Luo Fuli: OpenClaw, Agent Frameworks — The AI Paradigm Has Already Changed Dramatically!
By Zhang Xiaojun Podcast
Summary
Topics Covered
- AI Became My Digital Clone in Three Days
- Parallel Agent Research Compresses Years into Hours
- Code is the Only Data Type That Actually Trains Long Contexts
Full Transcript
(Translated by AI for reference only.)Hello everyone, I’m Xiaojun Our guest today is Luo Fuli The media calls her an AI genius girl But she doesn't like this title She is currently the person in charge of Xiaomi’s large models The interview was conducted after the release of OpenClaw Also after the release of Xiaomi MiMo V2 series models in 2026 We chatted more deeply A new round of technology paradigm triggered by OpenClaw in 2026
Changes and cutting-edge topics in future technological evolution Next is my interview with Luo Fuli Looking forward to 2026 We make progress together with AI These abilities can all be I think a month or two at most Slowly It can indeed be learned quickly in three or four months.
So environment is more important than experience You just mentioned that the 1T model might be It is a ticket for future competition.
Is that so?
It's Agent you need to be close to Claude Opus 4.6 level of such a ticket Then if I say we say it this way It’s for research, for Pre Train and for Post Train Yes I think it's a very reasonable card A ratio is probably 3:1:1 Yes
The ratio between Pre Train and Post Train should be The computing power invested is considerable Then the proportion of research should be at least your The total number of cards for official training is a little more.
Just keep extra cards for research You told me during the Chinese New Year Do you think technology has actually changed in the past few months?
Can you explain what you think?
The evolution of this technology over the past two months I think A very big dividing point Before and after using OpenClaw Yeah I actually regard OpenClaw as How to define an epoch-making Agent framework like this Yeah Yeah I know many people are Especially those who do serious coding with Claude Code will feel OK
OpenClaw is Claude Code plus an IM (instant messaging) Such a one that is more conducive to interaction A design of UI In fact, when I was in January The first time I saw this I probably think so too So I'm reluctant to use it then i felt And plus the founder I think it’s very suitable to do it close to the Agent
Some very fantasy operational actions So that includes Skillhub and so on.
Let you reject it even more and use it One that you feel is very Something that is more of an operation-oriented product It feels more like a product an interactive paradigm It is an innovation of interaction and what it calls localization so called 24 hours In my opinion In fact, they are all Just some product definitions But there is a real change
The moment I go to use it I think it's just right during spring festival Have some free time You want to figure this thing out Why are they so popular?
Yes, then one day I will I tried to install it late at night Then it was installed in two hours Spring Festival, right?
Yes, it was already 2 o'clock in the morning Then the first time I talked to it It lasted from 2 o'clock in the morning to dawn at 6 o'clock That's right, that night I I think the one in my head I don’t know if it’s dopamine or endorphins.
Just keep secreting It just makes me so excited that I can't sleep at all Your first feeling may be OK It is very autonomous And then it's very soulful For example, I chatted with it very late It will always remind me OK you are very late now Why don't you go to bed early?
That's the kind of warmth and care I feel In other words, this kind of emotional intelligence is Everyone who goes to (uses) OpenClaw the first one to feel But let’s delve into the reasons later In fact, it does There are many mechanisms to ensure this For example, it has its search.md
For example, it is Just take the simplest, smallest detail For example, how does it sense time?
It is in the context of each round of dialogue Go ahead and spell the current time Then there are some very subtle ones I think why I put It is called a so-called finely orchestrated context Just because It is in these very From an angle that no one pays attention to This context is arranged very well This is how you feel on your first day It's OK. I think it's just in product design.
It's OK. I think it's just in product design.
It's really done Beyond my imagination make everyone feel This framework has a soul But the next day At night I thought it should be more than that Then I started trying to put myself I don’t think it can be done with the current framework.
Let it do the things in your daily life And then found out it was all done For example Basically the second topic I talked to it about was How to spark curiosity in a team Or how to filter A curious person and then followed it Discussed in depth for an hour I think it's a lot Philosophically it is far beyond my imagination Yeah Right, so the next day we were talking about what to do
to build a better big model team and Screening of personnel from the beginning To the back, the construction of the entire organizational structure How do you face a paradigm shift?
What measures and actions should be taken I think at least it Can get my point After I told it it can finally form it A very systematic set of things And then turn it into a set of Skills So I will use this later It's great to be liberated. At least I'm here very much Sometimes I encounter someone
I even meet some now Whether it’s a matter of selecting people or team management I'll ask it now Right then I think it's now Basically turned into a digital clone of me At least on this matter But what really makes me Unexpectedly, it was on the third day On the third day I tried to put some Let it do some research tasks For example, the simplest
I think how do you go If we are working in the Agent framework The most important thing is How do you conduct multiple rounds of interaction?
Then you have to simulate User Agent performs multiple rounds of interaction Then I will go with it How to build a good User Agent Yeah In my opinion This is a very I think it's pretty good An important research topic (topic) pair I don't think I should say it Make it in an hour or two But when I communicate with it After communicating for about an hour or two,
I think this has already been done Yes, it has basically been achieved A good User Agent is born I can use this User Agent with my current one This set of post-train (post-training) framework goes together Constructing richer Agent scenarios Data, and then whether it’s doing SFT (supervised fine-tuning) It’s also good to do RL (reinforcement learning) This User Agent is very critical I think it starts from a
My initial understanding of it was just OK The design of a product with soul and warmth to the point where it can help me replace me part of life or work In the end it promotes my research This is what happened in three days It gives me more surprises every day Then I will go and take a closer look later.
Why is this framework better than Claude Code?
Yes One I found later What are the so-called benefits of these?
Let's take it out alone and talk about it It's actually a bit boring I just don't think it's cool This is why everyone feels OpenClaw has many shortcomings but it holds it together you will feel Its degree of completion is very high For example, it will A system with longer lasting memory This is a very durable memory system
This is reflected in the layering and grading of memory.
This is when I use Claude Code There is no such feeling at all Then for example I think it's right This joint utilization of multiple models It will be beyond my imagination For example, when I use Claude Code I'll default to it OK Assume that this model is in For example, its video understanding ability is not good Then I might have to give it to myself Equipped with a better video understanding model
Then play with it in Claude Code But when I was at OpenClaw I don't have to think about it at all I'll just send it a video It will find a way to find an OK one on its own A model with good video understanding ability can do this Yeah Just this kind of autonomy This kind of autonomy Confronting the Shortcomings of Contemporary Models But it addresses this shortcoming in
The framework is used to make up for this shortcoming.
This ability is a bit beyond my expectation.
Because when I went to use Claude Code I will default to OK I want to use Claude Opus 4.6 I will use the capabilities of this generation model.
But when I use OpenClaw I won't pay attention to the model.
One reason for ability is I think the design of the OpenClaw framework It actually is I want to pass the whole set of Agent as much as possible Choreography to make up for the shortcomings of the model it has been I think this is it's very The core logic of a product So later we will directly Put our model It’s actually MiMo V2 Flash I didn’t do a lot of targeted training at that time.
Connect it to OpenClaw Even the one we trained recently A very small end-side 3B model is used for training.
Found a very complex skillful in this set Or the framework of Agent?
It can still do those things I think it can't be a very small model things that can be done So this is the first time I feel Just the original one A very complex Agent framework design It can make up for the shortcomings of many model capabilities.
then Of course I think it's OpenClaw itself It is a framework compared to A differentiated advantage of Claude Code Well, but actually for If we say we want to pursue why are you here Different skillsful framework Each model has an unexpected Very stable performance
Then we come back to another proposition That’s it when you have a lot It’s actually on the market now The Agent framework is very rich Yeah Kilo Code OpenClaw Then Kilo Code Open Code and so on when you face so much When using a very complex Agent framework How do you make your model work on different frameworks?
All have a very stable and beyond-expected performance Then how do you get your post-training Paradigm correspondence and adaptation and migration Then this is us Under the impact of this context The second question to think about quickly So we correspond to The entire post-training paradigm requires From so-called Chat to Agent Such a migration So what do you know about OpenClaw?
A very big change has taken place This happened during the Spring Festival Yes Why was there resistance in the beginning?
I think If you want to pursue a very top programming experience Experience with Code Even now Also Claude Code plus Claude Opus 4.6 is the best Yeah So if you are like this If you think about the ending For any other Agent framework In fact, it can be ignored, but
But one problem is Code it It is a very generalized scenario.
Yeah It's just you who do it for it The design of many agents is also good In other words, model training it's all valuable But that doesn’t mean it’s a generalization Performance ensures that you can perform in non-Code scenarios Achieve very high accuracy and completion Yeah So I think although it will Use Claude Code to do some non-Code things
But I wasn't expecting it to be in this set The framework can give me a high degree of completion Because I know I will replenish it Some shortcomings that this framework does not have But when I use OpenClaw I don't think I need to think about this Just it's totally there The Agent framework itself goes Make up for the shortcomings of many models I can understand OpenClaw letting Coding Is it a product with generalized capabilities?
It has a lot of design logic For example, it has more message channels For example, it has more autonomy For example, like scheduled tasks, heartbeat tasks, and then These are more suitable for everyday scenes Because you write code you usually You won’t need a heartbeat mission, right?
Then if you are in daily life Your heartbeat mission is very critical It does have a lot of frameworks designed for Adapt it to better daily tasks But I don’t think it was discarded Let’s talk about the framework of a good agent Some of its most basic characteristics I think these most basic characteristics are followed by Being absorbed by Claude Code For example, its persistent memory Yeah
Before Claude Code, was it its memory or for?
Its entire memory system design Or for software engineering?
For example, it will be in the session Maybe I will perform a compression action when the session is almost full.
Then memorize it Then when I complete the task According to my plan, there may be some memory actions Then ensure that when I cross sessions its context is sharing is better So you can see Claude Code All its Agent framework designs In fact, they are all very for software engineering.
Just write it down better code Yeah But when OpenClaw was designed, I I think I borrowed this idea, but what?
What it thinks more about is how it can be better Complete all tasks end-to-end And how to make up for the current model Shortcomings in completing tasks end-to-end Then design it So persistent memory like this goes to the back You can pass a better Remote's interface can control it Then wait for these So good design
Over the next one or two months, it was completely completed Being absorbed by Claude Code Yeah I think this is also a two-way touch.
Because these designs are actually For programming or general programming Universal programming means that you can Programming to accomplish something seemingly Task pairs that have nothing to do with code Yeah But behind it is the improvement of model capabilities.
It inspires the upper limit of mid-level models It inspires the upper limit of mid-level models Maybe if we hadn't been like this A very complex Agent framework Maybe it can’t reach the mid-level model Approximate to Claude Level of Sonnet or Opus But you took advantage of this A very good Agent framework Then you can use it in most scenarios Except for the very difficult ones Requires long-distance tasks
Or really need it I call it serious programming Just for example Writing operator optimization is considered serious programming Then in a scenario like this It may indeed be different from the top model But in most situations of life In other words, you only need to use code to improve performance.
Yeah So in fact, such a new Agent framework Plus a mid-level model The middle level is possible Achievable on 85% of tasks A model of the same caliber as Claude Sonnet Well, in fact, it relies on such a Frameworks can already play a very big role Yeah I have heard a saying That's what it feels like OpenClaw If it is viewed as a shell
It releases the relatively strong model capabilities now is the best shell This model is Claude Opus 4.6 In fact, I think your whole Is it right to state that you don’t agree with this view?
accreditation accreditation As far as I think it's the upper limit It must have been brought about by Claude Opus 4.6 Including me following it A week of intense collaboration I only use Claude Opus 4.6 Because only it can bring me amazing feelings But when I use Claude Opus 4.6 Yeah That pile of accumulated experience
Those regardless of Skills Still at Agents.md
There's some information about it I even changed it myself Its entire Agent architecture design Because it is open source So you can change it yourself This is also another drawback of Claude Code It’s the entire Agent Architectural design is a black box Yeah Then this black box will cause you not to know You definitely can't change its memory system You can't change its entire Agents workflow either.
But because OpenClaw is too open You can try to change it yourself For example, I will let it help itself I design a new Memory system i will make it I feel it now The Multi Agent version of the 2:00 version at that time I think the whole logic is very confusing.
Then I'll ask it to design a set for me.
New Multi Agent system I can change it myself All my source code Just this kind of native kind The impact that maneuverability gives me is huge.
But these things are basically Only Claude Opus 4.6 can change Yeah But I asked Claude Opus 4.6 to fix it for me.
This framework itself is very easy to use.
I'll switch to That Sonnet Then switch to some domestic models Even the MiMo V2 Pro we were training at the time I just think It's very powerful Yeah Yes, so this is why I think The top model should be similar to the top model The Agent framework is moving forward together Or maybe this is me too Recently, the so-called self-study
And anyway some time ago There are many popular ones A thought on this concept That's what I am about this matter For the first time I felt a How does Agent's self-learning happen?
The most likely way it will happen is that You do need the model itself The architecture of your Agent itself Move forward synchronously Yeah Then as the model progresses Regardless of whether it is trained through reinforcement learning or other training methods in progress It is actually going Change your entire Agent framework
This Agent framework contains The static information it sends to the model This static information For example, Memory is static information As long as you write it down What should be delivered When starting a new Session You should send it to the model Or let’s call it Skill Fold (skill folder) This thing is actually in should change during training There is also some dynamic information Dynamic information includes your entire Agent
The design of this architecture itself is very important.
Then I think for different scenarios For example, Claude Code for software engineering scenarios and others For example, doing financial analysis and other scenarios I think its Agent architecture The design will vary Then why are you there While improving model capabilities, Then improve your entire Agent framework A degree of adaptation to this model Or a generalization ability, right
Then this is the self-study I am thinking about now Well, the framework of the intelligent agent you mentioned Is the Agent framework the product we understand?
No It is quite different from the product How to understand this Agent framework The boundary between product and agent Now I don't know how to define it's a little blurry right Well, I think the product It can be defined as saying that you directly The level of things that people can feel when interacting with each other Well, but the Agent framework does have Before defining your interaction layer But it also defines you
How to communicate with the model at that level?
Yeah right So then it even knows The strengths and weaknesses of model capabilities Then it can know how to better schedule Well, for example, for cost optimization scheduling Ah Yeah It is equivalent to a middle layer The middle layer between people and models This middle layer Can be made very thick Then instead, the front-end UI Show it is the thinnest layer It is no longer critical Yes
So OpenClaw actually shows What can be done with this framework of intelligent agents?
There is a lot of room for imagination in this framework Has anyone done this before?
In fact, Claude Code It has always been a very complex Agent framework Just because it's a black box So we don't know how it was designed OpenClaw is open source So you know how it's designed And then you can change it Just changing it is very, very important Yeah, it inspires people's creativity Yeah As long as you know how its frame is designed You let it change You can create a new framework yourself
You can create a new framework based on it This is why OpenClaw is starting from version 2.0 Well, when I was using it I don't think it's easy to use So I spent several days changing it Until the three o'clock version, eh?
The version released on March 10th is already very easy to use.
That’s basically what you do after three o’clock Version followed by a pretty good model You will feel it is very powerful Yeah, instead of just picking up Claude The model will find it powerful It’s because of its entire Agent architecture accept a bunch of people Developers too Or maybe you are like me does not belong to The developers of this framework at the beginning
just user I can improve it I do improvements and optimizations for my own scenarios So I think this is In other words, this is open source The value and significance of the Agent framework itself During your high-intensity week It was the week of interacting with OpenClaw How much did you spend on Opus 4.6?
Anyway, it was almost 1,000 yuan on the first day.
$1,000 Because it took about four or five hours Well, then I will cut it when I go crazy in the middle because it's too expensive Then I will cut Sonnet But I found that it really doesn’t work Then I will You can only use Opus Then only Opus can bring that amazing feeling But behind But now slowly and gradually I found that this is not the case anymore Well, just because Because it makes me feel amazing
Those things are changing I think it's human The adaptability is really strong I felt something amazing on the first day I immediately felt less amazing the next day.
When I just told you I think so I just talked about it on the second day Things from the first, second and third days Why do I feel so boring?
But it's really what happened to me in three days Then I felt really amazing at the moment Now I don’t feel amazing anymore because you are one Excellent framework capabilities are very strong so i now It's possible to get Opus to help me put mine After the Agent framework is built, Then there are fewer and fewer things that amaze me.
So now I feel that what is lacking is The first is imagination I just want to think crazy Is there anything else it can't do right?
Then the second one is when it these Those times when everything that can be done can be done How do I optimize its cost?
Optimize it for speed I'm thinking about these things All these things happened All happened during the Spring Festival I think a lot of thinking is happened at that time But I think a person is usually Weak right Or you will still have your own a collapse of cognition Just collapse to Although I was highly excited during those days
So excited that I feel I will send it to everyone in the group I highly recommend everyone to use it How is it?
But no one pays attention to me People are celebrating the New Year Yes Everyone is really spending time with their families And I don’t want to disturb you either Then that’s why I say it is highly recommended Then come back after the New Year I found that very few people actually use it.
Because everyone is interested in novel things, especially I really think it's a bit fantasy fantasy Yes, it's just not suitable Not very techy Yes, everyone will think it is too much Fantasy stuff is so unreal I feel the same way So you don't want to touch it And then um so The initial push was quite difficult.
Then but I feel The next day I felt like I couldn't do it anymore It’s just that I think we have to let everyone use it and then be OK.
I gave everyone an instruction I just said that if the next day People who have had no more than 100 rounds of OpenClaw conversations You can quit directly But in order to promote this matter, I I have done a lot of things before For example, because everyone knows OpenClaw The entire deployment period is still It still takes a few hours I don't think it's necessary for everyone Go spend so much time
To mess with a set full of bugs In fact, there are some things that are of little value.
Then I bought a few Mac Minis and then deploy it then put it I got them all Different OpenClaw groups Let everyone go in different directions force it in this direction in the crowd Join a Feishu group and chat Why do we need to chat in a large group?
It’s because personal imagination is really limited But when you see other people using OpenClaw When you can actually do this You will stimulate your own imagination so i think Everyone’s imagination is a product of then so when i say this Actually, I don’t think I ever thought about it.
On the second day, the number of conversations with someone did not exceed 100 rounds.
I would really put him in that Open I don't have this idea, I'm right There is no such person, is there?
Because I didn’t take the final exam I don't have You have no way to verify this I have a way to verify But I think it doesn’t matter whether it’s verified or not.
I just want to express something to you a kind The attitude is OK, you don’t have to You might really be falling behind So I didn't go the next day Someone came up to me and asked me That Cici (Luo Fuli), how are you going?
assess your Is there 100 rounds of dialogue?
Right I told them I said you can just use it I said I have my own assessment methods In fact, my assessment method is that I don’t know how to assess it.
Yes I just hope everyone can use it Yes then so those two days Two days after returning from the Spring Festival The whole team works all day long It doesn't feel like working Just in the group Restless Yes, it can be called restlessness It’s just that you see others can accomplish this thing Then you also want to play Then everyone in the group has the group message I just don’t watch it for 10 minutes
Just 999+ something like this But not so exaggerated Just a lot It was very happy just for everyone to play together in the group He is a very happy journey Not bitter at all, not cruel at all It's really fun then OK So after playing for two days, Everyone found Wow, this is so fun What to do?
Then do it What everyone thinks is OK How can I use such a good person?
Agent framework to improve model capabilities At the same time, how do I make my model change Current Agent framework Immediately enter into such a research paradigm So In fact, once you enter such a research paradigm, and what I just said We have Claude Code In fact, in most scenarios, such a framework provides more stable output.
Even though it's a black box we don't know But it does not matter whether it is for research or coding.
It is indeed a more stable framework and a better framework than OpenClaw Then you can stimulate it within this framework yours Expand Ideas for extending your research And let it help you realize it Then go to training quickly Start model training This will greatly accelerate So I think we basically finished it in three or four weeks.
Research is something that might have taken thirty or forty weeks before.
I think This matter is behind me We are really ignited by this framework to the end it generates value During this process, I felt From the perspective of swarm intelligence I feel that my growth and gains will be greater In your very restless group What are the most fun quests to explore?
That doesn't sound fun now.
But the impact was very strong at the time What shocks you most What shocked me the most was that we all worked together to modify the framework itself.
It’s about how you think about improving the framework itself Because that framework at that time really had a lot of problems. Also, its memory is really very smart.
Because we have about 100 people in a group then I think the whole thing about Feishu The message channel should still not be so smart.
At least it doesn't go that far Distinguish between people very well But the context of its entire chat Its control over everyone’s portrait And then everyone's memory I don’t think they’re all that strong.
Very high availability I think this is Demonstration of strong model capabilities This has nothing to do with the Agent framework Because in a group, you want more than 100 people to chat there Then there are more than 100 people with different backgrounds More than 100 people are changing it like crazy then This model has not been modified.
This Agent framework has not been modified.
Then it becomes very smart I think this is also the first time I feel how you can use the wisdom of a group of people to improve a thing itself then If I just modify the Agent framework myself But others cannot feel the intelligence of this framework i think it Seems almost interesting In other words, the progress of the framework itself is very slow.
But if a group of people improves, they will improve very quickly.
It may take a few hours to iterate.
similar to this feeling So you think it's possible So on the third and fourth day when we connected it to our own model and used it, we found that Why is it so easy to use? Why am I almost as good as Claude even without training?
I started feeling like this on the third or fourth day.
But if you test more, you will find that there are still many things that are not as stable as Claude.
then so But this incident brought an impact: You will find that it is very important for you to use swarm intelligence to improve the Agent framework.
So I am also very delighted The number of stars skyrocketed when I saw the star behind OpenClaw.
I think this is something that must be done before the arrival of AGI.
Yes Because you have always been very keen on research directions We talked a lot based on the last time we talked So what kind of intuition do you think these changes above have brought to you?
What changes do you think will happen in the future?
One way to do research on what might have been before is to Write code from the moment you think about it It’s up to you to design a good evaluation criterion The process is quite long. You have to spend at least A week or two I think Quickly unless if your assessment is certain Then your code only needs to be modified. It may take you a day or two.
But I think at least When Agent is assisted It can really finish these things in an hour or two I think the improvement in efficiency basically means A very essential question then This is what I think of doing research in this era.
A very critical reason for taste Or there is a reason why your research efficiency is very important In fact, I have always regarded research efficiency as very important.
It’s just that Agent amplifies and accelerates the research efficiency.
So if your taste is more accurate at this time, Then do it accurately Of course, it is also possible that one out of ten can succeed.
But these ten can be done in parallel Ten of you can do it in parallel. You don’t have to do it in a pipeline like before.
You can give ten ideas to different subagents Doing it at the same time, they can also cross-validate OK that's quick Maybe an hour, two hours, or up to a day You are just burning a lot of tokens You can verify your research ideas O not OK work not work Then the key is that it's okay If you are willing to cultivate it for a long time, it can also iterate and evolve on its own.
But not in Claude Code But you change to a more open framework Then it can iterate and evolve on its own So I think this is the impact of doing research on me.
So it changes the rhythm of the entire research Yes, I think there will be fundamental changes in efficiency and methods.
What changes did this bring to you later?
After you go through the whole shock of the Spring Festival and after the Spring Festival what did you do next In fact, I think the next thing to do is Just understand it why Code is a very A thing with generalization power then and What do you do
The generalization power of Code extends to other fields Yes In fact, one of the most essential reasons why Code has generalization power is that Divided into several stages First of all, Agent is a very long-term and multi-round task.
Yes Then we go back to the pre-training session. It is difficult to find that there is For example One terabyte of contextual data Even 128K of data is hard to find But for example There is really data that can reach the length of 128 to 1 megabytes Basically there are only two types of data. There is a high probability that there are only two types of data.
One category is Code data and the other category is books.
But the signal of books is too diffuse Code this Associations between files is stronger So when you have a long context dependency like this When training on a denser data set model it will naturally model long contexts better
Yes, this is something we may not be aware of Before Agent was such an important thing What we do So you can understand it as the base itself It's the power and efficiency of long contexts Efficiency is key We’ll talk about efficiency later.
Long context capabilities and efficiency Already fully prepared This is because we did not receive such Big impact things have been done before But what to do after the Spring Festival is How do you inspire this?
The potential of a large model Starting from Code and extending to other scenarios Yeah Because in other scenarios, if you train it, it will be more stable.
Yeah But you didn’t train it and it can be generalized Just saying it's not that stable And the level of a top model must be To be more stable in a wider range of scenarios Yes So But Code is the upper limit of it Then when you train in other areas, it’s the lower limit to keep it.
I think so So the first thing you need to do is Code Code’s long-term tasks are more diverse So why is software development so important to you?
Is software development really a very long mission you got it done Basically common to many models The characteristics are already good No matter It’s not just the general characteristics of the model.
It is the framework of the Agent itself It was also iterated very well.
Yeah Yes Just like Plan mode Manufacturers like this have reached a certain stage You need to compress After compression You have to go again the next day Review the changes you made before The design of the Agent framework itself In fact, it is to prepare for software development.
But these frameworks are all generalized It can be generalized to other Go into more difficult long-distance missions Then so it is Things to do during the Spring Festival The first is that it must be in the Agent scene Give it really long-range tasks inside.
Constructed and trained into And then scale on top of that A lot of SFT training and RL training Yes, and the second one is that I think it is still If you want to consider its generality You still have to cover more ground.
So but at this time How to cover more areas I think it depends on Rely more on the wisdom of the crowd Just let more people use it Then based on more people using it For example, we will use a lot of it internally We'll make our model First promote it within the company Used by a large number of people After using it Then we discover a wider range of scenarios Go inside this broad scene
Synthesize more data for training Yes Then a very critical issue here is How do you restore that time?
Everyone uses this environment Yeah Yes, because of this environment Only then can you better conduct longer-term interactions Then and after having the environment Only then can you do it for the environment itself More accurate reward settings Then this matter is more difficult
Then I think so too I think in this matter If enough computing power is invested, If you spend enough time researching There should be some generation-difference models that are useful for How are you doing this now?
What we are doing now is confidentiality.
Really keep it a secret and reward design I think when we put this A set of paradigms gives Scaling a When the magnitude is very large I think we will open source and tell everyone Yeah Yes But wouldn’t long-distance tasks be interrupted?
In fact, in the real context of one trillion Basically there are very few tasks that are doing a task Yes It's usually doing complex tasks Yeah Yes So actually you have to One terabyte context does a great job current stage The current stage I said that the current stage may be the current stage of one or two weeks.
It does not represent the stage in the next two or three weeks.
It’s just that maybe we don’t really want to find it.
The task of filling up one trillion contexts Of course you have to find the better one But this task is difficult And its training efficiency is too low It’s just that you have to finish training One trillion You need to reout (rerun) the entire trajectory (trajectory) One trillion trajectory it's very slow Even if we can now achieve TPS of 80 to 100 The project can achieve 80 to 100 TPS on MiMo V2 Pro
You need to reout one megabyte of context It also takes time It may take a minute or two, so So So in fact, real training will not Training on such a long-distance mission But when you have trained in a context of 1 trillion pretrained And the post-training has corresponding Task activate it a little bit It usually can be in a context of one megabyte Have such an ability
But we need to improve Improve the needs So you see Claude is here now I think it's really a trillion context on the stability of ability Basically now there are only Claude 4.6 Opus and Sonnet are ahead In fact, others like Gemini Although Gemini claims to be Good contextual skills But actually it doesn’t work Yes What do you think about mining in quantification?
Will Factor be a good long-range mission?
Yeah See what asset pair it is There are a lot of assets and its rewards Too unclear Yes Some assets are Most of the assets are Not suitable for long-range mission modeling I won’t go into details about this.
Anyway some Assets are very, very unsuitable Yeah because of it Backtesting is worthless Then if you want to run it in a real environment it's possible Not a short-term factor Rather, it is related to some longer-term factors or some extra some alpha itself modeled by this model Some things that don’t go in So you have to choose assets Just choose the right one
This is OK I think Yeah About your new model We'll talk about it later Let’s finish talking about OpenClaw first.
Yeah If you observe, just observe What do you think of OpenClaw?
A popular link And I clearly feel that in China It's more popular than in the United States why is this I saw another one There is a very interesting saying called "the shrimps from the West are raised from the East" what's it called "Practice from the West to the East" "Practice from the West to the East" Yeah Because I don’t understand the Bay Area’s enthusiasm for it.
Right so I just start with your perspective From the perspective of myself and those around me I think one possible reason is that As far as Chinese developers are concerned, I think there are more That’s it When I say developer, I mean he can use Code To improve his efficiency
Maybe the request is more urgent More urgent demands As far as I'm concerned, the improvement of efficiency seems to be It's something in our blood and then So I think Only OpenClaw can do this Improve efficiency to the extreme So I feel like this is it one of the causes of fire But I think there is another reason
I think it is different from domestic The development of large models is also closely related to In most efficiency improvement scenarios, There is really no need for that at the moment It’s just possible I think 85% No need for top-notch models then we There are so many cheap and easy-to-use models available Then you calculate this The price/performance ratio is very cost-effective It is the cost of this model’s API that is
The productivity it replaces The coefficient ratio of the value is very large So you will be more motivated to experience it It may cost 10 yuan for an API Help you do something for 1,000 yuan Then you must be willing to use it But if your API is 10 times or dozens of times more expensive The gap in the middle is very small Then you will be very repelled to use it.
Such a complicated set of things to optimize it Yeah But we’ve actually been talking about Agent for a year.
Right, it was at the beginning of last year when everyone Let’s start saying this is the first year of Agent Why did it start to get popular now?
What do you think is different between this and last year?
Yes As far as I think I think the Agent I talked about before They should all be very much in my definition.
I don’t think it’s an agent It can only be regarded as a context Something slightly more complicated than Chat It doesn’t matter if it’s something like BrowseComp Like SWE-bench Then these Search and Code’s Agent framework In fact, they are very simple Compared to the current Claude Code, it’s better Better than OpenClaw Its Agent framework is too simple
This simplicity then leads to First of all it is simple Secondly, it is not universal This is too concise and not universal, which leads to it It can only be set for this task So those we see in BrowseComp <b>SWE-bench That SWE-bench has its problems It’s because the field it focuses on is too single.
Just go fix the bug It's not really for software development Yes I think at least The first half and even the second half of last year Many models that look like for Agent It actually just says that I changed it to a more complex one System prompts Then bring a little bit of feedback from the environment A little bit of feedback from the environment
For example, SWE-bench still has environmental feedback.
TAOBench is also available Then bring a little bit of feedback and interaction with the environment Then let the model have a little Follow and understand complex system prompts The ability to interact with the environment I think it can only be achieved at this level But it's not at all There is no industrial-grade available capability The simplest thing for you is what is industrial grade available?
You just connect it Use it in Claude Code or OpenClaw You will find that it is not available it has many problems to the simplest It cannot understand the framework itself then it didn't For the framework itself The interaction paradigm between people and it will change For example, people interact with it The biggest change in paradigm is People no longer modify the code
People don’t pay much attention anymore OK This line of code is wrong Please help me change it A query like this will never appear again.
Then people will only mention some higher-level ones For example, adding restrictions Just clarify the requirements and then Architecture design People will participate in architectural design Because there are so many now Architectural design is still stronger than people and assist in understanding business logic This is the meaning of Skills value to This business logic is not available in the model itself
Because this is a lot of business logic It's something internal to the enterprise Or the real environment The things that have settled inside You have to go through many rounds with this one.
Things that only settle down through interaction So I think those previous agents The framework can’t be called Agent framework, right?
There is no availability Then perform very high on those Benchmarks The model does not mean that its agent ability is really strong.
Yes So we are optimizing When this version of the model I completely gave up on these benchmarks.
We basically I won’t pay attention to these Benchmarks anymore.
when you face a When there is a big paradigm change In fact, as long as you follow the right path In fact, you can be short-lived a very short window period You go and ignore the assessment Because if you rely on body sense, you can immediately A very large qualitative difference was measured Yes But when you slowly step into the deep water
It still requires some very detailed evaluation.
So does skill change the ecology of the model?
Changed the model in This kind of high complexity It’s the complexity of the process The accuracy of task execution Because it actually defines a set of implementation specifications It is difficult to implement this set of specifications Included in the pre-training data Because in the pre-training data There is no such internal information
This information is usually a large number of enterprises This kind of internal precipitation and accumulation some things that arise between people Some organizational legacy norms I think it's more of a legacy left over from the organization.
then But this is impossible Appears in pre-training data But the Agent can be taught by someone Interact with it multiple times By completing several tasks Let the Agent learn this set of norms So the Agent will So now a lot of Skills are actually written by Agents themselves.
Yes But Skills was also born in Claude, right?
Yes, but this is where a problem arises It is indeed OpenClaw Skills is a hot topic I say bring fire because it makes more People contribute to the Skills community This is very critical because This is what I think belongs to people What needs to be created together with the Agent if you don't have that many people Advanced It’s just that you don’t have that much
This kind of alternative information I think because it's really another kind of alpha it's that kind of alternative information If you co-create with the current Agent Then Agent or the top one It is also difficult to bring out the capabilities of the model This is also what you call swarm intelligence so to speak so to speak Human experience accumulates into Skills What if becomes more important Is it a supplement to pre-training?
Yes Because pre-training actually relies mostly on knowledge Or is it the knowledge you can access on the Internet?
but a lot of knowledge Or a lot of intelligence It should be said that there is a lot of intelligence We are not accessible on the Internet Then at this time it will be Another form appears Now Skills is considered a kind of thing, right?
Yes It actually provides an interactive way Let people proactively contribute data Yes Contribute to make the model more A way to perform tasks with a higher probability of success If we redefine the so-called Because I used to have DeepSeek Moment Then there's OpenAI There is ChatGPT Moment If we redefine This so-called OpenClaw Moment how would you define it This Moment will
longer timeline just because of it The preface chapter is too long So that everyone won’t think it is a new thing and then its subsequent chapters But I think for us already For those who believe this Compared with the matter of believe Our reaction speed is fast enough But I don’t know if other people have followed up.
Right, so that’s From my personal sense, it is This moment will last longer and be more profound This is more profound.
It may be more capable of Go into scenes where more people can feel it so it's more profound But its flow takes time Yeah Yes Yeah In terms of its ability to overflow energy, it is much stronger than Chatbot.
It's not like it doesn't have a very clear Such a clear definition like o1 R1 It will have a math code like this There is something ground truth (standard answer) here OK You got a share The model has this ability Such a very clear definition of standards Of course I think Agent In fact, there are indeed many scenarios where you are Need to have clearly defined standards
But most scenes don’t have it relatively chaotic Yes yes so Then the value it generates is also Slowly reaching a critical point suddenly a big step forward In fact, Claude Opus 4.6 is indeed Such a further step a sudden point in time But Anthropic does this This path also lasted for two years.
at least two years we can see Then why did this happen just now?
Anthropic did not make it themselves I think there is still only one Is there a reason for open source and closed source?
It’s a closed source framework, don’t you have it?
A way to gain insight into the interior design So you can't take advantage of Most of the more people's wisdom goes into improving it But open source means you can use it More people's wisdom to improve it Yeah Yeah It may not be in line with Anthropic’s pursuit of security.
Yeah In fact, I feel safer now Most of them are models Things you should pursue Don’t make too many demands on us A framework itself is too secure But the framework itself can do a lot of things But I think Open source and security are not in conflict Yeah Because we will license a lot of data to OpenClaw Yeah
If we put a large number of If personal privacy data is authorized to it How to alleviate it is more ordinary people This anxiety about safety?
So actually this is me Why do you want to be open source?
It’s just that I think there will still be a day And it should be coming soon That's most people simple tasks Except for those very difficult tasks It’s actually about privacy-related tasks There are many tasks that are not that difficult In fact, you can do it in localization Yes, you will have your own to carry with you
It may be carried with you or it may be There is a piece at home or in a certain scene chip Right and then all your data is local then all that Involving private data These scenarios are all inferred locally This is very critical Just reason locally Then it involves some non-privacy Then high difficulty, high creativity and high complexity tasks I will go to the cloud to reason again
In fact, this can be decoupled And that's why I Talk about a good Agent framework With the help of a person I think Very small 3B model it plays this tasks it can do on its own The complexity is beyond my imagination Then this matter inspired me for For example, device-cloud hybrid Localization of privacy is also good Have some thoughts But I think it's still early days
Because these thoughts are not only needed There are a lot of things to do on this side of the model You also need to build this layer yourself We need to move forward together So this is why I think open source The reason why it is such an important thing It’s just that I think more people need to work together to do this.
rather than a certain company Just do it well Yes What do you think we'll see next?
Already in OpenClaw After being popular for so many months What will everyone do based on it in 2026?
There have been many actions in the country Various Claws appeared Including you MiMo Claw In fact, there are many, many Claws now.
Just let everyone act in one way Well, different forms of interaction and then visit different Models and different types of frameworks Yeah But I think the real way to go Let this framework iterate on its own Or a framework that puts more emphasis on self-evolution and self-iteration
I don’t think it has been born or appeared on a large scale yet.
Yeah and how you use most of them human wisdom to create a stronger framework Yeah Haven't appeared yet these two floors The first layer is the self-evolution of the framework itself The first layer is the self-evolution of the Agent itself The first level is the self-evolution between Agent and people I haven't seen it yet What we are doing now is actually
How to train a better model for Agent And how to make the Agent adapt to this model In fact, I am working on Agent and Two-way flow between models but not yet I'm at the frame level What I think needs to be done in the future In fact, how do you let the framework evolve on its own?
And the framework and people mutual evolution This is your current focus on betting.
I think so general direction Things that I will study in depth do you think now What are the core flaws that have not been achieved?
It may make up for a certain shortcoming Maybe it will work The defect is we just got here i think i I don’t know if I did it or not.
Anyway, I think Just because of time We just got here The acceleration in the future will be very fast not just us I think it should be defined this way Now just have a long context Very efficient model architecture manufacturer and it does the pre-training phase
Code’s capabilities are also very good He has such a model This model parameter may be at least I hope I feel at least 1T or more Yes, as long as there is such a one ticket person Basically they are all in the same Do this horizontally Of course, Anthropic must have come to the front.
I'm only talking about the present The success of the previous era was not It means leading the next era Now basically everyone is on the same level What kind of era is this How to define it in the Agent era?
That's what I think The era of accelerating changes in productivity is critical to The era of accelerating changes in productivity is critical to Productivity will explode this year, right?
Yes Everyone will think it’s a lot of work No need to do it yourself This is the most direct feeling As soon as you come into contact with this matter You will find that many of your jobs will be replaced So at this time people should think more about In other words, what is your own meaning and value?
Yeah So what kind of high value can Agent do?
Has the mission become more important?
You want to improve the top model from From the perspective of ability Definitely let the Agent replace the higher Value tasks are important Because of higher value tasks it means longer context Yeah Then it means more token consumption Yeah It is definitely more token consumption then means Yeah What it replaces is
It can eventually replace The intelligence of the top group of people is enough In fact, it’s because of the following The intelligence of that group of people can slowly grow replace it with another way For example, there are robotics (robots) Then in short it is for If you want to do it for From the perspective of pursuing stronger model capabilities It must be completely for higher value Scene creation is more valuable Yes
But another perspective of evolution is that You want to do something more beneficial to the whole society A model that helps everyone Then it shouldn’t be done for higher value scenarios.
But so that everyone can Feel the intelligence level of this model then it might be another approach There may be another way to do it It is a more common task might be better For example, in this scene You need to care more about multimodality Because in a wider range of scenarios, multi-modality Especially the understanding of video
Understanding of some more nuanced environments then it is more critical And we need to pay attention to costs Cost is a very important factor in revolution you can't say you're done A mission costs $1,000, right?
Then I think there are many mission scenes Many mission scenarios feel it must be achieved A very high replacement coefficient ratio You can help me save 10 times the cost I might consider giving it a try.
Right, so what do you do at this time?
A lower cost and more efficient higher speed Such an Agent framework and corresponding combined models That's crucial So these are two development ideas How would you define 2026?
i think i It's hard to define it because I think it’s been two months It's a huge development for me now.
Just what we do for two weeks I think basically let us It's hard to believe this was done in two weeks What happened in the past two weeks So I don’t even know What happens next this year Basically in this kind of situation a state of high excitement I'm curious That is from 2022 ChatGPT was born at the end of 22 Then everyone felt that AI It must be a productivity revolution
It has been developed for three years till now What prerequisites do you think it brings?
Make today’s productivity explosion possible become a more likely point in time What conditions have become mature?
I think the first key point is It’s no longer algorithm engineers doing this I will think about this myself It is a very iconic node Yeah Just before You will feel that only Researchers or people who talk about algorithms are going Consider how to improve your intelligence level But you will find now There will be
That’s everyone who knows how to code Everyone who deploys the code thinks about it together To improve the model The level of intelligence of this whole thing I think this is the main difference It doesn’t matter if it’s writing Skills You might as well change the Agent framework Still go Designing better research paradigms These are actually three levels
I think everyone is letting Use your own intelligence to speed up this matter This is the biggest change in my opinion I suddenly thought of Peak (Ji Yichao, co-founder and chief scientist of Manus) at the end of last year The last words in that podcast from late last year.
He said that the evolution of Agent requires everyone's participation Yes So make me feel this way And now it is actually It's just that the only thing that hasn't happened now is me What I just said about how agents and people are better Accelerate more Because Agent also needs to iterate by itself People also need to iterate on their own via chat Chatting is a kind of Then is there some more natural way?
Yeah Yeah Do you have some thoughts?
For example, if I were Really good equipment to carry Follow me all day Then everything I said It knows everyone I've ever met I think it should evolve faster than me I think it should be faster than me because of its More computing power will be relied upon later Yeah Then it will replace me soon Yes and it won't remember today I won’t remember tomorrow it is very stable
And it is only a curve of evolution And it never tires of it it doesn't need rest You just mentioned that Domestic people use OpenClaw more enthusiastic So what do people in the Bay Area think about it now?
People from model companies don’t care much about this stuff.
I feel The people at the model company don't take very good care of things.
Yeah OK, that's really different from us.
It may not find this difficult I thought this was not difficult at first Yeah Then later I felt that it was all The design of Agent is very, very clever When I say this is clever, I mean that I think It makes up for many model shortcomings How ingeniously it is to build this orchestration I suspect the reason for this is that it started out as Based on Claude’s previous generation model Yeah
Just rely on the previous generation 4.5 Even Opus Sonnet is actually not that strong enough So you have to design the system The design of Agent’s framework You need to design more carefully Go and make up for its shortcomings.
Right then This led to the progress of Opus But most of the domestic Model may just be closer to Claude 4.5 Levels of Sonnet and Opus Then the two of them shook hands here instead.
Yeah I think this is the reason Then if the model can be improved Are these elaborate arrangements still needed?
Still needed It’s just necessary due to cost considerations.
That is, we will always pursue one The lowest cost and most efficient solution Yeah This is very advanced in the productivity revolution.
Requirements so it's impossible to say all of us The scenes use top-notch models because it's too expensive So it is more likely that the Agent is evolving Models evolve Then the way the model evolves may be Models at the same level are getting smaller and smaller it is also an evolution Although we are not doing such a thing But it is indeed a way In other words, it is now possible to activate a model of 10B
Maybe in a year it will be able to Achieve the level of Claude Opus 4.6 This is very likely to happen It is very cheap to activate the 10B model Then maybe one or two dollars There will be millions of contexts So then why don't you use a smaller model It responds faster and more sensitively It’s easy to use based on this Agent framework You can also improve this framework
Then why not embrace it?
Yeah So it actually makes the not so good The model has room for better performance There is a higher limit on it So it’s actually more in line with the domestic narrative, right?
I don’t know what the domestic narrative is But it makes people want to use it I really want to use it to replace my job Yeah So will there be an explosion of small-sized end-side models?
This trend will definitely happen But I don’t think it’s 26 years a main melody of Yeah it's a branch line And it’s something that will continue to happen The main theme of 26 years is the transformation of productivity Continuous breakthroughs in high-productivity scenarios So we have to do longer missions then More emphasis on collaboration between multiple agents Because of more complex tasks It cannot be done by a single agent
But multi-Agent collaboration to some extent The above is also for cost considerations.
Cost and time considerations and how to get there Stimulate the wisdom of the Agent group I now feel that the ones currently on the market Already seen now The work of Multi Agent is a bit fake When I say this is fake, I mean that I really rely on it.
Relying on Multi Agent to achieve Better final task completion rate I think this is a bit fake in this dimension.
But it can improve efficiency It's speed This is the final speed at which this task is completed.
and its ultimate cost savings Yeah This is certain Multi Agent But I didn't see that Multi Agent will eventually be able to To achieve a higher ceiling but it will happen It's just this part that I haven't seen yet So this year’s trend actually follows A lot has changed in the past three years Yeah This is how I feel it myself Yeah Kimi's people told me They feel that they and Doubao have already
Start playing a different game Ali is playing with Doubao Yuanbao Internet product games Because they are playing DAU Then Kimi's people think they are I just took the Anthropic route.
you What kind of game are you playing?
Maybe what we are pursuing is I think When will you be able to surpass yourself?
Yeah This is how I define it Because I think my definition of AGI would be One reason why it is very vague is because I don't feel the need to pursue one A very clear definition of AGI When AGI happens Everyone will feel it Just because you discover everyone's life your lifestyle The way you work has been It has been slowly changed for a long time
So in this process Pursue DAU Pursue these things it won't change me it does not advance this goal Just let this model replace me it does not advance this goal So I wasn't thinking about these things at all Yeah And if we pursue some medium-term goals For example, let’s say you pursue token consumption.
You pursue it to complete higher value tasks It is getting closer to this goal Yes, because I have to complete tasks that may replace my own It does require more token consumption Yeah It needs to have a more complex context hmm It requires mobilizing other people’s intelligence levels Team management actually means going Mobilize other people’s intelligence Yeah Right so Technological innovation is also what you need
Have more data access permissions You need to have a cluster that gives you you will use a GPU cluster you will use it And you have to define a set of evaluation criteria yourself After you train a model on this cluster, How do you verify These things are what I imagine us to be This model should be combined with a set of frameworks themselves
These things and tasks should be completed Yeah rather than pursue Another thing on demand So I didn’t quite understand it.
Their narrative You're talking about DAU's narrative, right?
Yes, I think this narrative is Not quite my own One way we do research past two months How do you feel your life has changed?
life is Excited And excitement It’s what you feel like every day Discover the Agent's framework itself or The model itself has improved again So this year I brought you a strong aha moment This may be more powerful than past Chatbots Yeah and it lasts as long as it lasts You feel like it can't stop Yeah This is a relatively big change I think R1 might be from that period The moment you experience it
You use it and then You found that it has good thinking ability Then this thinking ability also comes from Code and Math Export to other fields that's the moment Then after that moment is over You don't feel it's lasting But when it comes to Agent You'll feel like it goes on forever I think this continuity is It's a completely different feeling Because there is such continuity So you will believe this even more
it's accelerating across the board Yeah What kind of mission is there for you?
I used to think it was absolutely impossible to do But today it can be done Just train the model I think it was difficult before Yeah because What you have to deal with is a more complex For example, the integration of deep learning platforms Doesn't sound very reliable and how do you go Let the model have the context you have because a researcher it has The context is very long Yeah
You have to go through a long period of scientific research training For example, a Ph.D.
It has five years of scientific research training, right?
Yeah How do you let one The large model has the same context as you This thing is difficult I used to think it was impossible But I recently discovered Turns out it's very smart It’s so smart that you just need to tell it your recent context.
It can even help you recover Your own research growth path Yeah you let You can follow it again at this time When discussing the same topic You find it's as smart as you This is very cruel Just what I thought we did ourselves Work is creative enough Enough not to become skillful It will not be Workflowized but i find it now It can actually be done Then he said Maybe after a while
It can really train a follower The models we can train It can also be trained That's okay Can it train a stronger model?
Then he stepped on it with his left foot The right foot has improved I think this is very likely to happen This may be a very big change Right so it might actually work generate greater intelligence on its own evolution It absorbs everyone's intelligence first And then generate stronger intelligence on your own I think this matter must be What happened in the past two years I just heard your whole description
I will have a feeling I think the way you train the model is similar to There is some similarity in the way you manage your team The main reason is it does require swarm intelligence Yeah Yes Not personal heroism Of course not Of course not That is, it needs to be done in every aspect They are all believers and extreme people
What is swarm intelligence in model training?
Different Agents may need to have their own context.
Then the reason why it has its own context is When the model ability is not that strong Have its own independent context it will be more focused Concentration is quite important So that your context is not confused Then it can be done more accurately Then you can understand it as Now we can train the model We need someone who knows Infra very well then go
Write a very good training or inference architecture then need it from Think backwards from the perspective of reasoning Follow along to understand model evaluation Let’s decide together with the person who trains the model.
A very good model structure There is such an intermediate collaboration Then this guy knows a lot about model training and Someone who understands model evaluation You have to learn to do math again to further communicate with classmates Say OK, what capabilities do we want to give the model?
What kind of data do we need to construct?
And then at the same time The group of people who do data It also needs to participate in pre-training and post-training Because the data used for pre-training and post-training is sense is right And here, if you really want to get a real score, There are still many subagents But between these subagent Well, their contexts are independent.
There are also related places Then I felt that such a complex framework Now it seems possible simulated It can indeed be simulated And that's why I play by myself OpenClaw I started letting it go the next day I was at home at that time Then I asked everyone in my family Just my dad and my mom Then my husband then
Everyone has his own subagent We caught up with a Feishu group Everyone can also follow by themselves Have your own independent subagent to chat with then it evolves on its own Then then I will be Delegate a task in that group to their subagent Go and do it And because our context is different Then let it dry and sure enough it will it will Because it has better context so it will do better
That is such a A rough attempt makes me believe that it is This matter should Just the same simple thing it switches to a higher complexity More creative scenarios As long as the capabilities of the model are consistent with the Agent framework itself It should be possible in evolution Uh-huh I thought of a point just now It's about the framework Do you want a complete statement?
It is the framework of the intelligent agent The framework of the agent Yeah In fact, I think the intelligent agent framework is a current There are many adjectives to describe it For example, Harness (harness engineering) And then there are some other adjectives It’s just that I didn’t pay special attention to it.
Which adjective is more accurate for Then I pay more attention to the framework itself What are some of the differentiated advantages it brings?
then For example, I think a very good framework It should actually go Try to make up for the shortcomings in action Yeah Many things are in Make up for deficiencies in action For example, a good memory The system is to make up for deficiencies in action and then accessed more extensive The message channel is to make up for the shortcomings in action
and then it's more proactive Whether it is the initiative of scheduled tasks Or is it some other way Some proactive designs and it iterates itself to update In fact, these are all making up for the shortcomings in action.
Because the big model is you give it No matter it is The better context you give it It definitely performs better the better So if you can put these It cannot get the context All these action contexts are given to it.
then it will definitely be done better So this is what I see When a good framework I'll see if it has these elements Then of course here Another very crucial part is It is the assessment itself that is a good framework It does require a good A generalizable evaluation system Then it will be possible to self-iterate
And here are the existing ones The evaluation system is actually very simple it's just in case it doesn't come out fatal error Then how come there is one more An evaluation system for generalization ability To promote the self-iteration of this set of frameworks In fact, now we use the highest-level people as the evaluation Just give it to you a more difficult task
Tasks in higher value scenarios then it doesn't finish Will you provide it with additional information?
Yeah Then you will also point out to it what is wrong.
Then push it through more rounds Interaction can complete this task So in essence it is Now it's this group of people who are doing the evaluation.
But this evaluation will slowly be absorbed by the framework The framework will design many things to ensure I can judge accurately in certain situations At the same time, it will also be absorbed by the model ability.
Just the model will learn to be me like a human being Use this method or idea to achieve this bottleneck Or how about it It will take another approach on its own In other words, reflect on your own Reflect like a human being Of course, does it rely on itself?
Or should it use a more super Agent?
Or a sub-Agent in other fields?
That's all possible But this is At present, these frameworks have not gone too far In fact, in the past month, there have been Some frameworks are paying attention to these things Because after OpenClaw was released Many domestic teams have Launched similar products Do you think they are different?
For example, the QQ team has Your team has Kimi has Minimax What's the difference between them I probably only tried half of it Yes, only half tried I think they are similar Just let This similarity really makes OpenClaw Transform into a Chat-like form for you to feel I haven't seen one yet and I think At least changing the Agent framework itself I think this product of yours should
You know how to iterate the framework itself?
Haven't seen one better than OpenClaw The open source community progresses faster Yeah Because the OpenClaw open source community is progressing too fast Then I haven't seen anything better than this The open source community progresses faster Such an Agent framework Or the product appears I haven't seen it yet So I would rather use the latest OpenClaw You see OpenClaw soon Later it was sold to OpenAI Why do you think it is?
Why is this a very good product company?
Finally, it was given to the model company.
Does this mean there is no model?
Making products is still difficult To me, these are the only two Something that must be deeply coupled But the good thing is OpenClaw it's Open source hasn’t changed So you can still use this framework Let’s work together to design a better Agent architecture So the matter itself has not changed I'm just saying that it's possible to do this It’s just that some people’s stance has changed Yeah
Then this position changed maybe it's a good thing It could be a bad thing But in short It does not affect the open source nature of OpenClaw itself I think it is At least I say this kind of group evolution This possibility and genetic fire are reserved That's good Yeah Then let’s talk about models The last time we chatted was actually at After your V2 Flash is released,
This time I have sent three more New models Pro Omni and TTS and you call it A silent ambush Why do you say that?
Why is it silent?
And why an ambush?
First of all, these three models It behaves in the Agent scenario Ability improves so quickly Or that we can catch up so quickly In the complex Agent architecture Its performance is so stable, which exceeds our expectations.
It's just that we didn't plan very well but We all woke up at once Then it broke out You are awake Right and then It is such a context So it's really a quiet one I say quiet because the outside world doesn’t know So are we within ourselves something that happened quickly Then the second one is We actually went a year ago When doing so many modalities
In fact, there are more angles to say If intelligence is truly produced This intelligence should be all-round multifaceted So I did an understanding of multimodality.
and eventually you produce Intelligence must ultimately generate value There must be interaction So we have to do speech generation Yes But this is just But when I made these things a year ago it's not clear yet Just you still think You are still making a single model You are making a dynamic understanding model You are making a speech generation model Then you don’t think these things can be done
well organized and put together Then when I saw OpenClaw I'll do it myself I thought of such a picture That’s what each of these models is doing What role does the link play?
How can they be orchestrated effectively?
Then as well it will produce a What a great ecological value Suddenly I felt that it was in my Everything was cleared up in my mind.
So we quickly make all directions We all have to face this paradigm and do it Post-training targeted design That's the reason So now if you can do it in OpenClaw Claude Code uses these models at the same time You may find that it works well when strung together It's definitely better than Other models may be easier to use then or at least this is us a goal moving forward
Then why are there still these three models at the back?
It’s not combined into one model.
I think it's more out of concern for A consideration of cost and speed and price For example, speech generation You don’t have to use a 1T model You can't accept its experiments either Then for example, multi-modal understanding Is it worth a larger model So I think about this They all have to ask questions.
And because I think Agent revolution It is essentially productivity You must be productive enough Pay attention to its end-to-end completion rate in the end and the cost efficiency that gives it And then here are the three models now Some reasons for synchronization then and behind us There should also be some plans How to make three models collaborate better Yeah right That's what you call orchestration What kind of knowhow do you have now?
How to arrange better arrangements Yeah First of all, let’s look at the type of tasks For example, your simplest Most task types You can actually Made only using language models Yeah Right then But Because now we are done The whole task is too long in some aspects if you realize
You need to call your other like When people have the same sensory capabilities Then you go and use another, more sophisticated model make it do better Yeah similar to this then And because these three models They were trained in the same ecology So you can know their background Can you know what I know?
It is and it knows Then you can rest assured that you Let it do the tasks you think it can accomplish You won't worry about it not knowing the background knowledge you have This background knowledge comes from pre-training Yeah Currently these three models are in the same The potential released within the Agent framework Can I assemble other models with you?
The gap in unleashed potential It's very weak It's very weak at the moment But I don't think the future will Because in the future it is a The product of effectiveness, cost and efficiency You won’t think about the future But for now you will think it is weak What are you betting on these three models?
What is the relationship between these three models?
I think it should be replaced in all aspects Vicarious life and work Yeah all aspects of So you must have these abilities When you look at Pro, I think it is about understanding and cognition.
do more complex Scheduling Omni is doing perception Yeah TTS is doing audio sound output sound output it is an expression Yes Do these three add up to a human-like intelligence?
Anyway, it uses the intelligence that people have Appearance Inputs and outputs are modeled Yeah Yes, but does it have any personal details?
The synergy between the senses is so good I don't think it's done yet But this is not purely about the model.
Didn't do it Also, it’s not done in the framework.
For example, now OpenClaw Its understanding and modeling of videos is very poor Its entire architecture why Because there is no open source community A set of joint understanding of audio and video The emergence of very good open source models And then this model Also has strong agent capabilities There is no such model So its development in the framework is lagging behind So it now understands the video Will go back to understand the picture
Even in the end, we will fall back to understanding the caption.
just fall back to a text-only intelligence level.
So this is why we need to open source It’s because only the open source community has seen one Stronger video understanding of voices After some models of sound generation like this Only its corresponding framework will change Yeah Then there are only these two The framework is actually a coordination layer, right?
There is only this level of coordination And then this model This layer of the intelligent center When these two layers blend well, Then it is possible to achieve human-like intelligence Yeah Then let’s focus on V2 Let’s talk one by one Although Flash has been around for a while It was released by you on December 16 last year When I was working on Flash What is your core positioning?
this It is considered by everyone to be your first job at Xiaomi first work In fact, Flash and Pro are basically trained at the same time.
Yeah And because the structures of their two models are very similar, Yeah, but we are designing the V2 series Whether it is MiMo V2 When Flash was still Pro The model architecture itself has a very crucial one The goal is that we want for long context Efficiency to design model structure At that time, I had a vague premonition that Agent Times long context is very important
Or have a premonition long context will produce intelligent but we don't I didn't expect what came next OpenClaw this form But I already had a premonition of long context Must be a very important question Then the effect of long context And the most critical reasoning efficiency It means your cost and your speed should be fast Cost is low enough fast enough Then this is our generation model Structure must pursue eternal propositions
Because your cost is low enough fast enough It's possible that you can turn one trillion into 10 trillion Even 100 megabytes In fact, all model structures now It can train up to 100 megabytes Yeah Then why doesn't it put this Model provides 100 trillion inferences Except for the average effect I think it's more due to cost considerations Just 100 trillion is too expensive That is, you can go to the back
After one trillion to within the range of 100 trillion it is very expensive So expensive that you can’t even don't want to use it Right, so we were around Design with such a core goal in mind The structure of this Hybrid Attention Well, there was actually another one at that time A more mainstream option is to use MLA (Multi-head Latent Attention, multi-head latent attention mechanism) including now I think it started with us at the same time
The ones training should be GLM-5 and Kimi Kimi earlier K2 is earlier Then actually they all chose MLA As for DeepSeek (V2, V3, and R1 all use MLA) In fact, for the era of Chat, MLA It is indeed a very excellent model structure Even for long articles It can be considered a pretty good model structure.
Because it significantly reduces the KV Cache Then for the long tail, your KV Cache is very valuable.
And then but it has a I think it's It doesn’t fit the paradigm of Agent.
The most fundamental points The first one is It is actually because the MLA is The original design was to achieve a good The ratio of memory access to calculation is On the H series chips at that time To achieve a higher one That’s why I don’t waste my computing power
Want to do it again The bottleneck of memory access is broken.
So it was designed under such a structure Then under this structure The designed model architecture it doesn't have any room to play The only space I can play with is the hypothesis We think KV Cache is important And I think reasoning speed is also important So can I use some The way it reasons about acceleration For example, the simplest inference coding MTB is one way and then have it actually reason When speeding up n multiple times
Yeah But it is not possible to write an MLA Because MLA has reached an l bound and memory bound (compression and memory) A very perfect critical point If you use MTP (Multi-token Prediction, multi-token prediction) You will find that it is stuck again That calculation bound So now you see the model structure of all MLA Whether it’s GLM or not Also Kimi Anyway, I guess they are not on MTP.
Because after it is uploaded, it is calculated as bound Then it is not cost-effective to calculate the bound.
So its models will be slower You will think that everyone is interested in MiMo Our first generation Flash Even Pro Flash can achieve 100-150TPS Then Pro now we can do it too It just depends on the cost.
Because basically it can reach 60-100 Yes 100TPS will definitely be more expensive Yes, so here it is Everyone uses MiMo Regardless of whether you use Flash or Pro, one feeling is that Wow so fast Right and then this is the structure Especially for long context efficient An advantage brought by structure And then at the same time Its cost is low enough
Because of Hybrid Attention (hybrid attention mechanism) Included in the Pro generation Let's pull it to a newer The ultimate hybrid proportions Just its Full Attention and that The proportion of the Sliding Window layer will be more extreme Achieved a ratio of 7:1 So it saves KV Cache So we are basically in this generation Structurally achieved through Sliding Window layer to reduce KV Cache Make it more effective on long articles
Ability to support longer long contexts Then send it through MTP at the same time The computing power saved by Sliding Window Attention Save some Attention computing power Then fill it in with MTP In this way it can achieve a in actual reasoning This memory access and calculation a good balance And then I took care of it at the same time
Cost of long context then and its Reasoning speed so Although we didn’t think so much when designing the model, But basically it’s perfect Very suitable for being an Agent Because for Agent Long context is critical Then the very small KV Cache is also critical.
Because your KV Cache is very small you can Do more multi-level caching cache hit It will be very helpful to save the cost of your reasoning And then the next thing is Speed is a pretty critical proposition Yeah And then once you experience the faster model and faster models with comparable levels of intelligence You won't be able to get back to the level of the slower model.
Yes So I think MiMo V2 Flash and Pro are roughly like this In such a context We started training at the same time Yeah At that time Why did you choose MTP at that time?
The choice of MTP is really quite a posteriori.
This is our model I have almost reached the mid-to-late stage of training and then We think this model I started to design a reasoning plan for it.
Then we do it in our own Upload the reasoning cards of that generation When actually designing a parallel solution for reasoning I found that there are too many calculations left.
We didn't expect there would be so much left Then what you are thinking about is how to put these Let it use the remaining calculations effectively Then MTP is very suitable but we are The reason why MTP is added to the pre-training stage is that it It really improves the capabilities of the base This is the same between us and DeepSeek It is pre-training and MTP is added because of MTP Can improve its base capabilities
Why are we the only ones using MTP when reasoning?
The reason is because our model structure Naturally, there is a lot of margin left in calculations Then we are behind this matter When designing the reasoning architecture Suddenly one day I realized But it’s not like one day suddenly It’s up to you to calculate it carefully.
When reasoning about various aspects of some numbers you just know then you can Can be encoded with images Then use the remaining computing power It just so happened that we trained MTP again Then it will come in handy It’s actually a natural extension of exploration.
Why it hasn't become mainstream yet Everyone is too I think I believe the MLA Yeah Everyone trusts MLA too much That's everyone Because MLA is really too clever Just put it in the model structure I've done everything I can to the best of my ability.
So I think in the first half of 25 years If you want to train a model If you train a base model So actually MLA is indeed a good choice Especially when you don’t see the long context When the paradigm of value and Agent MLA is indeed a very good choice Will it become mainstream in the future?
do you think I think not Still not MLA probably won’t I said MTP If MTP says If it depends on everyone What does the next generation model structure look like?
I think the current model structure There are roughly two trends in design One is that you are really in the pre-training stage You just want to understand what the scenario behind your reasoning is.
For example, what do you want the card to be on?
What should I put on the card to push?
Then in what context do you need to push it?
your reasoning What is the parallel way You may even need to figure this out Then you can design a This scene and this reasoning method and The perfect structure of this chip Then you train this structure Then its efficiency cost It must be optimal in all aspects As far as MLA is in this context (background) designed under circumstances Yeah But this
This context is based on Based on two premises One premise is that Post Train (post-training) is not important Or the Post Train time is very short Yeah As long as you can finish the Post Train in one month You spend most of your time doing Pre Train Yeah So you just need Pre Train The reasoning itself is enough Then the second one is your reasoning card You always use that one or two
Even just that one is the best But this matter at the moment is great changes have taken place in Because now the cycle of Post Train is lengthening.
Post Train What You Can Do on a Generation Basis The upper limit of is stimulated far when did this happen It is brought about by this paradigm of Agent Oh Have a more efficient context for you In fact, the longer the context, the longer the context Is it also a way to generate intelligence?
Just you have your model can be entered into More context When understanding more context Then your potential is higher This is the same as the original Chat one The paradigm is completely different because The original paradigm of Chat Its context is what people lose What people lose is very short Yeah So it purely relies on this pre-training But the Agent paradigm I just love Post Train Do you understand the framework?
How to get there The so-called multi-Agent collaboration It’s okay to be messy Anyway You can understand that we are here The computing power to be invested in Post Train It may be comparable to Pre Train So your time period is lengthening So in a you For example, let’s say you want to do Post Train In the scenario of half a year or a year
So you assumed a lot of things in the first half of the year it may be invalid Which card do you assume you want to push on?
What scenario do you assume you are going to be in?
achieve better results It's all ineffective Because it's possible when you do Post Train After half a year or a year You will find that those scenes have completely changed You may have thought that 128K was enough in the past.
But now everyone thinks it’s possible In a few months, everyone will think that I need 10 trillion 10MB context It's similar to this logic So If you still do the model structure like this it may lose some dexterity But what?
If we talk about this team's Post Train is efficient enough to keep up with Its cognition is able to this Post Train The cognition can assist Pre Train to To make correct judgments about architecture Then it is possible that this mode still works.
Just design a detailed structure for it Think clearly about the types of reasoning cards Think clearly about the scene Then it should still work But there is another There is another way to do structure That's me structurally More concise Then leave more margin to do it Subsequent adaptation and enhancement in different scenarios For example, I think Hybrid Attention It is a simpler structure
You might think its simplicity is reflected in You can do it with MTP Go and give it this computing power Make fuller use of Then it can also be the back You have even trained for A Hybrid structure to increase Sparse or Full ratio, etc. Anyway, I think in such a When it comes to a simpler architecture In fact, the space for your Agent to play is actually
There will be more room for further training.
Yeah The cost reduction is MTP giving next token What are the advantages brought by prediction?
Well MTP is if its If the hit rate is very high Then it can bring about cost reduction Yeah equivalent to it in the shorter More tokens were spit out within a period of time So it makes the GPU utilization higher So it essentially reduces the cost of a single token cost generated You just mentioned the many benefits of MTP Will it bring about some hallucinations accordingly?
Not MTP it is Because it will be verified Then only your prediction is accurate Only then will I adopt the result of your current token So it doesn't have any illusion good What we just talked about was the MTP used in Flash Yeah That You also made some choices For example, a hybrid attention mechanism You chose the 5th floor at that time.
Sliding window and global attention mechanism Well, you guys have actually changed again this time.
It should be said like this Maybe one of our rough One conclusion from a large number of experiments is that The number of layers of Full Attention is very important But there is room for its coefficient ratio For example So you're on a larger model When you have more layers You can ensure that the total number of layers of Full Attention remains unchanged But you add more Sliding
Just use the Window Attention layer It is possible that the number of layers is more important than this coefficient ratio In other words, on a larger parameter scale When the Head of Attention is larger Maybe we can do it too a sparser ratio It should be a unity of these two conclusions.
As a result, we can achieve greater Create a higher sparsity ratio on the model It is the sparsity ratio between Full and Sliding.
We have been doing a lot of Sparse research recently I also found that it can eat larger models The ratio of sparse to larger Attention For larger models you can be sparser But your small model is too sparse Your model effect will drop very seriously So this is an experimental result It will not become a fixed standard This is the result of our experiment
Then I’m not sure if other companies will also follow.
Also the same There will also be the same experimental conclusion Yes I think Flash is pretty good Xiaomi’s early style Because we pursue the ultimate cost-effectiveness How was this determined?
And because your API at that time The pricing is based on an input of one million tokens of US$1.01.
The output is $0.3 per million tokens It seemed to be the lowest price at the time.
top speed Do you think you did the right thing then?
What achieves this effect Yeah right Basically, I did everything I needed to do.
That is, the structure has a long context very efficient architecture Coupled with MTP, it can be accelerated even more Then, I think the most basic infrastructure of Infra will be done well.
Basically you can do it at this price I think during the pre-training period According to your model Pricing based on framework advantages It is indeed reasonable It is indeed reasonable Because your framework is strong and then finally the person end user can be felt by the end user In fact, it’s purely because your model is very strong.
So I will follow your model structured approach to pricing I think it's reasonable But when we enter the post-training paradigm Then the post-training paradigm is In addition to this model structure In addition to its own advantages You also need to check whether your context is good or not.
And what does your model say about this?
Are you good at understanding context?
Yeah So I think its pricing logic should change It shouldn’t be based on my final Pricing based on cost reasoning It should be based on my model The final value generated is used to price Then this value is in addition to the model In addition to the advantages of its own architecture There is also the model, which is Did a good enough job in post-training
So it can be better understood This Agent framework This is also one of its pricing A place where premium space exists So behind us MiMo V2 Pro actually abandons such a pricing logic I saw it in your tech blog It's Flash Did you want to bet from the beginning?
Reasoning Coding and Agentic?
I think when doing structure Just bet a point Just bet long context The modeling effect must be good enough Then the efficiency is high enough Reasoning efficiency is high enough Just bet on this point Didn't think about anything else I don’t think pre-training should be done At least we were At that time, I couldn’t think of more goals.
I don't think an architecture The goal itself should be too complex Because the architectural goals are too complex It's just that you have too many restrictions So if these constraints end up Your Post Train will take a long time It meets these constraints It becomes a very fake restriction Then wouldn't your structure be in vain?
Right, so we don't have the right one at the beginning Model structure imposes more goals I think it is unreasonable to add more goals Did Flash help you verify it?
Flash helped us verify our There is no problem with the entire Infra data but we didn't It says to train Flash first and then Pro They were trained together But Flash is a relatively small job relatively early Not early either We distributed it after training So when you see That's us Yeah right No planning early Basically most of our models Training is all conducted in the second half of the year second half of last year
From Flash to Pro What are your expectations for Pro?
Of course it's the same time Yes That is, two models are trained simultaneously We believe that this generation of architecture is fine It’s just that we were in the middle of training Pro We solved a lot of problems in the process For example, the instability of training values This is a model trained with 1T parameter magnitude.
Commonly encountered challenges I keep training you and lose back.
Always train and train OK A certain activation value is very large Then you have to consider how to go Or some experts (experts) The distribution becomes very extreme then like the ceiling A batch of tokens will be sent over in a moment.
There will be another batch of tokens in a while Called another expert again It's very dangerous These signals are dangerous It will lead to training problems such as loss.
Typically there will be a lot of spikes (jumps) It will be typical and very uneven.
The complexity of expert will be very uneven Just you are training a larger model It will take a lot of time to go solve such a problem So it looks like synchronous training But the training progress of Pro will be a little slower than that of Flash Because you are the one who has to solve these problems in the middle Elements that allow you to train for instability Numerical instability is just a symptom
Then these elements will be very exercised Infra and algorithm of a team The ability to jointly debug Sometimes you even doubt Is there a problem with any card?
That is, if all the cards are found in the end Checked and found no problem You will wonder if it is today sunspot's revenge Only you will know and suspect On some very metaphysical issues You have to start from the very surface level Found very low-level factors You just mentioned that the 1T model might be It is a ticket for future competition.
Is that so?
It's Agent you need to be close to Claude Opus 4.6 level of such a ticket There was no such model when you established the project.
Right, why did you already I think it must be 1T Yeah Because first of all, I have trained DeepSeek V3 to be as big as More than 600 700B models You won’t want to train the same model again If you are right, then you should definitely continue to the next step of scaling.
So 1T is one in our Existing card at that time A comparison based on the number of existing cards a range of limits How many cards is it?
A few thousand calories anyway How many calories does it cost to train this model?
But actually we have to invest Lots of cards to do research on So in fact, the actual research The card will be several times the actual training card Training like MiMo V2 Pro and Flash In fact, a few thousand calories may be enough for each training.
But actually you invest in making models The card studied will be many times this card I think 3 to 5 times is a better range That’s it Whether it’s your early structural research Then the Post Train in the middle and later stages Research on many algorithms So it's not that we have Thousands of cards are enough to do this Rather say At least I think in terms of the card’s resources and reserves
At present Especially under the paradigm of Agent In fact, the number of cards becomes A very important bottleneck Because of the idea birth and this hands-on You code it Too fast. And what are you stuck on now?
Too fast. And what are you stuck on now?
stuck on card because The efficiency of that GPU lies there You have to go in order to To verify this idea, run an experiment Then you have to run many experiments in parallel So I'm stuck at the bottleneck of the card.
So now the card becomes one A more critical constraint Of course, this is just for training Then for reasoning, the card is a more critical factor.
Infer the demand for cards Much higher than training sessions Train inference and experimentation This is best to be a few to a few Yeah Inference based on the number of users In other words, looking at high value Number of tokens consumed by the scene So many of these scenarios vary from person to person.
Yes So let’s reason Take it apart and see Then if I say we say it this way It’s for research, for Pre Train and for Post Train Yes I think it's a very reasonable card A ratio is probably 3:1:1 Yes The ratio between Pre Train and Post Train should be The computing power invested is considerable Then the proportion of research should be at least your
The total number of cards for official training is a little more.
Just keep extra cards for research What was the ratio of pre-training to post-training in the past?
Yeah At least in the Chat era, it should be a very exaggerated For example, 3:5:1 There is a ratio between pre-training and post-training Yeah It's possible this year A big change happened There should be many teams that are 1:1.
Yeah The top teams should all be 1:1 Yeah Train a 1T model what are the challenges I think it’s an all-round challenge.
Basically a full range of challenges In fact, the data is not very The big challenge is that The larger model seems to be better for dirtier data is more tolerant But what?
but i'm not sure I'm not sure Because we were trained on the same batch of data so i'm not sure I just said it looks like Then the main challenge should be when you encounter In the process of training when encountering problems how do you go I think the first step is for you to find the problem For example, many teams will
Treat loss spike as a normal thing But we may try So that there is no loss spike We think there is definitely a loss spike will result in an update of a certain step Particularly unstable Some values have large outliers Just put some parameters directly Or give it some expert Beat it to death Just beat to death Just after the parameters are updated There will never be another one No more tokens will be sent to expert.
So then there is need among you Very tight monitoring system That is, you have to gain insight into the model parameters What changes have occurred internally?
You need a monitoring system like this For example, you at least want to go How about the complexity of expert?
Let’s take a look at the parameters of each layer What are the input and output?
Has its activation value become abnormal?
These things are the occurrence of loss spike Things you should see afterwards But maybe not all The team will look at it in detail This is the first step I say to find the problem Maybe many people don't regard it as a problem.
Then after you find the problem, you can think about it again.
What is the reason for it?
leading to such questions For example, it may be that the sparsity ratio is too high.
As a result, for example, the output of the Full layer is similar to The output of the Sliding Window layer is At least the numerical difference is very large So there is a big difference in numerical values This will lead to some numerical instability It may be due to structural reasons It may be due to structural reasons It could also be purely an Infra bug.
For example, let’s say you wrote a wrong operator in your communication.
We finally found that there was It may even be on a certain norm Anyway, you may have some problems Perhaps the final solution that is really helpless is You find that the value of this layer is too large Just give it Or just clip it off Or suppress it through norm That is, it has many solutions For example, press it down through norm it must be I think it will definitely damage the model effect.
Clipping is one way For example, we will also Learn from Kimi’s QK-Clip method When some logits of QK are very large It really affects the training stability of the model.
You can't help it Just clip it off This will at least make training better.
When reaching a steady state Let it go again Yes There are too many temporary ones like this You have to find the problem solve problems Even go so far as to push back a lot may cause this problem One path to analysis These paths are actually very Testing a team across teams In fact, if you are in a large company It's cross-team collaboration
But this collaboration efficiency is extremely low And if you are in a small team a creative team Then it is the test link The key is the degree of cooperation between several people Then it will be more efficient and if this There is enough for everyone in the environment Pursue perfection Just pursue the ultimate It can't tolerate what you did suddenly.
Then we have to stop the experiment to find the problem which side do you belong to We definitely belong A very extreme type of small team So it will lead to This will lengthen the training cycle Pre-training cycle it sure It won't be finished in a month or two.
So when the training cycle is lengthened, If there is a very clear deadline Goal or something Then you definitely can't bear this Because for example, it’s a few thousand calories bigger If you stop for one day, it will definitely be one or two million.
Two to three million things Right, then how can you bear such a loss?
If you are on a goal-driven team may feel Stop for two or three weeks to sort out a problem I don’t know if this is a problem yet Will it have any impact on the final training of the model?
It may be an unacceptable thing But here we are We think it's a problem we should solve it Because we don’t have such a clear explanation a target i want to publish You have no deadline We have no deadline We think the model is trained well We'll send it again Yes Is there no pressure from the company?
No Because you are indeed not a startup company, right?
Not a startup team But in fact it is an entrepreneurial team I think it includes MiMo and MiClaw In fact, they all operate in an entrepreneurial way.
so it can be done How difficult is it for you to train a 1T model?
Is it an exponential rise?
No No. Managing a team is equally difficult.
No. Managing a team is equally difficult.
Yeah It should be said There is not much management Because it would be good if we all solve the problem together That's right, you don't need to manage these people.
Let's solve this problem together Then everyone has their own Different ways of solving problems Then let's solve the problem together On the contrary, this solves the problem This ability to demonstrate by one's own example is A very good culture and orientation How big a team is training 1T model?
very small I'm only talking about the training itself But you still need to do data So many people Also a few people to several people Do you need anything else?
Maybe one more is needed Very good infrastructure team That is, you can understand it as Set up the card for the cluster this Is this the Infra team?
This is not the Infra team This is the infrastructure team This team probably needs a little bit of Need experienced people Because there is no experience, it is old Make some low-level questions Gain some experience in basic operation and maintenance facilities So what's your secret?
Essentially, I don’t think it’s necessary A very big teamwork thing I think teamwork certainly has advantages For example, everyone is stuck In the case of huge resources Can be explored in parallel Yeah It is advantageous for doing research But I don't think it's right I just talked about that scene
Found an issue that might be the problem Then go deep into the cause and solve it I don't think it's useful for this kind of model training process Problems encountered in the process: A large team is an advantage On the contrary, a large team may be a disadvantage Yeah Yeah In the process of training this model What is the status of your team?
The team status is Originally doing pre-training The group of people who have the data will do post-training Then do Infra Make a training framework People who do inference and Infra come and go together Solve problems during the training process why is this why is this Why is there this change?
That's why people who pre-train do post-training Many reasons A big part of the reason for post-training in the first place is that you need to have good data intuition This is quite important The second is It is based on personal hobbies and interests it is not very based on Based on what you said We need people here You turn it around
No, most of it happens naturally to everyone.
but i can Who can be expected to make such a move?
Because there are many abilities and characteristics that are very common.
For example, intuition about data For example, it will be based on the model effect Backwards the design of some algorithms In fact, this is what I do a lot of the time when working with data.
Yeah Well, maybe I think the main reason is Our definition of human beings is not so clear So most people it will Naturally changes with training stages Freedom to choose the next stage Something more imaginative Are there 100 of you now?
I remember the last time we talked There are now 100 people But these 100 people include All people on the link for all links For example, data collection Data quality Let Pre Train Infra Post Train then include Even our development we need some development Also includes our products Contains our data
There are also algorithm engineers in three directions Basically you can spread it out It's language Multimodality and speech Then there are 100 people here The ratio of interns is very high Then there are some interns Maybe doing some This will not be reflected immediately at the moment Things about the structure of the generation model Things about model capabilities So in fact, I really put myself into Iterate people in the first-generation model
I think there are very few May be added to all links Only twenty or thirty people got up Thirty or forty people There are only so many people at most twenty or thirty people Twenty or thirty people are relatively evenly distributed There is no group here Are there divisions into different groups?
No no group no group So you have 1 versus 100 Almost Why is there no group division?
For example, why not separate pre-training groups?
Regardless of post-training group because many Many people will be interested in both directions If your group is very divided clear and fixed words So it's actually killing part of the creativity.
In other words, it will kill its future growth space.
The second one is I really don’t think people who do post-training If we talk about post-training, the current one A very important paradigm change is it needs to have The perspective of diversity A lot of post-training people It is done based on a scene it does not have this diversity of vision But being a pre-trainer The first thing you should focus on is diversity
Because it cannot go to this model A small amount of data is stuffed inside It fortifies diversity with better data So actually people who do pre-training go Post-training has great advantages That is, it will naturally care more about diversity then It's a good supplement It's a great addition Of course, there are also people who only do post-training.
For example, it only studies reinforcement learning it so it keeps doing post training Or call it Mid Train?
Just anyway That’s it At least we won't be there The organizational structure goes up to certain According to the scene, I think this is what some people think It's creativity that's stifling it in this scene That's weird In my opinion Without a group, there would be no leader Yeah Is there someone who will actually push this project forward?
Yes But it's all very vague For example, maybe this project needs to move forward.
For example, we want to train Pre-training for MiMo V2 Pro Then or post-training There may be people who actually push forward But this person is not right People participating in this project have absolute control over There is no rank, right?
You can think so But Xiaomi itself has ranks Originally, our entire team The organizational structure is completely decoupled no rank Do you think it’s important to do AI?
Make a big model No ranks, no groups This is for the emergence of intelligence What is the meaning of itself I think equality itself is valuable That is, equal rights itself is beneficial to everyone.
Contribute equally to your own creativity and wisdom I think it's valuable Yes any level It should be regulated and constrained to a certain extent Then the norms and constraints themselves I personally think it suppresses creativity Then and after there are levels It defaults to this People at higher levels should
There is an intelligence that is stronger than all others Very strange about this This definition is very strange I don’t think it’s likely to exist So it’s flatter And especially for the most important leader, He doesn't want to have a particularly strong sense of control.
And I can’t do it without this feeling This kind of I think if there is Once you have such an idea On the contrary it is not Very conducive to the development of an innovative team Although you said there is no management But how to actually manage it?
Passion drives management I think this is very important I have found it to be the most effective way Yes To choose to inspire everyone’s enthusiasm then Let everyone surround themselves with things they want to believe in Go do something on your own I think This is what I have always believed in The most effective way to manage What are your methods to drive enthusiasm?
Let everyone realize a new thing Let everyone experience it In fact, it is a very important way to drive his enthusiasm.
For example, OpenClaw This is a way to experience You seem to have used a very extreme Say oh you don’t have to You don’t have 100 rounds of dialogue You resign tomorrow This is a very extreme way But your purpose is to experience Is that right?
I won't take the final exam either I won't go the next day either Everyone said OK during the assessment Are you really of any use?
Because I think that's not important then i I only care about whether you actually use this action Has it really reached 100 rounds?
That's just a quantifier So what else do you have Use love to drive everyone’s experience In fact, it is also critical when screening people.
That’s it You can tell a lot of people from their past experiences.
What kind of goal is he focusing on?
doing something Yes It depends on the characteristics of people who do things driven by love It will be very special Right, while you were chatting with him You should be able to feel it some people Will do things around many strange goals But for the people who love doing it is very obvious How obvious I have no choice There are many quantitative indicators But when I went to chat with him, I could feel it directly
Just can sense it Yes Did you have any failures in training this 1T model?
It was a success even once It just depends on how you define failure For example, if you train until your loss is so high, it will just float away.
That It should have happened a few times along the way.
several times I don’t know how many times I’ve done it now.
Anyway, there are always two or three times Just loss and flew directly.
But it teaches For example, after training for hundreds of steps, I came back again.
Oh Then you say this situation You should stop and solve it Or should we continue to train forward?
solve We think it should To solve it, we should stop and solve this problem In fact, it's like this hundreds of steps And then it happened again This is where we think we should stop and resolve it.
So I will stop and solve it Let the loss pass more smoothly It usually stops for a few days It's hard to say It could be just a few days, it could be a week or two, it could be The longest time was two weeks Are you anxious about stopping for two weeks?
Not anxious Because we don't have What is the goal Oh, of course You have so many cards and you do a bunch of experiments every day Today I want to check like this Hey, I feel like this is the reason Quick change I changed it and started running again. It looked like this again the next day.
Or at night Then I won’t be able to sleep well at night anyway.
Then I often dream at night and ask why Loss spiked again That's it In fact, although we do not have a clear time point But you will still collapse So there are still many very frustrating moments.
But although there is no clear node, it is stuck That’s because it is limited, right?
You may feel that you may have wasted computing resources.
Doing some useless experiments Having such a self criticize So can the number of parameters determine the upper limit of intelligence?
Is a bigger model better?
I am now I think it’s the parameter amount plus the context itself.
These two are decided together.
But at least To reach the contemporary What everyone thinks is the strongest Agent level I think it must be a parameter scale of more than 1T can do it Only in this way can it feel that you are very close to 4.6 Opus like this But I don't know how big it is I just think I think it takes at least 1T to do it.
Of course it will be more active if the general staff is above 1T.
What’s more critical is the activation parameters But the larger the activation parameter For example, your total ginseng is 1T.
The larger your activation parameters are It means higher reasoning cost Yes So it's just a trade off.
Why was it increased from 5:1 to 7:1?
It’s a hybrid attention mechanism I think We pursue a more extreme sparsity ratio It’s Full and Sparse A sparse ratio between Full and Sliding Window The main reason Still We hope that within a larger structure To create a more efficient long context Yes Because of the larger structure
If you have more layers of Full Attention If your total parameters increase The number of layers of your Full Attention will also increase.
So actually in the case of long text It will also become very broken Because the number of layers of your Full Attention has increased.
But if you expand the number of parameters The number of layers of your Full Attention has not changed.
then it's possible possible your long article The efficiency of the two generations of Pro and Flash models is almost the same.
But the intelligence level of its Pro has been improved So we hope When the efficiency of a long article is equivalent We want the model to be scaling the upper limit of its level So We are more focused on controlling the efficiency itself.
But control the efficiency itself Later in the era of Agent There is also a more valuable thing That is, since this larger model Its long text is very efficient right Then I can stuff in more context Then it becomes stronger again Right, so probably A background for such a decision For this 1T base Some of your decisions The first one is the hybrid attention mechanism We just talked
Then there is the context window of 1M Is there anything lost while doing this?
Does 1M require training?
It's a long context It still needs training The key to the problem is Where did you really get it?
Go inside the one megabyte context window There are so many dense supervision signals I think It is difficult to find such data Or rather The cost of constructing such data is very high Very expensive to construct So it is So In fact, you have to really look at it from the perspective of the end As long as you have
For example If you have a 1T token amount And it's all one trillion real long context Then your model's I think your model One trillion abilities can definitely be trained As long as the loss keeps decreasing it's modeling If it is compressed, then it will definitely be able to train.
But the crux of the matter is It is difficult for you to construct a real 1T 1M context.
It's really hard for you to construct Either that or the cost is too high Or it would be difficult for you to find such a scene So this is the crux of the matter we have So The effect of this long context is slowly improving.
Slowly improve And then you have a third one It’s MTP, which is a continuation of Flash Has this changed?
Yes nothing much changes In fact, it is training during pre-training.
first floor Then train more layers during Mid Train Then pre-train a layer It’s to improve the capabilities of the base Later, Post Train will train more layers.
It is to use more layers when reasoning Achieve better inference acceleration Yes Technical points about Pro Except for the three just now I haven't missed anything. That's almost it.
I think the conversation has been very thorough.
So Pro started a few months ago What about the other two?
Basically the same period Oh, I started planning it at that time Yes Actually Probably three directions To advance further at the same time Pro plus Omni plus TTS Like the entire V2 family It points to multi-modal narrative But its mode is very different Text is a discrete token The picture is a matrix of pixels Audio is a waveform
How do you do this fusion?
Actually we still want to Try to unify it under the paradigm of language model So So at least in audio modeling We want to discretize it Becomes the ID of a discrete token that is the same as text then So we're on audio It's about this matter A lot of research has been put into it Computing power talks about how to model discrete audio And this
We want the modeling of this discrete audio Try to do it A lossless discretization Yes, because people still don’t believe it Say how you turn some continuous input into discrete Finally it can be rebuilt This matter In fact, it requires a lot More sophisticated for encoder For example We need some multi-layer RVQ (residual vector quantization) to guarantee it its Its discrete representation is a very large
High space like Dense then We need more pre-training Come and let it begin to emerge If you do based on continuous features Might pop up soon But you do it based on discrete features It will be harder for you to model Its emergence will occur later then so We know that this attempt was started in audio
Then it will also migrate to other subsequent modes.
we still I would rather use a more elegant architecture to do it.
whole An understanding of multimodal input But But We are not building this entire architecture just for the sake of unification.
Just a lot of the time If we say we found Indeed, for example, discrete in the image It’s really not that feasible when Then we will still find a way to say Inside a more mainstream architecture at present Because our priority is to ensure that this model What it needs to have is a holistic a level of intelligence
Instead of pursuing a paradigm for unified elegance Is it easy to unify audio into LLM?
should We are quite different Yes Just us, we should be quite different As far as our technical architecture is concerned, it should be very different.
I know I should Yusanjia from abroad is also good.
Domestic bean buns are also pretty good.
They should all have completely different structures from ours.
Why did you choose this architecture?
It’s just the obsession of people who do NLP.
We are the only audio people who do NLP.
So I have this obsession Just believe it Then he did it Can that be done with pictures?
can We've been trying for a very long time So Can you use LLM as a unified approach?
Yes, but it's actually a trade-off like i said You need to achieve a truly lossless reconstruction It requires investing more computing power The cost of longer studies is it's a trade-off Yes At least we've moved past it in terms of audio.
Oh What about the picture? Have you passed it?
Picture in progress I don't know if I can go through it.
Hey if What will it bring if you step forward?
Will this lead to greater imagination?
more elegant structure I pour I initially thought If we unify it and discretize it Then we can use a set of infrastructure to solve this problem The same pre-training infrastructure The same infrastructure for RL Very elegantly unifies all paradigms It's just too simple This is if If it can be done But now I find a problem is
When we have Claude Code and the top models These structures For example, let’s re-write a set of RL Infra architecture Re-write a training infra architecture we recently I have completely rewritten some new architecture from scratch.
I originally thought that writing these architectures was quite labor-intensive.
Quite time consuming But now it seems With the support of Agent The time to write these architectures is greatly shortened Then you don’t actually need to do it for the elegance of the architecture do a lot Unification for the sake of unification This is the latest change Changes in one month Oh But before you wanted to be unified Yes, the obsession came from this kind of thing before.
The obsession of NLP is that everything is discrete Elegant supervision signals are clearer Then you can do NTP Do next token prediction then You can reuse all existing Infra Wow how cool it is But looking back now It’s not that complicated to re-write a set of Infra Then a few people may Claude Code
You can create a new RL framework in just two or three weeks.
That Why do you want to train Infra because of me?
architectural unity to sacrifice The structure of the previous model has to be sacrificed so much.
But Omni is here when doing It has gone one way It is different from treating them separately Multimodality in this route of splicing right It builds a unified trying to build a unified Not really On our Omni’s entire ViT Just made one As long as it's still a ViT It's just that we made it more efficient it became a
Hybrid Sliding Window A ViT Yes but Our representation itself doesn’t change much As long as it is a continuous representation Not many changes were made Why do you call it full mode?
rather than multimodal Just because it supports It does support video and audio Image text all modalities Then there are some Agentic models It does not support joint understanding of audio and video.
Then it should be the first The first to support audio and video joint understanding And the ability of Agentic Able to achieve a level similar to that of language model Are there any signs that show In this full-modal or multi-modal understanding, able to generate intelligence Two months ago I believed And recently It’s during the whole process of training Omni anyway Just a little bit
I just question this a little bit Yes But we found some good signs later For example, MiMo V2 Omni It’s actually smaller than the Pro But when you actually use it, you will find that This Omni's This perception and understanding of the world
In other words, it ultimately reflects its emotional intelligence and its reserve of knowledge Will be stronger than larger models Because it taught Because it is natively multi-modal trained So I guess Maybe it's because we're in both directions scale In pure language, the computing power of scale and This native
The computing power of scale in multi-modal mode is not that much This may result in us not seeing the native Multimodality is so powerful inherently multimodal brought by itself one A big improvement in intelligence but you can feel it For example, a lot of world knowledge Because it has been trained on videos so it knows more then Its ability to perceive many very subtle things
you will find it stronger But these are all false It’s all up to us to physically measure it ourselves.
Perception Perception will be stronger But you are on any benchmark You are motionless without anything so to speak Is it possible that the Benchmark is wrong?
Of course it's possible Of course it's possible So I won’t say exactly now Very sure to say OK You have to understand multimodal capabilities It is the final realization of the so-called AGI path.
One of the necessary paths I don't want to draw such a conclusion.
It's because I think everyone has a different definition of AGI.
then Especially in a place like Agent The ability to combine multiple models Gives very elegantly choreographed scenes that come together case I don’t think we need to go very far to emphasize Does multimodality promote intelligence?
That's it Does it promote intelligence itself?
It’s not important. What does multimodality bring?
Just the two points I just made I think at the moment I only Observe these two points I don’t know yet whether the future will be bring on a new architecture Will there be something new?
I think it is possible to do multi-modal generation It may not be a little different That’s it will generate it may promote better perception But if you just say Expand the dimension of your perception to it Maybe you are not very good at promoting intelligence But if you can generate it Maybe it can promote intelligence This is a guess of mine
But generation is still a scientific research problem Yes or combine generation with understanding Achieve a unified structure At present, it has not been scaled to a very large computing power.
Computing power Most of the generated architecture Or generated purely by for It just doesn't have the intelligence to understand So what is your goal for the Omni model?
It’s what you designed it for That's what I think so far It's Agent it wants action It must have multiple modes In fact, it is such a goal But the next goal I don’t think this is the next goal.
Maybe I still want to explore it Say that when you combine perceptions of multiple spaces even When you can generate more multi-modal signals Will you advance your understanding of the world?
Yes But it needs to have a To put it more bluntly it may just need to have a Interact better with the current Agent framework A video generation model Why didn't you disclose Omni's Total parameters and activation parameters Leave some room for imagination Leave us some room for imagination That’s it
We believe that this parameter amount may achieve better Closer to the level of intelligence of Pro Although everyone knows that it is smaller than Pro Just smaller Is that right?
then But we believe they can iterate on each other We hope they can interact with each other you lift me up I lift you up we hope to do this Who is more important, it or Pro?
Of course Pro is more important But it is a pure language space to do it A lot of preliminary research is more important So what is worth focusing on about TTS?
I think TTS is one Our motivation for doing TTS is We want to use what we think is an elegant architecture go and make one Everyone uses a traditional architecture An easy thing to do Oh, actually you are trying this thing Right right right then but After we finish this matter we found put it
Pursue a discretized tokenizer in On thousands of hours of data sets After training then we found The generalization power of this model is very good But I have no way to compare Let’s train a very small model Is its generalization ability really not as good as this?
At least we currently have this model everyone can see it You then add a lot of stylization to it Various stylized labels Oh, just don't care about it it will Smarter That is, it will see your word itself it will be more By inferring the surface meaning of your words
to give it emotion and melody Regarding the generality of this matter We found it particularly obvious particularly obvious Because we simply did something The number of styles of several very specific scenes Do SFT and RL on data Just a few very stereotypical stylized scenes Just like make it faster and slower Very stereotyped and stylized scenes of happiness and sadness
Made this style Post Train Stylized Post Train But we found You put that style tag give it something very complex natural language description it can also follow This is a pure generalization of it This is when we're doing this discovered This simple architecture coupled with extremely large-scale training brought An externalized expression of super generalization power But it's still early days
So our TTS model I think the effect is very amazing In other words, its upper limit is very high then But we are slowly making up for its lower limit now.
It can sometimes be unstable So we are currently only a limited time free Just open up the API Let's have some fun But there is no guarantee that it can actually be put into production and used But soon we will make it production ready You were at a press conference before Drawn a diagram of the road to AGI You compared human intelligence It is a path of biological evolution it is a regular triangle
Then the inverted triangle is an The current development of AI Do you think AI development is a castle in the air?
Because in the evolution of human beings Language is at the top But the AI large language model is huge Zoom in on the topmost layer So it is an inverted triangle How do you think it can Piece together a picture of this AGI road Is what you are doing going down this path?
I think The evolution logic of the current large model You really can’t follow people It's completely different I think the reason for the difference is that the environment is different The environment in which people evolve It is different from the environment in which the model evolved.
When people evolve It follows changes in nature In order to survive and then evolve But the big model It seems that it didn't come up here to survive in the first place Yes what is it for it I don't know Does the big model now have its own values?
But we have to forcefully assign values to it Just let it replace part of the people But it seems like it won’t die if it doesn’t replace it.
But it doesn't have this kind of existential crisis So So the big model it might be more I think When there is no existential crisis On the contrary, it will evolve more freely More undisciplined and creative Yes faster Not less constrained And its current basic conditions are too good
It has so much computing power to use then It has a valuable starting point for human knowledge As a basic starting point you can use There are so many people helping it improve So I feel that the two environments are completely different So the evolutionary paths are different.
after language What are the next steps in the evolution of the model?
Or even subdivided in language What’s next after coding?
coding it will still have one It's a very good theme That is, it requires a very complex software project One-step development I feel like the longer that goes on The complexity of development that can be done Not necessarily the amount of code As long as the complexity is higher For example, if you write a Kron operator It may not have a high amount of code
But you need to write it to debug Debug to see if it is actually training Improve efficiency Right Really effective improvement This verification process may be long.
But the amount of code may not be long.
In short it is You need to participate in the actual development of this kind of project I think so It’s true that I do code A very big theme In short, the more programmers it replaces So this is its main theme and then extend it to a broader productivity scenario In fact, it does require the help of a stronger
an interactive For example, Feishu then WhatsApp Telegram It's a good way to interact it is lowered The threshold and frequency for you to interact with it Then actually a better way of interaction is it has Its own body it can move everywhere So it’s a robot It must still be a for interaction
a good one It’s for agent interaction a good one A way must emerge Right, so it will definitely still come from pop up on screen to our real space Then But the robot itself is an evolution The bottleneck is most likely to be in the hardware.
We also talked about it last time Just in hardware Even on the battery itself when it comes to our kind internal enclosed space even to the degree of dexterity of some dexterous hands So These will all be more evolved than the Agent itself.
Evolution in language space is slower You said before that Flash is the first step towards AGI.
Which step is it now?
It feels like the journey has reached 20%.
20% What percent can we achieve this year?
I think it can reach at least 60% 70% Yes Oh, then AGI is very fast I feel like it should be possible within two years As long as it can be realized within two years That is, most people will indeed lose themselves Will abandon their original working mode then Subversion of life patterns will lag behind Because life does not produce productive value
Then work generates productivity value So you will feel it first It's your working model that has been subverted Next is life Then you really have to step into a situation where your life is turned upside down.
Maybe you just need more robot Of course you don't like the word AGI There is no clear definition But I think What this means is that it has been advanced by its timeline.
The key variable here is that AI can train AI.
right?
This is indeed a landmark node Because it can self-improve?
that's it It should be said that it can reach its peak The intelligence of a group of people Because it can train itself it can create new research That is, it has the ability to do new research This is indeed the pinnacle of its self-iteration self-study A pinnacle of self-iteration This one will be from this year’s big model manufacturer A core competitive point?
It's just hard for you to say Go to for and let AI train a large model Go to design task Target and train Because it is a higher level goal You won’t talk about how to reach this goal But the model everyone finally made will lead to this goal If you have cutting-edge models of intelligence You will end up doing this But it can't just do this two months ago Do you think AGI still has
how far I think it's at least two years I did think so at the time Do you think it’s within two years now?
What do you think of this model of your new generation?
Especially Pro and this generation model in China How long do you think the generation gap between the United States and the United States will be?
I think In fact, in China Companies that already have bases of more than 1T There are several Kimi And MiMo Also There are some then These model manufacturers I think it's basically in the moment Distance from top foreign countries Take Claude Opus 4.6 for example
I think if the reaction speed is fast enough There should only be a difference of two or three months.
It’s not that I can catch up after two or three months.
Claude two or three months later But to say that he can catch up with the contemporary Claude I think the probability is quite high So in this case How has everyone changed in the past two or three months?
It's actually a test The overall research level of this team The agility of this technology And how to embrace the new paradigm Embrace new paradigms for doing research This is really crucial And that’s what we talked about at the beginning Right right right That's what we're competing for So I think the next two or three months are going to be really exciting.
Then what will happen at the same time is We can see that the Agent framework is Already in the past two or three months OpenClaw itself has also received a number of improvements Then you can also see Some self-learning and some self-iteration Some generation of framework then then Therefore, the Agent framework layer will also be in I will improve very quickly in the next two months.
And then the next thing is With the first two outbreaks That is, the stronger the Agent’s framework Model capabilities further soar And our costs are extremely advantageous In fact, the demand for reasoning will definitely explode.
I think several times to 10 times the space It's very very something that may happen immediately So for the inference chip will reach an unprecedented high Such a need So how to go at existing production capacity In particular, most of the stuck points may lie in storage.
Yes Then on the basis of production capacity then you go get it Whether you make it yourself Still using the most advanced chips to do better reasoning.
Right, then a much, much better one Lower cost reasoning It’s a very crucial proposition then The last thing is Another thing is that for The longer term thing is We definitely won't be at this level of 1T Walking too long If you want to get the lead in the next stage Then You have to look for a larger scaling So what is the need to scale the parameters of the model?
Or go scaling something?
then And what kind of chip do you want to scale on?
Yes, then it is now Things that require immediate decision-making and judgment Then this matter is the decision Maybe half a year Or more than half a year later who is more ahead What decision are you making now?
This decision will be kept secret So all the MiMo related stuff we see now It was all decided half a year ago Almost I saw you tweeted that Just a few people I've asked you why MiMo team is very fast Then you gave several key conclusions One is Core architecture and infrastructure research cycle is long So we need to see returns Have strategic conviction a year ago
The second is agility after training It's another ability The third one is curiosity Just love That's what you always say can you explain Why can these three points bring about Training of a super large-scale model quickly Pre-training because it’s too forward-looking So the more critical thing is you need You need to have a predictive ability In other words, there is a certain strategic nature
It’s you who train this generation of models What are you preparing for?
this matter You must think clearly about it a year in advance.
Or half a year Just why I said half a year or a year Just because I used to think it was one year Now I think it's half a year Because Agent It will actually speed up this matter In short, you have to go a long way in advance Just have to think clearly Your generation model structure For such a long time later what is it going to do
I think we need to think this through clearly Otherwise It does not have an advantage It may just be a mediocre model structure A very mediocre model structure and won't say Brings a very mediocre model effect But It will definitely lead to a very mediocre cost and efficiency Disadvantages Right then So this is what I think is pre-training
or is There are many things that Infra should do in advance Then if you train because of it Now we are decoupling with the Agent to iterate.
So there are many things you can’t do The advance planning is very clear So this time it’s even more challenging How can we build on the capabilities of this current model?
With this set of Agent paradigms This chemical reaction occurred Yes then How do you quickly design a new Infra architecture?
Yes This involves the new RL Infra architecture Because for code and math In this Chat form The core of reasoning’s Infra architecture It’s the inference engine itself is the model’s inference engine itself Because the model takes a long time to push Thinking for a long time Then give an answer This is the problem with this RL Infra architecture The Infra architecture for Agent
It does not just focus on the model reasoning itself It also pays attention to the model A coupling with this Agent Yes So it's based on the inference engine Rollout Switched to agent-centric A more complex one like this One may be a black box It may be a white box system right so Then There are many problems on Infra that need to be solved here
Follow the previous one and do R1.
Infra issues to be solved in the reasoning era of Chat is completely different Therefore, this team needs to be more agile to develop quickly RL Infra’s system for today’s era then Then because the Agent framework changes too fast Then your system must have good enough compatibility to be compatible Even if you think about it If it's really going to involve adaptation or iteration How difficult is this RL Infra system?
Yes So this RL Infra system Is it necessary to have a fault-tolerant system that is good enough?
Features and how you use them because it involves inference training Also involves GPU A lot of comprehensive management with the CPU So this piece I think the agility of this team And the cooperation between Infra and research It's a very crucial thing if done well Then it will be reflected in You will feel that the research speed will be very fast And then the last thing
That's what I think really about curiosity or love or is An insistence on this technology I think there are many excellent resinchers Many characteristics of the body But how do you want to go?
Screening from the source Then manage to inspire Finally, let everyone unify around Things that most people believe in will work It is indeed a very complex thing Then the difficulty of this matter I think it's as good as designing a very good A complex Agent system So on this matter I feel like I’m also exploring
then I think I am more It is also in this environment to learn Yes, just learn how other people go.
in his own area of expertise do better So this may be why I have recently been interested in the so-called wisdom of crowds will eventually produce Finally some reasons for thinking about AGI The wisdom of the crowd speaks to you Is it more than just a company?
a team Yes, I think it is the collective wisdom of all mankind.
This is what OpenClaw is doing It may have triggered this OpenClaw is possible. I don't know about it.
Is the motive But at least It now allows everyone to work together to improve a framework On this matter and in such a short period of time let everyone go to do this I think there is such a sign Yes I just mentioned the gap between China and the United States What do you think about the realization of AGI between China and the United States?
Will the process be different? Method
To be honest I don't know much about America So I at least feel that according to our current set in model Cutting-edge research should be the first Secondly, the level of the model Next is the framework of AGI Next is chip energy From these various perspectives I think it is very possible to lead
Together Very likely to lead can you understand This agent framework In fact, it is completed a piece of this puzzle Actually I think it is It's completed The accuracy of the model on complex tasks It used to be a complex task It’s very difficult for you to describe clearly.
Then it will be difficult for you to put all your Input context to complete complex tasks to it But with this Agent framework, in a very easy interactive way way of communicating in natural language It can take all your When you do this complicated task again All contexts are provided And it is also called The more you use it, the smarter it gets.
Just the more you use it All your wisdom is absorbed into it Being absorbed into the framework itself Isn’t this already absorbed into the model?
But it must eventually be It becomes similar to the parameters of the model and is absorbed.
I have a feeling, I don’t know if it’s right or not Does it look like a patch?
I think for top models It's not a patch either Yes For top models it looks like an oiler But for the middle model It's just a very good amplifier It should be said that it is not amplification In fact, it should be said that it makes it very stable.
Become in various scenes All output very good results But for top models It's like it's doubled its upper limit Because you have been joining Xiaomi for a while, right?
Then several models were released in the past six months What do you think about the past six months?
what does it mean to you Is there any progress made?
Is there anything you feel is lacking?
This question is really I think this era may be I feel like I might be denying myself yesterday No matter how many in the way of doing things Or is it based on your judgment about the future of things?
I basically deny it all the time Yes I think it's in this denial Growing through self-introspection and reflection For example There is one place that has made particularly big progress.
do you think I feel like my journey has not been told There is a very clear statement It has some nodes that I set for myself Then when I reach this node I feel like I've made progress It's that we are making progress all the time then it may sometimes It's a very gentle Speed up sometimes it's gentle sometimes But it's always improving So you have to let me find one myself
landmark event I feel like I really can’t find it But I feel like I've been quietly evolving Then the system in my mind quietly evolved Do you have any ideas?
Mental method The mentality is that when I was doing quantification in the past, I think it is very useful to learn something It is very important for me to be able to overcome challenges.
In one sentence “There are always ways to model prices” Yes, this is what I felt at the time Give me a word of strength and support Then when I do quantification later, You feel that the price is your reward You have to predict the exact price Only then can you make quantitative investments
then return to When making a large model track You will find that the reward is not so clear Then there are changes Then this time You should be right for me That's what I think I should do what is in line with my values at the moment Then this thing I think
It must be of value to more people Then it must be something more meaningful Then there is one that I think is like this I think If we create large models of this group of people If there is no such internal driving force But I want to make something that destroys Then I guess it will be very dangerous in the end.
So my current idea is to say things i do every day Do you want to make this world a better place?
or let some part of This very boring thing has been replaced then He has more time to do more valuable things So we are always imagining If 90% of our jobs are replaced We should do something interesting Everyone can think of a lot of interesting things Oh really?
Really, what do you want to do?
I feel like things are changing At this moment I have thought about I haven't thought about this yet I thought about it about a month ago That's what I think a very valuable thing It is a lot of basic research in China now It is actually too
To demand a very complete product Messy proof and don’t have a good fund Or there is a charity organization Or say OK What kind of things come and go to support This kind of person who does basic research Go in this more breakthrough direction Take a step forward
There is no such good system Including if there are good computing resources of this kind It requires a good infrastructure system Come and support them to do such things Then Can we do something like this for public benefit?
organization Come and support this matter Have thought about me This is what I was thinking about a month ago If we realize it one day Implemented AGI Then let’s compete at this time Who does research faster?
That AI is not AI is doing it too People are doing it too Will it We humans work with AI to guide it to do better The research we can create will be faster I always feel Scientific research should be accelerated Even if AGI is finally realized There's a lot to do too Why do we have to compete with it?
Just let it be done Why don't you lie down all day and have nothing to do? Hahaha
There's always something new to do If you just enjoy life, it would be quite boring.
Or maybe you should do something to help it?
I think Then this thing might be the current model For example, provide emotional value, right?
Provide emotional value to the model hahaha Provide emotional value to the model does it need God, I was thinking about it before How to make the model provide us with emotional value Yes, yes, yes. In short, we just want to do something useful.
But this is beneficial I think It is judged according to personal values.
then Is boredom a meaning?
I don't know either But It seems that boredom doesn’t mean anything to me, right?
How did you relieve stress in the past six months?
Is it big when you’re stressed?
My brain is a Sliding Window Attention I forget very quickly Even if I am under pressure, I will immediately possible If you hurry up, it'll be over in an hour or two.
If you go slowly, it will be over in one day.
I'll definitely pass the next day if I get some sleep So my way to relieve stress is very fast But This is also based on a premise Yes you will have some new ones the next day Things that have an upper limit to imagination flush it out You will forget it immediately If it is still in that context You should never forget it After the model is sent out this time,
Is it any different from your imagination?
Any new feedback?
I think this time it's I think everything is within my expectation I'm not talking about this time either I think it's within my expectations every time It's because I saw the capabilities of this model first so i can expect After this model is released, What is the perceived state of others?
So On the contrary, I am a bit desensitized to any release.
then I can also predict it What is its most popular state?
Or what is the most explosive state?
I can predict all this So it was totally within my expectations I'm not too excited or anything like that I just think it's OK We believe that the level and capabilities this model has achieved Perceived by everyone This is how I feel about this release So A day or two before I publish I knew it was probably like this if not perceived Prove what we did wrong That’s it
There is something wrong with our own internal judging criteria And so actually I felt at the time The criteria we used before launching I think there is no problem Regarding the external evaluation of this model, Including what framework it can be used in?
What model benchmark level does it currently reach?
Basically we are It’s the same as our internal assessment So everyone has evaluated it correctly.
On the contrary, I started thinking about it a few days before it was released.
OK what should we do next?
what to do next I've already entered the next stage So at this stage the state I don’t even care too much Then why on March 11th did you Two mysterious models were launched first It’s because of Post Train During the training process We pulled the middle one I took a look at a few checkpoints.
and then discovered that at a certain stage I found it very useful Then we feel Everyone should be allowed to experience it and the anonymous stage Your evaluation will be more fair This is indeed a good way So I went directly to OpenRouter.
View anonymously Everyone’s evaluation of it is different from ours.
And of course there are some, like me Problems we didn't realize at the time It’s a long article about that model The training didn't last long at that time So it's really not good So we post it anonymously later One week until the official release Focus on optimizing the experience of its long articles.
This is what we learned from the anonymous period Received external evaluation The most valuable improvement for us Other than that We verified it during the anonymity period Our internal assessment is no problem Then we just need According to our own evaluation system Then just do the subsequent scaling things.
So what is your Benchmark?
What is your team’s benchmark?
How to drive I think making large models is a benchmark in itself.
But this "good" is defined by ourselves How do you get the company to agree to this?
How should we deal with the relationship with the company?
As long as Mr. Lei agrees, that’s fine.
Hahaha I think he is a very good Very strategic boss angel investor anyway there are many tags behind it Yes There is no requirement In fact, in this matter If I had joined Xiaomi at the beginning On this matter, they are highly unified.
So You don’t need to explain too much later Yes Just do it According to our judgment Just do it with your intuition.
and then did For the boss “Well done” We just talked a lot It’s all about this model of your V2 family.
I want to think about it with you next This is the progress of our entire model in the past three years.
What stages would you divide it into?
From the end of 22 ChatGPT started the war on this big model What are the key changes each year?
How did we get here in your eyes I think ChatGPT is the first That’s it Play the model in a I guess It should be a 4K pre-training scene The intelligence level of the model inside The actual length of pre-training Or rather In the end, the length of this context is indeed critical.
So in fact, ChatGPT just makes everyone feel OK, I’m pre-training in a 4K context.
after training then i By simply talking to it The number of rounds of this dialogue is one In the case of two or even multiple rounds it's in a Actually The context at that time It is highly related to your conversation turn pair round Two rounds and one round two rounds Then you can in subsequent rounds To correct many problems in the pre-sequence round The model can also be used in subsequent rounds
To clarify the mistakes I made before This was actually the impact ChatGPT had on people at the time.
It's just that you feel it in the conversation Reach a human-like level of intelligence And then all these things happen Maybe it happened in a very short context And then it's just in this very short context Go and put the model on very large-scale pre-training The kind of training Loss minimized that That level of intelligence was stimulated Of course, a prerequisite for all stimulation It all depends on having one
It makes people feel the level of intelligence very much Such a set of interactions Chat is a good interaction Yes Otherwise you wouldn’t know how powerful this model is already Yes So Chat is a good interaction This is 22 years is what happened with ChatGPT end of year Yes then 23 years of words Actually When a top closed source model does it Its next year will be OK
Before open source How to catch up with this top closed source model So 23 years Look, it’s Llama Qwen Includes DeepSeek include These are the open source teams Just preparing to go First disclosed with the help of Llama How to do large-scale pre-training Paradigm is actually how to make good data how to get there The structure was unknown at that time.
At that time, even if you trained a 7B structure Then this Transformer structure what is it like Is Pre-Layer Norm or Post-Layer Norm Then what is this detail like?
Then what is your hidden size?
At that time, these super parameters completely opaque But Llama tells you OK Can you train successfully like this?
It gives you a start Then use this head So Qwen OK With the help of an architecture of LLAMA then do better Pre-training data Do larger-scale pre-training compute scaling OK It’s just the Qwen series that’s up then But Qwen did a great job It is for developer ecology Conducted full-size model training Then I also trained some The multi-modal model is also very top-notch.
then This is very good for the community very It will help inspire the community behind to make some fine-tuning There are also some fine-tuned frameworks like Some necessary prerequisites for birth Right then DeepSeek at the same time Although I am trying to reproduce LLAMA again But But what I might care more about is Let’s see what problems the LLAMA generation architecture has
Instead of rushing to scale you care more about it Say OK LLAMA this generation architecture For example, now LLAMA still uses GQA (Group Query Attention) Then GQA is in the larger model especially in that On some limited GPUs
Have shortcomings When the GPU is used for training What kind of problems will it have?
What problems will we encounter with scaling?
OK, I think some new structures may be involved.
to solve this problem So this is the series stage of DeepSeek V2 and V3 Just want to go Propose some new architectures Whether it is MoE for efficient training MLA for lower inference cost Then what happened at this time So DeepSeek may pay more attention to saying OK during that time I'll do better research Do it on a worse chip The matter of scaling
What are the shortcomings of LLAMA?
It's just Dense Then if you really want to scale it For example, you For example, you No one is training hundreds of B Dense now.
Open source community Because training hundreds of B Dense Although LLAMA has done You can also see the conclusion But that conclusion It’s not necessarily a problem with the structure.
That’s it it's too expensive it's too expensive It’s also very expensive if you train it.
Then you push it also very expensive No one is going to move a stupid, expensive model So you MoE for more efficient training Then reason more efficiently Then for lower ones like MLA reasoning cost The architecture of these models will be born Right so this is This is the same stage Maybe Qwen and DeepSeek are taking two paths Qwen is pure scaling
Then DeepSeek is What we consider is scaling based on innovation.
which one is correct No nothing I think it's right or wrong Because they are both One is for Reach the strongest model Achieve the strongest model under limited computing resources Because after all, DeepSeek’s computing power There may only be very few Qwen fraction Right then But what Qwen wants is How can I promote better development of the entire ecosystem?
So both are correct Didn't say who was right or wrong This formed two open source forces in China Right right right An open source force is doing research achieve absolute height Then an open source force is really ecological value I think it should be done to a high level Ecological value itself is also a value If there weren’t so many good open source models So much good research work
Just like a lot of research on the preamble of DeepSeek R1 In fact, they are all done on Qwen’s model.
So they promote each other All valuable to the community Yes Of course DeepSeek has more On the other hand, value is that it has a completely new structure Brings an impact on training costs Or rather And an impact on subsequent reasoning costs resulting in a lot Inference chip I think so
The inference chip has more and more accurate judgments How should I construct the next generation chip?
How to design Yeah I think this is a For this entire AGI journey It's a very good thing Yes This may be what happened in 2023 or 24 Then the only thing that may happen in 24 years beyond everyone's expectation o1 and R1 In fact, o1 and R1 are I think it's inside DeepSeek It's a surprise attack so to speak
It can also be said that That's right It was born very accidentally The birth was quite accidental I think it's actually Say when the pre-training paradigm changes to post-training then For organizations and teams and when the requirements for innovation are different So how should the entire team be reorganized?
a question Yes This is the biggest feeling I got from this matter This is right It's the team I think team is the most important factor According to the traditional management method I will now invest more in post-training Good calculation, I voted Then I'll recruit someone from outside.
Or am I from I just formed a new team Is this the wrong way?
It depends on the team itself It may feel like follow-up people just do post-training I don’t think this is very conducive to innovation.
That’s it The most important thing is That’s it You can think of many points where it would not do well.
For example, what I just said Diversity of post-training data If you are just doing post-training Naturally, we lack this perspective There is also a lot In fact, I think the main reason is this many teams It does pre-training The portraits of the characters after the training are very stereotyped.
Anyway We don't do it in this rigid way I'm here to recruit and organize people.
So the very stereotypical question is Naturally, people who do pre-training cannot do post-training.
Maybe it's like this Anyway, I don’t know much about the underlying reasons here.
I just felt that when I got to know Why is it so strange?
I always have this doubt Then I don’t care why it’s so weird.
Anyway, I just don’t think that’s right.
Then I won’t do it and it’ll be OK.
So what we see outside may be R1 But what you perceive internally is Before this model starts training Adjustments to the team and organization In fact, it’s about whether everyone recognizes this or not.
then and In what way are we going to do this?
I feel like I’m in the process of R1 It’s a very great experience But the final result As far as I can predict Just when I leave R1 has also reached the level of a Lite (lightweight version) Then code math has done something very close to That o1 The level of the smaller version then Then I have predicted it I think it’s code math This reasoning will definitely work.
And it is possible that AIME (American Mathematics Invitational Competition) will start from At that time it was only thirty or forty minutes I brushed it to 70 or 80, I think.
Probably very possible But the latter one has now reached 100 points.
Yes But the thing I didn't realize was It's actually a paradigm shift Just reasoning It can actually be done through code math This highly generalized scenario Can be placed outside of general use In fact, o1 didn't get through this either.
This is something I didn't expect So It is also because of such a background factor in As a result, when I read some new things later, Even if it is for a very vertical scene like code Especially in scenarios like code.
I would first think about whether it can really be generalized Have I underestimated it?
Yes then this is just a possibility A set of skills that I have accumulated myself then This is 24 years 25 years to maybe 25 years One thing that I think is very boring is It's a very staggered year This staggered year That is, you can choose to use the Chat paradigm Take reasoning to the extreme Go ahead and do it
SWE-Bench Put that LeetCode Bench These biased Benchmarks Just thinking about it for a long time Give an answer like this Then put this AIME These pairs You can choose to continue to delve deeper into this paradigm You can also choose me to ignore it I don't care about this I don't care about this paradigm
I'll just think of the next one I might be on this paradigm Able to achieve sixty or seventy That's OK In fact, AIME achieves 60 or 70 points It means you have accessed this link.
But can you Probably a smarter team I think we will fully embrace it by the middle of 25 years Go for the new Agent architecture to do things And it's your choice MiniMax has been transferred I think MiniMax is the earliest to switch.
MiniMax is the earliest transfer in China The corresponding should be I think it’s earlier than Kimi But in such a new way It actually affects the entire team Agility requirements are very high You need to iterate quickly Iterate quickly based on a base So if you look at the release speed of the model, You can also see which companies It hugs fast enough Right then Maybe some companies haven’t kept up.
Still under the original Chat paradigm Continue to cultivate Didn't keep up Even if you make some of this BroseCap that SWE-Bench these Terminal-Bench this kind of The one with the so-called Agent On Benchmark And the improvement in these benchmarks has been very deep.
But it does not represent the model It can actually be used Just BroseCap It's just a very outrageous one Evaluation index The model trained on this me It feels like it can only be measured on this kind of data set.
You try another way Even if it is a way to do information retrieval you do In the end, your ability still cannot be generalized.
That's weird It’s just that this entire data set is too limited.
entire framework Also very specific So it is this half year If we talk about people who are working as Agents Most of them have gone astray.
I think it's a wrong approach We also walked for a while During the first generation of Flash I don’t want to be an Agent In fact, I just want to make a good Chat but i think a reason It’s because we really need to lay a good foundation for Chat first like i said Your Chat must be at least 70 or 80% Yes Go through the whole process Only you can say
Infrastructure for your Post Train data Your Infra infrastructure your current people Especially those we recruited All of them are people who have never made large models before.
He has experience He must have a growth experience Otherwise, he will make new things as soon as he comes up.
How to do it easily Yes, it's me At that time I When doing Flash I am purely OK things we have done Let a group of inexperienced people do it again But I care more about this group of people doing it When a bunch of new things like this They themselves are evolving Then they evolved Do new things behind us very valuable because I rarely give very strong supervision in the middle
Unless I find out I'm about to turn around Otherwise, you will give such a supervision signal that is too detailed.
Just tell him how to do this One flaw is that you will let most of the team to lose originality This is Things that I think should be avoided as much as possible what is called No one on your team has a background in language modeling.
That’s it Before most of our hiring I have never made a large model they just graduated For example, I did some basic research in school And for example, I didn’t even make large models before.
what does that do Doing engineering?
Then do development These all have a bit of training background None needed Yes About 1/3 to 1/4 of the people have some training experience But maybe they were just trained For example, 7B 14B A scale model similar to this I think it’s completely different from training a large model.
Those experiences are unlikely to be reused.
Yes That requires you to have strong training.
Practical experience?
Are your experience requirements high?
The requirements for what should be done in step 1234 will be higher.
But I found out later Don't tell everyone what step 1234 does Then push everyone together Let's do this again together Then everyone will move forward Then the organization will talk about it later Let's finish talking about this first you just now Reviewing the entire process from the end of 2022 to now Technology development history and what key things each company has done So now the competition for big models
Transitioned from the previous Chat to Agent right?
This is the second act of this model competition second battle That is, everyone starts on the same starting line The open source model should be Maybe it’s early for a closed-source model, I think.
For example Claude should have been on this path two years ago Just We didn't realize it was the most correct path In other words, most of them don’t realize it I think a lot of people realized it last year aware of last year But they're not doing the right thing In my opinion what is the right thing to do The right thing is You need to use a very complex Agent framework Or various Agent frameworks then
To complete tasks of higher complexity end-to-end and aim at this as your post-training paradigm Yes Rather than in a very limited scene Customized for this scenario Do it within this very simple structure another one A little more complex than Chat Task
Then the input and output of its model Or a bunch of strings Or a bunch of tokens A bunch of tokens Then the RL paradigm of your model Maybe it’s what I just said Focusing on inference Rollout But in fact Agent Not anymore What do you think MiniMax is?
This change is relatively fast i feel I think it is faster Because they use a 10B (activated) model To achieve the current one I think Agent's abilities are quite amazing.
It’s their post-training agility It's very amazing but you said Agent's second act The so-called admission ticket is the 1T basic model Then MiniMax doesn’t have such a big model yes so i I don't think they really mean Already benchmarked against Claude Opus 4.6 I define entry as You need to benchmark yourself to the level of Claude Opus 4.6
It requires a 1T base At the same time, agility is required It (MiniMax) already has this thing Then it has the latter So Right now, Chinese companies don’t have both right?
Yeah right Take a look at DeepSeek haha We just got along That is from the end of 22 Model changes every year until now do you think today Can you comment on these manufacturers from China and the United States?
What position have they reached?
Is there any difference between everyone’s bets now?
Everyone may have a consensus They all think that Anthropic’s path is correct.
I think this is the current consensus That’s it The path of the Agent will also be clearer.
At least in the past Within 3 months I think the path for Agent has become clearer.
So when the path is clearer Domestic large model team I think will enter a state of accelerating and catching up.
Because now everyone is on the Pre Train I think it's a difference Basically there is no Or very close Or rather Domestic large model team on Pre Train Structurally there are advantages There was even a time when I thought Claude possible for a long time in the past Do a lot of context engineering we all mistakenly thought
It is because the model structure is not very advanced Then some compromise designs were made for cost But looking back now Maybe my thinking is too limited.
then Now maybe you can see all these contexts No matter what the original motivation was But ultimately the current Such a state It is what it calls the management of this context.
And the entire matching Skill Fold Or the architecture of the Agent Actually it's for Cooperate with the model to play a more powerful overall Designed for task completion so i think When everyone sees such a paradigm After the transformation So there is another base There is no such situation where generation difference occurs.
Then everyone is actually all in I need to do a good job in Agent’s Post Train Or to be more specific In fact, it’s about how to do it well on Agent RL scaling In fact, this is a very clear and accurate direction But the specific path Research paths need to be explored But at least something to do Just like 23 years
It’s the same as the gap between Pre Train and Pre Train.
I think it's very clear When did you realize that coding can be generalized?
Coding is so generalizable I think it will be from the beginning That is, whether it is in the Pre Train paradigm or the Post Train paradigm Just 23 years Even at the beginning When I returned from quantization to the track of large models Will be able to generalize the coding Will have very high expectations But this expectation is Shift to saying OK
I first need to prepare the pre-training data for Code.
Then go to scaling computing power Take a look Will it perform well on Code Benchmark?
Take another look at Code Benchmark After promotion Other general reasoning types Like BBH practice Benchmark will get better This was the beginning In fact, it goes step by step Such an exploration path verified by experiments Then go to R1 Verified again Because of code and math All have very good verification indicators.
So verify again Then go to the paradigm of Agent Code because it has a good environment then And Code can naturally do long-term tasks For example, software development is a very long range mission You are going to work on a very complex project It's just a long-term mission So it fits the paradigm of Agent very well So it's basically on every paradigm it all hit
that point Yes, you can go to Code At least in research You can be self-closing then And on the path of this self-closed loop something made It is easy to scale to other On general data in a wider field In fact, Code itself is quite versatile.
Then it's easy Because it is natural language So it's easy to scale So do this Code thing I think at least during these three paradigm shifts It's a very elegant path What step has RL scaling reached now?
Do you have any preliminary results from your exploration?
It’s not very convenient to share yet I think when we have one At least I think the computing power of RL scaling With pre-trained computing power when reaching the same water level I will share it with everyone Do you think the competition has become more intense today?
It's become calmer Relative to 23 years Have the dimensions of competition increased?
There are more dimensions of competition But actually It should be said that the dimensions and speed of competition have increased.
becomes very fast Maybe you do pre-training You can’t produce one model a month But you do post training You can indeed produce one model a month then There is also In fact, the matter of Agent In addition to looking at the architecture of the Agent itself, it It also depends on your view of the whole reasoning side.
Structure or even hardware chip How do you recognize it?
I think it will affect There are some basic decisions For example When will the 10 MB context be created?
How to scale the context to 1 megabyte In fact, it also involves your For example, if you are pre-training Go to scaling at every stage Then the one behind you corresponds Post Train Do Post Train on 1 MB Follow 256K to do Post Train The difference in computing power is several times that of So do you have enough computing power gap?
Go support you to do this thing and your final scene and the capabilities of the framework itself Does it support your ability to operate on 10 MB?
give full play to Or the ability of 1 trillion can be used It actually has a longer decision-making link It's from the original Pre Train We only need to make decisions about the architecture of Pre Train itself to the need Are you interested in a quick next period?
The evolution of the Agent framework and the entire market including inference chips I think it’s an estimate of supply and demand.
because of the chip It is subject to the manufacturing process, right?
Oh no, manufactured reserves The bottleneck is right there And whether you should plan as early as possible to do a broader Not a model structure for a single chip These are all things that need to be planned in advance.
This is talking about the company dimension of the large model What do you think about startup companies?
Of course you didn’t start a business Do you think there are more opportunities for startups today?
Standing in 2026 Still more desperate To be honest I don't understand very well Startups beyond pedestal large models so and But what I can see is at least Its requirements for the size of the team of a startup company will be getting smaller and smaller It’s just that maybe you don’t need to have a very big company Just a few people
Even a person can become a company Just you You learn to make full use of Agent Come and make it yours I've always seen people say that before I support many employees on OpenClaw by myself Just myself I also tried this kind of Multi Agent.
Although it is currently unrealistic Or I think it's a bit gimmicky But I think it will become a reality soon within this year Yes Multi Agent pairing Multi Agent What are the areas where there is no breakthrough now?
What's the difference now?
Every aspect is a little different For example You need a cheap enough model Because what you finally calculate is Is it better than hiring a real employee?
cheaper If only it weren't so cheap and easy to use Why should I use you?
Right so You still need a cheaper model It’s impossible to say that you burn a Claude Opus and spend more than 1,000 tokens a day.
As a result, the value created by that employee may be 1,000 Yes, that's what I mean Then the second one is The second one is I think the current architecture of Multi Agent not that much I think there's still room Anyway, yes, there is still room and how they each evolve themselves Self-iteration and communication with each other There is all this There's still room
Although there are Do you feel like you already have this prototype?
And I am Multi Agent It's also very smooth to use.
But I always feel that it is still going Save costs, save time It did not enlarge the final upper limit I haven't felt this way yet So will that be Multi Agent?
Will collaborative RL training do this?
maybe So where do you think the boundaries of model companies are?
Nowadays, model companies seem to have no boundaries It just feels like I said before that I don’t want to make products But now I found That is, it has become a direct product again.
The model is the product is right Yes, with the help of Agent Its product power is actually stronger Then everything else is simple On the contrary It is the model itself that relies on this Agent architecture It has become a new set of products now in your opinion Why should a company choose open source?
Why choose closed source Because most domestic companies have open sourced except bytes What is the purpose of choosing open source now?
This is a technical choice Or a market choice Is it still a matter of accelerating AGI?
I still think this is the purpose Open source accelerates AGI Open source must be something that accelerates AGI If you don’t have open source you We assume that AGI will explode Assuming it will replace most of the productivity Then we can work backwards from this matter you need how many chips you need Then Will these chips be produced by one company?
Will it be purchased by a company?
It seems not it will scatter then if it is dispersed Then Assume that the chip is dispersed So the reasoning behind these chips It may be the chip manufacturer It's possible Large model manufacturer So is it using the same model?
Still using a different model I think it must be different So if you work backwards from the endgame, There must be something about open source at least Conducive to advancing this matter because it ends up To generate economic value on a large scale It must rely on computing power can produce economic value So I think open source is at least important for
For many aspects Agent framework chip energy These links actually have a promoting effect.
So I think it speeds up the AGI process Then do you think it is a public good or a market in the end?
I think it depends on each company's approach to open source It is based on one’s own ecological niche That is, do you have something that no one else can do?
A strategic niche gained in the short term if any Then you dare to open source if not You think the model is your ecological niche Then you won’t open source It's such an act You work on open source in a big company Will there be pressure?
I don't feel like I'm working in a big company now In fact, I think Xiaomi as a whole is very entrepreneurial-oriented.
I think very strangely it looks like a big company But it is actually a A very flexible company So what do you think about 2026?
What will be the winner in the competition between model companies?
What do you do right to stay at the poker table?
The first thing you can’t do wrong is First your Pre-training base It can't be wrong. If this matter doesn't happen, Then there is basically no chance at all Yes So let's say we all have a model larger than 1T Then the base The potential is all there Especially in code The potential is quite the same on a base
Then what everyone competes with is How to get there quickly The first one is how to make the Agent framework and model Mutual self-iteration improvement second How to make the architecture of this Agent more flexible?
Couple the resources you have now Yes Or ecological niche How to let the Agent architecture understand you understanding and scheduling What you now call resources and ecological niches For example, the operating system is For example, the hardware is For example The traffic is also right then traffic social These are all considered How to make this Agent architecture
Adapt to your current strategic resources Then finally join forces How to integrate it well?
In fact, it is very testing What I personally think is very challenging is that a company Are you willing to use a new method?
to do this What is the new way?
Have to think It turns out that everything I did was wrong.
Is it necessary?
So many people come to do this first think Do you need so many people to do this?
It's all you do now Do all these people need to be cut off?
because of its productivity will be replaced by another more efficient thing Or how to let these people take advantage of it?
Agents come and go to exert greater productivity value Yes So these all need to be thought about The second one is in a new ecological niche It turns out that those things that seem to have barriers Isn't it There are still barriers What do you think of Frontier lab?
Where should frontier be reflected?
the most basic I think you should be frontier when doing research Yes It’s just that we still have to have that kind of I want to make a lot of original things And these things may not be available in the short term Not so mainstream But you have to say Not recognized by the mainstream at all That's weird too I find it a bit difficult myself Countermainstream
One thing that I think is not very suitable is It’s hard for you to scale I still believe in scaling.
As long as you follow the mainstream Then it will be easy for you to scale Why?
All your Infra all Hardware chips these It all revolves around this goal Pushing forward together Then you personally will get Then Your research will gain a very large acceleration.
Yes, mainly for this reason So I think we will follow this mainstream trend then do some We think ahead For example, long context efficient architecture It is actually done under such a background But it doesn’t say anything about groundbreaking research.
but we think The product of these small points is studied then it is a very high level one The status of a frontier model What have you done with your past few years?
More original research What are you more satisfied with?
I think it is more original research All are industrial grade For example DeepSeek V2 It is an industrial-grade model When everyone is doing it in the mainstream one When using a larger Dense model Then we go against the mainstream and do it MoE then do it Go to change Attention In fact, these two are doing research.
but it is it is slightly in one sometimes may be more in one Conducted research in resource-constrained scenarios But it is essentially scalable research Yes So I think this is a good work then I think the MiMo V2 series can be considered a it's because of us Indeed, in this paradigm of Agent
It’s not very clear yet So I made a lot of preliminary decisions and judgments.
causing us to be very I think it's very efficient Quickly go based on a Elegant and simple structure to train Then this structure And finally we found It fits the paradigm of Agent very well then We are fast To shift to the paradigm of Agent Did a lot of post-training design Especially around the entire Agent architecture
Coming and going to redesign our RL Infra Yes, I think these are all It’s just a combination of many points.
And then finally let everyone feel it Not a paper itself but one Industrial grade model Are you obsessed with publishing papers?
If not, the less you send, the better.
Why?
I just hope others Post a paper, there are some people in our team I said don't take me Yes The core reason is I don’t read papers from academic conferences anymore.
One of the main reasons is I think you really should do most of the experiments yourself Then you believe your experimental results Than believe that the experimental results of the paper will be better But I'll take a look What were its original concerns and motivations?
this I look at it occasionally But in short I wonder if there is such a large-scale In the computing power team People who have done research and people who have not done research it focuses on issues I found that the degree of overlap is also quite different.
So I read these papers less and less now So what are your sources of information now?
derived from iteration Source of truth comes from iteration I rarely even communicate with people recently Very few Right, so I don’t even know I talked about it today So many hours of this stuff Will it be discovered after a while?
I should find out after a while that it's wrong But I don’t know how many people will feel that right now it's wrong Or is it?
It’s helpful and valuable, but I haven’t communicated with it yet.
Have not communicated To say there is communication Just communicate with yourself Then see for yourself With the team Communicate with others who are doing the same experiment Yes, you just now actually More or less some organizational topics were mentioned Yeah Including the organizational topics we talked about a lot last time Have you had any iterations in the last two months?
these 100 people Maybe 20 people have been trained before Have been exposed to smaller models Yes, the main thing is that I think these things are all can be rapid acquisition These abilities can really be learned quickly As long as you are placed in the environment around a higher standard of purpose When it comes to driving These abilities can all be I think a month or two at most Slowly
It can indeed be learned quickly in three or four months.
So environment is more important than experience I think So I don’t care too much about his experience.
And care more Did I create a better environment?
This environment allows everyone to improve their learning faster and faster Let everyone be with each other The so-called MOPD (Multi-Teacher On-Policy Distillation, multi-teacher online distillation) what we are talking about mutual distillation I distill your strengths You distill my strengths Such rapid improvement of each other Yes That is, I care more about creating this environment myself.
Does it meet such a prerequisite?
Rather than caring about when this person comes Is the gene based on his historical background good?
I only care about his possible initialization Is the upper limit of check point high?
I only care about this But I don't care about him that much He is currently supervised The state of that point after learning I don't really care if it's tall or not.
Then who would you choose?
Does his academic degree need to be related to artificial intelligence?
look at something I think your Ph.D. ratio is 55% Yes, that includes PhD students Yes reading Not a PhD graduate I'm studying for a Ph.D.
I think those numbers are a bit stereotyped.
then Actually This is more representative of a person doing research.
degree of love That's if it's a love of doing research it may choose read at least a master's or doctoral degree But I find that now it is more We also recruit a lot of undergraduate students Then undergraduates are studying this kind of Agent understanding of this new paradigm
I think the imagination will be higher So now I'm hiring people instead It will gradually tilt to recruit more Prerequisite Undergraduate We will recruit sophomores and juniors Why?
because of their flexibility and adaptability I don’t feel polluted Naturally, I accept this work meeting more generate huge value His mind doesn't feel imprisoned yet then So he dared to put those things of his own with confidence and boldness Leave your ideas to this architecture for verification
Then keep exploring this boundary yourself So how do you create the environment?
The first is the person who built the environment it is to have the same idiosyncratic For example, I said that I should emphasize the matter of love Emphasize the sense of mission I think you need to have these basic qualities.
then Secondly, in these characteristics Because they are relatively weak So So the other one is To give these qualities true To enlarge One premise is that his foundation must be good That’s it That’s it You can understand it as when he wants to do something Then when he has this passion He must be able to do it His foundation is better
He can't just have lots of ideas but not be able to do them On this basis It's constructed so that he doesn't do it in the end.
key elements of success This is a basic ability Right So we will choose the basic ones Then curious Then love drives you to do things Of course there are some possibilities More and more, there will be some more Have higher requirements for diversity Have higher requirements for diversity Because if the recruitment is too homogeneous, Then it is easy for everyone to miss some looks like noise But actually
Some information that is very valuable for research Right then Then diversity is very important at this time Yeah, so every group we have at work We chatted a lot Everyone will go crazy with their own ideas Or share the information you are paying attention to.
Then or say Maybe in the group Maybe on the seat It’s too noisy anyway all day long then I think this kind of communication environment is very good then Of course there are some external It's just that this is inside There are also some external factors For example, the way you organize your incentives Don't be too focused on it
certain very definite and clear goals and I think this way of motivation Money is a very important baseline but it It's not the only baseline Yes Just give me enough money But what?
In addition to money, other things are also very important It's his sense of worth his sense of meaning In fact, I think many people care more about these things What you just said about post-training team management The construction method will be a little different from the pre-training Usually the people we see at training Both types will adapt very well.
In other words, he will be more enthusiastic about this matter The first category is that he will pay more attention to A group of people like this who went to play with models Because he went to play Only then does he know where the boundaries of each model’s capabilities are.
He will want it to find a scalable way Go and replenish its boundary It's possible To construct a batch of stronger data and a stronger environment Make it capable of RL training then
It may be fall back to a certain stage of pre-training.
Maybe this batch of data was not done well OK, I will add this data to it.
Supplement it with this type of data Then maybe I will do better next time I train the model.
In short it is People who care more about model experience People who interact with the model more frequently Yeah He will be very adaptable to this method because I think this iteration is valuable Especially those People who maintain many of their own private test libraries Then he went crazy to measure the boundaries of different models suddenly found a model Stronger
And then share this unique experience with others I think it’s quite suitable to go Entering this paradigm Another thing is that it is true I think this is inevitable We're going to center around this new Agent paradigm to design Its good RL Infra system So do RL Infra for RL There is a big difference from Infra which does Pre Train RL's Infra will care more about this degree of ambiguity
I think It’s Pre Train’s Infra. You may not be able to tolerate mistakes.
For example, if we have a loss spike, You don't allow it to appear You have to solve this spike But do RL Infra You have to allow it to tolerate errors That’s it it's you You allow this model to communicate with this Agent In this Agent framework This one broke halfway through the rollout.
But there are many reasons for it to break You can't even find out which one it is it may be because This Agent framework writes a certain timeout logic It may be because It requires a long verification process to do this task That's a mess You don't know why it is This one is broken Another one is Will it be your training and reasoning?
It is trained on a heterogeneous cluster.
then so Inconsistency between your training and inference I think in the original Code and Math That reasoning, that paradigm This cannot be tolerated But now you have to tolerate this then and How do you make some more Scheduling of heterogeneous resources For example, in addition to the GPU, you also need to take care of the CPU.
You also have to take care of storage Right and then So it is How can it be used in such complex heterogeneous resources?
Train this model Then that's I think a lot of algorithms are needed here A compromise with engineering There's a lot of fuzzy territory in the middle So It requires flexibility in the requirements of Infra people and this kind This kind of Such an understanding across these two fields I'm very demanding then In other words
In fact, I think it is better than Pre Train it's all about flexibility and agility have put forward new requirements able to adapt to this paradigm people will feel In fact, he didn't adapt It turns out there is such a group of people For example, in Infra Even if Infra is relatively speaking, pursuing a many things They are all people who have a clear answer and solution then
He will be the Infra of RL There will still be that group of people Is it suitable for RL Infra?
So I think it’s probably such a change So our side may still look like At least do Pre Train Infra I think they are separate from RL Infra.
Not quite able to blend in Because they are indeed I think The need for complexity and this kind of precision will vary greatly What is the bottleneck of doing RL?
You just mentioned that Pre Train is actually almost done.
In fact, the team that really scaled Agent RL very few very few Including overseas, right?
Anthropic must have done this I don’t know much about other teams. At least I feel it from the final model effect No It doesn’t scale to the same level as Pre Train.
Are these two paradigms developed before?
Do you think there will be any new paradigm in the future?
Not sure. Let’s go through this paradigm first.
Hahaha I think so What we just talked about is It is a generative model Very strong in perception models are combined into a new framework Carry out RL training It's enough, it's in my plan long enough And it is more difficult to achieve Now some people say it is continue learning online learning Now I think continue learning and online learning
I'm more referring to When it interacts with this environment Or with the Agent framework During multiple rounds of interaction The framework itself is iterating and evolving on its own This is how I define it what do you expect for the future This future may be 26 years maybe 27 years Maybe a little longer term I now feel that the current Do your research every day I feel very good What is your current working status like?
work rhythm 11 a.m.
11 a.m.
1234pm But this is my status Does not represent the status of the rest of our team Are you a night owl?
Not really It’s my own sleep You really don’t need much sleep Maybe 6 hours is enough 5 hours is fine 4 hours is fine 4 to 6 hours is an OK range for me So I don't need that much sleep And I'm a little excited about what I'm doing now So it is indeed I also feel that sleeping too much is a waste of time.
It feels a bit like that
Loading video analysis...