Xie Chen : Data Survey — History, Landscape, Pyramid Structure, and Recipes for AI and Robotics Data
By Zhang Xiaojun Podcast
Summary
Topics Covered
- Simulation Is a Prerequisite for Robotics, Not Just an Accelerator
- Failure Data Is More Valuable Than Perfect Success Data
- Today's Models Are Essentially Huge Compressors
Full Transcript
<b>Hello everyone, I’m Xiao Jun</b> <b>Data, computing power, and algorithms are the three driving forces behind artificial intelligence</b> <b>Today’s episode is very special</b> <b>We want to focus on one of these driving forces, which is data</b> <b>and provide an industry overview</b> <b>Large language models are hitting a wall when it comes to data</b> <b>while robotics data is still in a barren desert</b> <b>So how exactly does the data industry operate?</b>
<b>Our guest for today’s episode is</b> <b>a returning guest on our “Business Interview Series”</b> <b>Steve Xie Chen, founder and CEO of Guanglun Intelligence</b> <b>From your perspective, who has become more aggressive?</b>
<b>I think ByteDance has definitely become more aggressive</b> <b>I think</b> <b>Alibaba</b> <b>I think OpenAI</b> <b>I also think DeepMind has definitely become more aggressive</b> <b>I think NVIDIA</b> <b>has also become more aggressive</b> <b>These are the five teams competing to build the brain of robots</b> <b>In a way,</b> <b>I believe PI (Physical Intelligence) should also fall into this category.</b>
<b>Actually, the most valuable data is the data from failing first and then succeeding.</b>
<b>I think when it comes to the endgame,</b> <b>overall,</b> <b>just like Elon Musk said,</b> <b>we humans might just be living inside a simulation.</b>
<b>Hello Steve, please say hi to our audience first.</b>
<b>Thanks to Xiaojun for the invitation.</b>
<b>My name is Steve, Chinese name Xie Chen.</b>
<b>I am the founder and CEO of Guanglun Intelligence.</b>
<b>Steve has actually been on our podcast before,</b> <b>but since this is our first time recording a video podcast,</b> <b>let’s have Steve introduce himself again,</b> <b>and share some of his past experiences.</b>
<b>I originally studied physics as an undergrad at Peking University,</b> <b>then went to Columbia Business School for a PhD in quantitative finance.</b>
<b>Unlike many leaders in the tech world,</b> <b>especially those founding companies in embodied AI,</b> <b>my experience right after graduation was somewhat complicated.</b>
<b>I was actually the head of AI algorithms for dynamic pricing in e-commerce</b> <b>Which company in e-commerce</b> <b>Actually, it was called Jet.com back then</b> <b>It is an emerging startup that aims to disrupt Amazon.</b>
<b>Then quickly raised a lot of funds</b> <b>In the end, it was actually acquired by Walmart</b> <b>Of course, after this</b> <b>Actually, I have also worked as a product manager</b> <b>I've also been responsible for products</b> <b>So actually, I have always been focused on algorithms and their practical implementation</b> <b>Then think about my next step</b> <b>Until 2018</b> <b>I was especially fortunate to go to Silicon Valley</b> <b>Joined Cruise</b>
<b>At that time, those were possibly the most advanced, or among the two most advanced companies.</b>
<b>One is Waymo, and the other is Cruise, both L4 autonomous driving companies</b> <b>Going to Cruise to be in charge of autonomous driving simulation</b> <b>This was also my first time truly validating simulation and synthetic data across the entire industry</b> <b>They are not toys</b> <b>They can truly and effectively support the evolution of algorithms</b> <b>After that, I went to NVIDIA</b> <b>I was in charge of autonomous driving simulation at NVIDIA</b>
<b>Actually, it was during my time at NVIDIA</b> <b>In 2021, right after I joined, I discovered something that completely changed my perspective</b> <b>I realized that at NVIDIA, their focus was on the chip for the vehicle side</b> <b>Orin’s biggest customers were not Waymo or Cruise</b> <b>But NIO, XPeng, and Li Auto</b> <b>This was a huge shock to me</b> <b>It made me realize that the next generation of autonomous driving</b> <b>Won’t be in the US</b> <b>Won’t be in Silicon Valley</b>
<b>But will be in China</b> <b>So I had to return to China</b> <b>In fact, just six months after joining NVIDIA</b> <b>I moved back to China with my family</b> <b>And joined NIO</b> <b>To lead their autonomous driving simulation</b> <b>I’m especially grateful to my wife</b> <b>She gave me tremendous support at that time</b> <b>Choosing to leave behind many work friends and experiences in the US</b> <b>Then I returned to China together with him</b> <b>Of course, after returning to China</b>
<b>I truly started working at NIO</b> <b>to implement simulation from the perspective of an OEM</b> <b>building it into a complete data closed loop</b> <b>that can support</b> <b>for example, autonomous driving algorithms</b> <b>synthetic data training</b> <b>as well as large-scale evaluation</b> <b>and deployment</b> <b>At that point,</b> <b>I started having a lot of reflections</b> <b>which is to say,</b> <b>is simulation merely an accelerator,</b> <b>a nice-to-have enhancement,</b>
<b>or is it more fundamental,</b> <b>more of a prerequisite?</b>
<b>At that time, I increasingly felt that for autonomous driving,</b> <b>simulation is mostly an accelerator,</b> <b>but for embodied intelligence in robotics,</b> <b>It might actually be more of a prerequisite</b> <b>After having this line of thinking</b> <b>Especially with the evolution of large models</b> <b>So in 2023</b> <b>My co-founder Yang Haibo and I</b> <b>Decided to establish Guanglun Intelligence together</b> <b>The starting point was really the hope to use simulation</b>
<b>And synthetic data to accelerate the robotics industry</b> <b>Why was your early work experience</b> <b>More diverse compared to others?</b>
<b>What were you looking for at that time?</b>
<b>Great question</b> <b>I think I was personally searching for</b> <b>That is</b> <b>Which industry</b> <b>Which cause</b> <b>I could make the greatest contribution to</b> <b>And that contribution might not just be icing on the cake</b> <b>But rather</b> <b>Where I could truly become a prerequisite</b> <b>To truly transform an industry</b> <b>Actually, I majored in Physics during my undergraduate studies.</b>
<b>Physics is actually quite difficult</b> <b>When I first joined the Physics Department at Peking University</b> <b>Ranked 110th in the grade</b> <b>I probably spent three years</b> <b>Sometimes I don't go to bed until two in the morning.</b>
<b>I haven't gone home even during this winter and summer break</b> <b>All at school</b> <b>In the end, I probably ranked in the top five of my grade</b> <b>This experience made me feel</b> <b>The first is indeed through hard work</b> <b>You really can do better</b> <b>But secondly,</b> <b>Actually, talent is still the most crucial factor</b> <b>Then I think I might still be lacking</b> <b>On the talent for physics</b> <b>Later, I moved into finance because I saw the potential there</b>
<b>At that time, it was probably from physics and mathematics</b> <b>The students who have progressed the furthest</b> <b>I might have ended up in the finance industry</b> <b>But it was only after pursuing my PhD that I realized</b> <b>this industry actually started to lack some innovation</b> <b>and perhaps didn’t contribute much</b> <b>in terms of real impact on society, from my perspective</b> <b>So I wanted to immerse myself more in the tech industry</b> <b>Once I entered the tech field</b>
<b>I was really searching for where I could add the most value</b> <b>From a product standpoint</b> <b>I especially wanted to do one thing</b> <b>which was to truly bring it to effective implementation</b> <b>to enhance value for users</b> <b>But after working on it for a while, I realized</b> <b>this lacked some technical difficulty</b> <b>some substantial challenges</b> <b>it wasn’t disruptive enough</b> <b>So with these thoughts in mind</b> <b>I kept searching</b>
<b>And of course, I think my greatest fortune was</b> <b>that around 2018, I truly found it</b> <b>I think the most meaningful thing</b> <b>and I believe it could potentially become a product</b> <b>and even a business model</b> <b>that’s what simulation is</b> <b>I remember I think I met one of your senior classmates</b> <b>who should also be from Peking University’s physics department</b> <b>and he said you are quite rare</b> <b>to have done your undergraduate in Peking University’s physics department</b>
<b>and then quickly went to Columbia Business School</b> <b>How do you think your traits differ from your peers?</b>
<b>I think my trait is that</b> <b>I probably want to do one thing</b> <b>or not do it at all</b> <b>but if I do it, I want to do it the best</b> <b>ideally at an international level</b> <b>to be number one or number two</b> <b>or in other words, no one else can do it better than me</b> <b>at that level</b> <b>Also, I think another trait is</b> <b>I prefer to find a point of differentiation</b> <b>Going to Columbia Business School</b>
<b>Actually, one reason was because of my studies</b> <b>In many aspects, it was probably better</b> <b>But the main reason was</b> <b>I realized I wasn’t suited for physics</b> <b>At that point in time,</b> <b>compared to my peers,</b> <b>I think I probably thought more deeply</b> <b>Because I was constantly searching</b> <b>To figure out in which area</b> <b>I could truly have an advantage</b> <b>Something that would set me apart from others</b> <b>Did you find it?</b>
<b>I believe I hadn’t found it back then</b> <b>But I think I’ve found it now</b> <b>What I haven’t mentioned is</b> <b>I also started businesses during my undergrad</b> <b>And I also started businesses during my PhD</b> <b>My undergrad experience was actually more complex</b> <b>By the time I was in my junior year, I had reached the top five in my class</b> <b>I just started to let loose</b> <b>Because I felt that this</b>
<b>By junior year, I was in the top five of my class</b> <b>Yes, that grade was enough for me to apply to a prestigious school abroad</b> <b>So the grades after that weren’t as critical</b> <b>At that point, I started thinking</b> <b>What was I missing?</b>
<b>What I was missing was probably a real club experience</b> <b>International exposure</b> <b>Because I had probably been studying hard for those three years</b> <b>While my classmates had all kinds of</b> <b>Different experiences</b> <b>So I applied at that time</b> <b>To do an exchange year at Columbia University</b> <b>That year really left a deep impression on me</b> <b>It was during the financial crisis</b> <b>Indeed, in 2008</b>
<b>I truly experienced a very different world</b> <b>I took some very interesting courses</b> <b>And made a lot of friends</b> <b>At the same time, it also showed me that people like me</b> <b>really hope to have the experience of studying abroad during their undergraduate years</b> <b>which is very likely at Peking University or Tsinghua University</b> <b>many students from these top universities</b> <b>also hope to have such an experience</b> <b>to better understand the world during their undergraduate studies</b>
<b>and find their next direction</b> <b>So at that time, I organized a study group</b> <b>an exchange group</b> <b>which means that while I was at Peking University</b> <b>we held several sessions</b> <b>and took many students</b> <b>to go abroad to the United States</b> <b>including actually during my PhD</b> <b>I couldn’t just sit still</b> <b>so I started a business</b> <b>while I was doing my PhD</b> <b>at that time, I had a dog</b> <b>its name was Potato</b>
<b>it was a very cute pug</b> <b>When it was three months old</b> <b>it was diagnosed with a heart condition</b> <b>which made me very sad</b> <b>Because of my love for it</b> <b>and also because</b> <b>of interacting with many dog owners</b> <b>I realized that, on one hand, for Potato</b> <b>and on the other hand, for the dog owner community</b> <b>there might be a need for an app</b> <b>a mobile app</b> <b>to help everyone better maintain</b> <b>their relationships with each other</b>
<b>and also to help them better maintain their connection with their dogs</b> <b>So at that time, I downloaded a lot of apps</b> <b>probably over five hundred apps</b> <b>on my phone</b> <b>and tried them one by one</b> <b>self-taught design</b> <b>self-taught coding</b> <b>and then developed this app</b> <b>My First Tech Startup</b> <b>My First Tech Startup</b> <b>Yes</b> <b>Then I developed an app for dog owners</b> <b>At that time, this dog owner app</b> <b>was actually ranked</b>
<b>probably among the top three social apps for dog owners in North America</b> <b>and it mostly had five-star reviews</b> <b>so it was quite popular</b> <b>But I think one problem was</b> <b>I didn’t really consider the business model back then</b> <b>So after finishing it</b> <b>it was actually hard to commercialize</b> <b>At that time, a few VCs in Silicon Valley</b> <b>offered me a term sheet</b> <b>wanting to invest in me</b> <b>But I was also close to finishing my PhD</b>
<b>I thought it over and decided to pass</b> <b>because maybe this wasn’t really the path</b> <b>I wanted to pursue for my life</b> <b>At the same time, I feel there isn't a viable business model</b> <b>I also don't want to take investors' money</b> <b>and waste their funds</b> <b>as well as waste my own time</b> <b>So later on, I decided to shut down the company</b> <b>How long did you run this company?</b>
<b>The company operated for about three years</b> <b>Three years until I finished my PhD</b> <b>Until I graduated with my PhD</b> <b>Your previous work experience is quite diverse</b> <b>Is it because you passed on many things</b> <b>and realized many things just weren't right for you?</b>
<b>Exactly</b> <b>Actually, I think it varies from person to person</b> <b>Let me give an example</b> <b>I think Buffett and Lang Lang are quite fortunate</b> <b>On one hand, I think they are amazing</b> <b>They have very strong abilities</b> <b>On the other hand, they are very lucky</b> <b>They probably, at the age of ten</b> <b>We identified what they excel at</b> <b>Right, maybe Buffett discovered it when he was ten years old</b> <b>He is especially fond of stocks</b>
<b>He is also skilled in investing</b> <b>Lang Lang probably discovered he was good at playing the piano when he was around ten years old</b> <b>I think I actually spent a lot of time</b> <b>Discovering what you're not good at</b> <b>I have to go through trial and error to find out what I'm not good at</b> <b>But I might not be that lucky</b> <b>I probably spent a long time</b> <b>Only then did I truly discover what I’m good at</b>
<b>What I'm good at is</b> <b>I believe what I'm good at is</b> <b>Based on a more disruptive technological innovation</b> <b>Building a product</b> <b>And use this product to truly support an entire industry</b> <b>I believe this is something I'm good at</b> <b>Why did you ultimately choose to specialize in the field of simulation?</b>
<b>You could say we've been deeply cultivating this field.</b>
<b>And it’s been only six months since I joined NVIDIA</b> <b>Then I joined NIO</b> <b>Actually, you switched jobs quite quickly afterwards</b> <b>You didn’t stay long at each company</b> <b>Yes</b> <b>I think first and foremost, I believe in the power of simulation</b> <b>This was something I realized when I went to Cruise</b> <b>Before I led the simulation team at Cruise</b> <b>Honestly, simulation was seen as a toy</b>
<b>Or more like a demo tool Cruise used to showcase to investors</b> <b>It was actually built using a game engine</b> <b>Using this fairly traditional set of technical art techniques</b> <b>To create a world and cars that looked very realistic</b> <b>And then used it to generate a large amount of data</b> <b>But the algorithm teams</b> <b>For example, the perception team at the time</b> <b>Were not able to effectively utilize it</b>
<b>Or actually, when they did use it, the results</b> <b>Were that the performance of the trained models decreased</b> <b>Rather than improved</b> <b>Actually, the CEO is quite a competitive person</b> <b>His name is Kyle, and he's quite a driven person.</b>
<b>He brought me in hoping I could solve this problem.</b>
<b>At that time, I was probably given about three months.</b>
<b>The pressure was actually quite high back then.</b>
<b>The first step I took was actually different from others.</b>
<b>My background is a bit more complex.</b>
<b>I have a background in physics,</b> <b>I have experience in quantitative analysis,</b> <b>and I also have an AI background.</b>
<b>So the first thing I did wasn’t to improve the simulation,</b> <b>but to evaluate the simulation.</b>
<b>After establishing a set of evaluation criteria,</b> <b>the second step was to truly integrate generative AI</b> <b>with the simulation,</b> <b>to genuinely enhance it.</b>
<b>At the same time, we iterated effectively with the algorithm,</b> <b>and once the data was properly fed into the algorithm,</b> <b>we actually saw an improvement.</b>
<b>This was a very special moment I truly witnessed,</b> <b>a unique point in time.</b>
<b>It truly made me believe in this matter</b> <b>Of course, why I chose to go to NVIDIA at that time</b> <b>was because</b> <b>on one hand, NVIDIA</b> <b>Jensen Huang and the team really recognized</b> <b>that I was doing well in autonomous driving simulation</b> <b>and they were indeed looking for someone to lead this</b> <b>but secondly, from my perspective</b> <b>I was constantly throwing counterexamples at myself</b> <b>challenging myself</b> <b>asking why I should believe</b> <b>that I was the best at simulation</b>
<b>because at that stage</b> <b>Waymo had its own approach</b> <b>Cruise had its own approach</b> <b>and the whole industry hadn’t fully converged yet</b> <b>so it was hard to say who was right or wrong</b> <b>but I think NVIDIA’s advantage</b> <b>is that as a supplier</b> <b>I believed I should approach it from a</b> <b>I believe I have already acquired</b> <b>From the perspective of L4</b> <b>Going back to Waymo might not make much sense for me</b> <b>But if I go to a supplier</b>
<b>I can see it from the supplier's perspective</b> <b>Let's look at how to approach simulation</b> <b>How big was NVIDIA back then (so that's why I joined NVIDIA) in 2021</b> <b>At that time, there were probably around ten thousand people</b> <b>But his autonomous driving team</b> <b>It has been developing for several years now</b> <b>Actually, when it comes to autonomous driving</b> <b>The investment is still relatively high</b> <b>At that time, I transitioned from Cruise to NVIDIA</b> <b>Is it a mainstream choice?</b>
<b>Actually, at that time, I felt that many people</b> <b>Still haven't figured out NVIDIA</b> <b>To be honest</b> <b>I didn’t quite understand it that way at the time</b> <b>Until I joined NVIDIA</b> <b>I just understood him</b> <b>Do you regret leaving your job now?</b>
<b>No, I don’t regret leaving.</b>
<b>Yes, but the thing is,</b> <b>when I was actually inside NVIDIA,</b> <b>what really made me realize</b> <b>was that NVIDIA is an extremely hardcore tech company.</b>
<b>I remember telling my wife at the time,</b> <b>I said, don’t underestimate NVIDIA,</b> <b>it’s not just a gaming card company,</b> <b>it’s not just a GPU company,</b> <b>it’s a company focused on accelerated computing platforms.</b> <b>It’s a full-stack company.</b>
<b>This was something I truly saw from the inside back then.</b>
<b>Of course, that said,</b> <b>I think NVIDIA showed me</b> <b>how suppliers should approach simulation,</b> <b>but why did I go to NIO? On one hand,</b> <b>I wanted to return to China,</b> <b>and on the other hand, I wanted to experience from a customer’s perspective,</b> <b>from a...</b>
<b>from a...</b>
<b>Because if I think about it</b> <b>the biggest demand for simulation in the future will come from OEMs</b> <b>because they will all develop their own autonomous driving systems</b> <b>so I should look at it from an OEM perspective</b> <b>to truly understand how to utilize simulation</b> <b>at the same time, I find it hard to answer myself</b> <b>another question is</b> <b>why must this be done externally?</b>
<b>isn’t it enough to do it internally?</b>
<b>so I feel I need to look from multiple perspectives</b> <b>to really gain a deep understanding myself</b> <b>whether there is truly an opportunity to do this externally</b> <b>you say simulation is not a toy</b> <b>then what exactly is simulation?</b>
<b>that’s a very good question</b> <b>to be honest</b> <b>in the beginning, I always called simulation a time machine</b> <b>without simulation</b> <b>autonomous driving might take fifteen years</b> <b>with simulation</b> <b>Maybe it can be achieved within five years</b> <b>I see it as an accelerator</b> <b>Why do I say that?</b>
<b>Because the primary data source for autonomous driving</b> <b>still comes from the real world</b> <b>specifically, data collected from cars being driven</b> <b>Its data is easy to collect</b> <b>Its data is very easy to collect</b> <b>Essentially, it is passive</b> <b>because the drivers have already purchased the cars</b> <b>Right?</b>
<b>Right?</b> <b>And then the data is collected from driving</b> <b>It actually prefers to use simulation</b> <b>to accomplish two things</b> <b>One is to supplement some edge-case scenarios</b> <b>commonly known as Corner Cases</b> <b>These might be some rare incidents on the road</b> <b>The other is to use simulation for evaluations</b> <b>Because within simulation</b> <b>you can achieve better repeatability</b> <b>So it can be repeatedly verified</b> <b>The effectiveness of their algorithm</b> <b>Conducting regression testing</b>
<b>But at that time, my thought was</b> <b>Is simulation only useful as a time machine?</b>
<b>Is it possible</b> <b>It is for AI</b> <b>Regarding the future development of AI</b> <b>It will be similar to NVIDIA's graphics cards</b> <b>Without NVIDIA AI, there would be no development.</b>
<b>Rather than saying that having it</b> <b>It will only develop faster</b> <b>So at this point in time</b> <b>I started to look into the robotics industry</b> <b>At that time, I actually felt that at NVIDIA</b> <b>One thing that really moved me was</b> <b>I had the opportunity to work with Jensen Huang</b> <b>At that time, I had the opportunity to work with NVIDIA</b> <b>Possibly some of the leaders of Omniverse</b> <b>There was quite an in-depth exchange</b>
<b>At that time, I actually felt that NVIDIA</b> <b>was playing a big strategic game</b> <b>and what they were truly focusing on was robot simulation</b> <b>and they turned this into a complete platform</b> <b>because they strongly believed that through synthetic data</b> <b>through simulation</b> <b>this was the only path</b> <b>to truly enable robots to be deployed worldwide in the future</b> <b>At that time, I increasingly believed</b> <b>that this was indeed a major trend going forward</b> <b>At this stage</b> <b>I thought that</b>
<b>what I really should do is start a business</b> <b>not focused on autonomous driving simulation</b> <b>or synthetic data</b> <b>but to truly build the data infrastructure</b> <b>the data engine for the entire robotics industry</b> <b>Why do it externally?</b>
<b>Why not within a single company?</b>
<b>Why don’t these robotics companies do this themselves?</b>
<b>How should I put it</b> <b>Actually, it took me quite a long time to understand</b> <b>I think what matters more here is to consider the difficulty of this matter</b> <b>the market opportunity</b> <b>and I think it’s useful to compare it with</b> <b>some companies in this industry</b> <b>for example, companies like Scale AI</b> <b>I believe when the market opportunity is big enough</b> <b>and the difficulty is relatively high</b>
<b>in such a situation</b> <b>I think there’s a greater advantage to doing it externally</b> <b>Why?</b>
<b>Why?</b> <b>Because you can actually attract better and more talented people</b> <b>Let me give an example</b> <b>At Cruise, the best algorithm talent</b> <b>is hard to assign to the simulation team</b> <b>they will definitely be assigned to the perception team</b> <b>or the prediction team at that time</b> <b>Right?</b>
<b>Right?</b> <b>Then at Waymo, the best data talent</b> <b>It doesn't necessarily have to be given to the data infrastructure team</b> <b>It might be given to the algorithm team</b> <b>And at Scale AI</b> <b>Right?</b>
<b>Right?</b> <b>It attracts the best algorithm talents from around the world</b> <b>And data experts to build a data flywheel for it</b> <b>I think the same principle applies</b> <b>Which is, I believe as long as the task is sufficiently difficult</b> <b>And the business opportunity is large enough</b> <b>I think it should be done externally</b> <b>Unless, for example, this task is something like</b> <b>Just autonomous driving simulation</b> <b>Right?</b>
<b>Right?</b> <b>Then I do think</b> <b>It might not be worth doing this entirely externally</b> <b>Actually, today's program aims to discuss a very specialized</b> <b>Relatively niche</b> <b>But also fundamental topic: data</b> <b>Because nowadays, whether it's large language models, embodied intelligence</b> <b>Or robotics, all are deeply concerned with data issues</b> <b>However, the stages on both sides might be different</b> <b>Large language models have hit a data wall</b> <b>There is no more data available</b>
<b>All the internet data has already been consumed</b> <b>But for Robotics, data is still a barren desert</b> <b>So in your view, how important do you think the data issue is?</b>
<b>Is it a fundamental problem?</b>
<b>Regarding data, I actually believe it is a fundamental issue for AI</b> <b>If we think from first principles</b> <b>I actually think data should be compared to the education industry for humans</b> <b>Data for models or data for intelligence</b> <b>I think is somewhat analogous to education for human learning</b>
<b>Data is roughly equivalent to education</b> <b>I believe data is extremely critical for intelligence</b> <b>Because I think data for intelligence is similar to how humans acquire knowledge</b> <b>Continuously improving themselves</b> <b>I believe knowledge is an extremely critical, first-principle requirement for human intelligence</b> <b>So by analogy, I think data is vitally important for intelligence</b> <b>How would you define data?</b>
<b>I think I tend to look at it more from the perspective of different stages in AI data development</b> <b>Let's help ourselves think about how to define AI data.</b>
<b>I believe the earliest data was more like the initial stage of machine vision.</b>
<b>At that time, Professor Fei-Fei Li defined ImageNet.</b>
<b>Back then, the data was more of a dataset.</b>
<b>It was a static collection including images,</b> <b>along with relatively accurate ground truth annotations.</b>
<b>This was the earliest phase.</b>
<b>It was the stage of static datasets.</b>
<b>If I were to compare it to human education,</b> <b>it was more like a one-time, spoon-fed style of teaching.</b>
<b>For example, buying some textbooks once,</b> <b>and providing them to students for learning.</b>
<b>Later on, I think we reached the era of Scale AI,</b> <b>where data production truly became industrialized.</b>
<b>At this point, data was more about</b> <b>a large-scale, factory-style process,</b> <b>including the subsequent techniques,</b> <b>to produce high-quality data at scale with relatively high timeliness.</b>
<b>So it was more like a factory process for mass-producing data.</b>
<b>At this stage, I think it’s somewhat akin to a wholesale style of education.</b>
<b>Moving forward, I believe we have entered the era of large language models.</b>
<b>In this era, I think the data used for pre-training</b> <b>has already exhausted the entire internet’s data.</b>
<b>So the focus of data shifts to the post-training and evaluation stages.</b>
<b>More and more, it relies on higher-level experts,</b> <b>such as highly skilled engineers,</b> <b>physicists, top mathematicians, lawyers, and doctors.</b>
<b>On one hand, they design the questions,</b> <b>and provide evaluation criteria.</b>
<b>On the other hand, based on these questions and the feedback from assessing the large models,</b> <b>we identify corresponding issues and then address them by</b> <b>providing more information, more experience, and guidance.</b>
<b>Helping the models improve.</b>
<b>At this point, I think the data becomes more like</b> <b>a more advanced stage of education,</b> <b>where the teacher imparts knowledge, nurtures, and resolves doubts.</b>
<b>It’s like a teacher who tailors instruction to your individual needs,</b> <b>based on your abilities and your current stage,</b> <b>using evaluations to identify problems,</b> <b>and then providing you with experienced, targeted feedback and guidance.</b>
<b>To help you improve</b> <b>So I think this is actually an evolution of data</b> <b>Of course, from the perspective of embodiment</b> <b>Its data is even more complex</b> <b>For example, in large language models</b> <b>The data is mostly in the digital world</b> <b>Then, from the evaluation perspective</b> <b>Providing more feedback to the model</b> <b>And embodiment, I think, is more in the physical world</b>
<b>Whether in the real physical world</b> <b>Or in the simulated physical world</b> <b>Based on evaluation and signals</b> <b>Offering more effective experience transfer and feedback</b> <b>Yes, so I think this might represent different stages in data development</b> <b>From this perspective</b> <b>I believe data should be more defined as</b> <b>A signal that helps you learn</b>
<b>Along with the corresponding transfer of experience</b> <b>So it evolves from static data</b> <b>Into an educational system</b> <b>Yes, I actually find this quite interesting.</b>
<b>For example, I still remember the early days of autonomous driving.</b>
<b>At that time, the data teams,</b> <b>the datasets they provided didn’t really come with much feedback.</b>
<b>There wasn’t much feedback.</b>
<b>Usually, it was more like the algorithm teams提出了一些需求.</b>
<b>Right, then the data teams would deliver accordingly.</b>
<b>And later, the algorithm teams would提出更多的需求.</b>
<b>Exactly. So if we look at,</b> <b>many of today’s data annotation industries,</b> <b>especially in autonomous driving data annotation,</b> <b>I think they’re still at this stage.</b>
<b>Right, actually these data vendors or internal teams</b> <b>don’t really understand the state of the algorithms.</b> <b>They are mostly passively accepting the requirements put forward by the algorithm teams,</b> <b>and then providing the corresponding data deliverables.</b>
<b>But if we look at, for example, the large language model industry,</b> <b>of course, one factor is scale.</b>
<b>And then, for example, companies like Mercor and Surge,</b> <b>they tend to recruit more senior-level talent.</b>
<b>Providing more evaluations on the algorithms of their clients' models</b> <b>Using these evaluations to give feedback to their clients</b> <b>And based on this feedback</b> <b>Specifically proposing more and stimulating greater demand for data</b> <b>Also helping these clients meet their increased data needs</b> <b>To assist in improving their algorithms, forming a closed loop</b>
<b>At this stage, data providers actually have</b> <b>a very thorough understanding of their clients' algorithms</b> <b>Because the real evaluator becomes the data provider</b> <b>Exactly, so I think this is very much like the relationship between students and teachers</b> <b>Right? For example, in mass education</b>
<b>Right? For example, in mass education</b> <b>The teacher might not have much understanding of the students</b> <b>It’s more of a rote, one-way teaching approach</b> <b>Whereas in a more advanced setting, like a university professor</b> <b>Or a teacher in a physics Olympiad class, the relationship with students</b> <b>Is much more targeted and personalized</b>
<b>I believe data is evolving towards this kind of targeted guidance</b> <b>We often hear a few phrases in the industry</b> <b>One is called data annotation</b> <b>Another is that the amount of data corresponds directly to the amount of manual labor</b> <b>Let me give everyone a visual explanation</b> <b>of the workload implied behind these two sentences</b> <b>specifically, what tasks his work includes</b> <b>and what his workflow looks like</b> <b>I want to say that data is actually evolving</b>
<b>it might have started with basic data labeling and now involves more data collection</b> <b>Let me give some examples here</b> <b>For instance, in the data labeling industry</b> <b>take Scale AI, which initially provided data for autonomous driving</b> <b>they might have received, for example,</b> <b>various sensor data from their clients</b> <b>and then performed extensive data cleaning</b>
<b>as well as more detailed slicing of the data</b> <b>on top of that, they probably developed their own toolchain</b> <b>though much of the process is still human-driven</b> <b>working based on these toolchains</b> <b>and following certain standardized procedures</b> <b>for example, drawing a box here</b> <b>this is a bicycle</b> <b>that is a pedestrian</b> <b>Including data that may be more sequential in nature</b>
<b>Labeling them</b> <b>Then possibly going through multiple layers of annotation</b> <b>Further development might first involve automated annotation</b> <b>Followed by human-in-the-loop quality inspection</b> <b>This way, the data is ultimately produced</b> <b>This is probably a more traditional approach to autonomous driving algorithm annotation</b> <b>Such an industry</b> <b>How much manpower does it require?</b>
<b>It requires a significant amount of manpower</b> <b>Including now, I actually think the autonomous driving annotation industry</b> <b>Is still</b> <b>Of course, I believe there are already many automated algorithms on the client side</b> <b>But if we look, for example,</b> <b>The entire industry probably has many bases</b> <b>Many provinces and cities likely have numerous annotation centers</b> <b>Each base might have tens of thousands of people</b> <b>Working in the annotation industry</b>
<b>So across the whole market, I estimate there might be</b> <b>I’m not sure</b> <b>I estimate there could be hundreds of thousands, maybe even a million people.</b>
<b>When it comes to manual annotation</b> <b>So many people</b> <b>Right</b> <b>There are many people</b> <b>Of course, to be honest about this</b> <b>I believe it still lies in the data from the previous generation</b> <b>It is more based on a set of standards and guidelines</b> <b>Have people provide annotation information based on this set of guidelines</b> <b>But I believe the next generation of data</b> <b>In fact, what people provide here is the sharing of experience</b>
<b>Let me give an example, for instance</b> <b>The data of large language models</b> <b>Whether it's Mercor or Surge</b> <b>These might be the two current leaders in the Bay Area</b> <b>Relatively emerging data vendors</b> <b>So they provide fine-tuning for large language models</b> <b>And the data from the evaluations</b> <b>That includes, for example, RLHF, which is</b> <b>Including continuously interacting with this model</b> <b>Provide them with feedback</b> <b>And they come up with a lot of questions</b>
<b>Provide some answers to help this client's algorithm</b> <b>On one hand, to evaluate them</b> <b>On one hand, let them use better RL fine-tuning</b> <b>Improve yourself</b> <b>Actually, at this moment</b> <b>These people are all very experienced.</b>
<b>Or rather, very expensive people</b> <b>You can look at their hourly wage</b> <b>All of them have hourly rates above one hundred dollars.</b>
<b>What they provide is more of raw data</b> <b>What they provide is not just an annotation</b> <b>It's not that on the existing data</b> <b>They are providing a layer of annotation</b> <b>Instead, they directly provide feedback on the data.</b>
<b>Or directly generate new data</b> <b>Could you give an example?</b>
<b>Give an example</b> <b>For example, it's a question</b> <b>What is your perspective on AI data?</b>
<b>Right</b> <b>So the algorithm might have first generated its own perspective</b> <b>For example, GPT might first generate its own perspective</b> <b>So, if there is a data expert here</b> <b>Then he might base his view on GPT's perspective</b> <b>Provide him with the appropriate feedback</b> <b>Right</b> <b>At the same time, he might also come up with more questions</b> <b>More challenges</b> <b>The Role of the Teacher</b> <b>Exactly</b> <b>He plays the role of a teacher</b> <b>He will release more questions</b>
<b>At the same time, he might also provide more answers</b> <b>Including, for example, to give one example</b> <b>Like programming</b> <b>You might have ten different ways</b> <b>Being able to code this segment of the program</b> <b>Which one is good?</b>
<b>Which one is bad?</b>
<b>Which one is ambiguous?</b>
<b>All of these need to be extracted accordingly</b> <b>and provided to the algorithm.</b>
<b>So at this point, the data is very different from before.</b>
<b>Previously, for example, in autonomous driving data,</b> <b>or the most traditional machine vision data,</b> <b>you probably needed to provide only correct information,</b> <b>right?</b>
<b>right?</b> <b>Perfectly correct information.</b>
<b>That was ideal.</b>
<b>But actually, with today's data,</b> <b>such as large language models or embodied AI,</b> <b>there isn’t a strict notion of correctness,</b> <b>nor strict perfection.</b>
<b>Everyone’s answers might be different,</b> <b>right?</b>
<b>right?</b> <b>But the distribution of these different people,</b> <b>this diversity,</b> <b>and the logical relationships within it,</b> <b>And even some erroneous data</b> <b>can be extremely valuable</b> <b>Let me give an example</b> <b>It's the data from our embodied AI clients</b> <b>In the early days</b> <b>this probably included our clients</b> <b>who are also among the world's top embodied AI companies</b> <b>Their requirement to us might have been</b>
<b>that we provide a perfectly accurate, simulation-based</b> <b>robot to carry out a long-range task</b> <b>For example, making a pizza</b> <b>Taking the dough out of the fridge</b> <b>Then adding various seasonings on top</b> <b>As well as different fruits, vegetables, and meats</b> <b>Cheese, and so on</b> <b>Finally, putting it into the oven</b> <b>Pressing the buttons</b> <b>You have to make it perfectly</b>
<b>This long-range task is effective data</b> <b>But later on, our clients</b> <b>Including us discovering through iteration together</b> <b>Including us discovering through iteration together</b> <b>Actually, the most effective data is the data that fails first and then succeeds</b> <b>For example, I might want to put a slice of mushroom in it</b> <b>But after I take out the mushroom and slice it</b> <b>I didn’t hold it firmly</b> <b>The mushroom fell onto the table</b> <b>I pick it up again</b>
<b>And put it back onto the pizza</b> <b>This kind of data</b> <b>We might call it negative samples or corrective data</b> <b>This data is often more valuable</b> <b>So actually, when the model’s generalization ability improves</b> <b>It can learn from mistakes and regain this kind of cognition</b> <b>It becomes closer to the human learning process</b> <b>Exactly</b> <b>It becomes closer to the human learning process</b> <b>Not long ago, we had a podcast discussing a viewpoint</b>
<b>That Guangmi said they seriously spent time researching for Frontier Lab</b> <b>Those Frontier Lab companies in Silicon Valley that do data annotation</b> <b>The biggest impression is</b> <b>if the model's data distribution does not include this type of data</b> <b>then this kind of task will not succeed</b> <b>only by compressing this type of data might it succeed</b> <b>so today's model is still essentially a huge compressor</b>
<b>so he (Li Guangmi) proposed that data is the model, and the model is the application</b> <b>do you agree with this viewpoint?</b>
<b>meaning all data should be trained and compressed into the model</b> <b>I think that</b> <b>Guangmi actually mentioned a</b> <b>very good point, which I think is a current issue at this stage</b> <b>that is, the model's generalization ability is still insufficient</b> <b>how do we define generalization ability?</b>
<b>I think it's called Zero Shot in English</b> <b>in Chinese, it basically means zero-shot capability</b> <b>the ability to learn from zero samples</b> <b>meaning I haven't shown you this sample</b> <b>haven't seen it before</b> <b>haven't seen it before</b> <b>but you are still able to handle it</b> <b>Right</b> <b>For example, if your robot's training involves</b> <b>I've never seen a video about making pizza</b> <b>But you might have seen, for example, someone chopping vegetables</b> <b>You might have seen someone making burgers</b>
<b>But the task of making pizza for you</b> <b>Can you make it work?</b>
<b>This is the zero-shot capability</b> <b>Now, from an embodied perspective, I believe</b> <b>The zero-shot capability is still somewhat lacking</b> <b>Under such circumstances</b> <b>Indeed, you need to know the execution rate for the specific tasks.</b>
<b>You need to supplement the data for what kind of tasks</b> <b>At this stage, it is</b> <b>I think this is reasonable</b> <b>But I believe that data is essentially the model</b> <b>I believe this perspective holds true in the long term</b> <b>I don't think it's fundamentally a matter of perspective</b> <b>Because I believe that essentially speaking</b> <b>The model's architecture still needs to be improved</b> <b>I believe that if a model does not have</b> <b>If the architecture does not have</b>
<b>Zero-shot generalization capability</b> <b>I believe this model is not truly a path to</b> <b>A model for general intelligence</b> <b>Let me give another example</b> <b>Actually, people's algorithms are different too</b> <b>Let me give an example of an algorithm that an average person might learn</b> <b>The algorithm for learning from Musk is also different</b> <b>Elon Musk's Way of Learning</b> <b>It’s probably more about starting from first principles</b>
<b>Based on his possibly extensive knowledge</b> <b>And based on his practical experience</b> <b>Right</b> <b>Quickly migrate the new knowledge out</b> <b>To help him better understand this matter</b> <b>I think his model might be better than the average person's model</b> <b>It needs to be much more effective</b> <b>So, in my view, intelligence actually now</b> <b>On one hand, I definitely believe more is needed</b> <b>Effective high-quality data</b>
<b>But on the other hand, I still think it comes down to the model</b> <b>More improvement is needed</b> <b>So here we're talking about issues related to architecture and algorithms</b> <b>Exactly</b> <b>Essentially, it's still not smart enough</b> <b>Essentially, it’s just not smart enough.</b>
<b>I believe that generalization still requires the architecture of the algorithm to achieve it</b> <b>Of course, there is a moment that represents the Scaling Law</b> <b>It's all about timing</b> <b>It means that your data volume must be accumulated to a certain large scale.</b>
<b>Only then can we see the emergence of its generalizability</b> <b>That's smart enough.</b>
<b>That's smart enough</b> <b>In fact, we are currently serving some of the largest large model teams in the world.</b>
<b>So through our collaboration with them</b> <b>What we actually found is that in embodied interaction</b> <b>I think what I mean is</b> <b>These Zero Shot capabilities refer to the ability to handle tasks with zero samples.</b>
<b>I believe it has gradually started to emerge</b> <b>So I think, in this regard, I am actually quite optimistic.</b>
<b>In which scenarios does this zero-shot trend appear?</b>
<b>I think it might not be about the scenarios, but rather about the teams.</b> <b>Let me briefly share a difference I've observed.</b>
<b>Maybe this was more apparent about six months ago.</b>
<b>Our large model clients and our robot clients,</b> <b>their data requirements,</b> <b>whether in terms of volume,</b> <b>or from their specific definition perspective,</b> <b>were quite similar.</b>
<b>But perhaps in the last six months, a qualitative change has occurred.</b>
<b>What I found is that large model clients</b> <b>are now most focused on zero-shot capabilities.</b>
<b>What do they believe in?</b>
<b>They believe in the Scaling Law.</b>
<b>They believe in using a sufficiently effective algorithm,</b> <b>with enough high-quality data.</b>
<b>And this data is more about ontology-agnostic simulations and human data.</b>
<b>Based on simulation evaluations,</b> <b>large-scale evaluations help them achieve a relatively simple ontology.</b>
<b>For example, robotic arms</b> <b>are not robots with wheeled chassis</b> <b>or this one</b> <b>for instance, this legged robot is essentially a robotic arm</b> <b>whether the gripper can achieve sufficiently effective zero-shot transfer capability</b> <b>this relates to the large model teams</b> <b>why do large model teams engage in hardware-related work</b> <b>actually, it's precisely because they don't want to deal with hardware-related tasks</b>
<b>so they choose the simplest robotic arms</b> <b>compared to, say, humanoid robots or wheeled robots</b> <b>which are actually much more complex</b> <b>right, because they require a lot of maintenance work</b> <b>and each physical unit requires extensive debugging</b> <b>yes, but what do large model teams use robotic arms for</b> <b>currently, the main large model teams are working on embodied VLAs</b> <b>large model teams are also working on VLAs</b>
<b>large model teams are also working on VLAs</b> <b>it's not only embodied intelligence or autonomous driving teams working on VLAs</b> <b>yes, I think this is actually the most critical point</b> <b>for example, if we look at DeepMind, that's Tan Jie, right</b> <b>For example, companies like NVIDIA and OpenAI</b> <b>what direction are they aiming for with VLA?</b>
<b>I believe their top priority is definitely focused on general intelligence.</b>
<b>Their underlying logic is to build an embodied brain.</b>
<b>First and foremost, it must have generalization capabilities.</b>
<b>Right? The brain’s abilities don’t necessarily have to be extremely strong.</b>
<b>For instance, I might need a dexterous hand to screw in a bolt,</b> <b>but I should be able to create a brain</b> <b>that, after training on a hundred different tasks,</b> <b>can handle five other tasks it has never seen before.</b>
<b>It can perform those additional five tasks.</b>
<b>This, I think, is the focus of large model teams.</b> <b>They pay attention to this aspect.</b>
<b>They focus on zero-shot generalization capabilities.</b>
<b>On the other hand, from the perspective of robotics customers,</b> <b>they are increasingly applying these technologies to specific scenarios.</b>
<b>Right? In this area, they pay close attention to their own embodiment.</b>
<b>Right? The complexity of the embodiment might include wheels,</b> <b>legs,</b> <b>or hands.</b>
<b>or hands.</b> <b>They might still have sensors on hand</b> <b>Right? So they might be very focused on whether these specific tasks</b>
<b>Right? So they might be very focused on whether these specific tasks</b> <b>can be executed well and implemented properly</b> <b>So these two types of clients actually had similar points of focus in the early days</b> <b>but now their focus has actually become quite differentiated</b> <b>Let me give another example</b> <b>For instance, the large model teams</b>
<b>They might primarily focus on the data that’s easiest to obtain</b> <b>such as from home environments</b> <b>or other scenarios like supermarkets and so on</b> <b>and possibly some factories to help them improve this kind of generalized cognition</b> <b>Right? Whereas the robot clients might have specific implementation paths</b>
<b>Right? Whereas the robot clients might have specific implementation paths</b> <b>Some might be deployed in hotels, others in various factory workshops</b> <b>For example, workshops in vehicle manufacturing plants</b> <b>Some might even be sent to deserts to replace solar panels</b> <b>They are more focused on data from concrete business scenarios</b>
<b>I have a slight gap in understanding</b> <b>Specifically, the VLA teams working on large models and the LLM teams working on large language models</b> <b>These should be two separate teams, right?</b>
<b>What kind of collaboration exists between them?</b>
<b>Actually, different companies are different.</b>
<b>Often, these are two separate teams.</b> <b>But in reality, I think they form an extremely close-knit collaborative team.</b>
<b>In fact, this probably includes the large language model team for big models.</b>
<b>The world model team for large models.</b>
<b>And the VLA team for large models.</b>
<b>They actually have what I consider a highly symbiotic and cooperative relationship.</b>
<b>For example, VLA often relies on a foundational model.</b>
<b>If your company already ranks among the top five globally in large model capabilities,</b> <b>then you can definitely use your own base model to build upon.</b>
<b>Right?</b>
<b>But if you don’t have that,</b> <b>then I think it becomes somewhat more challenging.</b>
<b>So from our perspective,</b> <b>the companies we collaborate with</b> <b>usually have the largest datasets and simultaneously possess teams for large language models,</b> <b>world models,</b> <b>and VLA teams.</b> <b>It’s these kinds of teams that undertake this work.</b>
<b>If they don’t have that,</b> <b>It will definitely use others</b> <b>Two of them might use Qianwen</b> <b>Or they might use other open-source models</b> <b>Yes, of course, that's one aspect</b> <b>Secondly</b> <b>It's about their understanding of data</b> <b>I think it's extremely accurate</b> <b>For example</b> <b>Not just purely correct data</b> <b>But also data used for error correction</b> <b>That is, data corrected after mistakes</b> <b>This insight actually comes a lot from large language models</b> <b>Right</b>
<b>Because it’s more human-like</b> <b>Additionally</b> <b>Regarding whether the data volume is acceptable</b> <b>The level of data hunger is also vastly different</b> <b>Because if you have already seen a large volume of a certain demand</b> <b>Your expectations for this matter</b> <b>Will be very high in terms of data volume</b> <b>And if we say</b> <b>this team previously</b> <b>used a relatively small amount of data,</b> <b>then it’s hard for them to suddenly open up a much larger data scale</b>
<b>because their budget is completely different.</b>
<b>The third point is,</b> <b>I actually think it’s about infrastructure,</b> <b>specifically the training infrastructure,</b> <b>and I believe GPUs are a very relevant factor,</b> <b>as well as reinforcement learning,</b> <b>the whole infrastructure for RL,</b> <b>is a very important aspect.</b>
<b>Let me give an example:</b> <b>A robotics company might already have thousands of GPUs,</b> <b>but a large model team</b> <b>could have tens of thousands of GPUs,</b> <b>so this represents at least an order of magnitude increase.</b>
<b>Additionally, the reinforcement learning infrastructure</b> <b>is actually very difficult to develop in-house,</b> <b>very difficult to develop in-house.</b>
<b>It's difficult to build embodied models</b> <b>to develop a set of reinforcement learning</b> <b>large-scale parallel infrastructure</b> <b>and these large model teams</b> <b>often already have the best infrastructure</b> <b>available for their own use</b> <b>It's just that they shifted from the large language model scenario</b> <b>to fine-tuning VLA instead</b> <b>So the LLM large language model teams are working on what we call the general-purpose brain</b> <b>Exactly</b> <b>The VLA teams are working on the robotic brain</b>
<b>But most likely, they are not training it from scratch</b> <b>They are building on top of the large language model brain</b> <b>Exactly</b> <b>What about the world model teams?</b>
<b>Is this something new?</b>
<b>Actually, we’ve seen some of our clients</b> <b>who might be using their own world models</b> <b>or hope to use these world models in the future</b> <b>as a foundational base model</b> <b>Let's move on to the later part about VLA</b> <b>Because I believe the world model has actually gained more</b> <b>predictive capabilities regarding the physical world</b> <b>and understanding ability</b> <b>Based on this</b> <b>combined with the corresponding Action Head</b> <b>right?</b>
<b>right?</b> <b>then we can create a higher-quality VLA</b> <b>Actually, I think the world model and VLA have a very interesting</b> <b>mutually symbiotic relationship</b> <b>The world model can serve as the foundation for VLA</b> <b>and VLA, in turn, acts as an implementation</b> <b>providing corresponding feedback to the world model</b> <b>This is a very crucial aspect</b> <b>Let me give an example</b> <b>If I believe the evaluation criteria for something</b> <b>will increasingly converge</b>
<b>then maybe these two things will eventually become one</b> <b>For instance</b> <b>I believe in embodiment</b> <b>Possibly the best benchmark dataset right now</b> <b>Called BEHAVIOR</b> <b>BEHAVIOR is Professor Fei-Fei Li</b> <b>This set based on simulation</b> <b>A set designed specifically for embodiment</b> <b>This evaluation set</b> <b>They are all quite challenging</b> <b>These long-range tasks</b> <b>As well as this data that is very difficult to collect</b> <b>To achieve it</b> <b>So</b> <b>I personally also feel very fortunate that</b>
<b>In this December</b> <b>This NeurIPS summit</b> <b>To assist with this</b> <b>This year marks the first BEHAVIOR Challenge</b> <b>Let's present the award</b> <b>So I found a very interesting situation, which is</b> <b>Actually, the team that went to compete for the BEHAVIOR leaderboard</b> <b>There are also teams working on world models</b> <b>So what they actually did was</b> <b>based on their foundational world model</b> <b>based on this Action Head, right?</b>
<b>Then they also made it onto the leaderboard</b> <b>and performed very well</b> <b>This is one example</b> <b>Another one is</b> <b>another company I find very interesting called ENACT</b> <b>It’s also based on BEHAVIOR</b> <b>this evaluation system</b> <b>which, essentially, is an evaluation system for VLAs</b> <b>They developed a set of metrics to evaluate</b> <b>world models</b> <b>This was also done by Fei-Fei Li’s team</b> <b>So you can see</b> <b>that</b> <b>the same benchmark</b> <b>can be used</b> <b>both as a standard to evaluate VLAs</b>
<b>and as a standard to evaluate world models</b> <b>So if the evaluation systems</b> <b>become increasingly consistent</b> <b>it's quite possible that in the future</b> <b>these two things will become more and more</b> <b>I think they will become increasingly related</b> <b>Then the world model is not replacing VLA</b> <b>The world model is actually replacing the large language model, right?</b>
<b>I believe the world model will more likely serve as a brain in the cloud</b> <b>while the VLA, I think, will be a brain on the edge device</b> <b>I think this is probably a long-term</b> <b>and they will have a symbiotic relationship</b> <b>As for the large language model</b> <b>I believe the large language model</b> <b>essentially already possesses some world model capabilities within the digital realm</b> <b>but it actually lacks an understanding of the physical world</b>
<b>I think the world model has the ability to understand the physical world</b> <b>as well as predictive capabilities</b> <b>And I believe the embodied VLA</b> <b>likely requires more</b> <b>precise, effective, and efficient action capabilities in the physical world</b> <b>So I believe these three might still be somewhat different</b> <b>But the training infrastructure for these three might come later</b> <b>The underlying foundation will become increasingly unified</b> <b>The underlying foundation will become increasingly convergent</b>
<b>It could become a unified, very large brain</b> <b>Right</b> <b>So perhaps in the future, the world model will be a cloud-based brain</b> <b>VLA is the brain on the edge side</b> <b>Then the digital world might have a brain</b> <b>This is essentially the brain of the large language model</b> <b>It sounds like there are two forces at play right now</b> <b>One is a company focused on the brain</b> <b>One is a company that specializes in ontology</b>
<b>One type is companies focused on the brain</b> <b>One type is companies that focus on the ontology.</b>
<b>Which of these two types of companies do you think will become a</b> <b>This is a more important force on the game map.</b>
<b>I believe the long-term perspective will likely be more important</b> <b>But let me briefly share one of my observations</b> <b>This observation is about the data closed loop</b> <b>Or the matter of the data engine</b> <b>is that Tesla actually invented the data engine</b> <b>The concept of the Data Engine</b> <b>is largely because</b> <b>they needed to implement their FSD autonomous driving system</b> <b>At that time, they probably already had over a million cars</b> <b>on the road, operating around the clock</b>
<b>based on drivers, right?</b>
<b>Collecting data from these users driving their cars</b> <b>to train their cloud-based brain</b> <b>And then, based on continuous improvements in their cloud brain</b> <b>deploy better autonomous driving capabilities to the edge devices</b> <b>thus creating a data flywheel</b> <b>This data engine</b> <b>Tesla’s data engine, at its core logic</b> <b>is essentially an ontology-related underlying logic</b> <b>That is, the autonomous driving vendor or the OEM</b> <b>because they have deployed the most of their own cars worldwide</b>
<b>they can collect the most data from their own vehicles</b> <b>and based on this data, they can train the best brain</b> <b>So these OEMs themselves are the biggest brains</b> <b>Right?</b>
<b>Right?</b> <b>But I think when it comes to embodied intelligence</b> <b>this logic might be overturned</b> <b>Why is that?</b>
<b>Because from the perspective of embodiment</b> <b>there simply aren’t, say, millions of robots</b> <b>deployed at the edge</b> <b>automatically performing all kinds of tasks</b> <b>or with people at the edge</b> <b>remotely controlling them to carry out various tasks</b> <b>If people are remotely controlling them</b> <b>the cost becomes too high</b> <b>which is not a scalable approach</b> <b>Under these circumstances</b> <b>I believe the entire data architecture</b>
<b>will conform to a data pyramid</b> <b>meaning the smallest amount of data</b> <b>will be based on data collected</b> <b>from robots actually deployed at the edge</b> <b>Real device data</b> <b>Real device data</b> <b>Exactly</b> <b>So the data volume in the middle section</b> <b>will be generated based on simulation</b> <b>And the data below</b> <b>could be from the internet, for example</b> <b>or from a first-person human perspective</b>
<b>Now, these two types of data—simulation</b> <b>and first-person human perspective data</b> <b>what are their characteristics?</b>
<b>They don’t need to be based on a physical entity</b> <b>They don’t need to be based on a hardware entity</b> <b>to generate data</b> <b>And their scalability is far greater than</b> <b>deploying real robots</b> <b>So what happens as a result?</b>
<b>I think what happens is that</b> <b>the majority of embodied data</b> <b>definitely does not come from the physical entity itself</b> <b>Right?</b>
<b>Right?</b> <b>Under such a premise,</b> <b>I believe Tesla's data closed loop</b> <b>doesn't really hold up in embodiment.</b>
<b>It's basically saying there won't be a single ontology provider</b> <b>that is the most widely used ontology itself,</b> <b>and at the same time can serve as the world's best brain.</b>
<b>I think this fundamentally won't work.</b>
<b>Let me give another example</b> <b>to support this point.</b>
<b>Tesla,</b> <b>they are building robots,</b> <b>right?</b>
<b>right?</b> <b>Specifically Optimus.</b>
<b>Specifically Optimus.</b> <b>They are making robots,</b> <b>but</b> <b>Optimus' brain</b> <b>is actually assigned to be provided by xAI,</b> <b>right?</b>
<b>right?</b> <b>It's not Tesla providing it themselves.</b>
<b>The same principle applies</b> <b>It will definitely be a major model provider who builds this brain</b> <b>Under such circumstances</b> <b>I believe the major model providers</b> <b>will use more ontology-agnostic data</b> <b>to train this brain</b> <b>While ontology providers are more likely to use the brains provided by the major model providers</b> <b>for fine-tuning</b> <b>deployment and implementation</b> <b>Under such circumstances</b> <b>I think there might be two other types of companies involved</b> <b>One type is data providers</b> <b>And I believe data providers</b>
<b>have actually gone through different stages of evolution</b> <b>Starting from the earliest days when data sets were static</b> <b>their relationship with clients was purely</b> <b>a simple client-vendor relationship</b> <b>To Scale AI, Surge AI, Mercor and clients</b> <b>it has become more like a partnership</b> <b>And later on, I think</b> <b>It needs to provide evaluations based on data vendors</b> <b>Then provide more feedback based on those evaluations</b>
<b>And use that feedback to stimulate customer demand</b> <b>To obtain more data</b> <b>Then train better models based on this data</b> <b>And then run more evaluations based on data vendors</b> <b>Therefore, I believe data vendors and large model providers</b> <b>Will increasingly form a symbiotic relationship</b> <b>Because large model providers need data vendors</b> <b>To give them more effective evaluations</b>
<b>And more effective data</b> <b>While data vendors need large model providers</b> <b>To provide better data validation feedback based on the models</b> <b>To help them iterate their data production pipelines</b> <b>So I think these two sides will have a symbiotic relationship</b> <b>So I believe data vendors will be very crucial in this</b> <b>Another group I think is important are the scenario providers</b> <b>They are often overlooked by many</b> <b>Scenario providers</b> <b>Or simply scenario companies</b> <b>For example, companies like OEMs</b>
<b>It is essentially a scenario-based company</b> <b>It inherently encompasses many scenarios where robots need to be deployed.</b>
<b>Inside its workshop</b> <b>At its factory</b> <b>Including medical groups</b> <b>It has many of its own scenarios that need to be deployed on robots</b> <b>Including companies in agriculture and so on</b> <b>Even in industry, this represents a huge opportunity</b> <b>So I believe these companies at different scenario levels</b> <b>They all have the need for large-scale deployment of robots.</b>
<b>We are actually currently serving clients</b> <b>There might be a significant proportion</b> <b>We have already started with these scenario-level clients</b> <b>In such a situation, I believe it will be a mutual collaboration among the four parties</b> <b>The first one is the large model vendor</b> <b>Right</b> <b>They mostly rely on data provided by ontology-independent data vendors</b> <b>Continuously pushing the boundaries of the Scaling Law</b> <b>On Generalization</b> <b>Then provide the brain</b> <b>Provide the brain to the ontology company</b>
<b>The ontology company may then leverage more scenarios and data</b> <b>to implement solutions in those scenarios</b> <b>And the scenario companies have greater autonomy</b> <b>Because they can actually choose hardware from Company A</b> <b>Or they can choose hardware from Company B</b> <b>They might even have strong in-house R&D capabilities</b> <b>And can develop their own hardware</b> <b>For example, I believe many OEMs</b> <b>will develop their own robots</b> <b>Because they have a better grasp of mass production</b>
<b>Quality control</b> <b>Hardware stability</b> <b>And cost management</b> <b>They likely have a deeper understanding of these aspects</b> <b>They can also leverage the capabilities of large model brains</b> <b>to directly implement solutions for their own scenarios</b> <b>So I think moving forward, it will be about the connection among these four</b> <b>Going back to the point Li Guangmi made earlier</b> <b>You think we can't say data equals the model</b> <b>So from a long-term perspective, what do you think equals the model</b>
<b>I believe we have to return to first principles</b> <b>And see how humans actually learn</b> <b>Then I think it might be the ability for systematic learning</b> <b>I believe that at the core, there should be a model</b> <b>Essentially, I think large models</b> <b>So we can't say knowledge equals the model</b> <b>Exactly</b> <b>In other words</b> <b>I think we can't say knowledge equals the model</b> <b>I think it should be a continuously improving system-level capability</b>
<b>Because actually, every time the system-level capability improves</b> <b>It might also bring new demands for data</b> <b>Let me give an example</b> <b>A child’s learning might be fine with just looking at picture books</b> <b>But for someone like Musk or Buffett, their learning</b> <b>Probably involves more targeted, advanced knowledge</b> <b>As well as these signals</b> <b>It’s like having a personal coach</b> <b>Of course, I believe that personal training</b> <b>should not be human-centered</b> <b>it should be system-centered</b>
<b>Only in this way can sufficiently scalable personal training be provided</b> <b>truly scalable hands-on teaching</b> <b>On our podcast, we've actually had many guests from the large language model field</b> <b>and also many guests from the robotics field</b> <b>What do you think are the differences in the data challenges these two fields face today?</b>
<b>What are the differences?</b>
<b>At what stages are they respectively?</b>
<b>I think these two are quite different</b> <b>From the perspective of large language models,</b> <b>their pre-training data is sufficient</b> <b>because essentially it’s data from the entire internet</b> <b>Exactly</b> <b>So there’s a lot of it</b> <b>What they actually face more is the issue of post-training</b> <b>and evaluation</b> <b>And post-training and evaluation, essentially,</b> <b>are somewhat like hands-on teaching</b> <b>That requires finding teachers of increasingly higher levels</b>
<b>to provide them with this kind of mentorship and guidance</b> <b>Actually, these teachers often come from different industries</b> <b>right?</b>
<b>right?</b> <b>For example, they might be the best engineers</b> <b>or gold medalists in mathematics</b> <b>or top lawyers</b> <b>or the best doctors</b> <b>And increasingly, this mentorship transforms into</b> <b>posing questions</b> <b>For instance, an ordinary teacher might</b> <b>educate students through their own demonstrations</b> <b>while a better teacher might ask progressively harder questions</b> <b>to motivate the student to seek out the answers themselves</b> <b>So I believe, essentially speaking,</b>
<b>this is what large language models are about</b> <b>They face the challenge of data by</b> <b>finding better and better people</b> <b>and based on them, posing increasingly difficult questions</b> <b>And based on these questions,</b> <b>And these signals</b> <b>These additional experiences being imparted</b> <b>Continuously enhancing the model's capabilities</b> <b>As for embodiment</b> <b>I think its current challenges lie at two ends</b> <b>The first end is in pre-training</b>
<b>Right now, there is actually a lack of sufficient pre-training data</b> <b>I believe this pre-training data needs to come from the physical world</b> <b>Whether it’s the real physical world</b> <b>Or a simulated physical world</b> <b>And the assets it needs to interact with</b> <b>For example, computers</b> <b>Or things like the coffee cup here with us</b> <b>At the same time, it requires the imparting of these experiences</b>
<b>For instance, from a human or a robot</b> <b>On how to operate different objects</b> <b>How to interact with this physical world</b> <b>Along with the corresponding language definitions</b> <b>And also the relevant evaluation criteria</b> <b>Where is this knowledge best sourced from?</b>
<b>Where it falls short</b> <b>Right?</b>
<b>Right?</b> <b>And then provide the corresponding learning signals</b> <b>This is a crucial pre-training requirement that I recognize</b> <b>Actually, I think right now</b> <b>the entire industry still lacks large-scale pre-training data</b> <b>to support embodied intelligence</b> <b>to achieve a foundational capability</b> <b>of a base model after pre-training</b> <b>I think this is an extremely critical gap</b> <b>Secondly, I think it's about evaluation capability</b>
<b>This is something many people might not have considered</b> <b>For example, autonomous driving or large language models</b> <b>Why do their models improve so rapidly?</b>
<b>Autonomous driving, essentially, benefits because its evaluation is free</b> <b>How do I explain this?</b>
<b>Because autonomous driving evaluation is done through</b> <b>what’s called Shadow Mode deployed on the edge side</b> <b>In Chinese, it’s called 影子模式 (Shadow Mode)</b> <b>Basically, it means deploying the algorithm on the vehicle itself</b> <b>Running the corresponding algorithms within its simulated world</b> <b>It doesn't actually carry out the corresponding real execution</b> <b>But it will output signals to the driver behind the wheel</b> <b>The signals being operated on are used for comparison</b> <b>When encountering some differences</b>
<b>Take this back as feedback</b> <b>Because, for example, if this person is a teacher</b> <b>So when there is a difference between the student and the teacher</b> <b>It's very likely that the student made a mistake at this point</b> <b>So this is an extremely cheap, low-cost, or even free signal to acquire</b> <b>To help autonomous driving evaluate their respective situations</b> <b>And this signal also includes the corresponding demonstration</b> <b>Right</b> <b>Including the relevant experience as well</b>
<b>Tell them what the teacher did after you made this mistake</b> <b>How do people do it?</b>
<b>Let you learn based more on imitation</b> <b>To improve oneself</b> <b>I believe large language models actually have this kind of shadow mode as well</b> <b>This shadow mode comes into play once these large language models have already been deployed.</b>
<b>Interaction with Users</b> <b>Right</b> <b>Actually, for example, when we use GPT or different large language models</b> <b>We also give it different feedback</b> <b>This feedback actually serves as a free shadow mode</b> <b>To help them understand what’s good and what’s not</b> <b>Give them some examples</b> <b>Helping them improve</b> <b>This is also a free evaluation</b> <b>As for embodiment,</b> <b>We currently do not have the capability to conduct such large-scale evaluations</b>
<b>Then I believe this must be provided based on simulation</b> <b>Embodied in Robotics</b> <b>It doesn't have one, which is in the real world</b> <b>To provide a foundation for shadow mode</b> <b>The only thing it can do is to scale based on simulation</b> <b>Evaluation</b> <b>And obtain more signals</b> <b>And then feed these signals back to the embodied brain</b> <b>Keep pushing them to continuously improve</b> <b>So I think evaluating this is actually</b>
<b>the core trend of embodied models regarding data</b> <b>So the data collection problem for robots is probably structurally harder than for large language models</b> <b>Exactly</b> <b>I think it’s much harder</b> <b>Possibly by several orders of magnitude</b> <b>If having enough data is a perfect score of 100</b> <b>How much do you think the data for large language models scores today?</b>
<b>How much do you think the data for robots scores?</b>
<b>I actually think it’s hard to define that perfect 100</b> <b>Let me give an example</b> <b>Human learning is actually endless</b> <b>So from a human perspective</b> <b>You can actually see that the more capable a person is</b> <b>The stronger their learning ability might be</b> <b>They actually engage with more data every day</b> <b>Not less</b> <b>Exactly</b> <b>Of course, what I mean is</b> <b>If we look at it from a conceptual standpoint</b>
<b>I think large language models may have already reached their peak in pre-training</b> <b>I believe they are probably focusing more on fine-tuning and evaluation</b> <b>Actually, I think in fine-tuning and evaluation</b> <b>I believe large language models still have a long way to go</b> <b>I estimate that large language models are currently at about 60%</b> <b>But to truly improve beyond that</b> <b>there is still a lot of room</b> <b>specifically in the fine-tuning and evaluation stages</b> <b>From the perspective of embodied AI</b>
<b>assuming the data returned from one million robots is a starting point</b> <b>that starting point might not even be 100%, but around 60%</b> <b>I think there aren’t even 10,000 robots currently</b> <b>whether in the real world, simulation, or human-generated data</b> <b>that can provide this kind of data</b> <b>right?</b>
<b>right?</b> <b>So I actually think from this perspective</b> <b>it might be less than 0.6%</b> <b>less than 0.6%</b> <b>less than 0.6%</b> <b>Exactly</b> <b>This actually gives everyone an intuitive sense</b> <b>Yes</b> <b>But I think the data issue with large language models today</b> <b>Is that as they evolve from Chatbots to Agents</b> <b>They actually face even more data scarcity on the Agent side</b> <b>Because AI has never witnessed real human work</b> <b>So it needs to find a large number of human experts</b>
<b>To collect data in real, authentic work environments</b> <b>Do you think the data problem Agents face today</b> <b>Is somewhat similar to Robotics?</b>
<b>Yes</b> <b>I think that’s a very good point</b> <b>I actually believe robots are Agents in the physical world</b> <b>While these large language model Agents</b> <b>Are actually Agents in the digital world</b> <b>And I think the problems they face</b> <b>Are quite similar: first, they need an environment</b> <b>Secondly, they need corresponding experience transfer</b> <b>Additionally, they also require proper evaluation</b> <b>Or signals for such evaluation</b> <b>to help them improve</b> <b>we can see</b>
<b>that for large language model Agents</b> <b>there is actually a very key</b> <b>data product for them</b> <b>called RLinf</b> <b>which is a service for reinforcement learning environments</b> <b>this environment is essentially a virtual environment</b> <b>but it’s not, for example, a physical simulation environment</b> <b>it’s more of a digital world environment</b> <b>such as a virtual Didi website</b> <b>a virtual JD.com website</b>
<b>a virtual shopping website</b> <b>a virtual programming website</b> <b>a virtual programming environment</b> <b>to help them continuously</b> <b>based on some predefined success metrics</b> <b>these definitions and these test questions</b> <b>continuously use reinforcement learning</b> <b>to fine-tune themselves</b> <b>Constantly experimenting</b> <b>and continuously improving oneself</b> <b>This is actually what I think Agents</b>
<b>currently need most in this digital world</b> <b>is primarily data-driven products</b> <b>As for embodiment,</b> <b>as I just mentioned,</b> <b>we haven’t truly reached the Agent stage yet</b> <b>right?</b>
<b>right?</b> <b>We are still in a pre-training</b> <b>and evaluation phase</b> <b>These two areas present the biggest challenges</b> <b>First, there isn’t enough pre-training to enable</b> <b>the model to achieve a basic level of capability</b> <b>Second, there isn’t a sufficiently robust large-scale evaluation</b> <b>to help these large model developers continuously measure</b> <b>and improve their foundational abilities</b> <b>Let me add one more detail here</b> <b>which is why the BEHAVIOR Challenge</b>
<b>specifically Fei-Fei Li’s BEHAVIOR Challenge, is so important</b> <b>Because other academic-level benchmarks</b> <b>actually, our clients</b> <b>the top large model providers</b> <b>have already completely surpassed their benchmarks</b> <b>the embodied benchmarks are actually relatively easier</b> <b>and have all been surpassed</b> <b>the truly challenging one is BEHAVIOR</b> <b>for BEHAVIOR’s 100 tasks</b> <b>the highest success rate currently is around 26%</b> <b>so there’s still a long way to go</b>
<b>of course, this is more of an academic-level benchmark</b> <b>whereas for the industry, for example</b> <b>they need a much larger scale</b> <b>and higher-quality BEHAVIOR dataset</b> <b>to help them challenge the fundamental capabilities of their models</b> <b>and certainly, based on these two points</b> <b>fine-tuning becomes very critical</b> <b>once the foundational capabilities from pre-training</b> <b>reach a certain standard</b> <b>fine-tuning through reinforcement learning is applied</b> <b>It becomes sufficiently important</b>
<b>So we also see some of our clients</b> <b>working with us on simulation-based</b> <b>fine-tuning through reinforcement learning after training</b> <b>Essentially, this process is very similar to</b> <b>the agents of large language models in the digital world</b> <b>Large language model agents operate in a virtual web environment</b> <b>right, a virtual programming environment</b> <b>constantly trial-and-error to perform fine-tuning</b> <b>Whereas the physical world agent, essentially,</b>
<b>in a simulated environment,</b> <b>based on predefined success metrics,</b> <b>standards, and large-scale scenarios,</b> <b>continuously tries and errors,</b> <b>fine-tuning itself</b> <b>In other words, this process</b> <b>I think, compared to pre-training and evaluation,</b> <b>is still probably a suboptimal issue at this stage</b> <b>Actually, just now we did a mapping of the entire data industry</b> <b>This is a horizontal perspective</b> <b>I also want to talk about the vertical aspect</b> <b>The data industry within the field of artificial intelligence</b>
<b>Is it a branch?</b>
<b>Within this ecosystem</b> <b>What kind of position does it roughly occupy?</b>
<b>Let's also discuss the past and present of the data industry</b> <b>I believe the development of the data industry</b> <b>Is actually closely related to each evolution in model learning paradigms</b> <b>There is a strong correlation</b> <b>For example, I can define it as</b> <b>In the earliest days, the data industry</b> <b>Probably started with Fei-Fei Li's ImageNet</b> <b>It served both as a training set</b> <b>And as an evaluation set</b> <b>It primarily served machine vision</b>
<b>Providing photos along with their corresponding ground truth annotations</b> <b>It was more of a static dataset</b> <b>Always providing the correct answers</b> <b>So at this point</b> <b>I think the data industry was more about</b> <b>It's similar to a rote-learning style education industry</b> <b>Then moving forward, it’s about autonomous driving</b> <b>I think Scale truly pioneered an industrial-grade data industry</b> <b>Starting from the earliest static data</b>
<b>This is probably hard to control in terms of timing</b> <b>For example, ImageNet</b> <b>It indeed took several years to develop</b> <b>Whereas Scale might actually be able to industrialize it at a factory level</b> <b>Right?</b>
<b>Right?</b> <b>Large-scale human operations managing quality</b> <b>Managing efficiency</b> <b>Managing delivery timelines</b> <b>To deliver this data</b> <b>I think this is more like a volume-driven education industry</b> <b>Then moving forward, I think it’s the data industry for large language models</b> <b>At this point, I think</b> <b>The core logic has changed</b> <b>In the earliest days, it might have been the user making a request</b> <b>And you deliver</b> <b>Right?</b>
<b>Right?</b> <b>Then it was more of a factory-style approach</b> <b>but still relatively rough and broad in delivery</b> <b>it evolved into something more evaluation-driven</b> <b>which is about helping customers identify problems</b> <b>and then stimulating new demands</b> <b>followed by targeted delivery</b> <b>So at this point, actually</b> <b>for example, in terms of how Scale defines itself</b> <b>it might have started calling itself a Data Foundry</b> <b>which is somewhat similar to the model of</b> <b>TSMC’s semiconductor fabs</b>
<b>essentially still a factory</b> <b>but with more processes</b> <b>more standards</b> <b>more know-how</b> <b>more procedures</b> <b>these are its secret sauce</b> <b>right?</b>
<b>right?</b> <b>But I believe that</b> <b>the development moving forward will actually be quite different</b> <b>Why is that?</b>
<b>Because I believe</b> <b>at this moment in time,</b> <b>when evaluating large language models with RLHF,</b> <b>it’s still a human-centered process.</b>
<b>For example, Meror Surge operates the same way.</b>
<b>It relies on increasingly skilled humans to provide feedback,</b> <b>offering more experience and guidance.</b>
<b>From the perspective of embodied AI,</b> <b>the amount of data required is far greater than what large language models need.</b>
<b>At this point,</b> <b>it’s hard for me to imagine</b> <b>a thousandfold scale of Meror Surge,</b> <b>which might already have hundreds of thousands of users worldwide,</b> <b>possibly even a million people,</b> <b>providing data at a scale a thousand times larger.</b>
<b>I think this is very difficult to scale</b> <b>and also very inefficient.</b>
<b>So I believe at this point,</b> <b>there will definitely be a shift.</b>
<b>It's centered around people.</b>
<b>It shifts to being centered around the system.</b>
<b>This system is an engine.</b>
<b>It is based on different individuals who may be on the edge side.</b>
<b>It relies on their simulations.</b>
<b>On their engineering capabilities.</b>
<b>To amplify these people's signals,</b> <b>their experiences,</b> <b>enabling them to effectively support</b> <b>the evolution of embodied models.</b>
<b>And I believe this must be driven by evaluation,</b> <b>not by, for example,</b> <b>training-driven approaches.</b>
<b>training-driven approaches.</b> <b>So I think this could be a data industry,</b> <b>and I believe it will evolve step by step.</b>
<b>Earlier, we talked about the people who label data,</b> <b>or those who collect data.</b>
<b>Their hourly wages have increased significantly.</b>
<b>Has the number of people decreased?</b>
<b>The number actually hasn't decreased.</b>
<b>There's actually something very interesting about this</b> <b>In fact, I thought a lot about this issue early on</b> <b>That is, whether one day</b> <b>either the algorithms</b> <b>will have a greatly improved learning efficiency</b> <b>right?</b>
<b>right?</b> <b>Or the model's</b> <b>capabilities will become so advanced</b> <b>that it will require less and less top-tier human cognition</b> <b>But up until now, that hasn't happened</b> <b>I think this point is very similar to when</b> <b>DeepSeek first came out</b> <b>right?</b>
<b>right?</b> <b>Then people talked about Test-time Scaling</b> <b>Once it was introduced, it was very likely that pre-training</b> <b>or the overall demand for NVIDIA GPUs</b> <b>would drastically decrease</b> <b>But actually, what everyone found was</b> <b>after Test-time Scaling came out</b> <b>it actually stimulated even more demand for AI applications</b> <b>The demand for AI Agents</b> <b>On the contrary, it has increased the demand for NVIDIA cards</b> <b>I think that’s very likely</b>
<b>I feel there’s one intuitive thing</b> <b>That is, the more capable a person is</b> <b>The more they love to learn</b> <b>In fact, the amount of reading they do daily doesn’t decrease, it increases</b> <b>I think it’s very possible that this will be the case going forward</b> <b>Of course, it will increase up to a certain point</b> <b>Let me give an example</b> <b>Maybe the AI model’s capabilities become too strong</b> <b>And by the end</b>
<b>It might have reached a Nobel Prize level in this world</b> <b>At that point, there won’t be many people left who can teach it</b> <b>And at that time, I believe what it needs to do is</b> <b>Continuous self-improvement</b> <b>Just like humans</b> <b>AI training AI</b> <b>Exactly</b> <b>I think it will actually be very much like a human</b> <b>When people are young,</b> <b>they probably look at a lot of picture books,</b> <b>and have teachers guiding them by example and instruction.</b>
<b>Later on,</b> <b>it’s more about continuous self-improvement.</b>
<b>Or I think many people,</b> <b>they don’t compare themselves to others,</b> <b>they compare themselves to who they were before.</b>
<b>How much better can I be today than I was yesterday?</b>
<b>It’s the same principle.</b>
<b>I believe AI will also reach this stage.</b>
<b>At this stage, what it truly needs,</b> <b>I think, is an environment,</b> <b>a standard of success,</b> <b>a constantly updated standard of success,</b> <b>right?</b>
<b>right?</b> <b>Then, based on its own experience,</b> <b>and leveraging reinforcement learning to continuously improve itself,</b> <b>I think it will reach this stage.</b>
<b>And this stage, in fact, is also what I believe</b> <b>Simulation and synthetic data</b> <b>are a crucial phase.</b>
<b>Because at this stage, you always need a physical environment,</b> <b>you always need evaluation metrics.</b>
<b>I think this will probably become</b> <b>the most essential requirement at that time.</b>
<b>What’s needed is like a school teacher and exams.</b> <b>This is the current phase.</b>
<b>The next phase might be self-learning.</b>
<b>Exactly.</b>
<b>Yes.</b>
<b>It will always require an environment,</b> <b>a context,</b> <b>an environment,</b> <b>and corresponding definitions of success.</b>
<b>This data industry</b> <b>has given rise to some key people and companies.</b>
<b>First of all, I think Fei-Fei Li truly defined</b> <b>the concept of AI data.</b>
<b>I believe her contribution is immense.</b>
<b>After this, I believe true industrialization</b> <b>I think Scale AI</b> <b>truly led</b> <b>the wave of industrialized AI data</b> <b>and I think it actually led twice</b> <b>The first time was in autonomous driving</b> <b>during its startup phase</b> <b>when the entire industry's scalable AI data demand</b> <b>was really in autonomous driving</b> <b>It turned this into a streamlined production line factory</b>
<b>to reliably deliver labeled data for autonomous driving</b> <b>and later, possibly around 2021 and 2022</b> <b>entered the era of GPT-2</b> <b>and RLHF</b> <b>thus becoming one of the earliest to serve data for large models</b> <b>especially data for post-training and evaluation</b> <b>data driven by evaluation</b> <b>this industry</b> <b>I think this is extremely critical</b> <b>Of course, there might also be some data driven by evaluation</b>
<b>Later on, it evolved into things like Surge AI</b> <b>Like Mercor</b> <b>I think these are all companies of the same kind.</b>
<b>You just mentioned that simulation is very important for Robotics</b> <b>What role do you think simulation ultimately plays in this industry?</b>
<b>Do you think it’s an accelerator?</b>
<b>Is the accelerator a tool or something more fundamental?</b>
<b>Yes, I think this is a very good question</b> <b>Actually, this is also</b> <b>I feel that since I started my career</b> <b>Ever since we started doing simulations</b> <b>It's something I've been thinking about constantly</b> <b>Right</b> <b>Then I can say with certainty</b> <b>I believe simulation is crucial for robotics</b> <b>It is a prerequisite</b> <b>Without simulation, this definitely couldn't be done</b> <b>My starting point, I think, lies more in a few areas</b> <b>First of all,</b>
<b>We just mentioned the concept of a data closed loop</b> <b>I think the data loop for robots and autonomous driving</b> <b>will be completely different.</b>
<b>Because robots don’t have as much</b> <b>of this truly physical machine</b> <b>deployed on the edge,</b> <b>and based on human demonstrations,</b> <b>to collect large-scale data back.</b>
<b>It must rely on simulation</b> <b>to gather sufficiently large-scale data.</b>
<b>It’s an essential requirement.</b>
<b>So it is a must-have.</b>
<b>Secondly, I think another</b> <b>extremely essential element</b> <b>is, for example, when it comes to data,</b> <b>I think one is simulation,</b> <b>and the other is human data.</b>
<b>I believe these two will be</b> <b>the main sources of</b> <b>ontology-agnostic data.</b>
<b>ontology-agnostic data.</b> <b>And regarding evaluation,</b> <b>I actually can't think of</b> <b>Besides simulation</b> <b>Any source</b> <b>I believe large-scale</b> <b>What I’m proposing is not a small-scale evaluation</b> <b>For example, small-scale evaluations</b> <b>I can do it at the lab level</b> <b>Or in a certain scenario</b> <b>I'm going to build some prototypes</b> <b>10 units, 20 units</b> <b>To perform some algorithmic reasoning</b> <b>I think this is acceptable for evaluation</b> <b>But I can't, for example,</b> <b>If I want to implement it in the home furnishing scenario</b>
<b>I am simultaneously in a thousand households</b> <b>Possibly even more families</b> <b>For example, to evaluate</b> <b>For example, tens of thousands of different tasks</b> <b>Retrieve the signal anytime</b> <b>At the same time, I can perform repetitive measurements</b> <b>For example, I might</b> <b>have my algorithms evolve every day</b> <b>so can I measure multiple times daily</b> <b>to truly</b> <b>allow myself to more accurately understand</b> <b>the evolution of each version of the algorithm</b> <b>I believe the only solution</b>
<b>is through simulation</b> <b>Exactly</b> <b>Of course, another</b> <b>observation I find very interesting</b> <b>is about the clients we serve</b> <b>Actually, in the very beginning</b> <b>our clients were all</b> <b>strong believers in simulation</b> <b>They truly believed in synthetic data</b> <b>they believed in simulation</b> <b>and they used our synthetic data to train their AI</b> <b>At that time, some of the top-tier</b> <b>Frontier Labs</b> <b>The top-tier large model teams</b> <b>They truly belong to the authentic camp</b>
<b>They absolutely refuse to try any kind of simulation</b> <b>But actually, if we look at</b> <b>the past three months or so</b> <b>the past three months</b> <b>basically, they have all become our clients</b> <b>to conduct large-scale evaluations</b> <b>Do you approach them, or do they approach you?</b>
<b>They come to us</b> <b>Yes, so this is a</b> <b>Who are they?</b>
<b>That’s not convenient to say</b> <b>Right</b> <b>But I think this is a very interesting signal</b> <b>To be honest</b> <b>At the very beginning</b> <b>I proactively sent them a lot of emails</b> <b>And they said</b> <b>I know you’re the best at simulation</b> <b>If I were to do simulation</b> <b>I would definitely come to you</b> <b>But maybe I’m not at that point yet</b> <b>However, in the past three months</b> <b>they have all come to us</b> <b>What common problem are they facing?</b>
<b>They can’t scale their evaluations</b> <b>This is their core issue</b> <b>They believe their algorithms are already good enough</b> <b>Previously, they relied on real device data</b> <b>Previously, they relied on real device data</b> <b>Or some form of simulation</b> <b>Using evaluation datasets</b> <b>These academic-level benchmarks to test</b> <b>But in the real industry</b> <b>They don’t hold much significance</b> <b>Because they’re too simple</b> <b>They’re not scalable enough</b> <b>For example, we have some teams working on deploying in home scenarios</b>
<b>These brain teams</b> <b>They might</b> <b>fold clothes</b> <b>They might do housework</b> <b>They’re already doing very well</b> <b>What they hope for is</b> <b>to have a thousand different home scenarios</b> <b>where they can evaluate themselves at any time</b> <b>including these</b> <b>The key is not the scenarios</b> <b>but these tasks</b> <b>and these evaluation criteria</b> <b>that can help them</b> <b>to assess themselves anytime</b> <b>This is something they</b> <b>cannot achieve through real devices</b> <b>It sounds like</b> <b>those who want to build the brain</b>
<b>were probably the first to embrace</b> <b>simulation right?</b>
<b>simulation right?</b> <b>Exactly</b> <b>Then those</b> <b>were companies that originally came out of a specific scenario</b> <b>maybe something like folding clothes</b> <b>or</b> <b>in supermarkets</b> <b>giving them a robot that does a certain task</b> <b>these companies were slower to embrace simulation</b> <b>when they needed to generalize</b> <b>do they need simulation in that case?</b>
<b>I think there are two types of simulation</b> <b>for example</b> <b>the more traditional kind</b> <b>which supports RL</b> <b>right?</b>
<b>right?</b> <b>this kind of simulation</b> <b>then for example</b> <b>maybe full-body</b> <b>like Full-body Control</b> <b>or Locomotion</b> <b>which is about how to make a</b> <b>Humanoid robots walk more efficiently</b> <b>Allowing them to stand more steadily</b> <b>Enabling them to perform</b> <b>full-body control tasks</b> <b>At this point,</b> <b>these robotics companies tend to fully embrace simulation</b> <b>They were actually among the earliest adopters of simulation</b> <b>However, the demand for simulation in this area is relatively small</b> <b>They might run RL</b> <b>on a single local machine</b>
<b>and achieve it</b> <b>Running reinforcement learning can accomplish this</b> <b>It’s not a large-scale demand</b> <b>But I believe for large-scale demands, as you mentioned,</b> <b>it’s more about these large model companies</b> <b>the brain companies</b> <b>They need to generalize</b> <b>They need to scale their data</b> <b>or scale their evaluations</b> <b>And on these two points</b> <b>He will definitely get stuck on at least one point</b> <b>So he will definitely use simulation</b> <b>That's why they were the earliest to embrace that group</b> <b>Exactly</b>
<b>You just mentioned changes starting to appear in the past three months</b> <b>It should be companies building robots for vertical scenarios</b> <b>Right?</b>
<b>Right?</b> <b>Not exactly</b> <b>I mean the large model teams</b> <b>Actually, they can also be divided into</b> <b>Teams that are staunch simulation advocates from the start</b> <b>And some who were true hardware believers from the very beginning</b> <b>I trust the data from real machines</b> <b>But maybe at a certain stage</b> <b>They realize this approach just won’t work</b> <b>So they have to use simulation</b> <b>Therefore, I believe that</b> <b>Our biggest growth in the past three months</b>
<b>I think first of all, it’s basically almost all the large model teams</b> <b>And their world model teams as well</b> <b>This might just be one company</b> <b>It’s possible that more than one team is collaborating with us</b> <b>There could be a VLA team</b> <b>And a world model team</b> <b>All working with us</b> <b>Because, in a way, there might actually be many VLA teams</b> <b>They might be building on the foundation of the world model</b> <b>Right?</b>
<b>Right?</b> <b>So at this point</b> <b>Maybe the world model team is using us</b> <b>And possibly using us more effectively</b> <b>Exactly</b> <b>Then maybe the VLA team uses our evaluations</b> <b>And the world model team uses our data</b> <b>This is a phenomenon we might observe</b> <b>Quite common</b> <b>Do these three teams have different data needs?</b>
<b>Not exactly the same</b> <b>For example, the world model team</b> <b>Doesn’t necessarily require data with very strong action signals</b> <b>Exactly</b> <b>Then it must have better physical constraints</b> <b>Right?</b>
<b>Right?</b> <b>This kind of grounding</b> <b>And it needs to have</b> <b>something that helps it better predict</b> <b>what will happen next</b> <b>in the physical world</b> <b>But it doesn’t necessarily have to have a first-person perspective</b> <b>or the viewpoint of the first robot</b> <b>with data interacting physically inside</b> <b>Whereas VLA might be more of an action-oriented agent</b> <b>It must have this kind of action data</b> <b>That could be from its own embodiment</b> <b>Or from other embodiments</b> <b>Cross-embodiment</b>
<b>Even data from human actions</b> <b>So I think there will be some differences here</b> <b>But overall, from an evaluation perspective</b> <b>They probably both really need simulation</b> <b>Because they need to operate in these physically realistic environments</b> <b>to either verify that their prediction capabilities are accurate enough</b> <b>or that their action capabilities</b> <b>can accomplish these different tasks</b> <b>You know, in China there are also many companies focused on</b>
<b>robotic brains, whether they are large corporations</b> <b>or startups</b> <b>From my conversations with them</b> <b>my intuitive sense is that the simulation camp is smaller than the real-machine camp</b> <b>because the common reasoning they give is that</b> <b>real-machine data generalizes well</b> <b>while simulation data does not generalize well</b> <b>Why do you think this phenomenon occurs?</b>
<b>Why does it seem that there are fewer simulation-focused teams among Chinese robotics groups?</b>
<b>I think there are a few reasons</b> <b>First, I believe that fundamentally these companies</b> <b>are still robotics companies</b> <b>Robotics companies</b> <b>If we look at their business models</b> <b>at the core, they still need to sell the hardware</b> <b>So if it’s a simulation-based approach</b> <b>I think it would be very difficult to</b> <b>convince its customers to buy their hardware</b> <b>Why?</b>
<b>Why?</b> <b>Because many of their customers actually</b> <b>are, I think, many domestic</b> <b>real-machine-based business models</b> <b>still focus on selling a data acquisition center</b> <b>right?</b>
<b>right?</b> <b>I buy it to use your robots to collect data</b> <b>to collect data</b> <b>and then continuously improve</b> <b>So they need to trust real-machine data to sell the hardware</b> <b>Exactly</b> <b>Otherwise, I think</b> <b>essentially, it’s still a</b> <b>“the butt decides the brain” kind of situation</b> <b>I think they need to truly advocate for a real-machine approach</b>
<b>in order to more effectively run these business models based on real-machine data acquisition</b> <b>and make them work</b> <b>Of course, when it comes to real device data collection</b> <b>I believe real device data collection is definitely necessary</b> <b>I do not deny the importance of real device data collection</b> <b>And I think the current volume is also needed</b> <b>I believe it could grow tenfold</b> <b>And that volume might also be essential</b>
<b>But the key is to see what stage it will grow to</b> <b>Based on the data pyramid</b> <b>the smallest amount should actually be from real devices</b> <b>the data from the actual operational robot itself</b> <b>Real device data</b> <b>Exactly</b> <b>It has the highest cost</b> <b>Its cost is the highest</b> <b>But the most critical point is that it’s the hardest to scale</b> <b>It’s not even about the cost</b> <b>For example, how do you enter different scenarios</b> <b>and scale up quickly</b>
<b>That is a very difficult challenge</b> <b>How do you switch to new scenarios</b> <b>Many, I think most of the real-device data collection</b> <b>Now if you go to their data collection centers</b> <b>You will see they are also using simulation</b> <b>How to understand this</b> <b>They are using real-world simulation</b> <b>They are holding a fake banana</b> <b>They are holding a fake apple</b> <b>They are not holding a real banana</b> <b>A real apple</b> <b>The scene might change very little</b>
<b>It might all be at this desktop level</b> <b>Or built in some IKEA-style setups</b> <b>It’s hard to, like simulation,</b> <b>scale up to possibly broader,</b> <b>more variable, physically realistic</b> <b>scene applications</b> <b>Exactly</b> <b>So I think this is</b> <b>what I see as a core difference</b> <b>Also, from our perspective</b> <b>I think what it means is</b> <b>the real players are the ones actually building large pre-trained models</b>
<b>Actually, I also listened to that episode with Tan Jie</b> <b>I tend to agree with Tan Jie's perspective</b> <b>which is that I feel it might not be very reasonable</b> <b>to purely build an embodied large model</b> <b>it definitely has to be based on a foundational platform</b> <b>right?</b>
<b>right?</b> <b>At this point, I believe</b> <b>it should be a large model company</b> <b>that leverages their</b> <b>foundational capabilities and then uses more data</b> <b>first for pre-training and then fine-tuning to develop better VLAs</b> <b>So from this perspective</b> <b>I think maybe robot companies</b> <b>actually, there probably aren’t many robot companies</b> <b>truly doing this</b> <b>really building a large pre-trained model</b>
<b>so the amount of data they need might not be that much</b> <b>the data they need isn’t that extensive</b> <b>A point that Tan Jie made left a deep impression on me</b> <b>because I also told him</b> <b>that domestic real-device advocates often say</b> <b>real-device data has better generalization</b> <b>he said simulation data brings up the Sim-to-real problem</b> <b>not a generalization problem</b> <b>the generalization issue should be addressed by generating</b> <b>a massive amount of simulation data</b> <b>right</b>
<b>do you agree with his point?</b>
<b>I do agree</b> <b>speaking of which, let's define simulation</b> <b>because the definition of simulation is currently vague</b> <b>it used to mainly mean physical simulation</b> <b>now some also consider video generation</b> <b>as simulation</b> <b>how do you define simulation?</b>
<b>I actually hope to define it more strictly</b> <b>so I think simulation</b> <b>should more so be about operating within a</b> <b>An environment that is physically accurate enough</b> <b>That can be reproducible</b> <b>And can be corrected to generate corresponding actions</b> <b>And observe the results</b> <b>I believe this is what truly defines a simulation</b> <b>Let me explain, of course</b> <b>By physically accurate, I mean</b>
<b>The environment and the objects you interact with</b> <b>Need to be sufficiently</b> <b>Aligned with the physics of the real world</b> <b>This alignment is not just about looking similar</b> <b>Not just about geometric resemblance</b> <b>But also factors like friction</b> <b>And many other physical parameters</b> <b>Must be adequately aligned as well</b> <b>That’s the first point</b> <b>Second reproducibility</b> <b>Meaning if I run the simulation a hundred times</b> <b>I should have a sufficiently high consistency coefficient</b>
<b>Not necessarily exactly one hundred</b> <b>It might be ninety-five or ninety-nine</b> <b>My results are the same</b> <b>I think this is a very critical point</b> <b>Another thing is</b> <b>When I am in the same environment</b> <b>Starting from the same point, changing my actions</b> <b>I can see the possible changes in the outcome</b> <b>I think these points are all essential</b> <b>Now let's take a look at video models</b> <b>Regarding video models, I think</b>
<b>They are more about predicting the next frame</b> <b>They can capture some transformations of the world, I believe</b> <b>Right?</b>
<b>Right?</b> <b>But first, it might be very difficult to reproduce</b> <b>It might be very difficult to reproduce</b> <b>If it's hard to reproduce</b> <b>Then it's very hard to conduct large-scale, reliable evaluations</b> <b>Secondly, it lacks actions</b> <b>It’s difficult to have sufficiently accurate actions</b> <b>And this point is also hard for me to</b> <b>Otherwise, we do evaluations</b> <b>Otherwise, we generate data</b> <b>Exactly</b> <b>Thirdly, when I change some</b> <b>initial conditions</b>
<b>can it produce other actions</b> <b>That’s also a very difficult challenge</b> <b>So I think general video models</b> <b>cannot yet be called simulations</b> <b>Of course, I believe world models have the potential</b> <b>to truly become a type of simulation</b> <b>World models becoming a type of simulation</b> <b>Exactly</b> <b>So how should we understand this</b> <b>I think at its core, a world model</b> <b>is actually a generative model</b>
<b>So its advantage is that it can generate more broadly</b> <b>relatively realistically</b> <b>Not, in my opinion, physically real like a true simulation</b> <b>But relatively realistic</b> <b>This kind of prediction for the world</b> <b>I even think that later on, robots were integrated</b> <b>Regarding the next steps for the ontology</b> <b>I think this is feasible</b>
<b>This is feasible for the foreseeable future</b> <b>Right</b> <b>But I believe simulation and world models</b> <b>I don't think they are the same</b> <b>Who will replace whom</b> <b>I believe the two have more of a symbiotic relationship</b> <b>How should we understand this matter</b> <b>For example, the clients we actually serve in the future</b> <b>There might be a very large proportion</b> <b>They are all customers of the world model</b>
<b>Actually, the clients of world models rely on them for their predictive capabilities</b> <b>Gradually improve</b> <b>Gradually enhancing the physical grounding capability</b> <b>It needs better physical data to help them improve</b> <b>Right</b> <b>It needs to have more realistic physics</b> <b>It requires actions that are more human-like in behavior</b> <b>to help them improve</b> <b>So in this context</b> <b>simulation actually assists them</b>
<b>On the other hand, due to the world model</b> <b>it may have better generative capabilities</b> <b>so it can also support simulation data</b> <b>helping the simulation results achieve better generalization</b> <b>or using simulation based on the world model</b> <b>to achieve better grounding by combining the two</b> <b>resulting in more accurate generative outputs</b> <b>From our perspective</b> <b>over the past few months</b>
<b>we and our world model clients</b> <b>have increasingly formed a symbiotic relationship</b> <b>A symbiotic relationship means</b> <b>they use our data</b> <b>and we use their models</b> <b>together, we can scale this effort much further</b> <b>I think this could be</b> <b>Next, let's talk about the simulation world model</b> <b>The relationship between the two of them</b> <b>It sounds like simulation is a means of creating a world model</b> <b>I think it's actually hard to say who is a means to whom.</b>
<b>I don't think simulation is a subset of world models</b> <b>Or perhaps the world model is not a subset of simulation</b> <b>I think the two of them should probably go together</b> <b>Achieving something bigger</b> <b>This is all about providing better learning capabilities for intelligence</b> <b>Currently, these three teams</b> <b>Each team collaborates with you the most</b> <b>World Models, VLA, and LLM</b> <b>I think world models collaborate more with VLA</b> <b>It's because you are working with data related to robotics, right?</b>
<b>Exactly</b> <b>Because we are still primarily focused on a physical environment</b> <b>An experience of taking action inside</b> <b>And the corresponding criteria for these evaluations</b> <b>Here</b> <b>I think what we do relatively less is</b> <b>Talking about a digital environment</b> <b>it's basically just LLMs</b> <b>Are world models and VLAs merging?</b>
<b>I think in the short term, they actually have a very symbiotic relationship</b> <b>I believe it's a mutually dependent relationship</b> <b>I think in the future, it's possible</b> <b>that one day the two might converge into one</b> <b>But essentially speaking</b> <b>I think for a long time</b> <b>they will remain mutually dependent</b> <b>Can we compare the robotics industry</b> <b>to how we used to think about autonomous driving?</b>
<b>Because in autonomous driving before</b> <b>the competition between Waymo and Tesla lasted a long time</b> <b>Today, these robotics brain companies</b> <b>seem to be following the Waymo path</b> <b>but it looks like Tesla has become the more mainstream approach</b> <b>Of course, Waymo is also doing very well</b> <b>What’s your take on this issue?</b>
<b>And who do you think are the Waymo and Tesla of the robotics field?</b>
<b>Why do you think this current brain company is more like Waymo?</b>
<b>Because they have a lightweight ontology.</b>
<b>They collect a lot of data.</b>
<b>I feel the robotics company is more like Tesla.</b>
<b>I understand what you mean.</b>
<b>What I might have observed is,</b> <b>let me just say first,</b> <b>I think this might be quite different from autonomous driving.</b>
<b>Meaning, I might not necessarily follow Tesla or Waymo.</b>
<b>The reason I think this is,</b> <b>as I mentioned earlier, the underlying data logic,</b> <b>if the underlying data logic is based on an ontology-driven data closed loop,</b> <b>which accounts for over 90% of the data volume,</b> <b>then I believe it will definitely follow either Tesla’s or Waymo’s logic.</b>
<b>Otherwise, it would be one or the other.</b>
<b>So I think it’s quite</b> <b>they are working within a relatively more vertical scenario.</b>
<b>Exactly.</b>
<b>A relatively more vertical scenario, and their intelligence is relatively limited.</b>
<b>I think the intelligence in autonomous driving is still relatively limited.</b>
<b>It's more of an edge-side model</b> <b>Right?</b>
<b>Right?</b> <b>An edge-side model</b> <b>And its task is actually quite singular</b> <b>Isn't it?</b>
<b>Isn't it?</b> <b>Just to drive the car well</b> <b>Exactly</b> <b>For example, when it encounters a cup like this</b> <b>Its reaction is simply to avoid it</b> <b>But in the robotics field</b> <b>It needs to determine what material the cup is made of</b> <b>How big the cup is</b> <b>And then decide the strength of its grip</b> <b>So the complexity factor is much higher</b> <b>Exactly</b> <b>Its scenarios are more singular</b> <b>The only physics involved are between the car and the ground</b>
<b>It doesn't want to collide with anything</b> <b>Right</b> <b>So, I think its level of intelligence is somewhat lower</b> <b>Exactly</b> <b>Its intelligence level will be somewhat lower</b> <b>Of course, I believe there are two ways to solve the autonomous driving problem</b> <b>One way is not VLA</b> <b>But directly VA</b> <b>Is VA the next generation of VLA?</b>
<b>I think</b> <b>I don't think so</b> <b>For VA, I think it’s more about action output</b> <b>Xpeng is currently taking this path</b> <b>Right</b> <b>I think it’s mainly because the computing power on the edge might not be that strong</b> <b>Exactly</b> <b>And possibly</b> <b>The intelligence required for this is relatively limited</b> <b>And once I have enough data</b> <b>I can use imitation learning</b> <b>To compress the model to closely match the driver’s behavior</b> <b>That’s sufficient</b> <b>Right</b>
<b>It's very likely that VA is possible</b> <b>that this is the endgame of the matter</b> <b>That's a possibility</b> <b>But there's another approach</b> <b>which is to create a more general VLA</b> <b>and then have it drive</b> <b>That will definitely be feasible in the future</b> <b>Right?</b>
<b>Right?</b> <b>So what I mean is, regarding autonomous driving</b> <b>I think one</b> <b>point I haven't fully figured out yet</b> <b>is whether there could be two viable paths</b> <b>One path is that because its intelligence ceiling isn't that high</b> <b>VA works</b> <b>Right?</b>
<b>Right?</b> <b>The other path is that the VLA I build can also do it</b> <b>but this VLA might also be able to do other things</b> <b>I think both paths could potentially succeed</b> <b>Yes</b> <b>No more language</b> <b>There is no language in VLA</b> <b>You definitely think its intelligence level will decrease</b> <b>Its intelligence level will significantly drop</b> <b>Of course, I’m discussing this from the perspective of intelligence</b> <b>From the perspective of learning paradigms</b>
<b>On the other hand, I think from the data perspective</b> <b>It’s still that autonomous driving essentially</b> <b>Relies on imitation learning</b> <b>Primarily large-scale imitation learning with a small amount of reinforcement learning</b> <b>To supply its intelligence</b> <b>The data it needs is mostly embodiment-related</b> <b>Data directly collected from the car driving back</b> <b>Whereas for embodied AI</b> <b>It definitely follows a route of embodiment-independent data</b>
<b>The volume of embodiment-related data</b> <b>The actual amount of data from robots deployed on the edge</b> <b>Will be relatively small</b> <b>Under such circumstances</b> <b>I think what will likely come out in the end</b> <b>Is that something like Tesla probably won’t exist</b> <b>Because if it really is Tesla</b> <b>If it really follows the Tesla model</b> <b>Its brain might actually not be made by Tesla</b> <b>It could be made by xAI</b> <b>Right?</b>
<b>Right?</b> <b>So what I mean is</b> <b>I think in this case</b> <b>It’s actually two teams within a big company</b> <b>They are essentially two different companies</b> <b>Right?</b>
<b>Right?</b> <b>So I think what this means is</b> <b>In the end, there might be three models</b> <b>The Waymo model</b> <b>The current Tesla</b> <b>The internal Tesla company model</b> <b>And another model within the Musk ecosystem</b> <b>Where one company builds the hardware</b> <b>And the other builds the brain</b> <b>Right?</b>
<b>Right?</b> <b>So if we put this into context</b> <b>In other companies</b> <b>maybe DeepMind has built a brain</b> <b>right?</b>
<b>right?</b> <b>and then they might implement this brain in the physical form</b> <b>I think this is very likely the path</b> <b>Besides Musk and Google</b> <b>who else do you think can support this?</b>
<b>Everyone is doing it</b> <b>You mean autonomous driving and embodied intelligence, right?</b>
<b>No, brain and robots</b> <b>Oh</b> <b>brain and robots</b> <b>I think there are probably fewer in the US</b> <b>I think domestically</b> <b>I think Xiaomi is a possibility</b> <b>Hmm</b> <b>Yes</b> <b>But overall</b> <b>I think this is still a pretty difficult thing</b> <b>What about XPeng and Li Auto?</b>
<b>They are currently positioned as an intelligent driving vehicle company.</b>
<b>I believe this matter</b> <b>fundamentally comes down to the number of cards.</b>
<b>Because essentially speaking,</b> <b>if you want to do this,</b> <b>it's somewhat similar to saying</b> <b>your premise is that you need a team and capability for a world model.</b>
<b>You might already have one of the best world models in the world,</b> <b>and then based on that,</b> <b>you simultaneously work on VLA.</b>
<b>In that case, I think the number of cards required could be quite high.</b>
<b>How many are needed?</b>
<b>For the customers we serve, the number of cards is usually in the tens of thousands.</b>
<b>At this level, this is how it's being done.</b>
<b>But I think among domestic players,</b> <b>these OEMs still have significant opportunities.</b>
<b>As for startups,</b> <b>I think it's very difficult for startups to build the brain.</b>
<b>I don't really think so.</b>
<b>From my perspective,</b> <b>I don’t think it’s very practical to build a unified brain.</b>
<b>Look at the intelligence level of autonomous driving.</b>
<b>You feel it’s not high enough.</b>
<b>Compared to that unified brain concept.</b>
<b>Is it possible that robots solve problems one vertical scenario at a time?</b>
<b>Focusing on a specific vertical scenario,</b> <b>collecting a lot of real-world data,</b> <b>and then training the system well for that scenario,</b> <b>just like autonomous driving today.</b>
<b>Could this be a faster path?</b>
<b>Is the unified brain too far off?</b>
<b>Yes.</b>
<b>I definitely think this path exists.</b>
<b>Actually, to me, this path looks more like Waymo’s approach.</b>
<b>This path resembles Waymo’s.</b>
<b>Yes, because I think it’s more about operating within a limited,</b> <b>non-generalized domain.</b>
<b>non-generalized domain.</b> <b>Right?</b>
<b>Right?</b> <b>And doing one thing really well.</b>
<b>I still remember when I first joined Cruise,</b> <b>So our focus is on deploying in San Francisco</b> <b>autonomous driving</b> <b>After completing that, we’ll consider the second city</b> <b>So actually, I think this approach is very similar to what Waymo Cruise did back in the day</b> <b>They spent a long time fully deploying in the first scenario</b> <b>After that, they worked on generalizing it</b> <b>Expanding it—that is, adapting to more scenarios—can be quite challenging</b>
<b>Yes, actually, I think Waymo is doing a great job now</b> <b>But I think Tesla might be better in terms of scalability here</b> <b>They might perform much better when it comes to scaling up</b> <b>Right</b> <b>Exactly</b> <b>So from my perspective,</b> <b>I believe if you approach it this way,</b> <b>starting with a relatively narrow domain scenario,</b> <b>first, this scenario might be divided into</b> <b>one or two specific scenarios</b> <b>and you focus on doing those well</b>
<b>then gradually cover other specific scenarios within this domain</b> <b>this will take a very long time</b> <b>After this, we can parallelly move into other scenarios</b> <b>I think this might require a fundamental overhaul</b> <b>Because the entire model architecture and data aspects might be completely different</b> <b>Exactly</b> <b>In that sense, it’s somewhat similar to the early days of automation</b> <b>I believe there will be successful cases in this area as well</b> <b>For example, autonomous driving</b>
<b>Actually, if you look at the domestic market, I think some are doing very well</b> <b>Take autonomous driving in mining, for instance</b> <b>They focus on a very specific vertical</b> <b>They fully master this vertical</b> <b>Within this vertical, they have a solid business model</b> <b>And corresponding barriers to entry</b> <b>I think this is a very successful case</b> <b>Exactly</b> <b>Of course, I think this case is probably hard to transfer to other scenarios</b>
<b>So you don’t agree with my view that brain companies are like Waymo</b> <b>And robot companies are like Tesla, right?</b>
<b>Exactly</b> <b>I think brain companies should be more like OpenAI in the later stages</b> <b>Exactly</b> <b>I think, at its core, autonomous driving is still something</b> <b>that doesn’t require extremely high intelligence</b> <b>I believe that when we talk about embodiment, we need to benchmark</b> <b>both large language models and autonomous driving</b> <b>I think embodiment might be a combination of the two</b> <b>Is there a Tesla in this industry?</b>
<b>Is there a Tesla in the embodiment field?</b>
<b>I think Figure hopes to become the Tesla of embodiment</b> <b>Right?</b>
<b>Right?</b> <b>They have their own hardware</b> <b>They are scaling up mass production</b> <b>While deploying, they are also developing their own “brain”</b> <b>But it’s still far off</b> <b>Exactly</b> <b>Because the scenarios are just too ambiguous</b> <b>Right</b> <b>I think the difficulty is still very high</b> <b>What I’m observing more and more now is</b> <b>I think I might see the emergence of large models’ generalization ability sooner</b>
<b>And I think many people probably underestimate the difficulty of deploying in a vertical domain scenario</b> <b>and then migrating it to other verticals once it's deployed</b> <b>the generalization becomes even harder</b> <b>it gets even more difficult</b> <b>Exactly</b> <b>because I have truly experienced it firsthand</b> <b>with Cruise and Waymo during this wave of autonomous driving</b> <b>so I think, first of all, within a vertical domain scenario</b> <b>deploying in San Francisco</b>
<b>is already a very challenging problem</b> <b>and of course, once that is achieved</b> <b>when you move to other cities</b> <b>you actually need to collect and train with even more data for each city</b> <b>and conduct large-scale evaluations</b> <b>to truly ensure you can deploy safely enough in that city</b> <b>this is not something that generalizes easily</b> <b>whereas, for example, Tesla</b>
<b>they probably</b> <b>started collecting data from the very beginning</b> <b>that's right</b> <b>Yes, it is a much broader and more extensive data collection.</b>
<b>The real challenge is to get this done right</b> <b>But robots might find scenarios involving such extensive data collection more challenging</b> <b>So your logic is that you must rely on simulation.</b>
<b>Simulation and human data are ontology-independent data</b> <b>Yes, I believe this will be absolutely critical</b> <b>I believe that if this hadn't happened</b> <b>If there is no embodied simulation and human data beneath the pyramid</b> <b>I believe that general intelligence simply cannot emerge without embodiment.</b>
<b>Speaking of this data pyramid</b> <b>Let's talk about the structure of this pyramid</b> <b>And what know-how it has regarding the collection of each type of data</b> <b>Actually, the Data Pyramid was developed by Feifei's student, Professor Zhu Yuke</b> <b>A concept he proposed</b> <b>Essentially, what he's doing is analyzing embodied intelligence data, which is different from autonomous driving.</b>
<b>Most of it definitely doesn't come from data generated by its own entity.</b>
<b>Because there isn't enough large-scale ontology data</b> <b>That relies more on simulation, the internet, and human data</b> <b>So the pyramid includes three parts, with the top being data collected from the real entity itself.</b>
<b>This is basically the most real teleoperation data of robots that we can see nowadays</b> <b>This data must be the most accurate and the most useful.</b>
<b>But the problem with this data is that it’s very difficult to scale.</b>
<b>It’s hard to scale robots and hard to scale scenarios.</b>
<b>The middle layer is data generated through simulation.</b>
<b>The advantage of simulation-generated data is that it can be scaled very well.</b>
<b>Of course, it also faces the Sim-to-real problem.</b>
<b>Actually, nowadays, since customers are using large models,</b> <b>they use a huge amount of simulated data as well as real data during the pre-training phase.</b>
<b>This actually makes the model’s generalization ability very strong.</b>
<b>In fact, the Sim-to-real gap—the difference between simulation and reality—is becoming smaller and smaller.</b>
<b>Yes, this is the middle layer of simulated data.</b>
<b>Going further down, there’s internet data and human video data.</b>
<b>Human video data is mostly first-person perspective data.</b>
<b>It might be data collected by people wearing glasses, for example.</b>
<b>If we look back over the past few months,</b> <b>I think in ontology-agnostic data, simulation, and human data,</b> <b>there has been a qualitative breakthrough.</b>
<b>I actually believe we have now reached a Scaling Law,</b> <b>a Scaling Law for embodied data.</b>
<b>Why do I say that?</b>
<b>Let's take a look at what I think is Fei-Fei Li's BEHAVIOR Challenge</b> <b>including NVIDIA's GR00T model</b> <b>which uses a large amount of simulated data</b> <b>to demonstrate its effectiveness</b> <b>Additionally, there's the Generalist</b> <b>which utilized 270,000 hours of UMI gripper data</b> <b>The UMI gripper is essentially human-operated</b> <b>with two hands holding the gripper</b> <b>to collect the data</b>
<b>so it’s actually a form of human data as well</b> <b>It’s more of a relatively simple gripper morphology</b> <b>and moving forward, it’s really about finger-shaped data</b> <b>dexterous hands</b> <b>Exactly, and they have already demonstrated that this 270,000 hours of data</b> <b>shows a Scaling Law in the model</b> <b>So, I think because of these points</b> <b>based on our concrete observations</b> <b>and the demands our clients have brought to us</b>
<b>the past few months have probably seen a qualitative leap</b> <b>specifically in the demand for data volume</b> <b>This could potentially be a huge increase</b> <b>Originally, we might have been a team that</b> <b>needed to stimulate demand</b> <b>but now we may need to scale our team</b> <b>to truly deliver on customer needs</b> <b>at this stage</b> <b>Of course, I might share some more thoughts here</b> <b>I think the pyramid isn’t simply just three layers</b>
<b>Real data</b> <b>Simulated data</b> <b>and then human data</b> <b>Actually, each stage</b> <b>each layer needs to be further subdivided</b> <b>Let me give an example</b> <b>Starting from the simulated data layer</b> <b>The topmost layer might be simulation data driven by humans</b> <b>Because from an ROI perspective, it’s very close to real data</b> <b>Right?</b>
<b>Right?</b> <b>Its advantage is that it doesn’t need to be based on the robot’s physical body</b> <b>Right?</b>
<b>Right?</b> <b>And on the other hand,</b> <b>it still relies on humans to collect the highest quality data.</b>
<b>But the issue is that its scalability is relatively limited.</b>
<b>Right?</b>
<b>Then, moving further down, it’s more about algorithm-driven,</b> <b>model-driven automated data collection layers.</b>
<b>Here, human involvement is relatively minimal.</b>
<b>What it guarantees is scalability,</b> <b>but the quality won’t be higher than the upper layers.</b>
<b>Right?</b>
<b>Now, if we look even further down,</b> <b>the human data layer is similar.</b>
<b>It might also include more passive human data collection layers,</b> <b>where people might be wearing some kind of glasses, right?</b>
<b>But without strong quality control measures,</b> <b>they gather a lot of first-person human perspective data.</b>
<b>It could also include actively collected data layers,</b> <b>which might use higher quality hardware,</b> <b>and have better process controls,</b> <b>but their scalability tends to be lower.</b>
<b>I think this could be a component of a data pyramid</b> <b>Yes, and of course, there’s another point</b> <b>Which is that, on some level, I actually feel the data pyramid gives the impression</b> <b>That it exists as a very independent entity</b> <b>Like how true-to-life simulation of internet users is a relatively independent concept</b> <b>From our practice</b> <b>I increasingly believe the data pyramid might actually be</b>
<b>A closed loop centered around simulation</b> <b>With simulation as the core middle layer</b> <b>Yes, so how do we understand this?</b>
<b>If we really want to do simulation evaluation well</b> <b>Because evaluation must be scaled based on simulation</b> <b>It has to incorporate the largest possible amount of sufficiently realistic scenarios</b> <b>The physical world</b> <b>Human trajectories</b> <b>Experience</b> <b>At the same time, I think a very critical factor is the evaluation criteria</b> <b>The evaluation standards for different tasks</b> <b>This is very difficult to develop in isolation within simulation</b>
<b>So in fact, it requires access to more real-world data</b> <b>Right, so this is actually why we are now starting to work on human data.</b>
<b>Human data</b> <b>Specifically, video data of humans.</b>
<b>You just mentioned first-person perspective data of humans.</b>
<b>Exactly.</b>
<b>Why first-person perspective?</b>
<b>Because, actually, we can consider humans as...</b>
<b>I think one capability that large models really focus on is the ability to cross ontologies.</b>
<b>From this perspective, isn’t a human also a kind of robot?</b>
<b>Yes, so essentially, this training paradigm treats humans as robots.</b>
<b>We take their data back,</b> <b>and feed it all together for training.</b>
<b>There’s another point,</b> <b>it’s like treating humans as vehicles.</b>
<b>Right.</b>
<b>Treating humans as vehicles.</b>
<b>Exactly.</b>
<b>That’s exactly what it means.</b>
<b>Also, if this continues, maybe in the future robots will become more and more like humans,</b> <b>because the more they resemble humans,</b> <b>The closer its essence is to that of a human, the smaller the gap becomes</b> <b>Right? So I think this is a core point of first-person perspective data from humans</b>
<b>Right? So I think this is a core point of first-person perspective data from humans</b> <b>Of course, once this data is collected</b> <b>it can actually be used based on Real-to-sim</b> <b>many algorithms</b> <b>and simulation capabilities to bring this world back</b> <b>to bring back the physical interactions involved</b> <b>as well as many of their tasks and evaluation criteria</b>
<b>and truly integrate these into the simulation to expand its scale</b> <b>Right?</b>
<b>Right?</b> <b>Another aspect is</b> <b>this forms a loop from real to simulation</b> <b>From simulation back to reality, meaning</b> <b>after simulation is done, it must be implemented in the real world</b> <b>So how to solve Sim2Real</b> <b>On one hand, by incorporating more simulation into pre-training</b> <b>and on the other hand, by better aligning it with the real world</b> <b>Right?</b>
<b>Right?</b> <b>So actually</b> <b>real teleoperation data</b> <b>Evaluating real teleoperation against simulation benchmarks becomes especially important, right?</b>
<b>It's not just about benchmarking on the training side,</b> <b>but also benchmarking on the evaluation side.</b>
<b>This truly allows Sim2Real to serve not only training</b> <b>but also evaluation.</b>
<b>So from this perspective,</b> <b>I think the data pyramid is, on one hand, a pyramid,</b> <b>a hierarchical pyramid.</b>
<b>On the other hand,</b> <b>I believe it could be a closed loop of data centered around simulation</b> <b>and driven by evaluation.</b>
<b>So which data do you think is overrated?</b>
<b>And which data is underrated?</b>
<b>First of all, I believe data from real robots is definitely overrated.</b>
<b>Looking at the actual industry developments over the past few months,</b> <b>I think most people have already recognized this point.</b>
<b>Originally, companies or large model teams that were strictly proponents of real hardware,</b> <b>now, I believe, are massively procuring simulated data,</b> <b>simulation-based evaluations, or human data.</b>
<b>So I think first of all, real robot data is definitely overrated.</b>
<b>Secondly, I think simulation is still being underestimated.</b>
<b>Why is that?</b>
<b>Because I believe people have already seen some of the capabilities of simulation data.</b>
<b>But I think the evaluation through simulation,</b> <b>in fact, not many people have truly recognized its value.</b>
<b>I believe large model teams have fully realized it.</b>
<b>Why? Because they focus on large-scale evaluations.</b>
<b>Without simulation, large-scale evaluations are impossible.</b>
<b>And I think many robotics companies are only now beginning to see this stage.</b>
<b>Because their scale isn’t that large yet.</b>
<b>But as their scale grows,</b> <b>and they need to handle more tasks, more types of tasks, and increasingly open scenarios,</b> <b>they will increasingly feel this pain point.</b>
<b>Simulation becomes an unavoidable necessity.</b>
<b>Additionally, I think human data is also relatively underestimated.</b>
<b>I believe human data is actually extremely critical.</b>
<b>Of course, I think from our perspective,</b> <b>it can truly help us improve, supplement, and enhance</b> <b>our simulation-centered ecosystem.</b>
<b>Smart glasses sound incredibly useful.</b>
<b>Smart glasses are basically like a car</b> <b>Exactly</b> <b>Everyone goes out to collect data for robots</b> <b>Right</b> <b>Yes, I totally agree with that point</b> <b>I think one issue with human data is that it basically has no barriers</b> <b>I've seen many people working on hardware for human data</b>
<b>But essentially, human data at its core requires people to wear consumer-grade hardware to collect data</b> <b>Does it have to be the eyes?</b>
<b>First-person perspective, it definitely has to be the eyes</b> <b>For example, I’ve seen hardware companies like Plaud making recording pens</b> <b>And there are also companies making devices worn on the chest</b> <b>Got it</b> <b>Is this kind of human first-person perspective data?</b>
<b>From a first-principles perspective, the closer you are to the human viewpoint, the better</b> <b>Right, so if your hardware is, say, mounted on the head</b> <b>On top of the head or on the chest</b> <b>There’s actually a gap between your viewpoint and the human eye’s viewpoint</b> <b>So essentially, this causes a problem</b> <b>Why does it have to be the eyes?</b>
<b>I think it probably comes more from a first-principles perspective, saying</b> <b>because this is how people work</b> <b>people just work this way</b> <b>right, and this is actually what I see as many real demands</b> <b>I think they are all moving in this direction</b> <b>right, so if you look at it from this angle</b> <b>in the end, what’s really needed is an optimally scaled</b> <b>consumer-grade</b>
<b>comfortable enough wearable device to truly serve human data</b> <b>I think the hardware on the edge</b> <b>how to get people willing to wear glasses on a large scale</b> <b>if you’re not nearsighted, or like me, I am nearsighted</b> <b>but I prefer to wear contact lenses</b> <b>ideally, people would just like wearing these glasses</b> <b>not wear them just for the sake of data</b> <b>I think this might be the real point that human data needs to reach</b>
<b>let me give an example, for instance</b> <b>Meta’s Ray-Ban glasses</b> <b>right? They actually changed their approach</b>
<b>right? They actually changed their approach</b> <b>at first, they probably wanted to make these gaming glasses, right?</b>
<b>And then make it very flashy</b> <b>But it doesn't look good enough</b> <b>So I think Meta's Ray-Ban glasses</b> <b>have a particularly smart point</b> <b>First of all, these are very cool glasses</b> <b>They look like really good glasses</b> <b>Secondly, they have an AI assistant you can talk to</b> <b>Right, they have a camera</b> <b>I believe this kind of wearable might be the most useful in the long run</b> <b>Because this wearable is something everyone already has</b>
<b>Not something you need to buy for everyone</b> <b>So these companies first need to design glasses that are attractive enough</b> <b>So that we’re all willing to wear them</b> <b>Then they can use us to collect data for their robots</b> <b>That’s the idea</b> <b>But if you think from this perspective</b> <b>I think the premise</b> <b>Must be based on a consumer-grade product</b> <b>It’s basically like saying</b> <b>I believe companies focused on human data shouldn’t make their own hardware</b>
<b>If his hardware makes it difficult to reach a consumer-level market</b> <b>By consumer-level, I mean possibly a million units</b> <b>Or even larger shipment volumes</b> <b>Everyone loves this kind of glasses</b> <b>So I believe it should be based on existing consumer-grade hardware</b> <b>Or if this hardware hasn’t been released yet</b> <b>A company with consumer-grade hardware launches a hit product</b> <b>And everyone starts wearing it</b> <b>That would be a real breakthrough</b> <b>So why would he share the data with this</b>
<b>Company that trains AI brains with robots?</b>
<b>I think in this case</b> <b>He would have different hardware</b> <b>Each actually comes with corresponding SDKs and APIs</b> <b>And apps, right?</b>
<b>So you can actually design such a data collection process</b> <b>We all know computing power is expensive</b> <b>Because the three pillars driving AI are computing power, algorithms, and data</b> <b>Computing power is very costly</b> <b>Is data expensive?</b>
<b>If you want to buy, for example, synthetic data or first-person human perspective data</b> <b>What kind of price range are we talking about?</b>
<b>I actually think data is becoming increasingly expensive</b> <b>And this is a very interesting point</b> <b>Because many people might assume data should get cheaper over time</b> <b>But I actually believe that, essentially,</b> <b>it depends on the type of data</b> <b>Like I mentioned earlier, the different stages of data</b> <b>Maybe starting from a static dataset</b> <b>Or a bulk-level dataset</b> <b>To data that provides feedback</b> <b>The value it brings to algorithms is completely different</b>
<b>Therefore, the price it can command is also entirely different</b> <b>Yes, of course</b> <b>I think if we look at it from the perspective of pre-training,</b> <b>fine-tuning, and evaluation,</b> <b>pre-training data is probably the cheapest</b> <b>And it should be a relatively standardized product</b> <b>Right?</b>
<b>Right?</b> <b>Because I don't think there's any company</b> <b>Paid for all the pre-training dataset costs out of pocket</b> <b>It should be a shared cost</b> <b>Right?</b>
<b>Right?</b> <b>For example, there might be five major large model companies worldwide</b> <b>sharing the cost of this pre-training data</b> <b>And everyone is willing to share</b> <b>Because this should be a relatively</b> <b>common foundational capability that helps everyone improve</b> <b>Basic capabilities</b> <b>The most critical feedback-driven</b> <b>improvements still happen during fine-tuning and evaluation</b> <b>And fine-tuning and evaluation are likely more targeted</b> <b>Data-wise</b> <b>It’s more evaluation-driven</b>
<b>Providing sufficient signals</b> <b>As well as relatively more experience transfer</b> <b>The value and price of this data are much higher</b> <b>About how much?</b>
<b>Actually, it’s hard to say</b> <b>Meaning that right now, maybe</b> <b>For example, from a data perspective</b> <b>it could be within an hour</b> <b>ranging from tens of RMB to over a thousand RMB</b> <b>all are possible</b> <b>but it</b> <b>is an expert in indicator data, right?</b>
<b>an expert in data collection</b> <b>not just that</b> <b>it also includes, for example, this</b> <b>when it comes to data, I think it</b> <b>is embodied data</b> <b>it includes three key elements</b> <b>first, it includes a physical scenario</b> <b>whether real or simulated</b> <b>there must be a scenario</b> <b>second,</b> <b>it includes the trajectory of experience</b> <b>as well as the transmission of that experience</b> <b>the transmission of experience includes language annotations</b> <b>third,</b> <b>That means it includes these evaluation metrics</b> <b>For example, this indicates success</b>
<b>This indicates failure</b> <b>It might be labeled in even more detail</b> <b>For instance</b> <b>the BEHAVIOR dataset</b> <b>Maybe as a Pizza</b> <b>it could be a long-term task</b> <b>Within it, there might be</b> <b>smaller subtasks</b> <b>I might fail at first</b> <b>For example, I might place a mushroom first</b> <b>I failed the first time</b> <b>Then I succeeded the second time</b> <b>All of this would be labeled</b> <b>Exactly</b> <b>When structured together, it forms a dataset</b> <b>For example, an hour’s worth of</b> <b>Pizza-making data</b>
<b>could be sold for</b> <b>For example,</b> <b>ranging from tens of RMB</b> <b>to several thousand RMB, varying accordingly</b> <b>Currently, this is a</b> <b>I think the entire industry is still in a relatively</b> <b>divergent phase</b> <b>Of course,</b> <b>that is to say,</b> <b>we focus heavily on high-quality data</b> <b>because low-quality data is essentially meaningless here</b> <b>High-quality data</b> <b>I believe falls within a range</b> <b>from several hundred RMB up to over a thousand RMB</b> <b>What defines high-quality data?</b>
<b>High-quality data,</b> <b>in my opinion, involves several key points</b> <b>First, the physical environment</b> <b>must be sufficiently diverse</b> <b>its interactions must be genuinely realistic</b> <b>and it must closely align with real-world physical scenarios</b> <b>Secondly,</b> <b>this refers to the recording of the trajectory,</b> <b>which needs to be sufficiently professional.</b>
<b>For example, making a pizza, right?</b>
<b>It has to be smooth enough.</b>
<b>There might be mistakes,</b> <b>but within those mistakes, there’s a correction.</b>
<b>Actually, this kind of data is more valuable.</b>
<b>It’s quite counterintuitive,</b> <b>because people might think a perfect</b> <b>pizza-making video</b> <b>would be the most expensive,</b> <b>but that’s not the case.</b>
<b>If, for example, a few toppings fall off in the middle,</b> <b>and then you pick them back up</b> <b>and remake the pizza properly,</b> <b>that data is actually more valuable.</b>
<b>I think it’s somewhat similar to human learning,</b> <b>human experience, right?</b>
<b>Failing first and then succeeding through experience.</b>
<b>It is often the most valuable</b> <b>Then thirdly, I would say</b> <b>I think its evaluation metrics</b> <b>as well as its annotations</b> <b>are sufficiently accurate</b> <b>especially for these long-range tasks</b> <b>which are actually quite challenging</b> <b>It requires large-scale automated</b> <b>model-driven algorithms</b> <b>to truly help refine and optimize it</b> <b>including, if it involves human data</b> <b>hands</b> <b>full-body tracking</b> <b>the realism</b> <b>and the accuracy are extremely critical</b>
<b>For example</b> <b>what kind of data is good data</b> <b>Would movie data be good data?</b>
<b>Would game data be good data?</b>
<b>That's a very good question</b> <b>These are the kinds of videos we encounter in daily life.</b>
<b>They could potentially be good data.</b>
<b>I actually think game data,</b> <b>and movie data as well, can be useful.</b>
<b>But the thing is,</b> <b>from the perspective of the data pyramid,</b> <b>I think the key point about the data pyramid is to tell everyone</b> <b>that any data can be useful.</b>
<b>But more importantly, we need to consider ROI,</b> <b>the cost-benefit ratio.</b>
<b>So let me give an example.</b>
<b>Movie data,</b> <b>video data is very likely to help improve the model,</b> <b>but the problem is,</b> <b>the processing cost might be quite high,</b> <b>and the improvement to the model might be relatively small.</b>
<b>So it could be that I spend</b> <b>a large amount of computing power to process this data,</b> <b>and then compress it,</b> <b>but the intelligence gains I get are relatively limited.</b>
<b>I actually think from an ROI perspective</b> <b>I think the highest level is still one</b> <b>Simulation-Based</b> <b>Some people might be in the loop</b> <b>But the data is collected driven by algorithms</b> <b>Or it could be a piece of human data</b> <b>I think these two might be the ones I've seen so far</b> <b>I believe the data with the highest ROI during the pre-training phase</b> <b>Why the data from movies</b> <b>The data processing in gaming is very challenging</b>
<b>I think, on one hand, it will have some, for example,</b> <b>You might add more annotations</b> <b>Another issue is that it’s not yet a 3D piece of information.</b>
<b>It is essentially a 2D piece of information</b> <b>The game could possibly be 3D</b> <b>But when it comes to games, it might be that</b> <b>It's a bit too cross-domain</b> <b>It might be something connected to the real world</b> <b>A completely different scenario</b> <b>Its physics are actually unrealistic</b> <b>Right?</b>
<b>Right?</b> <b>So it has a different worldview.</b>
<b>Exactly.</b>
<b>But this kind of data is actually useful for world models.</b>
<b>For example, many world model teams</b> <b>use a large amount of game data.</b>
<b>Specifically, data from playing games.</b>
<b>They have dedicated teams</b> <b>that purchase the rights to these games,</b> <b>then use their agents to play them,</b> <b>and collect the data to train their world models.</b>
<b>But this matter,</b> <b>how should I put it,</b> <b>is useful,</b> <b>but its effectiveness</b> <b>is not that high.</b>
<b>From the perspective of providing data,</b> <b>the focus should be on the needs of high-ROI customers.</b>
<b>In other words, I think the data pyramid is very large,</b> <b>but you don’t actually need to serve every single layer within it.</b>
<b>But it should probably serve the highest value chain items</b> <b>Do you have a lot of internal data to price them differently?</b>
<b>Yes, we do</b> <b>That's right</b> <b>But that means</b> <b>It's really hard to set the price</b> <b>Overall speaking</b> <b>It's actually not that complicated</b> <b>We might say overall</b> <b>There are mainly two categories</b> <b>One is pre-training</b> <b>Right?</b>
<b>Right?</b> <b>The other is evaluation data</b> <b>These two are actually</b> <b>What everyone is most lacking right now</b> <b>Because many people call you the Data Factory, right?</b>
<b>It's like a digital factory</b> <b>Take us inside this digital factory</b> <b>What is your workflow like?</b>
<b>And roughly, how is your team composed?</b>
<b>For example, just now we talked about people doing data annotation</b> <b>Is this a profession?</b>
<b>Yes</b> <b>That's a great question</b> <b>First of all, maybe I</b> <b>think we are more like a Data Engine</b> <b>or rather, I prefer to define it as a Data Engine</b> <b>You think Data Factory is an outdated concept</b> <b>Yes</b> <b>I feel Data Factory is somewhat biased</b> <b>A factory implies it is a</b> <b>production line</b> <b>A production line lacks advanced technology and advanced systems</b> <b>and it is not feedback-driven</b> <b>It is not driven by evaluation feedback</b>
<b>We see the Data Engine as feedback-driven</b> <b>a learning engine</b> <b>So</b> <b>they are more based on a system</b> <b>focused on engineering and system capabilities</b> <b>Leveraging the end-user side</b> <b>to help them generate data</b> <b>In such a scenario</b> <b>let me give an example</b> <b>What people often see is the data we produce</b> <b>but actually, our core is a full-stack solution</b>
<b>First, to create a physically realistic enough world</b> <b>we need to use simulation</b> <b>This simulation requires us to build a physically realistic world</b> <b>as well as these interactive, physically realistic assets</b> <b>This is actually a very challenging foundational task</b> <b>to build</b> <b>For example, rigid body assets are relatively simple</b> <b>but creating non-rigid bodies, such as</b> <b>cables</b>
<b>is quite difficult, especially since many industrial scenarios we serve require cable plug-and-play interactions</b> <b>This is a very challenging task</b> <b>It requires a proprietary underlying physics solver</b> <b>a non-rigid body solver</b> <b>as well as simulation of these assets</b> <b>This Co-design is about joint debugging to truly help them generate</b> <b>At the same time, what should we do about the physical aspect here?</b>
<b>This one’s physics need to come from the real-world physics</b> <b>So we actually have a physical measurement factory</b> <b>This measurement factory is based on a sufficiently automated toolchain, including robotic arms and so on</b> <b>Automating the interaction with various real physical assets around the world</b> <b>Go do the interaction</b> <b>Retrieve their mechanical information</b> <b>And then relatively automate it to place it into this simulated asset and this world</b>
<b>So all of this is for us to produce a</b> <b>A simulated world that is realistic enough physically and an interactive physical world</b> <b>This</b> <b>This is the system we developed</b> <b>On this basis</b> <b>What I just mentioned is that there are two types of simulation data.</b>
<b>One type is human-driven</b> <b>Its advantage lies in the fact that the quality of its data is the highest.</b>
<b>It provides the best example</b> <b>The issue is that its scalability is somewhat lacking.</b>
<b>So along this path, we have</b> <b>These are very high-quality.</b>
<b>For example, the teleoperation toolchain.</b>
<b>It's somewhat like seeing a person remotely operating a real-world robot.</b>
<b>We have people remotely operating</b> <b>robots in this simulated world.</b>
<b>Robots of different forms.</b> <b>Even robots we define ourselves.</b>
<b>They might be different from every other robot.</b>
<b>But they use a sufficiently standardized form</b> <b>to collect data on the various robot bodies,</b> <b>demonstration data,</b> <b>and bring it back.</b>
<b>At the same time, we have also trained</b> <b>a sufficiently good automated algorithm based on this approach,</b> <b>which can automate data collection in this direction using this algorithm.</b>
<b>Occasionally, human intervention is needed,</b> <b>right?</b>
<b>right?</b> <b>So this is a more scalable data generation pipeline.</b>
<b>On top of these two foundations,</b> <b>the next step is annotation.</b>
<b>That probably means it has more semantic-level annotations</b> <b>And it leverages a lot of large model capabilities here</b> <b>As well as, in the end, possibly a human-in-the-loop quality check</b> <b>To truly ensure this data</b> <b>Is of sufficiently high quality</b> <b>So this might be the real foundation for us to generate data</b> <b>Of course, that means</b> <b>As I just mentioned</b> <b>Evaluation also needs to be scaled</b> <b>So you can understand</b> <b>Evaluation is actually a data pipeline as well</b> <b>And it</b>
<b>Starts from</b> <b>It begins based on our</b> <b>Based on humans</b> <b>That is, based on</b> <b>Human-collected data along this chain</b> <b>This set of</b> <b>Edge-side hardware and cloud-based automated algorithms</b> <b>Brings back this data</b> <b>And then proceed to do it</b> <b>Real-to-sim might include this</b> <b>Reconstructing the physics inside the video</b> <b>Extracting the tasks within the video</b> <b>Relatively automatically</b> <b>As well as extracting these evaluation criteria</b> <b>Taking these</b>
<b>And putting them into our</b> <b>Simulation</b> <b>Assets</b> <b>Scenes</b> <b>World</b> <b>And the task definitions</b> <b>To make it more scalable</b> <b>To generate a</b> <b>Complete data pipeline for evaluation</b> <b>To serve our clients</b> <b>You just emphasized repeatedly that evaluation data is very important</b> <b>So how do you do it?</b>
<b>Yes, I believe</b> <b>Evaluation data</b> <b>The biggest challenge is</b> <b>First, it needs to be very challenging</b> <b>Second, it needs to be highly scalable</b> <b>So it has to be both difficult and scalable</b> <b>Exactly</b> <b>This is really tough</b> <b>Let me give an example</b> <b>For instance</b> <b>Many</b> <b>robotics companies</b> <b>are doing demos</b> <b>They might, for example,</b> <b>fold clothes, right?</b>
<b>And so on</b> <b>They are often working within</b> <b>a relatively fixed environment</b> <b>to perform relatively single tasks</b> <b>But for the generalization ability of large models,</b> <b>they might need</b> <b>That is to say</b> <b>truly in</b> <b>large-scale scenarios</b> <b>possibly</b> <b>at the thousand-level</b> <b>at least in such scenarios</b> <b>there could be very large-scale tasks</b> <b>this task might be in the thousands</b> <b>or even tens of thousands</b> <b>scale of such a task</b> <b>and then there are corresponding definitions of success</b>
<b>to help them truly evaluate</b> <b>at this point, I think</b> <b>first of all, it’s about</b> <b>how to build</b> <b>these parallel worlds</b> <b>and the corresponding physics of these parallel worlds</b> <b>which I have already</b> <b>briefly mentioned just now</b> <b>that is, regarding simulation</b> <b>For a production line that goes from real to simulation</b> <b>The challenging part is</b> <b>its tasks</b> <b>and these evaluation criteria</b> <b>which we derive from the real world</b> <b>This is an extremely critical aspect for us</b> <b>That is to say</b>
<b>if the simulation evaluations</b> <b>become disconnected from real-world evaluations</b> <b>then even if this can be scaled</b> <b>it won’t truly generate</b> <b>substantial value</b> <b>Another point is</b> <b>that is to say</b> <b>some might think we are a simulation-centric company</b> <b>and say we only do simulation</b> <b>but that’s not the case</b> <b>we also have a real-world evaluation infrastructure</b> <b>for example, we have real robots</b> <b>and we have</b> <b>Regarding the real scenarios</b> <b>and the evaluation algorithms</b> <b>these situations</b> <b>their goal</b>
<b>is not to serve our customers</b> <b>by evaluating their robots in real-world settings</b> <b>but rather their aim is to benchmark against our</b> <b>larger-scale simulated</b> <b>toolchain</b> <b>this production line and these</b> <b>evaluation challenges</b> <b>for example</b> <b>using the same algorithm</b> <b>both in simulation and in reality</b> <b>can we observe a corresponding correlation</b> <b>this is a crucial matter</b> <b>I believe only by doing this well</b> <b>can we truly</b> <b>successfully scale simulation-centered evaluations</b>
<b>and do them properly</b> <b>How many people do you have?</b>
<b>The whole team?</b>
<b>Right now, we probably have full-time</b> <b>mostly in engineering and technical roles.</b>
<b>There might be around a hundred people.</b>
<b>Something like that.</b>
<b>I don't really believe</b> <b>that AI can completely generate data for itself</b> <b>and then serve itself.</b>
<b>The underlying logic here is different.</b>
<b>Because it’s more like a perpetual motion machine.</b>
<b>So essentially,</b> <b>I think one core point is</b> <b>first, whether you have access to an accurate enough representation of the world,</b> <b>right?</b>
<b>right?</b> <b>And an accurate enough task.</b>
<b>Second, you need someone to provide experiential demonstrations within this process.</b>
<b>This is a key cognitive factor that helps the model improve.</b>
<b>Of course, that said,</b> <b>I think a very critical point is</b> <b>It's about how you scale this demonstration</b> <b>Right?</b>
<b>Right?</b> <b>If you are a human-centered data company</b> <b>then what you might need is</b> <b>I believe</b> <b>you probably need tens of millions to hundreds of millions of people</b> <b>to ultimately get these things done</b> <b>because the volume required here is huge</b> <b>But if you are simulation-centered</b> <b>system-centered</b> <b>then you actually have a scaling effect here</b> <b>because you are amplifying through technology</b> <b>the set of experiences generated by humans</b>
<b>then I think the volume needed here might be about a hundred times smaller</b> <b>I remember last time Tan Jie said</b> <b>that Data Factory</b> <b>faced a problem where you collected a lot of data</b> <b>but for example, after providing data to them</b> <b>like giving data to these brain companies</b> <b>they still can't tell you whether the data is good or not</b> <b>And in the end, it just turns into a blame game</b> <b>Where the data company says</b>
<b>“Oh, it’s your model that wasn’t trained well”</b> <b>And then the model company says</b> <b>“Hey, it’s your data collection that’s not up to par”</b> <b>It’s basically a back-and-forth blame process</b> <b>What’s your take on this?</b>
<b>How should this issue be addressed?</b>
<b>Right</b> <b>I think this is an objectively existing problem</b> <b>But</b> <b>I actually want to give an example</b> <b>Let’s look at</b> <b>Scale AI and OpenAI during the GPT-2 phase</b> <b>They were actually at the same stage</b> <b>At this stage</b> <b>It basically means</b> <b>Everyone was collectively searching for</b> <b>the right “recipe” for the data</b> <b>And the overall direction was already relatively clear</b> <b>For example, simulation</b> <b>For example, human data</b> <b>For example, simulation evaluation</b> <b>But in the details</b> <b>there might be some differences</b>
<b>Let me give an example</b> <b>For instance, we have actually encountered</b> <b>In the early days</b> <b>the client’s requirement might have been</b> <b>to have perfect data</b> <b>Later on, they might prefer negative samples</b> <b>or just error-correcting data</b> <b>Additionally,</b> <b>they might need data with a wider distribution</b> <b>For example, if you pick up a bottle</b> <b>they might want</b> <b>the way you pick up the bottle to be different</b> <b>rather than always picking it up the same way</b> <b>from a similar direction</b>
<b>or the same position</b> <b>Right?</b>
<b>Right?</b> <b>These are all part of a</b> <b>gradual iterative understanding, I think.</b>
<b>I believe the most crucial aspect here is</b> <b>collaborating with some of the industry's leading clients,</b> <b>working together</b> <b>to create a symbiotic relationship.</b>
<b>I think this is the most critical thing.</b>
<b>Additionally,</b> <b>actually,</b> <b>we have encountered some</b> <b>questions that people have asked before,</b> <b>which is,</b> <b>if a data company</b> <b>is not focused on the brain,</b> <b>and not focused on the ontology,</b> <b>then its understanding</b> <b>of data</b> <b>cannot keep up with an ontology company</b> <b>or a brain company.</b>
<b>Understanding of Data</b> <b>I think</b> <b>actually, from our practical experience</b> <b>I don't think that's the case</b> <b>And why is that?</b>
<b>It's just that</b> <b>actually</b> <b>there are very few teams in the world</b> <b>that truly have an understanding of data</b> <b>especially large-scale pre-training level data</b> <b>probably only about five or so</b> <b>we basically have a</b> <b>collaborative relationship with them</b> <b>I believe that</b> <b>the most critical thing is to</b> <b>establish a relatively symbiotic</b> <b>collaborative relationship</b> <b>with the core clients</b> <b>Which five?</b>
<b>Which five?</b> <b>That might be</b> <b>I might not go into too much detail on this part</b> <b>But it could be that</b> <b>you can imagine the largest big model companies</b> <b>they usually have their own</b> <b>dedicated teams</b> <b>Here,</b> <b>I think</b> <b>a very core point</b> <b>is whether both sides can iterate synchronously</b> <b>iterating mutual understanding</b> <b>This is a crucial matter</b> <b>In a way,</b> <b>we have gained a lot</b> <b>of insights from different clients</b>
<b>At the same time, we have also provided our clients with</b> <b>more insights</b> <b>I think this is extremely necessary</b> <b>Let me give another example</b> <b>Actually,</b> <b>the concept of the data pyramid</b> <b>It needs to be validated.</b>
<b>The data pyramid is just a concept.</b>
<b>But which layer of data is actually the most effective?</b>
<b>What is the right proportion?</b>
<b>That needs to be validated.</b>
<b>We are actually probably working with</b> <b>about two companies or so</b> <b>constantly</b> <b>iterating and evolving the data pyramid.</b>
<b>This is a very critical matter.</b>
<b>Of course, that means</b> <b>if you want to validate the data pyramid,</b> <b>you need a certain number of cards.</b>
<b>Probably tens of thousands of cards</b> <b>to truly and effectively validate the data pyramid.</b>
<b>So I believe that some of these core understandings are extremely important.</b>
<b>How to allocate</b> <b>the right proportions.</b>
<b>My view is that</b> <b>maybe it’s not necessary to get too detailed.</b>
<b>But that said</b> <b>I actually believe</b> <b>it's increasingly leaning towards</b> <b>the ontology-independent layer</b> <b>This is</b> <b>first and foremost a certain fact</b> <b>Additionally</b> <b>we might gain a deeper understanding</b> <b>not just during the pre-training phase</b> <b>but also in the post-training phase after pre-training</b> <b>how to approach this during the RL stage</b> <b>how to fine-tune</b> <b>how much to leverage simulation</b> <b>how much to leverage real-world data</b> <b>and the subsequent evaluation</b> <b>how to structure it</b> <b>I think it’s an integrated, systematic understanding</b>
<b>preparing data is absolutely critical</b> <b>Could you share with everyone</b> <b>some of your key secrets?</b>
<b>I think maybe what I’m trying to say is</b> <b>to say something like</b> <b>a somewhat counterintuitive insight</b> <b>I think it still comes back to</b> <b>what kind of data</b> <b>is considered good data</b> <b>at this point</b> <b>actually, I feel it’s becoming more and more like human learning</b> <b>it’s increasingly unlike the earliest autonomous driving</b> <b>the earliest machine vision</b> <b>the earliest autonomous driving machine vision</b> <b>where perfect data was the best</b> <b>it had a standard answer</b> <b>I think nowadays data</b>
<b>increasingly doesn’t have a standard answer</b> <b>and at this point, what can</b> <b>I think from first principles</b> <b>help people learn data</b> <b>I think that might be the best data</b> <b>meaning, for example, it is a</b> <b>It showed you some mistakes</b> <b>Data that allows you to learn from those mistakes</b> <b>I think this is a crucial point</b> <b>Additionally,</b> <b>perhaps from childhood as we grow up,</b> <b>we might</b> <b>just watch a teacher</b> <b>explain problems to you,</b>
<b>which might not be the most effective way</b> <b>Maybe if you treat every classmate as your own teacher,</b> <b>a single problem could have different approaches,</b> <b>and from this diverse distribution,</b> <b>you can draw your own conclusions,</b> <b>which might be better.</b>
<b>I think these might be</b> <b>I believe its Secret Sauce is</b> <b>that it aligns with human</b> <b>learning, which is becoming increasingly universal.</b>
<b>So actually, I increasingly feel that what we</b> <b>might be doing is running an education company.</b>
<b>AI education companies</b> <b>Yes, I think the ultimate data companies</b> <b>might actually look very similar to education companies</b> <b>So what do you think is the difference between educational AI and human educators?</b>
<b>At present,</b> <b>I think embodied AI might still not be that intelligent</b> <b>Right?</b>
<b>Right?</b> <b>So actually, right now,</b> <b>there are still quite a few examples that are demonstrations</b> <b>where it’s still about rote memorization</b> <b>or learning by imitation</b> <b>Right?</b>
<b>Right?</b> <b>But I believe that</b> <b>the further we go, the more you need to challenge it</b> <b>Also,</b> <b>what I think is that essentially,</b> <b>embodiment</b> <b>is still about interacting with the physical world</b> <b>So this kind of education might be different from</b> <b>our usual book-based education</b> <b>It's still somewhat different</b> <b>It requires more physical demonstrations and physical interactions</b> <b>Because you interact with various companies both domestically and internationally</b> <b>Whether they are brain-computer interface companies</b> <b>Or companies focused on robot hardware</b>
<b>Or companies working on large models</b> <b>I'm quite familiar with all of them</b> <b>Can you give everyone an overview</b> <b>About how Chinese and American robotics teams</b> <b>Approach data mapping</b> <b>No problem</b> <b>What their core beliefs are</b> <b>Because I know them very well</b> <b>So I probably can't go into too much detail</b> <b>I think I can categorize them accordingly</b> <b>That is, I think there is a large model faction</b> <b>I believe the large model faction is probably growing</b>
<b>Specifically, the large model teams from major companies</b> <b>I think their starting points</b> <b>May have been somewhat different initially</b> <b>But it might be increasingly</b> <b>becoming more and more convergent</b> <b>becoming more and more convergent</b> <b>What they need is</b> <b>this zero-shot generalization capability</b> <b>Are you referring to the large language model teams</b> <b>or which teams?</b>
<b>The big companies' VLA teams</b> <b>The big companies' world model teams</b> <b>I think it's probably these two teams</b> <b>I believe what they need is</b> <b>this zero-shot generalization ability</b> <b>I think this is an extremely, extremely</b> <b>important zero-shot capability</b> <b>I think this is what they value the most</b> <b>They don't really emphasize the complexity of the ontology that much</b> <b>What they most critically hope for is</b> <b>to be able to use a relatively simple</b> <b>standardized ontology</b>
<b>but still be able to validate</b> <b>Their scalable Zero Shot capability</b> <b>I think this is</b> <b>They have a strong belief in data</b> <b>They also strongly believe in</b> <b>Ontology-agnostic data</b> <b>Trusting simulation</b> <b>Trusting simulated evaluations</b> <b>Trusting human data</b> <b>Because this follows the logic of large language models</b> <b>Exactly. At the same time,</b>
<b>Exactly. At the same time,</b> <b>They actually, on the infrastructure side,</b> <b>Are earlier adopters of RL</b> <b>Engaging in large-scale RL</b> <b>But the focus might be on simulation</b> <b>This is something we might observe</b> <b>A core trend among major large model teams</b> <b>Let me interject here</b> <b>Because it’s precisely these big companies that</b> <b>Of course, have abundant funding</b> <b>And very strong infrastructure capabilities</b> <b>But at the same time, it has a large language model</b>
<b>There is also this VLA and world model</b> <b>It's exactly in the field of robotics</b> <b>He will definitely allocate resources accordingly in the present.</b>
<b>The team focused on large language models, right?</b>
<b>He wouldn't be biased towards the robotics team, would he?</b>
<b>So could it actually happen within these big companies with abundant resources?</b>
<b>Actually, within the robotics team</b> <b>Being able to retain resources for the robotics team</b> <b>Not as many as imagined</b> <b>You said this is very good</b> <b>However,</b> <b>I think this is actually a possibility</b> <b>The real situation from three to six months ago</b> <b>Or rather, actually before this year</b> <b>What I think we are seeing</b> <b>Actually, big companies basically don't get involved directly.</b>
<b>For example, OpenAI might not get involved</b> <b>Right</b> <b>Maybe ByteDance won't even participate</b> <b>There was simply no outcome</b> <b>Seriously scaling up to do this</b> <b>But starting from this year</b> <b>I think the core might be</b> <b>That is to say</b> <b>Regarding large models, relatively speaking</b> <b>The trend has become somewhat clearer</b> <b>Some resources can be freed up</b> <b>And then these efforts started to be applied to robotics</b> <b>VLA</b> <b>From your perspective, tell us who has become more aggressive</b>
<b>I think ByteDance has definitely become more aggressive</b> <b>I think</b> <b>Alibaba</b> <b>I think OpenAI</b> <b>I also think DeepMind has definitely become more aggressive</b> <b>I think NVIDIA</b> <b>I think they have also become more aggressive</b> <b>These are the five teams competing to build the robotic brain</b> <b>I think there will be others as well</b> <b>Actually, on some level</b> <b>I believe PI (Physical Intelligence) should also fall into this category</b> <b>But it is a startup</b> <b>Right</b> <b>It is a startup</b>
<b>But I think it might</b> <b>Let's define it</b> <b>It might lean more towards a Frontier Lab</b> <b>Rather than a robotics company</b> <b>Right</b> <b>So I think it can also be considered part of this category</b> <b>It's exactly that.</b>
<b>It is truly training its own models on a large scale</b> <b>This is something I consider a large model</b> <b>Right</b> <b>Let's take another look</b> <b>Robot</b> <b>I think when it comes to robots</b> <b>Perhaps in the very beginning, they were all realistic styles.</b>
<b>Right now, I think some of them are</b> <b>Starting to follow simulation</b> <b>It's happening</b> <b>Simulation evaluation</b> <b>I think this is a turnaround</b> <b>There are also some who are following simultaneously</b> <b>This human data</b> <b>For example, this one</b> <b>Led by Generalist, right?</b>
<b>At the same time, for example, Sunday</b> <b>It uses its UMI-like gripper</b> <b>Which is actually a form of human data as well</b> <b>Some domestic companies might be the same</b> <b>They are also following human data</b> <b>So I think that means</b> <b>Robot companies are actually diversifying</b> <b>Maybe, I think at the core, these robot companies</b> <b>Are they a big business model?</b>
<b>Is it about data collection?</b>
<b>Or is their business model</b> <b>Simply to develop brain-like intelligence?</b>
<b>I think there will be some differentiation here</b> <b>specifically in the category of its data</b> <b>there will be differentiation in operations</b> <b>It seems that brain intelligence can't become a business model at this stage</b> <b>I think that</b> <b>brain intelligence</b> <b>means it takes</b> <b>robots and deploys them into real-world scenarios</b> <b>to carry out tasks in those scenarios</b> <b>rather than just being a data collection factory</b> <b>I think right now</b> <b>many robot companies</b>
<b>are essentially operating as data collection factories at the core</b> <b>Personally, I am quite optimistic about Yushu</b> <b>Regarding Yushu, I think</b> <b>it follows a more ontology-focused model</b> <b>If we consider</b> <b>perhaps this</b> <b>is data that is ontology-independent</b> <b>which might explain why large companies' big models</b> <b>Truly becoming the ultimate brain</b> <b>I think Yushu's differentiation is the most distinctive</b> <b>It is about firmly focusing on perfecting its core essence.</b>
<b>Right, so I actually think that Yushu might be the one going forward</b> <b>Its positioning is very clear</b> <b>It also doesn't compare itself to, say,</b> <b>It also doesn't compete with brain companies</b> <b>It doesn't want to become a brain either</b> <b>I think they are very pragmatic</b> <b>And also knowing where your strengths lie</b> <b>Knowing which kind of company you don't want to develop in</b> <b>I think knowing your own</b> <b>I think the boundary is very crucial</b> <b>So within this ecosystem</b>
<b>What role will it play</b> <b>This kind of ontology company</b> <b>I think it will be a core element</b> <b>Hardware Manufacturer</b> <b>Maybe, for example</b> <b>If later on we talk about these</b> <b>The brain company of big tech firms</b> <b>The brain team of big tech firms</b> <b>They want to apply it in real scenarios</b> <b>To implement their brain technology</b> <b>They are very likely to prioritize looking at Yushu</b> <b>And collaborate with Yushu, right?</b>
<b>Because I think Yushu has already proven</b> <b>It is stable enough</b> <b>And ready for mass production</b> <b>Besides Yushu</b> <b>Which other</b> <b>robotics companies do you favor?</b>
<b>I believe Zhiyuan is actually doing very well in commercialization</b> <b>Because I think they probably</b> <b>Had a clear vision from DAY 1</b> <b>If this</b> <b>Is to be systematized</b> <b>They have to fully integrate the upstream and downstream</b> <b>At the same time, I also believe</b> <b>Embodiment, in some sense</b> <b>It should still be a supply-driven market for now</b> <b>You need to produce the volume first</b> <b>To truly drive the entire industry's advancement</b> <b>And to drive the entire supply chain's improvement</b>
<b>I think they have a very clear vision on this</b> <b>I believe their mass production is handled very well in all aspects</b> <b>What do you think about this industry</b> <b>Of course, it’s still very early today</b> <b>If we have to talk about the endgame</b> <b>What kind of form do you think it will take</b> <b>Will the robot brain become a hegemon?</b>
<b>Will it be monopolized by a single company?</b>
<b>I think it might be similar to the current large model industry</b> <b>Right?</b>
<b>Right?</b> <b>We might see</b> <b>People used to think OpenAI could monopolize it</b> <b>Exactly</b> <b>But today, it seems unlikely</b> <b>Exactly</b> <b>Because I believe that</b> <b>At the core, it's still a data closed loop</b> <b>Right?</b>
<b>Right?</b> <b>If we say</b> <b>this data closed loop is controlled by a single entity</b> <b>it has scale, its own largest entity</b> <b>to implement in scenarios</b> <b>and collect the most data back</b> <b>it can train its own largest brain</b> <b>then this could indeed form a kind of hegemony</b> <b>Let me give an example: Tesla is such a hegemony</b> <b>They are leading in autonomous driving</b> <b>I think they are doing very well</b> <b>Of course, domestically, I think OEMs</b> <b>like Li Auto, XPeng</b>
<b>NIO and others, I think they are all doing well</b> <b>But of course, if</b> <b>this is a</b> <b>data model independent of any single entity</b> <b>then it must evolve symbiotically</b> <b>in cooperation with data providers</b> <b>At this point, I think maybe</b> <b>It's very difficult for a large model provider to establish a monopoly on its own</b> <b>So I believe that in the end</b> <b>It’s probably more about an ecosystem</b> <b>This place has the best brain companies</b>
<b>There are the best data companies</b> <b>Have the best</b> <b>The company behind this robot's hardware</b> <b>A strong collaboration among the three</b> <b>Let's truly enable this scenario company</b> <b>Truly implementing these robots on the ground</b> <b>Of course, there may be some scenarios where companies</b> <b>You are the best hardware company yourself</b> <b>I believe this is entirely possible</b> <b>It seems like the brain over in the US</b> <b>Developing faster</b> <b>The development of the main body is faster here in China</b>
<b>What will this produce</b> <b>Subsequent Impact</b> <b>Will the Chinese team be able to catch up on the robot's brain?</b>
<b>From my judgment</b> <b>Because we actually</b> <b>serve a large number of clients</b> <b>I believe it’s very likely to catch up</b> <b>Let me give an example</b> <b>Qianwen is probably the best</b> <b>open-source large model right now</b> <b>Right?</b>
<b>Right?</b> <b>So I think that</b> <b>the capabilities of domestic large models</b> <b>I believe are extremely high</b> <b>And I think their determination in this area is strong enough</b> <b>Their infrastructure</b> <b>I think is sufficiently good</b> <b>At the same time, I believe the talent</b> <b>density is also high enough</b> <b>I think it’s more because</b> <b>perhaps previously, domestic big companies</b> <b>may have focused mainly on large models</b> <b>large language models</b> <b>He is determined to win these matters.</b>
<b>I think this issue now</b> <b>They have already started allocating their resources</b> <b>Specifically towards embodied AI.</b>
<b>So I believe that</b> <b>We can see significant improvements in this area.</b>
<b>Why in the past three to six months</b> <b>Have they begun shifting resources to embodied AI?</b>
<b>What signs have they observed?</b>
<b>Actually, I think it’s not just the past three to six months.</b>
<b>It might be more like the past</b> <b>Almost a year.</b>
<b>Yes, I believe it’s more that</b> <b>Firstly, the trend around large models</b> <b>Has become relatively clear.</b>
<b>So they have the bandwidth to invest here.</b>
<b>Secondly,</b> <b>I think they have indeed recognized</b> <b>That right now,</b> <b>I have a core logic here.</b>
<b>Is the data fundamentally related to the ontology</b> <b>or is it unrelated to the ontology?</b>
<b>If this data definitely comes from the ontology,</b> <b>I think it’s very difficult for large model companies to fully get involved.</b>
<b>Right? So the best approach is to collaborate with an ontology provider.</b>
<b>Exactly.</b>
<b>But if the core of this data is unrelated to the ontology,</b> <b>then I believe this is a clear opportunity for large model companies.</b>
<b>So I think this is something the entire industry</b> <b>should gradually start to clarify.</b>
<b>Who will be the OpenAI of the Robotics field?</b>
<b>I think, first of all, OpenAI probably</b> <b>will still be the Robotics OpenAI,</b> <b>because their Robotics team</b> <b>is actually still a very strong team.</b>
<b>I think they definitely shouldn’t be underestimated.</b>
<b>And I think DeepMind,</b> <b>they absolutely could still be the DeepMind of large models.</b>
<b>I think they are an extremely stable,</b> <b>and exceptionally excellent team.</b>
<b>I think NVIDIA is very promising</b> <b>I think it’s very promising</b> <b>Because I believe NVIDIA places great emphasis on physical AI</b> <b>I think Jim Fan’s team</b> <b>I think Liu Mingyu’s team</b> <b>Both are strong enough teams</b> <b>And I believe they are well-resourced teams</b> <b>As for the domestic side</b> <b>I think maybe ByteDance</b> <b>I think maybe Alibaba’s Qianwen</b> <b>From my perspective</b> <b>I think both are extremely outstanding</b> <b>You’re not optimistic about Musk</b> <b>I think xAI has potential</b> <b>But Musk</b>
<b>Actually, his current focus is on his core hardware</b> <b>Actually, I think one is xAI</b> <b>Maybe it’s still centered on a large model</b> <b>Right?</b>
<b>Right?</b> <b>He still needs to keep his focus</b> <b>Then perfect the large model</b> <b>He didn’t win that battle</b> <b>That’s right</b> <b>So this might be the most critical thing for xAI</b> <b>And since he has an inherent advantage</b> <b>I think this advantage is something others don’t have</b> <b>It’s his inherent hardware advantage</b> <b>He must leverage it to the fullest</b> <b>Right? So I think this is currently Tesla’s main focus with the robot</b>
<b>Right? So I think this is currently Tesla’s main focus with the robot</b> <b>So I think these two</b> <b>Actually haven’t fully converged yet</b> <b>Do you think there’s a divergence in the current approach to the robot’s brain?</b>
<b>Has it converged?</b>
<b>I don’t think it’s fully converged</b> <b>I believe that</b> <b>It’s basically what we just mentioned</b> <b>The idea that the model is the data</b> <b>I think the architecture of the robot’s brain</b> <b>Probably hasn’t fully converged yet</b> <b>Of course, I think based on the existing architecture</b> <b>There are already some hints of Scaling Laws</b> <b>They are based on non-ontological</b> <b>That is, data independent of ontology</b> <b>Generated from simulations and human data</b>
<b>Of course, the question is whether this brain architecture can further evolve</b> <b>How it can more effectively utilize world models, and so on</b> <b>I think this is still a research question</b> <b>Here, I believe there are still some research problems that need to be solved</b> <b>We now have many new terms, including world models and spatial intelligence</b> <b>As well as AI in the physical world</b> <b>Are these all talking about the same thing?</b>
<b>Or similar things?</b>
<b>Let me explain these new concepts to everyone</b> <b>Yes, I think they are actually not quite the same</b> <b>I think AI in the physical world probably refers more to</b> <b>Models that can act in the physical world</b> <b>So I believe this mainly includes autonomous driving and embodied intelligence</b> <b>This, I think, is a definition of physical AI</b> <b>Of course, when it comes to spatial intelligence</b> <b>I think it is actually more about</b>
<b>The focus here is on 3D spatial vision</b> <b>and whether we can effectively not just reconstruct but also generate</b> <b>this 3D space and make corresponding predictions based on it</b> <b>I think world models are more about</b> <b>having a sufficiently good understanding and predictive ability of the physical world</b> <b>but perhaps lacking the ability to take action within it</b>
<b>I think that's roughly the distinction</b> <b>Today, since our main topic is data</b> <b>if you had to solve just one critical problem in data</b> <b>that could enable a significant leap forward</b> <b>what do you think that problem would be?</b>
<b>I think if we talk about embodiment</b> <b>the most critical issue right now might be evaluation</b> <b>specifically, scaling up evaluation</b> <b>I believe this is the core problem</b> <b>Why do I say that?</b>
<b>Because actually, I think currently</b> <b>the pathway for ontology-agnostic data pretraining and the Scaling Law has already emerged</b> <b>so I believe evaluation is now the real bottleneck</b> <b>this is the true bottleneck</b> <b>If this can't be solved</b> <b>I think it will be very difficult for everyone to measure their own intelligence improvement</b> <b>This is a core issue</b> <b>Exactly</b> <b>So here, as I just mentioned</b> <b>I think we definitely need to establish truly large-scale, realistic evaluations</b>
<b>and build them well</b> <b>I believe this will be a capability everyone needs</b> <b>What about large language models?</b>
<b>Regarding data issues, what do you think is the most critical problem to solve?</b>
<b>For large language models, I actually</b> <b>think it might also be about evaluation and post-training processes</b> <b>Many current agents actually need better evaluation capabilities</b> <b>So what's the problem we're facing now?</b>
<b>It's that</b> <b>the higher the magic, the higher the counter-magic</b> <b>When your model's capabilities improve</b> <b>you need even more skilled people</b> <b>to provide better feedback or to design</b> <b>more challenging test questions</b> <b>More effective evaluation metrics</b> <b>So I think this is actually now</b> <b>probably the biggest problem large language models are facing</b> <b>It is actually</b> <b>essentially a higher-order competition</b> <b>evaluation metrics</b> <b>When do you think data issues will become completely irrelevant?</b>
<b>Actually, early on</b> <b>I believed there would be a day</b> <b>when data issues wouldn’t matter</b> <b>maybe in fifteen years</b> <b>maybe twenty years</b> <b>there might come a day</b> <b>when data is no longer a problem</b> <b>But now I’m increasingly reflecting on this</b> <b>which is to say</b> <b>I approach it from first principles about humans</b> <b>You ask, when do people stop wanting to read?</b>
<b>Or when do people stop wanting to learn?</b>
<b>I actually think the more outstanding a person is</b> <b>The more you want to improve yourself</b> <b>it will simply become</b> <b>that is to say</b> <b>shifting from learning from others</b> <b>to benchmarking against yourself</b> <b>against your own yesterday</b> <b>using yourself today as the standard</b> <b>benchmarking against this morning</b> <b>right?</b>
<b>right?</b> <b>it will be more eager to engage with a broader range of knowledge</b> <b>but sometimes this knowledge</b> <b>books alone might not be enough</b> <b>then it may need to go out into the real world</b> <b>to practice</b> <b>to face some setbacks</b> <b>and then receive some feedback</b> <b>to continuously motivate itself</b> <b>to improve better</b> <b>so I actually believe that the smarter it gets</b> <b>is my current perspective</b> <b>Actually, I feel there have been some changes compared to before.</b>
<b>My current perspective is</b> <b>I believe the smarter it is</b> <b>Actually, when it comes to knowledge,</b> <b>The level of urgency will increase</b> <b>The demand for data will only grow stronger</b> <b>But it might just be unwilling to learn from the outside.</b>
<b>It could be self-learning</b> <b>Yes</b> <b>I completely agree</b> <b>I think it means we've reached the endgame</b> <b>Overall, it’s possible that</b> <b>Just like Elon Musk said</b> <b>We might actually be inside a simulation</b> <b>It might just be within itself</b> <b>These are some settings we've configured for it</b> <b>Within the simulated environment</b> <b>Based on some success metrics it sets for itself</b> <b>It continuously trains its own internal skills</b> <b>I think there might be a day</b> <b>When AI Starts Learning from AI</b>
<b>So does that mean Data Factory will disappear?</b>
<b>I agree with that point.</b>
<b>What I mean is, I believe Data Factory</b> <b>is not a first-principles demand.</b>
<b>Right? I think knowledge, or rather</b> <b>human beings' thirst for learning,</b> <b>is a first-principles demand.</b>
<b>Exactly. So I believe Data Factory</b> <b>is still more of a bulk-production style</b> <b>large-scale generation,</b> <b>bulk-production style,</b> <b>a relatively standardized knowledge pathway.</b>
<b>I think this pathway might soon become unnecessary.</b>
<b>So won’t you disappear then?</b>
<b>We are not just Data Factory.</b>
<b>We, I believe, are still</b> <b>system-driven,</b> <b>system-centered,</b> <b>and evaluation-centered.</b>
<b>and evaluation-centered.</b> <b>By helping clients' models identify problems</b> <b>and based on this effective feedback and experience</b> <b>helping them enhance a set of capabilities</b> <b>right, this set of capabilities</b> <b>includes demonstrations</b> <b>and also simulated environments</b> <b>by the end stage</b> <b>very likely</b> <b>no one will use my data</b> <b>but all will use simulated environments</b> <b>to apply RL within them</b> <b>continuously refining their internal skills</b>
<b>I think there might be such a day</b> <b>will AI not need this?</b>
<b>AI needs it if it is strong enough</b> <b>does it need an education system?</b>
<b>I think in the end</b> <b>it might not be an education system</b> <b>it might be an environment</b> <b>right?</b>
<b>right?</b> <b>this environment is like saying</b> <b>somewhat similar to</b> <b>a person learning in society</b> <b>they always need an environment</b> <b>whether it is a</b> <b>For example, in a more digital environment</b> <b>or in a more physical environment</b> <b>it all needs to happen within such a scenario</b> <b>to self-improve</b> <b>right?</b>
<b>right?</b> <b>this scenario, this environment</b> <b>essentially speaking</b> <b>might be what we ultimately provide to our clients</b> <b>which is somewhat similar to</b> <b>how we look at large language model training</b> <b>there are actually many</b> <b>like Scale and others</b> <b>that provide something called RLinf</b> <b>which I just mentioned</b> <b>an environment for reinforcement learning services</b> <b>right?</b>
<b>right?</b> <b>so that the model can</b> <b>refine its internal skills on its own</b> <b>I believe this is a</b> <b>potentially the ultimate level of demand</b> <b>you ask, what kind of environment did someone like Einstein have?</b>
<b>I think Einstein probably had a lot of this</b> <b>constructed within his brain</b> <b>many of these thought</b> <b>premises</b> <b>Right?</b>
<b>Right?</b> <b>First, he might have some</b> <b>basic understanding of physics</b> <b>Right?</b>
<b>Right?</b> <b>Then based on this foundational understanding</b> <b>based on these fundamental theorems</b> <b>he constructed many thought experiments</b> <b>Thought experiments, in a way</b> <b>we can understand them as simulations</b> <b>Right?</b>
<b>Right?</b> <b>Actually, many of his concepts in general relativity</b> <b>and special relativity</b> <b>might all be his own</b> <b>mental thought experiments in his brain</b> <b>to trial and error</b> <b>to come up with ideas</b> <b>So essentially</b> <b>I believe that to construct such a thought experiment</b> <b>you might need some physics</b> <b>you need some grounding</b> <b>that is, some constraints</b> <b>Right?</b>
<b>Right?</b> <b>you need enough of these environments</b> <b>to help him inside</b> <b>to carry out enough</b> <b>I think maybe large-scale experiments</b> <b>You think simulation is your</b> <b>We started talking about that</b> <b>You have always been looking for</b> <b>But never found it before</b> <b>Is it the direction you have found now?</b>
<b>I think</b> <b>Life direction</b> <b>Yes, I think simulation is this</b> <b>Because I believe simulation is truly able to</b> <b>solve the embodied data problem's cornerstone</b> <b>Or rather, I think simulation is this</b> <b>the entire embodied intelligence</b> <b>It is the prerequisite needed for learning</b> <b>Of course, that means</b> <b>I think simulation</b> <b>is Alone</b> <b>simulation alone</b> <b>Of course, I think simulation alone probably</b>
<b>cannot completely solve this problem</b> <b>I believe it needs to be a</b> <b>pyramid I mentioned earlier</b> <b>centered around simulation</b> <b>but not a system solely based on simulation</b> <b>a system capability</b>
Loading video analysis...