LongCut logo

A 7-hour marathon interview with Saining Xie: World Models, AMI Labs, Yann LeCun, Fei-Fei Li, and 42

By Zhang Xiaojun Podcast

Summary

Topics Covered

  • B-Class Trajectories Outperform Elite Paths
  • Vision Drives Cambrian Explosion
  • Follow Heart Over Rankings
  • Research Nonlinearity Yields Breakthroughs
  • Representation Learning Core of Intelligence

Full Transcript

This subtitle was translated by AI. We cannot guarantee its accuracy and it is provided for entertainment purposes only.

<b>Hello everyone</b> <b>I'm Xiaojun</b> <b>In this episode, we have come to New York, USA</b> <b>It is the Chinese New Year right now</b> <b>New York just had a heavy snowfall</b> <b>This is the coldest winter New York has had in years</b> <b>The streets are still covered with unmelted ice and snow</b> <b>But today's conversation</b> <b>gave me a feeling of</b> <b>the warmth of everyday life after the thaw</b> <b>Sitting across from me today</b>

<b>is young scientist Xie Saining</b> <b>He has just embarked on an entrepreneurial journey together with Turing Award winner Yann LeCun</b> <b>setting out on the entrepreneurial journey</b> <b>Their neo lab, AMI Labs</b> <b>has just completed its first mega-scale funding round</b> <b>The team currently has 25 members</b> <b>Xie Saining has always told me</b> <b>he is not the "chosen one"</b> <b>he is the ordinary one</b>

<b>And now, here is my interview with Xie Saining</b> <b>Ilya called me</b> <b>and I didn't say anything</b> <b>I just turned down OpenAI</b> <b>They sent me an offer</b> <b>and I said I'm not going, sorry</b> <b>But wherever there is love, there must also be hate</b> <b>They are two sides of the same coin</b> <b>[laughter]</b> <b>This morning we are in New York</b> <b>shooting B-roll in Brooklyn</b> <b>I really like it here</b> <b>Because I live near Times Square</b> <b>I think that area</b>

<b>is still a very stereotypical New York</b> <b>But coming here</b> <b>feels like a New York full of artistic vibe</b> <b>and lively neighborhood energy</b> <b>Yeah</b> <b>I think this area of Dumbo is of course very artistic</b> <b>Right, in many films</b> <b>There was a Korean film called Past Lives</b> <b>In that film, you may have seen</b> <b>the carousel</b> <b>And the Dumbo bridge over there, right</b> <b>Only tourists go to Times Square</b> <b>I am a tourist</b>

<b>Real New Yorkers would never go</b> <b>But actually the area near NYU is also really good</b> <b>That area is called</b> <b>Greenwich Village</b> <b>And that area is also a "village"</b> <b>And that area also has a great neighborhood vibe</b> <b>Why did you come to New York to do academia?</b>

<b>That doesn't seem like a choice many people make</b> <b>Well, not really</b> <b>But there is quite a long history</b> <b>That is true</b> <b>Various reasons</b> <b>I think</b> <b>Of course</b> <b>Also because I genuinely yearned for this city</b> <b>Right</b> <b>I longed for many elements of this city</b> <b>The people here</b> <b>And including NYU</b> <b>That was also part of it</b> <b>And of course the main reason was still Yann (Yann LeCun, Turing Award winner and Executive Chairman of AMI Labs)</b>

<b>And the AI efforts here</b> <b>Right</b> <b>NYU actually does quite well</b> <b>But on the other hand</b> <b>NYU also has a very strong film school</b> <b>And many directors I admire</b> <b>Like Martin Scorsese</b> <b>Including more recently Chloé Zhao</b> <b>are all NYU graduates</b> <b>So that's also partly the reason</b> <b>Right also</b> <b>Also part of the reasons</b> <b>Right I</b> <b>I</b> <b>I told you yesterday</b>

<b>I think — how many years has it been since I came to America</b> <b>I came in 2013</b> <b>So it's been about 13 years</b> <b>My 'post-training' is a bit broken now</b> <b>So the issue of mixing Chinese and English</b> <b>Sorry about that, viewers</b> <b>I'll try my best to explain</b> <b>Please bear with me</b> <b>Please bear with me</b> <b>Please bear with me</b> <b>Mm, it seems I haven't found anywhere</b> <b>a podcast of yours</b> <b>or an interview</b>

<b>So</b> <b>Is this your first time doing a podcast or interview?</b>

<b>First time doing a podcast</b> <b>First time doing a podcast</b> <b>First time doing an interview</b> <b>Right, you can probably find many</b> <b>Me going out to various conferences, right</b> <b>talks at conferences</b> <b>giving talks and such</b> <b>many of those</b> <b>Why</b> <b>Why haven't you been on a podcast all these years</b> <b>or done an interview</b> <b>I think</b> <b>Mm</b> <b>I don't know</b> <b>I think I'm more suited to being a listener</b> <b>I really enjoy podcasts</b>

<b>Right</b> <b>I often listen to a lot of podcasts</b> <b>My Spotify</b> <b>YouTube, commuting every day, and before bed</b> <b>I often listen to podcasts in my spare time</b> <b>Mm right</b> <b>And I think I have quite a desire to express myself</b> <b>Or rather</b> <b>I also talk about a lot of things with friends privately</b> <b>With students</b> <b>I think, mm</b> <b>Getting everyone together to chat, I think that's very enjoyable</b> <b>Mm, but this podcast thing</b> <b>I don't know either</b>

<b>Maybe it's because nobody invited me</b> <b>That shouldn't be the case</b> <b>Um, well, a little I guess</b> <b>But I still think</b> <b>Maybe it's also because I'm more introverted</b> <b>I think a lot of times</b> <b>feel mm</b> <b>I don't know which things should be said</b> <b>which things are worth saying</b> <b>which things people would want to hear</b> <b>But now I think, gradually</b> <b>as I get older</b> <b>it's fine, it's okay</b>

<b>I have gained the courage to be disliked</b> <b>I actually looked up a lot about you online</b> <b>a lot of information</b> <b>But I found</b> <b>everyone's description of you</b> <b>all starts from SJTU's ACM Class</b> <b>And I'm also very curious</b> <b>What was Xie Saining like before that?</b>

<b>Could you start from your</b> <b>earliest memories of the world</b> <b>as the starting point</b> <b>and tell us about your childhood and growing up</b> <b>I</b> <b>Ah OK</b> <b>See, this is exactly why I didn't want to do a podcast</b> <b>[laughter]</b> <b>Because honestly</b> <b>I've never prepared for this</b> <b>Or rather, you have to let me think back</b> <b>from the earliest memories</b> <b>Well it's</b> <b>I think starting from when I was little</b> <b>Maybe</b>

<b>When I was four or five years old</b> <b>Mm, my mom would take me traveling everywhere</b> <b>That might be my earliest memory</b> <b>Oh, where did you travel?</b>

<b>All kinds of places</b> <b>Right, because she also did some business</b> <b>and traveled around everywhere</b> <b>Traveling all around the country, right</b> <b>I remember very clearly, right</b> <b>This first impression of Shanghai</b> <b>And going to</b> <b>Sichuan, and then</b> <b>all kinds of tourist spots you can imagine</b> <b>Um</b> <b>But for me</b> <b>If I really have to dig into the family background</b> <b>which was</b> <b>My dad is a complete homebody</b> <b>Mm</b> <b>never goes out</b>

<b>But his favorite thing to do is read books</b> <b>So at home, there is a study room</b> <b>with several walls full of books</b> <b>So</b> <b>When I was young, I was basically in this state</b> <b>either running around outside</b> <b>being taken traveling by my mom</b> <b>or at home browsing through all kinds of books</b> <b>books I should read, books I shouldn't — I'd look at them all</b> <b>Right</b> <b>And I think that was my early childhood</b> <b>And then later on</b>

<b>And indeed later</b> <b>I think our generation's growing-up experience</b> <b>was quite different</b> <b>Because I think — well, I don't know</b> <b>I think kids today</b> <b>might, in this AI era</b> <b>have the same feelings</b> <b>But back then for me</b> <b>When I was about 9 years old</b> <b>I got my first computer</b> <b>And from that time on</b> <b>not for anything productive, right</b> <b>buying games box by box and playing them</b> <b>Then the internet came along</b>

<b>and for the first time I felt this information explosion</b> <b>So</b> <b>That was the first time I understood what "content" meant</b> <b>And at that time I felt</b> <b>I suddenly had more desire to express myself</b> <b>Because reading books is still one-directional</b> <b>this learning process</b> <b>though also very broadening</b> <b>But online, there were BBS forums back then</b> <b>And you could go online to share your opinions</b> <b>I still remember, right</b> <b>There was Sina Blog</b> <b>It probably doesn't even exist anymore</b>

<b>But I wrote a lot of blog posts</b> <b>Oh really?</b>

<b>Oh really?</b> <b>Ah um</b> <b>about all kinds of random topics</b> <b>Now</b> <b>Looking back now, it's definitely very funny</b> <b>But</b> <b>What was the most popular article?</b>

<b>Quite a few, I think</b> <b>I remember</b> <b>It felt like forced melancholy — writing sad words without real cause</b> <b>Oh</b> <b>Maybe including QQ Space back then, right</b> <b>Everyone always wanted</b> <b>a platform to express themselves</b> <b>And then later</b> <b>there were actually even more new media emerging</b> <b>including blogs</b> <b>then Weibo, right</b>

<b>But back then it wasn't Weibo actually</b> <b>It was Fanfou — I don't know if you've heard of it</b> <b>Of course</b> <b>Wang Xing, right</b> <b>And at that time I was also a heavy Fanfou user</b> <b>On it</b> <b>Fanfou can still be logged into now</b> <b>But it's really hard to look at</b> <b>Sometimes I look at it</b> <b>I think, oh gosh</b> <b>Should I just delete it all</b> <b>But then I think</b> <b>Let it stay there</b>

<b>Let it become part of the internet memory</b> <b>Mm</b> <b>But I think at that time</b> <b>I think</b> <b>I think this explosive growth of the internet</b> <b>made me become</b> <b>someone interested in many things</b> <b>Mm</b> <b>I think that's how it was</b> <b>So, your parents</b> <b>Your mom was in business</b> <b>Were you from a business family?</b>

<b>Not really, not really</b> <b>Um</b> <b>Well, my dad basically</b> <b>He studied psychology in college</b> <b>He also did some education work before</b> <b>And later also in some</b> <b>media work at TV stations</b> <b>Oh</b> <b>Maybe the same profession as you</b> <b>Oh</b> <b>Right</b> <b>So my memory of him when I was little</b> <b>is of him carrying a camera everywhere</b> <b>Oh, that's interesting</b>

<b>Right right right</b> <b>But in my family there really wasn't</b> <b>anyone who studied pure science and engineering</b> <b>This also gave your personality</b> <b>I think quite an artistic side</b> <b>Maybe but</b> <b>But I think I</b> <b>I think the one thing I want to say is</b> <b>Growing up in such a relaxed family environment</b> <b>has really shaped my model of the world</b> <b>I think, about my own</b>

<b>I'm still quite proud of it</b> <b>Mm</b> <b>quite proud</b> <b>Because I think I would</b> <b>Or rather, you just asked why I came to New York</b> <b>I think that's part of it too</b> <b>Mm</b> <b>I think I would hope for myself</b> <b>or hope for the people around me</b> <b>to look at the world with a more open mind</b> <b>Were your grades always very good?</b>

<b>Because you were admitted to SJTU's ACM Class through recommendation</b> <b>Um, not at all</b> <b>It was from high school</b> <b>Right, I think it was like this</b> <b>So, you can see</b> <b>Now I have many, many friends around me</b> <b>who are actually all</b> <b>those who've come up through the top track</b> <b>Right</b> <b>the best high school, right</b> <b>then the best undergraduate</b> <b>competing in competitions</b> <b>the best undergraduate</b> <b>then the best PhD</b>

<b>then after finishing, going to teach at, say, the top four universities</b> <b>There's a very clear main path, right</b> <b>And I have great respect for them</b> <b>I'm completely not like that</b> <b>I'm a, um</b> <b>At most, I have a B-class kind of trajectory</b> <b>Oh</b> <b>Like you</b> <b>And many</b> <b>My decisions are actually quite mystical</b> <b>Because I think</b> <b>I haven't deliberately, in some kind of</b> <b>meritocratic</b> <b>this kind of</b>

<b>setting</b> <b>framework to strive for things</b> <b>Many times it was actually quite random</b> <b>And maybe that's just the way it is</b> <b>The intelligence just isn't enough</b> <b>But indeed</b> <b>For example, when being admitted via recommendation, right</b> <b>That was also very accidental</b> <b>Anyway, there were two</b> <b>awards in informatics and math competitions</b>

<b>And at that time SJTU happened to have this</b> <b>program where you could enter early</b> <b>basically trying to recruit some students</b> <b>and have them skip the college entrance exam</b> <b>Right</b> <b>Actually, I was originally following the gaokao path</b> <b>being prepared for it, actually I</b> <b>um, was supposed to</b> <b>be taking the gaokao</b> <b>So I struggled with this for a long time</b>

<b>The teachers at school all said, no, that won't do</b> <b>How can you back out at the last minute</b> <b>Your grades are already very good, right</b> <b>You should of course aim for Tsinghua or Peking University</b> <b>But my inner thought was</b> <b>Well, SJTU seems great, I think</b> <b>I've been to Shanghai</b> <b>I feel like me and this city</b> <b>and this school share a compatible spirit</b> <b>And I just wanted to study computer science</b> <b>And I think</b>

<b>SJTU's computer science was also very good at that time</b> <b>I had also heard of this ACM program</b> <b>Although the selection process back then</b> <b>actually required you to</b> <b>enter early</b> <b>and after entering there was a summer camp</b> <b>a program like summer camp</b> <b>Right, and you would undergo some tests</b> <b>before you could enter this class</b> <b>Right</b> <b>But many interesting things happened in that process</b> <b>Of course, first let me say</b>

<b>I think I was quite</b> <b>How should I put it</b> <b>If I could choose again</b> <b>I wouldn't regret it at all</b> <b>Right, I think that summer before entering early</b> <b>was a highlight of my life</b> <b>Why</b> <b>Because during those two months, I did nothing</b> <b>just played games in the dorm</b> <b>Why is that a highlight?</b>

<b>Because never again in my life</b> <b>did such a moment come again</b> <b>What games were you playing back then?</b>

<b>Um, many games</b> <b>Playing Dota and such</b> <b>Just in the dorm</b> <b>It was that kind of</b> <b>the kind I saw online during high school</b> <b>college life</b> <b>You know?</b>

<b>You know?</b> <b>Ah, it was</b> <b>There was the studying part</b> <b>But also some</b> <b>finding yourself</b> <b>and in this kind of</b> <b>aimless wasting of time</b> <b>kind of experience</b> <b>Right</b> <b>So Xie Saining's life highlight was wasting time</b> <b>Really? In the dorm?</b>

<b>Really? In the dorm?</b>

<b>[laughter] You could say that</b> <b>Haha, that's very interesting</b> <b>You keep saying you weren't among those with the best grades</b> <b>But you've also had a pretty smooth path</b> <b>You seem to be among the highest achievers too</b> <b>Why is your self-perception</b> <b>My grades are actually average</b> <b>It depends on who I'm comparing to</b> <b>Compared to the top competition winners</b> <b>like what I just described</b> <b>those who had a very smooth path</b> <b>the top students from Yao Class</b>

<b>and then comparing with the top four PhD programs, top four professors</b> <b>Then I really am</b> <b>far behind</b> <b>But on the other hand</b> <b>I think</b> <b>I'm still quite grateful for all of these experiences</b> <b>Because I feel</b> <b>continuing the story from here</b> <b>I think it's actually quite interesting</b> <b>For example, when I went to SJTU</b> <b>SJTU wasn't necessarily</b> <b>in terms of computer science</b>

<b>and artificial intelligence</b> <b>a particularly leading</b> <b>school</b> <b>And now</b> <b>for example, the ACM Class has become</b> <b>Of course, this has nothing to do with me</b> <b>But my juniors</b> <b>including my seniors, right</b> <b>whether doing entrepreneurship or academia</b> <b>shining and contributing everywhere</b> <b>And also</b> <b>We have a very strong</b> <b>alumni network</b> <b>everyone connected, working on things together</b>

<b>I think</b> <b>I still think</b> <b>it's an upward trajectory</b> <b>An upward trajectory</b> <b>And then later still</b> <b>There is another very interesting thing in here</b> <b>I want to mention</b> <b>which is my ACM Class interview</b> <b>And in the interview process</b> <b>there would be senior professors</b> <b>Back then it was Professor Shen Enshao who interviewed us</b> <b>This interview</b>

<b>didn't actually ask you technical questions</b> <b>He would ask you, what books do you like to read</b> <b>Mm</b> <b>And I feel this was somehow destined</b> <b>there was some fate involved</b> <b>Because I was very anxious back then</b> <b>and almost couldn't answer</b> <b>Then I told him</b> <b>A book I actually really like</b> <b>and one I just finished recently, is this</b> <b>This book is called What Is Mathematics?</b>

<b>Which is "What is Mathematics?"</b>

<b>Then Professor Shen Enshao followed up and asked</b> <b>Who is the author of this book</b> <b>to test me</b> <b>And I was a bit stunned</b> <b>And you know, right</b> <b>A high school student</b> <b>I can't remember foreign names either</b> <b>I thought about it</b> <b>and ultimately managed to answer</b> <b>It was Richard Courant</b> <b>Richard Courant</b> <b>And then Professor Shen said</b> <b>Ah right</b> <b>You must remember this name</b> <b>Because this is equivalent to</b>

<b>one of the greatest mathematicians of the 20th century</b> <b>Why does this make me feel</b> <b>there's a certain destiny at play</b> <b>or some coincidence in this</b> <b>is because now at NYU</b> <b>the department I'm in</b> <b>this institute is the Courant Institute of Mathematical Sciences</b> <b>which is Richard Courant's institute</b> <b>the first shovelful of earth he dug</b> <b>the department he built</b> <b>Mm</b> <b>So, I think it's quite interesting</b> <b>Right</b>

<b>And the application process later was actually similar</b> <b>I think</b> <b>Or to put this from another angle</b> <b>I think</b> <b>It seems like the world</b> <b>always doesn't want me to do what I want to do</b> <b>Why</b> <b>But</b> <b>But I insist on doing exactly what I want to do</b> <b>Oh</b> <b>For example, during my undergraduate years</b> <b>I was initially interested in computer vision, right</b> <b>Or rather</b>

<b>I developed some interest in artificial intelligence</b> <b>At that time also</b> <b>Starting out in the ACM Class</b> <b>Everyone would start doing this kind of</b> <b>research internship</b> <b>and would go to various labs within the school</b> <b>to different laboratories</b> <b>And the lab I went to</b> <b>was one doing</b> <b>neuroscience + AI work</b> <b>called BCMI</b> <b>And the bookshelves had so many books about consciousness</b>

<b>about the brain</b> <b>about images</b> <b>And then</b> <b>about how we perceive the real world</b> <b>books like these</b> <b>And after looking at them I thought, wow</b> <b>That's so interesting</b> <b>And um</b> <b>Later, in this process</b> <b>I also got to know a senior classmate of mine</b> <b>This senior was Hou Xiaodi</b> <b>Oh</b> <b>And he is also very well known</b> <b>He had previously also started a company</b>

<b>and now is also doing entrepreneurship</b> <b>And every time I talk with him</b> <b>he always says</b> <b>The world has changed</b> <b>But we haven't changed</b> <b>By "we" I specifically mean him and me</b> <b>Because every time we chat</b> <b>it's exactly the same as what we talked about over ten years ago</b> <b>Right, at that time he was a legend at the school</b> <b>Right, and he did two legendary things</b> <b>The first legendary thing was</b>

<b>that as an undergraduate</b> <b>he published a paper at CVPR (one of the world's top computer vision conferences)</b> <b>Right, and in this paper</b> <b>was a very elegant algorithm</b> <b>with only 7 lines of code in total</b> <b>that solved a very important problem</b> <b>and published a paper</b> <b>Mm</b> <b>CVPR now accepts maybe several thousand papers each year</b> <b>thousands of papers</b> <b>Right, tens of thousands of submissions</b> <b>So now, when we're looking to recruit undergrads</b>

<b>everyone has three, four, five papers each</b> <b>CVPR is already nothing special</b> <b>But at that time</b> <b>at schools in mainland China</b> <b>being able to publish work at such a top conference</b> <b>was actually extremely, extremely difficult</b> <b>very rare</b> <b>very rare</b> <b>And then</b> <b>For an undergraduate to publish such work</b> <b>was unheard of</b> <b>So</b> <b>Everyone truly admired him very, very much</b> <b>Mm</b> <b>But then</b>

<b>he did a second very impressive thing</b> <b>which was, um</b> <b>he led a team</b> <b>and wrote something</b> <b>called the "SJTU Survival Guide"</b> <b>"SJTU Student Survival Guide"</b> <b>Oh, this was written by a team?</b>

<b>Um, he should be the main author</b> <b>I don't know</b> <b>A team followed him in it</b> <b>This thing still has an archive online now</b> <b>I welcome everyone</b> <b>to check it out offline</b> <b>So what does this guide talk about</b> <b>And some of the things</b> <b>some words</b> <b>I went back and revisited it just a couple of days ago</b> <b>I found it very, very interesting</b> <b>Right um</b> <b>What does it talk about</b> <b>It talks about</b>

<b>why people should learn</b> <b>China's education system</b> <b>the university model</b> <b>what exactly is wrong with it</b> <b>where you should spend your time</b> <b>to achieve the life you want</b> <b>Mm</b> <b>And it also guides everyone on how to do research</b> <b>what the purpose of research is</b> <b>the purpose of research is not to churn out papers</b> <b>but is truly about exploring the infinite unknown</b> <b>things like this</b> <b>Of course</b>

<b>It also teaches everyone how to skip class</b> <b>how to</b> <b>complete assignments</b> <b>in a quicker way</b> <b>Right, it's this kind of pamphlet</b> <b>I also went and read it</b> <b>It says if a person</b> <b>treats grade scores as their highest pursuit</b> <b>then they are a sacrifice to that system</b> <b>Mm, I completely agree</b> <b>Right, I think looking back on these things now</b> <b>probably had a subtle influence</b>

<b>really influenced my understanding of many things</b> <b>When he published this</b> <b>what year were you in?</b>

<b>Um</b> <b>First or second year</b> <b>First or second year</b> <b>You already knew him in your first or second year?</b>

<b>By that time he had already been admitted</b> <b>and gone to</b> <b>Caltech for his PhD</b> <b>So he and I were</b> <b>Because he also graduated from this same lab</b> <b>So he and I essentially communicated online</b> <b>Hou Xiaodi was at Caltech at the time</b> <b>and was already doing his PhD</b> <b>He had also been admitted to a great school</b> <b>And we were all very, very envious</b> <b>At that time</b> <b>And he and I would still</b> <b>on Google Chat back then</b>

<b>chat with him about many, many things</b> <b>And he really was</b> <b>also gave me a lot of advice</b> <b>I still remember</b> <b>What advice?</b>

<b>What advice?</b> <b>Um, nothing specific</b> <b>More often</b> <b>when chatting with him online</b> <b>it was more about research</b> <b>Right, what exactly should be done</b> <b>sharing my own confusion with him</b> <b>And then</b> <b>and how to</b> <b>how to get a paper published</b> <b>roughly seeking his advice</b> <b>Right, and at that time</b> <b>But at that time</b> <b>I think through Xiaodi</b> <b>through the books I read</b> <b>I had basically decided</b> <b>I felt this is what I want to do with my life</b>

<b>I think this thing is just so fascinating</b> <b>computer vision</b> <b>Um</b> <b>At that time there wasn't actually a name for it</b> <b>or rather, computer vision was slowly starting</b> <b>as a term</b> <b>But actually before</b> <b>Right</b> <b>and people had been processing image or visual information</b> <b>for a long time already</b> <b>For example, people would do so-called image processing</b> <b>which is image processing</b> <b>Um</b> <b>more often starting from an EE major</b>

<b>Right, and computer vision</b> <b>might be, um</b> <b>gradually becoming more and more popular</b> <b>Mm</b> <b>And then</b> <b>which was around when I started learning these things</b> <b>this knowledge</b> <b>it was starting to become more and more popular</b> <b>Right, and then</b> <b>Um, as I just said</b> <b>The world always doesn't want me to do this</b> <b>is because when I was in SJTU's ACM Class</b> <b>there was actually another feature</b>

<b>which is that every student in this class</b> <b>had to do an internship in their third year</b> <b>Mm</b> <b>That's actually quite common now</b> <b>But at that time</b> <b>it was still mainly this class's</b> <b>founder's, Professor Yu Yong's</b> <b>innovation</b> <b>So at that time, most people in the ACM Class</b> <b>would work with Microsoft Research Asia</b> <b>which is MSRA</b> <b>through a cooperative program</b>

<b>so many of our students were sent there</b> <b>to do approximately</b> <b>a 6-month internship</b> <b>Right so</b> <b>Um, originally for me</b> <b>If I did nothing</b> <b>I would go to MSRA for internship</b> <b>Right, although that was also good</b> <b>But at that time</b> <b>there actually wasn't a vision group</b> <b>willing to accept undergrads from the ACM Class for internships</b> <b>Why is that?</b>

<b>Um, I don't know</b> <b>Maybe because back then, professors like Ma Yi</b> <b>and Sun Jian were all there</b> <b>Kaiming should have been there too by then</b> <b>And I think</b> <b>they probably didn't like having too many</b> <b>undergrads who don't know anything</b> <b>coming to participate in things, right</b> <b>At that time, they were extremely talented</b> <b>Yes yes yes exactly</b> <b>But we really didn't know anything</b> <b>Right</b> <b>I think I can gradually understand this now</b>

<b>Um, but at that time, um, there was a choice</b> <b>which was still to go to MSRA</b> <b>but not doing anything vision-related</b> <b>research</b> <b>And Professor Yu also told me, well</b> <b>actually you undergrads</b> <b>the most important thing now is still to have research experience</b> <b>and learn how to do research</b> <b>what specific</b> <b>direction</b> <b>isn't very important</b> <b>Mm right um</b> <b>But I didn't think that was okay</b>

<b>I felt I couldn't accept that</b> <b>doing a completely different</b> <b>direction</b> <b>I wanted to understand this field more</b> <b>I hoped to work diligently</b> <b>on some things</b> <b>And then</b> <b>and hopefully one day be like senior Xiaodi</b> <b>being able to publish a CVPR paper</b> <b>Xiaodi was already your idol at that time, wasn't he</b> <b>A bit</b> <b>He was many people's idol</b> <b>Right, during SJTU days</b> <b>Oh</b>

<b>um, and then</b> <b>So I started thinking about how to handle this</b> <b>And started sending emails</b> <b>So I contacted NUS in Singapore, right</b> <b>National University of Singapore's</b> <b>Professor Yan Shuicheng's lab</b> <b>Mm right</b> <b>This was entirely my own doing</b> <b>I didn't even tell Professor Yu</b> <b>And after it was confirmed, hey</b>

<b>I can have this internship opportunity</b> <b>And on his side there were already some</b> <b>subsidies</b> <b>and talking about timing and arrangements</b> <b>the structure was already fairly well set up</b> <b>Then I went to find Professor Yu</b> <b>I said, Professor Yu</b> <b>I really don't want to go to MSRA</b> <b>I want to go to Singapore</b> <b>this school's lab</b> <b>to do the research I want to do</b> <b>Mm</b> <b>Professor Yu was silent for a few seconds</b>

<b>Right, um, maybe I guess</b> <b>I don't know</b> <b>I haven't asked him this question</b> <b>But I guess his inner thought was</b> <b>this student is so headstrong</b> <b>Right</b> <b>Because in the professor's mind</b> <b>MSRA was a better choice</b> <b>Yes yes</b> <b>One, a better choice</b> <b>Two, I think it also allows everyone to go through</b> <b>Right</b> <b>keeping everyone together</b> <b>I think one reason is of course</b> <b>easier to manage</b>

<b>Second, there would be more synergy</b> <b>Right, everyone could still exchange ideas</b> <b>Then you going to a new place</b> <b>what does that even mean</b> <b>is this place even reliable</b> <b>is what you want to do reliable</b> <b>this thing might be uncontrollable</b> <b>Were you conflicted about it?</b>

<b>I wasn't conflicted</b> <b>But I really appreciate Professor Yu</b> <b>in that he</b> <b>Anyway, he was silent for a few seconds</b> <b>and finally said okay</b> <b>You go ahead. Right, um, and so I went</b> <b>But this thing</b> <b>after it happened</b> <b>Professor Yan's group</b> <b>NUS's lab</b> <b>became an option for my juniors</b> <b>an available</b> <b>position</b> <b>Mm</b>

<b>So I think</b> <b>So I think</b> <b>I still want to take some initiative</b> <b>I think taking some initiative</b> <b>and doing what I want to do</b> <b>Right</b> <b>was still very early at that time</b> <b>image-related</b> <b>artificial intelligence</b> <b>what exactly attracted you</b> <b>why did it attract you</b> <b>that led you to make many different choices</b> <b>Because I think the way I experience the world</b>

<b>is through vision</b> <b>Mm, I would think</b> <b>I was probably a bit bored when I was little</b> <b>and I would think, hey</b> <b>humans have so many</b> <b>right senses</b> <b>If I had to remove one</b> <b>which would I remove</b> <b>I think maybe I could be deaf</b> <b>maybe I can't speak</b> <b>maybe I have no touch, no smell</b> <b>I would live very miserably</b> <b>but maybe that could still be accepted</b>

<b>But if I had no vision</b> <b>then I can't watch cartoons anymore</b> <b>I also can't watch movies</b> <b>I also can't play games</b> <b>I would seem to have</b> <b>lost a person's independence</b> <b>And I think</b> <b>Of course this</b> <b>these initial thoughts and later</b> <b>in some books I read</b> <b>what was said resonated quite well</b> <b>Um, because visual signals</b> <b>actually occupy a large part of the brain's cortex</b> <b>um, depending on how you say it, right</b> <b>the main visual areas</b>

<b>might be about</b> <b>um, 30% of the entire brain</b> <b>But um</b> <b>when the entire brain sees an image</b> <b>the activated parts might make up 70%</b> <b>Mm</b> <b>Right</b> <b>So</b> <b>Actually, all of us humans are visual creatures</b> <b>And this</b> <b>Right, that's what I think</b> <b>I'm also a visual creature</b> <b>I also very much like</b> <b>looking at things</b> <b>Animals too</b> <b>Not just humans</b>

<b>Not just humans, right</b> <b>What you said is very, very correct</b> <b>Mm, actually it's not entirely like that</b> <b>Because actually 530 million</b> <b>years ago, 530 million years ago</b> <b>on Earth</b> <b>these creatures actually had no eyes</b> <b>everyone lived in the deep sea</b> <b>without light</b> <b>Right, everyone was in the deep sea</b> <b>and light couldn't get in</b> <b>And then suddenly one day</b>

<b>some creatures were able to</b> <b>develop their vision</b> <b>Although still very weak</b> <b>only able to see a faint</b> <b>signal</b> <b>Right</b> <b>But at this point they were amazing</b> <b>They could see the prey they wanted to hunt</b> <b>where it is, and swim over quickly</b> <b>and eat it</b> <b>They could also avoid predators</b> <b>someone's coming to catch me</b> <b>I immediately run away</b> <b>Once vision was born</b>

<b>Um</b> <b>other creatures in the evolutionary process</b> <b>had to evolve stronger vision</b> <b>Right because</b> <b>if you don't have stronger vision</b> <b>you'll be eaten</b> <b>Right</b> <b>So an arms race began</b> <b>So this is the so-called Cambrian Explosion</b> <b>what is called the Cambrian Era</b> <b>That is to say, on Earth before the Cambrian period</b>

<b>there may have been only a handful of species</b> <b>But after the Cambrian</b> <b>suddenly like a big bang</b> <b>hundreds of thousands of species emerged</b> <b>One leading theory is</b> <b>a theory</b> <b>that this explosion's</b> <b>origin</b> <b>was actually because creatures had an arms race</b> <b>at the visual level</b> <b>Yes yes</b> <b>So what you said is completely right</b> <b>I think</b> <b>This is actually not something unique to humans</b>

<b>I think all animals are actually the same</b> <b>Mm</b> <b>And so</b> <b>I'm still quite interested in this</b> <b>And you know</b> <b>this thing called vision</b> <b>isn't just a sense</b> <b>There is a saying that</b> <b>the eye is actually the only one</b> <b>it is part of the brain</b> <b>but it's the only one</b> <b>part of the brain exposed to the real world</b> <b>because other parts of the brain</b> <b>are all hidden behind our skull</b> <b>Mm right</b>

<b>So thinking about it this way</b> <b>solving vision isn't about solving vision itself</b> <b>but about solving intelligence itself</b> <b>Right, so I think everything can be connected</b> <b>From before you even officially started your first year</b> <b>hiding in the dorm playing games</b> <b>wasting time</b> <b>to you finding computer vision</b> <b>as the main thread of your life</b> <b>what happened in between?</b>

<b>Mm, actually nothing much happened</b> <b>Actually many times</b> <b>I think it all comes from chance</b> <b>Mm</b> <b>Just like if I hadn't read that book back then</b> <b>I probably wouldn't have taken this path</b> <b>But sometimes I feel this is also inevitable</b> <b>I still quite believe</b> <b>everyone actually has their own destiny</b> <b>Or rather</b> <b>Sometimes I tell students</b> <b>Don't think that if you don't do this</b> <b>someone else will</b> <b>do it</b>

<b>Instead think: if you don't do this</b> <b>this thing will never happen in this world</b> <b>What does that mean?</b>

<b>meaning</b> <b>you are now working on a research topic</b> <b>Right</b> <b>and the thing you're doing</b> <b>how you got here step by step</b> <b>to this endpoint</b> <b>this thing</b> <b>completely depends on yourself</b> <b>your personal life experiences</b> <b>your background growing up</b> <b>maybe a book you read</b> <b>maybe a conversation you had with someone</b> <b>maybe it's genetic</b> <b>your genes</b>

<b>wise</b> <b>simply being different from others</b> <b>Right, I think</b> <b>every individual</b> <b>in this world is very unique</b> <b>everyone is a variable in this world</b> <b>everyone is a variable in this world</b> <b>and who can say for certain</b> <b>It's possible</b> <b>you are the most important variable in this world</b> <b>This is your worldview</b> <b>I think it's my optimistic side</b> <b>[laughter]</b> <b>Right</b> <b>Mm</b> <b>During your time at NUS</b> <b>Did you get what you wanted to get?</b>

<b>Um, I think</b> <b>I think yes</b> <b>First of all, I made a lot of very good friends</b> <b>I can gradually elaborate on that later</b> <b>But I got to know</b> <b>For example</b> <b>Actually the main person who mentored me then</b> <b>my mentor was Feng Jiashi</b> <b>He was a PhD student at the time</b> <b>Right, and he mentored me</b> <b>And then did some work</b> <b>We published a paper</b> <b>Not a top conference either</b> <b>Unfortunately, I still couldn't publish at CVPR during undergrad</b>

<b>Mm</b> <b>But we published</b> <b>a decent one</b> <b>this BMVC paper</b> <b>Right, it was</b> <b>a not-so-top-tier computer vision</b> <b>paper</b> <b>So um</b> <b>I think</b> <b>I still think there was a lot to gain</b> <b>For the first time I learned</b> <b>um research</b> <b>what it's about</b> <b>Right</b> <b>Having actually written a paper versus not having written one</b> <b>I think there's still a big difference</b> <b>Was that your first paper on CV?</b>

<b>Yes yes</b> <b>But you could say</b> <b>this was a CV paper</b> <b>but actually it wasn't really about CV</b> <b>Its only application</b> <b>was face recognition</b> <b>it was more like a</b> <b>machine learning paper</b> <b>But that was normal at the time</b> <b>everyone studying CV</b> <b>or researching CV</b> <b>was doing similar things</b> <b>the so-called</b> <b>manifold clustering related things</b>

<b>Right, but it was at that time point</b> <b>That was 2012, 2013</b> <b>2012 right</b> <b>So it was right at the AlexNet moment</b> <b>Mm</b> <b>So I was also at that time point</b> <b>learning about this</b> <b>Right, and then right</b> <b>and learning about ImageNet</b> <b>learning about deep learning</b> <b>So I think that was actually a starting point</b>

<b>That was when I just started doing research</b> <b>and learning how to do research</b> <b>and also a starting point for all of deep learning</b> <b>This was your third year</b> <b>Third year, right</b> <b>University was almost over at that point</b> <b>So you actually during your undergraduate years</b> <b>had already found your main thread</b> <b>I think so</b> <b>Mm</b> <b>What was your intrinsic reward mechanism at that time?</b>

<b>I think it's still curiosity</b> <b>Right, it's that I</b> <b>I think</b> <b>I want to know why</b> <b>Right</b> <b>Or rather</b> <b>This might also be my own explanation</b> <b>I also don't know</b> <b>what exactly my intrinsic motivation is</b> <b>But</b> <b>Mm</b> <b>I want to understand more</b> <b>I want to understand</b> <b>more about this field</b>

<b>I want to engage with the top</b> <b>students in this field</b> <b>researchers</b> <b>professors</b> <b>and have deeper exchanges</b> <b>Mm-hmm</b> <b>So this is also why later</b> <b>I decided I still wanted to go abroad</b> <b>wanted to apply</b> <b>I think also</b> <b>Probably this reason too</b> <b>Here I want to ask a small extra question</b> <b>You must also have many friends from Tsinghua's Yao Class</b> <b>Right, I also have many friends from Tsinghua's Yao Class</b>

<b>who have come on my show</b> <b>Yes, I want to know</b> <b>Tsinghua's Yao Class</b> <b>do you think compared to SJTU's ACM Class</b> <b>what is the biggest difference</b> <b>in terms of training</b> <b>I think the ACM Class is probably less competitive</b> <b>One difference is, um, again</b> <b>this thing</b> <b>is actually still Professor Yu's design</b> <b>He, I think, is, um</b> <b>quite a great educator</b> <b>I can say that</b> <b>Mm right</b> <b>Like back in our days</b> <b>actually in our curriculum design</b>

<b>um, there would be many</b> <b>seemingly quite strange settings</b> <b>For example, we had a course</b> <b>that Professor Yu was actually very proud of</b> <b>called the 'Student Forum'</b> <b>What is this Student Forum?</b>

<b>It means everyone comes to this class</b> <b>and spends maybe 45 minutes to 1 hour</b> <b>to do a presentation</b> <b>give a talk</b> <b>And this talk cannot be related to studying</b> <b>It can be about anything in the world</b> <b>but cannot be related to studying</b> <b>Right so um</b> <b>some people would talk about philosophy</b> <b>some about history</b> <b>some about society</b>

<b>some about many very interesting things</b> <b>Of course science was also allowed</b> <b>Mm right</b> <b>And I think</b> <b>I think this might be a difference in cultivation approach</b> <b>Of course I've never been to Yao Class</b> <b>so I'm not sure</b> <b>But I think</b> <b>everyone was still in a relatively relaxed</b> <b>and more liberal arts-focused</b> <b>kind of setting moving forward</b> <b>Mm, the impression you give me is</b> <b>you don't seem like someone who likes excessive competition</b>

<b>Um, I think I'm not afraid of competition</b> <b>but I genuinely don't like excessive competition</b> <b>And I think</b> <b>excessive competition definitely doesn't help innovation</b> <b>Right, I think</b> <b>I think this</b> <b>Of course that's not saying ACM Class has no competition</b> <b>there is actually very strong competition</b> <b>Were you a winner in this competition?</b>

<b>I wasn't eliminated</b> <b>OK</b> <b>Right</b> <b>But actually it can't really be called elimination</b> <b>which was</b> <b>everyone felt whether they were suited or not</b> <b>and would choose to stay or leave</b> <b>What was your approximate ranking in undergrad?</b>

<b>There were maybe 30-40 people total</b> <b>Maybe ranked around the teens</b> <b>Just not pushing myself too hard</b> <b>Not pushing myself too hard</b> <b>Mm</b> <b>Did you ever think about becoming</b> <b>for example, first or second in the ACM Class?</b>

<b>Was that your goal?</b>

<b>I couldn't have</b> <b>Right [laughter]</b> <b>Really, really couldn't</b> <b>Because we had very strong</b> <b>Right um</b> <b>students with competition backgrounds</b> <b>And the evaluation criteria</b> <b>I think were actually quite multidimensional</b> <b>it's hard to say who was first or second</b> <b>Or if you only look at GPA</b> <b>then I really couldn't</b> <b>Mm right</b> <b>And I think</b> <b>And for this</b> <b>maybe also inspired by the Survival Guide</b> <b>I also didn't care that much</b> <b>So from that time you started</b>

<b>following your interests very closely</b> <b>Yes right</b> <b>I think pursuing my interests</b> <b>and I would do everything possible to make it happen</b> <b>Right, especially in the application process it was the same</b> <b>Mm</b> <b>A previous example was you going to NUS</b> <b>instead of going to Microsoft Research Asia</b> <b>Right, when applying</b> <b>Actually</b> <b>there's another story here</b> <b>which is that I almost didn't get into any school</b> <b>but ultimately didn't</b> <b>I did have some offers</b>

<b>but none from a professor I wanted to work with</b> <b>doing computer vision</b> <b>Oh</b> <b>This made me very, very depressed</b> <b>And at one point I would think</b> <b>Okay, I could go do some</b> <b>recommendation system research</b> <b>some more</b> <b>um, you know</b> <b>machine learning research</b> <b>Oh</b> <b>Um, until finally</b> <b>And then I</b> <b>I started frantically writing emails to everyone</b> <b>those cold-contact emails</b> <b>Mm right</b>

<b>And then Professor Tu Zhuowen</b> <b>Right, Professor Tu</b> <b>replied to me</b> <b>But by then it was already very, very late</b> <b>Because you know</b> <b>For PhD applications</b> <b>the deadline is generally April 15th</b> <b>Right, I actually received this reply in April</b> <b>Oh</b> <b>Right</b> <b>Who was the professor you most wanted to work with?</b>

<b>At that time</b> <b>Um</b> <b>At that time there weren't many professors doing computer vision</b> <b>Right, and then</b> <b>I think Professor Tu</b> <b>was certainly</b> <b>a professor I admired very, very much</b> <b>So I think he was also my top choice</b> <b>Right mm</b> <b>And of course</b> <b>there would be many</b> <b>You would of course say</b> <b>Like at Stanford</b> <b>Berkeley right</b> <b>MIT would have</b> <b>many pioneers of computer vision</b> <b>But at that time</b>

<b>beyond my ability</b> <b>Mm right</b> <b>So I sent this email to Professor Tu</b> <b>And he replied to me</b> <b>And I remember very clearly</b> <b>Because of the time difference</b> <b>So Professor Tu asked if we should have a call</b> <b>When are you free</b> <b>I said I'm free at any time</b> <b>And so at 3 AM</b> <b>downstairs in the dormitory</b> <b>I had this phone call with Professor Tu</b>

<b>Telling him why I thought</b> <b>I wanted to do this</b> <b>Mm, what things I had done before</b> <b>And why I thought</b> <b>I very much admire your research</b> <b>I think we can work together</b> <b>Right so</b> <b>Later, Professor Tu rescued me</b> <b>Very, very, very lucky</b> <b>In the last few days</b> <b>In the last few days he rescued me</b> <b>But there was another twist later</b> <b>Because at first Professor Tu Zhuowen</b>

<b>was actually at UCLA</b> <b>Right</b> <b>So the offer I received was UCLA's offer</b> <b>And I got my visa sorted and was ready to enroll</b> <b>And then about a week before</b> <b>Professor Tu said</b> <b>I'm sorry</b> <b>I'm going to change jobs</b> <b>I'm at UCLA</b> <b>for various reasons</b> <b>I don't want to stay anymore</b> <b>I don't want to continue here</b> <b>I'm going somewhere else</b> <b>Where am I going?</b>

<b>Right now I can't tell you either</b> <b>I don't know either</b> <b>Because he was also in interviews at that time</b> <b>Oh really?</b>

<b>Oh really?</b> <b>And he told me</b> <b>You have a few options</b> <b>One is you can stay at UCLA</b> <b>and I'll hand you over to other professors</b> <b>Or you can wait</b> <b>and see how my situation works out</b> <b>And possibly</b> <b>if I go to a school you're willing to come to</b> <b>you can come with me</b> <b>So did you wait?</b>

<b>Or did you immediately say, I choose you?</b>

<b>I basically said</b> <b>I immediately said, I choose you</b> <b>You didn't care about the school?</b>

<b>Um</b> <b>I think I don't care about the school</b> <b>And I still think</b> <b>I think all these things are very interesting</b> <b>Because back then if you looked at UCSD</b> <b>in terms of overall rankings</b> <b>nothing compared to UCLA</b> <b>Mm</b> <b>Now it's completely different</b> <b>If you look at CS rankings</b> <b>or from AI hiring</b> <b>and students</b> <b>including faculty resources</b>

<b>in terms of AI strength</b> <b>I think UCSD</b> <b>is already among the top few</b> <b>Back then, it was completely different</b> <b>Back then</b> <b>And I actually always wanted to collaborate with a professor</b> <b>named Serge Belongie</b> <b>who had just decided to leave UCSD too</b> <b>Well, so I felt everything was hopeless</b> <b>which was</b> <b>the place I was going didn't seem highly ranked</b> <b>um, and then</b> <b>faculty were also leaving</b> <b>faculty were also leaving</b>

<b>But I thought about it and said</b> <b>none of this matters</b> <b>none of it is important</b> <b>what matters is who I'm working with and on what</b> <b>and whether this is something I want to do</b> <b>I think putting aside all this noise</b> <b>this is the only thing I want to care about</b> <b>Mm, that's very interesting</b> <b>Mm</b> <b>So this kind of thing happened several times</b> <b>I just said</b> <b>At SJTU it was also an upward trajectory</b> <b>And then going to</b>

<b>UCSD</b> <b>That was also part of it</b> <b>which was</b> <b>Of course</b> <b>I'm not saying this has anything to do with me</b> <b>I don't think it has anything to do with me</b> <b>But somehow I feel I can see a place</b> <b>or even a person</b> <b>their upside potential</b> <b>that is, their potential</b> <b>Mm</b> <b>And I'm willing to grow together with those places</b> <b>I think</b> <b>This is something I feel quite deeply</b>

<b>How long did it take you to find out Professor Tu was going to UCSD?</b>

<b>Um, maybe a few months later</b> <b>Right, maybe one or two months later</b> <b>Were you worried at the time?</b>

<b>Of course I was worried</b> <b>Right</b> <b>Because Professor Tu is actually very humble</b> <b>extremely capable but very humble</b> <b>So he would always give me a heads-up saying</b> <b>the school I'm going to</b> <b>might be ranked lower</b> <b>you should think about it</b> <b>Right, what did you say?</b>

<b>I don't remember very well what I said</b> <b>But again, for me</b> <b>this might not be that important</b> <b>And</b> <b>and at that time it wasn't yet time to make a decision</b> <b>Right, why should I</b> <b>worry in advance about things that haven't happened</b> <b>So I didn't think too much about it</b> <b>Did anyone else make this choice?</b>

<b>Among the students Professor Tu communicated with</b> <b>Um, basically none</b> <b>I was the first student he recruited at UCSD</b> <b>I think just based on that</b> <b>Professor Tu must like you very much</b> <b>Um, I think all of this is</b> <b>I think it was also him saving me</b> <b>Um indeed</b> <b>But this was not only rescuing me at the beginning</b> <b>and then later doing research</b> <b>during the PhD process</b> <b>I think he truly helped me</b>

<b>Right, like my internship in Singapore and such</b> <b>you could say we were doing some research</b> <b>but in reality</b> <b>it was still small-scale stuff</b> <b>having someone next to you teaching you</b> <b>the feeling is still different</b> <b>Professor Tu is the type who sits beside your monitor</b> <b>and goes through the code with you line by line</b> <b>that kind of teacher</b> <b>Mm, and he often</b>

<b>I think proudly would tell us these things</b> <b>And I think he is very deserving</b> <b>of this pride, meaning</b> <b>he published several papers</b> <b>that actually had an important influence</b> <b>on later computer vision</b> <b>all completed as sole author works</b> <b>And these works didn't have, like now</b> <b>everyone using PyTorch</b> <b>with so many open-source communities</b> <b>so many libraries you can use</b>

<b>right, having GPUs</b> <b>in his time there was nothing</b> <b>he had to write from the ground up</b> <b>For example, for a task like image segmentation</b> <b>he had to write from scratch</b> <b>about 50,000 lines of code</b> <b>He even sent me this code to look at</b> <b>That included the lowest level</b> <b>including some distributed training</b> <b>a whole series of things</b> <b>all written in C++</b> <b>Right, 50,000 lines of code</b>

<b>I think</b> <b>On one hand I feel I'm very lucky</b> <b>not needing to go through all that</b> <b>But on the other hand</b> <b>I think actually</b> <b>their generation in America</b> <b>these scientists</b> <b>these professors are truly admirable</b> <b>Right, if not for them</b> <b>there would be no us today</b> <b>They actually, um</b> <b>blazed a trail</b> <b>Right, this path didn't originally exist</b>

<b>As I said, right</b> <b>publishing a CVPR paper</b> <b>was actually a very, very difficult thing</b> <b>And there was a certain circle</b> <b>a certain fixed circle</b> <b>Right, and I think it required Professor Tu</b> <b>and actually his boss</b> <b>Professor Zhu Songchun</b> <b>and including later people like Fei-Fei (Li Fei-Fei, Stanford professor, co-founder and CEO of World Labs)</b> <b>and so on</b> <b>Professor Fei-Fei</b>

<b>everyone blazing this trail</b> <b>so that we have a path to walk</b> <b>Mm, I saw a Xiaohongshu comment saying</b> <b>Xie Saining was unremarkable in China</b> <b>nothing special</b> <b>made a big splash when he got to America</b> <b>So what exactly is the variable?</b>

<b>First, I don't think I was unremarkable in China</b> <b>Mm, I don't accept that</b> <b>And I didn't make a big splash in America either</b> <b>I don't accept that either</b> <b>I feel like the things I've done</b> <b>have been a fairly smooth</b> <b>a very gradual process</b> <b>Right, and or rather I think this is also what I hope</b> <b>um, as a researcher, right</b> <b>this kind of science practitioner</b>

<b>I hope to be in</b> <b>meaning this is not a momentary</b> <b>burst of hormones or adrenaline</b> <b>this thing</b> <b>might be a lifetime of building</b> <b>a very quiet process</b> <b>I hope</b> <b>to be in such a state</b> <b>When I say such a state</b> <b>it's because I know</b> <b>many people are in this state</b>

<b>the researchers I most admire</b> <b>they are in this state</b> <b>they didn't say</b> <b>there was this sudden rise to fame</b> <b>or at least their way of doing things is not</b> <b>or their purpose is not to become suddenly famous</b> <b>Right, I think so</b> <b>Then what is it for?</b>

<b>It's for thinking problems through</b> <b>Mm</b> <b>How did your PhD work unfold?</b>

<b>The PhD work was also very interesting</b> <b>PhD work</b> <b>Um, I think it was also through</b> <b>Professor Tu's hands-on mentoring</b> <b>Right, but um</b> <b>We had our first paper</b> <b>By the way, I</b> <b>During my PhD</b> <b>I wasn't a successful PhD student either</b> <b>by today's standards</b> <b>I published maybe five or six</b> <b>top conference papers</b> <b>What level is that?</b>

<b>I don't know</b> <b>That should have been fine for that era</b> <b>the level to get a job at a top lab</b> <b>Now it might already be</b> <b>Right now</b> <b>now many of my students</b> <b>publish many more papers than I did</b> <b>and the quality of work is also much better</b> <b>But anyway</b> <b>At the beginning</b> <b>I think we did a work called</b> <b>Deeply Supervised Nets</b> <b>Mm</b> <b>This work</b> <b>was actually</b> <b>Me and another more senior PhD student</b>

<b>completed it together in collaboration</b> <b>And at this time</b> <b>This was around 2013, 2014</b> <b>And at this time, deep learning finally began to explode</b> <b>But I think this was also a very interesting moment</b> <b>Because actually many people didn't accept this</b> <b>Especially many professors working in computer vision</b> <b>didn't even accept this</b> <b>Everyone thought</b>

<b>deep learning was still a kind of alchemy</b> <b>still a black box</b> <b>people trusted traditional machine learning theory more</b> <b>trusting SVMs, or trusting some</b> <b>Bayesian theories</b> <b>Right</b> <b>being able to pivot in time to do deep learning research</b> <b>This now, looking back</b> <b>with the benefit of hindsight</b> <b>is a no-brainer</b>

<b>you didn't need to make that choice</b> <b>right, you should just do it</b> <b>But at the time, making such a choice</b> <b>I think required some courage</b> <b>So Professor Tu actually is</b> <b>another reason I admire him very, very much</b> <b>and I</b> <b>deeply affected by this</b> <b>this one thing</b> <b>That is to say</b> <b>he actually pivoted very promptly</b> <b>So this Deeply Supervised Nets</b> <b>was in this era</b>

<b>our first deep learning work</b> <b>Right, so this thing</b> <b>was actually simple</b> <b>it was about how</b> <b>all of these neural networks</b> <b>Um</b> <b>previously were just a single stream</b> <b>a long chain</b> <b>with your input</b> <b>and getting your output</b> <b>And now Deeply Supervised Nets</b> but this robotics isn't simple robotics</b> <b>meaning</b> <b>you can now actually have multiple branches</b>

<b>that is, your neural network</b> <b>can actually have multiple exits</b> <b>and at different exits</b> <b>you can apply a supervision signal</b> <b>In this way</b> <b>the most direct benefit is</b> <b>you can</b> <b>not only from the signal at the far end</b> <b>do back propagation</b> <b>back to</b> <b>the early layers</b> <b>back propagation</b> <b>you don't need</b> <b>to do back propagation from the far end</b>

<b>all the way to the beginning</b> <b>you can actually from an intermediate node</b> <b>do back propagation</b> <b>This way</b> <b>can partially solve the vanishing gradient problem</b> <b>Mm</b> <b>And this actually relates to what came later</b> <b>for example, ResNet actually has some resemblance</b> <b>it's actually</b> <b>or in that era</b> <b>everyone actually wanted to solve this problem</b> <b>So Deeply Supervised Nets</b> <b>was a</b> <b>way to solve this problem</b>

<b>Actually this thing</b> <b>though it was long ago</b> <b>right, this was again 12 years ago</b> <b>but I think research is like this</b> <b>12 years later</b> <b>actually some of our current papers</b> <b>are again using the same</b> <b>kind of design</b> <b>sometimes we don't even realize it</b> <b>I think this is very interesting</b> <b>But let's not talk about 12 years later</b> <b>Right, so my second paper</b> <b>was called Holistically-Nested Edge Detection (HED)</b>

<b>a work on edge detection</b> <b>HED</b> <b>Right, I think about this paper</b> <b>I'm actually quite proud of it</b> <b>Because frankly</b> <b>it solved a research problem</b> <b>um, it was both lucky</b> <b>and unlucky</b> <b>The lucky part is</b> <b>this paper was a good paper</b> <b>The unlucky part is</b> <b>once the problem was solved</b> <b>nobody worked on it afterward</b> <b>so nobody cited your paper</b> <b>[chuckles]</b>

<b>so it lost many citations</b> <b>[chuckles]</b> <b>Um, but um</b> <b>But this work</b> <b>is essentially a Deeply Supervised Nets</b> <b>DSN applied to</b> <b>image</b> <b>or edge detection</b> <b>but it's actually a global</b> <b>what we call pixel labeling</b> <b>pixel-level</b> <b>annotation</b> <b>task</b> <b>implementation</b> <b>Mm</b>

<b>And this</b> <b>also opened up many new ways of thinking for me</b> <b>because I would discover</b> <b>that a neural network</b> <b>each of its layers</b> <b>actually has implicit structure</b> <b>and information in it</b> <b>your neural network, again</b> <b>has not only input and output</b> <b>in between there is a lot of information</b> <b>it represents</b> <b>a so-called hierarchical</b>

<b>hierarchical structure of the world</b> <b>For edge detection</b> <b>it represents</b> <b>that your early layers</b> <b>output edges that are</b> <b>more so-called coarse</b> <b>more coarse edges</b> <b>Right, and the further up</b> <b>the more refined your edges become</b> <b>So</b> <b>Finally you can take all of these edges</b> <b>and fuse them together</b> <b>to get one that best approximates human perception</b> <b>such an edge</b> <b>output result</b> <b>I think this</b> <b>was actually</b>

<b>also giving me a new understanding of deep learning</b> <b>It's a very interesting, very interesting thing</b> <b>You can think of it as a black box</b> <b>but each part of this black box</b> <b>you can open up</b> <b>connect some new inspiration</b> <b>and reach some new goals</b> <b>I think this was very enlightening for me</b> <b>And this paper at the time</b> <b>also had a big impact on my life</b>

<b>because it was published at ICCV</b> <b>and also received an award</b> <b>This award was the Marr Prize</b> <b>the Best Paper Award nomination</b> <b>not the Best Paper Award itself</b> <b>just a nomination</b> <b>But actually for the Marr Prize</b> <b>it selects two papers</b> <b>which is equivalent to</b> <b>the Marr Prize and Honorable Mention are two awards</b> <b>So this made me feel</b> <b>if you want to say sudden fame</b> <b>I really did feel at the time</b>

<b>look, I also became famous at a young age</b> <b>Now, of course</b> <b>we have many Chinese students</b> <b>also on the world stage</b> <b>winning so many Best Papers</b> <b>Right, but back then for me</b> <b>walking onto that stage</b> <b>or that podium</b> <b>and giving the award presentation</b> <b>giving this talk</b> <b>I think it moved me greatly</b> <b>I felt, wow</b>

<b>my life has begun</b> <b>Right, and I will keep working hard</b> <b>I will have more and more best papers</b> <b>Ah unfortunately</b> <b>this was my last time receiving Best Paper</b> <b>[laughter]</b> <b>What year of your PhD was this?</b>

<b>Second year of PhD</b> <b>[laughter]</b> <b>And up until now</b> <b>Just a few days ago during Spring Festival</b> <b>people were still texting saying</b> <b>Happy New Year</b> <b>May you have many Best Papers</b> <b>I said it's been 10 years</b> <b>everyone has been wishing this for me</b> <b>and I still haven't received another one</b> <b>Do you still want one?</b>

<b>Um</b> <b>Good question</b> <b>Well I think</b> <b>this thing isn't that important to me anymore</b> <b>On one hand</b> <b>I know the process</b> <b>I know actually</b> <b>um, whether I get a Best Paper or not</b> <b>might not represent the quality of the work</b> <b>And I also know the Best Paper I got</b> <b>Honorable Mention</b> <b>was mostly luck too</b> <b>Mm-hmm</b> <b>It's a hugely random process</b>

<b>whether a paper gets accepted or not</b> <b>what kind of award it can get</b> <b>I think this thing</b> <b>is very, very random</b> <b>And if something is this random</b> <b>it shouldn't be something a researcher</b> <b>should focus on</b> <b>So in your second year</b> <b>you felt life had finally begun</b> <b>Right, and life finally began</b> <b>and then reality immediately knocked me over</b> <b>Right um</b> <b>[chuckles]</b> <b>but it wasn't that exaggerated</b> <b>That is to say, um</b> <b>I think this is another</b>

<b>during my PhD</b> <b>time well</b> <b>again grateful to Professor Tu</b> <b>in that he</b> <b>was actually a very, very open-minded</b> <b>person who let us explore all kinds of</b> <b>different directions</b> <b>So during my PhD I did 5 internships in total</b> <b>I think even today that seems</b> <b>although with schools</b> <b>and industry already collaborating so broadly</b> <b>I think it's still hard to imagine</b> <b>Why did you want to do internships?</b>

<b>I just wanted to go out and see</b> <b>Mm</b> <b>maybe it's the same as traveling when I was young</b> <b>I wanted to know in different places in this world</b> <b>different organizations</b> <b>what kind of things were happening</b> <b>what people were doing what things</b> <b>I wanted to know all of this</b> <b>And on one hand I tell you</b> <b>right, I always wanted to do</b> <b>artificial intelligence</b> <b>or wanted to do computer vision</b> <b>But on the other hand</b>

<b>I would also ask myself</b> <b>What if I'm wrong?</b>

<b>Right</b> <b>What if</b> <b>what if</b> <b>right, what if</b> <b>the world</b> <b>has something even more interesting happening</b> <b>what would I do</b> <b>Right so</b> <b>I think</b> <b>This is another motivation of mine</b> <b>You went to NEC Labs America</b> <b>went to Adobe</b> <b>went to Meta</b> <b>went to Google Research and DeepMind</b> <b>Right, thank you</b> <b>for the background check</b> <b>Right yes</b> <b>Those are the 5 places</b> <b>And um</b>

<b>actually the first four were all in the Bay Area</b> <b>So</b> <b>I was actually quite happy during that time</b> <b>every year</b> <b>I had my own beat-up car</b> <b>and every summer</b> <b>I would sublet my dorm room</b> <b>drive my car all the way from Southern California to Northern California</b> <b>Mm</b> <b>an 8-hour drive</b> <b>Sometimes with</b> <b>once or twice with friends</b> <b>but most of the time I was on the road alone</b>

<b>I think this was actually quite cool</b> <b>Right, all my worldly possessions in my car</b> <b>two suitcases</b> <b>not taking anything else</b> <b>because I'd given up my place too</b> <b>when I came back I'd have to find housing again</b> <b>Right, um, no fixed abode</b> <b>this nomadic researcher lifestyle</b> <b>I was still quite happy</b> <b>Which of these 5 places did you like most?</b>

<b>I think each has its own characteristics</b> <b>Like among these 5</b> <b>So I recently also told students</b> <b>I have many students</b> <b>and their internships</b> <b>actually didn't produce much good work</b> <b>And I told them</b> <b>I would use myself as an example</b> <b>I said, I did 5 internships</b> <b>and half of them I didn't produce anything</b> <b>Mm</b> <b>And how long were these internship periods?</b>

<b>Generally 3 to 6 months</b> <b>So about half of each year</b> <b>half the time at school</b> <b>half the time in the Bay Area</b> <b>of course at the low point I was in London</b> <b>And I think it's not about liking or not liking</b> <b>I would try to diversify</b> <b>Um, that is</b> <b>I would</b> <b>hope each place I went was different</b> <b>I hoped for a more diverse experience</b> <b>So NEC Labs America</b> <b>was of course the first place I went</b>

<b>And I think there</b> <b>I also published a CVPR paper</b> <b>And there, um, there were many great colleagues</b> <b>mostly Chinese people</b> <b>Mm</b> <b>and after work at lunch everyone would go together</b> <b>to Cupertino to eat</b> <b>That's my impression of it</b> <b>I very, very much liked this group</b> <b>really liked everyone's attitude toward research</b>

<b>And I also published my own paper</b> <b>So I think I was very happy about this experience</b> <b>Right</b> <b>NEC Labs America back then should have also been a gathering place for deep learning</b> <b>Dr. Yu Kai (founder and CEO of Horizon Robotics) also worked there</b> <b>Yeah</b> <b>Mm</b> <b>Yes</b> <b>Of course, it had two divisions</b> <b>one in Princeton</b> <b>and one in Cupertino (in Silicon Valley, California)</b> <b>All the vision</b> <b>and media people were in the Bay Area</b> <b>And all those doing</b>

<b>traditional</b> <b>machine learning work</b> <b>were all</b> <b>concentrated in Princeton</b> <b>Right</b> <b>And some of what follows we can skip</b> <b>But anyway, at Adobe I just didn't produce anything</b> <b>The reason is, um</b> <b>Adobe is a very, very artistic</b> <b>company with an artistic temperament</b> <b>Oh</b> <b>Makes sense</b> <b>And at that time I was in San Francisco</b> <b>And then</b>

<b>having me do things related to design</b> <b>and crowdsourcing</b> <b>meaning you'd write some</b> <b>Mechanical Turk</b> <b>internet</b> <b>user feedback systems</b> <b>right, some user feedback systems</b> <b>and using it to guide some</b> <b>machine learning and, um, this kind of</b> <b>computer vision tasks</b>

<b>like segmentation</b> <b>this thing</b> <b>I just didn't do well</b> <b>I still feel guilty toward my mentor</b> <b>Of course they were all very kind</b> <b>Right, but this</b> <b>was also a time that made me realize it's OK</b> <b>not producing anything</b> <b>is actually not the end of the world</b> <b>right, it's not the end of the world</b> <b>But that period was actually quite depressing</b> <b>And this depressive period</b>

<b>actually continued until my Meta internship</b> <b>in school</b> <b>also didn't seem to produce any interesting work</b> <b>And then after going to Meta</b> <b>then um</b> <b>the internship was maybe only three months</b> <b>In the first two months I basically also</b> <b>was exploring some things</b> <b>exploring some things</b> <b>also related to neural network architecture</b> <b>some things</b> <b>but also didn't discover anything</b> <b>worth mentioning</b>

<b>And then suddenly a turning point happened</b> <b>This um</b> <b>He Kaiming (main inventor of ResNet) joined FAIR</b> <b>At that time</b> <b>Right</b> <b>So this was about halfway through my internship</b> <b>Professor He Kaiming then joined FAIR</b> <b>and became a</b> <b>full-time researcher</b> <b>Mm so</b> <b>That was my first time working with Kaiming</b> <b>That was my first time</b> <b>learning from him</b> <b>Right, and then</b>

<b>And then</b> <b>And we built some deep friendships then</b> <b>I think</b> <b>Because at that time he was coming to America for the first time</b> <b>It was his first time</b> <b>He had many firsts that were</b> <b>at FAIR</b> <b>right</b> <b>At that time he also couldn't drive</b> <b>first time in America, unfamiliar with everything</b> <b>I had to drive him out to eat</b> <b>and drive him home sometimes</b> <b>[chuckles]</b> <b>But he later learned to drive himself</b>

<b>And he also didn't know how to use Linux</b> <b>Mm, that's also very interesting</b> <b>Right, because at Microsoft they all used</b> <b>they could only program with Windows</b> <b>Right</b> <b>So I had to teach Kaiming how to use the cluster</b> <b>how to use Linux</b> <b>Right, but you'll find</b> <b>Kaiming</b> <b>this is Kaiming</b> <b>not without reason</b> <b>Right, and I think</b> <b>someone like him truly has this kind of</b>

<b>you could call it an aura</b> <b>or I could call it some kind of</b> <b>reality distortion field</b> <b>this is actually Steve Jobs's term</b> <b>meaning</b> <b>the people around Steve Jobs, influenced by him</b> <b>would all feel reality had been distorted</b> <b>right, some things that were completely impossible</b> <b>could now gradually actually be done</b> <b>I think Kaiming also has this kind of magic</b> <b>Right, and then</b> <b>So this was my first time seeing</b>

<b>how a truly top-level researcher does</b> <b>their research</b> <b>At that point your internship only had one month left</b> <b>How were you able to build such deep friendship?</b>

<b>I think, one is daily life interactions</b> <b>Why did he choose you?</b>

<b>Why did he communicate with you?</b>

<b>Because I was an intern there</b> <b>and my</b> <b>manager entrusted me to Kaiming</b> <b>because I wasn't doing well anyway</b> <b>hadn't produced anything</b> <b>Then Kaiming came and said, hey</b> <b>Kaiming, you come guide him</b> <b>come join in the discussions</b> <b>Right, so there was still a month left</b> <b>And Kaiming said</b> <b>why don't we participate together</b> <b>in the ImageNet Challenge</b> <b>Right, just compete in this competition</b> <b>Mm</b>

<b>And then I said, hey</b> <b>Sure, let's compete in this competition</b> <b>Because when Kaiming was at Microsoft</b> <b>his work came about</b> <b>through competing in ImageNet</b> <b>right, building up step by step</b> <b>Simply put</b> <b>Mm</b> <b>And so we also went to</b> <b>play with this ImageNet</b> <b>challenge</b> <b>Mm</b> <b>And in this process we discovered</b>

<b>hey, some ideas we had thought of before</b> <b>were actually reasonable</b> <b>actually very good ideas</b> <b>Right</b> <b>And I actually proposed this idea to Kaiming</b> <b>Kaiming's magic is</b> <b>he can take</b> <b>all very ordinary things</b> <b>and turn them into gold-like</b> <b>valuable ideas</b> <b>So we did this ResNeXt work</b> <b>And then</b> <b>this was also our</b>

<b>solution for the ImageNet challenge</b> <b>a submitted solution</b> <b>And we got second place</b> <b>Didn't get first place</b> <b>But I think we were actually the most effective</b> <b>Should have been first</b> <b>Because the first-place solution was</b> <b>an ensemble solution</b> <b>which combined some previous algorithms</b> <b>doing model ensembling</b> <b>a combined solution</b> <b>Right</b> <b>And we were actually a completely new framework</b> <b>Mm</b> <b>Right, and then</b> <b>And at that time</b>

<b>Um</b> <b>Right, I think</b> <b>I think what ResNeXt wanted to convey</b> <b>is also about how we</b> <b>by modifying the neural network architecture</b> <b>learn a more scalable</b> <b>right, a more extensible representation</b> <b>such a representation</b> <b>this thing is also very interesting</b> <b>because this</b> <b>idea is very, very simple</b> <b>It says</b> <b>originally</b> <b>for example, my ResNet is just a serial network</b>

<b>right, just layer by layer by layer</b> <b>like this</b> <b>conv layers</b> <b>now I can in parallel</b> <b>expand into several different groups</b> <b>each group with its own</b> <b>small network</b> <b>so you have networks</b> <b>within a large network</b> <b>distributed in parallel with many small networks</b> <b>Mm</b> <b>why is this interesting</b> <b>because in today's terms</b> <b>this is MoE (Mixture of Experts)</b> <b>Oh</b>

<b>So</b> <b>So at least on ImageNet at the time</b> <b>we already saw a kind of scaling behavior</b> <b>that is, the more groups you have</b> <b>the more sparse your neural network becomes</b> <b>and the more sparse your neural network</b> <b>the wider it gets</b> <b>but you can at the same flops</b> <b>computation level</b> <b>get better results</b> <b>it converges faster</b> <b>and your final results also improve</b>

<b>I think this</b> <b>resonates with what people are doing with MoE today</b> <b>aligns very well</b> <b>Does this work count as</b> <b>an extension of Kaiming's ResNet?</b>

<b>Yes yes</b> <b>So why is it called ResNeXt</b> <b>Kaiming said, right</b> <b>this is Xie's ResNet</b> <b>so the x is both next</b> <b>the next generation ResNeXt and also</b> <b>Um</b> <b>giving me some</b> <b>giving me some credit</b> <b>Mm</b> <b>I think</b> <b>Kaiming is someone very good at naming things</b>

<b>Right</b> <b>at naming papers</b> <b>many later papers</b> <b>were actually named by him for us</b> <b>Mm</b> <b>Would he hide people's names in them?</b>

<b>Not really</b> <b>Not really</b> <b>not every time</b> <b>but it was just a clever touch</b> <b>I think this is also part of his research taste</b> <b>Then why was your name hidden in it?</b>

<b>I don't know</b> <b>I think maybe also</b> <b>Ah</b> <b>I actually don't know</b> <b>I never asked him</b> <b>Mm</b> <b>How long had you been working together at that point?</b>

<b>Did your internship get extended?</b>

<b>All of this happened in that one month</b> <b>Right, it all happened in one month</b> <b>This kind of thing is countless</b> <b>Many of my best works</b> <b>actually follow the same rhythm</b> <b>starting out unable to produce anything</b> <b>Oh</b> <b>and then at the end suddenly a burst of inspiration</b> <b>and then converging on this thing</b> <b>research is never a linear development</b> <b>or a linearly developing research</b> <b>is never good research</b> <b>Mm</b>

<b>And then</b> <b>Much of our work is actually non-linear</b> <b>I can tell you more stories later</b> <b>Mm</b> <b>Um right</b> <b>Anyway</b> <b>At this time it was with Kaiming</b> <b>And I</b> <b>I finished</b> <b>and that period ended</b> <b>But your friendship continued, right?</b>

<b>I think so</b> <b>Right</b> <b>And then went to Meta</b> <b>This was a productive</b> <b>internship</b> <b>I think it was a productive internship</b> <b>And at Google?</b>

<b>At Google</b> <b>I think it also went pretty well</b> <b>Because</b> <b>I started to learn how video works</b> <b>Right, these internships</b> <b>were all different from what I'd done before</b> <b>Each internship</b> <b>was a different topic from what I'd done</b> <b>which led to my final dissertation</b> <b>actually, on the surface looking scattered</b> <b>but I was still able to find a way</b> <b>to connect them</b>

<b>and I'll tell you the way to connect them shortly</b> <b>Good</b> <b>But, anyway, at Google</b> <b>I went to study some video</b> <b>this kind of</b> <b>neural network architecture and training</b> <b>process and what it should look like</b> <b>I think it was also quite rewarding</b> <b>Hey, I have a question</b> <b>Because you worked so well with Kaiming at Meta</b> <b>And then</b> <b>and he's a very famous AI researcher</b> <b>why didn't you stay and continue collaborating with him</b>

<b>I think many people might make that choice</b> <b>why did you keep going to other places</b> <b>to explore</b> <b>Um, this is</b> <b>actually Kaiming's suggestion</b> <b>Kaiming would advise everyone</b> <b>to intern at different places</b> <b>this is the only way to</b> <b>maximize your gains</b> <b>Right</b> <b>So like us back then</b> <b>me</b> <b>and Wang Xiaolong</b> <b>we had all done one internship</b> <b>And then</b> <b>um, we of course all wanted to stay</b>

<b>but Kaiming said go check out other places</b> <b>maybe there will be different gains</b> <b>Mm</b> <b>But after your PhD you returned to Meta</b> <b>Yes right</b> <b>I think</b> <b>I think also after finishing the Google internship</b> <b>I immediately went to intern at DeepMind</b> <b>I think that experience</b> <b>was actually very enlightening for me</b> <b>Mm, at that time DeepMind wasn't yet</b> <b>Google</b> <b>Had it not been acquired yet?</b>

<b>No no</b> <b>acquired acquired</b> <b>already acquired</b> <b>but they were two different organizations</b> <b>because it, um, was only in London</b> <b>Right</b> <b>So during that time I went</b> <b>doing some RL-related research</b> <b>Ah</b> <b>And the reason was</b> <b>I really didn't know how this thing worked</b> <b>and I wanted to go and see</b> <b>And it was very painful doing it</b> <b>And London's winter</b> <b>that period was winter</b> <b>so cold</b> <b>London winters are also very cold</b> <b>I still remember very clearly</b>

<b>I'd get off the London underground</b> <b>working until very late</b> <b>at night maybe 10 or 11 o'clock</b> <b>and the biting cold wind</b> <b>mixed with rain</b> <b>hitting my face</b> <b>and clothes and hat couldn't block it</b> <b>step by step back to my tiny room</b> <b>Right, the temporary dorm</b> <b>It was actually quite hard</b> <b>Right</b> <b>But that period for me</b> <b>I think was also very enlightening</b>

<b>First</b> <b>made me feel like I didn't really enjoy doing</b> <b>RL (reinforcement learning) related research</b> <b>Or rather</b> <b>I didn't enjoy robotics-related research</b> <b>Robotics</b> <b>Because</b> <b>at that time RL was actually in this kind of</b> <b>virtual environment</b> <b>simulated environment</b> <b>doing some embodied agent tasks</b> <b>Mm</b> <b>But I think my bigger gain</b> <b>actually came from</b>

<b>my understanding of an organization like DeepMind</b> <b>being built up at that time</b> <b>Mm</b> <b>I thought, wow</b> <b>this place is so different</b> <b>different from everywhere I'd been</b> <b>Right</b> <b>They had a very different management model</b> <b>For example, they would have many</b> <b>PMs coordinating different research teams</b> <b>and the operations between them</b> <b>They would have these different working groups</b>

<b>where everyone still had many bottom-up ideas</b> <b>these bottom-up ideas</b> <b>But</b> <b>there wasn't a top-down management model</b> <b>and it was also a hierarchical management mode</b> <b>Starting with purely exploratory</b> <b>ideas</b> <b>where everyone could have their own small group</b> <b>to do some early studies</b> <b>and then immediately transition to</b> <b>once something takes shape</b>

<b>it would immediately enter a more top-down</b> <b>more organized management mode</b> <b>I think this is very, very interesting</b> <b>And thinking back now</b> <b>Right, I also mentioned this on Twitter before</b> <b>That Demis also met with many interns</b> <b>And everyone organized a meeting</b> <b>And Demis said to everyone</b> <b>or rather, someone actually asked him this question</b> <b>Saying hey</b> <b>what exactly is DeepMind's mission</b> <b>this company</b> <b>what do you ultimately</b>

<b>want to become as a company</b> <b>Demis's answer was</b> <b>DeepMind will ultimately become</b> <b>a company that can win multiple Nobel Prizes</b> <b>able to win multiple</b> <b>this requires</b> <b>key point: a company that wins multiple Nobel Prizes</b> <b>I think we all said back then, wow</b> <b>that's so ambitious</b> <b>isn't that a bit far-fetched</b> <b>they're just doing AI</b> <b>But now we see</b>

<b>they have already achieved at least one step</b> <b>I think</b> <b>I think it's truly very, very admirable</b> <b>Actually the entire AlphaFold team</b> <b>was in the process of forming during my internship</b> <b>gradually coming together</b> <b>Right</b> <b>I could actually see which people were doing these things</b> <b>And at the beginning</b> <b>some interns were also participating in this process</b> <b>and step by step</b> <b>how it went from an exploratory idea</b>

<b>to gradually becoming organized</b> <b>focused on execution</b> <b>step by step able to achieve</b> <b>ultimately completely changing the world</b> <b>such a project's process</b> <b>The organization question</b> <b>we'll</b> <b>discuss in detail later</b> <b>Mm, I'm thinking</b> <b>did you do too many internships</b> <b>so you didn't get any more best papers after</b> <b>Mm</b> <b>I think that might be the case</b>

<b>or rather, I think what I did</b> <b>was maybe too much, too scattered</b> <b>Which year of your PhD did you start internships?</b>

<b>from the first year</b> <b>Oh, from the first year</b> <b>So these two were always</b> <b>intertwined</b> <b>Mm right</b> <b>So I think you're very right</b> <b>actually my timeline was disrupted</b> <b>Right, it does lose some focus</b> <b>But I think this was also a design of my own</b> <b>So coming back</b> <b>how to connect all these things</b> <b>I think my doctoral dissertation title is</b> <b>Um</b> <b>this</b>

<b>Deep Representation Learning with Induced Structural Priors</b> <b>roughly about some structural priors</b> <b>Um</b> <b>using these priors</b> <b>to guide us</b> <b>how to learn a better</b> <b>deep learning representation</b> <b>Mm</b> <b>And this</b> <b>again, many many years have passed</b> <b>but I</b> <b>I find what I'm doing now is still this</b>

<b>And then</b> <b>And at this conference in November or December</b> <b>there was a workshop</b> <b>their workshop title</b> <b>was Representation Learning with Induced Structural Priors</b> <b>roughly about using structural priors and representation</b> <b>a topic roughly like this</b> <b>And I gave a talk there</b> <b>And at the end of my talk</b> <b>I said, actually over the past 12 years</b> <b>your workshop topic</b>

<b>though still a frontier now</b> <b>we are</b> <b>discussing it with some different meaning</b> <b>But</b> <b>this was also the problem I wanted to study at the beginning</b> <b>and also what I feel now</b> <b>is still not fully solved</b> <b>Right, so on one hand</b> <b>I think during my PhD</b> <b>the timeline was a bit fragmented</b> <b>The reason is</b> <b>I was doing different things in different places</b> <b>But on the other hand</b> <b>This is also, if you want to tackle</b>

<b>representation learning as a topic</b> <b>this is also unavoidable</b> <b>because it's like planting a tree</b> <b>your representation is actually the root of this tree</b> <b>after this tree grows</b> <b>it needs to have different branches</b> <b>Right</b> <b>each branch is actually a different</b> <b>what we call downstream</b> <b>application</b> <b>a new application</b> <b>So I've done image recognition</b> <b>image segmentation</b>

<b>edge detection</b> <b>video recognition</b> <b>action recognition</b> <b>right, and even later</b> <b>some embodied RL-related tasks</b> <b>when doing all these things</b> <b>the problems I saw</b> <b>they are all branches on those tree branches</b> <b>they are not roots</b> <b>Right</b> <b>I think it's possible</b> <b>what you said is right</b> <b>I haven't considered this</b> <b>whether I would have more best papers</b> <b>[chuckles]</b> <b>but I hope to plant more of this tree</b>

<b>and put down deeper roots</b> <b>rather than</b> <b>going further on the branches</b> <b>Right mm</b> <b>And I think, again</b> <b>I think this is the core of deep learning</b> <b>that is, representation learning</b> <b>Representation Learning</b> <b>is basically equivalent to deep learning</b> <b>Let me explain to everyone what representation learning is</b> <b>Um</b>

<b>Good question, right, this thing</b> <b>Um, I think</b> <b>I think the reason I like saying</b> <b>I am someone who does representation learning</b> <b>is because this is still hard to define</b> <b>Mathematically speaking</b> <b>you can think of representation learning as</b> <b>you have data</b> <b>right x</b> <b>and you now want to map it to a</b> <b>space</b> <b>and now this space</b> <b>might have some properties</b>

<b>these properties</b> <b>maybe these good properties</b> <b>may make it easier for you on downstream tasks</b> <b>to achieve better results</b> <b>Right</b> <b>So what you want to learn</b> <b>Um</b> <b>from the initial data</b> <b>to this well</b> <b>property space mapping</b> <b>function</b> <b>this is what is called representation learning</b>

<b>And then</b> <b>this function is also not just a simple mapping</b> <b>it might be a hierarchical</b> <b>hierarchical mapping</b> <b>And now</b> <b>of course this can be implemented in different ways</b> <b>now the mainstream implementation</b> <b>is to use a non-linear neural network</b> <b>to implement this</b> <b>function</b> <b>Right, so I think this is a definition</b> <b>But I just said I would</b> <b>I would be willing</b> <b>to say</b> <b>I myself am someone who does Representation Learning</b>

<b>is because I think this is a timeless title</b> <b>because this field develops too fast</b> <b>many times we do many things</b> <b>including, let me give an example</b> <b>this might be a very, very</b> <b>very negative example</b> <b>which is that in the past, actually</b> <b>when I</b> <b>at what time</b> <b>maybe just after finishing my PhD</b> <b>something was very, very hot</b> <b>called NAS (Neural Architecture Search)</b> <b>which is called</b>

<b>neural architecture</b> <b>search</b> <b>I don't know how to translate it</b> <b>it's Neural Architecture Search</b> <b>Mm</b> <b>Um, in this field</b> <b>there is a lot of consensus that</b> <b>this kind of topic</b> <b>wasted about two years of the entire field</b> <b>This was a wrong direction</b> <b>Everyone went down this wrong path</b> <b>publishing thousands of papers</b> <b>but ultimately got nothing out of it</b> <b>Mm</b> <b>And so</b> <b>Why do I say</b> <b>representation learning is a good</b>

<b>title like that</b> <b>or I am willing to tell everyone</b> <b>I am someone who does representation learning</b> <b>is because this is a fundamental problem</b> <b>If you say now</b> <b>I am someone doing Neural Architecture Search</b> <b>then this becomes very problematic</b> <b>It's possible after 2 years</b> <b>you'd have to immediately change fields</b> <b>You'd have to update your website</b> <b>My research direction is Neural Architecture Search</b> <b>delete that sentence</b>

<b>and replace it with the next more fancy</b> <b>or different</b> <b>term</b> <b>It is not a timeless theme</b> <b>It is not a timeless theme</b> <b>Mm</b> <b>Representation is a timeless theme</b> <b>the most fundamental theme</b> <b>and a theme that has not yet been solved</b> <b>Mm</b> <b>So ah hey</b> <b>I may have talked about my PhD a bit too long</b> <b>[chuckles] But</b> <b>But I still want to say</b> <b>That is to say, I think during my PhD</b> <b>I also experienced more setbacks</b>

<b>For example</b> <b>Our initial Deeply Supervised Nets paper</b> <b>this also</b> <b>At first we submitted to NeurIPS</b> <b>and got a pretty high score</b> <b>something like 886</b> <b>a score of 887</b> <b>but was ultimately still rejected</b> <b>And this was also a blow to me</b> <b>Mm, I found, wow</b> <b>Publishing a paper is actually this hard</b> <b>Even with very good reviews,</b>

<b>it was still rejected for some ridiculous reasons,</b> <b>and got rejected.</b>

<b>What was so ridiculous?</b>

<b>The ridiculous reason was that</b> <b>we had a mathematical formula in the paper,</b> <b>which should have been squared,</b> <b>and we had a typo —</b> <b>we left out the squared term.</b>

<b>Didn't write it.</b>

<b>It was purely a typo,</b> <b>very easy to fix.</b>

<b>But the PC said —</b> <b>the Program Chair,</b> <b>the person responsible for</b> <b>these conferences — said</b> <b>this makes your math invalid,</b> <b>it's an error.</b>

<b>And during the rebuttal,</b> <b>when responding to the reviewers,</b> <b>the reviewers didn't see it,</b> <b>so unfortunately</b> <b>there was no way to fix it.</b>

<b>So at that point all we could do was</b> <b>Now it seems unimaginable.</b>

<b>First of all,</b> <b>nowadays perhaps</b> <b>people don't check the formulas in papers anymore.</b>

<b>Second,</b> <b>I think people have become relatively more tolerant.</b>

<b>Back then,</b> <b>people were extremely nitpicky about details.</b>

<b>Yeah right.</b>

<b>But it's fine.</b>

<b>We ended up submitting to AISTATS</b> <b>— another conference —</b> <b>a machine learning conference.</b>

<b>And that paper</b> <b>won their Test of Time Award last year.</b>

<b>The Test of Time Award.</b>

<b>So I think</b> <b>After all this time.</b>

<b>Right.</b>

<b>Because all Test of Time Awards evaluate</b> <b>things 10 years later —</b> <b>at the 10-year mark,</b> <b>among all papers published 10 years ago,</b> <b>which paper had the greatest influence</b> <b>on the field.</b>

<b>Right. So I think</b> <b>I suddenly felt at peace again.</b>

<b>I think</b> <b>Research truly is a long-term process.</b>

<b>And so,</b> <b>That's also why</b> <b>I tell many of my students this:</b> <b>And I think</b> <b>don't worry about</b> <b>your wins and losses at every moment.</b>

<b>Or, to describe it mathematically,</b> <b>don't worry about a point estimate.</b>

<b>Don't, on this timeline,</b> <b>at every point,</b> <b>evaluate whether you're doing well or not.</b>

<b>Because all evaluations</b> <b>are ultimately an integral.</b>

<b>You need the accumulation of time.</b>

<b>In the end, look —</b> <b>everything you've ever done,</b> <b>added together,</b> <b>determines whether you're a good researcher.</b>

<b>But in that moment,</b> <b>you'll still feel very down.</b>

<b>Very down. Right.</b>

<b>Extremely down.</b>

<b>In that moment it's hard to think about 10 years later.</b>

<b>Hard to think about what happens 10 years from now.</b>

<b>Mm.</b>

<b>When you finished your PhD,</b> <b>what expectations did you have for your life?</b>

<b>You had published some good papers,</b> <b>you had 5 internship experiences,</b> <b>did you think you should go into research</b> <b>or into industry?</b>

<b>Did you make that choice?</b>

<b>I was never very confident back then.</b>

<b>At that time I never even considered a faculty position.</b>

<b>Because I thought I didn't deserve it.</b>

<b>[laughter] Because</b> <b>Why did you feel unworthy at every moment?</b>

<b>It's a bit better now.</b>

<b>But uh,</b> <b>Maybe that's a bit of an exaggeration.</b>

<b>It's not that I really felt unworthy.</b>

<b>But compared to my peers,</b> <b>they were on the established track,</b> <b>like I said,</b> <b>moving step by step toward a good faculty position.</b>

<b>That path.</b>

<b>I felt I wasn't on that path.</b>

<b>Oh.</b>

<b>Or rather,</b> <b>What you just said makes a lot of sense.</b>

<b>If your final destination</b> <b>was really a faculty position,</b> <b>at least at that point in time,</b> <b>you shouldn't have gone to 5 places</b> <b>for 5 internships,</b> <b>working on 5 different projects.</b>

<b>That's very unfavorable for</b> <b>finding a faculty position.</b>

<b>If you wanted a faculty position,</b> <b>staying in Kaiming He's team</b> <b>would have let you publish more papers,</b> <b>gotten more results,</b> <b>during that period,</b> <b>it might have been a smoother path</b> <b>toward a definite goal.</b>

<b>I don't know if it was a definite goal.</b>

<b>I really think it's quite mysterious.</b>

<b>All these decisions came down to:</b> <b>I only thought about where I should go</b> <b>to do what I most wanted to do,</b> <b>ideally with the people I most wanted to work with.</b>

<b>Working together.</b>

<b>I think</b> <b>This idea is actually very, very simple.</b>

<b>So when job hunting back then,</b> <b>actually I</b> <b>I was looking everywhere.</b>

<b>There were quite a few offers from major companies.</b>

<b>Right.</b>

<b>and</b> <b>I've talked before about my OpenAI interview experience.</b>

<b>It was actually pretty cool.</b>

<b>Basically, I was in a small dark room</b> <b>for five or six hours,</b> <b>working on one problem.</b>

<b>When I came out, it was already dark.</b>

<b>Right.</b> <b>I found the experience quite fascinating.</b>

<b>It felt quite extraordinary.</b>

<b>But back then actually</b> <b>Who was the interviewer at OpenAI?</b>

<b>John Schulman (OpenAI co-founder, Thinking Machines co-founder and Chief Scientist)</b> <b>Oh right.</b>

<b>Oh right.</b> <b>I saw you wrote about this experience on Zhihu.</b>

<b>Right?</b> <b>Uh, not on Zhihu,</b> <b>it was on Twitter,</b> <b>on X.</b>

<b>on X.</b> <b>Right, Zhihu reposted it.</b>

<b>That's it.</b>

<b>Yes.</b>

<b>So his original</b> <b>interview questions were on a single A4 sheet of paper,</b> <b>handwritten in pencil,</b> <b>line by line, handwritten interview questions.</b>

<b>I think</b> <b>it really moved me deeply.</b>

<b>I found it so fascinating.</b>

<b>This place is very interesting.</b>

<b>And uh,</b> <b>In the end,</b> <b>Actually,</b> <b>There was an offer, of course,</b> <b>but in the end</b> <b>I didn't go to OpenAI.</b>

<b>I didn't go to OpenAI.</b>

<b>This is where the timeline</b> <b>— quantum mechanics — starts to diverge.</b>

<b>That was 2018.</b>

<b>So early.</b>

<b>Mm.</b>

<b>So if I had gone to OpenAI, maybe, uh,</b> <b>you'd now be part of the LLM world.</b>

<b>Maybe. I don't think so.</b>

<b>I don't know.</b>

<b>I don't know.</b>

<b>Don't know what would have happened.</b>

<b>Back then I didn't even think about it.</b>

<b>I just wanted to go to FAIR.</b>

<b>If FAIR gave me the offer,</b> <b>I would definitely go.</b>

<b>Your reason for wanting to go to FAIR was Kaiming?</b>

<b>Uh right.</b>

<b>Kaiming, Piotr Dollar,</b> <b>Ross Girshick.</b>

<b>Ross Girshick.</b> <b>The so-called</b> <b>the three pillars of computer vision back then.</b>

<b>They weren't that senior —</b> <b>university professors or anything like that —</b> <b>they were all</b> <b>young to mid-career,</b> <b>researchers.</b>

<b>researchers.</b> <b>But the absolute top three.</b>

<b>Right, they were there.</b>

<b>And the research they were doing was</b> <b>the absolute top-tier computer vision research.</b>

<b>So for me,</b> <b>there was no choice to make.</b>

<b>So it was kind of fun back then.</b>

<b>Here's the thing —</b> <b>Ilya (Ilya Sutskever, SSI founder and CEO, OpenAI co-founder and former Chief Scientist)</b> <b>called me, and I said almost nothing,</b> <b>and I rejected OpenAI.</b>

<b>They sent me an offer,</b> <b>and I said I'm not going, sorry.</b>

<b>What did Ilya say on the call?</b>

<b>Uh, he was very angry.</b>

<b>He asked me,</b> <b>"Why didn't you even discuss it</b> <b>before rejecting the offer?"</b>

<b>"Is the money not enough?"</b>

<b>How much was it?</b>

<b>Uh,</b> <b>I don't remember exactly.</b>

<b>It was actually very, very low.</b>

<b>Maybe uh,</b> <b>probably in the hundreds of thousands.</b>

<b>Back then the pay for</b> <b>a top PhD student</b> <b>around 2008 would be</b> <b>roughly $400K to $500K</b> <b>dollars.</b>

<b>dollars.</b> <b>Dollars. Right.</b>

<b>Dollars. Right.</b>

<b>And now it's at least tripled.</b>

<b>But anyway,</b> <b>at that time</b> <b>OpenAI was at that level too,</b> <b>which was fine.</b>

<b>Right. And then</b>

<b>But Ilya was very angry.</b>

<b>So I</b> <b>I could only give vague responses</b> <b>and told him</b> <b>that I couldn't go.</b>

<b>and</b> <b>At that time indeed</b> <b>what did he say when angry?</b>

<b>Uh, not much actually.</b>

<b>His tone was just very stern.</b>

<b>Why did he decide to make this call?</b>

<b>I don't know.</b>

<b>That shows he really cared about recruiting.</b>

<b>He had never been rejected before.</b>

<b>Uh no.</b>

<b>I don't think that's the case.</b>

<b>In 2018,</b> <b>I think he was probably often rejected.</b>

<b>Because FAIR at that time</b> <b>— not just in Vision —</b> <b>in many areas,</b> <b>for the top PhD graduates,</b> <b>FAIR was more certain than OpenAI,</b> <b>more open,</b> <b>more like an academic environment.</b>

<b>Such an institution.</b>

<b>I think, at least at that time,</b> <b>everyone around me,</b> <b>if given that choice,</b> <b>unless</b> <b>they really wanted to do what OpenAI was already doing,</b> <b>the things OpenAI excelled at,</b> <b>I think most people would still lean toward FAIR.</b>

<b>Did you get the FAIR offer smoothly?</b>

<b>Uh, not that smoothly.</b>

<b>I think it was also quite</b> <b>rocky all the way.</b>

<b>When you rejected OpenAI,</b> <b>was it because you already had the FAIR offer?</b>

<b>Yes right.</b>

<b>But at FAIR,</b> <b>I gave a talk,</b> <b>this talk —</b> <b>I had no experience at all,</b> <b>it seemed everyone at my stage</b> <b>was quite experienced at job hunting,</b> <b>while I knew nothing.</b>

<b>So I gave a talk,</b> <b>and uh,</b> <b>the talk was scheduled for one hour.</b>

<b>Normally you'd speak for 45 to 50 minutes</b> <b>with 10 minutes for questions.</b>

<b>But I finished in 30 minutes.</b>

<b>Done.</b>

<b>Everyone looked at each other,</b> <b>not knowing what to do.</b>

<b>of course,</b> <b>many of the researchers there</b> <b>gave me a lot of face</b> <b>and asked many questions,</b> <b>so the time was somehow stretched to 45 minutes.</b>

<b>It wasn't too awkward.</b>

<b>Later Kaiming told me</b> <b>that everyone thought this was</b> <b>first, very unconventional.</b>

<b>How could you finish so fast?</b>

<b>Second,</b> <b>Maybe interviews should all be like this —</b> <b>a 30-minute talk works fine,</b> <b>saves everyone's time.</b>

<b>So many times</b> <b>I've done things</b> <b>without doing them perfectly.</b>

<b>Hmm, why did you finish so quickly?</b>

<b>Why didn't you follow the rules?</b>

<b>I didn't know there was a rule.</b>

<b>Oh.</b>

<b>Didn't read it.</b>

<b>Uh, I didn't know about this rule.</b>

<b>Like now,</b> <b>for example,</b> <b>Because</b> <b>this rule is actually a job talk rule.</b>

<b>Nobody told me this rule.</b>

<b>Right, people just said,</b> <b>"There's a talk starting at 11,"</b> <b>but this is actually an established convention</b> <b>because that's how academic interviews work.</b>

<b>and</b> <b>FAIR back then was actually an academic institution.</b>

<b>Mm.</b>

<b>It was really like a university.</b>

<b>Its operating model was like a PI</b> <b>leading a group of young people —</b> <b>whether interns</b> <b>or newly joined members —</b> <b>working together.</b>

<b>working together.</b> <b>And when I joined FAIR,</b> <b>I was probably</b> <b>among the first few — I'm not sure —</b> <b>Chen Xinlei was probably the first,</b> <b>but I was probably the second —</b> <b>a fresh PhD graduate who could join FAIR.</b>

<b>At first they didn't recruit new PhD graduates.</b>

<b>If you were just a PhD graduate,</b> <b>they didn't want you.</b>

<b>They would only recruit people like Kaiming,</b> <b>who had already done very impressive work,</b> <b>those kinds of researchers.</b>

<b>Mm. Right.</b>

<b>So I was also quite</b> <b>lucky. Right.</b>

<b>lucky. Right.</b>

<b>Mm.</b>

<b>I think FAIR</b> <b>really was the holy temple at that time.</b>

<b>Mm.</b>

<b>And so,</b> <b>I didn't agonize much over</b> <b>too many other possibilities.</b>

<b>Mm. And then</b>

<b>About the Ilya situation,</b> <b>let me add one more thing.</b>

<b>I've only talked to Ilya on the phone twice.</b>

<b>This was the first time.</b>

<b>We can talk about the second time later.</b>

<b>It was</b> <b>in July 2024,</b> <b>right after he founded SSI.</b>

<b>He emailed me and asked</b> <b>if I'd be willing to come work together.</b>

<b>And you rejected him again.</b>

<b>Uh right.</b>

<b>Why this time?</b>

<b>This time because I had just started at NYU.</b>

<b>and</b> <b>Mm. I think there were several reasons.</b>

<b>Mm. I think there were several reasons.</b>

<b>When I talked with him,</b> <b>Uh,</b> <b>the main topic we discussed</b> <b>wasn't salary or anything like that.</b>

<b>We didn't talk about any of that.</b>

<b>The main topic was</b> <b>how to give future artificial intelligence</b> <b>the ability to love.</b>

<b>the ability to love.</b>

<b>Discussing philosophy.</b>

<b>Of course, I finally asked him</b> <b>one question.</b>

<b>one question.</b> <b>I asked how he viewed multimodality,</b> <b>how he viewed computer vision,</b> <b>or general perception models —</b> <b>what did he think?</b>

<b>Ilya's response was</b> <b>he felt this was already solved well enough.</b>

<b>Okay, so I thought</b> <b>maybe uh,</b> <b>SSI has its own language-based</b> <b>approach.</b>

<b>approach.</b> <b>And that approach,</b> <b>at least for now,</b> <b>is not the path I want to pursue.</b>

<b>This is your fundamental disagreement —</b> <b>LLM versus vision.</b>

<b>Right. We can talk more about this later.</b>

<b>But I don't actually see this as a disagreement.</b>

<b>I see it as an organism.</b>

<b>Everyone is just in different places,</b> <b>doing different things at different times.</b>

<b>I always like to say,</b> <b>"Brothers climbing a mountain,</b> <b>each making their own effort."</b>

<b>Everyone doing their own thing.</b>

<b>No problem with that at all.</b>

<b>It's not a fight to the death.</b>

<b>LLMs don't conflict with what I want to do.</b>

<b>And without the recent developments in LLMs,</b> <b>there might not have been</b> <b>the current state of computer vision.</b>

<b>Mm.</b>

<b>That topic you discussed —</b> <b>how to give AI the ability to love —</b> <b>did you reach any conclusions?</b>

<b>The conclusion is that this is very important.</b>

<b>Why?</b>

<b>Because without it,</b> <b>we face a very uncertain</b> <b>and very dangerous future.</b>

<b>But with love comes hate.</b>

<b>They're two sides of the same coin.</b>

<b>It can't only have love.</b>

<b>When it learns to love,</b> <b>it will definitely</b> <b>know what the opposite is.</b>

<b>For me,</b> <b>completely agree with you.</b>

<b>Mm.</b>

<b>This becomes a philosophical proposition.</b>

<b>Mm.</b>

<b>But let me ask a counter-question.</b>

<b>Why do people trust their own children,</b> <b>trust humans so much,</b> <b>but have such worry and fear</b> <b>about AI, this new</b> <b>form of intelligent entity?</b>

<b>I don't have an answer to that.</b>

<b>But I think</b> <b>There will be technical ways</b> <b>to have control.</b>

<b>We can use technical means</b> <b>to make AI more trustworthy in the future,</b> <b>safer,</b> <b>and more controllable.</b>

<b>Mm. Controllable.</b>

<b>And this is also one reason</b> <b>why we need to work on</b> <b>world models.</b>

<b>world models.</b> <b>Why did he want to reach out to you?</b>

<b>Uh, I don't know.</b>

<b>Maybe he reached out to</b> <b>a thousand people,</b> <b>ten thousand people.</b>

<b>I guess. Right.</b>

<b>When we were waiting in line at a restaurant that day,</b> <b>we actually walked through the streets of New York together,</b> <b>and our conversation naturally extended to</b> <b>people who have greatly influenced you.</b>

<b>In what you shared just now,</b> <b>the human factor</b> <b>takes up a very large share of many of your choices.</b>

<b>Why are people so important to you?</b>

<b>And in your personal bio,</b> <b>you clearly listed</b> <b>which collaborators are important to you.</b>

<b>That's very rare.</b>

<b>Why are people so crucial to you?</b>

<b>Is this unusual?</b>

<b>I don't think it's unusual at all.</b>

<b>I think</b> <b>In academic circles,</b> <b>this is a common behavioral pattern.</b>

<b>People organize themselves into</b> <b>these social networks.</b>

<b>Mm. And these people shape your thinking,</b> <b>because they may be your students,</b> <b>they may be your teachers, right?</b>

<b>But teachers don't always teach students.</b>

<b>Sometimes students teach the teachers.</b>

<b>All of this can be true.</b>

<b>So it's a huge graph</b> <b>where everyone is connected.</b>

<b>And I think</b> <b>That's also why research,</b> <b>or science, is especially fascinating.</b>

<b>Mm. Because many times</b> <b>the mutual</b> <b>trust between people,</b> <b>mutual appreciation,</b> <b>mutual feelings —</b> <b>these aren't built through</b> <b>living together</b> <b>and being friends.</b>

<b>Many times it's through scientific discovery,</b> <b>kind of</b> <b>this research aspect, that connections are built.</b>

<b>Relationships between people.</b>

<b>I think this is actually very interesting.</b>

<b>For example, those who deeply influenced me —</b> <b>I may get to know them personally,</b> <b>of course I try to get to know them personally,</b> <b>right, but that's not what matters most to me.</b>

<b>I seem to understand them through their papers,</b> <b>learning their way of thinking.</b>

<b>and</b> <b>I think that's the real meaning of research.</b>

<b>I don't think the purpose of research is to publish papers.</b>

<b>I</b> <b>I don't think</b> <b>Uh,</b> <b>publishing papers</b> <b>is the goal.</b>

<b>Not at all.</b>

<b>The purpose should be —</b> <b>what is the purpose?</b>

<b>ah,</b> <b>Is it a journey through people?</b>

<b>What Kaiming told me the purpose is:</b> <b>Mm.</b>

<b>Mm.</b> <b>at its core it means</b> <b>sharing knowledge.</b>

<b>sharing knowledge.</b> <b>that is,</b> <b>The purpose of publishing a paper isn't for others to see it,</b> <b>but so that after others see the paper,</b> <b>they have something to work on.</b>

<b>that is,</b> <b>You publish a paper,</b> <b>others understand some of the content,</b> <b>and they feel</b> <b>their own horizons have expanded.</b>

<b>Mm. It's about helping others.</b>

<b>Being helpful to others.</b>

<b>Right. Being able to inspire others,</b> <b>or enlighten others.</b>

<b>Oh, that's the purpose of research.</b>

<b>I think that's the purpose of research.</b>

<b>Or, to put it more romantically,</b> <b>the idea is</b> <b>I think this —</b> <b>this comes from Hannah Arendt (political philosopher),</b> <b>and she said</b> <b>she doesn't care about impact.</b>

<b>She doesn't care about influence.</b>

<b>Because</b> <b>In researcher circles,</b> <b>people say</b> <b>we publish papers to create some kind of impact,</b> <b>Right?</b>

<b>Right?</b> <b>In my own dictionary,</b> <b>I actually have a bit of an aversion to the word impact.</b>

<b>Aversion.</b>

<b>A bit of an aversion.</b>

<b>Oh.</b>

<b>Uh why?</b>

<b>What is it about it that you resist?</b>

<b>Again Arendt</b> <b>said that</b> <b>she felt, uh,</b> <b>the word "impact" is overly aggressive,</b> <b>overly masculine.</b>

<b>overly masculine.</b> <b>For her,</b> <b>the purpose of doing these things is not to create impact</b> <b>but for understanding itself.</b>

<b>If you can understand something,</b> <b>the feeling is wonderful.</b>

<b>If you can write down what you've understood,</b> <b>whether it's an article or a paper,</b> <b>and spread it,</b> <b>then you can</b> <b>potentially allow more people in the world</b> <b>to understand</b> <b>such a question in the same way you do.</b>

<b>And this</b> <b>will be transmitted step by step,</b> <b>creating a kind of resonance.</b>

<b>and</b> <b>Arendt's view is that</b> <b>she would find in this a sense of family —</b> <b>a feeling of family.</b>

<b>She would feel that she understood something,</b> <b>told others,</b> <b>allowed others to understand,</b> <b>which means these people also understood her to some degree.</b>

<b>Mm.</b>

<b>But humans, as social beings,</b> <b>need to be understood.</b>

<b>Right.</b>

<b>He reframed the word "influence"</b> <b>in a very soft way —</b> <b>seeking to be understood.</b>

<b>I think so.</b>

<b>I think so.</b>

<b>You agree more with this view?</b>

<b>I agree with her very much.</b>

<b>Because I think</b> <b>Creating impact is fine in itself.</b>

<b>But it's very self-centered.</b>

<b>Mm-hmm.</b>

<b>I'm going to create impact. Mm.</b>

<b>Right. Me-centered.</b>

<b>And yes,</b> <b>you're absolutely right.</b>

<b>I'm going to create this impact,</b> <b>I'm going to change the world,</b> <b>but do the people in this world agree to be changed by me?</b>

<b>[laughs]</b> <b>Or rather, many disasters in the world</b> <b>are because people want to create impact,</b> <b>want to transform the world.</b>

<b>Right.</b>

<b>I think</b> <b>I would tend to agree with this softer expression.</b>

<b>I think</b> <b>If all people in this world,</b> <b>through our research,</b> <b>can gain a new layer of understanding,</b> <b>a new layer of knowledge,</b> <b>the total intelligence on Earth would increase.</b>

<b>And increasing total intelligence on Earth</b> <b>is never wrong.</b>

<b>It's always something beneficial to the world.</b>

<b>Whether it's called impact</b> <b>or being understood by more people.</b>

<b>Do you want to be known and remembered by more people?</b>

<b>Mm. Do you have a need for fame?</b>

<b>I certainly don't have that need.</b>

<b>You don't have that need.</b>

<b>But I think</b> <b>I don't have that need.</b>

<b>But really?</b>

<b>Uh I</b> <b>Or rather, from where I stand now,</b> <b>I'm actually a victim</b> <b>of a kind of false fame.</b>

<b>Uh, the reason is</b> <b>people now take some of our papers</b> <b>and post them on Xiaohongshu,</b> <b>to discuss — and actually none of this</b> <b>— people talk about the so-called top-three conferences</b> <b>and promote the work, right?</b>

<b>I</b> <b>I have never once</b> <b>asked any such media outlet</b> <b>to do this kind of promotion.</b>

<b>Mm.</b>

<b>And I tell my students:</b> <b>please don't go on Xiaohongshu</b> <b>or Zhihu</b> <b>to promote your own work.</b>

<b>You can explain your work,</b> <b>you can comment on your work.</b>

<b>That's fine.</b>

<b>Just don't promote yourself.</b>

<b>Why is it okay on X?</b>

<b>I think on X,</b> <b>uh, it's more about</b> <b>how you define promotion.</b>

<b>What I focus on</b> <b>is briefly summarizing things</b> <b>and telling people what it's about.</b>

<b>It's more like attracting people to look at my work,</b> <b>and I think that's fine.</b>

<b>But the promotion I'm referring to</b> <b>is more like the fame you mentioned,</b> <b>because what I really can't accept is</b> <b>people now say "so-and-so's team"</b> <b>published such-and-such</b> <b>work.</b>

<b>work.</b> <b>Oh.</b>

<b>Oh.</b> <b>It reinforces that person,</b> <b>reinforcing that person,</b> <b>someone's team.</b>

<b>someone's team.</b> <b>reinforces that person.</b>

<b>Right uh,</b> <b>If any editors hear this,</b> <b>I hope people can stop doing this.</b>

<b>Don't write "Xie Saining's team".</b>

<b>Don't put my photo on it.</b>

<b>Don't put my name on it.</b>

<b>We need to encourage young people more —</b> <b>the people who actually did the work,</b> <b>give them more visibility.</b>

<b>Right?</b>

<b>Well, people might think you're the first author.</b>

<b>Uh right.</b>

<b>If I am the first author, that's fine.</b>

<b>But I'm not the first author.</b>

<b>Right?</b>

<b>I'm just the team lead.</b>

<b>And much of this work is done by students.</b>

<b>So what should it be called?</b>

<b>Not "Xie Saining's team".</b>

<b>Just focus on the work itself.</b>

<b>Talk about what problem this solves</b> <b>and why it matters.</b>

<b>That's enough.</b>

<b>Right.</b>

<b>But I think</b> <b>You really hate being used as a target by others.</b>

<b>Is that so?</b>

<b>Uh yes.</b>

<b>Because I think it adds</b> <b>a lot of risk.</b>

<b>I think</b> <b>Mm. Tell us about those who influenced you.</b>

<b>Mm. Tell us about those who influenced you.</b>

<b>We've already talked about a few people.</b>

<b>Kaiming, Professor Tu — anyone else?</b>

<b>Oh yes.</b>

<b>Uh,</b> <b>I think, right,</b> <b>this goes</b> <b>back to FAIR.</b>

<b>We can follow the FAIR thread.</b>

<b>After FAIR,</b> <b>I came to NYU.</b>

<b>I think this was another decision-making point.</b>

<b>Stayed at FAIR for 4 years.</b>

<b>A full 4 years.</b>

<b>Right. OK.</b>

<b>Yes. Yes.</b>

<b>Also with ups and downs.</b>

<b>For me,</b> <b>I just said</b> <b>many places I've been</b> <b>actually grew alongside me.</b>

<b>FAIR might be an exception.</b>

<b>When I joined, it was at its peak.</b>

<b>The high point.</b>

<b>Probably the high point.</b>

<b>Right. And then</b>

<b>Right. It's a pity.</b>

<b>What's happening there now.</b>

<b>But I also think</b> <b>Mm.</b>

<b>Mm.</b> <b>Right. Because I left relatively early,</b>

<b>Right. Because I left relatively early,</b> <b>so I wasn't there</b> <b>when it was</b> <b>at its lowest point when I left.</b>

<b>Right. [laughs]</b>

<b>I also saw some warning signs.</b>

<b>Right.</b>

<b>OK.</b>

<b>And but right.</b>

<b>And I think</b> <b>if I'm talking about people who influenced me,</b> <b>then in this process, when going to NYU,</b> <b>I think</b> <b>that was another quite mysterious decision-making process.</b>

<b>Right. Deciding to go to New York at that time</b> <b>— I just mentioned this —</b> <b>was partly because I might enjoy the city.</b>

<b>and</b> <b>But I think</b> <b>Uh, another very important thing</b> <b>was also that Yann LeCun is here.</b>

<b>Right, Yann is here.</b>

<b>Mm right uh.</b>

<b>Why, with him here,</b> <b>were you willing to go?</b>

<b>You worked together at FAIR.</b>

<b>Uh,</b> <b>He likes to say he's recruited me</b> <b>that is,</b> <b>three times, right?</b>

<b>The first time was at FAIR.</b>

<b>But at that time,</b> <b>because he was the overall director of FAIR,</b> <b>FAIR's director,</b> <b>I didn't directly work with him,</b> <b>but I was influenced by him of course.</b>

<b>Or have you had long-term exchanges?</b>

<b>Yes, we've talked.</b>

<b>Right.</b>

<b>But never directly collaborated.</b>

<b>Mm.</b>

<b>Then going to NYU was the second time.</b>

<b>We can talk about the third time later.</b>

<b>Mm.</b>

<b>And the NYU experience —</b> <b>I think why it matters that he's here</b> <b>is also because</b> <b>I think</b> <b>he's a person with a very strong vision.</b>

<b>so</b> <b>Right.</b>

<b>Right.</b> <b>I think many of these decisions were very intuitive.</b>

<b>For example, NYU's building,</b> <b>which we call the Center for Data Science,</b> <b>the so-called Data Science Center,</b> <b>this was actually led by Yann</b> <b>over ten years ago.</b>

<b>He established this organization.</b>

<b>Right. It's independent of</b> <b>traditional computer science departments</b> <b>or math departments.</b>

<b>It's a new department.</b>

<b>So we have a new building,</b> <b>and the first time I walked into this building,</b> <b>I felt great.</b>

<b>Because</b> <b>Everything is glass doors.</b>

<b>Right.</b> <b>I can take you to see it sometime.</b>

<b>All glass doors.</b>

<b>Uh, everything is very, very open.</b>

<b>And it feels a bit like a company for students.</b>

<b>And the color scheme is very nice.</b>

<b>Right, I keep saying I'm a visual person.</b>

<b>There are warm tones in there,</b> <b>with an orange floor,</b> <b>various sofas,</b> <b>and everyone, uh,</b> <b>though it's quite chaotic —</b> <b>all kinds of robots</b> <b>running around on the floor,</b> <b>various students on this sofa,</b> <b>that sofa,</b> <b>sitting and studying.</b>

<b>And there's absolutely no privacy —</b> <b>zero privacy.</b>

<b>zero privacy.</b> <b>All the professors' office glass doors —</b> <b>you can see clearly everything happening inside.</b>

<b>Mm. Right.</b>

<b>But I thought, wow,</b> <b>this is very interesting.</b>

<b>This environment is very interesting.</b>

<b>Right.</b>

<b>More and more American schools now</b> <b>are making efforts like this,</b> <b>saying we want to have</b> <b>mm, this kind of</b> <b>uh interdisciplinary</b> <b>cross-disciplinary centers.</b>

<b>cross-disciplinary centers.</b> <b>Right? Usually,</b>

<b>Right? Usually,</b> <b>like, these AI centers,</b> <b>and</b> <b>using them to attract talent,</b> <b>using them to bring different departments together,</b> <b>because AI really serves as</b> <b>this middle layer,</b> <b>this connecting identity and position.</b>

<b>Connecting everyone.</b>

<b>Connecting everyone.</b>

<b>Everyone needs it.</b>

<b>Right. Mm. Yeah.</b>

<b>Whether you're doing science, right,</b> <b>doing physics, chemistry,</b> <b>math,</b> <b>statistics, business school,</b> <b>and including computer science,</b> <b>I think AI is a very good</b> <b>middle connecting node.</b>

<b>Mm right.</b>

<b>But Yann's foresight was that he</b> <b>more than ten years ago</b> <b>had already established this.</b>

<b>Mm.</b>

<b>So I think</b> <b>I think he is</b> <b>quite a visionary person.</b>

<b>Mm. Right. And then</b>

<b>So NYU's positioning in AI is also very good.</b>

<b>So actually, uh, again,</b> <b>I think</b> <b>the computer science department isn't the school's strong suit.</b>

<b>But it has many</b> <b>AI talent reserves.</b>

<b>Right.</b> <b>It has gathered many very impressive AI</b> <b>faculty members.</b>

<b>faculty members.</b> <b>Right. Mm.</b>

<b>Right. Mm.</b>

<b>Yann is one reason you chose NYU.</b>

<b>There are also many, many reasons.</b>

<b>He's one of them.</b>

<b>Because he needed to interview me,</b> <b>and he needed to give the final say.</b>

<b>Right. Mm.</b>

<b>Or rather, it was he who chose me.</b>

<b>Mm.</b>

<b>Important people.</b>

<b>Are there others?</b>

<b>Mm. I think there are.</b>

<b>For example, during my time at NYU,</b> <b>I also collaborated with many other professors,</b> <b>and one person who I think influenced me greatly</b> <b>would be Professor Fei-Fei.</b>

<b>Right.</b>

<b>I think Professor Li Fei-Fei —</b> <b>uh, everyone should definitely read the book she wrote.</b>

<b>Right, her autobiography.</b>

<b>Right.</b>

<b>And I've read it too.</b>

<b>But after having deep conversations with her,</b> <b>I gained even more.</b>

<b>Right. Sometimes I would</b> <b>tell her</b> <b>I was facing</b> <b>this difficulty and challenge,</b> <b>and Professor Fei-Fei would tell me earnestly</b> <b>some stories from her past.</b>

<b>Mm. And then</b>

<b>This was actually a great comfort to me.</b>

<b>What kind of stories?</b>

<b>Specific things</b> <b>might not be appropriate to share.</b>

<b>But in short,</b> <b>her journey wasn't smooth sailing at all.</b>

<b>Mm. She also had to</b> <b>wade through many thorns,</b> <b>overcoming many obstacles step by step,</b> <b>and now</b> <b>standing on the world stage,</b> <b>becoming a pride of the Chinese community,</b> <b>or becoming a North Star for the entire research field,</b> <b>especially computer vision,</b> <b>allowing everyone to see</b>

<b>what she's thinking</b> <b>and being able to</b> <b>in some sense</b> <b>set some new directions.</b>

<b>I think</b> <b>Right, her influence on me has been enormous.</b>

<b>Mm.</b>

<b>and</b> <b>And I think Professor Fei-Fei's greatest strength is</b> <b>that she's someone who can define problems.</b> <b>Mm. This point</b>

<b>Mm. This point</b> <b>is actually not very intuitive.</b>

<b>When people talk about Professor Fei-Fei,</b> <b>her greatest achievement</b> <b>is building ImageNet,</b> <b>this dataset.</b>

<b>this dataset.</b> <b>But in fact, this isn't just a dataset.</b>

<b>This isn't just data.</b>

<b>It's hard to imagine</b> <b>that back then, right,</b> <b>around 2012 or 2011,</b> <b>image classification wasn't a well-defined problem.</b>

<b>Defining this problem clearly</b> <b>was far more important</b> <b>than building such a dataset —</b> <b>far, far more important.</b>

<b>Mm-hmm.</b>

<b>And I think Professor Fei-Fei</b> <b>set this agenda,</b> <b>defined this problem clearly,</b> <b>so that subsequently</b> <b>Deep Learning could have a playground,</b> <b>have such a platform</b> <b>to showcase its capabilities.</b>

<b>I think</b> <b>This is her greatest achievement,</b> <b>and also what I always want to learn from.</b>

<b>Mm. Right.</b>

<b>So I worked with her on</b> <b>two pieces of work.</b>

<b>One is Thinking Space,</b> <b>and</b> <b>this paper</b> <b>mainly involves</b> <b>within multimodal base models,</b> <b>how to solve,</b> <b>better solve this kind of</b> <b>uh, spatial intelligence problem.</b>

<b>Well, recently we have another paper called Cambrian-S,</b> <b>and this paper also addresses</b> <b>questions about video —</b> <b>how do we define problems,</b> <b>which problems are actually important.</b>

<b>Right.</b>

<b>I think this collaboration with her</b> <b>has also helped expand the boundaries of my research.</b>

<b>How did you come to know Professor Fei-Fei well?</b>

<b>Uh, it was all quite serendipitous.</b>

<b>She came to New York on a business trip once,</b> <b>and we had a meal together.</b>

<b>And she told me</b> <b>a lot of things.</b>

<b>Right. And she would often come to New York later,</b> <b>and because she's also starting a company,</b> <b>we would often get together</b> <b>and chat.</b>

<b>and chat.</b> <b>Right, roughly that.</b>

<b>And normally we'd have</b> <b>some research meetings.</b>

<b>Mm. I'm curious about something,</b> <b>and I think many people are curious about this too.</b>

<b>Mm.</b> <b>How did you go from being a very young</b> <b>researcher just starting out in academia,</b> <b>and gradually,</b> <b>come to be alongside these well-known names in AI,</b> <b>come together with them</b> <b>and stand alongside them?</b>

<b>That is,</b> <b>how did you enter the core of AI?</b>

<b>I</b> <b>I still don't feel I'm at the core of AI,</b> <b>or that I've gotten close to it.</b>

<b>Mm. But the people you just mentioned,</b> <b>certainly many people would love to collaborate with them.</b>

<b>Is that so?</b>

<b>Ah, of course.</b>

<b>Right. I think</b>

<b>And look — all of it was serendipity.</b>

<b>With Kaiming it was just happening to be there</b> <b>as an intern and getting him to open up.</b>

<b>And with Professor Fei-Fei,</b> <b>you just had one meal together.</b>

<b>How did you get them to open up to you?</b>

<b>I think this is very hard to do intentionally.</b>

<b>Mm. Or this is a bit mysterious.</b>

<b>You could call it some kind of law of attraction.</b>

<b>Or you could think of it as</b> <b>people whose thoughts align</b> <b>ultimately converging together.</b>

<b>Though you may have countless small streams,</b> <b>in the end, they may all converge into one river.</b>

<b>I think, for example,</b> <b>uh, all the people I've mentioned,</b> <b>at least they're all working on vision.</b>

<b>Or rather,</b> <b>Even including Yann,</b> <b>who can be seen as doing general AI,</b> <b>but his starting point, right,</b> <b>was also digit recognition,</b> <b>which is also a visual problem.</b>

<b>Right.</b>

<b>I think everyone's foundation</b> <b>is very, very aligned.</b>

<b>So I think</b> <b>I really didn't make these things happen intentionally.</b>

<b>Right.</b>

<b>And many things,</b> <b>Or rather, I think</b> <b>don't need to be made to happen intentionally.</b>

<b>Everyone is just based on these research questions,</b> <b>and their understanding of these questions,</b> <b>collaborating together.</b>

<b>collaborating together.</b> <b>Right.</b>

<b>Right.</b> <b>I would think of it this way.</b>

<b>The thing is that</b> <b>from the outside,</b> <b>I'd see you as someone very goal-oriented</b> <b>and very logical.</b>

<b>But through our conversation just now,</b> <b>I find you're someone whose choices are quite disorderly.</b>

<b>Right?</b>

<b>Right.</b>

<b>I think there's a certain disorder.</b>

<b>Mm. But I think</b> <b>this is also a by-design process.</b>

<b>I choose this disorder.</b>

<b>I think</b> <b>I think</b> <b>Using this clichéd phrase:</b> <b>"follow your heart."</b>

<b>Right. But in many cases</b> <b>right, there's no way around it.</b>

<b>Many of my choices couldn't truly optimize</b> <b>for a result.</b>

<b>I think this is the source of the disorder.</b>

<b>So in</b> <b>these disorderly choices,</b> <b>can you string together all of your research journey</b> <b>into a single thread?</b>

<b>We've actually already discussed a few works.</b>

<b>Yes. Yes.</b>

<b>Yes right.</b>

<b>I think we can go through it bit by bit.</b>

<b>I think one benefit is</b> <b>I don't have that many papers,</b> <b>so</b> <b>so maybe it's relatively easy to string together.</b>

<b>And I think indeed, uh,</b> <b>I can't say there's a hidden thread,</b> <b>but there really is a thread in the background</b> <b>guiding me to keep doing this.</b>

<b>Or rather, before talking about these papers,</b> <b>before —</b> <b>I want to say,</b> <b>computer vision has developed for such a long time,</b> <b>right, I have many friends</b> <b>who are slowly exploring new directions,</b> <b>like doing some</b> <b>robotics right,</b> <b>or 3D vision.</b>

<b>I'm also trying to expand my boundaries outward.</b>

<b>But looking back,</b> <b>I find on this main thread,</b> <b>right, I think this main thread for me</b> <b>— representation learning —</b> <b>Mm.</b>

<b>Mm.</b> <b>there are too many unsolved problems. Right.</b>

<b>So I want to stay on this main thread</b> <b>and push forward what we're doing.</b>

<b>So the starting point of all this,</b> <b>if we trace it back,</b> <b>of course involves Deep Learning,</b> <b>involves Deep Neural Networks,</b> <b>the design of these architectures.</b>

<b>I think this part</b> <b>is of course related to representation learning.</b>

<b>Mm. And then</b>

<b>this is also what I think, in the past,</b> <b>everyone has been working toward.</b>

<b>Not just me.</b>

<b>Right. And everyone,</b>

<b>everyone is doing this —</b> <b>how to design a better architecture</b> <b>so we can learn better representations</b> <b>and better solve</b> <b>problems.</b> <b>Mm.</b>

<b>Mm.</b> <b>Right. And then, uh,</b>

<b>Right. And then, uh,</b> <b>later on,</b> <b>actually uh,</b> <b>things start to change.</b>

<b>We find</b> <b>that architecture itself isn't necessarily the most important.</b>

<b>It's definitely important,</b> <b>but not necessarily the most important,</b> <b>or it's not everything.</b>

<b>So there are at least several different things</b> <b>that intertwine.</b>

<b>that intertwine.</b> <b>Right, architecture is one thing,</b> <b>your architecture is one thing,</b> <b>and your data is also important.</b>

<b>Mm-hmm.</b>

<b>And there's also your objective —</b> <b>your goal is also very important.</b>

<b>Right?</b>

<b>I think architecture determines</b> <b>what you use for training.</b>

<b>We can imagine it as</b> <b>having a massive engine.</b>

<b>And the hardware of this engine</b> <b>is essentially the architecture of a neural network.</b>

<b>Mm.</b>

<b>But having just the engine's architecture</b> <b>is actually useless.</b>

<b>You have no fuel.</b>

<b>You can't start it.</b>

<b>Right. So, uh,</b>

<b>there's the data dimension</b> <b>and there's the objective dimension,</b> <b>the objective function considerations.</b>

<b>And so</b> <b>My subsequent research</b> <b>has also followed this main thread —</b> <b>representation learning as the main thread —</b> <b>advancing around architecture, data, and objective.</b>

<b>Mm-hmm.</b>

<b>And uh,</b> <b>During the time at FAIR,</b> <b>I think FAIR</b> <b>— this full-time job,</b> <b>in the full-time work process —</b> <b>I think one core aspect was</b> <b>that I worked with Kaiming,</b> <b>and Kaiming was leading some</b> <b>self-supervised learning</b> <b>such work,</b> <b>Right.</b>

<b>Right.</b> <b>And actually, again,</b> <b>now everyone says Scaling is</b> <b>is</b> <b>already a buzzword.</b>

<b>Everybody's talking about scaling.</b>

<b>Mm. Right.</b>

<b>But actually the first person who really told me</b> <b>that we need a scalable model,</b> <b>that we need to make the model bigger and bigger,</b> <b>these were Kaiming's exact words.</b>

<b>Bigger and bigger.</b>

<b>Right yes.</b>

<b>Kaiming told me this.</b>

<b>What year did he tell you?</b>

<b>Uh, roughly around 2018 or 2019.</b>

<b>Right. And then</b>

<b>So from the very beginning his conviction was</b> <b>that is,</b> <b>we must make models bigger,</b> <b>make data bigger,</b> <b>and this would allow us to get</b> <b>a better result.</b>

<b>I think very early on,</b> <b>Kaiming already had this vision.</b>

<b>Mm.</b>

<b>Uh. And then</b>

<b>so we also</b> <b>made some efforts along this path.</b>

<b>And so I think</b> <b>initially, the discussion about self-supervised learning —</b> <b>including Yann,</b> <b>Uh,</b> <b>he's a big advocate.</b>

<b>That is, he is</b> <b>very invested in</b> <b>self-supervised learning —</b> <b>He has this classic cake analogy.</b>

<b>This metaphor.</b>

<b>Right, the base layer is</b> <b>the body of the cake,</b> <b>and this part must be Self-Supervised Learning.</b>

<b>On top of that you can have Supervised Learning,</b> <b>right, this is the icing on the cake,</b> <b>the cream on your cake.</b>

<b>And further on top is Reinforcement Learning,</b> <b>it's just the cherry on top,</b> <b>just a little cherry at the very top.</b>

<b>Mm.</b>

<b>Each layer of this cake is actually important,</b> <b>but they're not ranked by importance.</b>

<b>Mm.</b>

<b>If you don't have the cake's base,</b> <b>you can't get to intelligence</b> <b>relying only on the cherry on top.</b>

<b>Mm.</b>

<b>Right. So because we were at FAIR</b> <b>doing vision,</b> <b>we were actually paying attention to this very early.</b>

<b>But the process of this research went like this:</b> <b>around 2015 and 2016,</b> <b>people already knew that self-supervised learning</b> <b>was actually a future for vision.</b>

<b>So at that time, uh,</b> <b>people would design</b> <b>all kinds of</b> <b>what we call pretext tasks,</b> <b>or proxy</b> <b>objective goals,</b> <b>some proxy tasks.</b>

<b>that is,</b> <b>what is self-supervised learning?</b>

<b>I don't have a label to directly give you,</b> <b>unlike ImageNet,</b> <b>where I have 1000 classes</b> <b>and can directly train</b> <b>a supervised classifier</b> <b>and get a representation this way.</b>

<b>In the old days,</b> <b>this is what everyone was doing.</b>

<b>Through 1000 class labels, by the way,</b> <b>within these 1000 classes there are 200 dog</b> <b>different breeds.</b>

<b>different breeds.</b> <b>Even so,</b> <b>this is why</b> <b>ImageNet is so powerful.</b>

<b>Right? Even with that distribution,</b> <b>it can still let</b> <b>our neural networks learn good representations.</b>

<b>I think this is extremely impressive.</b>

<b>But people also see the limitations.</b>

<b>Once everything is just Supervised Learning,</b> <b>there are many things you can't capture.</b>

<b>Mm.</b>

<b>Because what it learns</b> <b>— for example, we're sitting here now,</b> <b>we see these chairs,</b> <b>Right?</b>

<b>Right?</b> <b>and we now have a lot of images,</b> <b>of different chairs.</b>

<b>Some chairs might be quite ordinary,</b> <b>chairs in a studio like ours,</b> <b>or chairs in a home,</b> <b>or some designer chairs,</b> <b>right, or like an avocado chair,</b> <b>a chair shaped like an avocado.</b>

<b>For supervised learning,</b> <b>you need to map all of this</b> <b>to a single label,</b> <b>this label is called "chair".</b>

<b>So what your network has to learn,</b> <b>this mapping,</b> <b>is actually very, very difficult.</b>

<b>Right.</b>

<b>And it's an infinite mapping.</b>

<b>It's an infinite mapping.</b>

<b>Mm.</b>

<b>So it can only either memorize,</b> <b>just remember,</b> <b>recite all the chairs it's ever seen,</b> <b>or</b> <b>this,</b> <b>through what we call spurious correlations,</b> <b>some false correlations,</b> <b>tell you it's a chair.</b>

<b>For example, it may not look at the chair itself</b> <b>but look at the background behind the chair,</b> <b>or it thinks</b> <b>all chairs will be next to a table,</b> <b>so it uses that to make a decision boundary</b> <b>and says,</b> <b>hey, this is a chair.</b>

<b>But this is not what we want.</b>

<b>What we want to achieve</b> <b>is, from this very diverse visual knowledge,</b> <b>these visual observations, to gain some kind of common sense,</b> <b>some kind of intuition.</b>

<b>Mm. Intuition.</b>

<b>Right. Or some kind of common understanding.</b>

<b>So this is why people initially wanted to do</b> <b>so-called Self-Supervised Learning</b> <b>or Unsupervised Learning.</b>

<b>A common misconception back then was</b> <b>people say</b> <b>we want to do Unsupervised Learning</b> <b>because labeling data</b> <b>is too hard and too expensive.</b>

<b>We need to hire people</b> <b>to label,</b> <b>spending money and time.</b>

<b>We don't want to do that.</b>

<b>But that's just</b> <b>one very small part of the problem.</b>

<b>The bigger issue is, in the eyes of computer vision researchers,</b> <b>ah,</b> <b>everyone knew long ago</b> <b>that only through this path</b> <b>there's no way to give AI systems this kind of common sense.</b>

<b>So in 2015 and 2016,</b> <b>everyone was very, very creative.</b>

<b>That period</b> <b>was actually a quite creative era.</b>

<b>People would design</b> <b>all kinds of crazy tasks.</b>

<b>These tasks —</b> <b>for example, you take an image,</b> <b>rotate it 90 degrees,</b> <b>or 180 degrees,</b> <b>or 270 degrees.</b>

<b>You don't give these images a label,</b> <b>but because you designed</b> <b>how to rotate these images,</b> <b>right, and these images</b> <b>and their</b> <b>rotation angles</b> <b>can form a valid pretext task.</b>

<b>You can predict how these rotated images</b> <b>were actually rotated.</b>

<b>This becomes a so-called</b> <b>proxy task.</b>

<b>proxy task.</b> <b>Mm.</b>

<b>Mm.</b> <b>Similar proxy tasks also include</b> <b>giving an image,</b> <b>converting it to grayscale,</b> <b>removing all its colors,</b> <b>but then using a neural network</b> <b>to reconstruct the original colors.</b>

<b>Essentially, from a grayscale image,</b> <b>how do you predict</b> <b>the color of each object</b> <b>as it should be.</b>

<b>Mm.</b>

<b>And there are other similar examples,</b> <b>too many to count.</b>

<b>Another example, one last one:</b> <b>let me give one more example.</b>

<b>The so-called Context Encoder —</b> <b>you take an image, cut out a piece in the middle,</b> <b>make it white,</b> <b>and then train a neural network</b> <b>to fill in this empty part.</b>

<b>Fill it in.</b>

<b>Mm.</b>

<b>The rationale behind all these pretext tasks is</b> <b>that</b> <b>humans can actually do this.</b>

<b>The reason humans can do this,</b> <b>the reason humans know,</b> <b>hey,</b> <b>whether this image was rotated 90 or 180 degrees,</b> <b>or what color the butterfly</b> <b>or house in this image should be,</b> <b>what color should it have,</b> <b>or</b> <b>you can predict the information missing in the middle —</b> <b>all these things</b> <b>is because humans,</b> <b>based on some understanding of the physical world,</b> <b>have this common sense,</b> <b>so they can guess</b>

<b>these corrupted signals,</b> <b>these already lost signals,</b> <b>how they should be reconstructed.</b>

<b>The masked signals.</b>

<b>Right.</b>

<b>But back then the problem was a hundred flowers blooming —</b> <b>all kinds of papers,</b> <b>Mm.</b>

<b>Mm.</b> <b>but none of them worked well.</b>

<b>All the results were actually quite poor,</b> <b>all worse than ImageNet pre-training,</b> <b>by roughly 15-20 percentage points.</b>

<b>Percentage points.</b>

<b>So people were making some progress,</b> <b>moving forward step by step,</b> <b>but the gap</b> <b>uh, what ImageNet could achieve through Supervised Learning,</b> <b>learned on large-scale data,</b> <b>with labels,</b> <b>Uh,</b> <b>the representation learned with labels,</b> <b>was still far, far better.</b>

<b>Right?</b>

<b>So uh,</b> <b>we did something at that time,</b> <b>and this was done together with Kaiming.</b>

<b>And this,</b> <b>this architecture is called</b> <b>called MoCo,</b> <b>Mm.</b>

<b>Mm.</b> <b>Momentum Contrast,</b> <b>momentum contrastive learning.</b>

<b>Right.</b>

<b>Even the Chinese name sounds interesting.</b>

<b>Right yes.</b>

<b>Yes, momentum contrastive learning.</b>

<b>Uh, I think</b> <b>you don't need to dig into</b> <b>the specific technical details.</b>

<b>Because now</b> <b>much of it is no longer important.</b>

<b>But in short,</b> <b>it was the first to take what's called contrastive learning</b> <b>as a framework</b> <b>and make it actually work, as a paper.</b>

<b>And what is contrastive learning?</b>

<b>Also quite simple.</b>

<b>We're now in this Representation Space,</b> <b>in this representation space,</b> <b>there are different points.</b>

<b>These points may be the same object</b> <b>or completely different objects.</b>

<b>For example,</b> <b>I have several images of this chair,</b> <b>Right?</b>

<b>Right?</b> <b>and also some that may be tables,</b> <b>or images of cats or dogs.</b>

<b>These images are all different,</b> <b>but in this space,</b> <b>we can measure their distances.</b>

<b>Or we know</b> <b>all these different chairs —</b> <b>their images should be closer,</b> <b>their representations should be closer.</b>

<b>But a chair and a cat</b> <b>should be farther apart.</b>

<b>Mm-hmm.</b>

<b>So this is the basic</b> <b>logic of contrastive learning.</b>

<b>And this</b> <b>is actually not new.</b>

<b>This</b> <b>It's been done for many, many years.</b>

<b>By the way, this</b> <b>early work</b> <b>was actually Yann who first worked with his students</b> <b>to do it.</b>

<b>That's very interesting.</b>

<b>Of course the problem being solved</b> <b>wasn't directly Representation Learning,</b> <b>but some Metric Learning problems.</b> <b>Some metric learning problems.</b> <b>But that's okay.</b>

<b>This was around 2019,</b> <b>I think we gave contrastive learning</b> <b>some new meaning.</b>

<b>But</b> <b>But this didn't come out of nowhere.</b>

<b>Actually before that,</b> <b>the entire field was slowly moving in this direction,</b> <b>expanding.</b>

<b>expanding.</b> <b>For example, there was a paper called CPC,</b> <b>and another paper called Memory Bank.</b>

<b>These two papers were already moving in this direction —</b> <b>using contrastive learning to do</b> <b>self-supervised learning, having already taken several steps.</b>

<b>Right, and then</b> <b>this is</b> <b>where I can't help but admire Kaiming's ability.</b>

<b>I think</b> <b>I think this is also</b> <b>a moment that made me think, wow,</b> <b>what a top-tier researcher</b> <b>and</b> <b>— or rather, I can't just say top-tier researcher.</b>

<b>Kaiming in my heart</b> <b>is simply the best researcher.</b>

<b>How does he actually work day-to-day?</b>

<b>Mm okay.</b>

<b>I think there are several points.</b>

<b>Maybe we can briefly talk about it.</b>

<b>that is,</b> <b>I think he has a kind of extreme focus.</b>

<b>and</b> <b>This focus allows him to have a kind of flow state,</b> <b>called this kind of mind flow,</b> <b>right, he can immerse himself in a problem</b> <b>without needing to consider what's happening</b> <b>in the rest of the world.</b>

<b>Mm.</b>

<b>And I find this particularly</b> <b>particularly admirable.</b>

<b>particularly admirable.</b> <b>And another thing is</b> <b>how does his focus manifest?</b>

<b>I think his focus shows in that</b> <b>Mm.</b>

<b>Mm.</b> <b>every day, apart from this one problem,</b> <b>he won't think about anything else.</b>

<b>He'll grab the people collaborating with him</b> <b>to talk about it,</b> <b>and grab other people to talk about it too.</b>

<b>In any case, this topic is the main subject</b> <b>of his thinking.</b>

<b>Oh.</b>

<b>And most of his mental cycles</b> <b>are allocated to</b> <b>this one specific problem.</b>

<b>Oh.</b>

<b>This is very difficult.</b>

<b>I think it's extremely, extremely hard.</b>

<b>Right because</b> <b>thoughts are often very hard to control.</b>

<b>Yes yes yes.</b>

<b>Ah right.</b>

<b>This is related to world models.</b>

<b>Thoughts are hard to control.</b>

<b>That's a good point.</b>

<b>But Kaiming is actually someone very</b> <b>capable of this kind of focused decision-making,</b> <b>able to concentrate.</b>

<b>Mm.</b>

<b>I actually think there are several points.</b>

<b>I think a top researcher</b> <b>needs this ability to varying degrees.</b>

<b>They need sufficient focus,</b> <b>they need good</b> <b>research taste.</b>

<b>research taste.</b> <b>How do you define that?</b>

<b>We can talk about it later.</b>

<b>Mm.</b>

<b>And they also need a certain steadfastness —</b> <b>you can't just go with the flow</b> <b>and</b> <b>do what others are interested in.</b>

<b>And of course</b> <b>you also need strong engineering skills,</b> <b>research intuition,</b> <b>including when you read literature,</b> <b>you know what's important</b> <b>and what's not.</b>

<b>This is very important.</b>

<b>You also know</b> <b>that this</b> <b>is actually something quite odd</b> <b>about academia.</b>

<b>about academia.</b> <b>That is,</b> <b>you have to be able to highlight the key points.</b>

<b>Right.</b>

<b>The main reason is also that people often don't state them clearly.</b>

<b>You know?</b> <b>Sometimes people simply can't articulate the key points,</b> <b>sometimes people are unwilling to state them,</b> <b>and sometimes</b> <b>people haven't realized what the key points are.</b>

<b>But Kaiming's ability is</b> <b>he can peel away the layers</b> <b>and extract these key points,</b> <b>then tell you,</b> <b>and establish</b> <b>these connections in this high-dimensional abstract space.</b>

<b>These connections.</b>

<b>Oh.</b>

<b>I find this extremely, extremely impressive.</b>

<b>Right. So</b>

<b>many times</b> <b>each of Kaiming's ideas</b> <b>didn't come from sitting in some corner somewhere,</b> <b>dreaming them up at home.</b>

<b>They actually come</b> <b>from constant exploration,</b> <b>extensive reading,</b> <b>extensive thinking,</b> <b>derived little by little.</b>

<b>And this</b> <b>I think truly deeply</b> <b>—</b> <b>influenced the way I do research,</b> <b>and what I now tell my students</b> <b>about how research should be done.</b>

<b>It's about increasing input.</b>

<b>Increasing input.</b>

<b>And</b> <b>I think</b> <b>there's actually a paradigm here.</b>

<b>Mm, in this,</b> <b>this paradigm is also something Kaiming taught me.</b>

<b>Right, he said</b> <b>actually all these ideas</b> <b>you can't just sit there and think up,</b> <b>because if you want to come up with an idea</b> <b>Mm.</b>

<b>Mm.</b> <b>by just thinking, it's definitely not a good idea.</b>

<b>There are really only a few possibilities.</b>

<b>The first possibility:</b> <b>you're smarter than everyone else in the world,</b> <b>so</b> <b>you come up with an incredibly brilliant idea</b> <b>that no one else can think of.</b>

<b>But I think the probability of this is extremely small.</b>

<b>So the more likely two possibilities: first,</b> <b>while you're thinking of this idea,</b> <b>100 people,</b> <b>1000 people,</b> <b>10,000 people in the world are thinking the same idea.</b>

<b>So you'll have to compete with them,</b> <b>and your execution speed may not be faster than theirs.</b>

<b>The second possibility:</b> <b>this is a very bad idea</b> <b>that others have already tried many times</b> <b>unsuccessfully.</b>

<b>unsuccessfully.</b> <b>unsuccessfully.</b>

<b>unsuccessfully.</b> <b>Mm.</b>

<b>Mm.</b> <b>Then you probably don't need to try either.</b>

<b>Mm. So</b>

<b>So I think Kaiming's greatest influence on me is</b> <b>he taught me how to find a research idea.</b>

<b>Mm. How?</b>

<b>I think this is a process of seeking.</b>

<b>so</b> <b>Now I,</b> <b>when new students come in,</b> <b>I will tell everyone</b> <b>about a research cycle.</b>

<b>Uh, of course I hope it could be longer,</b> <b>but in today's competitive environment,</b> <b>there might be at most 6 months.</b>

<b>That is,</b> <b>from the beginning of 6 months,</b> <b>you need to start thinking about an idea,</b> <b>and then later</b> <b>you need to write this idea into a paper</b> <b>and publish it.</b>

<b>This whole cycle is about 6 months.</b>

<b>What does this process look like?</b>

<b>You need to have a general direction,</b> <b>you need to know what you want to do.</b>

<b>You can't know nothing at all</b> <b>just saying "I want to do research" isn't enough.</b>

<b>This</b> <b>can be achieved by talking with your advisor,</b> <b>or with your peers,</b> <b>discussing with your classmates,</b> <b>or through your own reading,</b> <b>developing some general direction,</b> <b>this directional understanding.</b>

<b>Mm right?</b>

<b>But</b> <b>you must give yourself enough time and space</b> <b>to explore.</b>

<b>to explore.</b> <b>And this exploration,</b> <b>this exploration phase,</b> <b>I think</b> <b>should last at least one to two months.</b>

<b>What should you do during the exploration phase?</b>

<b>The exploration phase —</b> <b>good question. What do you do during exploration?</b>

<b>good question. What do you do during exploration?</b>

<b>You can't just sit there thinking.</b>

<b>What you need to explore is</b> <b>constantly hacking things,</b> <b>ah,</b> <b>that is,</b> <b>you really have to be like a hacker,</b> <b>playing with things,</b> <b>messing around with things.</b>

<b>Treat research like a game,</b> <b>like a toy to play with.</b>

<b>Mm, this might involve, for example,</b> <b>working through formulas,</b> <b>reading more papers,</b> <b>finding some connections,</b> <b>of course,</b> <b>and perhaps more importantly, actually doing things,</b> <b>writing code.</b>

<b>writing code.</b> <b>But when you're writing code,</b> <b>what you need to note is</b> <b>the code you write</b> <b>is not your initial starting idea</b> <b>or direction,</b> <b>but an exploration process.</b>

<b>So the code you write</b> <b>might simply reproduce a baseline,</b> <b>take what someone else's paper is doing</b> <b>and reproduce it.</b>

<b>and</b> <b>And it might also be on the basis of this baseline</b> <b>to make some kind of extension.</b>

<b>Mm.</b>

<b>And the most important thing in all this</b> <b>is to find a signal.</b>

<b>that is,</b> <b>it's still a bit like what you just said —</b> <b>all of this decision-making process</b> <b>is actually a quite disorderly exploration process.</b>

<b>It's a</b> <b>what we call stochastic gradient descent.</b>

<b>Right?</b>

<b>This is a cornerstone of all machine learning,</b> <b>but it equally applies to research itself</b> <b>and to our lives.</b>

<b>that is,</b> <b>In everyone's pursuit of their ultimate goal,</b> <b>they're all going through a stochastic</b> <b>gradient descent process.</b>

<b>Mm.</b>

<b>And I think research is the same.</b>

<b>For you,</b> <b>the most important thing in research</b> <b>is not going from point A to point B.</b>

<b>For example, A is an idea,</b> <b>B is a paper,</b> <b>but rather in this process,</b> <b>what kind of signal can you find?</b>

<b>Your gradient,</b> <b>where exactly is your gradient?</b>

<b>Right. So</b>

<b>Kaiming's view is</b> <b>this gradient itself</b> <b>is the source of your real idea.</b>

<b>When you've gone through constant exploration,</b> <b>tried many things,</b> <b>possibly unsuccessful,</b> <b>possibly successful,</b> <b>by the way, it doesn't have to be a successful experiment</b> <b>to give you this gradient.</b>

<b>Sometimes a failed experiment</b> <b>gives you a larger gradient.</b>

<b>Right?</b>

<b>That is, as long as</b> <b>the most feared thing is not knowing which direction to go.</b>

<b>Mm.</b>

<b>So a good result,</b> <b>a bad result,</b> <b>are both good results.</b>

<b>For research,</b> <b>a surprise,</b> <b>something surprising,</b> <b>such an observation,</b> <b>is always for a researcher</b> <b>— for a researcher —</b> <b>the most joyful thing.</b>

<b>Something unexpected that you observed.</b>

<b>Right.</b> <b>You saw something unexpected.</b>

<b>Mm.</b>

<b>so</b> <b>he said,</b> <b>It's after this kind of exploration</b> <b>In this process,</b> <b>the ideas you discover</b> <b>are the truly your own ideas.</b>

<b>The idea you started with isn't your idea.</b>

<b>That thing doesn't belong to you.</b>

<b>The idea found in exploration is your own idea.</b>

<b>And the research process</b> <b>is about finding</b> <b>your own idea.</b>

<b>But</b> <b>this word,</b> <b>you need to see</b> <b>it belongs to — this thing</b> <b>is truly your own.</b>

<b>Like heaven gave you an inspiration,</b> <b>injected it into your head.</b>

<b>Right, on one hand heaven gives you inspiration,</b> <b>on the other hand,</b> <b>it's also based on extensive empirical work and practice.</b>

<b>Right?</b>

<b>There's no free lunch here.</b>

<b>Maybe you're truly a genius,</b> <b>or maybe you're extremely lucky,</b> <b>God holding your hand</b> <b>wrote this formula.</b>

<b>It can happen.</b>

<b>But most of the time, most progress,</b> <b>even most work that has great</b> <b>influence on the field,</b> <b>I think still happens step by step.</b>

<b>You can always trace back</b> <b>to find its starting point.</b>

<b>So I also tell students</b> <b>what's actually the worst kind of research?</b>

<b>It's when you define a problem at the start,</b> <b>say this is my idea,</b> <b>and in the end publish a paper</b> <b>whose idea</b> <b>is exactly the same as what you started with.</b>

<b>You didn't encounter any obstacles,</b> <b>you didn't encounter any difficulties.</b>

<b>Why is it the worst?</b>

<b>Because this shows</b> <b>your idea is a boring idea,</b> <b>and your published paper is a boring paper.</b>

<b>Right.</b>

<b>I think</b> <b>after many years of observation,</b> <b>this is indeed very, very accurate.</b>

<b>So I think this is also why</b> <b>I tell students this —</b> <b>because</b> <b>people sometimes can't accept this fact.</b>

<b>People always think</b> <b>I should start by thinking of a clever trick,</b> <b>then implement it,</b> <b>make it work,</b> <b>publish a paper,</b> <b>I've succeeded,</b> <b>and I move on to the next thing.</b>

<b>But what this can give for personal accumulation</b> <b>is actually very, very limited.</b>

<b>The exploration process is actually very difficult.</b>

<b>Many people don't know how to explore.</b>

<b>Exploration is very hard.</b>

<b>And this is why</b> <b>all these papers in my view are nonlinear.</b>

<b>This nonlinearity shows in two aspects.</b>

<b>The first is your 6 months of time —</b> <b>by the 5th month,</b> <b>like I just told you,</b> <b>your mindset collapses.</b>

<b>This ResNeXt story —</b> <b>on one hand people hear, wow,</b> <b>you changed direction in the last month</b> <b>and made it work.</b>

<b>That time period is so short,</b> <b>and you still managed to do it.</b>

<b>It sounds unbelievable.</b>

<b>But once you know this happens too often,</b> <b>you find there really is a pattern.</b>

<b>You often go through this.</b>

<b>I often go through this.</b>

<b>Or rather, my best work always happens this way.</b>

<b>So how do you maintain your mindset for the first 5 months?</b>

<b>Uh, there's no way around it.</b>

<b>You have to accept this fact,</b> <b>you have to be able to tell yourself</b> <b>this is a normal research process.</b>

<b>Would you consider switching direction in the first 5 months?</b>

<b>I might go for that boring idea.</b>

<b>I think you would.</b>

<b>And</b> <b>changing direction is actually very, very important.</b>

<b>You must learn to pivot.</b>

<b>Because I just said,</b> <b>the worst work is</b> <b>when your starting idea is the same idea</b> <b>as your ending idea.</b>

<b>The best work is</b> <b>when you've gone all around,</b> <b>jumping here and there,</b> <b>taken a long, winding road,</b> <b>and only then arrived at this point.</b>

<b>Mm.</b>

<b>Though this road is very bumpy,</b> <b>from the final destination</b> <b>step by step</b> <b>you can always trace back to the very beginning.</b>

<b>Only then can it be connected into a line.</b>

<b>Only then can you</b> <b>but during the process, you can't.</b>

<b>Yes, during the process</b> <b>I think</b> <b>you're in the process —</b> <b>because you don't know,</b> <b>you can't predict the future.</b>

<b>So this is always an exploration process.</b>

<b>So I think about two months of exploration,</b> <b>gradually forming an idea,</b> <b>then gradually expanding,</b> <b>then scaling up,</b> <b>Right?</b>

<b>Right?</b> <b>then supplementing experiments sufficiently,</b> <b>This thing,</b> <b>might take another two to three months,</b> <b>and finally writing the paper</b> <b>— then spending one to two months —</b> <b>this is</b> <b>already a very</b> <b>smooth research process.</b>

<b>Mm.</b>

<b>And I think</b> <b>this again,</b> <b>in today's era,</b> <b>faces many, many challenges.</b>

<b>People face all kinds of pressure.</b>

<b>Right? I think the competitive pressure now is too great.</b>

<b>The competitive pressure is too great.</b>

<b>and</b> <b>I think</b> <b>It makes people feel</b> <b>they must chase the cutting edge</b> <b>and finish things as soon as possible,</b> <b>seize the opportunity.</b>

<b>Mm.</b>

<b>Claim the territory.</b>

<b>but looking back,</b> <b>I think, as I just said,</b> <b>Professor Fei-Fei's greatest strength</b> <b>is</b> <b>that she's someone who can define problems.</b> <b>If you lose the ability to define problems,</b> <b>you essentially also lose much of the ability to innovate,</b> <b>essentially also lose the ability to do research.</b>

<b>And this</b> <b>I just said research is nonlinear,</b> <b>that's in terms of time.</b>

<b>But in terms of results,</b> <b>it's also nonlinear.</b>

<b>Mm.</b>

<b>That is, this</b> <b>is actually MIT professor Bill Freeman —</b> <b>he has a very classic</b> <b>plot,</b> <b>an illustration,</b> <b>this kind of graphic.</b>

<b>He often talks about it when giving talks.</b>

<b>So,</b> <b>This graphic has a horizontal axis</b> <b>and a vertical axis.</b>

<b>The horizontal axis starts from a very poor work,</b> <b>a decent work,</b> <b>a very good work,</b> <b>an exceptionally impressive work.</b>

<b>This is the horizontal axis.</b>

<b>The vertical axis</b> <b>is the impact on your entire career.</b>

<b>The impact of this paper on your career.</b>

<b>So you can guess</b> <b>what this curve actually looks like.</b>

<b>Right? It's not a linear curve.</b>

<b>It's not that a very poor work</b> <b>has a very bad career impact,</b> <b>and</b> <b>and the best work</b> <b>or a fairly good work</b> <b>gives you a very good return,</b> <b>gradually increasing.</b>

<b>gradually increasing.</b> <b>It's not linear.</b>

<b>It's not linear.</b>

<b>It's saying</b> <b>basically, a very poor work</b> <b>actually won't hurt you much,</b> <b>nobody cares.</b>

<b>nobody cares.</b> <b>Mm.</b>

<b>Mm.</b> <b>No one will notice.</b>

<b>A decent work —</b> <b>no one notices either.</b>

<b>The gains it brings you are also small.</b>

<b>Mm.</b>

<b>But sometimes,</b> <b>when you produce a very good piece of work,</b> <b>an exceptionally impressive work,</b> <b>work that everyone knows about,</b> <b>your impact</b> <b>— I said I don't like the word impact —</b> <b>reaches the top.</b>

<b>This thing,</b> <b>this,</b> <b>immediately shoots up to</b> <b>the top.</b>

<b>the top.</b> <b>Right?</b>

<b>Right?</b> <b>So we often say in academia</b> <b>what people measure is the so-called signature work.</b>

<b>Or another way to put it:</b> <b>people say</b> <b>what you optimize for is not an average —</b> <b>not the average of all your previous work —</b> <b>an average.</b>

<b>an average.</b> <b>But</b> <b>what you're optimizing is</b> <b>the maximum of your work.</b>

<b>Right, the highest point.</b>

<b>I think this illustrates</b> <b>the research game's</b> <b>nonlinear characteristic.</b>

<b>nonlinear characteristic.</b> <b>Mm.</b>

<b>Mm.</b> <b>So is the highest point good or not?</b>

<b>Of course it's good!</b>

<b>that is,</b> <b>You</b> <b>you only need to</b> <b>succeed just once in your lifetime.</b>

<b>And this</b> <b>I actually gave a talk about this at CVPR,</b> <b>I called it research: the infinite game.</b>

<b>Mm right?</b>

<b>This</b> <b>got quite a strong response from everyone.</b>

<b>I think actually</b> <b>I rarely give these non-technical</b> <b>talks,</b> <b>because this is more about philosophical thinking</b> <b>and some summaries.</b>

<b>That one was actually quite good.</b>

<b>and</b> <b>But it also</b> <b>contained everything I talked about above.</b>

<b>Because think about it,</b> <b>research as a</b> <b>career,</b> <b>a researcher as a</b> <b>profession,</b> <b>what is its</b> <b>true essence?</b>

<b>true essence?</b> <b>Oh.</b>

<b>Oh.</b> <b>It's not a chess player,</b> <b>it's not</b> <b>even a Winter Olympics athlete.</b>

<b>Because for a chess player and an athlete,</b> <b>your final achievement depends on your worst step</b> <b>to some extent.</b>

<b>You have to ensure every step,</b> <b>your moves must be correct.</b>

<b>If you make even a small mistake in the middle,</b> <b>if you make a small error in chess,</b> <b>placed a piece wrong once,</b> <b>you've lost.</b>

<b>you've lost.</b> <b>You've lost.</b>

<b>You've lost.</b> <b>Right?</b>

<b>Right?</b> <b>So this is a finite game.</b>

<b>In this process,</b> <b>there are always winners</b> <b>and always losers.</b>

<b>But a researcher is more like an inventor:</b> <b>you in your lifetime</b> <b>truly only need to succeed once.</b>

<b>Mm.</b>

<b>If you're lucky enough,</b> <b>you can succeed a few times.</b>

<b>Twice maybe. But you don't need to succeed 100 times.</b>

<b>Two times gets you to the top?</b>

<b>I think</b> <b>I think so.</b>

<b>Oh.</b>

<b>So I think</b> <b>this is actually quite interesting.</b>

<b>so</b> <b>I think</b> <b>as the entire field moves forward,</b> <b>there needs to be some reflection.</b>

<b>I think now,</b> <b>the traditional academic world,</b> <b>whether it's its social responsibility</b> <b>Or rather,</b> <b>its positioning in the entire research landscape,</b> <b>its positioning,</b> <b>was always the one setting the rules of the game,</b> <b>always the one deciding where we go next.</b>

<b>Right?</b>

<b>Now it's completely different.</b>

<b>Now the ones deciding where things go</b> <b>are OpenAI,</b> <b>ah,</b> <b>maybe Google,</b> <b>or Meta or other major companies.</b>

<b>Right, they're playing a finite game —</b> <b>they're playing a finite game against each other.</b>

<b>But this has caused them to drag academia into</b> <b>a finite game,</b> <b>this kind of decision-making chain.</b>

<b>Right?</b>

<b>So you see</b> <b>many times when a major company releases something,</b> <b>whether it's called some o-series,</b> <b>or some GPT series,</b> <b>or the Nano Banana series,</b> <b>a specific piece of work,</b> <b>a product launch,</b> <b>immediately everyone in academia swarms in</b> <b>saying, how can we within this paradigm,</b> <b>using what you'd call peanuts of resources,</b>

<b>resources as few as peanuts,</b> <b>these resources,</b> <b>Mm.</b>

<b>Mm.</b> <b>try to chase it?</b>

<b>Oh chasing.</b>

<b>What's the point?</b>

<b>Reproduce right?</b>

<b>Or maybe people don't believe they can</b> <b>people might also — right, as you said —</b> <b>they probably can't catch up anyway.</b>

<b>So it becomes some kind of</b> <b>reproduction in a sense,</b> <b>or building on top of it through</b> <b>I think</b> <b>this kind of research process</b> <b>is actually very, very painful.</b>

<b>Because there's one more thing I haven't mentioned.</b>

<b>For the past two years at NYU,</b> <b>I've actually also been working part-time at Google.</b>

<b>Mm.</b>

<b>Working part-time.</b>

<b>And this</b> <b>was in the Nano Banana team,</b> <b>right, in the Nano Banana team,</b> <b>the team within GenAI.</b>

<b>and</b> <b>This went on for two years.</b>

<b>so</b> <b>Not sure if I should share this,</b> <b>but let's share. Sometimes I tell some friends,</b> <b>the reason I went to do this work at Google</b> <b>is I wanted to see what people at Google were doing,</b> <b>so I would know what</b> <b>not to do in academia.</b>

<b>Oh.</b>

<b>That is, I need to know what you're doing,</b> <b>so I know what not to do.</b>

<b>Because if I know you're doing this,</b> <b>why would I do it alongside you?</b>

<b>Makes sense.</b>

<b>Because they have more resources.</b>

<b>it has more resources.</b>

<b>No need to compete with them.</b>

<b>Yes yes yes.</b>

<b>So this is also something that guides us.</b>

<b>Right, I don't want to be too preachy.</b>

<b>By the way, this disclaimer:</b> <b>all of what I've said</b> <b>is only based on my experience at NYU,</b> <b>not particularly successful,</b> <b>just sharing some experience.</b>

<b>It doesn't represent the diversity</b> <b>and complexity of research worldwide.</b>

<b>And looking back,</b> <b>I can also say</b> <b>some papers I do want to</b> <b>share with everyone,</b> <b>but looking back,</b> <b>I haven't produced a paper</b> <b>that I truly think has real value.</b>

<b>You're saying this to tell everyone</b> <b>I haven't reached the highest point yet,</b> <b>I haven't reached that Max yet.</b>

<b>You're right.</b>

<b>I'm still young.</b>

<b>[laughs]</b> <b>I can still work harder.</b>

<b>Mm.</b>

<b>But it really is like this.</b>

<b>Because yesterday I was thinking about this question.</b>

<b>I think there might be about 20 such papers,</b> <b>twenty-something papers,</b> <b>and</b> <b>that have profoundly influenced all of deep learning</b> <b>and the progress of AI.</b>

<b>If this world has 20 such papers,</b> <b>or 25 papers,</b> <b>and</b> <b>I don't have a single one.</b>

<b>What reason do I have not to keep working hard,</b> <b>to keep going?</b>

<b>I think this is a goal.</b>

<b>Doesn't DiT count?</b>

<b>Uh, I think it counts as 0.25.</b>

<b>Or DiT</b> <b>is more like</b> <b>pushing along the tangent of the research frontier,</b> <b>taking a small step forward.</b>

<b>If we didn't do it,</b> <b>someone else would have.</b>

<b>It doesn't completely belong to you.</b>

<b>Right, it doesn't.</b>

<b>Completely belong to me.</b>

<b>Mm.</b>

<b>You're right.</b>

<b>Yes.</b>

<b>Yes.</b>

<b>But these</b> <b>Or rather,</b> <b>I think</b> <b>Diffusion Model certainly counts,</b> <b>including</b> <b>maybe DDPM counts.</b>

<b>Right.</b>

<b>and</b> <b>I don't know.</b>

<b>Maybe we can list some.</b>

<b>I think this might be quite interesting.</b>

<b>I think LeNet counts.</b>

<b>I might</b> <b>not be able to list them all.</b>

<b>Okay, let's just list some.</b>

<b>Papers that have influenced AI's progress, right?</b>

<b>Right.</b>

<b>Or rather,</b> <b>I think in my view,</b> <b>these are things that can truly be called signature works,</b> <b>Or rather,</b> <b>works that I'm still very far from.</b>

<b>Right?</b> <b>I think</b> <b>ah,</b> <b>LeNet of course counts.</b>

<b>AlexNet of course counts.</b>

<b>Mm, and then</b> <b>ImageNet of course counts.</b>

<b>ResNet of course counts.</b>

<b>Mm.</b>

<b>R-CNN or Faster R-CNN, the detection part,</b> <b>of course counts.</b>

<b>Kaiming's already on there several times.</b>

<b>and</b> <b>What else?</b>

<b>What else?</b> <b>Transformer of course counts.</b>

<b>Attention is all you need,</b> <b>of course counts.</b>

<b>GPT-3 of course counts.</b>

<b>BERT of course counts.</b>

<b>I think CLIP counts too.</b>

<b>ViT I think counts too.</b>

<b>Vision Transformer,</b> <b>I think counts too.</b>

<b>And GAN,</b> <b>I think counts too.</b>

<b>Okay,</b> <b>can't list them all.</b>

<b>Roughly at that level.</b>

<b>Including in 3D,</b> <b>NeRF (Neural Radiance Field),</b> <b>Gaussian Splatting,</b> <b>I think both count.</b>

<b>They all count.</b>

<b>so</b> <b>Across different fields.</b>

<b>They all have these works.</b>

<b>The significance of these works is that</b> <b>everyone was originally</b> <b>gradually moving toward a direction,</b> <b>ah,</b> <b>and then suddenly a paper like this appears out of nowhere,</b> <b>completely changing our</b> <b>just-mentioned stochastic gradient</b> <b>descent process.</b>

<b>descent process.</b> <b>So you see its convergence curve</b> <b>has a drop.</b>

<b>Mm.</b>

<b>This is how I define this.</b>

<b>And I think</b> <b>assuming this long river of history</b> <b>means this curve continues forward,</b> <b>right, there are times and times again</b> <b>kind of</b> <b>kind of</b> <b>kind of</b> <b>allowing everyone to break out of previous local optima</b> <b>or enter the next stage —</b> <b>such papers appear.</b>

<b>But I think we're still far from done.</b>

<b>This path is far from convergence.</b>

<b>I think there are still many things to be done.</b>

<b>I hope</b> <b>I think it doesn't need to be me personally,</b> <b>I hope</b> <b>but at least I hope to be able to participate.</b>

<b>Right. I hope</b>

<b>assuming there's a next revolution,</b> <b>I hope</b> <b>I hope</b> <b>looking back,</b> <b>Right?</b>

<b>Right?</b> <b>I said maybe it's not about creating some impact,</b> <b>but because of my personal experience,</b> <b>the patterns of collaboration around me,</b> <b>my own understanding,</b> <b>my own thinking,</b> <b>I am able to understand certain things,</b> <b>and what I understand can somehow</b> <b>have some influence on</b> <b>the world's or AI's development.</b>

<b>Mm.</b>

<b>I think</b> <b>this is something I care very much about now.</b>

<b>Mm.</b>

<b>Is there no hope from LLMs for this?</b>

<b>The next revolution.</b>

<b>Again,</b> <b>I think absolutely not.</b>

<b>No hope?</b>

<b>or</b> <b>I would say LLMs will eventually fade.</b>

<b>No no no.</b>

<b>LLMs</b> <b>will never die,</b> <b>but will eventually fade.</b>

<b>Old soldiers never die,</b> <b>they just fade away.</b>

<b>Right?</b>

<b>Why will they eventually fade?</b>

<b>They won't die.</b>

<b>They will just fade away.</b>

<b>That is, it will definitely have its value,</b> <b>it's a very good tool.</b>

<b>I use LLMs every day now.</b>

<b>But it's not the foundation for building a universal,</b> <b>a general intelligence system.</b>

<b>It's not the world model's</b> <b>kind of</b> <b>foundation of this building.</b>

<b>World model,</b> <b>we'll talk about it later.</b>

<b>Your work —</b> <b>do you want to expand on it?</b>

<b>You've already</b> <b>let me say a bit more.</b>

<b>Is there time?</b>

<b>Yes. You've already said you haven't reached Max.</b>

<b>Yes yes right.</b>

<b>Put that way,</b> <b>it seems there's nothing much to talk about with these works.</b>

<b>But I think there's still some significance.</b>

<b>Because</b> <b>Just like I said about non-linear research,</b> <b>right, in a paper,</b> <b>we first do some things,</b> <b>then gradually</b> <b>build up some reserves,</b> <b>and then in the last month,</b> <b>find a new direction,</b> <b>deliver</b> <b>the final result.</b>

<b>Mm. I think,</b>

<b>When I look at all my previous work,</b> <b>I also have this feeling:</b> <b>I'm still in that initial confused exploration phase.</b>

<b>But who knows —</b> <b>maybe this year,</b> <b>maybe next year,</b> <b>maybe</b> <b>I suddenly</b> <b>right, have a spiritual awakening,</b> <b>and can produce some more meaningful work.</b>

<b>Mm-hmm.</b>

<b>But I think the foundation here is</b> <b>as I just said,</b> <b>it needs to be able to string together a thread.</b>

<b>Or rather,</b> <b>it's actually not a line,</b> <b>it's a graph.</b>

<b>It has different nodes,</b> <b>different nodes connected to each other,</b> <b>each node is a paper,</b> <b>all with connections between them.</b>

<b>Your subsequent papers</b> <b>are all influenced by all the previous papers.</b>

<b>Mm right.</b>

<b>So later,</b> <b>for example, Contrastive Learning,</b> <b>making it work means</b> <b>we saw for the first time in visual tasks</b> <b>MoCo</b> <b>such work,</b> <b>especially we had</b> <b>V1 V2,</b> <b>V3 right?</b>

<b>V3 right?</b> <b>And in V3,</b> <b>we used Transformer,</b> <b>and we scaled up,</b> <b>Uh,</b> <b>actually already better than the representation ImageNet could get,</b> <b>across all kinds of tasks.</b>

<b>This for us was</b> <b>actually a major surprise.</b>

<b>Mm.</b>

<b>Mm-hmm.</b>

<b>At that time,</b> <b>at that point,</b> <b>I thought, wow,</b> <b>everything is flourishing again.</b>

<b>Our problem can basically be answered.</b>

<b>We found a way —</b> <b>self-supervised learning —</b> <b>that can work.</b>

<b>Going forward,</b> <b>we just need to scale up what we're doing now,</b> <b>and</b> <b>the future is incredibly bright.</b>

<b>But unfortunately,</b> <b>this also didn't happen.</b>

<b>Right?</b>

<b>But before that,</b> <b>we had another paper,</b> <b>also</b> <b>MoCo and MAE by the way were both projects Kaiming led.</b>

<b>Actually, people say</b> <b>what does it mean to lead a project?</b>

<b>I think</b> <b>Kaiming truly demonstrated this leadership —</b> <b>that is,</b> <b>he truly took on 80-90% of the first-author</b> <b>plus last-author</b> <b>or corresponding-author responsibilities.</b>

<b>The corresponding author's responsibilities.</b>

<b>He needed to write the baseline himself,</b> <b>run many, many experiments himself,</b> <b>finalize the paper himself,</b> <b>tell the story, present it,</b> <b>all of these things</b> <b>basically Kaiming did single-handedly.</b>

<b>And accomplished it.</b>

<b>So what about others?</b>

<b>Others,</b> <b>we</b> <b>of course also participated</b> <b>and made contributions.</b>

<b>But I'm just saying</b> <b>this is a path Kaiming led.</b>

<b>Right we</b> <b>accelerated the progress of this,</b> <b>and may have made the results much better too.</b>

<b>Mm.</b>

<b>But it doesn't change the essence of this.</b>

<b>Right.</b>

<b>So this is Kaiming.</b>

<b>Even now, for example, just a couple days ago he told me</b> <b>he really enjoys this kind of</b> <b>IC work — individual contributor,</b> <b>the individual contributor</b> <b>type of role.</b>

<b>Mm.</b>

<b>He doesn't enjoy managing a large team,</b> <b>getting everyone together,</b> <b>just being a manager pointing the direction.</b>

<b>He doesn't like that.</b>

<b>How many people does he manage now?</b>

<b>He has many, many people.</b>

<b>He now has many undergraduates</b> <b>visiting him,</b> <b>and he</b> <b>is also doing a lot of really great work.</b>

<b>So I actually don't believe him.</b>

<b>I tell him,</b> <b>"You're actually a very good manager."</b>

<b>At least for me,</b> <b>even though you never really managed me,</b> <b>just being around you,</b> <b>I could feel my own efficiency improving,</b> <b>feeling like I was getting smarter.</b>

<b>I think</b> <b>If I were going to have a manager,</b> <b>I'd want one like that —</b> <b>Right?</b> <b>one who can empower the people around him to get better.</b>

<b>Right?</b> <b>one who can empower the people around him to get better.</b>

<b>Right.</b>

<b>I think this is Kaiming.</b>

<b>So MAE —</b> <b>in any case,</b> <b>we explored the Contrastive Learning path,</b> <b>and found</b> <b>it couldn't scale up.</b>

<b>So we wanted to switch directions.</b>

<b>So we went back</b> <b>and used a simpler approach,</b> <b>which is a kind of denoising</b> <b>autoencoder,</b> <b>this kind of autoencoder,</b> <b>the Masked Autoencoder (MAE).</b>

<b>This method is even simpler.</b>

<b>Everyone can go read the paper,</b> <b>But in short,</b> <b>but basically by taking some images</b> <b>and corrupting them,</b> <b>then reconstructing</b> <b>these</b> <b>noisy images,</b> <b>cropped images,</b> <b>or masked images,</b> <b>to learn representations.</b>

<b>Mm.</b>

<b>This</b> <b>fundamentally different from Contrastive Learning,</b> <b>but its results were also very good,</b> <b>although it has very different characteristics.</b>

<b>For example, it doesn't explicitly</b> <b>model certain environments</b> <b>this kind of invariance</b> <b>which causes it, when doing linear probing,</b> <b>to perform slightly worse</b> <b>but with untuned fine-tuning</b> <b>these are two different ways to test representations</b> <b>right, in that case the results would be much better</b> <b>in any case, they have different properties</b> <b>the representations they learn also look different</b> <b>and these things</b> <b>would have far-reaching consequences down the line</b>

<b>we can talk more about this later</b> <b>but this was MAE</b> <b>at the time we thought</b> <b>wow, MAE is incredible</b> <b>MAE should at least win a best paper award, right?</b>

<b>turns out it didn't</b> <b>scaling up MAE would solve all problems, right?</b>

<b>turned out it didn't scale up either</b> <b>right</b> <b>actually I heard</b> <b>you and Xiangyu (chief scientist at StepFun) had talked about this before</b> <b>because he also paid attention to self-supervised learning</b> <b>he actually also</b> <b>talked a lot about</b> <b>why self-supervised learning can't scale up</b> <b>some of the reasons</b> <b>I won't go into it again here</b> <b>feel free to go back and relisten to that episode</b> <b>but anyway, in short,</b> <b>back then there was this kind of</b>

<b>rollercoaster ride</b> <b>on the one hand, we got some really good results</b> <b>but on the other hand, these papers were just papers</b> <b>we were never able to</b> <b>truly deliver something real</b> <b>right, like GPT</b> <b>that could point everything toward a completely different</b> <b>scalable paradigm for the future</b> <b>yeah right</b> <b>I think this whole thing</b> <b>had, at that point, kind of</b> <b>come to a close</b>

<b>of course, at that time I also did</b> <b>some other work</b> <b>for example, I extended self-supervised learning</b> <b>for what you could call the first time</b> <b>into the 3D domain, for instance</b> <b>I also did some work on point clouds</b> <b>these</b> <b>were called Point Contrast</b> <b>but these works were perhaps more about</b> <b>demonstrating that representation learning</b> <b>as a concept</b> <b>is not just a problem for the image domain</b> <b>it's a very universal</b> <b>approach</b>

<b>or rather, a methodology</b> <b>it doesn't only work with images</b> <b>it also works in 3D space</b> <b>later on</b> <b>many people tried it on all kinds of medical imaging</b> <b>and also on robotics tasks</b> <b>all kinds of domains</b> <b>it holds up</b> <b>so this thing</b> <b>I don't see it as a failure</b> <b>because it really has been</b> <b>influencing many many different</b> <b>fields beyond what we were focused on</b>

<b>like computer vision itself</b> <b>but on the other hand</b> <b>it still hasn't achieved the same kind of impact as LLMs</b> <b>in terms of influence</b> <b>mm</b> <b>so then</b> <b>after all that, what came next?</b>

<b>right yeah</b> <b>it seems like we went back to</b> <b>an exploration phase</b> <b>all of this was at FAIR</b> <b>all done at FAIR</b> <b>you were there for 4 years during that phase</b> <b>4 years</b> <b>mm</b> <b>so was that the end of your FAIR chapter?</b>

<b>not yet</b> <b>still early, still early</b> <b>that was probably the first year or two</b> <b>right</b> <b>there's another fun story,</b> <b>let me brag about Kaiming again</b> <b>[laughter]</b> <b>so</b> <b>back then, resources were always an issue</b> <b>GPUs were always in short supply</b> <b>and then FAIR made a decision</b> <b>to give TPUs a try</b> <b>see if this thing is any good</b> <b>Google had been using them</b> <b>they</b>

<b>had fully transitioned to using TPUs</b> <b>so</b> <b>we got about 5,000 TPU chips</b> <b>these chips</b> <b>not bought, more like rented</b> <b>on Google Cloud</b> <b>and then</b> <b>it was originally set up for people doing language models</b> <b>people played around with it</b> <b>and quickly found</b> <b>ugh, it's way too hard to use</b> <b>really not user-friendly</b> <b>okay</b> <b>Kaiming stepped up and said, let me handle it</b>

<b>so he truly, single-handedly</b> <b>I mean, again</b> <b>all on his own</b> <b>from start to finish</b> <b>built an entire infrastructure on TPUs</b> <b>which enabled us to do</b> <b>all the subsequent work</b> <b>including MoCo</b> <b>including MAE</b> <b>including the later DiT</b> <b>all of it</b> <b>happened on top of TPUs</b>

<b>so for me, this was</b> <b>a really important lesson</b> <b>which is</b> <b>how to summarize it...</b>

<b>it's like</b> <b>a craftsman who wants to do good work</b> <b>must first sharpen his tools</b> <b>mm</b> <b>one thing Kaiming taught me was</b> <b>the ceiling of your research</b> <b>actually depends on how good your baseline is</b> <b>oh</b> <b>because if your baseline is weak</b> <b>you can easily fool yourself</b> <b>oh</b> <b>you won't produce anything meaningful</b> <b>if you haven't put enough thought</b> <b>into the baseline level</b>

<b>into building this system properly</b> <b>into pushing the engineering to its limits</b> <b>you don't have a platform</b> <b>to do real exploration</b> <b>because you might find an interesting</b> <b>seemingly valuable signal</b> <b>but that signal could be completely wrong</b> <b>the reason being your baseline</b> <b>your benchmark itself wasn't good enough</b> <b>mm</b> <b>so this is actually quite counterintuitive</b> <b>because people always say</b>

<b>if my baseline is a bit weaker</b> <b>then the performance gains I can show</b> <b>would be larger</b> <b>so it's easier for me to publish papers</b> <b>right, but Kaiming doesn't think this way</b> <b>mm</b> <b>he thinks about how to</b> <b>push the baseline as high as it can go</b> <b>and then starting from that foundation</b> <b>whatever new things we build</b> <b>that's groundbreaking work</b> <b>that's a genuine breakthrough</b> <b>right</b> <b>anything you build on top of a weak baseline</b> <b>any improvement</b>

<b>might just be a throwaway paper</b> <b>so this thing</b> <b>has also been an inspiration to me</b> <b>including when they were working on detection</b> <b>I wasn't part of that work</b> <b>I was still doing my PhD</b> <b>but all of those</b> <b>Fast R-CNN, Mask R-CNN</b> <b>Focal Loss, and the whole series of work</b> <b>all of that work was because they</b> <b>including Ross Girshick</b> <b>including Kaiming</b>

<b>including Wu Yuxin</b> <b>who is now at Kimi</b> <b>they put enormous effort into building the infra</b> <b>and building that codebase</b> <b>so that the baselines</b> <b>the baselines for these methods</b> <b>already far exceeded all of those</b> <b>random mediocre CVPR papers</b> <b>mm</b> <b>our baseline was already stronger than yours</b> <b>so if I take one more step up</b> <b>of course I'm going to go even further</b> <b>mm</b> <b>so</b>

<b>I think I've always maintained this kind of</b> <b>methodology</b> <b>I think I place a lot of importance on</b> <b>this kind of</b> <b>I don't want to call it engineering</b> <b>because it's not entirely just about</b> <b>the codebase itself</b> <b>it's not</b> <b>like building a codebase at a product company</b> <b>that kind of relationship</b> <b>it's more like</b> <b>the scaffolding for a research breakthrough</b>

<b>if your scaffolding is unstable</b> <b>you can't build anything</b> <b>so</b> <b>this thing</b> <b>also influences what we do now</b> <b>but anyway, the point is</b> <b>Kaiming in terms of building this scaffolding</b> <b>was also truly exceptional</b> <b>I think you were so lucky</b> <b>because very early on someone told you</b> <b>a lot of the right ways to do things</b> <b>so in many areas</b> <b>you avoided a lot of wrong turns</b> <b>I think I was incredibly lucky</b>

<b>but I also hope</b> <b>though I think a lot of this really is</b> <b>on one hand, common sense</b> <b>but as you said, on the other hand</b> <b>for a student</b> <b>this might not be so obvious</b> <b>not so apparent</b> <b>mm</b> <b>like with this scaffolding thing</b> <b>when we were at FAIR</b> <b>there was a running joke</b> <b>kind of a joke, sort of</b> <b>the story goes that the first lesson for everyone interning at FAIR</b>

<b>guess what it was?</b>

<b>mm</b> <b>the first lesson</b> <b>was to use a certain tool</b> <b>guess what that tool was?</b>

<b>no idea</b> <b>that tool was</b> <b>an Excel spreadsheet</b> <b>[chuckles]</b> <b>this thing is also quite interesting</b> <b>so</b> <b>we'd have this whole system for tracking experiments</b> <b>of course, this might be a bit outdated now</b> <b>because nowadays there might be better</b> <b>tools like Feishu</b> <b>many better tools</b> <b>but back then</b> <b>we would meticulously</b>

<b>build this kind of template</b> <b>and this template was just an Excel file</b> <b>so sometimes we felt like office clerks</b> <b>I do research every day</b> <b>but it's not screen full of code</b> <b>writing some fancy stuff</b> <b>instead, staring at this spreadsheet</b> <b>this Excel file</b> <b>the spreadsheet</b> <b>looking at what each row represents</b> <b>the research part of this</b> <b>is how you design the spreadsheet</b> <b>how do you make sure</b>

<b>every experiment gives you</b> <b>what I just called this gradient</b> <b>right</b> <b>because you can always hit two extremes</b> <b>one extreme is you run too few experiments</b> <b>so your signal is unclear</b> <b>you don't know anything</b> <b>the other extreme is</b> <b>I don't care at all what experiments I'm running</b> <b>I just run experiments blindly</b> <b>right</b> <b>I have all these resources</b> <b>I just maximize my resources</b> <b>run all the jobs</b>

<b>dump all the results</b> <b>just throw everything into the spreadsheet</b> <b>and then feel satisfied</b> <b>thinking my research is done</b> <b>both of these are a pretty poor</b> <b>pattern for a student's research</b> <b>mm</b> <b>but back then, by watching how Kaiming</b> <b>built that kind of spreadsheet</b> <b>I learned an enormous amount</b> <b>right</b> <b>because you really have to make some decisions</b>

<b>those decisions being, for my</b> <b>what metrics should I actually focus on</b> <b>right</b> <b>what should I actually be recording</b> <b>what columns should there be</b> <b>how should I define control variables</b> <b>and how to make each experiment as informative as possible</b> <b>mm</b> <b>okay so let's move on</b> <b>right, so what other things happened at FAIR</b> <b>then there's also the thing about DiT right</b> <b>but let's not jump to that yet</b> <b>let's continue the FAIR story</b>

<b>so after the self-supervised learning phase</b> <b>you entered an exploration phase again</b> <b>right</b> <b>so at that time, like I mentioned</b> <b>actually there's no real transition</b> <b>right, these things are all overlapping</b> <b>I may be doing one thing while also exploring something else</b> <b>right</b> <b>and at that time</b> <b>what I was most interested in actually was</b> <b>generative models</b> <b>at the time generative models was a big topic</b>

<b>GAN was already quite mature by then</b> <b>right</b> <b>then</b> <b>VAE and various other things</b> <b>were also starting to emerge</b> <b>yes</b> <b>then there was a paper</b> <b>which, back in maybe 2021 or 2022</b> <b>at the time of the DDPM paper</b> <b>right, it's the Denoising Diffusion Probabilistic Model</b> <b>mm</b> <b>this paper was very interesting to me</b> <b>because at the time the image quality</b> <b>actually wasn't that impressive yet</b> <b>I think the image quality was about on par with GAN</b>

<b>or even a bit worse, right</b> <b>but in terms of sample diversity</b> <b>it was much better than GAN</b> <b>right</b> <b>because GAN always has this mode collapse problem</b> <b>right, it tends to just generate one kind of image</b> <b>right</b> <b>but this thing was able to generate</b> <b>much more diverse content</b> <b>so I thought</b> <b>there might be something here</b> <b>but it's still not clear enough yet</b> <b>then we had a meeting</b> <b>in the group</b> <b>and we discussed this paper</b>

<b>and at the time Kaiming also said</b> <b>he thought this was interesting</b> <b>he also thought this was something worth pursuing</b> <b>but he had one question</b> <b>and this question I still remember to this day</b> <b>he asked, have you thought carefully</b> <b>about whether this is a discriminative model</b> <b>or a generative model?</b>

<b>mm</b> <b>I think this is very profound</b> <b>because the essence is</b> <b>you're doing denoising</b> <b>when you're doing denoising</b> <b>essentially you're doing discriminative prediction</b> <b>right</b> <b>but at the same time</b> <b>through multiple steps of denoising</b> <b>you're also doing generation</b> <b>right</b> <b>so the interesting question Kaiming raised was</b> <b>in the end, is this thing a discriminative model</b>

<b>or a generative model?</b>

<b>and what does this boundary mean?</b>

<b>mm</b> <b>I thought this was a very deep question</b> <b>because in the end</b> <b>the things that Diffusion models are capable of doing</b> <b>completely blurred this boundary</b> <b>right</b> <b>it can do generation, it can do discrimination</b> <b>it can do representation learning</b> <b>all kinds of things</b> <b>so I think this is a fairly profound question</b> <b>yes</b> <b>so at the time, based on this question</b> <b>we did a lot of exploration</b>

<b>including</b> <b>things like trying to use DDPM</b> <b>or diffusion models for classification</b> <b>and checking</b> <b>whether the representation it learns is good</b> <b>and how it compares to a self-supervised model</b> <b>mm</b> <b>this was a line of exploration we pursued</b> <b>it was interesting</b> <b>and there's a paper I'm not sure if it was published</b> <b>actually I know it was published</b> <b>but it wasn't published by us</b>

<b>someone else did it</b> <b>mm</b> <b>but anyway, we did a lot of this kind of exploration</b> <b>but let's first talk about the process</b> <b>when did this happen at FAIR?</b>

<b>this was around 2022 to 2023</b> <b>mm</b> <b>at that time</b> <b>diffusion models had started to take off</b> <b>mm</b> <b>not yet, not right away</b> <b>this is before ChatGPT, right?</b>

<b>mm</b> <b>this is before ChatGPT</b> <b>right, so this was around 2022</b> <b>before or after Stable Diffusion?</b>

<b>roughly the same time</b> <b>it was approximately the same time</b> <b>mm</b> <b>at that time Stable Diffusion was already getting attention</b> <b>right, that whole community</b> <b>was also very active</b> <b>right</b> <b>so at the time I was</b> <b>very curious about diffusion models</b> <b>mm</b> <b>and we started exploring</b> <b>is the exploration you're describing</b> <b>something you can do freely on your own</b> <b>without needing to report to anyone?</b>

<b>yes, this is the freedom of FAIR</b> <b>right, that's exactly the freedom I was talking about</b> <b>yes</b> <b>so at the time</b> <b>in terms of the direction of research</b> <b>within the team, nobody was doing diffusion models at all</b> <b>so I was the first to start exploring this</b> <b>and later brought in an intern</b> <b>who was Bill Peebles</b> <b>yes, who is now head of Sora</b> <b>we started together</b> <b>right</b> <b>but I was the first to start at FAIR</b>

<b>and then brought Bill in later</b> <b>mm</b> <b>so back then</b> <b>I was exploring all kinds of angles</b> <b>and then later we kind of settled on</b> <b>the most important one</b> <b>which was the DiT direction</b> <b>mm</b> <b>and by the way</b> <b>let me also mention this</b>

<b>DiT wasn't the original goal</b> <b>at the very beginning</b> <b>right</b> <b>the original goal was actually</b> <b>exploring the connection between</b> <b>discriminative and generative models</b> <b>mm</b> <b>yes, that was the original question</b> <b>mm</b>

<b>right, and during this exploration</b> <b>we kind of discovered</b> <b>that this direction of DiT was more interesting</b> <b>mm</b> <b>and we focused on that</b> <b>ok then let's not jump there yet</b> <b>let's continue talking about FAIR</b> <b>what was life like at FAIR?</b>

<b>what was the culture like?</b>

<b>what was special about FAIR?</b>

<b>mm</b> <b>I think the most special thing about FAIR is</b> <b>it's the most academic-like place</b> <b>inside industry</b> <b>that I've ever been to</b> <b>right, so a lot of the culture</b> <b>is actually quite similar to academia</b> <b>for example</b> <b>everyone has a very high degree of freedom</b> <b>you can basically choose</b> <b>what you want to work on</b> <b>mm</b> <b>and at the same time</b>

<b>you have a lot of resources</b> <b>the resources are beyond what you'd have in academia</b> <b>right</b> <b>so I think FAIR</b> <b>was a very ideal research environment</b> <b>for me at that stage</b> <b>mm</b> <b>but it also has some problems, right</b> <b>like you said</b> <b>later on</b> <b>there were some cultural shifts</b>

<b>right</b> <b>I think around 2022 or 2023</b> <b>after ChatGPT appeared</b> <b>FAIR was going through a lot of changes</b> <b>mm</b> <b>right</b> <b>you're using such a fancy-sounding term</b> <b>and you even have to say it in English</b> <b>which shows how hard these things are to define</b> <b>it really is a</b> <b>research aesthetic</b> <b>right</b> <b>I think</b> <b>it encompasses everything I've mentioned above</b>

<b>the specifics of how you do things</b> <b>I think all of that is included</b> <b>but it also involves some higher-level</b> <b>philosophical</b> <b>considerations</b> <b>like how Kaiming gave me the Diamond Sutra</b> <b>I think he</b> <b>because the Diamond Sutra says</b> <b>all things are like dreams, illusions, bubbles and shadows</b>

<b>and one passage also says: all phenomena are illusions</b> <b>if you see all phenomena as not phenomena, you see the Tathagata</b> <b>mm</b> <b>taking this a bit further</b> <b>it's actually quite similar to certain ideas in Western philosophy</b> <b>quite similar actually</b> <b>for example, Kant's concept of the thing-in-itself</b> <b>and then</b>

<b>Schopenhauer's</b> <b>the world as will and representation</b> <b>right</b> <b>what they're all trying to express</b> <b>I don't know much about philosophy, I don't want to sound pretentious</b> <b>but in my humble understanding</b> <b>I think what they're all trying to discuss is</b> <b>what you see</b> <b>is not the essence of the thing</b> <b>what you see of the world is not its true substance</b> <b>so when you're reading a paper</b>

<b>what matters is</b> <b>to break through the illusion the paper presents to you</b> <b>and question</b> <b>what lies behind this paper</b> <b>what kind of</b> <b>substantive essence does it actually contain</b> <b>I think the source of researcher taste lies in</b> <b>whether people can</b> <b>truly set aside all these superficial appearances</b> <b>and then</b> <b>keep pursuing the path toward truth</b> <b>keep seeking</b> <b>mm</b>

<b>I think Kaiming does this best</b> <b>if you think about this from a long-term perspective</b> <b>the question is: what is the right way</b> <b>to guide how you choose a topic</b> <b>what kind of things to work on</b> <b>right</b> <b>this thing also connects to</b> <b>while you're doing research</b> <b>what exactly should each step involve</b> <b>I think everything is consistent</b> <b>mm</b> <b>and then</b> <b>I think</b> <b>one problem with not having good research taste is</b>

<b>people might get caught up in these appearances</b> <b>these appearances might be a paper's acceptance</b> <b>or the kind of fame you mentioned from the outside world</b> <b>or</b> <b>being able to get something done quickly</b> <b>and getting the kind of momentary praise</b> <b>and adulation</b> <b>right</b> <b>I think for Kaiming</b>

<b>this is completely outside of his world model</b> <b>he simply doesn't care</b> <b>I think</b> <b>right</b> <b>but if you ask me to list out research taste as points a,</b> <b>b c,</b> <b>d...</b>

<b>d...</b>

<b>that becomes pretty hard to articulate</b> <b>this thing</b> <b>because it involves so many things</b> <b>because research itself, as I said</b> <b>is also a creative process</b> <b>it's also a writing process</b> <b>from the writing side, by the way</b> <b>Kaiming is also the person with the strongest writing ability</b> <b>he also strongly encouraged us, saying</b> <b>make sure to start writing early</b> <b>this thing</b> <b>very unfortunately</b> <b>even now</b> <b>at my age</b>

<b>I still can't do it well</b> <b>like Kaiming</b> <b>all his papers</b> <b>were finished a month before the deadline</b> <b>at least that was the case at FAIR</b> <b>mm</b> <b>meaning</b> <b>while everyone else was pulling all-nighters to meet the deadline</b> <b>and then</b> <b>feeling this huge sense of satisfaction</b> <b>Kaiming, you know</b> <b>was like a carefree free spirit</b> <b>having finished everything a month ago</b>

<b>and then polishing it over and over again</b> <b>watching all of you rush to meet your deadlines</b> <b>I, in a very relaxed way</b> <b>have already made this thing perfect</b> <b>he finished everything a month in advance</b> <b>everything done</b> <b>meaning the paper was fully written</b> <b>ah</b> <b>not just the results obtained, but the paper fully written</b> <b>this is already a publishable</b> <b>solid piece of work</b> <b>so</b> <b>that means he had to start writing when</b>

<b>two months before the deadline</b> <b>and he only needed one month to write it</b> <b>no</b> <b>one month is a long time</b> <b>right</b> <b>of course he would keep writing afterward</b> <b>during that month before the deadline</b> <b>he would</b> <b>polish every table</b> <b>every</b> <b>single</b> <b>word</b> <b>every punctuation mark</b> <b>ah</b> <b>for example, this habit</b> <b>also influenced me</b> <b>for instance, I now have this OCD</b> <b>like this kind of</b> <b>how to put it</b> <b>obsession</b>

<b>that also came from my time with Kaiming</b> <b>which is that in your paper</b> <b>not a single line should have less than 60% filled with text</b> <b>filled -- what does that mean?</b>

<b>meaning if you have a line</b> <b>and more than half of it is empty</b> <b>it doesn't look good</b> <b>you need to fill that line</b> <b>or have it filled roughly</b> <b>sixty to seventy percent</b> <b>then your paper looks more elegant</b> <b>elegant, or uniform</b> <b>oh</b> <b>and now with every paper</b> <b>I always ask all the students</b> <b>right, look carefully</b> <b>if you have some trailing word</b>

<b>if people aren't paying attention</b> <b>you'll end up with a word</b> <b>sitting alone on a line somewhere</b> <b>it looks terrible</b> <b>understood</b> <b>mm</b> <b>and also</b> <b>when Kaiming thinks about this, his view is</b> <b>this paper is not for you to read</b> <b>this paper is for others to read</b> <b>so you need to care about how others experience it</b> <b>mm</b> <b>how can you -- a paper is just a vessel</b>

<b>how do I, through this vessel of knowledge</b> <b>let people relatively smoothly get</b> <b>to your own</b> <b>the core of what you want to express</b> <b>this communication interface needs to be pleasing to the eye</b> <b>that's a great way to put it, right</b> <b>the communication interface must be pleasing to the eye</b> <b>so you can't let your paper look too bad, right</b> <b>you have to get the details right</b> <b>so all of this</b>

<b>you can consider it a kind of research taste</b> <b>but I think this is</b> <b>actually something more general</b> <b>a kind of aesthetic toward life</b> <b>or toward everything in the universe</b> <b>mm</b> <b>I think these things are all connected</b> <b>right</b> <b>this is also why</b> <b>we care so much about our own papers</b> <b>being as unique as possible</b> <b>having our own distinctiveness</b>

<b>we can have our own webpage design</b> <b>we'll record our own videos</b> <b>record videos</b> <b>but there are many</b> <b>people who wonder why you bother with all this</b> <b>this stuff</b> <b>has nothing to do with research</b> <b>isn't this just a distraction?</b>

<b>why spend extra energy</b> <b>polishing all this</b> <b>are you just doing this for hype and marketing?</b>

<b>ah, I hope people don't think that</b> <b>because I think</b> <b>having your own style</b> <b>is actually very important</b> <b>mm</b> <b>and then</b> <b>this is also why</b> <b>all of our papers use a consistent template</b> <b>we have our own designs</b> <b>and indirectly</b> <b>I also hope to pass on some of my taste, again</b> <b>I can't guarantee they're all good</b> <b>but somehow</b> <b>at least discuss it with my students</b>

<b>we can work on this together</b> <b>at least together we can conceptualize</b> <b>think it through together</b> <b>right, I think this also, in my view</b> <b>this broader</b> <b>is part of research taste</b> <b>mm, it contains many very concrete small details</b> <b>an enormous number of details</b> <b>right</b> <b>but I think</b> <b>this is also what makes research interesting</b> <b>I told you yesterday</b>

<b>my childhood dream was actually to become a film director</b> <b>right</b> <b>mm</b> <b>childhood dream</b> <b>no no</b> <b>when did that dream fade?</b>

<b>it faded pretty quickly</b> <b>unfortunately</b> <b>but I still watch a lot of films</b> <b>but I think, eventually, I came to realize</b> <b>the research process and filmmaking process</b> <b>are actually not that different</b> <b>why?</b>

<b>why?</b> <b>because a film also needs to discover a theme</b> <b>it also involves exploration</b> <b>I have a story I want to tell</b> <b>and it shouldn't be that I just stand at this moment</b> <b>and think oh</b> <b>this is how my story goes</b> <b>and then I just go straight toward the finish</b> <b>it shouldn't work that way either</b> <b>you should also go make the film</b> <b>I think you'd have great intuition</b> <b>right</b> <b>yes, exactly right</b>

<b>the worst films are the ones that just go through the motions</b> <b>I start with A</b> <b>no conflict along the way</b> <b>and arrive at B</b> <b>and then it's over</b> <b>I just</b> <b>play it for you</b> <b>a good film actually is</b> <b>or, why do we say when writing a paper</b> <b>people say</b> <b>they told the story really well</b> <b>even though this might even have a bit of a narrative</b> <b>storytelling quality</b> <b>mm</b> <b>film is a storytelling process</b> <b>there's a book</b>

<b>I actually recommended it to students before</b> <b>I learned from Kaiming</b> <b>I share with people</b> <b>some unexpected books</b> <b>let me recommend a book</b> <b>it's called Story, by Robert McKee</b> <b>mm</b> <b>this book is a book about screenwriting</b> <b>mm</b> <b>but I think this book</b> <b>actually speaks to a lot of things about research</b> <b>and life</b> <b>there's one thing this book talks about</b> <b>that I think is particularly interesting</b> <b>it talks about</b>

<b>what makes a good story</b> <b>it's not</b> <b>a story that has no conflict from beginning to end</b> <b>a good story must be driven by conflict</b> <b>and through conflict to discover</b> <b>the true character's core</b> <b>mm</b> <b>and in research</b>

<b>it's the same thing</b> <b>a good research paper</b> <b>must also set up the conflict</b> <b>and then through conflict</b> <b>you discover the core of this problem</b> <b>and the solution to this problem</b> <b>right</b> <b>so I think this book</b> <b>has a lot of profound insights</b> <b>including about life</b> <b>mm</b> <b>and I think the concept of conflict in the book</b> <b>is actually similar to what I was just talking about</b>

<b>that gradient</b> <b>mm</b> <b>you need enough contrast</b> <b>to let you see the difference</b> <b>right</b> <b>for example</b> <b>if in your experiment</b> <b>you don't have a good enough control group</b> <b>or experimental group</b> <b>your signal will be weak</b> <b>and you won't know the answer</b>

<b>right</b> <b>so having this kind of conflict</b> <b>this gradient</b> <b>is extremely important for research</b> <b>mm</b> <b>I think this is really interesting, thank you</b> <b>so let me ask about another topic</b> <b>which is about your transition from FAIR to NYU</b> <b>right</b> <b>you transitioned from FAIR to NYU around 2023</b> <b>right, to become a professor</b> <b>right</b> <b>can you talk about how this transition happened?</b>

<b>right, so actually</b> <b>I spent a total of five years at FAIR</b> <b>mm</b> <b>and for me this experience at FAIR</b> <b>I think it was the most formative five years</b> <b>of my career</b> <b>so I think I'm extremely grateful</b> <b>and this experience has really shaped</b> <b>who I am today</b> <b>mm</b> <b>but at the same time</b> <b>I always had this desire</b> <b>to someday</b>

<b>run my own lab</b> <b>and take on students</b> <b>because I think this experience</b> <b>the experience of someone guiding you</b> <b>is something I'm very thankful for</b> <b>and I want to pass on</b> <b>what I learned</b> <b>right</b> <b>so after five years at FAIR</b> <b>I decided to make a move</b> <b>and go into academia</b> <b>mm</b> <b>and so I joined NYU</b> <b>mm</b> <b>which by the way, NYU is a very interesting place</b>

<b>why?</b>

<b>why?</b> <b>because NYU is somewhat unique</b> <b>it's located in New York City</b> <b>in Manhattan</b> <b>mm</b> <b>right, so it's surrounded by a lot of industry</b> <b>which gives you a lot of collaboration opportunities</b> <b>mm</b> <b>and NYU's location in New York</b> <b>there is a relatively strong AI community here in New York</b> <b>right</b>

<b>for example, NYU has Yann LeCun</b> <b>mm</b> <b>who is of course a figure you don't need to introduce</b> <b>mm</b> <b>and NYU also has</b> <b>Kyunghyun Cho</b> <b>who is also a very well-known researcher</b> <b>mm</b> <b>and then there's also this whole community in New York</b> <b>like, for example</b> <b>Google has a large office here in New York</b> <b>Microsoft also has offices here</b>

<b>Morgan Stanley, Goldman Sachs</b> <b>lots of different types of companies</b> <b>mm</b> <b>so I think this is</b> <b>a very unique place</b> <b>where you can combine</b> <b>industry and academia</b> <b>mm</b> <b>right, so actually now when we're talking about</b> <b>is Dumbo a community in New York?</b>

<b>Dumbo is a very interesting place</b> <b>in Brooklyn</b> <b>mm</b> <b>and Dumbo has become one of</b> <b>the more important areas of New York's AI community</b> <b>mm</b> <b>there are a lot of AI startups</b> <b>here in Dumbo</b> <b>for example, some of the more well-known ones</b> <b>like Hugging Face's office is here</b>

<b>mm</b> <b>and then Runway's office is also here</b> <b>mm</b> <b>and then there are many other startups</b> <b>so New York is actually quite vibrant</b> <b>and the reason I chose NYU</b> <b>is partly because of this</b> <b>and also partly because of the people there</b> <b>mm</b> <b>so that's how I ended up at NYU</b> <b>mm</b> <b>right, so then</b> <b>it turns out that the professor role</b> <b>after you actually start doing it</b> <b>is somewhat different from what you imagined</b> <b>right?</b>

<b>right?</b> <b>mm, I think many aspects are different</b> <b>for example, a professor</b> <b>has to deal with a lot of administrative work</b> <b>right</b> <b>things like grant applications</b> <b>various committee work</b> <b>right</b> <b>also things like</b> <b>things completely unrelated to research</b> <b>right</b> <b>I was quite well protected at FAIR</b> <b>from a lot of this</b>

<b>right</b> <b>but at a university</b> <b>you have to deal with all of it yourself</b> <b>mm</b> <b>so I think this is a very different experience</b> <b>and also</b> <b>advising students</b> <b>is very different from doing research yourself</b> <b>mm</b> <b>because advising students requires</b> <b>not just doing the research</b> <b>but also helping students</b> <b>grow as researchers</b>

<b>right</b> <b>and this is a very different skill set</b> <b>mm</b> <b>so I think</b> <b>transitioning into the professor role</b> <b>was actually a big challenge</b> <b>mm</b> <b>but at the same time, it's very rewarding</b> <b>because you can see your students</b> <b>grow</b> <b>right</b> <b>and I think this is</b> <b>one of the most rewarding things</b> <b>about being a professor</b>

<b>mm</b> <b>I think that's a beautiful thing to say</b> <b>so let me ask</b> <b>about the startup you founded</b> <b>right</b> <b>I heard that you are now a professor at NYU</b> <b>and also a co-founder of a startup</b> <b>right</b> <b>what's the story behind that?</b>

<b>right, so the startup</b> <b>started a bit over a year ago</b> <b>right</b> <b>and the company is called Emu Video</b> <b>no, wait, that's a product</b> <b>[laughter]</b> <b>it's called Oasis</b> <b>mm</b> <b>so what does Oasis do?</b>

<b>right, so Oasis is focused on</b> <b>AI-generated video</b> <b>mm</b> <b>and specifically</b> <b>a game that is generated by AI in real time</b> <b>mm</b> <b>so the original idea</b> <b>was inspired by</b> <b>the DiT work</b> <b>and also by Sora</b> <b>mm</b> <b>and we thought</b> <b>this technology</b> <b>can be applied to games</b>

<b>mm</b> <b>right, because games are actually</b> <b>an extremely good use case for this kind of technology</b> <b>mm</b> <b>because games</b> <b>require very fast frame generation</b> <b>right</b> <b>and at the same time</b> <b>games require a lot of interactivity</b> <b>right</b> <b>so these two things together</b>

<b>make games a very interesting application</b> <b>mm</b> <b>this thing</b> <b>can be applied to many many different papers</b> <b>no matter what your topic is</b> <b>right, so I think this is also very interesting</b> <b>mm</b> <b>and then later</b> <b>we could maybe talk about</b> <b>DiT right</b> <b>but this paper also</b> <b>this paper</b> <b>was again one of those</b> <b>that brings us to NYU</b>

<b>no no</b> <b>no, this one is also</b> <b>also FAIR</b> <b>it was the last piece of work at FAIR</b> <b>oh</b> <b>and then at that time FAIR was already starting to have some</b> <b>culture shift</b> <b>because at that point ChatGPT had just come out</b> <b>OpenAI and then DeepMind were also doing very well</b> <b>OpenAI as an emerging</b> <b>research force</b> <b>mm, and then</b> <b>had actually done a lot at FAIR</b>

<b>that nobody dared to even dream of</b> <b>uh</b> <b>and even if they dreamed it they couldn't do it</b> <b>right, so everyone started thinking</b> <b>what went wrong with this organizational model</b> <b>does there need to be a major overhaul</b> <b>there had already been many</b> <b>reorganizations</b> <b>this was also a trigger</b> <b>why</b> <b>I think by then it was no longer a good sign</b> <b>for me to keep staying at FAIR</b> <b>things were already starting to decline</b> <b>not exactly decline</b> <b>just that</b>

<b>everyone's focus was no longer on research</b> <b>people would</b> <b>have these meetings that lasted several hours</b> <b>research alignment meetings</b> <b>coordination meetings</b> <b>alignment meetings</b> <b>alignment meetings</b> <b>and the only topic of these meetings was</b> <b>what exactly should we be doing</b> <b>but these meetings</b>

<b>went on for</b> <b>several weeks</b> <b>and still no conclusion</b> <b>because nobody would know what they want to do</b> <b>because this is completely counter to what I just described</b> <b>the normal</b> <b>bottom-up logic of research</b> <b>mm right</b> <b>now it had become</b> <b>let's all sit together</b> <b>and discuss what research project</b> <b>we should do over the next one or two years</b> <b>in my view</b> <b>or in Kaiming's view</b> <b>or in the minds of many researchers</b>

<b>this looks completely anti-research</b> <b>right</b> <b>so at that time it had a lot of effect on us</b> <b>for example, at the time I</b> <b>was working on DiT</b> <b>Diffusion was also just getting started</b> <b>nobody yet</b> <b>not a single person at FAIR</b> <b>was doing Diffusion Model research</b> <b>but I thought, hey</b> <b>this thing seems really interesting</b> <b>I think I should give it a try</b> <b>and then Bill Peebles</b>

<b>was an intern I recruited at the time</b> <b>mm</b> <b>and he's now head of Sora</b> <b>and also the main character in Sora's various generated videos</b> <b>he's also the star of those</b> <b>mm right</b> <b>he's an extremely sharp person</b> <b>or or</b> <b>in my view</b> <b>what I'd call a perfect PhD student</b> <b>in all directions, uh</b> <b>at least a well-rounded, all-around student</b> <b>right, but anyway</b> <b>our starting point back then</b> <b>was not to do Diffusion Model research</b> <b>nor to do DiT</b>

<b>in the first two months of exploration</b> <b>it was entirely focused on representation learning</b> <b>that is, we wanted to look at</b> <b>the representation a Diffusion Model learns</b> <b>how it compares to what a normal Supervised Learning</b> <b>or rather</b> <b>a Self-supervised Learning model learns</b> <b>what the differences are</b> <b>actually</b> <b>there was a lot of follow-up work in this direction</b> <b>but what we started doing</b> <b>after working on it for a while, the feeling was</b>

<b>this thing is okay</b> <b>just so-so</b> <b>it can learn a decent</b> <b>a generative model can learn a decent representation</b> <b>but this representation</b> <b>was much, much worse</b> <b>than the representation from self-supervised learning</b> <b>mm</b> <b>completely not competitive, right</b> <b>so we gave up on that</b> <b>but in the process</b> <b>in the final month</b> <b>we discovered</b> <b>hey</b> <b>by the way, this thing</b> <b>the premise being</b> <b>because DiT</b>

<b>we needed to compare at the representation level</b> <b>against, say, ViT-based systems</b> <b>to make a comparison</b> <b>so at that time it was</b> <b>why didn't we use a U-Net</b> <b>but instead used ViT for this Diffusion Model</b> <b>that was the starting point, right</b> <b>and then we found out, hey</b> <b>from the representation angle</b> <b>this doesn't seem to add much value</b> <b>but it seems like our new architecture</b> <b>is indeed more efficient</b>

<b>and indeed more scalable</b> <b>more stable than U-Net</b> <b>and from a code perspective</b> <b>I care a lot about these things</b> <b>from your code perspective</b> <b>what I call Minimal Description Length (MDL)</b> <b>your code is actually quite important</b> <b>it can reflect some things</b> <b>if your code is short</b> <b>and can achieve the same purpose</b> <b>then your method will typically be better than one that</b>

<b>requires thousands of lines of code</b> <b>an extremely complex system</b> <b>even if it can do the same thing</b> <b>but the former</b> <b>this more elegant solution</b> <b>the simpler solution is always better</b> <b>I think this is also a kind of research taste in a sense</b> <b>so we found, hey</b> <b>this thing is both simple and it works</b> <b>and scalable</b> <b>and efficient</b> <b>so it seems like this thing</b> <b>is the direction we should be pursuing</b>

<b>so also a month in advance</b> <b>and then we went to work on this</b> <b>mm</b> <b>and at that point we were competing for a lot of resources</b> <b>people said</b> <b>why are you working on this?</b>

<b>we need to consolidate resources now</b> <b>and we need to do something more meaningful</b> <b>a bigger project</b> <b>for example</b> <b>nobody knows</b> <b>so we need these alignment</b> <b>meetings to discuss it</b> <b>but</b> <b>at least Diffusion Models</b> <b>wouldn't be an important part of this critical path</b> <b>an important</b> <b>key member on this critical path</b> <b>right</b> <b>so there was a lot of opposition</b>

<b>but I felt I could see</b> <b>that this is actually something very important</b> <b>because I think this, from an architecture standpoint</b> <b>I've</b> <b>I've been doing architecture work for so long</b> <b>I think this is the future of Diffusion architectures</b> <b>right, it's not the Diffusion Model</b> <b>what I said, the overall data architecture</b> <b>and the objective</b> <b>are all very important</b> <b>right, but on the architecture side</b> <b>this is an indispensable piece</b> <b>so this is why</b>

<b>in the last month we pushed in this direction</b> <b>and the results were very good in the end</b> <b>and we were able to show</b> <b>this really great</b> <b>scaling behavior</b> <b>and we submitted the paper to CVPR</b> <b>and we were all very happy</b> <b>and then the paper got rejected</b> <b>mm</b> <b>right, LeCun apparently tweeted about this</b> <b>yes</b> <b>saying not enough novelty</b> <b>you might have done this thing</b> <b>uh right</b> <b>you don't have long stretches of math</b>

<b>you don't have a long complex structure</b> <b>you came up with a very simple structure</b> <b>and even though you got good results</b> <b>the reviewers weren't convinced</b> <b>mm right</b> <b>this is another lesson</b> <b>but by that point</b> <b>I had actually started to come around</b> <b>I realized</b> <b>this whole thing about research papers</b> <b>in this huge random process</b> <b>whether you get accepted or not</b> <b>doesn't matter at all</b>

<b>so we then submitted to another conference</b> <b>didn't change a thing</b> <b>and it got accepted as an Oral Paper</b> <b>mm, which proves once again</b> <b>this is a completely random process</b> <b>but what happened afterward was more interesting</b> <b>after getting this paper</b> <b>I realized</b> <b>in every dimension</b> <b>this was better than a U-Net based system</b> <b>why not just use this</b> <b>right, you've unified the underlying logic</b> <b>at least on the architecture side, unified the logic</b>

<b>you can share a lot of infrastructure</b> <b>it's so efficient</b> <b>results are good and scalable</b> <b>you can build even larger models</b> <b>so we thought</b> <b>this thing</b> <b>once this paper is out, there will definitely be a lot of attention</b> <b>which, by the way</b> <b>there was indeed a lot of attention</b> <b>lots of people discussing it on Twitter</b> <b>but we found, hey</b> <b>nobody was actually using it for anything</b> <b>oh</b> <b>and then we started talking to people</b>

<b>like we reached out to the Stable Diffusion folks</b> <b>by the way, I think Stable Diffusion</b> <b>LDM is also one of</b> <b>what I'd call those twenty-something foundational papers</b> <b>one of them</b> <b>but I also talked to some people there</b> <b>and then</b> <b>we also talked to some other big companies</b> <b>so we were kind of at school</b> <b>at that time I was -- this paper had just</b> <b>landed right at the end of my time at FAIR</b>

<b>and the beginning of my time at NYU</b> <b>oh, so both affiliations were listed?</b>

<b>well</b> <b>right, right -- actually, no</b> <b>actually only NYU was listed</b> <b>and Berkeley</b> <b>because FAIR didn't let us list their name</b> <b>why?</b>

<b>why?</b> <b>because first, they felt this paper, it's OK</b> <b>it's a paper. second, you had already left</b> <b>so don't list our name</b> <b>mm, so then after this paper</b> <b>a lot of people started using DiT</b> <b>right</b> <b>and then we found that Sora used DiT as the backbone</b> <b>right</b> <b>which was a huge affirmation</b> <b>mm</b> <b>because at the time the Sora paper</b> <b>mentioned DiT by name</b>

<b>yes</b> <b>right, so this was something we were very proud of</b> <b>mm</b> <b>and then, later</b> <b>a lot of other models</b> <b>also started using DiT</b> <b>mm</b> <b>yes, basically all the main video generation models now</b> <b>use DiT as the backbone</b> <b>mm</b> <b>so I think this was a very important paper</b> <b>mm</b> <b>right, so then</b> <b>let's talk about the startup</b>

<b>right</b> <b>so why start a company?</b>

<b>right</b> <b>I think for me</b> <b>the main motivation was</b> <b>I wanted to see</b> <b>whether this technology</b> <b>that I had been working on for so many years</b> <b>could have real impact</b> <b>mm</b> <b>because in academia</b> <b>you write papers</b> <b>and other people read your papers</b> <b>and they may use your ideas</b> <b>but you never really get to see</b> <b>the end-to-end impact</b> <b>mm</b>

<b>right, so I wanted to</b> <b>take this technology all the way</b> <b>to building a product</b> <b>mm</b> <b>and also</b> <b>I think</b> <b>that games are a very interesting application</b> <b>mm</b> <b>because games are one of the few places</b> <b>where both high visual quality</b> <b>and very low latency</b> <b>are required at the same time</b> <b>mm</b> <b>and this is actually a very hard technical problem</b>

<b>right</b> <b>so we thought</b> <b>if we can solve this problem</b> <b>for games</b> <b>then the technology will be applicable</b> <b>to a much wider range of use cases</b> <b>mm</b> <b>right, and also</b> <b>games are a massive market</b> <b>right</b> <b>so there's a lot of commercial potential as well</b> <b>mm</b> <b>right, so that's kind of the story</b>

<b>behind starting the company</b> <b>mm</b> <b>so what has the journey been like</b> <b>since you started the company?</b>

<b>mm</b> <b>I think</b> <b>building a company is very different from doing research</b> <b>mm</b> <b>for many reasons</b> <b>right</b> <b>one is that in a company</b> <b>you have to think about</b> <b>the product</b> <b>and users</b> <b>mm</b> <b>which is not something you think about in research</b> <b>right</b> <b>and two is that</b> <b>in a company you have to think about</b> <b>the business model</b> <b>and how to sustain the business</b>

<b>mm</b> <b>right, which is also not something</b> <b>you think about in research</b> <b>right</b> <b>and three is that</b> <b>building a team is very different</b> <b>from advising students</b> <b>mm</b> <b>because in a company</b> <b>you're hiring professionals</b> <b>who have different skills and backgrounds</b> <b>mm</b> <b>and you have to think about</b> <b>how to align everyone</b> <b>toward a common goal</b> <b>mm</b>

<b>which is quite different from</b> <b>advising PhD students</b> <b>mm</b> <b>right</b> <b>so I think building a company</b> <b>has been a very learning-rich experience</b> <b>mm</b> <b>and I've learned a lot from it</b> <b>mm</b> <b>right, and the product you mentioned</b> <b>Oasis</b> <b>has gotten quite a lot of attention</b> <b>right?</b>

<b>right?</b> <b>yes, I think Oasis got quite a lot of attention</b> <b>mm</b> <b>when it was first released</b> <b>mm</b> <b>and the demo got a lot of</b> <b>views and discussion</b> <b>mm</b> <b>right</b> <b>and what's the current status of the company?</b>

<b>right</b> <b>we're still pretty early</b> <b>mm</b> <b>we're building out the technology</b> <b>and the product</b> <b>mm</b> <b>and we're also thinking about</b> <b>the go-to-market strategy</b> <b>mm</b> <b>right, I think</b> <b>the vision is very clear</b> <b>mm</b> <b>but the execution is always</b> <b>the hard part</b> <b>mm</b>

<b>right, so we're still working on it</b> <b>mm</b> <b>I think that's very relatable</b> <b>so</b> <b>let me ask</b> <b>about your thoughts on</b> <b>the current AI landscape</b> <b>mm</b> <b>what do you think</b> <b>are the most important</b> <b>open problems right now?</b>

<b>mm</b> <b>I think there are many</b> <b>mm</b> <b>but one thing that I think is particularly interesting</b> <b>is the question of</b> <b>how do you build AI systems</b> <b>that can reason</b> <b>and plan</b> <b>mm</b> <b>right, because current systems</b> <b>like LLMs</b> <b>are very good at pattern matching</b>

<b>mm</b> <b>but they struggle with</b> <b>systematic reasoning</b> <b>mm</b> <b>right, so I think this is a very important</b> <b>open problem</b> <b>mm</b> <b>and another one is</b> <b>how do you make AI systems</b> <b>more efficient</b> <b>mm</b> <b>right, because current systems are</b> <b>very computationally expensive</b> <b>mm</b> <b>and this limits their deployment</b> <b>mm</b>

<b>right</b> <b>so I think efficiency is a very important problem</b> <b>mm</b> <b>and then there's also</b> <b>the question of alignment</b> <b>mm</b> <b>right, how do you make sure</b> <b>that these systems</b> <b>do what you want them to do</b> <b>mm</b> <b>right, so these are all very important open problems</b> <b>mm</b> <b>right</b> <b>and where do you see things going</b> <b>in the next five years?</b>

<b>mm</b> <b>I think</b> <b>the next five years will be</b> <b>very exciting</b> <b>mm</b> <b>I think we'll see</b> <b>a lot of progress</b> <b>on the reasoning side</b> <b>mm</b> <b>and I think we'll also see</b> <b>AI systems being deployed</b>

<b>in many more real-world applications</b> <b>mm</b> <b>right, because the technology is</b> <b>getting good enough</b> <b>mm</b> <b>and the cost is coming down</b> <b>mm</b> <b>so I think we'll see</b> <b>a lot more real-world impact</b> <b>mm</b> <b>right</b> <b>and what about</b> <b>on the video generation side specifically?</b>

<b>mm</b> <b>I think video generation will</b> <b>continue to improve very rapidly</b> <b>mm</b> <b>and I think</b> <b>the quality will get</b> <b>to the point where</b> <b>it's indistinguishable from real video</b> <b>mm</b> <b>in the next year or two</b> <b>mm</b> <b>right</b> <b>what it means is</b> <b>a possible random event like this</b> <b>a kind of black swan event</b> <b>or some kind of shock</b>

<b>a kind of, uh</b> <b>this kind of</b> <b>this kind of event that takes you by surprise</b> <b>if for this organization</b> <b>or for this person</b> <b>or for this matter</b> <b>your gains outweigh your losses</b> <b>then your organization</b> <b>is what's called antifragile</b> <b>mm</b> <b>so this concept I think is very interesting</b> <b>right</b> <b>because normally when we think about</b> <b>risk management</b>

<b>we think about</b> <b>how to avoid risk</b> <b>right</b> <b>but the antifragile concept says</b> <b>no, you should actually seek out certain kinds of risk</b> <b>or rather, certain kinds of volatility</b> <b>mm</b> <b>because these</b> <b>can make you stronger</b> <b>mm</b> <b>right</b> <b>and I think this applies very well</b> <b>to research</b> <b>mm</b> <b>because in research</b>

<b>you're constantly facing uncertainty</b> <b>mm</b> <b>and you need to be antifragile</b> <b>right</b> <b>meaning that when things don't work out</b> <b>you should actually learn from that</b> <b>and become stronger</b> <b>mm</b> <b>right, and I think this is</b> <b>a very important mindset</b> <b>mm</b> <b>and I think Kaiming embodies this very well</b> <b>mm</b> <b>because when things don't work out</b> <b>he doesn't get discouraged</b>

<b>mm</b> <b>he just tries something different</b> <b>mm</b> <b>right</b> <b>and I think this is</b> <b>a very important trait</b> <b>for a researcher</b> <b>mm</b> <b>right</b> <b>so is there anything else</b>

<b>you want to share</b> <b>before we wrap up?</b>

<b>mm</b> <b>I think</b> <b>one thing I'd like to say is</b> <b>to young people who want to do research</b> <b>or start a company</b> <b>mm</b> <b>I think</b> <b>the most important thing is</b> <b>to find something you're genuinely passionate about</b>

<b>mm</b> <b>because research and startups are both</b> <b>very long journeys</b> <b>mm</b> <b>and there will be a lot of hardship along the way</b> <b>mm</b> <b>and if you don't have genuine passion</b> <b>it's very hard to keep going</b> <b>mm</b> <b>right</b> <b>and also</b>

<b>I think</b> <b>finding good mentors</b> <b>and good collaborators</b> <b>is extremely important</b> <b>mm</b> <b>because, as I've been saying throughout</b> <b>a lot of what I've learned</b> <b>came from the people around me</b> <b>mm</b> <b>and so</b> <b>surrounding yourself with</b> <b>great people</b> <b>is one of the most important things you can do</b> <b>mm</b> <b>right</b>

<b>that's really great advice</b> <b>thank you so much</b> <b>this has been a wonderful conversation</b> <b>thank you</b> <b>yeah, thank you too</b> <b>alright</b> <b>so let's talk about</b> <b>your view on the AI landscape right now</b> <b>mm</b> <b>especially in New York</b> <b>right</b> <b>what are some of the interesting things</b>

<b>happening here?</b>

<b>happening here?</b> <b>mm</b> <b>I think New York</b> <b>is becoming a more and more important</b> <b>AI hub</b> <b>mm</b> <b>right, there's a lot of talent here</b> <b>mm</b>

<b>and a lot of interesting companies</b> <b>mm</b> <b>and I think</b> <b>New York has a unique advantage</b> <b>in that it's a very diverse city</b> <b>mm</b> <b>and this diversity</b> <b>can lead to</b> <b>very interesting collaborations</b> <b>mm</b> <b>between AI and</b> <b>other industries</b> <b>mm</b> <b>like finance</b>

<b>media</b> <b>fashion</b> <b>healthcare</b> <b>mm</b> <b>all of these are</b> <b>very well represented in New York</b> <b>mm</b> <b>so I think</b> <b>New York is going to play</b> <b>an increasingly important role</b> <b>in the AI landscape</b> <b>mm</b>

<b>right</b> <b>and what about</b> <b>comparing New York to</b> <b>Silicon Valley?</b>

<b>Silicon Valley?</b> <b>mm</b> <b>I think</b> <b>Silicon Valley is still</b> <b>the center of the AI world</b> <b>mm</b> <b>right</b> <b>but New York is</b> <b>growing fast</b> <b>mm</b> <b>and I think</b>

<b>New York has a different kind of energy</b> <b>mm</b> <b>right, it's more</b> <b>multi-disciplinary</b> <b>mm</b> <b>and I think that's</b> <b>actually very good for AI</b> <b>mm</b> <b>because AI is ultimately</b> <b>going to touch every industry</b> <b>mm</b> <b>so having this cross-disciplinary</b> <b>environment</b>

<b>is very valuable</b> <b>mm</b> <b>right</b> <b>that's really interesting</b> <b>so</b> <b>let me ask one more question</b> <b>which is</b> <b>if you were advising</b> <b>a young researcher</b> <b>who wanted to make an impact</b> <b>in AI</b> <b>mm</b> <b>what would you tell them?</b>

<b>mm</b> <b>I think</b> <b>first and foremost</b> <b>work on problems</b> <b>that you genuinely care about</b> <b>mm</b> <b>right, because your passion</b> <b>will drive you</b> <b>through the hard times</b> <b>mm</b>

<b>and second</b> <b>be willing to</b> <b>work hard on the fundamentals</b> <b>mm</b> <b>right, don't skip the basics</b> <b>mm</b> <b>because the fundamentals</b> <b>are what give you the tools</b> <b>to solve hard problems</b> <b>mm</b> <b>and third</b> <b>find good mentors</b> <b>and collaborate with great people</b> <b>mm</b> <b>right, as I said</b>

<b>a lot of what I've learned</b> <b>came from the people around me</b> <b>mm</b> <b>and so</b> <b>the people you surround yourself with</b> <b>will have a huge impact</b> <b>on your own growth</b> <b>mm</b> <b>right</b> <b>thank you so much</b> <b>this has been really insightful</b> <b>mm</b>

<b>I think</b> <b>we've covered a lot of ground today</b> <b>mm</b> <b>right</b> <b>from your early research</b> <b>all the way to</b> <b>starting a company</b> <b>mm</b> <b>and your thoughts on</b> <b>the AI landscape</b> <b>mm</b> <b>so thank you so much</b> <b>for being here today</b>

<b>thank you</b> <b>it was great talking to you</b> <b>yeah likewise</b> <b>alright</b> <b>so that wraps up</b> <b>our conversation today</b> <b>mm</b> <b>I hope you all found it</b> <b>as interesting as I did</b> <b>mm</b> <b>right</b> <b>and please</b> <b>subscribe to the channel</b>

<b>and leave a comment</b> <b>if you have any thoughts</b> <b>mm</b> <b>right</b> <b>see you next time</b> <b>bye</b> <b>in a really difficult position</b> <b>right</b> <b>why</b> <b>mainly because, first</b> <b>not enough resources</b> <b>let me give a simple example</b> <b>for instance, when we apply for funding</b> <b>the U.S. funding system</b>

<b>the U.S. funding system</b>

<b>I might be going off on a tangent here</b> <b>but the U.S. funding system</b>

<b>over the past few decades</b> <b>has barely grown at all</b> <b>even with high inflation, right</b> <b>everything has become more expensive</b> <b>tuition fees have also gone up a lot</b> <b>but government grants</b> <b>as well as the kind of proposal programs</b> <b>that companies offer</b> <b>the funded projects</b> <b>are still maintained at a very low level</b> <b>so on average</b>

<b>a body like NSF</b> <b>a U.S. government agency</b>

<b>a U.S. government agency</b>

<b>can give each individual PI</b> <b>a total of</b> <b>about $500,000 in funding</b> <b>per year</b> <b>over five years</b> <b>so about $100,000 a year</b> <b>right, and then a lot of companies</b> <b>have actually cut back a lot</b> <b>again because of ChatGPT</b> <b>because the era of LLMs has arrived</b> <b>and everyone has gradually started to pull back</b> <b>we can talk more about this later</b>

<b>but in any case, there are fewer and fewer</b> <b>opportunities from industry</b> <b>for this kind of sponsorship</b> <b>and once in a while</b> <b>if there's some kind of funding opportunity</b> <b>they'll typically give you</b> <b>maybe $100,000 to $150,000</b> <b>that's just a one-time thing</b> <b>a one-time lump sum of that much as a grant</b> <b>but you know</b> <b>there are probably about 100 schools</b> <b>100 professors at the same time</b> <b>or even more, competing for that $100,000</b> <b>what can you do with $100,000?</b>

<b>you can fund one student for one year</b> <b>as tuition</b> <b>what else?</b>

<b>what else?</b> <b>you can buy half an H100, or a small cluster</b> <b>mm</b> <b>or buy maybe 3 to 4 GPUs</b> <b>so you really can't get much done with that</b> <b>and of course, this isn't just</b> <b>me venting</b> <b>all of us</b> <b>so-called</b> <b>junior faculty in the U.S.</b>

<b>are living in quite difficult conditions</b> <b>everyone has to find their own way</b> <b>to get different resources</b> <b>so this is also why</b> <b>it's a bit like a startup</b> <b>you're in a very constrained resource situation</b> <b>resource-wise</b> <b>and you have to find resources from different places</b> <b>you have to fundraise, right?</b>

<b>Xiaojun</b> <b>this is Business Interview show</b> <b>I said I'm not commercial at all</b> <b>but actually in some ways</b> <b>there might still be some similarities</b> <b>and then including people at Google</b> <b>we</b> <b>I had a collaborator at Google</b> <b>and he's quite unusual</b> <b>he never goes into the office</b> <b>and he said, hey</b> <b>he said, we could have a chat</b> <b>and I said, sure</b> <b>let me come chat</b> <b>I flew to the Bay Area to see him</b>

<b>and he said we could talk</b> <b>but not in an office</b> <b>let's go on a trail</b> <b>hiking on the trail next to Google's campus</b> <b>mm, go hiking</b> <b>mm, talk while hiking</b> <b>mm, so in the middle of summer</b> <b>I hiked with him for an hour</b> <b>and I told him about</b> <b>the infrastructure work we'd been doing on TPUs</b> <b>these contributions</b> <b>these contributions</b> <b>and also why building this</b>

<b>longer-term collaborative</b> <b>partnership</b> <b>this kind of relationship</b> <b>would be good for Google</b> <b>and good for us</b> <b>right, so I thought</b> <b>hey, isn't this just like a fundraising process?</b>

<b>so in the end</b> <b>it became a kind of alms-seeking</b> <b>alms-seeking</b> <b>the process of seeking alms</b> <b>right right right</b> <b>right</b> <b>indeed because</b> <b>because</b> <b>this kind of sponsorship actually asks for nothing in return</b> <b>right, so I'm very grateful to Google</b> <b>but anyway</b> <b>I think who I should be even more grateful to is</b> <b>my students</b> <b>and they, bit by bit</b> <b>overcame many, many obstacles</b> <b>like I have a few students</b> <b>I have several students</b> <b>like</b> <b>Peter Tong</b> <b>Boyang Zheng</b>

<b>Shusheng Yang</b> <b>and many others</b> <b>and they all made very significant contributions on TPUs</b> <b>mm</b> <b>right, and good</b> <b>right, and good</b> <b>so that's the background</b> <b>meaning we now have some GPUs to work with</b> <b>and now</b> <b>we can work on things that are a bit more</b> <b>closely related to large models</b> <b>so this is why I started working on</b> <b>the Cambrian project</b> <b>right uh</b> <b>and of course</b> <b>all of these narratives</b> <b>these stories</b>

<b>are still completely rooted in my</b> <b>logic from all these years</b> <b>which is, uh</b> <b>first, representation is extremely important</b> <b>second, regardless of whether you're solving</b> <b>a standard computer vision task</b> <b>or we're now in</b> <b>the era of multimodal large models</b> <b>and solving these problems through VQA</b> <b>I think all of these are like</b> <b>all of these are like</b> <b>all of these are like</b>

<b>right, and underneath it all</b> <b>there's still something substantive</b> <b>that we need to think through</b> <b>right, and this part</b> <b>anyway, about language and vision</b> <b>we can talk about that later</b> <b>and I</b> <b>and then</b> <b>we later also had a paper called Cambrian-S</b> <b>this paper goes even further</b> <b>we're not just doing image-level VQA tasks</b> <b>we want to also involve video</b> <b>to deal with video</b>

<b>right and this thing</b> <b>actually the real reason I genuinely wanted</b> <b>to work on this</b> <b>goes back to films again</b> <b>and also has to do with</b> <b>two Chinese directors I like</b> <b>quite a lot</b> <b>director Jia, you know</b> <b>Jia Zhangke and Bi Gan</b> <b>both very well-known Chinese directors</b>

<b>right, Bi Gan's Kaili Blues extensively uses</b> <b>long takes</b> <b>and this made me think, okay</b> <b>while to him it's a visual tool</b> <b>for humans, this is also a very important</b> <b>a very important medium</b> <b>for visual understanding</b> <b>because, what is a long take?</b>

<b>life itself is one long take</b> <b>our eyes are our camera</b> <b>mm</b> <b>we are constantly</b> <b>doing all kinds of things in this world</b> <b>right, and the things we see</b> <b>the medium is video</b> <b>it's all video</b> <b>right</b> <b>but</b> <b>we can see the pixels in this video</b> <b>and everything behind them</b> <b>we can reason about causality</b> <b>we can perceive space</b> <b>right</b>

<b>and Jia Zhangke said something I really</b> <b>agreed with deeply</b> <b>he said what makes film so interesting</b> <b>this was when he told me this in New York</b> <b>he said this is very interesting</b> <b>is that if you just look at the timeline</b> <b>this is a timeline</b> <b>it's a linear timeline</b> <b>but at every point on this timeline</b> <b>you need a space to extend its time</b> <b>like we're talking right now</b> <b>even though it seems like a static frame</b>

<b>but imagine you had a long take</b> <b>or rather</b> <b>you're on the streets of New York right now</b> <b>under the Dumbo bridge</b> <b>below Dumbo</b> <b>right</b> <b>what you see is still frame after frame</b> <b>mm right</b> <b>but what it represents behind those frames</b> <b>is the state of the world</b> <b>the global information of the entire space</b>

<b>this thing completely transcends</b> <b>what a single lens encodes</b> <b>that individual, isolated</b> <b>each individual frame</b> <b>I think</b> <b>I think this makes a lot of sense</b> <b>so this is what made me think</b> <b>we still need to work on video going forward</b> <b>even if video is hard to work with</b> <b>even if video requires handling massive amounts of data</b> <b>we still have to do it</b> <b>so with Cambrian-S</b> <b>that's what we're doing</b>

<b>and this work is a bit like a position paper</b> <b>a position paper is a kind of</b> <b>how should I</b> <b>the translation would be an opinion paper</b> <b>meaning</b> <b>I want to put forward this kind of viewpoint</b> <b>so in that paper</b> <b>we discuss the concept of super sensing</b> <b>meaning the concept of hyper-perception</b> <b>and we also</b> <b>it's also a paper about data</b> <b>it's a paper about</b>

<b>architectural structure</b> <b>and it's also about</b> <b>a paper on spatial intelligence</b> <b>so Professor Fei-Fei also helped us</b> <b>with a lot of invaluable advice</b> <b>mm-hmm</b> <b>but the core idea is we want to define a paradigm</b> <b>for where multimodal AI should go from here</b> <b>right, and then</b> <b>so</b> <b>if you look at this problem step by step</b> <b>meaning we</b>

<b>this may be an imperfect analogy</b> <b>but you can draw a parallel with autonomous driving</b> <b>you might have an L0 system</b> <b>a system with nothing at all</b> <b>it's basically an old language model</b> <b>it can't perceive the world at all</b> <b>all this visual knowledge</b> <b>it can't see images</b> <b>it can't see videos either</b> <b>right</b> <b>but it can, through language</b> <b>like Plato's Cave allegory</b>

<b>indirectly understand the world</b> <b>that's fine</b> <b>we call it L0</b> <b>L1 is the current multimodal system</b> <b>with slightly better capabilities</b> <b>it's capable of what you'd call show and tell</b> <b>meaning you show it something</b> <b>and then it can tell you</b> <b>some answers about what you showed it</b> <b>right, you ask it a question</b> <b>and it gives you an answer</b> <b>this might be L1</b> <b>then L2, I think, is</b>

<b>what I call streaming event cognition</b> <b>meaning now this thing</b> <b>doesn't just look at a static image</b> <b>you'd have a continuous, streamable</b> <b>visual stream like this</b> <b>a visual stream</b> <b>your intelligent system</b> <b>needs to be able to understand this visual stream</b> <b>and be able to process this visual stream</b> <b>and also be able to answer questions</b> <b>be able to understand</b> <b>what's happened</b>

<b>right, and then the next stage</b> <b>uh, I call it spatial cognition</b> <b>meaning this is about</b> <b>what I was just saying</b> <b>which is that you</b> <b>at every point in this temporal sequence</b> <b>how to see beyond the present moment</b> <b>to what's really behind it — these</b> <b>the space behind these pixels</b> <b>right</b> <b>this is also something very, very deep for humans</b> <b>a very unique ability</b> <b>and ultimately</b> <b>actually um</b>

<b>I think the endgame is</b> <b>we need a predictive world model</b> <b>yes, some kind of predictive world model</b> <b>this is what can tell you</b> <b>everything about the real world you observe</b> <b>yes, I think</b> <b>what I want to convey through this paper is</b> <b>we're building a staircase</b> <b>step by step</b> <b>leading toward a future with a world model</b> <b>mm-hmm</b>

<b>um, although we may not know</b> <b>exactly how to define this world model</b> <b>at least in this paper</b> <b>we won't attempt to do that definitional work</b> <b>but we can identify</b> <b>which capabilities are absolutely necessary</b> <b>yes, so that's the core of this paper</b> <b>and this paper</b> <b>um, we also filmed a short video</b> <b>which I also posted on Twitter</b>

<b>some students</b> <b>we didn't spend any money</b> <b>it wasn't for promotion</b> <b>just some students with cameras</b> <b>filming on the streets of New York</b> <b>um, unfortunately we weren't able to</b> <b>shoot a Bi Gan-style long take</b> <b>but</b> <b>filming as we walked</b> <b>it was a love letter to New York, I suppose</b> <b>and then</b> <b>but a lot of people didn't understand</b> <b>saying why are you filming this</b> <b>does this have anything to do with your paper</b>

<b>mm-hmm</b> <b>I said of course it does</b> <b>our paper itself is about</b> <b>an intelligent agent living in the real world</b> <b>how it can ingest this continuous</b> <b>visual stream signal</b> <b>and</b> <b>be able to perceive what's happening in the world</b> <b>it might be moved by certain things</b> <b>right</b> <b>be surprised</b> <b>feel astonished</b> <b>but most of the time</b>

<b>its brain will have some kind of</b> <b>spontaneously operating world model</b> <b>guiding everyone to be themselves</b> <b>guiding everyone to live in this world</b> <b>yes, I think</b> <b>this paper is actually quite interesting</b> <b>because I had never done this kind of work before</b> <b>kind of like</b> <b>wanting to set an agenda</b> <b>defining the problem like this</b> <b>so</b> <b>so, I also hope to learn more from Professor Fei-Fei</b>

<b>Professor Fei-Fei often talks about the North Star, right</b> <b>so the question I've always been asking is</b> <b>what exactly is the North Star of vision</b> <b>mm-hmm, what exactly is that question</b> <b>and how should we solve it</b> <b>yes, so that's this paper</b> <b>did you find the answer</b> <b>um, I couldn't find the answer</b> <b>if I'd found the answer I wouldn't be sitting here</b> <b>I think this is an ultimate question</b> <b>mm-hmm</b>

<b>I don't think this is just a computer vision problem</b> <b>or rather, what I actually want to say is</b> <b>actually, the term computer vision</b> <b>is also very interesting</b> <b>it's called vision</b> <b>and vision has a double meaning</b> <b>it's a very ambiguous word</b> <b>vision refers to both your eyesight</b> <b>and your foresight about the future</b> <b>right, when you say someone has great vision</b> <b>meaning they have a grand vision</b> <b>visionary vision yes</b> <b>um, so I think computer vision</b>

<b>actually</b> <b>I'm not going to</b> <b>um this</b> <b>I can say I am someone who</b> <b>works in computer vision</b> <b>yes, but computer vision in my definition</b> <b>it's a perspective</b> <b>it's not a specific task</b> <b>it's not even a</b> <b>specific field</b> <b>it's a perspective</b> <b>perspective means it's a point of view</b> <b>yes, or rather it is</b>

<b>I think intelligence — it's quite fundamental</b> <b>it's a collection of problems</b> <b>that intelligence must solve</b> <b>it's a collection</b> <b>right, let me be more specific</b> <b>so what is vision</b> <b>or what problems does vision address</b> <b>mm-hmm</b> <b>I may not be able to articulate it clearly</b> <b>let me think</b> <b>um,</b> <b>first, the signals it handles are in continuous space</b> <b>high-dimensional, noisy signals</b>

<b>mm-hmm</b> <b>right, these are the problems computer vision needs to solve</b> <b>the problems computers need to solve</b> <b>it's not about writing lots of text on paper</b> <b>we need to evolve some kind of intelligence</b> <b>that doesn't avoid this problem</b> <b>it addresses this domain</b> <b>its</b> <b>its target</b> <b>this domain</b> <b>is completely different from language</b> <b>right</b> <b>continuous, high-dimensional, noisy signals</b> <b>these are the problems Vision needs to solve</b> <b>second, from the very first day of doing Vision</b>

<b>from the first paper I just mentioned</b> <b>starting from DSN or HED</b> <b>I already knew</b> <b>or rather I had this kind of bet</b> <b>that vision</b> <b>the most important thing</b> <b>is to learn this kind of hierarchical representation</b> <b>hierarchical representation</b> <b>this is extremely important</b> <b>if your representation lacks hierarchy</b> <b>you won't be able to solve</b>

<b>many, many problems in this world</b> <b>the hierarchical process is an abstraction process</b> <b>the process of abstraction</b> <b>is what's called a generalization process</b> <b>a generalization process</b> <b>this is also very different from a language model</b> <b>because a language model</b> <b>operates purely in the semantic space</b> <b>when thinking about this problem</b> <b>so</b> <b>there are of course other characteristics</b> <b>for example, I say vision as a perspective, um</b>

<b>for example, I think it's also</b> <b>this kind of large-scale parallelization</b> <b>we can now see many, many things</b> <b>many areas of our brain's cortex are firing</b> <b>right, and then</b> <b>we're processing in parallel</b> <b>many many different objects</b> <b>and their</b> <b>causal patterns</b> <b>and their physical changes</b> <b>these things are happening at different times</b> <b>and in different spaces</b> <b>all simultaneously</b>

<b>and we have a way</b> <b>to capture all these changes</b> <b>I think this thing</b> <b>is also an important characteristic of vision</b> <b>um</b> <b>and finally, there may be one more, which is some kind of</b> <b>um</b> <b>I'm not sure how to define this thing</b> <b>some kind of feature sharing</b> <b>what this means is</b> <b>for example, I look at</b> <b>the semantic part of this matter</b>

<b>or the real understanding part</b> <b>may be a bit more</b> <b>that is to say</b> <b>I now see a dog drawn by a child</b> <b>and a cartoon dog in an animation</b> <b>and a real dog running around in the real world</b> <b>right, and then</b> <b>how do I connect all these different visual</b> <b>entities together, right</b> <b>building this kind of abstract cognition</b> <b>saying, hey, they're all dogs, right</b>

<b>even though they're vastly different</b> <b>in this, um</b> <b>from a data perspective, you know</b> <b>they're so far apart</b> <b>not a single pixel is comparable</b> <b>so what I want to say is, um</b> <b>vision may have even more problems to solve</b> <b>I actually haven't thought carefully about this</b> <b>yes, anyway it'll have some common characteristics like these</b> <b>these features</b> <b>right, hierarchical structure</b> <b>and this kind of continuous domain modeling, um</b>

<b>continuous domain modeling</b> <b>and also this kind of</b> <b>this kind of</b> <b>large-scale parallelism and large-scale sharing</b> <b>I think these things</b> <b>are all part of an intelligent agent</b> <b>this thing</b> <b>cannot simply be reduced to</b> <b>just a computer vision system</b> <b>solving a small subset of problems</b> <b>mm-hmm</b> <b>so that's why I think</b> <b>computer vision</b> <b>I think, I think</b>

<b>I think although fewer and fewer people are working on</b> <b>this direction</b> <b>students are also increasingly fewer</b> <b>fewer students are applying to this area</b> <b>when people are undergraduates</b> <b>when choosing this direction</b> <b>they're also increasingly unwilling to choose it</b> <b>right, something called computer vision</b> <b>um, and then</b> <b>and when faculty are hiring, too</b> <b>we're probably increasingly less likely to</b>

<b>hire a professor doing pure computer vision</b> <b>but I think this is</b> <b>if you consider computer vision</b> <b>as a perspective</b> <b>I think it's the essence of intelligence</b> <b>look at the past few years</b> <b>after ChatGPT arrived</b> <b>CV was previously</b> <b>very central to</b> <b>occupying a very central position in artificial intelligence</b>

<b>of course, this happened after you entered the field</b> <b>um, in recent years LLMs have risen</b> <b>CV has been pushed back to a more marginal position</b> <b>in this process</b> <b>do you think people like you feel discouraged</b> <b>um</b> <b>I don't feel discouraged at all</b> <b>I feel not the least bit discouraged</b> <b>I think, as I said</b> <b>I should be grateful for LLMs</b> <b>yes, without LLMs</b> <b>Vision couldn't have expanded into the truly</b>

<b>large scope of multimodal intelligence it has now</b> <b>from the perspective of vision's development history</b> <b>there are actually two axes</b> <b>you can draw them — this axis</b> <b>goes back to ancient times, right</b> <b>at the earliest stage</b> <b>the things computer vision needed to handle</b> <b>were always the most singular</b> <b>most concrete and simplest tasks</b> <b>like MNIST digit recognition, right</b>

<b>1234, I need to</b> <b>determine which digit it is</b> <b>and then later there were some small datasets</b> <b>like CIFAR data</b> <b>a 32×32 pixel</b> <b>ten-class classification problem</b> <b>is it a cat or a dog</b> <b>is it a car or an airplane</b> <b>and then later</b> <b>datasets like ImageNet appeared</b> <b>it became a 256×256</b> <b>level</b> <b>doing classification, right</b>

<b>um, but at those times</b> <b>things were relatively controllable</b> <b>and then later</b> <b>there were detection and segmentation</b> <b>this more structured kind of</b> <b>cognitive process</b> <b>and these are compositions</b> <b>and then later, right</b> <b>if this axis continues to advance, it leads to</b> <b>the rise of multimodal large-scale models</b> <b>because of the introduction of multimodality</b> <b>we can easily abandon many</b>

<b>of these specific</b> <b>relatively rigid</b> <b>task designs</b> <b>this kind of task design</b> <b>and now I can take an image</b> <b>and ask all kinds of questions</b> <b>suppose this thing</b> <b>language as a great interface</b> <b>can</b> <b>or language as a great interface</b> <b>it can help you solve many many problems</b> <b>right, so you can see over this time</b> <b>um, this axis</b> <b>this axis, um</b>

<b>goes from simple to complex tasks</b> <b>such an axis</b> <b>but also an axis where language starts</b> <b>gradually entering computer vision</b> <b>so then</b> <b>there are two issues here</b> <b>the first is that after language entered vision</b> <b>it brought us enormous benefits</b> <b>allowing us to freely define problems</b> <b>we can ask anything</b> <b>and we can get any answer</b> <b>mm-hmm</b>

<b>but the second important risk is</b> <b>language's involvement has led to</b> <b>your dependence on language also increasing</b> <b>mm-hmm</b> <b>so many so-called multimodal cases</b> <b>these tasks are actually unrelated to lan—</b> <b>unrelated to vision</b> <b>purely a language problem</b> <b>mm-hmm</b> <b>from this perspective</b> <b>um, of course I think, yes</b> <b>vision seems to have become marginalized</b>

<b>mm-hmm right</b> <b>but of course I don't feel discouraged</b> <b>I see it as an enormous opportunity</b> <b>because in the end</b> <b>if the problems you're solving now</b> <b>are relatively simple</b> <b>then it doesn't matter</b> <b>problems you can solve with language</b> <b>just use language to solve them</b> <b>right um</b> <b>even though I haven't seen</b> <b>I can't do so-called grounding</b> <b>meaning I can't know</b> <b>the red apple you describe to me</b> <b>what exactly</b> <b>what is red</b>

<b>what exactly is an apple</b> <b>but somehow through statistical information</b> <b>in language</b> <b>I can still complete some decision-making tasks</b> <b>no one can fault you for this</b> <b>I think that's fine</b> <b>but the huge hidden opportunity is</b> <b>when the day truly comes</b> <b>that we need to deal with the real world</b> <b>real tasks</b> <b>to build some kind of real intelligence</b> <b>ah</b> <b>then this currently imperfect</b>

<b>visual representation</b> <b>will be a major deficiency</b> <b>so Yann LeCun's view is</b> <b>everyone right now is just using a crutch</b> <b>that crutch being the language model itself</b> <b>right, and even though you can walk</b> <b>and you'd think</b> <b>hey, I'm walking pretty well</b> <b>but you probably can't run</b> <b>and you can't participate in the Olympics</b> <b>right, because you have a leg</b> <b>this is the so-called leg of visual representation</b>

<b>which is still</b> <b>still not good enough</b> <b>why do you call it real intelligence</b> <b>why isn't LLM real intelligence</b> <b>because I think</b> <b>LLM is virtual intelligence</b> <b>but our intelligence</b> <b>so-called intellect</b> <b>isn't that also virtual</b> <b>oh, I think the word virtual may not be right</b> <b>what I define as real</b> <b>is something that has to interact with the real world</b> <b>yes, what does that mean</b>

<b>meaning look</b> <b>the problems that LLMs can solve well now</b> <b>mostly still occur in the digital space</b> <b>mm-hmm</b> <b>mm-hmm, for example</b> <b>um, it can memorize</b> <b>all this factual knowledge</b> <b>it can know</b> <b>right, we can put all</b> <b>these Wikipedia articles</b> <b>all in there</b> <b>and it can tell us everything we want to know</b> <b>it can serve as a very good legal advisor</b> <b>it can</b>

<b>even help summarize knowledge</b> <b>and do education</b> <b>do teaching</b> <b>a lot of these things</b> <b>right, and I think LLMs</b> <b>um, are of course revolutionary</b> <b>but this is different from the vision</b> <b>as a perspective that needs to solve problems</b> <b>actually they're completely different domains</b> <b>meaning</b> <b>meaning</b> <b>if what you need to handle is continuous</b>

<b>high-dimensional space</b> <b>in this kind of noisy domain</b> <b>then things like, for example, robots</b> <b>these domains aren't just robots</b> <b>by the way, robots are one good example</b> <b>I'll get to that in a moment</b> <b>ah, these things are very hard to tokenize</b> <b>they've already left this virtual space</b> <b>left this digital space</b> <b>right, what kind of tasks does this involve</b> <b>you're absolutely right</b> <b>I think robots are</b>

<b>there will also be many</b> <b>industrial applications, right</b> <b>industrial process control</b> <b>meaning some</b> <b>all those involving sensory</b> <b>modeling signals</b> <b>with many different kinds of sensors</b> <b>right, these kinds of sensors</b> <b>and they perceive what's happening</b> <b>in this world</b> <b>and you now need a unified algorithm</b> <b>to model this environment</b> <b>this system</b> <b>so that you then</b>

<b>perform an action or intervention</b> <b>meaning that when you</b> <b>you are</b> <b>take an action or make an intervention</b> <b>you're able to predict</b> <b>how this system</b> <b>will change next</b> <b>this is very hard for LLMs to do</b> <b>mm-hmm</b> <b>and you're absolutely right about that</b> <b>I think from my perspective, there are actually two extremes</b> <b>one extreme is LLMs, um</b> <b>very good at operating in the digital space</b>

<b>doing many many things</b> <b>and also very good at</b> <b>using coding as an interface</b> <b>right, through agents</b> <b>to intervene in our physical lives</b> <b>um, this will also happen</b> <b>and that's fine</b> <b>but ultimately it's still based on discrete tokens</b> <b>token-based</b> <b>these one-by-one positions</b> <b>ah, on the far right is Robotics <b>this Robotics is truly</b> <b>it must be true</b>

<b>truly general-purpose robotics</b> <b>meaning it can generalize to</b> <b>generalize to a certain degree</b> <b>such that it can do everything a human can do</b> <b>mm-hmm, it has its own decision-making system</b> <b>and it has its own brain</b> <b>mm-hmm, and I feel now that these two extremes</b> <b>right, and then</b> <b>and how from LLMs</b>

<b>step by step it extends to Robotics</b> <b>I think this is what computer vision</b> <b>or, in the new era,</b> <b>visual intelligence needs to solve</b> <b>right</b> <b>and then</b> <b>I think this is also the future of multimodal</b> <b>mm-hmm</b> <b>because obviously, robotics still doesn't work now</b> <b>and I often tell students</b>

<b>or people around me</b> <b>actually um</b> <b>the thing I most want to achieve</b> <b>is to solve the Robotics problem</b> <b>without doing Robotics</b> <b>why is that</b> <b>mm-hmm, because you think</b> <b>the Robotics approach can't solve the Robotics problem</b> <b>not exactly</b> <b>it's because I think each of us</b> <b>I think Robotics is advancing too quickly</b> <b>right</b> <b>now at the Spring Gala Festival there's Unitree Robotics and all that</b> <b>yes I think</b>

<b>I find it all rather jaw-dropping</b> <b>but on the other hand</b> <b>I think</b> <b>there still needs to be someone focused on the pre-training part</b> <b>which is what's called the robot brain</b> <b>what exactly it is</b> <b>mm-hmm</b> <b>or how this brain includes your visual system</b> <b>right, in the control part</b> <b>in the hardware part</b> <b>this part also means</b> <b>brothers climbing the mountain, each making their own effort</b> <b>I don't think I need to</b>

<b>intervene in hardware too early</b> <b>and do those things</b> <b>right</b> <b>I think there are fundamental research problems now</b> <b>that haven't been solved at the software level</b> <b>haven't been solved in building this brain</b> <b>we need to focus first on solving this part</b> <b>of course many people will argue</b> <b>you have to have</b> <b>something like a closed loop</b> <b>you need some kind of collaborative approach</b> <b>you need to validate on your robots</b> <b>otherwise</b> <b>if you build some algorithm now</b>

<b>some model may not be useful</b> <b>mm-hmm</b> <b>I fully agree with that</b> <b>but I think</b> <b>this can be done through some kind of partnership</b> <b>yes, I just don't want to</b> <b>buy this</b> <b>I also don't have the money</b> <b>I can't afford that many robots</b> <b>robots also have their own hardware scaling</b> <b>by the way</b> <b>you need to buy many robots</b> <b>to do hardware well</b> <b>mm-hmm</b> <b>yes, I want to focus on the brain part</b>

<b>and I think this</b> <b>is a problem that computer vision needs to solve</b> <b>a problem that representation learning needs to solve</b> <b>and also</b> <b>I think ultimately the problem that a world model needs to solve</b> <b>look at Kaiming, he started thinking about this so early</b> <b>wanting bigger, bigger, bigger</b> <b>mm-hmm</b> <b>why</b> <b>why did LLM Scaling Laws come so much earlier than CV</b> <b>um, good question</b>

<b>yes, I think first of all we can't say that much earlier</b> <b>because CV currently doesn't have a Scaling Law</b> <b>right, and actually before I was</b> <b>we were all pretty desperate</b> <b>I said, oh no</b> <b>this vision</b> <b>how come it still doesn't have a Scaling Law</b> <b>now maybe it's alright</b> <b>now for example these video diffusion models</b> <b>have some Scaling Behavior</b> <b>what's called Scaling</b> <b>is that you can consume the data</b> <b>yes, and then you can</b> <b>you can</b>

<b>you can get better results</b> <b>right</b> <b>or rather</b> <b>this more formal characterization</b> <b>meaning your Scaling Behavior</b> <b>meaning if you now have a Transformer system</b> <b>then I now satisfy this</b> <b>ratio like C=6ND</b> <b>meaning your</b> <b>your compute is basically equal to 6 times</b> <b>your tokens times your</b> <b>number of parameters</b>

<b>and I want</b> <b>I want to use this</b> <b>formal definition to make this point</b> <b>because I now think</b> <b>more and more that vision doesn't need a Scaling Law</b> <b>oh, why is that</b> <b>because again</b> <b>what vision cares about</b> <b>is completely different from what language cares about</b> <b>it's not a radical claim</b> <b>but it is a viewpoint</b> <b>a long-held view</b> <b>and many people doing NLP</b> <b>actually agree with this view</b>

<b>that is, a language model</b> <b>is actually not a self-supervised learning process</b> <b>it's actually a strongly</b> <b>supervised learning process</b> <b>meaning it's a strongly supervised process</b> <b>it depends on how you look at it</b> <b>what does supervised or unsupervised mean</b> <b>yes, the logic here is as follows</b> <b>generally speaking</b> <b>we say whether you have external annotations</b> <b>external labels</b> <b>this determines whether you are self-supervised</b> <b>or</b>

<b>or strongly supervised learning</b> <b>right, but language is such a special case</b> <b>what is language</b> <b>language</b> <b>is what humans over the past few thousand years of civilization</b> <b>through continuous evolution</b> <b>whether in a sociological sense</b> <b>or in each individual person's sense</b> <b>and processed</b> <b>everything about this world</b>

<b>and stored it in a tokenized form</b> <b>storing it down</b> <b>and we happened to have something called the internet</b> <b>and we uploaded this knowledge</b> <b>all to the internet</b> <b>so for all LLM researchers</b> <b>this is for free</b> <b>but something being free doesn't mean it has no labels</b> <b>then one question is</b> <b>suppose we didn't have the internet</b> <b>then if you wanted to train language models now</b> <b>could you still do it</b>

<b>put books in</b> <b>yes</b> <b>or suppose you had no books</b> <b>right yes</b> <b>exactly, this kind of</b> <b>knowledge upload</b> <b>this thing</b> <b>is itself a process of supervision construction</b> <b>right</b> <b>so this is different from vision</b> <b>so it's somewhat like language</b> <b>um, wanting to solve problems</b> <b>always staying in this target y space</b> <b>as we usually say</b> <b>you have a mapping from x to y</b>

<b>that's all machine learning</b> <b>you can through some</b> <b>regardless of where x and y are</b> <b>you can define the problem this way anyway</b> <b>and y is usually what people call supervision</b> <b>is the label, and x is your data</b> <b>right</b> <b>you can think of this</b> <b>language model as</b> <b>actually only characterizing things in the y space</b> <b>mm-hmm</b>

<b>mm-hmm, but this is true</b> <b>going back to the earlier question</b> <b>meaning this is actually insufficient to represent</b> <b>the totality of this world</b> <b>there are many things</b> <b>that you can't through language</b> <b>describe and characterize</b> <b>or rather this is both the advantage of language</b> <b>and also language</b> <b>may eventually, as I said, gradually fade</b> <b>or rather</b>

<b>LLM won't be the foundation of the entire world model</b> <b>that's one reason</b> <b>the reason is</b> <b>its advantage is</b> <b>you don't need to do anything</b> <b>to achieve some kind of alignment with humans</b> <b>because every sentence and every word you write</b> <b>is written by humans</b> <b>is written by humans</b> <b>mm-hmm right</b> <b>when you write this down</b> <b>what is language</b> <b>language is a communication tool</b> <b>language is not a</b> <b>thinking map</b>

<b>language is not even a decision-making tool</b> <b>it's a form of communication</b> <b>it's actually a communication tool</b> <b>mm-hmm</b> <b>so if it is a communication tool</b> <b>you always have to make some trade-offs</b> <b>you always have to sacrifice something</b> <b>so, ah, and then I think</b> <b>I think, um</b> <b>what I mainly want to say is yes</b>

<b>as a communication tool</b> <b>it aligns well with humans</b> <b>but on the other hand</b> <b>it has also lost a lot</b> <b>which it originally</b> <b>as an intelligent system</b> <b>should be modeling</b> <b>mm-hmm right</b> <b>for example, right now</b> <b>I have a cup of water</b> <b>I have a cup that fell on the ground and broke</b> <b>this is actually a linguistic</b> <b>the reason we say it this way</b> <b>is because this is the</b>

<b>most suitable thing for our communication</b> <b>we only care about the outcome and state of things</b> <b>right</b> <b>we don't care how a cup fell to the ground</b> <b>and how exactly it broke</b> <b>right, which physical</b> <b>laws it obeyed</b> <b>the dynamics behind it</b> <b>what exactly they are</b> <b>yes, so what exactly are its dynamics</b> <b>we don't care about these things</b> <b>right</b> <b>so I think this is also a limitation of it</b> <b>mm-hmm</b> <b>LLM people would complain that</b>

<b>after adding vision</b> <b>it might affect their intelligence</b> <b>ah why really</b> <b>yes, he hopes, um</b> <b>like Yang Zhilin, saying adding multimodal</b> <b>they hope it won't be a dumb multimodal</b> <b>ah yes</b> <b>I agree</b> <b>of course you shouldn't use a dumb multimodal</b> <b>but I think if you don't add vision</b> <b>you'll definitely be dumb</b> <b>and, but I think</b> <b>the fundamental issue is</b>

<b>how to define smart and dumb</b> <b>yes, it's about intelligence</b> <b>the definition of intelligence is different</b> <b>the definition of intelligence is different</b> <b>and or rather</b> <b>how exactly to define</b> <b>what is a simple task</b> <b>what is a difficult task</b> <b>mm-hmm</b> <b>over the past few decades</b> <b>all these AI researchers</b> <b>would continuously encounter</b> <b>this so-called Moravec's paradox</b>

<b>this Moravec's paradox</b> <b>what this paradox says is</b> <b>things that are easy for machines</b> <b>or um</b> <b>the easy problem is hard</b> <b>the hard problem is easy</b> <b>this is a paradox</b> <b>meaning things that are easy for machines</b> <b>are actually hard for humans</b> <b>and things that are hard for machines</b> <b>are actually easy for humans</b> <b>you seem to have several works at NYU</b> <b>um right</b>

<b>I think starting with V*</b> <b>um, V* is actually just one piece of work</b> <b>I think it's quite interesting</b> <b>could you talk about it</b> <b>because we were the first to think about</b> <b>wanting to build in a multimodal system</b> <b>a system two</b> <b>what's called</b> <b>that can</b> <b>do scaling at test time</b> <b>such a model</b> <b>meaning we</b> <b>when we look at the world around us</b> <b>for example I want to ask you a question now</b> <b>right</b>

<b>for example</b> <b>like something around you</b> <b>there's a trash can nearby</b> <b>what color is it</b> <b>you won't directly like a language model</b> <b>directly tell me an answer</b> <b>you'll definitely first think</b> <b>where is this trash can</b> <b>you might turn around and look</b> <b>discover</b> <b>there's a refrigerator over there</b> <b>maybe the trash can is next to the refrigerator</b> <b>then you'd localize this object</b> <b>and find this object</b> <b>right, and then tell me an answer</b> <b>so you have this visual reasoning here</b>

<b>right, some kind of visual reasoning here</b> <b>and then</b> <b>this thing</b> <b>it's entirely a behavior in a reasoning process</b> <b>right, and then</b> <b>and then this thing</b> <b>we built such a system back then</b> <b>and this is also</b> <b>um,</b> <b>for example, before o1</b> <b>a very long time</b> <b>yes, at least a few months</b> <b>and we started doing this</b> <b>mm-hmm right</b> <b>at that time this kind of test time scaling</b> <b>was not a buzzword at all</b>

<b>nobody had been talking about this</b> <b>okay right</b> <b>and I think this is worth talking about</b> <b>because for me</b> <b>it's actually an inspiration</b> <b>I think it's both</b> <b>I think it's a bittersweet</b> <b>kind of lesson</b> <b>meaning it</b> <b>the bitter part is</b> <b>let me first tell you what happened</b> <b>after we had this paper</b> <b>we had our own benchmark</b> <b>and then we found</b>

<b>meaning</b> <b>I have two friends</b> <b>Alex Kirillov</b> <b>who's also the author of SAM</b> <b>and Bowen Cheng</b> <b>both of them work at OpenAI</b> <b>mm-hmm so</b> <b>I talked with them for a long time</b> <b>we told them</b> <b>what our work had done</b> <b>our benchmark is here now</b> <b>you can try it out</b> <b>and I also discussed</b> <b>some of the logic behind it</b> <b>right meaning</b> <b>how you can do this kind of visual thinking</b> <b>and later</b>

<b>Alex and Bowen drove this project at OpenAI</b> <b>drove this project</b> <b>this project is called think with image</b> <b>and later, maybe over a year later</b> <b>right, and then this product launched</b> <b>mm-hmm, and after this product launched it was called</b> <b>think with image</b> <b>and inside, many examples or their benchmarks</b> <b>were actually the benchmarks from our paper</b> <b>oh</b> <b>so</b> <b>what makes me very happy about it is</b> <b>this is the first time</b>

<b>I thought, hey</b> <b>we can actually find a way</b> <b>to truly take a different path</b> <b>this can somehow</b> <b>inspire researchers at OpenAI</b> <b>to improve their own models</b> <b>mm-hmm</b> <b>I think this at least makes me feel</b> <b>there are things to do in academia</b> <b>mm-hmm</b> <b>but on the other hand</b> <b>um, it's also rather bitter</b> <b>because</b>

<b>you see, at that time OpenAI, right</b> <b>at the time of Sora</b> <b>why people were able to accept DiT</b> <b>was also because DiT</b> <b>um</b> <b>would be cited in Sora's blog post</b> <b>or Bill's name being on it</b> <b>letting people find this logic</b> <b>and the clues behind it</b> <b>mm-hmm right</b> <b>but unfortunately</b> <b>I think, gradually</b> <b>in recent years</b> <b>industrial research labs</b>

<b>have become increasingly closed</b> <b>so at first everyone published papers</b> <b>later people couldn't publish papers anymore</b> <b>you could write some blog posts</b> <b>you could add acknowledgments</b> <b>and also list the names of each team member</b> <b>and further on</b> <b>you could publish a blog post</b> <b>but there could no longer be author credits</b> <b>only</b> <b>OpenAI team or Gemini team</b> <b>that's it</b> <b>so I think this</b> <b>mm-hmm</b> <b>will lead to, I don't know</b>

<b>whether the next, originally healthy</b> <b>kind of exchange between academia and industry</b> <b>those channels</b> <b>will be cut off</b> <b>mm-hmm right</b> <b>doing research</b> <b>is fundamentally a labor of love</b> <b>we explore these questions</b> <b>not really because</b> <b>it can deliver some product</b> <b>or earn how much money</b> <b>but on the other hand, um</b> <b>some kind of credit assignment</b>

<b>meaning letting everyone know who did what</b> <b>I think this is something that over the past few decades</b> <b>has supported academia's ability to move forward</b> <b>a mechanism</b> <b>but now</b> <b>this mechanism is gradually being</b> <b>being eroded by LLMs</b> <b>this generation of models</b> <b>and the organizational structures behind this generation of models</b> <b>I think gradually broke it</b> <b>it's become commercial competition</b> <b>it has become a form of commercial competition</b> <b>mm-hmm yes</b>

<b>right, and then</b> <b>let me quickly conclude</b> <b>I think there are two more</b> <b>I want to briefly mention</b> <b>this paper, that is</b> <b>this REPA</b> <b>this thing is called representation alignment</b> <b>look, there's another keyword: representation</b> <b>so</b> <b>that's why I really like this paper</b> <b>but this paper also</b> <b>went through such a long time</b> <b>and all these past works</b>

<b>combined in a strange way</b> <b>formed a kind of chemical reaction</b> <b>mm-hmm, and then</b> <b>opening up, at least</b> <b>a small research domain</b> <b>and what it does is quite simple</b> <b>it's essentially</b> <b>a Deeply Supervised Net</b> <b>meaning a model you have now</b> <b>doesn't only have a diffusion loss at the top</b> <b>which is your final objective</b> <b>you also pull out some other things in the middle</b> <b>these objectives</b> <b>you can have other objectives</b>

<b>the objective we used is</b> <b>I want to make a Diffusion Model</b> <b>which is a generative model</b> <b>by the way</b> <b>have its internal representation</b> <b>able to align with an external self-supervised</b> <b>model's representation</b> <b>to align together</b> <b>mm-hmm</b> <b>here</b> <b>again, what's being said is</b> <b>representation is the most important thing</b> <b>not only for systems like Cambrian 1</b> <b>for doing multimodal understanding is it important</b>

<b>it's important for a generative model</b> <b>generating images</b> <b>generating videos too</b> <b>yes so</b> <b>this thing</b> <b>I think it's something for me</b> <b>quite a big inspiration</b> <b>but this hasn't been done thoroughly yet</b> <b>meaning</b> <b>why do we need to use</b> <b>this kind of Deeply Supervised approach</b> <b>such an indirect way to do alignment</b> <b>ah</b> <b>what if</b> <b>can we directly use this powerful</b> <b>representation</b>

<b>as a</b> <b>encoder for your generative model</b> <b>or as its foundation</b> <b>mm-hmm right</b> <b>and this thing took another step forward</b> <b>we also got very good results</b> <b>this paper is called Representation Autoencoder</b> <b>yes, it also involves representation</b> <b>and autoencoder</b> <b>but anyway</b> <b>in this</b> <b>the logic in this thing</b> <b>I think</b>

<b>again I don't want to talk too much about this paper's details</b> <b>but I think there's one thing</b> <b>Professor Ma Yi (founding director of the Institute of Data Science at HKU), when I visited Hong Kong</b> <b>I think what he said was absolutely right</b> <b>he said</b> <b>a student would ask, hey</b> <b>you're doing this right</b> <b>your autoencoder</b> <b>your representation layer will now become very high-dimensional</b> <b>because it's a representation now</b> <b>it's not the original</b> <b>simple pixel-level representation</b>

<b>nor is it a low-dimensional</b> <b>VAE-type representation</b> <b>it's a high-dimensional representation</b> <b>you want to do</b> <b>denoising and image generation on this high-dimensional representation</b> <b>this is actually a very difficult thing</b> <b>right, and a student asked at the time</b> <b>this dimension is too high</b> <b>it might not necessarily be a good thing</b> <b>and then</b> <b>it might make our learning system more complex</b> <b>or make training harder</b>

<b>first of all our results</b> <b>are completely the opposite conclusion</b> <b>but Professor Ma Yi got very excited</b> <b>he stood up and said</b> <b>I want to sincerely tell everyone</b> <b>you must not be afraid of high dimensions</b> <b>high dimensionality is in all machine learning</b> <b>an extremely important cornerstone</b> <b>um including</b> <b>whether in previous</b> <b>so-called kernel learning methods</b>

<b>kernel methods</b> <b>or why in a Transformer</b> <b>we need to have an Up Projection Layer</b> <b>right, you need to have a</b> <b>low-dimensional vector coming in</b> <b>and then turning it into a</b> <b>4 times larger, 4 times wider</b> <b>Fully Connected layer</b> <b>and then</b> <b>all these things</b> <b>are all telling us the following fact</b> <b>that in a high-dimensional space</b>

<b>many problems</b> <b>that couldn't be solved in low-dimensional space</b> <b>can now be solved</b> <b>many problems</b> <b>many types of information that didn't exist in low-dimensional space</b> <b>can now exist</b> <b>and you'll also have better efficiency</b> <b>ah</b> <b>this is</b> <b>this is traditional machine learning theory</b> <b>why you need to do</b> <b>after increasing dimensions</b> <b>making things</b> <b>making your data points linearly separable</b> <b>all the same logic</b> <b>but I feel very encouraged</b>

<b>in that you should not be afraid of high dimensions</b> <b>I think these are very good words</b> <b>because many times people feel afraid</b> <b>right</b> <b>feel afraid</b> <b>not just high-dimensional representation</b> <b>this thing</b> <b>but also afraid of escaping from some current local optimum</b> <b>meaning right now</b> <b>many things we've done before</b> <b>were all done to jump out of this local optimum</b> <b>mm-hmm</b>

<b>like VAE</b> <b>is the current era's</b> <b>local optimum</b> <b>we hope to use a representation learning approach</b> <b>to link everything together</b> <b>and this thing</b> <b>is actually a very natural thing</b> <b>and then</b> <b>now many people are also working on related papers</b> <b>there are many contemporaneous works</b> <b>all also very good</b> <b>but on the other hand</b> <b>this is also a not-so-natural thing</b> <b>because you need to break out of the existing framework</b>

<b>to do something new</b> <b>yes, but when you can jump out of this local optimum</b> <b>and do something new</b> <b>I think you</b> <b>you'll feel like your world has opened up</b> <b>because RE for us</b> <b>or for my research</b> <b>I think it's still a fairly important work</b> <b>because it tells me something</b> <b>or allows me to make a bet</b> <b>or predict a future</b> <b>what that future is</b> <b>or whether it's right or wrong</b>

<b>we can look again in a few years</b> <b>so this thing is also related to language</b> <b>and also to Diffusion Models</b> <b>like the recently popular Seedance</b> <b>and Sora</b> <b>mm-hmm</b> <b>my current bet is</b> <b>there's only one thing in this world</b> <b>that is important</b> <b>which is how to learn</b> <b>to learn this representation</b> <b>this is important</b> <b>when you have a good enough representation</b>

<b>handling other problems on top of it is simple</b> <b>your Language Model</b> <b>will gradually degrade to a simple</b> <b>communication interface</b> <b>unlike now</b> <b>all this multimodal intelligence</b> <b>is driven by large language models</b> <b>your representation layer only provides some simple</b> <b>a little bit of context</b> <b>right</b> <b>most of the so-called heavy lifting</b> <b>the dirty and heavy work</b>

<b>is all done by large language models</b> <b>mm-hmm</b> <b>the bet I want to make is</b> <b>the future won't be like this</b> <b>in the future you'll have a great foundation</b> <b>mm-hmm</b> <b>it's a <b>but it's also a great world model</b> <b>mm-hmm, and then</b> <b>what does this world model mean</b> <b>we can talk more about this</b> <b>but this foundation itself</b> <b>may not be a checkpoint</b> <b>it might be neural modules</b>

<b>connected together, multiple components</b> <b>forming a cognitive architecture</b> <b>wow, that sounds quite complex</b> <b>but essentially it's your brain</b> <b>it has different areas handling different things</b> <b>right</b> <b>the language, LLM layer</b> <b>will gradually become</b> <b>your essential representation</b> <b>or rather</b> <b>the foundation of your world model</b> <b>an interface of</b> <b>mm-hmm</b> <b>it's still very important</b> <b>it will never disappear</b>

<b>because humans need a Large Language Model</b> <b>to</b> <b>ask questions</b> <b>and answer questions</b> <b>right</b> <b>to communicate with it</b> <b>need to communicate with it</b> <b>it's a communication interface</b> <b>right</b> <b>also</b> <b>there's another line</b> <b>which is Pixel Generation itself</b> <b>meaning how you generate an image</b> <b>a video itself</b> <b>this thing</b> <b>through REPA</b> <b>some of our previous work</b> <b>we can see</b>

<b>it also needs to be based on a good enough</b> <b>representational foundation</b> <b>ah</b> <b>or you can think of it</b> <b>it's a world model</b> <b>um</b> <b>again in my view</b> <b>in my definition</b> <b>representation is a world model</b> <b>the most, most important part</b> <b>mm-hmm</b> <b>it's not all of it</b> <b>it's the most important part</b> <b>but when we have such a foundation</b> <b>you can think of it</b> <b>we can easily decode it into language</b> <b>right</b>

<b>and then</b> <b>we can easily decode it into pixels</b> <b>and generate videos</b> <b>we can also decode it into some kind of action</b> <b>some kind of movement</b> <b>so it might be some kind of</b> <b>analog to current VLAs</b> <b>mm-hmm</b> <b>but it's based on a stronger representation</b> <b>a stronger world model architecture</b> <b>what parts does the current representation include</b> <b>language is one of them</b> <b>um, I think it's one of them</b> <b>and then</b> <b>but this is also controversial</b> <b>meaning</b>

<b>like Zhilin you just mentioned</b> <b>he might say he doesn't want vision to contaminate language</b> <b>ah</b> <b>they'll still do multimodal</b> <b>but they want to think about</b> <b>how to make multimodal a smart multimodal</b> <b>right</b> <b>without lowering the overall intelligence level of the brain</b> <b>yes yes yes</b> <b>hey, about this thing</b> <b>but I want to say again</b> <b>this thing</b> <b>it really depends on how you define the problem</b> <b>but let me finish the earlier point first</b> <b>meaning</b>

<b>um this</b> <b>you say</b> <b>for example, the position of language in this</b> <b>right</b> <b>I think we also have our own worries</b> <b>meaning language is actually a poison</b> <b>or language is actually an opiate</b> <b>you add more language</b> <b>you'll always feel happier</b> <b>oh mm-hmm</b> <b>that shows it's useful</b> <b>this crutch</b> <b>it's useful</b> <b>but it's a shortcut</b> <b>if you as a person</b>

<b>if you keep taking this opiate</b> <b>you'll be ruined</b> <b>if it's a crutch</b> <b>and you keep using it</b> <b>you also can't train</b> <b>your leg muscles</b> <b>mm-hmm</b> <b>alright alright</b> <b>this is yours and Zhilin's</b> <b>two perspectives</b> <b>yes, so I'm very worried about language</b> <b>contaminating vision</b> <b>mm-hmm</b> <b>I'm extremely worried about this</b> <b>and moreover</b> <b>this contamination is already happening</b> <b>this</b> <b>the state of this contamination is as follows</b> <b>the state of this contamination happening is</b>

<b>the entire Large Language Model</b> <b>has a huge value chain</b> <b>that transmits step by step from industry to academia</b> <b>this value chain means</b> <b>we have a narrative at the top</b> <b>this narrative is whatever AGI, Scaling Law</b> <b>The Bitter Lesson, LLM</b> <b>the logic of these narratives</b> <b>the current bible</b> <b>yes um</b> <b>let me tell you about The Bitter Lesson</b>

<b>because I absolutely don't think</b> <b>the Large Language Model is</b> <b>a demonstration of</b> <b>The Bitter Lesson</b> <b>mm-hmm</b> <b>um</b> <b>the Large Language Model is actually anti-Bitter Lesson</b> <b>ultimately what representations will be general enough</b> <b>what is its endpoint</b> <b>ah, the endpoint</b> <b>we can call it the world model</b> <b>so maybe we can discuss</b> <b>in my definition</b>

<b>or in the context of this representation</b> <b>what exactly does world model mean</b> <b>what is a world model</b> <b>right</b> <b>this is about to enter your entrepreneurship topic</b> <b>let's first</b> <b>from multimodal to world model</b> <b>mm-hmm right</b> <b>mm-hmm, that's right</b> <b>in strict definitional terms</b> <b>a world model means</b> <b>you're now given a system</b> <b>or the state of an environment</b> <b>um</b>

<b>um</b> <b>this environmental state</b> <b>might be, for example, um</b> <b>you can think of it</b> <b>as the state at the current moment</b> <b>but a world model</b> <b>doesn't necessarily</b> <b>just make temporal predictions</b> <b>but let's not worry about that for now</b> <b>anyway, you first have a system or an environment</b> <b>you have a state S_t</b> <b>right</b> <b>and you have an intervention or action</b> <b>let's call it a_t</b> <b>at the current moment</b>

<b>you apply an action to this system</b> <b>you now hope to learn a predictive function</b> <b>or transition function F</b> <b>so that it can take your action</b> <b>together with your current state</b> <b>this environmental state</b> <b>to predict the next state</b> <b>right, the state at the next moment</b> <b>so this is the most basic general kind of</b> <b>definition of a world model</b>

<b>and this definition itself is actually incredibly straightforward</b> <b>or even somewhat trivial</b> <b>because this isn't a new concept</b> <b>because actually back in 1943</b> <b>there was</b> <b>this physiologist</b> <b>called</b> <b>Kenneth Craik, a Scottish philosopher and psychologist</b> <b>mm-hmm</b> <b>who first proposed this concept</b> <b>he said humans have in their minds</b> <b>such a world model</b> <b>this world model can tell us</b>

<b>when we take some action</b> <b>what consequences will follow</b> <b>mm-hmm</b> <b>because we can predict our actions</b> <b>the consequences our actions bring</b> <b>so this can guide us</b> <b>in what kind of action to take</b> <b>and what kind of decision to make</b> <b>if I know that putting my hand in a fire</b> <b>will hurt, then I won't</b>

<b>put my hand in the fire</b> <b>this thing</b> <b>this kind of prediction structure</b> <b>is also from the past</b> <b>including control theory</b> <b>in the 1960s and 70s</b> <b>how everyone would put</b> <b>a lunar probe to the moon</b> <b>or send it to</b> <b>wherever</b> <b>right</b> <b>and then</b>

<b>everyone actually needs to be based on such a control system</b> <b>for example a classic algorithm</b> <b>called Model Predictive Control</b> <b>this also involves a Model</b> <b>but this Model is actually also a kind of World Model</b> <b>this algorithm is actually very very simple</b> <b>meaning you now need to decide</b> <b>what control signal exactly I should apply</b> <b>to this system</b> <b>to enable it to complete</b> <b>a predetermined task</b> <b>mm-hmm right</b> <b>and what I need to do is</b>

<b>at the current moment</b> <b>roll out through my model</b> <b>to continuously output the next</b> <b>k steps of actions</b> <b>an action sequence</b> <b>meaning I need to output</b> <b>my next action sequence</b> <b>a sequence of actions</b> <b>and through this action sequence</b> <b>use my Model to get the next step</b>

<b>or the state at each step</b> <b>and finally I'll also have a, um</b> <b>some kind of cost function</b> <b>a metric function</b> <b>which tells me</b> <b>after I execute this action sequence</b> <b>how far I am from my ultimate goal</b> <b>how far the distance is</b> <b>so this algorithm is very simple</b> <b>you continuously sample your action sequence</b> <b>then jump back to the first step</b> <b>and find</b>

<b>the action sequence with the lowest cost</b> <b>execute its first step</b> <b>then repeatedly iterate to do this action</b> <b>and roll out the next action sequence</b> <b>yes, so each time you need to make a decision</b> <b>and the source of this decision</b> <b>is based on your prediction of the future</b> <b>mm-hmm</b> <b>yes, this is the so-called Model Predictive Control</b> <b>how people use this World Model</b> <b>and then later</b> <b>for example in Model-Based Reinforcement Learning</b>

<b>in Reinforcement Learning</b> <b>people also realized</b> <b>that a World Model is actually very important</b> <b>alright</b> <b>there's a classic paper here</b> <b>called Dyna</b> <b>this paper is actually Richard S. Sutton's paper — the father of reinforcement learning</b> <b>oh</b> <b>yes, so Richard Sutton himself wrote such a paper</b> <b>and he talked about</b> <b>ah</b> <b>a very interesting viewpoint</b> <b>or a framing</b> <b>he says the human intelligence system</b>

<b>can perhaps be divided into two types</b> <b>one called a reactive policy</b> <b>and one possibly called</b> <b>a more intelligent model-based policy</b> <b>right</b> <b>this thing</b> <b>actually um</b> <b>this analogy is</b> <b>the so-called System 1 and System 2 analogy</b> <b>right, which is human cognition</b> <b>also has so-called thinking fast <b>for very difficult problems</b> <b>we may need more mental cycles</b> <b>to study these problems</b> <b>mm-hmm</b>

<b>but for some problems</b> <b>for example when we drive, right</b> <b>when we first learned to drive we were very nervous</b> <b>looking left and right</b> <b>needing to make many decisions</b> <b>but when you truly learned to drive</b> <b>you internalize these decisions</b> <b>as part of your own muscle memory</b> <b>it becomes a reactive</b> <b>policy right</b> <b>so Richard Sutton in the Dyna paper</b> <b>said something very interesting</b> <b>he said, um</b> <b>what is Reinforcement Learning</b>

<b>Reinforcement Learning is a very primitive</b> <b>a very basic</b> <b>model-free</b> <b>without this world model</b> <b>a learning algorithm</b> <b>ah</b> <b>so Richard Sutton himself was somewhat anti-pure</b> <b>Reinforcement Learning</b> <b>at least at that time</b> <b>in his paper</b> <b>he talks about a better system</b> <b>which of course is</b> <b>if you have a strong enough</b> <b>world model</b>

<b>you can based on the current state</b> <b>predict the next state</b> <b>right, and then</b> <b>you'd have this so-called</b> <b>planning capability</b> <b>which is planning</b> <b>the so-called ability to make plans</b> <b>mm-hmm</b> <b>and then</b> <b>planning and reasoning are in some sense</b> <b>also the same concept</b> <b>reasoning is now very hot in Large Language Models</b> <b>but in fact, um</b> <b>this kind of planning we need</b> <b>and also</b> <b>the significance of planning for decision making</b>

<b>was actually discussed very early on in Control Theory</b> <b>and Reinforcement Learning where everyone was discussing it</b> <b>so I think this is the history of World Models</b> <b>so if we start from this angle</b> <b>the essence of a World Model is</b> <b>how to characterize a system and an environment</b> <b>such that you can make predictions in this system</b> <b>and this prediction can guide your</b> <b>your</b> <b>action sequence</b>

<b>and your own decision-making</b> <b>large language models predict the next word</b> <b>this predicts the next action</b> <b>based on this action</b> <b>predict the next state</b> <b>right</b> <b>how to understand state</b> <b>state is</b> <b>the minimum information that can describe</b> <b>all states of a system</b>

<b>in that way</b> <b>a source of information, you could say</b> <b>you can think of it that way</b> <b>meaning a state</b> <b>means, for example</b> <b>this thing</b> <b>also involves a very interesting thing</b> <b>very interesting</b> <b>another thing</b> <b>we need to discuss</b> <b>namely what exactly is the relationship between this and representation</b> <b>mm-hmm right</b> <b>um, why do we say</b> <b>it's the minimum information characterization unit</b> <b>it's because suppose right now</b>

<b>our current physical world</b> <b>right</b> <b>let me say Earth</b> <b>ah, or let me not go that far</b> <b>let's first talk about this room of ours</b> <b>right</b> <b>this is also an environment</b> <b>right</b> <b>so what is the state that characterizes this environment</b> <b>right, this state</b> <b>if you don't pursue this so-called minimum information</b> <b>or minimal descriptions</b> <b>then it can be</b> <b>for example, we now reconstruct this entire space</b> <b>entirely</b> <b>right</b>

<b>and we precisely characterize</b> <b>all the parameters in this system</b> <b>including the texture of this table</b> <b>including our sound waves</b> <b>including</b> <b>we</b> <b>the mass of this table</b> <b>this microphone's</b> <b>various physical parameters</b> <b>mm-hmm alright</b> <b>but we won't characterize this system that way</b> <b>right</b> <b>because much of this information</b> <b>is not important for our decision-making</b> <b>right</b> <b>because</b>

<b>actually if we assume an intelligent agent now</b> <b>living here for the purpose of</b> <b>we're having a conversation</b> <b>mm-hmm</b> <b>then I only need</b> <b>to know some basic facts</b> <b>for example, my microphone can</b> <b>stay on this table</b> <b>and then</b> <b>I won't care about every point of lighting</b> <b>nor will I care about</b> <b>every detail of the texture on the table</b> <b>mm-hmm right</b> <b>these things are all unimportant</b>

<b>so this state</b> <b>can actually contain a lot of information</b> <b>or can contain enough information</b> <b>meaning sufficient information</b> <b>this thing</b> <b>it depends on what kind of task you need to solve</b> <b>so what is this thing</b> <b>which is how to</b> <b>build such a state</b> <b>this thing</b> <b>is actually directly connected to representation learning</b> <b>mm-hmm</b> <b>representation learning</b> <b>like I just said, right</b>

<b>we need to have a hierarchical representation</b> <b>this hierarchical representation</b> <b>the purpose is actually</b> <b>how we can gradually develop</b> <b>layer by layer, iterating up</b> <b>and becoming increasingly abstract</b> <b>increasingly meaningful for my decision making</b> <b>and increasingly valuable representation</b> <b>mm-hmm</b> <b>it won't be fine-grained to every point</b> <b>it doesn't need to be fine-grained to every point</b> <b>so how do you abstract</b> <b>mm-hmm</b>

<b>and we also can't be fine-grained to every point</b> <b>it just can't be done</b> <b>right</b> <b>because this is very obvious</b> <b>right</b> <b>for example, say we're building an airplane</b> <b>this airplane</b> <b>for example</b> <b>every</b> <b>for example we want to model</b> <b>the dynamic system of this airplane</b> <b>right, I want to know how to make it</b> <b>more energy-efficient and fuel-efficient</b> <b>ah</b> <b>we can of course</b> <b>start from the lowest level</b>

<b>we can say</b> <b>this</b> <b>per cubic centimeter there might be 10 to the power of</b> <b>some ten-odd power of molecules</b> <b>and we model every molecular collision</b> <b>right</b> <b>and then</b> <b>through this approach</b> <b>to characterize our system</b> <b>this of course won't work</b> <b>this is a totally stupid way</b> <b>right, what we do instead</b> <b>is</b> <b>how we can statistically</b>

<b>study this problem</b> <b>so that's why there's fluid dynamics</b> <b>and then there would be this</b> <b>Navier-Stokes equation</b> <b>and a series of such settings</b> <b>right, everything becomes increasingly abstract</b> <b>and then</b> <b>but the world we're able to characterize</b> <b>becomes broader and broader</b> <b>mm-hmm</b> <b>actually language is in some sense abstraction</b> <b>language is some kind of abstraction</b> <b>but it's a</b>

<b>proven abstraction</b> <b>it's highly condensed</b> <b>meaning it's an existing abstraction</b> <b>it's an existing abstraction</b> <b>so</b> <b>what you want to build now is a new abstraction</b> <b>beyond language</b> <b>it's a, yes</b> <b>it's somewhat</b> <b>it must be a latent representation</b> <b>mm-hmm</b> <b>and this thing</b>

<b>people can understand indirectly</b> <b>what kind of representation you've learned</b> <b>or which representations</b> <b>which representations are meaningful</b> <b>all of this is fine</b> <b>it's not a complete black box</b> <b>but it's not constrained by the syntax of language</b> <b>and logic like that</b> <b>this is why I say LLMs are far from embodying The Bitter Lesson</b> <b>The Bitter Lesson says</b>

<b>you should minimize human knowledge as much as possible</b> <b>right</b> <b>put away your so-called</b> <b>human arrogance</b> <b>human arrogance</b> <b>and its so-called hubris</b> <b>this arrogance</b> <b>and its so-called cleverness</b> <b>and these so-called</b> <b>relatively clever structures</b> <b>minimize as much as possible</b> <b>and instead do as much as possible</b>

<b>using search and learning to find answers</b> <b>right, but you can imagine</b> <b>if what we're discussing now is how to</b> <b>characterize this world</b> <b>ah</b> <b>language is exactly such a structure</b> <b>language is an extremely clever product of humans</b> <b>mm-hmm</b> <b>it has intricate design</b> <b>it itself is</b> <b>it's not a question of more or less</b> <b>it's all</b> <b>it all is</b> <b>right mm-hmm</b>

<b>so</b> <b>so</b> <b>I think this represents language</b> <b>it has its own very strong points</b> <b>and it will definitely in future intelligence</b> <b>in all these intelligent systems</b> <b>occupy a very, very important position</b> <b>but it can do CoT (chain of thought)</b> <b>mm-hmm</b> <b>but CoT is another matter</b> <b>CoT is also another</b> <b>um, how should I put it</b> <b>it's a product of this stage</b> <b>right</b>

<b>oh, CoT is also a stage-specific product</b> <b>everything about LLMs</b> <b>is a fairly stage-specific product</b> <b>oh</b> <b>that's also why LLMs</b> <b>I also quite agree with Yann</b> <b>meaning LLMs</b> <b>are actually not controllable</b> <b>not safe either</b> <b>because they don't have a true world model</b> <b>we even use LLMs as world models</b> <b>but it's fundamentally flawed</b> <b>it's a flawed world model</b> <b>right</b> <b>and um</b>

<b>what this means is</b> <b>actually meaning</b> <b>all current controllability or safety</b> <b>how does an LLM do this</b> <b>it's entirely designed through fine-tuning</b> <b>to achieve it</b> <b>you need to feed it a lot of data</b> <b>to let it know what should be done</b> <b>what shouldn't be done</b> <b>or what it can't do</b> <b>what can be said</b> <b>what can't be said</b> <b>right</b> <b>what kind of speech might bring danger</b>

<b>what kind of speech</b> <b>might be more friendly</b> <b>so this is called alignment</b> <b>but all of this is based on some kind of</b> <b>post-training or some kind of</b> <b>fine-tuning alignment</b> <b>mm-hmm</b> <b>yes, but a true world model</b> <b>actually you don't need to do this</b> <b>because you can predict</b> <b>what consequence your action will lead to</b> <b>you can</b> <b>your</b> <b>what results your behavior will bring</b> <b>you can then during inference</b>

<b>process</b> <b>try to avoid such behavior</b> <b>mm-hmm</b> <b>you can add some external constraints</b> <b>to tell it</b> <b>you really can't do this</b> <b>for example</b> <b>I have a robot holding a knife cutting vegetables</b> <b>right</b> <b>and how do I ensure now</b> <b>that this robot holding the knife</b> <b>won't turn backward</b> <b>and slash you</b> <b>how do you guarantee this</b> <b>from the perspective of a Language Model</b>

<b>you</b> <b>you</b> <b>the way you can achieve this is through feeding</b> <b>it a lot of data</b> <b>mm-hmm</b> <b>right, but it needs to be able to see these things</b> <b>isn't a world model, right</b> <b>a world model</b> <b>doesn't necessarily need</b> <b>a world model</b> <b>because you're able to foresee this outcome</b> <b>meaning I'm able to</b> <b>take an action</b> <b>I can understand</b> <b>if this knife turns around now</b> <b>and creates a certain danger, what the result would be</b>

<b>how do you let it know</b> <b>um, that's part of your training</b> <b>about the world model</b> <b>it seems the definition hasn't converged yet</b> <b>for example, the world model you define</b> <b>and the world model Li Fei-Fei's team defines</b> <b>what is the difference</b> <b>ah right</b> <b>so what I just elaborated on</b> <b>is actually all the world model in our definition</b> <b>but I think the problems we're encountering now are</b> <b>that this world model is hard to define</b> <b>the reason</b>

<b>is actually that it's not a technical approach</b> <b>it's not an algorithm</b> <b>it's a goal</b> <b>mm-hmm</b> <b>meaning all of us</b> <b>whether you're working on LLMs</b> <b>or Video Diffusion Models</b> <b>or Gaussian Splatting</b> <b>all of us</b> <b>are on the path toward the world model</b> <b>so</b> <b>I say</b> <b>sometimes these competitions</b> <b>or these arguments</b>

<b>I think before long</b> <b>maybe in 1 to 2 years</b> <b>will all seem extremely ridiculous</b> <b>because</b> <b>because we're actually all developing toward this path</b> <b>and everyone knows</b> <b>this should</b> <b>lead to</b> <b>should</b> <b>be the right path</b> <b>it's just that</b> <b>everyone is thinking about this problem from different directions</b> <b>for example</b> <b>in our definition</b> <b>or let me first talk about other people's definitions</b>

<b>for example</b> <b>for a Video Diffusion Model company</b> <b>for example like</b> <b>like Sora</b> <b>like Bytedance's models</b> <b>like Genie (developed by Google DeepMind)</b> <b>right, and then</b> <b>all these models</b> <b>including Runway</b> <b>Luma</b> <b>every company making generative models</b> <b>is doing this</b> <b>all positioning themselves as World Model companies</b> <b>but they're actually still mainly focused on</b>

<b>building a world model simulator</b> <b>a world simulator</b> <b>the so-called world simulator</b> <b>mm-hmm</b> <b>their goal is still</b> <b>to render visually compelling videos</b> <b>with some kind of consistency</b> <b>able to have sufficiently long content</b> <b>and so on, and you can apply controls to it</b> <b>mm-hmm, you can choose</b> <b>like Genie</b> <b>right</b> <b>take two steps forward</b> <b>take two steps backward</b> <b>you need to ensure you have some memory</b>

<b>or whatever</b> <b>this thing</b> <b>is their kind of world</b> <b>world simulator</b> <b>or this generative world simulator</b> <b>that wants to solve</b> <b>and um</b> <b>Professor Fei-Fei's side</b> <b>at World Labs</b> <b>I think it's more like a frontend</b> <b>an interface for assets</b> <b>this is also very important</b> <b>because it's a strong 3D representation</b> <b>so</b>

<b>By the way</b> <b>also congratulations</b> <b>didn't they just successfully raise funding</b> <b>if you can see</b> <b>their lead investors</b> <b>the people they're discussing with</b> <b>for example I saw in the news</b> <b>Autodesk invested $200 million in them</b> <b>mm-hmm</b> <b>so</b> <b>what kind of company is Autodesk</b> <b>Autodesk is a company doing 3D modeling, visualization and CAD</b> <b>or whatever design kind of company</b> <b>right</b> <b>so in this scenario</b>

<b>you need a very, very concrete 3D</b> <b>one</b> <b>you also</b> <b>can call it representation</b> <b>it's also some kind of representation</b> <b>but it means this thing</b> <b>is not an abstract concept</b> <b>right, it's not hidden in your parameters</b> <b>it needs to have an explicit 3D</b> <b>form there</b> <b>that way</b> <b>you can then in this space</b> <b>master some kind of spatial intelligence</b> <b>you can then explore in this space</b>

<b>and you can be one hundred percent certain</b> <b>you won't make mistakes</b> <b>for a World Simulator</b> <b>a Generative World Simulator</b> <b>this thing</b> <b>not necessarily</b> <b>right, although you can through longer context</b> <b>have better memory</b> <b>but it cannot cannot be guaranteed</b> <b>mm-hmm</b> <b>and what we want to do</b> <b>is actually more like</b> <b>building a predictive brain</b> <b>yes meaning</b> <b>we</b>

<b>the core of how we view this problem</b> <b>is still about how to enhance</b> <b>intelligence itself</b> <b>yes, so that means</b> <b>you think LLMs are not intelligent enough</b> <b>I think, again</b> <b>LLM is a crucial</b> <b>part of this intelligence system</b> <b>it's a module</b> <b>but it's not everything</b> <b>it's not everything</b> <b>right</b> <b>let me give another example</b> <b>for example, why when LLMs do world modeling</b> <b>it's fundamentally</b> <b>flawed</b> <b>for example</b>

<b>let's go back to this vision question</b> <b>right, we're now sitting here</b> <b>mm-hmm</b> <b>if we turn our head slightly</b> <b>say 5 or 10 degrees</b> <b>that generates hundreds of frames</b> <b>actually this frequency is very, very high</b> <b>the human FPS can actually perceive</b> <b>say, 100 Hz</b> <b>these frequency fluctuations</b> <b>extremely impressive</b> <b>right</b> <b>if you process this problem the way an LLM does</b> <b>what would happen</b>

<b>mm-hmm</b> <b>at least processing it the current way</b> <b>what would happen is</b> <b>I would need to tokenize every frame</b> <b>we flatten it</b> <b>stringing it into a very very long sequence</b> <b>every frame</b> <b>I can do some downsampling</b> <b>or whatever, doesn't matter</b> <b>and then we string them together</b> <b>right, say I have 256 tokens per frame</b> <b>now you might have 32 frames or 128 frames</b> <b>stringing them together</b>

<b>then you'd have 256 times 128 tokens</b> <b>then you put them into a Large Language Model</b> <b>and align it with language</b> <b>and finally answer a question</b> <b>but does this make sense</b> <b>it makes no sense at all</b> <b>mm-hmm</b> <b>because you're actually taking this kind of world</b> <b>representation</b> <b>mm-hmm</b> <b>behind it</b> <b>there's actually some kind of global state</b> <b>right</b>

<b>you serialize it</b> <b>into a very very redundant token</b> <b>mm-hmm</b> <b>and Transformer</b> <b>people say it doesn't have much</b> <b>inductive bias</b> <b>it actually still has some inductive bias</b> <b>its inductive bias is</b> <b>it has to pay equal attention to every single token</b> <b>oh</b> <b>well, that itself is unreasonable</b> <b>right</b> <b>what this represents is</b> <b>the modeling technique of language models</b>

<b>cannot resolve the cognition of these continuous</b> <b>spatial signals</b> <b>this doesn't hold</b> <b>so</b> <b>this is why</b> <b>For us,</b> <b>when it comes to the world model we're building,</b> <b>I think</b> <b>it needs to have the following characteristics</b> <b>right, it needs to</b> <b>um,</b> <b>be able to understand the physical world</b> <b>and the definition here</b> <b>is that it must be the physical world</b>

<b>although the world model application will also extend to</b> <b>things like</b> <b>digital agents to</b> <b>like a gaming agent</b> <b>will of course also benefit from the World Model</b> <b>but</b> <b>I think its primary task</b> <b>is to solve the problem of physical world understanding</b> <b>and it needs to have sufficiently large associative memory</b> <b>Memory is also a very very important</b> <b>component of a World Model-based</b> <b>system as a whole</b>

<b>mm-hmm</b> <b>and it needs to be able to reason</b> <b>able to plan</b> <b>mm-hmm</b> <b>we just talked about planning</b> <b>able to</b> <b>able to do this kind of counterfactual reasoning</b> <b>or this kind of causal inference</b> <b>also very very important</b> <b>and the last point</b> <b>is that it needs to be sufficiently controllable and safe</b> <b>it needs to be a safe system</b> <b>right, I think all these things</b> <b>I'm actually borrowing from Yann on this</b> <b>these talking points</b> <b>but I think</b>

<b>these points are actually very very insightful</b> <b>right, not too many, not too few</b> <b>mm-hmm</b> <b>it and large language models</b> <b>are not in a derivative relationship</b> <b>they're in a replacement relationship</b> <b>uh</b> <b>I think</b> <b>it's not exactly a replacement relationship either</b> <b>uh</b> <b>why did I just say that everyone in the field</b> <b>is moving toward world models</b> <b>moving forward?</b>

<b>moving forward?</b> <b>the reason is</b> <b>large language models also want to evolve toward world models</b> <b>actually that's not quite what I mean</b> <b>what I mean is before large language models existed</b> <b>we couldn't really talk about world models at all</b> <b>if you have a purely RL-based system</b> <b>you're purely doing overfitting</b> <b>to the current environment</b> <b>Large Language Models</b> <b>gave you a certain degree of</b> <b>cognitive ability about the real world</b> <b>it forms one element</b> <b>mm-hmm, it forms one element</b>

<b>but this thing</b> <b>as I said, is fundamentally flawed</b> <b>because its cognition is too indirect</b> <b>yeah</b> <b>what language can give you is really just too little</b> <b>mm-hmm right</b> <b>and language has other problems too</b> <b>namely it is a</b> <b>fundamentally a communication tool</b> <b>so when we use language</b> <b>unless you're saying something like</b> <b>in a dream state</b> <b>like talking in your sleep</b>

<b>most of the time</b> <b>you use language with an intention</b> <b>you want to convey a purpose</b> <b>so LLMs are more like</b> <b>in my view, more like an extension of a search engine</b> <b>right?</b> <b>or a chatbot is more like an extension of a search engine</b>

<b>right?</b> <b>or a chatbot is more like an extension of a search engine</b> <b>we always bring the purpose in our mind</b> <b>to ask a question</b> <b>and expect an answer</b> <b>right?</b>

<b>right?</b> <b>but this is not what</b> <b>a World Model is</b> <b>in essence</b> <b>as I just said</b> <b>the World Model in our brain</b> <b>is doing a lot of work</b> <b>in the background</b> <b>there's even a lot of psychology</b> <b>some counterintuitive findings</b> <b>that say</b> <b>your brain has already made the decision for you</b> <b>before you decide to</b>

<b>say there are three buttons on my desk</b> <b>before I know which button I want to press</b> <b>I can already detect</b> <b>that my brain</b> <b>has already made that decision for me</b> <b>this experiment</b> <b>is called something like the Libet experiment or something</b> <b>it's a controversial experiment</b> <b>but what it demonstrates is</b> <b>many things are happening in your background</b> <b>already happening in your brain</b> <b>this is part of your world model</b> <b>a Language Model is not like that</b>

<b>language is just a communication tool</b> <b>you always come with a purpose</b> <b>throw out a question</b> <b>and want to get an answer</b> <b>it's also a reasoning tool</b> <b>right</b> <b>it's also a reasoning tool</b> <b>of course, but only a symbolic-level reasoning tool</b> <b>so you want to build</b> <b>a world model like the human brain</b> <b>I think we need to look more and more at people</b> <b>mm-hmm, actually not just people</b> <b>all kinds of animals</b>

<b>how their intelligence actually arises</b> <b>mm-hmm right</b> <b>let me, let me first conclude</b> <b>what I just said</b> <b>which is</b> <b>why is everyone step by step converging on</b> <b>converging on this World Model?</b>

<b>the reason is language models</b> <b>have already shown a bit of</b> <b>World Model-like behavior</b> <b>even though it has no actions</b> <b>it has no real understanding of the physical world</b> <b>and it can't truly reason and plan</b> <b>because its planning through CoT</b> <b>and its reasoning through CoT</b> <b>is still very different</b> <b>from what I just described</b> <b>like MPC-level</b> <b>planning</b> <b>CoT also brings its own set of problems</b> <b>but all that's fine</b> <b>but the next step</b> <b>you'll see</b>

<b>for example everyone's doing</b> <b>whether DiT or</b> <b>whatever model</b> <b>but people started doing generative models</b> <b>and that has made things somewhat different</b> <b>right?</b>

<b>right?</b> <b>mm-hmm, and that's why many people</b> <b>who do video generation call it a world model</b> <b>I think that's understandable</b> <b>although</b> <b>I don't agree that the video generation</b> <b>model they're doing</b> <b>is the final end game world model</b> <b>but it has indeed pushed one step beyond language models</b> <b>right</b> <b>how does it do that?</b>

<b>on top of language models</b> <b>uh</b> <b>I think all these systems now</b> <b>actually still rely on language models</b> <b>right?</b>

<b>right?</b> <b>they still use language models to do prompt</b> <b>rewriting and then to help</b> <b>serve as a conditioning</b> <b>fed into the video generation model</b> <b>and language models have actually become</b> <b>you know</b> <b>the historical progression here is quite interesting</b> <b>language models used to be the main thing</b> <b>now language models have become</b> <b>a preparatory step for video generation models</b> <b>a scaffolding</b>

<b>in the old language models</b> <b>what you modeled was P(y)</b> <b>right?</b> <b>and that y is still in some semantic space</b>

<b>right?</b> <b>and that y is still in some semantic space</b> <b>information in some kind of label space</b> <b>mm-hmm, but now with video generation models</b> <b>what you model is the probability P(x|y)</b> <b>what this means is</b> <b>what you're modeling now is already x</b> <b>x is the data itself</b> <b>your y has become</b> <b>a condition — this is already very different</b> <b>okay</b> <b>why is it so different?</b>

<b>it's because when you have a low dimensional y</b> <b>space</b> <b>and then you</b> <b>go to model such a distribution</b> <b>your probability density</b> <b>only competes within your y's distribution</b> <b>meaning</b> <b>the likelihood you assign</b> <b>I'm getting a bit too technical here</b> <b>but anyway</b> <b>or let's not talk about language models first</b> <b>let's first talk about</b> <b>say</b> <b>a model that classifies 1000 categories</b>

<b>you can think of</b> <b>these few labels as a precursor to language</b> <b>it's also a low-dimensional vocabulary</b> <b>right?</b>

<b>right?</b> <b>and then</b> <b>if you're doing a classification problem like this</b> <b>all the decisions you need to make are</b> <b>if this thing is a cat</b> <b>it can't be a dog</b> <b>right?</b>

<b>right?</b> <b>this thing is constrained by my label set</b> <b>mm-hmm</b> <b>but when you start modeling P(x|y)</b> <b>when you're doing a generative model</b> <b>the likelihood you assign in this case says</b> <b>what phenomena actually exist in the world</b> <b>which things are more likely to exist</b> <b>that becomes very very different</b> <b>right? because what you need to learn now</b>

<b>right? because what you need to learn now</b> <b>the amount of intelligence information</b> <b>is far greater than what you get from modeling P(Y)</b> <b>you need to understand why in this world</b> <b>a four-legged cat</b> <b>is more common than a three-legged cat</b> <b>right?</b>

<b>right?</b> <b>why if I'm generating a video</b> <b>say I have, I don't know</b> <b>a running video</b> <b>why would I have</b> <b>a smooth running state</b> <b>rather than suddenly hallucinating three legs</b> <b>four legs</b> <b>which is more believable</b> <b>more probable, right?</b>

<b>in probability space</b> <b>more probable</b> <b>this already carries enormous amounts of information</b> <b>what you need to model</b> <b>far exceeds what you need to capture in language space</b> <b>or in label space</b> <b>right?</b>

<b>right?</b> <b>you already need some understanding of the world</b> <b>so this is already more</b> <b>in line with the Bitter Lesson in my view</b> <b>meaning</b> <b>you've abandoned more of the</b> <b>cognition in language space</b> <b>and its logic</b> <b>and its syntactic structure</b> <b>and started modeling pixels</b> <b>started modeling</b> <b>the pixels themselves</b> <b>but taking it one step further</b>

<b>pixels themselves might also be wrong</b> <b>pixels themselves are also not Bitter Lesson enough</b> <b>mm-hmm</b> <b>what are pixels</b> <b>pixels are a human-defined</b> <b>regular grid</b> <b>just a grid of little boxes</b> <b>each little box might have</b> <b>8 bits of information</b> <b>and you might have this kind of lattice</b> <b>like a cell by cell by cell arrangement</b> <b>this is a pixel</b>

<b>this is each frame of the image we see</b> <b>right?</b>

<b>right?</b> <b>this is also an interface</b> <b>mm-hmm</b> <b>this is also made for humans to see</b> <b>right?</b>

<b>right?</b> <b>that's why world simulators</b> <b>why do people think Genie</b> <b>is so cool</b> <b>because we create a video</b> <b>we create a game</b> <b>this is for humans to see</b> <b>but taking it one step further</b> <b>the real Bitter Lesson says</b> <b>I don't need to make it for humans to see</b> <b>why do I need to make it for humans?</b>

<b>right?</b>

<b>who is it for?</b>

<b>it's for your system to see</b> <b>it's for your world to see</b> <b>mm-hmm</b> <b>it depends on what you ultimately want</b> <b>it can be for humans to see</b> <b>but being for humans to see</b> <b>is not the core of a World Model</b> <b>it's the interface of the World Model</b> <b>the World Model itself</b> <b>is spontaneously</b> <b>learning better representations</b> <b>making better predictions</b> <b>right?</b>

<b>right?</b> <b>but this thing itself</b> <b>whether or not you want to generate a cool video</b> <b>is actually irrelevant</b> <b>and whether or not you can answer</b> <b>some questions about your input space</b> <b>is also actually irrelevant</b> <b>so again</b> <b>let me repeat what I was just trying to say</b> <b>each of us</b> <b>is moving forward on the road toward world models</b> <b>the world model is a goal</b> <b>not a specific path</b> <b>uh, not a specific algorithm</b>

<b>or a specific technical roadmap</b> <b>and someday</b> <b>we will have a better world model</b> <b>mm-hmm</b> <b>language models will, on top of that</b> <b>also get stronger</b> <b>we'll have better multimodal models</b> <b>that can better understand the world</b> <b>and we'll have better video generation models</b> <b>mm-hmm</b> <b>and I think RAE is</b> <b>an early prototype in this process</b> <b>mm-hmm yeah</b> <b>so now there's also a very hot concept</b>

<b>the so-called Unified Model or Omni Model</b> <b>where people try to stack all the data</b> <b>together</b> <b>so that we can have one system</b> <b>that can do both understanding</b> <b>and generation</b> <b>what people also discuss is</b> <b>does understanding help generation</b> <b>or does generation help understanding</b> <b>mm-hmm</b> <b>I think neither really matters</b> <b>understanding and generation are one</b> <b>both need a real World Model</b>

<b>as their foundation</b> <b>right</b> <b>once you have that good World Model</b> <b>that can do some kind of prediction</b> <b>can do some kind of planning and reasoning</b> <b>the upper-layer decoding</b> <b>is actually very very simple</b> <b>so you think they're all built on top of</b> <b>the world model</b> <b>which is the base layer</b> <b>right</b> <b>you can think of it as</b> <b>what we want to do</b> <b>or what the representation school wants to do is</b> <b>the very bottom layer of the cake</b>

<b>this base</b> <b>the representation school</b> <b>how to unify representations into one</b> <b>unified meaning unifying it with language</b> <b>ultimately unified into some kind of representation</b> <b>abstracted into a few abstract representations</b> <b>so you still need scaling, right?</b>

<b>you still need to</b> <b>besides language, what other scaling</b> <b>can we currently see?</b>

<b>language scaling</b> <b>we just touched on this</b> <b>language scaling itself</b> <b>I think is again</b> <b>something a bit hard to articulate clearly</b> <b>because we also know</b> <b>there's a theory</b> <b>which says compression is intelligence</b> <b>right?</b>

<b>right?</b> <b>compression equals intelligence</b> <b>compression equals intelligence</b> <b>yes, but what it's saying is</b> <b>your language model</b> <b>is actually a lossless compression process</b> <b>or rather, language models</b> <b>getting bigger improving results</b> <b>is not because it's memorizing by rote</b> <b>having memorized all of this content</b> <b>it's simply a stronger model</b>

<b>so it can have a better compression ratio</b> <b>to compress all of your input information</b> <b>it brings some kind of generalization ability</b> <b>I think I agree with this view</b> <b>but I want to step back a bit</b> <b>I want to say</b> <b>actually because of the nature of the problems language models care about</b> <b>its Scaling Laws actually</b> <b>contain some padding</b> <b>which is</b> <b>what I mean by padding is</b> <b>it doesn't actually need the smallest model</b>

<b>to answer questions by truly understanding the world</b> <b>it doesn't need that</b> <b>and all our benchmarks</b> <b>and what humans use Large Language Models</b> <b>to achieve</b> <b>on these tasks</b> <b>also require it to be able to retrieve</b> <b>right, to be able to</b> <b>be able to retrieve factual knowledge</b> <b>if a model</b> <b>right, can't tell me</b>

<b>say a specific person's name on Wikipedia</b> <b>what they did in the past</b> <b>that's a very poor</b> <b>Large Language Model</b> <b>so</b> <b>so what I want to say is</b> <b>the Scaling Law of language models</b> <b>is based on a representation of knowledge</b> <b>that's the Scaling Law derived from that</b> <b>so that's why</b> <b>it may have a relatively balanced ratio</b> <b>meaning your number of tokens</b> <b>your data and your parameters</b>

<b>need to be roughly 1:1</b> <b>that's how it works</b> <b>one approach</b> <b>right?</b>

<b>right?</b> <b>then scale up</b> <b>world models, especially visual intelligence-based</b> <b>world models</b> <b>I think</b> <b>will have a very very different Scaling Law</b> <b>it will have a Scaling Law</b> <b>but the slope of that Scaling Law may be completely different</b> <b>or its ratio may be completely different</b> <b>my current intuition is</b> <b>the model won't be that large</b> <b>the model doesn't need many training parameters</b> <b>because you don't need to remember</b>

<b>if you want to do video generation</b> <b>that's a different story</b> <b>but you don't need to remember everything</b> <b>all the subtle details in the world that you can see</b> <b>you don't need to</b> <b>solve some definite equation</b> <b>in some very high-dimensional space</b> <b>to determine whether an apple falls</b> <b>mm-hmm</b> <b>it doesn't need to do these things</b> <b>it doesn't need human intelligence</b> <b>the highest level of human intelligence</b> <b>let's discuss what human intelligence actually is</b> <b>but anyway</b>

<b>it doesn't need these things</b> <b>it doesn't need to memorize all</b> <b>this knowledge</b> <b>it needs good understanding capability</b> <b>to filter information</b> <b>processing and filtering out information</b> <b>and then</b> <b>because ultimately</b> <b>what really matters is the decision itself</b> <b>mm-hmm</b> <b>right so</b> <b>so this will become more and more like humans</b> <b>because that's how humans are</b> <b>humans have many very important facts</b> <b>right?</b>

<b>right?</b> <b>like the human visual system</b> <b>or rather</b> <b>all of human sensors combined</b> <b>including hearing, vision, smell</b> <b>touch, all of these</b> <b>this</b> <b>is actually extremely high bandwidth</b> <b>this bandwidth might reach</b> <b>say 1 billion bits per second</b> <b>in the range of 100 million to 1 billion</b> <b>mm-hmm</b> <b>but when we're talking right now</b> <b>the bandwidth is extremely low</b> <b>the bandwidth is only ten to</b>

<b>ten to one hundred bits per second</b> <b>mm-hmm</b> <b>so what's actually happening?</b>

<b>right?</b>

<b>what kind of model is our brain</b> <b>that at twenty watts of power</b> <b>takes in one billion bits per second of information</b> <b>through our eyes</b> <b>and all kinds of sensory inputs</b> <b>and converts it into 10 bits per second of</b> <b>behavioral output</b> <b>this is the World Model itself</b> <b>it filters out large amounts of useless information and noise</b> <b>right, there's a lot of redundancy</b> <b>it knows what's important</b>

<b>and what's not important</b> <b>so the filtering system is very important</b> <b>right, of course</b> <b>this is also a hierarchical filtering system</b> <b>mm-hmm</b> <b>mm-hmm, that's indeed the case</b> <b>so how do you train this world model?</b>

<b>uh, language models are easy to train</b> <b>because internet information is just sitting there</b> <b>so you can train it</b> <b>but with world models, it seems like</b> <b>I don't even know where to begin</b> <b>right, I think this is the biggest bet</b> <b>because the closer you get to</b> <b>the essence of intelligence</b> <b>things become</b> <b>much harder</b> <b>mm-hmm right</b> <b>I think like you said</b> <b>we went through the period of dumping the entire internet</b> <b>to train models</b>

<b>that era</b> <b>I think going forward</b> <b>uh</b> <b>I honestly don't know if this path will work</b> <b>I have enough confidence</b> <b>but if you asked me whether it's 100% guaranteed to succeed</b> <b>not necessarily</b> <b>the reason still comes down to data</b> <b>can we actually pull this off</b> <b>to the fullest extent</b> <b>how much data does it need?</b>

<b>what kind of data?</b>

<b>I think the past era was about dumping</b> <b>or downloading, I should say</b> <b>the Internet era</b> <b>now the era is about downloading</b> <b>the human era</b> <b>mm-hmm</b> <b>we need to download humanity</b> <b>mm-hmm</b> <b>so right now, again</b> <b>right, everyone processes this knowledge</b> <b>we have something called the Internet</b> <b>we can upload it</b> <b>we can train a Transformer</b> <b>everything is good</b> <b>but for truly understanding the world</b>

<b>a 4-year-old child</b> <b>the videos they've seen — Yann often cites this example</b> <b>already exceed all the tokens</b> <b>used to train all of these</b> <b>large language models</b> <b>right?</b>

<b>right?</b> <b>a four-month-old baby</b> <b>the amount of video they've seen</b> <b>exceeds all 30 trillion tokens</b> <b>of the best large language models' data</b> <b>right?</b>

<b>right?</b> <b>so this magnitude is just enormous</b> <b>so when I said we need to download humanity</b> <b>the data that human eyes see</b> <b>how do we actually collect that data?</b>

<b>right?</b>

<b>I think video is still</b> <b>that's why</b> <b>before</b> <b>I was still very eager to do more work on video</b> <b>related research</b> <b>I think this is the best hope we have right now</b> <b>right mm-hmm</b> <b>oh this might have a very high barrier</b> <b>but I don't think it's necessarily impossible</b> <b>I think we can proceed in several stages</b> <b>first we can start with internet data</b> <b>start with YouTube</b> <b>mm-hmm</b> <b>as I was saying</b> <b>no matter what</b>

<b>all of these training tokens</b> <b>tens of trillions of tokens</b> <b>versus a four-month-old baby</b> <b>who has seen this much information</b> <b>all that data</b> <b>equals 30 minutes of YouTube uploads</b> <b>there's a massive amount of data on YouTube</b> <b>mm-hmm</b> <b>is there a copyright issue with that?</b>

<b>uh</b> <b>everyone knows there are copyright issues</b> <b>and everyone</b> <b>everyone is continuing</b> <b>continuing to do it anyway</b> <b>mm-hmm yeah</b> <b>I think</b> <b>at some point there will definitely be major copyright issues</b> <b>or rather this isn't just a copyright issue</b> <b>because YouTube may not own the copyright to these videos</b> <b>but it's a terms of service issue</b> <b>YouTube prohibits you from scraping this data</b> <b>which makes this data extremely hard to collect</b> <b>basically impossible to get</b> <b>you download a few videos</b>

<b>and YouTube blocks your IP</b> <b>and then</b> <b>you have to switch to a new IP</b> <b>right, so it's kind of</b> <b>now I think</b> <b>uh</b> <b>these data companies and these platforms</b> <b>are in this cat-and-mouse dynamic</b> <b>mm-hmm</b> <b>one side</b> <b>one side is tightly guarding against data collection</b> <b>blocking you from scraping</b> <b>the other side</b> <b>the other side is trying every means to get more data</b> <b>mm-hmm right</b>

<b>I don't know how it will end</b> <b>right</b> <b>wow, ByteDance has such a huge advantage</b> <b>ByteDance has such a huge advantage</b> <b>and ByteDance doesn't care</b> <b>right?</b>

<b>right?</b> <b>but they've received a lot of cease-and-desist letters too</b> <b>so I don't know</b> <b>I think going forward there may be more</b> <b>right, but I think</b> <b>this gets into human society's</b> <b>more political optimization</b> <b>mm-hmm alright</b> <b>step one is video</b> <b>step one is video</b> <b>and then next</b> <b>running in parallel is</b> <b>I think</b> <b>this kind of world model</b> <b>or</b>

<b>this very vision-centric world model</b> <b>will have some very promising application prospects</b> <b>because I think doing only research isn't enough</b> <b>the reason LLM succeeded</b> <b>is also because the chatbot interface</b> <b>was so successful</b> <b>so natural</b> <b>it relies on</b> <b>the internet</b> <b>on mobile devices</b> <b>but it's a very good interface</b> <b>a very very good product</b>

<b>so even OpenAI's own people didn't realize it</b> <b>right but</b> <b>when we talk about world models</b> <b>especially</b> <b>the world model we just defined</b> <b>what is the ultimate product exactly?</b>

<b>I think this</b> <b>might be the real hard problem</b> <b>mm-hmm</b> <b>maybe an even harder problem than data</b> <b>so right now</b> <b>if I just brainstorm ideas</b> <b>off the top of my head</b> <b>the ideas might all be wrong in the end</b> <b>but there are at least two outlets</b> <b>one is something like AI glasses</b> <b>this kind of truly personal assistant</b> <b>this needs a World Model</b> <b>with only a language model</b>

<b>that's not enough</b> <b>with only a language model</b> <b>it's still just ChatGPT</b> <b>but with a screen and voice interaction</b> <b>right?</b>

<b>right?</b> <b>it can't break out of that product form</b> <b>for example I often give people this example</b> <b>I'm now wearing some wearable devices</b> <b>they're not real AI wearable devices</b> <b>right?</b>

<b>right?</b> <b>but somehow</b> <b>they possess some traits I think are</b> <b>world model-like</b> <b>mm-hmm</b> <b>the reason is they're an always-on device</b> <b>it's always on</b> <b>always monitoring your body signs</b> <b>right?</b>

<b>right?</b> <b>and there's a large amount of information</b> <b>because every second</b> <b>right, I'm not sure at what frequency</b> <b>at what frequency it collects this information</b> <b>but my heart is always beating</b> <b>so it can always track this information</b> <b>and then where does this information go?</b>

<b>right?</b>

<b>this information itself is meaningless to me</b> <b>knowing my heart rate</b> <b>BPM at a certain moment</b> <b>has no meaning to me at all</b> <b>so it needs intelligent decision-making</b> <b>to tell me</b> <b>you seem to be under too much stress</b> <b>right, you're under too much pressure now</b> <b>you need to slow down</b> <b>and then saying</b> <b>your sleep hasn't been very good the past few days</b> <b>you might need to consider</b> <b>some remedial measures</b>

<b>or maybe you should take a day off today</b> <b>right?</b>

<b>right?</b> <b>I think this is actually quite world model-like</b> <b>except</b> <b>this is the most basic world model possible</b> <b>because the information it can get is just too little</b> <b>mm-hmm</b> <b>it's very narrow information</b> <b>right, very very narrow</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>but I think this</b> <b>is a glimpse of a future world model</b> <b>in AI wearables</b> <b>mm-hmm</b> <b>because if we imagine there were actually glasses</b> <b>or right</b> <b>I know you don't like wearing glasses</b> <b>but suppose there were some kind of wearable device</b> <b>that could truly be always on</b> <b>we don't know how to solve the power consumption issue</b> <b>never mind the hardware issues</b> <b>let's set that aside</b> <b>but it could see in real time</b> <b>everything we can see</b>

<b>right?</b>

<b>right?</b> <b>with completely always-on</b> <b>and infinite tokens</b> <b>flowing into the system</b> <b>mm-hmm</b> <b>I think this</b> <b>actually has enormous potential</b> <b>and first of all</b> <b>I'd really want this thing</b> <b>because I want to know at what time I drank a coffee</b> <b>and whether I drank that coffee an hour too early</b> <b>or an hour too late</b> <b>causing my sleep that night to not be as good</b> <b>or say I'm an athlete</b>

<b>who wants guidance on every movement</b> <b>or say I work in a hospital</b> <b>and I want to equip every elderly person in the nursing home</b> <b>with such a wearable</b> <b>so I know</b> <b>what their daily behavioral patterns are</b> <b>what medications they've taken</b> <b>what they've been doing</b> <b>ah</b> <b>how they're feeling emotionally</b> <b>right, what their condition is</b> <b>mm-hmm yeah</b> <b>and link it to their medical records in the background</b> <b>and provide better intelligent decision-making</b>

<b>I think there are many many similar examples</b> <b>right, but this is based on current LLMs</b> <b>existing multimodal intelligence</b> <b>which I think actually can't do this</b> <b>mm-hmm</b> <b>and then</b> <b>another outlet</b> <b>we also just touched on this</b> <b>I think it's Robotics</b> <b>I think Robotics</b> <b>faces the problem of</b> <b>the brain not being good enough</b> <b>mm-hmm</b> <b>and even if it can do martial arts</b> <b>it can perform</b> <b>of course</b> <b>you can't deny</b>

<b>that's also a good vertical domain</b> <b>right, the entertainment market</b> <b>might also be quite big</b> <b>so let robots go perform then</b> <b>I think that's fine too</b> <b>but this is far from a general-</b> <b>purpose robot that can enter every home</b> <b>carry elderly people up and down stairs</b> <b>take care of their daily needs</b> <b>this is</b> <b>still extremely far away</b> <b>mm-hmm, robots that can actually work are still a wasteland</b>

<b>[laughs] yes, yes</b> <b>oh and I think this part you can see</b> <b>robotics</b> <b>is actually</b> <b>a very good downstream application</b> <b>because no matter what new upstream</b> <b>we talk about in the broad world model sense</b> <b>like these glasses</b> <b>ah</b> <b>robots can benefit from it</b> <b>mm-hmm</b> <b>for example LLM came out</b> <b>and we had VLA, right?</b>

<b>that was hot for a while</b> <b>now video diffusion is doing well</b> <b>action-conditioned video diffusion is doing well</b> <b>right?</b>

<b>right?</b> <b>this generative approach</b> <b>this world simulator doing well</b> <b>so we're also discussing</b> <b>how robots can use these models</b> <b>to do a</b> <b>better action planning</b> <b>right, there's a lot of work like that</b> <b>so as I said</b> <b>I think</b> <b>there's still a long way to go here</b> <b>and then</b> <b>but I think</b> <b>watching robots online</b> <b>watching robots on the Spring Festival Gala</b>

<b>versus in private</b> <b>talking to researchers in the robotics industry</b> <b>the feelings are very different</b> <b>how so?</b>

<b>how so?</b> <b>the latter</b> <b>the latter are willing to tell me the truth</b> <b>oh</b> <b>that doesn't mean</b> <b>they're normally being dishonest</b> <b>just that the latter are more willing to tell me</b> <b>exactly where the shortcomings of current systems lie</b> <b>why does this sound like it can work</b> <b>but existing models just can't solve it</b> <b>so we just talked about</b> <b>your decade-plus long research journey</b> <b>how did you make the jump to world models?</b>

<b>mm-hmm</b> <b>I think there wasn't really a jump</b> <b>as I've been saying throughout</b> <b>I think</b> <b>what I call representation learning</b> <b>representation learning</b> <b>world models and the entire development of AI</b> <b>is actually a fairly smooth transition</b> <b>and</b> <b>I'm actually not a big fan of the term world model</b> <b>as a label</b> <b>I think it sounds a bit hyped</b>

<b>and now it's become a kind of</b> <b>catch-all term for everything</b> <b>and everyone is claiming they're doing world models</b> <b>I think this</b> <b>on one hand I think it's true that</b> <b>I don't think it's exactly a</b> <b>uh</b> <b>a researcher</b> <b>would enjoy this kind of process</b> <b>but on the other hand</b> <b>I think a field moving forward</b>

<b>may still need some of these</b> <b>buzzwords</b> <b>and I think if I had to name something</b> <b>I might appreciate one thing</b> <b>about the world model</b> <b>about the so-called World Model</b> <b>and that is this</b> <b>this comes from Jitendra Malik, a professor at Berkeley</b> <b>he said</b> <b>the one thing he likes about World Model</b> <b>is that it lets him tell people</b> <b>I'm doing a World Model</b> <b>not a Word Model</b>

<b>word as in W-O-R-D</b> <b>word</b> <b>word, right — I'm doing a world model</b> <b>not a word</b> <b>word model</b> <b>and a word model is an LLM</b> <b>I quite agree with that</b> <b>so I think</b> <b>as I keep repeating, I think</b> <b>I think</b> <b>world models</b> <b>are a destination that everyone will eventually reach</b> <b>it's a goal</b> <b>right</b> <b>mm-hmm actually</b>

<b>as you started pursuing world models</b> <b>you also made a very major decision</b> <b>which is</b> <b>to start a company — this is a very big</b> <b>very different choice from your previous research career</b> <b>a different choice</b> <b>why did you make this choice</b> <b>and how did it come about?</b>

<b>oh</b> <b>this decision was also something of a metaphysical one</b> <b>metaphysical</b> <b>oh well</b> <b>this</b> <b>people might think I'm being too mystical about this</b> <b>but it really was</b> <b>because before, I had many friends in the Bay Area</b> <b>some</b> <b>mentors who've been very helpful to me</b> <b>and some of them may be investors</b> <b>in that capacity</b> <b>or other entrepreneurs</b> <b>and they said</b> <b>Saining, you should also try starting a company</b> <b>mm-hmm</b>

<b>because at the university</b> <b>as I was saying earlier</b> <b>resources are scarce</b> <b>right, but that doesn't mean university is worthless</b> <b>I think</b> <b>university is actually a very good platform</b> <b>it gives me enough space</b> <b>to truly find what I want to do</b> <b>but I suddenly felt</b> <b>that now seems like a moment</b> <b>where</b> <b>what I want to explore</b> <b>has been explored to a certain extent</b>

<b>and going further might fall into</b> <b>what I call the medium paper trap</b> <b>[laughs] like the middle income trap</b> <b>meaning you'd publish decent papers</b> <b>but because of resource constraints</b> <b>you can't truly turn your</b> <b>your ideas into</b> <b>what might be a new breakthrough in some sense</b>

<b>right, so I thought</b> <b>this might be a good moment</b> <b>and then so I had a manager who asked me</b> <b>it was at quite an interesting moment</b> <b>probably about last year</b> <b>probably around year-end</b> <b>or maybe it was in the fall</b> <b>year-end of '25</b> <b>mm-hmm right</b> <b>year-end of '25</b> <b>and he said</b> <b>go ask Yann LeCun</b> <b>he seems to not be very happy at Meta lately</b>

<b>but at that time it wasn't actually that turbulent yet</b> <b>Alexander Wang hadn't come yet (Scale AI founder, joined Meta as Chief AI Officer)</b> <b>and like the layoffs at FAIR</b> <b>and</b> <b>all that turbulence</b> <b>my first instinct was</b> <b>oh, how could that be?</b>

<b>right Yann right?</b>

<b>we can later</b> <b>talk more about</b> <b>what kind of person Yann is</b> <b>but at least at that time</b> <b>I would have thought he's still</b> <b>the godfather of AI, right?</b>

<b>and</b> <b>he</b> <b>is a pure researcher</b> <b>how could he be pulled into a startup?</b>

<b>and then we had this conversation</b> <b>the Monday two weeks after that</b> <b>we happened to have a one-on-one meeting</b> <b>a one-on-one meeting</b> <b>with Yann LeCun</b> <b>yeah</b> <b>and before I could say anything</b> <b>Yann said to me, hey</b> <b>Saining, don't tell anyone yet</b> <b>but I've already decided</b> <b>this</b> <b>what I want to do now</b> <b>should be done outside</b>

<b>I want to start and build a company</b> <b>and then I asked him</b> <b>what do you want to do?</b>

<b>what's the business model behind this?</b>

<b>mm-hmm</b> <b>and then I realized wow</b> <b>this is completely aligned with what I'd imagined</b> <b>mm-hmm, very interesting</b> <b>right, and what is this thing?</b>

<b>I think you can</b> <b>you can call it world models</b> <b>or the logic behind this is</b> <b>I think on the thing I want to do</b> <b>in the current</b> <b>any country in the world</b> <b>I don't think it can be done</b> <b>including in the Bay Area</b> <b>can't be done in Silicon Valley either</b> <b>so what is this thing?</b>

<b>that is to say</b> <b>you still need a certain degree of research depth</b> <b>right?</b>

<b>right?</b> <b>it's not completely saying, hey</b> <b>we now have a Large Language Model</b> <b>we want to deploy this system</b> <b>and push to product</b> <b>and then</b> <b>go get some revenue</b> <b>it's actually not like that</b> <b>right?</b> <b>and I think</b>

<b>right?</b> <b>and I think</b> <b>this has a strong research-oriented</b> <b>inclination</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>but it's also not in a purely academic</b> <b>academic setting</b> <b>it's not the old FAIR</b> <b>and it's not NYU either</b> <b>it's not a university</b> <b>and it's not the old traditional FAIR either</b> <b>but on the other hand</b> <b>it's also not the Bay Area's</b> <b>big tech companies and the many neo labs now</b> <b>operating in a completely closed manner</b> <b>what does closed mean?</b>

<b>closed means</b> <b>you don't open source</b> <b>you can't publish papers</b> <b>and like the blog I mentioned</b> <b>mm-hmm</b> <b>you can't put your name on it</b> <b>can't put your</b> <b>uh name on it</b> <b>and</b> <b>like when I was actually at Google</b> <b>at GTM</b> <b>I was in GenAI</b> <b>and I was the only one there</b> <b>who had, in a sense, a foot in both worlds</b> <b>a double affiliation</b> <b>still doing things at the university</b> <b>people there actually have</b>

<b>some resistance to academia</b> <b>to this kind of purely exploratory research</b> <b>that's the Bay Area's</b> <b>current state</b> <b>right</b> <b>resistance</b> <b>how do you understand that?</b>

<b>who's resisting?</b>

<b>resistance means</b> <b>first, I think people look down on</b> <b>the work academia is doing</b> <b>they don't think academia's work can truly</b> <b>ah</b> <b>generate any kind of impact</b> <b>second</b> <b>because they also don't publish</b> <b>a lot of things you don't know what they're doing</b>

<b>right? even within these big companies</b>

<b>right? even within these big companies</b> <b>actually some large companies</b> <b>have research departments</b> <b>and more product-oriented departments</b> <b>but even between these two departments in the same company</b> <b>there's still a big divide</b> <b>because</b> <b>again, the side doing</b> <b>say core model</b> <b>training at these companies, these departments</b> <b>need to be in this highly competitive</b>

<b>race</b> <b>mm-hmm</b> <b>at the very front</b> <b>that's their only goal</b> <b>it's an arms race</b> <b>it's an arms race</b> <b>mm-hmm</b> <b>and</b> <b>this squeezes out your research space</b> <b>mm-hmm</b> <b>it</b> <b>it sucks away the oxygen</b> <b>in that environment</b> <b>the oxygen that gives you sufficient freedom to do research</b>

<b>mm-hmm, so you never considered joining any lab</b> <b>you couldn't stand that suffocating feeling</b> <b>yes</b> <b>I think this is also a very interesting phenomenon</b> <b>the phenomenon being</b> <b>there were indeed some opportunities back then</b> <b>and I was considering other options too</b> <b>and</b> <b>but after thinking about it</b> <b>I felt that maybe this</b> <b>if you really want to do</b> <b>truly cutting-edge exploration</b> <b>if you want to define the problems</b>

<b>you probably have to do it at your own startup</b> <b>for that to work</b> <b>mm-hmm, someone else's startup</b> <b>means they define the problems</b> <b>and you come to execute</b> <b>that's other startups</b> <b>well first of all</b> <b>I don't think among all these other startups</b> <b>there's any single startup</b> <b>or any big company</b> <b>that's focused on what we're doing</b> <b>what is called building the predictive brain</b> <b>right?</b>

<b>right?</b> <b>working at what you might call the most foundational layer</b> <b>or the most upstream layer</b> <b>doing things there</b> <b>that simply doesn't exist</b> <b>even more interesting is</b> <b>actually many of my friends</b> <b>when I talk with them</b> <b>everyone realizes</b> <b>this is actually necessary</b> <b>as I just said</b> <b>this thing</b> <b>on one hand is somewhat of a</b> <b>counter-consensus view</b> <b>right, a contrarian view</b> <b>but on the other hand</b>

<b>over the past year</b> <b>it has gradually become a consensus</b> <b>so what I'm saying isn't all that new</b> <b>nothing particularly new</b> <b>mm-hmm</b> <b>but I briefly mentioned</b> <b>I think in the entire AI industry right now</b> <b>there's this enormous AI</b> <b>this kind of</b> <b>value chain</b> <b>at the very top of this value chain as I just said</b> <b>there's Bitter Lesson</b>

<b>there's a narrative of AGI and LLM</b> <b>this has defined a series of benchmarks</b> <b>mm-hmm</b> <b>right, so you compete on leaderboards</b> <b>mm-hmm mm-hmm</b> <b>and you just compete</b> <b>the leaderboard might be LLM</b> <b>Arena or other leaderboards</b> <b>right, there are</b> <b>a series of benchmarks</b> <b>these benchmarks define resource allocation</b> <b>meaning</b> <b>how you allocate resources</b> <b>mm-hmm</b>

<b>right, because my goal</b> <b>if it's to be number one on the leaderboard</b> <b>then I can only pour in the most resources</b> <b>to be able to compete at that level</b> <b>and then resource allocation</b> <b>actually means this</b> <b>has already drifted somewhat from what researchers think is right</b> <b>or wrong</b> <b>although some</b> <b>very strong researchers know</b> <b>we may need to do some research</b>

<b>but under this value chain</b> <b>resource allocation means</b> <b>they can't do this part of the research</b> <b>so for example I think</b> <b>hmm</b> <b>video</b> <b>understanding is actually quite important</b> <b>but now it seems neither academia</b> <b>nor industry</b> <b>is doing much of it</b> <b>or people are doing it but not with a fundamental</b> <b>World Model angle to approach this problem</b> <b>to solve this problem</b>

<b>but why is that?</b>

<b>but this is a very interesting phenomenon</b> <b>you'll see</b> <b>it's not that no one is willing to do it</b> <b>it's not that no one has the ability to do it</b> <b>mm-hmm</b> <b>it's that all of them, without exception</b> <b>regardless of which company</b> <b>without exception</b> <b>have been assigned to a video generation model</b> <b>team</b> <b>mm-hmm</b> <b>because this is the only</b> <b>one within this value chain</b>

<b>that can indirectly</b> <b>participate in this value chain</b> <b>position</b> <b>even though they all know</b> <b>we haven't solved this problem</b> <b>we need a better</b> <b>as I just said</b> <b>a World Model</b> <b>based video understanding model</b> <b>and this</b> <b>might be an important prerequisite</b> <b>for actually training that World Model</b> <b>but people won't have space to do</b> <b>such exploration</b> <b>mm-hmm</b>

<b>so back when I was at Google</b> <b>I had that frustration too</b> <b>including when we did the RAE paper</b> <b>this paper</b> <b>took about this student and</b> <b>with Boyang Zheng</b> <b>we probably spent almost a year</b> <b>because this student in between might also have</b> <b>had some health issues</b> <b>anyway</b> <b>there might have been some gaps in there</b> <b>right?</b>

<b>right?</b> <b>anyway, to finish this work</b> <b>it took us a year</b> <b>mm-hmm</b> <b>when we published this work</b> <b>I was actually a bit worried</b> <b>I thought hmm</b> <b>would there be some Google researcher</b> <b>coming to me saying</b> <b>why did you publish a paper</b> <b>we're doing the same thing</b> <b>you've exposed our secrets</b> <b>mm-hmm</b> <b>turns out yes</b> <b>oh</b> <b>several researchers came to me</b>

<b>and their feedback was</b> <b>I think this is right</b> <b>I worked on this for two weeks</b> <b>but my manager said</b> <b>you can't do this anymore</b> <b>we have product cycle one coming up</b> <b>product cycle two</b> <b>product cycle three, right?</b>

<b>these</b> <b>product launch timelines</b> <b>need to be completed</b> <b>their motivation is different</b> <b>their motivation is different</b> <b>so it all comes back to</b> <b>I think we need to return to</b> <b>what we discussed at the beginning</b> <b>in this kind of finite game</b> <b>in this highly competitive environment</b> <b>every company</b> <b>seems to have lost its ability to define problems</b> <b>for example</b>

<b>you see that before, like OpenAI, right?</b>

<b>they actually had that ability</b> <b>mm-hmm</b> <b>many of these problems were defined by them</b> <b>right?</b>

<b>right?</b> <b>including GPT</b> <b>including models like CLIP</b> <b>right?</b> <b>or say</b>

<b>right?</b> <b>or say</b> <b>from their very first day</b> <b>as a research unit</b> <b>they had this kind of problem-defining capability</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>but now</b> <b>it seems like even OpenAI</b> <b>to some extent</b> <b>is being swept into this race</b> <b>mm-hmm, of course they were once the ones who defined the race</b> <b>now they're the ones being competed against</b> <b>mm-hmm</b> <b>so I think the AI industry right now</b> <b>needs new problem-definers</b> <b>and Yann has this conviction</b> <b>that the current path</b> <b>mm-hmm</b> <b>cannot lead to true intelligence</b> <b>right?</b>

<b>right?</b> <b>so someone needs to define new problems</b> <b>on this larger scale</b> <b>I think Yann and I share a lot of common ground</b> <b>on this matter</b> <b>mm-hmm, so you found a kindred spirit</b> <b>yeah, that's a better way to put it</b> <b>mm-hmm</b> <b>so then you started the company</b> <b>right?</b>

<b>right?</b> <b>then</b> <b>you mentioned Yann</b> <b>let me ask you</b> <b>what kind of person is Yann?</b>

<b>what's it like working with Yann?</b>

<b>mm-hmm</b> <b>Yann is</b> <b>a very unique person</b> <b>mm-hmm</b> <b>I'll start with a few of his characteristics</b> <b>mm-hmm</b> <b>he's very principled</b> <b>mm-hmm</b> <b>and I think his principles are</b> <b>very rooted in his deep understanding of the problem itself</b> <b>mm-hmm</b> <b>which is why he</b> <b>when he says something is right</b>

<b>I think he truly believes in what he says</b> <b>mm-hmm</b> <b>and won't be swayed by other people's opinions</b> <b>mm-hmm</b> <b>and I think this quality</b> <b>in the current research environment</b> <b>is actually very rare</b> <b>mm-hmm</b> <b>because most people</b> <b>well first of all researchers are human beings</b> <b>mm-hmm</b> <b>they also need to consider their career</b> <b>their citations</b> <b>right, their impact factor</b>

<b>mm-hmm</b> <b>and follow the trend</b> <b>when everyone else is doing LLMs</b> <b>I should also publish some papers on LLMs</b> <b>mm-hmm</b> <b>but Yann clearly hasn't done this</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and for me</b> <b>I feel like I also</b> <b>belong to this type of person</b> <b>mm-hmm</b> <b>second</b> <b>I think Yann is</b> <b>from my observations</b> <b>a very good leader</b> <b>mm-hmm</b> <b>right, how so?</b>

<b>how so?</b>

<b>Yann's leadership style is</b> <b>he actually doesn't</b> <b>manage people much</b> <b>mm-hmm</b> <b>mm-hmm</b> <b>and Yann's approach to leading</b> <b>is through his vision</b> <b>mm-hmm</b> <b>and through what he stands for</b> <b>and all the values that he represents</b> <b>mm-hmm</b> <b>to attract people to join him</b>

<b>mm-hmm</b> <b>and then</b> <b>he'll also give you a lot of freedom</b> <b>mm-hmm</b> <b>he's very empowering</b> <b>mm-hmm, that's great</b> <b>right?</b>

<b>right?</b> <b>and I think this is a style that works best for me</b> <b>because I also don't want to be managed very much</b> <b>mm-hmm</b> <b>mm-hmm, so you two get along really well</b> <b>mm-hmm</b> <b>yeah, I think we complement each other</b> <b>mm-hmm</b> <b>because I think Yann</b> <b>is more of a visionary</b> <b>mm-hmm</b> <b>and I'm more</b> <b>sort of more grounded</b> <b>someone who can actually execute</b> <b>mm-hmm</b>

<b>good at figuring out</b> <b>given Yann's direction</b> <b>what should we specifically do</b> <b>mm-hmm</b> <b>so I think this pairing</b> <b>is interesting</b> <b>mm-hmm</b> <b>yeah, I feel like Yann also</b> <b>has this kind of</b> <b>very outspoken</b> <b>internet celebrity vibe</b> <b>[laughs] [laughs]</b> <b>very outspoken person</b> <b>right?</b>

<b>right?</b> <b>and you're relatively more low-key?</b>

<b>mm-hmm mm-hmm</b> <b>yeah, I think that's relatively true</b> <b>mm-hmm</b> <b>I like</b> <b>speaking through work</b> <b>mm-hmm</b> <b>okay, so then you co-founded this company together</b> <b>mm-hmm</b> <b>and then you're in New York</b> <b>right?</b>

<b>right?</b> <b>let's talk about New York</b> <b>mm-hmm</b> <b>why not Silicon Valley?</b>

<b>ah, this question</b> <b>this is indeed a question a lot of people are very</b> <b>curious about</b> <b>right?</b>

<b>right?</b> <b>uh</b> <b>I think</b> <b>first of all</b> <b>honestly</b> <b>I'm a New York person myself</b> <b>I've been at NYU for many years</b> <b>mm-hmm</b> <b>and Yann has been at NYU even longer than me</b> <b>right?</b>

<b>right?</b> <b>and the feeling of New York, speaking truthfully</b> <b>is very different from San Francisco</b> <b>mm-hmm</b> <b>I've been to San Francisco many times</b> <b>and I've lived in the Bay Area</b> <b>mm-hmm</b> <b>but the Bay Area atmosphere</b> <b>is really</b> <b>a pure tech bubble</b> <b>mm-hmm</b> <b>but you know what</b> <b>it's not necessarily a bad thing</b>

<b>mm-hmm</b> <b>in that bubble</b> <b>everyone can be very focused on doing one thing</b> <b>mm-hmm</b> <b>so the entire Bay Area culture is</b> <b>just about building companies, right?</b>

<b>mm-hmm</b> <b>and New York is</b> <b>I think, a more</b> <b>real world</b> <b>mm-hmm</b> <b>this real world in New York</b> <b>has given me many inspirations</b> <b>right?</b>

<b>right?</b> <b>and then</b> <b>many of the ideas around the product</b> <b>especially the kind of embodied AI products</b> <b>or world model products</b> <b>I've imagined</b> <b>actually come from life in New York</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and then also</b> <b>in terms of recruiting</b> <b>I think many people in New York</b> <b>have a stronger desire to</b> <b>do something more fundamental</b> <b>mm-hmm</b> <b>right, because the Bay Area is</b> <b>actually quite saturated now</b> <b>yes</b> <b>in terms of talent</b> <b>it is saturated</b> <b>but in terms of culture</b> <b>everyone is doing product, product, product</b> <b>mm-hmm</b>

<b>right?</b>

<b>right?</b> <b>so I also feel</b> <b>that for what I'm doing</b> <b>New York might be</b> <b>a better fit</b> <b>mm-hmm</b> <b>mm-hmm yeah</b> <b>right, as we talked about earlier</b> <b>there are actually many AI startups</b> <b>in New York</b> <b>and there's quite a vibrant</b> <b>AI scene</b> <b>in New York</b> <b>right?</b>

<b>right?</b> <b>but New York still doesn't have an</b> <b>absolutely top-tier</b> <b>AI company</b> <b>like OpenAI-level</b> <b>right?</b>

<b>right?</b> <b>I think that</b> <b>is also an opportunity</b> <b>mm-hmm</b> <b>right, Hugging Face is in New York</b> <b>mm-hmm</b> <b>mm-hmm, well Hugging Face is headquartered in New York</b> <b>but their team might be quite distributed</b> <b>but their HQ is New York</b> <b>so I think this is</b> <b>a very interesting trend</b> <b>mm-hmm</b> <b>okay, so then let's talk about</b> <b>the current state of the company</b>

<b>how many people do you have?</b>

<b>how's it going so far?</b>

<b>mm-hmm</b> <b>right, so we're still very early</b> <b>the company is only about</b> <b>six months old or so</b> <b>mm-hmm</b> <b>and we currently have</b> <b>about 15 people</b> <b>mm-hmm</b> <b>the team is</b> <b>very very strong</b> <b>how big will your pre-training dataset be?</b>

<b>ah, these things</b> <b>that's the research part</b> <b>right</b> <b>we actually now have a very good roadmap</b> <b>and we've also hired many many people</b> <b>everyone actually cares a lot about</b> <b>how to make something land in reality</b> <b>not just simply doing research</b> <b>although research is very very important</b> <b>and now</b> <b>if we want to achieve</b> <b>the goal of a truly good world model</b> <b>how much compute does it need?</b>

<b>mm-hmm</b> <b>I think compute is definitely needed</b> <b>but as I was saying earlier</b> <b>I think the compute efficiency will be</b> <b>very very different</b> <b>mm-hmm</b> <b>so the amount of compute</b> <b>might not be comparable to</b> <b>training a frontier LLM</b> <b>mm-hmm</b> <b>but</b> <b>one thing I think is very important</b> <b>is the structure of how we use compute</b>

<b>mm-hmm</b> <b>right, there are many ways to use compute</b> <b>for example</b> <b>you can use compute to train language</b> <b>or use compute to train video</b> <b>mm-hmm</b> <b>or you could train both simultaneously</b> <b>mm-hmm</b> <b>I think for our approach</b> <b>the distribution of compute might be</b> <b>very different</b> <b>mm-hmm</b> <b>um</b> <b>a larger portion</b> <b>might be used on video</b>

<b>mm-hmm</b> <b>but not just the kind of</b> <b>prediction-based</b> <b>purely the kind of</b> <b>prediction target, right?</b>

<b>this approach</b> <b>mm-hmm</b> <b>but a combination of generative and discriminative</b> <b>methods</b> <b>and then</b> <b>with a combination of language too</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>so I think</b> <b>the goal is</b> <b>through the least amount of compute possible</b> <b>to train the best world model</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and then in doing so</b> <b>you also need to be able to</b> <b>make a product</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>so it'll be a long journey</b> <b>mm-hmm</b> <b>but I think the path</b> <b>is relatively clear to me</b> <b>mm-hmm</b> <b>yeah</b> <b>right well</b> <b>you did also mention Yann</b> <b>right, earlier you mentioned</b> <b>that before you started the company</b> <b>you were at NYU as a professor</b> <b>and also had a collaboration with Google</b>

<b>right?</b>

<b>right?</b> <b>you were in quite a good position</b> <b>mm-hmm</b> <b>and then you made a decision</b> <b>to step out and do this</b> <b>mm-hmm</b> <b>what was the tipping point?</b>

<b>or the final straw</b> <b>that made you decide</b> <b>okay, I'm going to do this</b> <b>mm-hmm</b> <b>I think it's a combination of many things</b> <b>but I think</b> <b>the biggest factor was</b> <b>the conversation with Yann, as I mentioned</b> <b>mm-hmm</b> <b>because I had never considered</b> <b>that Yann would want to do this</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>and once Yann decided he wanted to do this</b> <b>mm-hmm</b> <b>the whole thing became a lot more</b> <b>compelling</b> <b>mm-hmm</b> <b>because I think with Yann</b> <b>doing this kind of thing</b> <b>is much more legitimate</b> <b>right, meaning it's not just</b> <b>two or three young researchers</b> <b>thinking they can change the world</b> <b>right?</b>

<b>right?</b> <b>right, and Yann has the experience</b> <b>the vision</b> <b>and the prestige</b> <b>mm-hmm</b> <b>to attract talent</b> <b>attract investment</b> <b>right?</b>

<b>right?</b> <b>so I think this is</b> <b>when I found out about this</b> <b>I basically decided immediately</b> <b>mm-hmm</b> <b>without even thinking about it much</b> <b>right?</b>

<b>right?</b> <b>I think this kind of</b> <b>opportunity</b> <b>is once in a lifetime</b> <b>right?</b>

<b>right?</b> <b>mm-hmm, and also</b> <b>I've always said</b> <b>I actually really like Yann</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and I feel like having the chance to work closely</b> <b>with someone like Yann</b> <b>is something very rare</b> <b>mm-hmm</b> <b>mm-hmm, so that's also why</b> <b>you didn't hesitate</b> <b>mm-hmm</b> <b>yeah</b> <b>alright, so last question</b> <b>mm-hmm</b> <b>if you had to send a message</b> <b>to the Chinese AI research community</b>

<b>or students who are interested in AI research</b> <b>right?</b>

<b>right?</b> <b>what would you want to say to them?</b>

<b>hmm</b> <b>I think</b> <b>there are a few things I want to say</b> <b>mm-hmm</b> <b>the first thing</b> <b>is about attitude</b> <b>mm-hmm</b> <b>I hope everyone can</b> <b>keep thinking for themselves</b> <b>mm-hmm</b> <b>don't be swayed by trends</b> <b>mm-hmm</b> <b>I hope everyone can</b> <b>think</b> <b>about what they really want to do</b> <b>mm-hmm</b> <b>and why they want to do it</b>

<b>right?</b>

<b>right?</b> <b>because I see many people</b> <b>in AI research</b> <b>and many people are doing it</b> <b>but actually</b> <b>sometimes it's a bit</b> <b>following the crowd</b> <b>mm-hmm</b> <b>right, because it seems like this field is hot</b> <b>mm-hmm</b> <b>so let me get into it</b> <b>mm-hmm</b>

<b>but actually the more important thing is</b> <b>you yourself</b> <b>have a genuine passion for</b> <b>this kind of creative work</b> <b>mm-hmm</b> <b>you genuinely want to figure out</b> <b>the essence of intelligence</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>if you just see this as a career path</b> <b>that's also fine</b> <b>right, if you just want a good job</b> <b>mm-hmm</b> <b>but I think for researchers</b> <b>or people who really want to push the frontier</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>I think this genuine love for the work</b> <b>is really important</b> <b>mm-hmm</b> <b>the second thing</b> <b>is about approach</b> <b>mm-hmm</b> <b>I hope everyone can</b> <b>think about problems</b> <b>more deeply</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>I think</b> <b>a lot of current AI research</b> <b>is quite shallow</b> <b>mm-hmm</b> <b>meaning</b> <b>a lot of it is</b> <b>just following what others are doing</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>people follow trends</b> <b>mm-hmm</b> <b>but the most interesting things</b> <b>come from people who ask</b> <b>why?</b>

<b>why?</b> <b>mm-hmm</b> <b>why does this work?</b>

<b>mm-hmm</b> <b>why doesn't that work?</b>

<b>mm-hmm</b> <b>what is the essence here?</b>

<b>mm-hmm</b> <b>and I think</b> <b>this kind of</b> <b>thinking deeply about a problem</b> <b>is a quality that's becoming rarer</b> <b>mm-hmm</b> <b>so I hope people can cultivate this quality</b> <b>mm-hmm</b> <b>and the third thing</b> <b>is about community</b> <b>mm-hmm</b> <b>I hope everyone can</b> <b>be more open</b>

<b>to collaboration</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>I think one of the beauties of the AI field is</b> <b>it's a very open field</b> <b>mm-hmm</b> <b>right, many papers are open</b> <b>many code is open</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and this openness</b> <b>has driven a lot of progress</b> <b>mm-hmm</b> <b>I hope this spirit can be maintained</b> <b>mm-hmm</b> <b>yeah, thank you Saining</b> <b>mm-hmm</b> <b>this has been a very good conversation</b> <b>thank you</b> <b>thank you</b> <b>mm-hmm</b> <b>okay so now</b> <b>let me introduce</b>

<b>the next guest</b> <b>mm-hmm</b> <b>this next guest</b> <b>is also a very</b> <b>very special person</b> <b>mm-hmm</b> <b>he is</b> <b>a PhD student</b> <b>currently at NYU</b> <b>mm-hmm</b> <b>but he's not your ordinary PhD student</b> <b>mm-hmm</b> <b>he's also an entrepreneur</b> <b>mm-hmm</b>

<b>and then</b> <b>we just learned</b> <b>mm-hmm</b> <b>that he's also</b> <b>Forbes 30 Under 30</b> <b>wow</b> <b>yes</b> <b>this is very impressive</b> <b>mm-hmm</b> <b>let's welcome</b> <b>mm-hmm</b> <b>Zhiyuan Zeng (Tommy)</b> <b>mm-hmm</b> <b>hi everyone</b> <b>hi</b> <b>hello</b>

<b>mm-hmm</b> <b>alright Tommy</b> <b>why don't you first</b> <b>introduce yourself</b> <b>mm-hmm</b> <b>sure, hi everyone</b> <b>I'm Tommy</b> <b>currently I'm a PhD student at NYU</b> <b>and my research direction is</b> <b>AI agents</b>

<b>mm-hmm</b> <b>and at the same time</b> <b>I'm also the co-founder and CTO of a company</b> <b>called Simular AI</b> <b>mm-hmm</b> <b>and the direction of this company is also AI agents</b> <b>mm-hmm</b> <b>specifically</b> <b>we are building a desktop AI agent</b> <b>mm-hmm</b> <b>the product is called S2</b> <b>mm-hmm</b> <b>cool, desktop AI agent</b> <b>right?</b>

<b>right?</b> <b>does it work on a computer?</b>

<b>mm-hmm</b> <b>yes, it works on a computer</b> <b>mm-hmm</b> <b>then I want to ask you</b> <b>what exactly does it do?</b>

<b>mm-hmm</b> <b>right, so this thing basically</b> <b>can do everything you can do on a computer</b> <b>mm-hmm</b> <b>for example</b> <b>browsing the web</b> <b>mm-hmm</b> <b>writing code</b> <b>mm-hmm</b> <b>managing files</b> <b>mm-hmm</b> <b>using various applications</b> <b>mm-hmm</b> <b>right, using various software</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>so it can help you do tasks on the computer</b> <b>mm-hmm</b> <b>so it's more like</b> <b>a full automation of</b> <b>computer tasks</b> <b>mm-hmm</b> <b>yes, it's a computer automation tool</b> <b>right?</b>

<b>right?</b> <b>and it can</b> <b>handle more complex tasks</b> <b>mm-hmm</b> <b>right, like what?</b>

<b>for example</b> <b>say I need to</b> <b>book a flight</b> <b>mm-hmm</b> <b>but this booking involves</b> <b>multiple steps</b> <b>mm-hmm</b> <b>like opening a browser</b> <b>going to a website</b> <b>searching for flights</b>

<b>comparing prices</b> <b>mm-hmm</b> <b>and then ultimately booking it</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>all of these steps</b> <b>S2 can automatically complete for you</b> <b>mm-hmm</b> <b>so you just tell it what you want</b> <b>and then it does it for you</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>yes</b> <b>mm-hmm, that's pretty amazing</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>then tell me</b> <b>what's the difference between S2</b> <b>and similar products out there?</b>

<b>mm-hmm</b> <b>right, so I think</b> <b>S2's biggest differentiation is</b> <b>mm-hmm</b> <b>reliability</b> <b>mm-hmm</b> <b>right?</b> <b>because right now</b>

<b>right?</b> <b>because right now</b> <b>many similar products</b> <b>might be able to demo well</b> <b>mm-hmm</b> <b>but in actual use</b> <b>the reliability is not so good</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>because computer tasks</b> <b>are inherently</b> <b>very complex</b> <b>mm-hmm</b> <b>there are many unexpected things that can go wrong</b> <b>mm-hmm</b> <b>right, like pop-up windows</b> <b>mm-hmm</b> <b>or maybe the website</b> <b>has changed its UI</b> <b>mm-hmm</b> <b>or maybe the network is slow</b> <b>mm-hmm</b>

<b>all sorts of situations</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and S2's solution is</b> <b>we built a</b> <b>proprietary model specifically for computer tasks</b> <b>mm-hmm</b> <b>so that it can</b> <b>handle these complex situations</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and at the same time</b> <b>we also have</b> <b>a proprietary planning module</b> <b>mm-hmm</b> <b>so that it can</b> <b>plan more efficiently</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>mm-hmm, so it has a self-developed model</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>a proprietary model</b> <b>mm-hmm</b> <b>so to do this you need a lot of data</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>how do you get that data?</b>

<b>mm-hmm</b> <b>right, so data is indeed</b> <b>one of the biggest challenges</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>so our approach is</b> <b>to build a</b> <b>data synthesis pipeline</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>we use AI to generate data</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and then use this data</b> <b>to train the model</b> <b>mm-hmm</b> <b>right?</b>

<b>mm-hmm</b> <b>right?</b> <b>mm-hmm, and where does this synthetic data come from?</b>

<b>mm-hmm</b> <b>right, so the synthetic data</b> <b>mainly comes from</b> <b>we have an environment</b> <b>mm-hmm</b> <b>this environment simulates</b> <b>various computer tasks</b> <b>mm-hmm</b> <b>and then we have an AI agent</b> <b>in this environment</b> <b>completing these tasks</b> <b>mm-hmm</b> <b>and recording the process</b>

<b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>so this is the source of the data</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>mm-hmm, that's clever</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>so then</b> <b>tell me</b> <b>who are your target users?</b>

<b>mm-hmm</b> <b>right, our target users</b> <b>are mainly</b> <b>knowledge workers</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>people who spend a lot of time</b> <b>on computers every day</b> <b>mm-hmm</b> <b>for example</b> <b>software engineers</b> <b>mm-hmm</b> <b>data analysts</b> <b>mm-hmm</b> <b>right, product managers</b> <b>mm-hmm</b> <b>designers</b> <b>mm-hmm</b> <b>and so on</b> <b>mm-hmm</b>

<b>right?</b>

<b>right?</b> <b>but I think</b> <b>trying to accomplish something this different</b> <b>is still quite difficult</b> <b>because as I said</b> <b>I've been emphasizing all along</b> <b>we're actually looking for a kind of balance</b> <b>this balance means</b> <b>it's neither a purely academic research lab</b> <b>nor is it one of today's</b> <b>closed large-model companies</b> <b>Mm-hmm</b> <b>and this balance also means</b>

<b>take me personally, for example</b> <b>it's also a kind of balance</b> <b>it's like</b> <b>I'm neither a very senior</b> <b>already accomplished and established</b> <b>kind of distinguished professor</b> <b>but I'm also not an eighteen or nineteen year old</b> <b>who can just roll up their bedding and head to a factory in Shenzhen</b> <b>[laughter]</b> <b>and set down roots</b> <b>to do data collection</b> <b>or whatever</b> <b>I'm neither of those</b> <b>Mm-hmm</b>

<b>some of the data comes from factories in Shenzhen</b> <b>Yes</b> <b>someone is doing it</b> <b>the example I just mentioned is</b> <b>a specific company</b> <b>they have a company</b> <b>called build.ai</b>

<b>called build.ai</b> <b>I actually really admire that kid</b> <b>named Eddy</b> <b>he took a few people and dropped out of Columbia</b> <b>then went and lived in a Shenzhen factory</b> <b>Ah</b> <b>and then</b> <b>build a startup like that</b> <b>I think that's so impressive</b> <b>right</b> <b>I think this is both about finding balance</b> <b>but I find it challenging for myself</b> <b>but it's also a new opportunity</b> <b>I think maybe</b> <b>maybe</b> <b>this era</b>

<b>Uh</b> <b>might not belong to the old guard</b> <b>nor to the young guns</b> <b>but rather to a generation of mid-career entrepreneurs</b> <b>You said no to Ilya (SSI founder) twice</b> <b>but said yes to LeCun</b> <b>Why is that?</b>

<b>What kind of person is he in your eyes?</b>

<b>oh right</b> <b>Yann</b> <b>is a fighter online</b> <b>right?</b>

<b>right?</b> <b>actually firmly opposed to the LLM camp</b> <b>well, it's not just opposing LLMs</b> <b>he actually doesn't oppose LLMs</b> <b>he's never said he opposes LLMs</b> <b>he's very</b> <b>he even says he uses Gemini himself</b> <b>he's completely fine with</b> <b>LLMs</b> <b>he just opposes</b> <b>the narrative that LLMs can lead to human-level</b> <b>intelligence</b> <b>that's the narrative he opposes</b> <b>that's what he pushes back on</b> <b>Mm-hmm</b> <b>he has no objection to LLMs at all</b>

<b>but anyway he's a fighter online</b> <b>constantly engaging in battles</b> <b>but I think</b> <b>privately he's a really wonderful person</b> <b>he's someone I</b> <b>genuinely admire and look up to from the heart</b> <b>Were you close before?</b>

<b>we collaborated on some papers</b> <b>but</b> <b>definitely not like being in a startup together</b> <b>as co-founders</b> <b>like</b> <b>working closely like this</b> <b>we hadn't done that before</b> <b>Are you close with Kaiming?</b>

<b>definitely not</b> <b>mm-hmm right</b> <b>Yes</b> <b>but I think</b> <b>I think</b> <b>Yann is someone</b> <b>who truly can</b> <b>distort the reality field</b> <b>I think he's incredibly, incredibly impressive</b> <b>whenever I start to have doubts about something</b> <b>I always want to go have a chat with him</b> <b>he can easily make the people around him</b> <b>at least that's how I feel</b> <b>feel a sense of calm</b> <b>feel like, hey</b>

<b>these challenges aren't really challenges</b> <b>the road ahead is bright</b> <b>yes, he has that ability</b> <b>Mm-hmm</b> <b>and moreover</b> <b>of course</b> <b>his research vision</b> <b>I deeply admire as well</b> <b>admire</b> <b>like many of what I just mentioned</b> <b>such as what a world model is</b> <b>why we need to filter information</b> <b>this is essentially also JEPA</b> <b>the core of the JEPA idea he proposed</b> <b>is that you can't build a general model</b>

<b>you can't memorize everything</b> <b>and reconstruct it all</b> <b>you need to work in an abstract representation space</b> <b>to make predictions in an abstract representation space</b> <b>Mm-hmm</b> <b>that's the core of JEPA</b> <b>but what I want to say is</b> <b>Yann, I think, really practices what he preaches</b> <b>he himself is pretty JEPA as a person</b> <b>he consistently holds fast to many of his</b> <b>so</b> <b>logical principles</b> <b>and the things he believes are right</b> <b>this</b>

<b>is undisturbed by anything external</b> <b>but this doesn't mean</b> <b>he's completely stubborn</b> <b>who won't listen to anyone</b> <b>that's not really the case</b> <b>sometimes he's been wrong</b> <b>sometimes he's been right</b> <b>he's right most of the time</b> <b>but he can actually take in what people say</b> <b>mm-hmm, and he also said</b> <b>there was</b> <b>there was a press piece about how Yann</b>

<b>can't be moved</b> <b>that Yann LeCun can never be moved</b> <b>right, no one can</b> <b>move him</b> <b>Oh</b> <b>meaning he's stubborn, right?</b>

<b>saying he's too stubborn</b> <b>Yann said</b> <b>I can absolutely be moved</b> <b>I can absolutely be moved</b> <b>but I need to be moved based on facts</b> <b>not just because someone tells me what to do</b> <b>and I go do it</b> <b>that's when I'll be moved</b> <b>so back when he was at Meta actually</b> <b>Mm-hmm</b> <b>many people also told him</b>

<b>we at Meta are now going to build Large Language Models</b> <b>we need to do all these things</b> <b>you can't keep saying these things publicly anymore</b> <b>right?</b>

<b>right?</b> <b>you can't go around</b> <b>constantly dissing Large Language Models as not working</b> <b>Yann couldn't accept this at all</b> <b>Yann said my integrity as a scientist</b> <b>my integrity as a scientist cannot accept this</b> <b>so I think this is something I deeply admire too</b> <b>I think he truly</b> <b>the things he says</b> <b>Mm-hmm</b> <b>aren't because something is now</b> <b>trending</b> <b>and then he goes and says it</b>

<b>everything can be traced back to its origins</b> <b>including his talk about world models</b> <b>he didn't just start talking about it because world models became popular recently</b> <b>it was also</b> <b>something he was already talking about many, many years ago</b> <b>and he also has a really great paper</b> <b>I</b> <b>I genuinely recommend it to everyone around me</b> <b>it's called</b> <b>"A Path Towards Autonomous Machine Intelligence"</b> <b>right</b> <b>it's his position paper</b>

<b>also an opinion paper</b> <b>and at that point you'll find</b> <b>there are many layers to his thinking</b> <b>these layers are presented in a very engineering-oriented</b> <b>and implementable</b> <b>or mathematically expressed form</b> <b>so you see, when people ask him</b> <b>Yann, what exactly is a world model</b> <b>he never</b> <b>says something vague and high-level</b>

<b>something relatively</b> <b>abstract and empty</b> <b>he'll always write out formulas for you</b> <b>Uh</b> <b>he always will</b> <b>still does now</b> <b>still does now</b> <b>and</b> <b>he still spends one day a week at NYU</b> <b>and still leads his own group</b> <b>he still holds group meetings</b> <b>during group meetings</b> <b>he walks up to the whiteboard</b> <b>and walks everyone through the equations</b> <b>step by step</b> <b>Mm-hmm</b> <b>highly technical</b>

<b>very, very technical</b> <b>right</b> <b>What's the division of responsibility between you two?</b>

<b>Yann is executive chairman</b> <b>so</b> <b>he's more like the captain of our big ship</b> <b>about this with him</b> <b>I also</b> <b>talked with him about it</b> <b>who's the captain</b> <b>he's the captain</b> <b>no, I'm not</b> <b>talking about who's the captain</b> <b>I don't want to be the captain</b> <b>right, right, right, but he said</b> <b>on one hand he said</b> <b>he really doesn't like</b> <b>managing day-to-day operational matters</b> <b>he's not a good CEO</b>

<b>but on the other hand I feel — you're not either</b> <b>right, I'm probably not either</b> <b>but I also think</b> <b>he's a very wise manager</b> <b>he gave me this example</b> <b>he said</b> <b>his management philosophy is like</b> <b>sailing a boat</b> <b>this</b> <b>by the way, that's one of his hobbies</b> <b>I can talk about it later</b> <b>his other interesting things</b> <b>but he has this hobby</b> <b>he's heading out in March</b> <b>to go sailing in the Caribbean again</b>

<b></b> <b>he says his management style is</b> <b>giving everyone enough trust</b> <b>to let them do what they're supposed to do</b> <b>but once some turbulence arises</b> <b>right?</b>

<b>right?</b> <b>once we need to correct something</b> <b>he'll promptly</b> <b>Uh</b> <b>as early</b> <b>as possible make that adjustment</b> <b>right?</b>

<b>right?</b> <b>but before that</b> <b>trust everyone to do their work</b> <b>that is, believe in everyone</b> <b>to do what they're best at</b> <b>yeah, I think that's Yann's role</b> <b>he's for this company</b> <b>on one hand a kind of spiritual leader</b> <b>but on the other hand also</b> <b>navigating the open sea</b> <b>you need a helmsman</b> <b>he also has this</b>

<b>captain identity</b> <b>right and</b> <b>but I think what I feel about him</b> <b>I think</b> <b>what truly makes me feel</b> <b>I really enjoy working with this person</b> <b>is more personal reasons</b> <b>we've talked a lot</b> <b>these decisions aren't purely logical ones</b> <b>sometimes it still comes down to whether you click</b> <b>Mm-hmm</b> <b>it all comes down to people</b> <b>it all comes down to people</b> <b>right</b>

<b>like Yann, even though he really is a big shot</b> <b>you'll often see him at conferences</b> <b>holding out his phone</b> <b>taking selfies with everyone</b> <b>taking group photos</b> <b>and privately</b> <b>he's also a pretty pure and warm person</b> <b>right</b> <b>and being around him</b> <b>mainly I don't feel any sense of fear</b> <b>even though he's accomplished and distinguished</b> <b>mm-hmm, and then</b>

<b>I won't worry that I said something wrong</b> <b>and upset him</b> <b>I think that's actually quite rare</b> <b>especially given his status and standing</b> <b>to be like that</b> <b>and I can, or rather</b> <b>including everyone in this company</b> <b>can very directly tell him</b> <b>this is how I think about this</b> <b>I think you're right, or I think you're not right</b> <b>but let's discuss together</b>

<b>what way to move forward</b> <b>that would be best</b> <b>for this company</b> <b>I think</b> <b>that's also truly very rare</b> <b>right</b> <b>Tell us about your progress so far</b> <b>in terms of capital</b> <b>and team development</b> <b>of course by the time this is released</b> <b>it'll be after your announcement</b> <b>uh yes</b> <b>right uh</b> <b>I think in terms of capital</b> <b>Uh</b> <b>there's no way around it</b> <b>my world model</b>

<b>isn't sufficient to support making that kind of prediction</b> <b>but our target</b> <b>might be around one billion dollars</b> <b>right</b> <b>if that turns out to be wrong</b> <b>we'll just have to cut it</b> <b>[laughter]</b> <b>[laughter]</b> <b>[laughter]</b> <b>in terms of team composition</b> <b>we'll have many great partners</b> <b>like-minded people joining this company together</b> <b>so we'll start with around 25</b> <b>as an initial team</b>

<b>mm-hmm, and we hope to gradually grow the team</b> <b>we don't want to go too fast</b> <b>but not too slow either</b> <b>and in this there's actually so much</b> <b>I think</b> <b>I think that's part of the magic of building a startup</b> <b>because before, at big companies</b> <b>I would also, uh</b> <b>refer some friends from the past</b> <b>my students</b> <b>to join the company together</b> <b>but it was never really a unified thing</b>

<b>everyone went to different teams and did their own thing</b> <b>but</b> <b>but after starting a company</b> <b>I find</b> <b>you can truly bring everyone together</b> <b>Oh</b> <b>and find a shared mission like this</b> <b>Mm-hmm</b> <b>I think that's just so fascinating</b> <b>Mm-hmm</b> <b>and honestly I'm very moved by this myself</b> <b>because we have several friends</b>

<b>who actually have tens of millions of dollars in</b> <b>unvested OpenAI stock</b> <b>if they were leaving OpenAI</b> <b>and also, say, at Google</b> <b>there are also several like this</b> <b>Uh</b> <b>not at Google</b> <b>at Meta</b> <b>there are also those 15 to 20 million dollar</b> <b>offers like that</b> <b>and everyone just, without even thinking</b>

<b>gave it all up</b> <b>to join us</b> <b>Why?</b>

<b>Why?</b> <b>I think</b> <b>maybe we're all just a little crazy</b> <b>[laughs]</b> <b>it seems like</b> <b>the thing is, you need to</b> <b>consider, on one side is research</b> <b>and on the other side is financial outcome</b> <b>right, of course</b> <b>I think if a startup ultimately succeeds</b> <b>the upside can be very significant</b> <b>mm-hmm financially</b> <b>at least for now</b>

<b>I think most people are still very mission driven</b> <b>right and everyone still believes</b> <b>this is the only place</b> <b>where we can do this</b> <b>Have you already started</b> <b>thinking about business models?</b>

<b>Uh</b> <b>I think the reason for raising this much money</b> <b>might be partly to reduce some of that pressure</b> <b>but of course</b> <b>this is a serious company</b> <b>so our CEO</b> <b>and COO spend a lot of energy every day thinking about</b> <b>business model matters</b> <b>Mm-hmm</b> <b>right and, oh</b> <b>can I go back and talk about Yann again?</b>

<b>Sure!</b>

<b>oh right</b> <b>we'll see how to adjust it later</b> <b>but</b> <b>I think what I just said</b> <b>this thing about having a compatible spirit</b> <b>is really not a commercial decision at all</b> <b>right, and then I think</b> <b>mm-hmm, consistent with your mystical style of decision-making</b> <b>ah, of course</b> <b>of course the consideration is</b> <b>for example</b> <b>at the same time I would have had other opportunities too</b> <b>those opportunities</b> <b>might also have had much better</b> <b>short-term financial</b>

<b>returns</b> <b>Mm-hmm</b> <b>higher salary, higher returns</b> <b>but the way I've always thought about it is</b> <b>some people advised me</b> <b>go make money for two years first</b> <b>once you've made enough, come back and start a company — isn't that better?</b>

<b>Mm-hmm</b> <b>I partly agree, but I also worry</b> <b>right, at my current</b> <b>as</b> <b>at this stage of life</b> <b>do I still have two years</b> <b>in a good enough mental state</b> <b>to do this fully exploratory research</b> <b>Mm-hmm</b> <b>I think that's hard to say</b> <b>it's possible that once you have money</b> <b>your lifestyle</b>

<b>will change</b> <b>[laughter]</b> <b>and then</b> <b>this</b> <b>might also cause you to lose</b> <b>some of that original courage</b> <b>Oh</b> <b>and I think this is just for me personally</b> <b>I have many, many friends right now</b> <b>who are at Meta</b> <b>especially at Meta</b> <b>right everyone</b> <b>is actually making a lot of money</b> <b>they're also very competitive</b> <b>they work overtime every day too</b> <b>and basically everyone has moved near the office</b>

<b>working overtime every day</b> <b>seventy or eighty hours a week</b> <b>Yeah</b> <b>I think</b> <b>I also believe</b> <b>they will definitely build a great frontier model</b> <b>but I also want to say to them</b> <b>when you finish building that model</b> <b>mm-hmm, come check us out</b> <b>[laughter]</b> <b>I think yeah</b> <b>hopefully it's not too late</b> <b>but I think everyone I know</b>

<b>they all have this sense of mission</b> <b>right</b> <b>Meta FAIR's hiring strategy</b> <b>is it aligned with your hiring strategy?</b>

<b>uh, definitely not</b> <b>we don't have the money to hire like Meta FAIR does</b> <b>definitely different</b> <b>mm-hmm right</b> <b>or like Thinking Machines (the frontier AI lab founded by former OpenAI CTO Mira)</b> <b>including xAI</b> <b>I think they're all very different</b> <b>right, I feel</b> <b>although in terms of fundraising scale</b> <b>it's actually pretty good</b> <b>right</b> <b>at least in the top few historically, right?</b>

<b>top few — what's the valuation?</b>

<b>I don't know, I don't know</b> <b>Valuation</b> <b>we haven't changed</b> <b>still 3 billion pre-money</b> <b>right</b> <b>[laughter]</b> <b>mm-hmm, but the money is actually not a lot</b> <b>right, this capital</b> <b>is still very, very precious</b> <b>it's not like being at Meta</b> <b>at Google you really have a money-printing machine there</b> <b>you can't just print money</b> <b>it's okay, you can do</b> <b>whatever you want</b> <b>I think in a startup</b>

<b>we still need to be very, very careful in how we deploy resources</b> <b>I think you deliberately chose not to start up in Silicon Valley</b> <b>is that right?</b>

<b>uh yes</b> <b>I think</b> <b>Silicon Valley again</b> <b>it's very complicated</b> <b>people often say</b> <b>that it's already deeply mired in</b> <b>already hypnotized by Large Language Models</b> <b>[laughter]</b> <b>and I think</b> <b>I think</b> <b>Uh</b> <b>but I don't think this state of affairs will last very long</b> <b>people who are hypnotized will eventually wake up</b> <b>and I think</b> <b>at that point we</b> <b>we don't rule out at all setting up a company in Silicon Valley</b>

<b>I think in the end</b> <b>or maybe very soon</b> <b>our company's location will definitely be wherever the talent is</b> <b>that's where our company will be</b> <b>having an office</b> <b>that's a perfectly normal thing</b> <b>Mm-hmm</b> <b>right</b> <b>oh well, let me</b> <b>go back to Yann for a moment</b> <b>Sure. [laughter]</b>

<b>Sure. [laughter]</b> <b>no, what I want to say is</b> <b>I think Yann</b> <b>one thing that really appeals to me is</b> <b>he's truly a multi-hyphenate</b> <b>or rather a quite artistic person</b> <b>or in Kaiming's words</b> <b>Yann is someone whose adolescence at 16</b> <b>has continued all the way to 65</b> <b>oh, that's wonderful</b> <b>oh I think</b> <b>I think he must be pretty happy</b> <b>but he often says with great pride</b> <b>he has four great hobbies</b>

<b>the first hobby is</b> <b>building model airplanes</b> <b>the second is astrophotography</b> <b>so on Zoom you often see behind the topic</b> <b>there's a nebula, right?</b>

<b>a nebula-like</b> <b>wallpaper</b> <b>desktop background</b> <b>which he actually photographed himself</b> <b>in his own backyard</b> <b>and his third interest is making electronic music</b> <b>and getting into some jazz</b> <b>and things like that</b> <b>mm-hmm</b> <b>and if you look at his webpage</b> <b>it's a treasure</b> <b>I often go look at it from time to time</b> <b>he talks about which jazz clubs in New York</b>

<b>yes, the better jazz spots</b> <b>which musicians are particularly good</b> <b>and he also says</b> <b>that generally speaking</b> <b>French people look down on American</b> <b>popular culture</b> <b>except for jazz</b> <b>so he talks about Charlie Parker</b> <b>and a whole series of people</b> <b>and how great these musicians are</b> <b>I find it so interesting</b> <b>mm-hmm</b> <b>and he has another hobby which is</b> <b>as I already mentioned</b> <b>sailing</b>

<b>so I think a person like this appeals to me</b> <b>actually very, very much</b> <b>because I think his world is actually very big</b> <b>his world isn't just limited to research</b> <b>and now we're going to build world models</b> <b>I hope, you know</b> <b>the helmsman of this big ship is someone with vision</b> <b>and a love of life</b> <b>[laughter]</b>

<b>and there's another very interesting example</b> <b>coming up in March</b> <b>maybe when this show airs</b> <b>we'll have another paper to release</b> <b>the paper is called Solaris</b> <b>Solaris (from Stanisław Lem's 1961 novel)</b> <b>this is actually a sci-fi novel</b> <b>a novel by Lem, and</b> <b>later adapted into a film by Tarkovsky</b> <b>and the reason we chose this name</b> <b>is because we're building a so-called</b> <b>video generation model</b>

<b>and the film is also about</b> <b>an ocean</b> <b>this ocean</b> <b>that can read the subconscious memories of people</b> <b>and ultimately materialize and generate things from them</b> <b>I think that's really fascinating</b> <b>of course</b> <b>in Tarkovsky's film</b> <b>the message is</b> <b>our greatest enemy</b> <b>is not some alien civilization</b> <b>or the unknowable</b> <b>the ocean is actually humanity itself</b>

<b>it is humanity's own suffering and memories</b> <b>so</b> <b>the ocean is just a projection of humanity onto itself</b> <b>I want to bring this up because</b> <b>I think this</b> <b>film parallels what happens with LLMs so closely</b> <b>I think LLMs may not actually be understanding humans</b> <b>it's just a projection of humanity</b> <b>just a reflection</b> <b>but what I want to say is</b> <b>in relation to Yann</b>

<b>one day I said to him, hey</b> <b>this paper of ours</b> <b>what do you think of this name?</b>

<b>and I wanted to see if he knew the film</b> <b>and he said, oh</b> <b>you know this is a film title, right?</b>

<b>I said yes</b> <b>that's exactly</b> <b>why I chose this name</b> <b>he asked me</b> <b>which version did you watch?</b>

<b>[laughter]</b> <b>the 1975 one</b> <b>or the one from the early 2000s?</b>

<b>I felt</b> <b>I found the right person</b> <b>was it the Tarkovsky one or</b> <b>the Soderbergh one, right?</b>

<b>and I said, OK</b> <b>I think, mm-hmm</b> <b>I don't just admire you for your research</b> <b>it seems you also know more than me about film</b> <b>mm-hmm</b> <b>I think</b> <b>that's one thing</b> <b>quite interesting</b> <b>might not matter to many people</b> <b>but it's quite important to me personally</b> <b>a reflection of personal charisma</b> <b>a Chinese investor once told me</b>

<b>all startups born with a silver spoon</b> <b>none of them have succeeded</b> <b>almost none</b> <b>what do you think?</b>

<b>Uh</b> <b>I don't know what silver spoon means here</b> <b>enormous fundraising</b> <b>I see</b> <b>very famous</b> <b>as a founder who is already accomplished</b> <b>and very highly accomplished</b> <b>Mm-hmm</b> <b>ah, we weren't born with a silver spoon</b> <b>as I said, we're completely</b> <b>I won't say a ragtag bunch</b> <b>it's a grassroots coalition startup model</b> <b>how could Yann LeCun be grassroots?</b>

<b>Yann</b> <b>is not grassroots</b> <b>but in the AI industry right now</b> <b>or on the internet</b> <b>including in front of investors</b> <b>often it's half support half opposition</b> <b>half support, half opposition</b> <b>I don't know what the exact ratio is</b> <b>but in any case</b> <b>he's not the kind of hero everyone rallies around</b> <b>he's someone who holds firm to himself</b>

<b>and always tries to do the next thing</b> <b>but that thing hasn't been proven yet</b> <b>like that</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>and I think</b> <b>this means we weren't born with a silver spoon</b> <b>we don't have a silver spoon</b> <b>we don't have that feeling at all</b> <b>I think we're an underdog</b> <b>we're underdogs</b> <b>we actually</b> <b>are surviving under a kind of industry pressure</b> <b>a company like that</b> <b>right?</b>

<b>right?</b> <b>that's so humble-bragging</b> <b>no, no</b> <b>there's no humble-bragging</b> <b>we may have raised a lot</b> <b>but compared to the resources LLMs are mobilizing now</b> <b>this is just</b> <b>I don't know what percentage, it's so far off</b> <b>Was it difficult to raise funding?</b>

<b>with Yann on board</b> <b>it really wasn't difficult</b> <b>right</b> <b>but I</b> <b>I think</b> <b>a seed round is just a seed round</b> <b>I think you have to look ahead</b> <b>right?</b>

<b>right?</b> <b>I think you have to see what comes next</b> <b>which is to say</b> <b>can we ultimately deliver on our mission</b> <b>can we</b> <b>achieve this research breakthrough</b> <b>I think</b> <b>that's the most critical thing for us</b> <b>but anyway I feel</b> <b>I really enjoy this underdog identity</b> <b>especially as an entrepreneur</b> <b>because I think</b> <b>it's the same as being a researcher</b> <b>the more you don't believe in me</b> <b>the happier I am</b> <b>Have you felt anyone not believing in you</b>

<b>since you started the company?</b>

<b>mm-hmm, I think many people</b> <b>a lot of investor feedback</b> <b>more disbelief</b> <b>or more belief?</b>

<b>Uh</b> <b>I don't know what the ratio is</b> <b>we have many, many people who believe in us</b> <b>we have many people who don't</b> <b>mm-hmm, many of our</b> <b>or in Silicon Valley most people don't believe us</b> <b>in the rest of the world most people believe us</b> <b>so putting it all together</b> <b>I don't know</b> <b>Uh</b> <b>but that's okay</b> <b>I think the thing I most want to see is</b> <b>right?</b>

<b>right?</b> <b>you can not believe in us</b> <b>but then let's see</b> <b>right well</b> <b>I'm all in on this path now</b> <b>are you with me?</b>

<b>Mm-hmm</b> <b>How do you think entrepreneurship compares to being a researcher?</b>

<b>What's different?</b>

<b>I think there are many similarities</b> <b>but also many differences</b> <b>mm-hmm, I think about entrepreneurship... do you ski, Xiaojun?</b>

<b>I don't</b> <b>you don't?</b>

<b>you don't?</b> <b>I don't like sports</b> <b>I couldn't ski before either</b> <b>but I've been skiing recently</b> <b>and I've gotten quite a lot of</b> <b>insight from it</b> <b>I think</b> <b>first, skiing is a sport about balance</b> <b>once you master the balance</b> <b>you can actually ski</b> <b>second, you have to be fearless</b> <b>and point your shoulders down the slope</b> <b>I think this is so counterintuitive</b>

<b>people are always afraid</b> <b>when you're facing the downhill slope</b> <b>you always want to lean back</b> <b>Mm-hmm</b> <b>counter-instinct</b> <b>yes, you go against instinct</b> <b>and once you follow your instinct</b> <b>you fall backward</b> <b>and you completely lose control</b> <b>and completely fall</b> <b>right?</b>

<b>right?</b> <b>only when you completely abandon</b> <b>you</b> <b>only with enough courage</b> <b>and not fearing anything</b> <b>and pointing your shoulders toward the slope</b> <b>you actually become more stable</b> <b>right?</b>

<b>right?</b> <b>and you can actually control your speed better</b> <b>so</b> <b>there's a quote I really like</b> <b>right this</b> <b>it might be from</b> <b>somewhere</b> <b>from JoJo's</b> <b>the anime JoJo's Bizarre Adventure — it says the hymn of humanity is the hymn of courage</b> <b>I think that's also my understanding of entrepreneurship</b> <b>I think it requires courage</b> <b>but what you just asked</b> <b>is it the same in academia?</b>

<b>I think it requires even more courage</b> <b>but many of the decisions I made in academia</b> <b>mm-hmm, I think</b> <b>were also quite courageous decisions</b> <b>right?</b>

<b>right?</b> <b>and there's also this saying</b> <b>I think you never walk alone</b> <b>mm-hmm</b> <b>there'll be many people helping you</b> <b>Mm-hmm</b> <b>and precisely because you have people around you</b> <b>you become even braver</b> <b>Mm-hmm</b> <b>you just mentioned your taste in research</b> <b>what do you think about your taste in people?</b>

<b>First of all</b> <b>I don't think you should have a "taste" in people</b> <b>I think having a taste in people</b> <b>seems like a condescending way to put it</b> <b>Yeah</b> <b>How would you describe your ability to read people?</b>

<b>let me rephrase</b> <b>but I think it's also a mutual process</b> <b>mm-hmm, I think</b> <b>again, I think there's a kind of attraction</b> <b>that brings together people who can work together</b> <b>and we</b> <b>just need to follow that attraction</b> <b>to find those people</b> <b>and be with them</b> <b>right</b> <b>I don't think I would</b> <b>of course there will be some specific</b> <b>these</b> <b>metrics</b>

<b>we certainly have some</b> <b>like we're conducting interviews now</b> <b>I can't just say you don't need to interview</b> <b>mm-hmm, I have a set of mystical logic</b> <b>for hiring</b> <b>that's not realistic either</b> <b>Mm-hmm</b> <b>but I do care about</b> <b>Yeah</b> <b>certain things</b> <b>I think I care about</b> <b>whether you truly have that kind of</b> <b>desire to solve a problem</b>

<b>and the courage to want to understand something</b> <b>and that kind of persistence</b> <b>I think this matters for research</b> <b>and is also very important for entrepreneurship</b> <b>and when I recruit students</b> <b>I also need to be able to see</b> <b>this kind of</b> <b>personality in people</b> <b>Mm-hmm</b> <b>[laughter]</b> <b>so this</b> <b>what does it actually mean?</b>

<b>from the perspective of doing research</b> <b>it means</b> <b>if you have a problem in front of you right now</b> <b>Kaiming told me this too</b> <b>he said</b> <b>you should be thinking about the problem when you wake up</b> <b>thinking about it while eating</b> <b>thinking about it in the shower</b> <b>maybe you can stop thinking while sleeping</b> <b>or maybe you even sleep with it on your mind</b> <b>do you truly have that kind of</b>

<b>passion</b> <b>right?</b> <b>that drive to keep thinking about this problem</b>

<b>right?</b> <b>that drive to keep thinking about this problem</b> <b>or are you just treating this</b> <b>as just a job</b> <b>I think</b> <b>I think</b> <b>it's something that distinguishes people from one another</b> <b>a yardstick</b> <b>Do you have that problem right now?</b>

<b>Yeah</b> <b>What kind of problem?</b>

<b>mm-hmm, the kind of problem you carry with you every day</b> <b>yes absolutely</b> <b>of course</b> <b>but my current issue is</b> <b>that's also why I feel</b> <b>uh in</b> <b>after spending a long time in academia</b> <b>it gets a bit difficult</b> <b>because in academia, functioning</b> <b>you need to do all kinds of</b> <b>what we call context switching</b> <b>you need to switch contexts, right?</b>

<b>because you have so many parts</b> <b>to manage</b> <b>and coordinate</b> <b>I think being in a startup is actually quite good</b> <b>I can now focus on one thing</b> <b>I can think about</b> <b>what kind of team we should build</b> <b>what kind of people this team needs</b> <b>what problems we should solve</b> <b>in the next 1 month, 3 months, 6 months</b> <b>or a year</b> <b>Mm-hmm</b> <b>I might not be thinking about this correctly</b> <b>but that's okay</b>

<b>as long as the entire team works together</b> <b>we can fail together</b> <b>pivot together</b> <b>then I think this company won't fail</b> <b>I can't guarantee</b> <b>every plan I have now is correct</b> <b>I don't think Yann can guarantee that either</b> <b>Mm-hmm</b> <b>but I still believe in people</b> <b>as you said</b> <b>I still believe that gathering these people</b> <b>with ideals and passion</b>

<b>who want to</b> <b>forge a new path together</b> <b>will definitely achieve something remarkable</b> <b>Did you agree on the spot?</b>

<b>LeCun?</b>

<b>no no no</b> <b>there was a long, long gap in between</b> <b>and Yann wasn't the first to approach me</b> <b>anyway later</b> <b>Yann took charge of recruiting the team</b> <b>so he also had to think about</b> <b>what role each person should have</b> <b>right, I think later we discussed together</b> <b>negotiated together</b> <b>and</b> <b>I think it was quite a long process</b> <b>and I think</b> <b>everyone eventually found their right place</b> <b>How long did you agonize over it?</b>

<b>from the first time he</b> <b>told you</b> <b>maybe about a week of agonizing</b> <b>What were you agonizing over?</b>

<b>whether I should start a company at all</b> <b>to do this</b> <b>whether I should do this with Yann</b> <b>Mm-hmm</b> <b>or</b> <b>maybe look for some new opportunities</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>and then later</b> <b>but I didn't agonize for very long</b> <b>right, I feel</b> <b>I thought, OK</b> <b>Yann used his magic</b> <b>I'll tell you all</b> <b>talking to Yann is kind of like</b> <b>he's a bit like</b> <b>it's like he's</b> <b>casting spells</b> <b>like Harry Potter</b> <b>casting some enchantments on you</b> <b>mm-hmm, he says some things</b> <b>[laughter]</b> <b>and you</b> <b>stop thinking about other things</b> <b>mm-hmm, what spell did he cast on you?</b>

<b>nothing really</b> <b>he just shared his vision</b> <b>he just explained</b> <b>why this was a better choice</b> <b>a better choice for me</b> <b>and also a better choice for this company</b> <b>why here</b> <b>I can have enough agency and autonomy</b> <b>the so-called ability to make independent decisions</b> <b>and build a team</b> <b>and help us design this entire</b> <b>execution</b> <b>roadmap</b> <b>I also</b>

<b>incredibly, incredibly grateful</b> <b>so grateful that Yann could give me that trust</b> <b>right</b> <b>but our company has several other co-founders</b> <b>everyone is really, really wonderful</b> <b>there are 6 co-founders in total</b> <b>oh, that many</b> <b>Yes</b> <b>and there's a CEO</b> <b>what else?</b>

<b>what else?</b> <b>there's a CEO</b> <b>right</b> <b>there's also a COO</b> <b>there's a COO</b> <b>right and there's also</b> <b>VP of world models</b> <b>and then there's also</b> <b>whose current temporary title is CRIO</b> <b>who is also Chinese</b> <b>by the way, her name is Pascale</b> <b>Pascale Fung</b> <b>What kind of position is that?</b>

<b>Uh</b> <b>it's more of something between research</b> <b>between pure research and product</b> <b>a role at the alignment layer</b> <b>responsible for our innovation</b> <b>she also has a lot of entrepreneurial experience</b> <b>Mm-hmm</b> <b>and our VP of world models</b> <b>was the original JEPA team's</b> <b>uh this</b> <b>so</b> <b>director Mike</b> <b>and the COO was formerly Meta's</b>

<b>VP for all of Southern Europe</b> <b>Mm-hmm</b> <b>roughly that kind of combination</b> <b>so</b> <b>definitely not a purely researcher-background combination</b> <b>Mm-hmm</b> <b>Will you explore consumer-facing products?</b>

<b>uh yes</b> <b>and the ultimate goal</b> <b>will definitely include a consumer-facing product</b> <b>but we hope</b> <b>we won't be under any pressure</b> <b>because we still want to first build this world model</b> <b>however you define it</b> <b>first make it happen</b> <b>How many years out can your roadmap realistically plan?</b>

<b>planning years out is unrealistic</b> <b>I think if we can plan to a year</b> <b>that's already pretty good</b> <b>right</b> <b>and I think we don't need longer-term planning</b> <b>Mm-hmm</b> <b>Can greatness not be planned?</b>

<b>uh yes</b> <b>it's just, I'm not</b> <b>it's just like doing research</b> <b>I think you need an exploration process</b> <b>start by exploring</b> <b>start doing things</b> <b>mm-hmm, then gradually find your own ideas</b> <b>I think</b> <b>this applies to startups too</b> <b>What do you think</b> <b>about where your ideas have progressed to?</b>

<b>I think we've reached the point where</b> <b>we now have things to work on</b> <b>and we also feel there will be some</b> <b>quite promising results coming soon</b> <b>that's where we are</b> <b>but this thing</b> <b>what specifically?</b>

<b>what specifically?</b> <b>we can talk about it</b> <b>in a few months</b> <b>but coming back to it</b> <b>the thing is</b> <b>people outside have a misconception about this company</b> <b>and another misconception about Yann</b> <b>people actually don't know what JEPA is</b> <b>mm-hmm right</b> <b>[laughter]</b> <b>I personally also went through several phases</b> <b>from doubting JEPA, to understanding JEPA</b> <b>then to becoming JEPA</b> <b>those three life stages</b> <b>Mm-hmm</b> <b>[laughter]</b>

<b>I think this is also quite fun</b> <b>because at first, doubting JEPA</b> <b>was because we had just started doing self-supervised learning</b> <b>doing MoCo, doing MAE</b> <b>and I think</b> <b>JEPA seemed like yet another self-supervised learning algorithm</b> <b>that's it — then gradually understanding JEPA</b> <b>was because I felt JEPA actually</b> <b>goes deeper than we imagined</b> <b>there's a lot of underlying logic inside it</b>

<b>many mathematical principles</b> <b>and we also need someone on this path</b> <b>to keep persisting</b> <b>because what we discovered early on</b> <b>couldn't be scaled up</b> <b>so we stopped</b> <b>mm-hmm, and then</b> <b>but later with JEPA</b> <b>for example including me</b> <b>to give a simple example</b> <b>recently there was a paper called LeJEPA</b> <b>and with a very rigorous proof they showed</b> <b>if you want a good representation</b> <b>if you want this representation</b>

<b>to be agnostic to your downstream task</b> <b>then it must be an isotropic Gaussian distribution</b> <b>this is a bit technical</b> <b>essentially it means</b> <b>it's a characterization</b> <b>of a certain property of representations</b> <b>and I found</b> <b>this actually has merit</b> <b>truly becoming JEPA</b> <b>is because I feel JEPA is not a model</b> <b>JEPA is not a specific algorithm</b> <b>JEPA is a complete cognitive architecture</b>

<b>it's a cognitive system</b> <b>this</b> <b>in Yann's 2022 paper</b> <b>is what he wrote about</b> <b>so in my view, this cognitive system</b> <b>is a path to intelligence</b> <b>a universal intelligent agent's</b> <b>in my current view</b> <b>a very reasonable path</b> <b>so what JEPA requires</b> <b>JEPA is not just self-supervised learning</b>

<b>it needs world understanding capability</b> <b>it needs the ability to understand the world</b> <b>it needs the ability to make predictions</b> <b>it needs the ability to do planning</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>prediction and planning</b> <b>right</b> <b>I think</b> <b>this gave me new insights into JEPA</b> <b>and I found that JEPA actually isn't a specific</b> <b>as people outside tend to say</b> <b>like Yann has this method</b> <b>and he must stick to this method</b> <b>and turn it into something specific</b> <b>it's not like that</b> <b>JEPA is a very, very vast ocean</b>

<b>in this ocean there can be many, many ships</b> <b>sailing on it</b> <b>sailing</b> <b>[laughter]</b> <b>ultimately</b> <b>this entire system will have a lot of collaboration</b> <b>and LLMs are also part of it</b> <b>Mm-hmm</b> <b>so this makes me feel, mm-hmm</b> <b>this company can succeed</b> <b>and has a great chance of succeeding</b>

<b>the reason is it's not about shrinking things down</b> <b>under today's LLM settings</b> <b>everyone is narrowing things down</b> <b>but Yann's company is deliberately thinking big</b> <b>mm-hmm, he has enough space for us to explore</b> <b>to let us scale up</b> <b>until the very end</b> <b>we can achieve some kind of new breakthrough</b> <b>when exactly will this happen</b> <b>will it happen</b> <b>we can't predict</b> <b>but I feel</b>

<b>this is a path I'm willing to invest my life in</b> <b>to walk</b> <b>How does it feel after starting the company?</b>

<b>Your genuine feelings</b> <b>it's gotten busier and more tiring</b> <b>it's gotten busier and more tiring</b> <b>of course, definitely</b> <b>mm-hmm, there are lots of ups and downs</b> <b>there'll be</b> <b>a lot of tedious things</b> <b>but also because</b> <b>watching this company grow bit by bit</b> <b>watching some</b> <b>because we have 4 offices</b> <b>with so many legal issues</b> <b>whatever</b> <b>so much internal friction</b>

<b>slowly, what was originally</b> <b>internal friction</b> <b>gradually becomes smooth</b> <b>that process is actually quite enjoyable</b> <b>and in that process</b> <b>we also received help from many, many people</b> <b>so</b> <b>looking at it so far</b> <b>I think I made the right choice</b> <b>Mm-hmm</b> <b>maybe a bit different from your expectations?</b>

<b>maybe more optimistic</b> <b>Mm-hmm</b> <b>right, I feel</b> <b>the moment you jump, the fear disappears</b> <b>mm-hmm right</b> <b>I think as long as you have courage</b> <b>everything else is manageable</b> <b>and I feel in this company</b> <b>Ah</b> <b>I can find that courage</b> <b>Mm-hmm</b> <b>You just said AGI is a false premise</b> <b>can you elaborate on that?</b>

<b>AGI is a false premise</b> <b>this is also something Yann often says</b> <b>didn't he have a debate with Demis (DeepMind founder)?</b>

<b>right, he asked what exactly is general intelligence</b> <b>does general intelligence actually exist?</b>

<b>I won't go into too much detail on this</b> <b>but his logic here is also very mathematical</b> <b>very Yann</b> <b>what he says basically comes down to</b> <b>it means</b> <b>this person</b> <b>for example, there are 2 million visual nerve fibers</b> <b>mm-hmm, this can be modeled</b> <b>all the possible visual functions</b> <b>are actually enormously vast</b>

<b>it is</b> <b>as many as 2 to the power of 2 to the power of 200 functions</b> <b>but what humans can actually process</b> <b>and perceive</b> <b>is actually</b> <b>approaching zero</b> <b>right?</b>

<b>right?</b> <b>we are limited by our consciousness</b> <b>we are limited by our own neural</b> <b>bandwidth limitations</b> <b>we cannot see</b> <b>everything that happens in this world</b> <b>Mm-hmm</b> <b>so</b> <b>human intelligence is a very specialized intelligence</b> <b>it can only</b> <b>humans can only perceive what they can see</b> <b>Mm-hmm</b> <b>and later I also added a tweet about it</b> <b>I read a book</b>

<b>called "Are We Smart Enough to Know How Smart Animals Are?"</b>

<b>which asks whether we're smart enough</b> <b>to know how smart animals are</b> <b>Mm-hmm</b> <b>and after reading this book</b> <b>I let go of more of that human arrogance</b> <b>I think the evolution of intelligence</b> <b>is a continuous process</b> <b>it's not one where</b> <b>humans are truly unique</b> <b>right, we often say</b> <b>humans are intelligent</b> <b>because humans use tools</b> <b>but animals also use tools</b>

<b>and some people say</b> <b>humans actually have a certain</b> <b>self-awareness and consciousness</b> <b>one laboratory said</b> <b>humans can look in a mirror</b> <b>and recognize that the person in the mirror</b> <b>is themselves and not another entity</b> <b>can dogs?</b>

<b>can dogs?</b> <b>they can too</b> <b>right, many animals can</b> <b>oh right?</b>

<b>oh right?</b> <b>because some animals can't</b> <b>dogs actually quite enjoy looking at themselves in mirrors</b> <b>[laughter] right</b> <b>anyway, many animals</b> <b>indeed can't</b> <b>but many animals can</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>and there are also many very interesting things</b> <b>like chimpanzees, right?</b>

<b>and this author</b> <b>so</b> <b>de Waal also wrote another book</b> <b>called "Chimpanzee Politics" (a 1982 classic of animal behavior)</b> <b>which is about</b> <b>four chimpanzees</b> <b>and how they engage in power struggles</b> <b>very much like House of Cards</b> <b>or how there's a lot of scheming</b> <b>how you form alliances</b> <b>then maneuver and rise to the top</b> <b>and so on</b>

<b>I think that's very interesting</b> <b>[laughter]</b> <b>and one thing that left a deep impression on me</b> <b>was that</b> <b>for example, they</b> <b>these animals actually</b> <b>including chimpanzees, also have a kind of theory of mind</b> <b>they can also have their own world model</b> <b>and their world models are actually quite good</b> <b>for example, there's an example where</b> <b>an experimenter is in a room</b> <b>with two boxes</b> <b>one box containing a banana</b> <b>the other containing an apple</b>

<b>the chimp is shown this</b> <b>then the boxes are closed</b> <b>[laughter]</b> <b>and the experimenter takes the chimp out</b> <b>after a long, long time</b> <b>it's brought back into the room</b> <b>and the first thing the chimp notices</b> <b>is that</b> <b>the experimenter is eating a banana</b> <b>and the chimp</b> <b>immediately</b> <b>goes straight to open the box with the apple</b>

<b>and eats the apple</b> <b>without even glancing at the banana</b> <b>so</b> <b>chimpanzees also have a kind of reasoning ability</b> <b>right?</b>

<b>right?</b> <b>and although language is indeed unique</b> <b>language is something only humans have</b> <b>but that doesn't mean other animals don't communicate</b> <b>if we</b> <b>they have their own language</b> <b>they have their own language</b> <b>including</b> <b>like whales also have their own language</b> <b>anyway this is all quite fascinating</b> <b>I highly recommend that book</b> <b>[laughter]</b> <b>and there's also</b> <b>I read about some kind of bird (scrub jays)</b> <b>I forgot what they're called</b> <b>apparently they're very good at</b>

<b>if one is burying food</b> <b>burying food underground</b> <b>if it notices</b> <b>that one of its peers saw it happen</b> <b>it will first bury it there</b> <b>then wait for the peer to leave, dig it up</b> <b>and rebury it in a different spot</b> <b>I think that's quite interesting</b> <b>and of course we also know</b> <b>dogs have a keen sense of smell</b> <b>and bats navigate by hearing</b>

<b>I think the boundaries of intelligence are very broad</b> <b>people now talk about jagged intelligence</b> <b>so your world model</b> <b>which type of biological intelligence will it aim for first?</b>

<b>the goal is of course human intelligence</b> <b>human intelligence is certainly</b> <b>still</b> <b>at least in one dimension still the strongest</b> <b>or</b> <b>it's also what can most benefit the world</b> <b>Mm-hmm</b> <b>so we still want to build a world model</b> <b>toward human-like intelligence</b> <b>Mm-hmm</b> <b>but I just want to let go of human arrogance</b> <b>and recently I've been very inspired by this</b> <b>because I watched Rich Sutton</b>

<b>in this</b> <b>podcast</b> <b>talk about a theory</b> <b>because before I</b> <b>didn't know how to address this</b> <b>because people say</b> <b>LLMs are amazing, right?</b>

<b>LLMs can now write code</b> <b>can win gold at the IMO and IOI</b> <b>can help us go to the moon and Mars</b> <b>these things are incredible</b> <b>and I can't deny these things</b> <b>they really are impressive</b> <b>right?</b>

<b>right?</b> <b>but Rich Sutton's reply</b> <b>I think</b> <b>was very good — he replied</b> <b>you think these things are great and impressive?</b>

<b>that they're hard? well, feel free to think that</b> <b>because I don't think so</b> <b>I think building the intelligence of a squirrel</b> <b>is the hard problem</b> <b>once you have a squirrel's intelligence</b> <b>once you can build a squirrel's intelligence</b> <b>and make it survive in the real world</b> <b>with its own goals</b> <b>its own objectives</b> <b>its own intrinsic rewards as you described</b> <b>it knows hunger</b> <b>it has its own emotions</b>

<b>and it can engage in social activities</b> <b>after that, writing code, going to Mars, going to the moon</b> <b>those things would be the easy ones</b> <b>Good</b> <b>I'm gradually coming to strongly agree with this view</b> <b>if you set aside human arrogance</b> <b>building a squirrel's intelligence</b> <b>is actually a harder problem</b> <b>but that's not how it looks to humans</b> <b>from a human's</b> <b>perspective</b> <b>it doesn't seem that way</b>

<b>but that's entirely due to human arrogance</b> <b>you're also building human-level intelligence</b> <b>ah yes</b> <b>but what I mean is</b> <b>human intelligence has many, many aspects</b> <b>human intelligence is not just a language model</b> <b>human intelligence encompasses many types of intelligence</b> <b>that cannot be defined</b> <b>by language models or language itself</b> <b>right, I think that's a core insight</b> <b>What is your definition of intelligence?</b>

<b>mm-hmm, so as I was just saying</b> <b>Rich Sutton talked about this</b> <b>he feels that squirrel intelligence is the real intelligence</b> <b>I think his framing is a bit different</b> <b>he's not positioning from a human perspective</b> <b>looking at things from an anthropocentric view</b> <b>he's standing at the universe</b> <b>and the creator's perspective</b> <b>from this angle</b> <b>of course</b> <b>being able to recreate a squirrel</b> <b>is greater than your</b>

<b>human civilization in these 530 million years</b> <b>the things created in the last 8 seconds</b> <b>by far</b> <b>in this sense</b> <b>I think</b> <b>that's elevated the discussion</b> <b>I think that elevated perspective has merit</b> <b>but defining intelligence</b> <b>I don't want to give it a definition</b> <b>I think different animals have different intelligence</b> <b>and humans have human-level intelligence</b> <b>Mm-hmm</b>

<b>and what I want to encourage everyone to do</b> <b>don't only focus on what</b> <b>we as individuals cannot do</b> <b>pay attention to what we're already doing well</b> <b>pay attention to what a 4-year-old child</b> <b>or a child of a few years old does very well</b> <b>those things</b> <b>are actually what our world model</b> <b>next needs to focus on solving</b> <b>mm-hmm, so this is also</b> <b>why Robotics is ultimately</b> <b>a very fitting outlet</b>

<b>because before you talk about AGI</b> <b>or super intelligence</b> <b>can we first have a sufficiently reliable</b> <b>and general robot</b> <b>that can function in our home environment</b> <b>and help with household chores</b> <b>right, because a few-year-old child</b> <b>can actually do many, many household chores</b> <b>there's actually a list</b> <b>you can search for it online</b> <b>a 12-year-old child</b> <b>can basically do all the household chores</b>

<b>but is there a robot right now</b> <b>that can function like a 12-year-old child</b> <b>and handle these chores?</b>

<b>of course not</b> <b>Jie Tan from DeepMind</b> <b>Jie Tan</b> <b>he also says that robotic development is extremely uneven</b> <b>extremely imbalanced</b> <b>its developmental trajectory compared to a child's</b> <b>is different</b> <b>mm-hmm, for example</b> <b>the physical capabilities of robots' limbs have now surpassed</b> <b>they've already surpassed humans</b> <b>Mm-hmm</b> <b>but many other capabilities are still not as good as a child's</b> <b>because of the brain</b> <b>nobody is building the brain</b>

<b>nobody is building a robot brain</b> <b>all the robotics startups</b> <b>including the robotics divisions at big companies</b> <b>haven't solved this</b> <b>Doesn't DeepMind count?</b>

<b>DeepMind is now entirely based on Gemini</b> <b>so it's also working within the VLA framework</b> <b>Yes</b> <b>everything converges to</b> <b>Gemini</b> <b>Oh</b> <b>but this needs a second half of pre-training</b> <b>Mm-hmm</b> <b>in Shunyu Yao's classic formulation</b> <b>[laughter] I think there needs to be a second half</b> <b>but I think this is the second half of pre-training</b> <b>Mm-hmm</b> <b>Jim Fan recently also expressed the same view</b>

<b>so this pre-training is the world model</b> <b>who will do this pre-training?</b>

<b>that's not clear to me</b> <b>if I knew</b> <b>there was another place that could also do this</b> <b>then I might actually reconsider</b> <b>I wouldn't necessarily need to be</b> <b>at this startup</b> <b>doing this myself</b> <b>right?</b>

<b>right?</b> <b>robotics startups</b> <b>have no energy to do this</b> <b>they need to put their resources</b> <b>into the so-called hardware</b> <b>scaling law</b> <b>that is</b> <b>you need to buy more robots</b> <b>to deploy these robots</b> <b>or do these things in simulators</b> <b>these imitation learning approaches</b> <b>that can give you a good enough</b>

<b>to solve some specific problems</b> <b>in the short term</b> <b>a robotics team that creates value</b> <b>What about PI (Physical Intelligence)?</b>

<b>VLA right?</b>

<b>PI is the same</b> <b>PI is already very, very research-oriented</b> <b>and doing very, very well</b> <b>and is inspiring</b> <b>as a company</b> <b>but again, they won't do pre-training</b> <b>they won't do pre-training</b> <b>they'll take</b> <b>language models as their foundation</b> <b>Yeah</b> <b>right?</b>

<b>right?</b> <b>How should we understand your second half of pre-training?</b>

<b>what goes in</b> <b>what comes out</b> <b>I don't know</b> <b>at least the first step is</b> <b>in the long run</b> <b>the inputs are all</b> <b>continuous-space signals as I just described</b> <b>high-dimensional</b> <b>potentially noisy signals</b> <b>Mm-hmm</b> <b>at first it might still be video</b> <b>but we might also have multi-modal encoders</b> <b>to handle different</b> <b>signals beyond visual</b> <b>and the outputs</b> <b>that's a research question</b>

<b>the self-supervised question is still unknown</b> <b>I</b> <b>not necessarily unknown</b> <b>but</b> <b>it may become clearer later</b> <b>Mm-hmm</b> <b>but</b> <b>I think</b> <b>it's definitely not that simple</b> <b>but I think that's where the excitement lies</b> <b>I also find it quite interesting</b> <b>because the first time we met</b> <b>you said "you are not the chosen one"</b> <b>"you are just the normal one"</b>

<b>why do you like saying this?</b>

<b>No</b> <b>you see, throughout our conversation we discussed my</b> <b>growth story</b> <b>I</b> <b>I didn't expect we'd talk about all this</b> <b>but</b> <b>I definitely don't feel like a chosen one</b> <b>[laughter]</b> <b>this quote is actually from a team I love</b> <b>Liverpool right?</b>

<b>Liverpool right?</b> <b>I'm a KOP (the famous terrace at Anfield and symbol of devoted Liverpool fans)</b> <b>for over 20 years</b> <b>[laughter]</b> <b>I think there's a bit of compatible spirit</b> <b>and my favorite manager</b> <b>Klopp</b> <b>Jürgen Klopp</b> <b>[laughter]</b> <b>he was half-joking when he said to everyone</b> <b>when another manager</b> <b>José Mourinho</b> <b>said "I am the special one"</b> <b>I'm the special one</b> <b>then Klopp said</b>

<b>"I'm not the special one"</b> <b>"I'm the normal one"</b> <b>and I think</b> <b>on one hand he himself is very punk</b> <b>he has that rock 'n' roll spirit</b> <b>[laughter]</b> <b>Uh</b> <b>and he often tells everyone</b> <b>that his role in the team</b> <b>is like a battery</b> <b>he hopes through his own passion</b> <b>and his own energy, you know</b>

<b>to let others</b> <b>generate electricity for others</b> <b>empower</b> <b>empower others</b> <b>mm-hmm right</b> <b>I also want to be that kind of person</b> <b>I also want to be for a team</b> <b>whether that team is in academia</b> <b>or in a startup, a battery</b> <b>I think this is actually not easy</b> <b>because sometimes</b> <b>everyone has their moments of discouragement</b> <b>Mm-hmm</b> <b>I also want to</b> <b>so</b> <b>complain more</b>

<b>and let out my feelings</b> <b>but I'm gradually coming to feel</b> <b>in academia, like in front of students</b> <b>and in front of the startup team</b> <b>someone needs to play that battery role</b> <b>or I think Yann is a giant battery</b> <b>he inspired me</b> <b>but I hope to pass this electrical charge through me</b> <b>and send it further</b> <b>What was the last time you felt discouraged, and why?</b>

<b>I feel discouraged every day</b> <b>I think it's become</b> <b>a kind of researcher's fate</b> <b>I think everyone has this underlying melancholy</b> <b>because the process of research inquiry</b> <b>is like groping around in a dark</b> <b>lightless place</b> <b>Mm-hmm</b> <b>when you can't see any light</b> <b>you always feel lost and discouraged</b>

<b>and when people truly feel</b> <b>this kind of joy</b> <b>it's only when you actually get something working</b> <b>but this part of the time</b> <b>is very, very brief</b> <b>maybe only 5% or 10%</b> <b>Kaiming has said something similar</b> <b>so over time</b> <b>right, eventually everyone's</b> <b>mental state can become concerning</b> <b>but I think it's okay</b> <b>I think</b> <b>Uh</b>

<b>I think this era now</b> <b>is still not quite the same as before</b> <b>I think now there's more discussion</b> <b>I think</b> <b>this is one of the benefits of this AI wave</b> <b>at least</b> <b>people won't feel</b> <b>like they're in a closed space</b> <b>exploring alone</b> <b>at least people can scroll through Xiaohongshu</b> <b>scroll through Weibo, Zhihu</b> <b>and see how everyone is discussing this</b>

<b>I think that's sometimes quite stress-relieving</b> <b>but sometimes it also adds pressure</b> <b>when people criticize you, you don't think that anymore</b> <b>Does your company have people with an entrepreneurial personality?</b>

<b>entrepreneurial personality</b> <b>generally quite optimistic</b> <b>I think Yann himself is very optimistic</b> <b>very, very optimistic</b> <b>why isn't he a researcher</b> <b>with that melancholy undercurrent?</b>

<b>hmm, I don't know</b> <b>because he's been through hardship</b> <b>and then succeeded</b> <b>Oh</b> <b>he lived through the AI winter</b> <b>and then showed everyone</b> <b>he was right</b> <b>and they were wrong</b> <b>if I went through something like that</b> <b>I might not be so melancholy either</b> <b>he's still quite optimistic</b> <b>I think</b> <b>or rather, his past experiences</b> <b>have also given him more confidence</b> <b>and something he often says is</b> <b>this</b>

<b>what happened before with deep learning neural networks</b> <b>is exactly the same</b> <b>which thing?</b>

<b>which thing?</b> <b>it's that now, world models</b> <b>or whatever you call it</b> <b>the current systems</b> <b>building intelligent systems now</b> <b>he says there's always a small group of people</b> <b>who can clearly see</b> <b>the trajectory of the world's development</b> <b>the progress of technology</b> <b>but they're only a small minority</b> <b>most people can't see it</b> <b>right</b> <b>because most people are busy doing other things</b> <b>back then with deep learning</b>

<b>people were doing whatever</b> <b>other things</b> <b>traditional machine learning</b> <b>mm-hmm, and now</b> <b>what you're doing is</b> <b>you can, mm-hmm</b> <b>let's not say it — think about it</b> <b>[laughter]</b> <b>and I think</b> <b>he's actually quite optimistic</b> <b>or rather he has</b> <b>enough confidence</b> <b>and says</b> <b>the things I can see are important things</b> <b>the path I can see</b> <b>is a clear path</b>

<b>and on this matter</b> <b>I still believe him quite a lot</b> <b>Have you ever doubted him?</b>

<b>Uh</b> <b>as I said</b> <b>I questioned JEPA</b> <b>then understood JEPA</b> <b>then became JEPA</b> <b>so of course there was doubt</b> <b>but I feel that trust in a person</b> <b>and trust in a research direction</b> <b>takes time</b> <b>I was just telling students the other day</b> <b>every time Yann gives a talk</b> <b>he gives exactly the same talk</b> <b>his slides</b> <b>are honestly pretty ugly</b> <b>[laughter]</b> <b>[laughter]</b>

<b>but they have his personal style</b> <b>style and design</b> <b>is also interesting</b> <b>some things are originally ugly</b> <b>but if you use them enough</b> <b>and time passes</b> <b>they become the new fashion</b> <b>but</b> <b>every time he gives that same talk</b> <b>I've been feeling this very, very strongly recently</b> <b>I said</b> <b>this talk</b> <b>I've watched it at least 10 times</b>

<b>20 times now, but each time I get something new</b> <b>every time I feel</b> <b>like I understand a bit more what he really means</b> <b>and this</b> <b>this deeper understanding</b> <b>is not because I've watched the same content 10 or 20 times</b> <b>and got this new understanding</b> <b>it's because</b> <b>I'm doing what I want to do</b> <b>Mm-hmm</b> <b>and I find</b> <b>that is</b>

<b>when watching his talk</b> <b>each time I do this translation work</b> <b>and association work</b> <b>I find</b> <b>that what he said</b> <b>in my current understanding</b> <b>can be interpreted this way</b> <b>and it doesn't conflict at all with</b> <b>even today's large language model or multimodal paradigms</b> <b>everything</b> <b>Yann says can be clearly mapped onto</b>

<b>what we're doing now</b> <b>concretely</b> <b>and guide us</b> <b>to perhaps escape some local optimum</b> <b>[laughter]</b> <b>and perhaps lead to a different future</b> <b>mm-hmm, so it's become an inspiration</b> <b>right?</b>

<b>right?</b> <b>it's not just knowledge</b> <b>it's an inspiration</b> <b>Mm-hmm</b> <b>so I think that's also wonderful</b> <b>Mm-hmm</b> <b>we just talked a lot about world models</b> <b>do you have any new thoughts on your world model</b> <b>for the real world?</b>

<b>In the past year or two</b> <b>I think this thing must definitely</b> <b>go beyond the limitations of research</b> <b>the limitations of being a researcher</b> <b>it must enter real life</b> <b>and</b> <b>understand what's happening in the real world</b> <b>but I think New York is very different</b> <b>I go to work every day</b> <b>first, I don't have to drive</b> <b>so I've already started to emerge</b>

<b>from a kind of armor</b> <b>and enter real life</b> <b>by walking</b> <b>this</b> <b>I think has many</b> <b>wonderful effects</b> <b>for example</b> <b>some days I'm still under quite a lot of pressure</b> <b>sometimes something happens</b> <b>and it's quite discouraging</b> <b>but every time I walk through</b> <b>from my home to my office at school</b> <b>there's a park called Washington Square Park</b>

<b>Washington Square Park</b> <b>[laughter]</b> <b>inside there are all kinds of people</b> <b>all sorts</b> <b>everyone living their own lives</b> <b>there are street performers playing piano</b> <b>dancers</b> <b>mothers pushing strollers</b> <b>old men playing chess</b> <b>and young people sitting on the steps doing nothing</b> <b>daydreaming</b>

<b>and NYU students studying with laptops</b> <b>[laughter]</b> <b>I think my most stress-relieving moments every day are</b> <b>this roughly 5 to 10 minute walk</b> <b>I find</b> <b>the world is much bigger than we imagine</b> <b>not everyone cares about what AI is</b> <b>they may not care about this at all</b> <b>and they have their own lives</b> <b>the world is big</b> <b>but on the other hand</b>

<b>maybe AI someday in the future</b> <b>will indeed affect their lives</b> <b>so what should we actually be doing?</b>

<b>as researchers</b> <b>do we have some kind of social responsibility?</b>

<b>but this might be getting a bit far-reaching</b> <b>but I just feel</b> <b>more contact with people</b> <b>more contact with people living in this world</b> <b>helps me understand what AI is</b> <b>and how to build the next generation of AI</b> <b>in new ways</b> <b>and this</b> <b>is exactly what Ilya wanted to talk about when he called me</b> <b>what he wanted to discuss</b> <b>but I hadn't arrived at these insights yet</b> <b>Have you picked up any new hobbies?</b>

<b>New hobbies</b> <b>In New York?</b>

<b>right</b> <b>no real new hobbies</b> <b>I think</b> <b>skiing counts as one</b> <b>most other times</b> <b>I genuinely don't have time</b> <b>but the nice thing about New York is</b> <b>you know that once you go out</b> <b>you can find a new hobby</b> <b>that itself</b> <b>is enough to make me happy</b> <b>whether or not I actually have time to step out</b>

<b>and do those things</b> <b>Mm-hmm</b> <b>having that possibility available</b> <b>I think is quite different</b> <b>and very different from the Bay Area</b> <b>Can you share</b> <b>aside from work</b> <b>what music you like</b> <b>books you enjoy</b> <b>films and games you enjoy?</b>

<b>Right now</b> <b>Yeah</b> <b>that's hard to think about</b> <b>off the top of my head I'm not sure</b> <b>I think let me approach this through AI</b> <b>let me think about what I've watched recently</b> <b>let me think</b> <b>Mm-hmm</b> <b>I actually enjoy watching TV shows</b> <b>so I can recommend some shows</b> <b>for everyone</b> <b>Mm-hmm</b> <b>there's a show called POI</b> <b>it's also quite an old show</b> <b>Person of Interest</b>

<b>I watched this many years ago</b> <b>in it</b> <b>they discuss what a super intelligence is</b> <b>you have a good super intelligence</b> <b>and a bad super intelligence</b> <b>their competition</b> <b>and the threat to human society</b> <b>and I think</b> <b>I won't spoil it</b> <b>but it's quite multi-modal</b> <b>and this might</b> <b>have a certain prophetic quality</b>

<b>I think it's quite remarkable</b> <b>mm-hmm right</b> <b>at its core it's about how</b> <b>an AI in a box</b> <b>a language model</b> <b>or</b> <b>an agent that can write code</b> <b>step by step breaks free</b> <b>and becomes a multi-modal model</b> <b>I think everyone should check it out</b> <b>and later there's also</b> <b>something I really like</b> <b>like Pantheon (American animated series)</b> <b>it's also</b> <b>I think a kind of AI prophecy</b> <b>yes, it's an animation</b>

<b>its author is Ken Liu (Chinese-American science fiction writer)</b> <b>he's also from my hometown</b> <b>and he's also someone who</b> <b>worked as a lawyer</b> <b>worked as a programmer</b> <b>and ultimately became</b> <b>a novelist</b> <b>like that</b> <b>incredibly impressive</b> <b>I admire him greatly</b> <b>and I love reading his books too</b> <b>right</b> <b>but this show was also recommended by Sam Altman before</b> <b>so many people have seen it</b> <b>and also</b>

<b>recently of course there's this very popular Companion</b> <b>called</b> <b>I think this is also a kind of AI prophecy</b> <b>the slightly troubling thing now is</b> <b>popular culture has been too saturated with AI</b> <b>making everything seem AI-related</b> <b>it's a bit overwhelming</b> <b>but</b> <b>as</b> <b>maybe it's just because I'm an AI professional</b> <b>so sometimes</b>

<b>it feels different</b> <b>but I think</b> <b>these things are still quite inspiring</b> <b>including the sci-fi novels I mentioned</b> <b>including these older films</b> <b>I think</b> <b>they may all be a kind of prophecy about reality</b> <b>but generally speaking</b> <b>these</b> <b>works of film and TV</b> <b>don't point toward a very bright future</b> <b>usually the endings are quite bleak</b>

<b>Mm-hmm</b> <b>ah, I recently watched a film</b> <b>I think it's called No Other Choice</b> <b>which might translate as No Other Choice</b> <b>a film by Park Chan-wook</b> <b>and it's also about AI's alienation of humanity</b> <b>throughout the entire film</b> <b>it never mentions anything about AI</b> <b>until the very end</b> <b>but the whole thing is about</b> <b>the changes brought about by AI's arrival</b> <b>what changes humans have undergone</b> <b>people's mindsets</b> <b>relationships between people</b>

<b>what exactly has changed</b> <b>I think these things are also instructive</b> <b>and speaking of</b> <b>one last word on films</b> <b>welcome everyone to come to New York</b> <b>in New York</b> <b>I used to attend one film festival</b> <b>the New York Film Festival</b> <b>with many films to watch</b> <b>now I'll be going to two</b> <b>the second one is</b> <b>the AI film festival Runway holds every year</b> <b>and I think it's very cool and interesting</b> <b>if I were to recommend one</b>

<b>very relevant to everything we just talked about</b> <b>one that won their grand prize this year</b> <b>the AI film called Total Pixel Space</b> <b>called</b> <b>in Chinese it might be called Total Pixel Space</b> <b>[laughter]</b> <b>I won't spoil it</b> <b>anyway</b> <b>this is a very interesting AI short film</b> <b>and it actually talks about a lot of</b>

<b>what we just discussed</b> <b>about world models</b> <b>or why human intelligence</b> <b>is not simply</b> <b>or is not</b> <b>purely general intelligence</b> <b>some arguments</b> <b>I think it's quite fun</b> <b>mm-hmm, each of our guests</b> <b>recommends a life-changing book to our audience</b> <b>one that has truly influenced you</b> <b>and changed you</b> <b>what would yours be?</b>

<b>a book? mm-hmm</b>

<b>that's hard — you have to let me think</b> <b>Mm-hmm</b> <b>one book</b> <b>I guess people often recommend</b> <b>but</b> <b>the reason this book changed my life</b> <b>I wouldn't say it changed my life hugely</b> <b>but it was during my undergraduate years</b> <b>a collective memory</b> <b>everyone would read</b> <b>this book called GEB</b> <b>have you heard of it?</b>

<b>which is Gödel, Escher, Bach</b> <b>the Chinese title is "GEB: An Eternal Golden Braid"</b> <b>it talks about philosophy</b> <b>mathematical logic</b> <b>and these three people, right?</b>

<b>Gödel, Bach, and Es-</b> <b>cher right?</b>

<b>cher right?</b> <b>a mathematician</b> <b>a musician</b> <b>a composer</b> <b>and also a</b> <b>painter mm-hmm</b> <b>how they are able to</b> <b>what philosophical commonalities they share</b> <b>you could put it that way</b> <b>right</b> <b>and it's very interesting</b>

<b>because during our undergraduate days</b> <b>the book is this thick</b> <b>we studied it together as a group</b> <b>it was also recommended by our teacher</b> <b>so everyone studied it together</b> <b>and actually back then nobody really understood it</b> <b>but later it started feeling more and more</b> <b>mm-hmm, like it makes sense</b> <b>Mm-hmm</b> <b>this book</b> <b>I think</b> <b>if you don't have time to read every page carefully</b>

<b>you can read an abridged version</b> <b>or some kind of summary</b> <b>some of its ideas</b> <b>I find very, very interesting</b> <b>and also</b> <b>there's a book</b> <b>this one was probably also read during undergrad</b> <b>called Zen and the Art of Motorcycle Maintenance</b> <b>or is it motorcycle repair</b> <b>"Zen and the Art of Motorcycle Maintenance: An Inquiry into Values"</b> <b>I think it's called that</b> <b>right</b>

<b>and this book is also a process of inner seeking</b> <b>it's about a person riding a motorcycle</b> <b>with</b> <b>this might be a spoiler</b> <b>an imagined</b> <b>philosopher</b> <b>but this philosopher is actually a projection of himself</b> <b>mm-hmm, my feeling reading this book was</b> <b>I also</b> <b>didn't fully understand what he was saying</b> <b>right mm-hmm</b>

<b>but some books and films fill you up</b> <b>and some books or films empty you out</b> <b>my feeling after finishing this book was</b> <b>it kind of emptied me out</b> <b>Oh~</b> <b>and it made me feel</b> <b>Mm-hmm</b> <b>right, this gets abstract again</b> <b>anyway, it made me feel</b>

<b>Uh</b> <b>it made me sense</b> <b>what truly matters in this world</b> <b>what doesn't</b> <b>for you</b> <b>what matters</b> <b>what doesn't</b> <b>I don't know</b> <b>I think I'm always looking for that balance</b> <b>I think, mm-hmm</b> <b>I think</b> <b>genuine communication between people is important</b> <b>perhaps nothing else matters</b> <b>but at any given moment</b> <b>if you ask me this question</b> <b>I might say</b> <b>entrepreneurship is important</b> <b>research is important</b>

<b>but at the end of the day</b> <b>I still believe</b> <b>that communication between people is what matters</b> <b>it sounds like you want to do research also for the sake of connection</b> <b>uh yes</b> <b>I think so</b> <b>and I think</b> <b>research itself is also a form of deeper connection</b> <b>Mm-hmm</b> <b>Mm-hmm</b> <b>this</b> <b>actually helped us during fundraising</b> <b>too</b> <b>why not?</b>

<b>why not?</b> <b>an investor was very willing to invest in us</b> <b>and his reason</b> <b>the reason was</b> <b>someone he knew, a very strong entrepreneur</b> <b>who is also a researcher</b> <b>and this person said, hey</b> <b>you absolutely must invest in Saining</b> <b>and whatever way</b> <b>we need to help him</b> <b>but</b> <b>I only met this person once at a meeting</b> <b>who was it? and</b>

<b>later</b> <b>who?</b>

<b>who?</b> <b>Uh</b> <b>Who?</b>

<b>Who?</b> <b>Robin</b> <b>Robin Rombach</b> <b>he's the</b> <b>first author of Stable Diffusion</b> <b>and the current CEO of Black Forest Labs</b> <b>Oh</b> <b>right</b> <b>Flux right?</b>

<b>Flux right?</b> <b>[laughter]</b> <b>so</b> <b>the investor told me</b> <b>the reason he did this</b> <b>is this kind of trust</b> <b>is built on your academic work</b> <b>this trust</b> <b>can sometimes even surpass</b> <b>genuine personal</b> <b>connection</b> <b>Oh</b> <b>people get to know you through your work</b> <b>and this</b> <b>carries forward</b>

<b>and can go very far</b> <b>What do you think of Seedance?</b>

<b>Seedance is incredibly impressive</b> <b>Seedance really</b> <b>let</b> <b>let our film crew today</b> <b>also say something about it</b> <b>I think it's extremely strong</b> <b>[laughter]</b> <b>I've heard it's also a very, very large model</b> <b>and it's a MoE model</b> <b>I don't know if this rumor is true</b> <b>because before this</b> <b>I know</b> <b>nobody had been able to make MoE work</b> <b>within a Diffusion Model architecture</b> <b>if they truly managed to do</b> <b>200 billion parameters</b>

<b>and with an MoE architecture</b> <b>and they were able to ingest all that data</b> <b>I think</b> <b>that's incredibly, incredibly impressive</b> <b>Mm-hmm</b> <b>but all these generative models</b> <b>90% is still a data problem</b> <b>architecture doesn't matter much</b> <b>90%, or let me say 95%</b> <b>it's all a data problem</b> <b>mm-hmm, their data is inherently abundant</b> <b>their data itself is more</b> <b>but volume alone isn't enough</b>

<b>Mm-hmm</b> <b>they must have done enormous work</b> <b>to clean the data</b> <b>to do captioning</b> <b>to calibrate the data distribution</b> <b>their diversity-quality balance</b> <b>as well as their prompt alignment with language</b> <b>the degree of that</b> <b>I believe</b> <b>a large number of people must have been involved in this work</b> <b>and done an enormous amount</b> <b>right</b>

<b>but once you've done all these things well</b> <b>subsequent things</b> <b>become much simpler</b> <b>but I think</b> <b>I think Seedance is very impressive</b> <b>I think</b> <b>including Sora</b> <b>including Veo</b> <b>wanting to surpass them</b> <b>I don't think it's necessarily that simple</b> <b>Our studio is called Language and World Studio</b> <b>what comes to mind when you hear that name?</b>

<b>what are you thinking?</b>

<b>I see you wrote me a line: let go of</b> <b>uh called</b> <b>let go of Wittgenstein</b> <b>let go of Wittgenstein</b> <b>well, that's not a great way to end</b> <b>I'm going to start complaining again</b> <b>right, go ahead</b> <b>you complain — I say, let go of Wittgenstein</b> <b>means you shouldn't</b> <b>people shouldn't take Wittgenstein</b> <b>and really stretch him</b> <b>using it as a language boundary</b> <b>meaning the limits of my world</b> <b>and use that quote as endorsement for LLMs</b> <b>or linguistic determinism</b>

<b>so that's completely absurd</b> <b>and likewise</b> <b>there are other quotes</b> <b>like people citing Feynman</b> <b>Feynman said what I cannot create</b> <b>I do not understand</b> <b>this</b> <b>being used to endorse unified models</b> <b>I think</b> <b>both of these things are really unacceptable to me</b> <b>what's the first thing?</b>

<b>the first is Wittgenstein, right?</b>

<b>when he spoke of the limits of language</b> <b>as the limits of my world</b> <b>there were strong preconditions</b> <b>in his Tractatus Logico-Philosophicus</b> <b>what he discussed in the Tractatus</b> <b>was that</b> <b>your</b> <b>the language he referred to targets what can be captured in propositions</b> <b>the limits of the world that can be described</b> <b>and this does not represent the general</b>

<b>the entirety of what we call the world</b> <b>[laughter]</b> <b>so</b> <b>first, the language he spoke of</b> <b>and the world he spoke of</b> <b>are already different from the language in today's LLMs</b> <b>and the world it refers to</b> <b>second, in his later period Wittgenstein</b> <b>had completely overturned his earlier</b> <b>entire philosophical system</b> <b>he later stopped saying that</b> <b>and what he talked about instead was</b>

<b>language is actually a game</b> <b>the so-called concept of language games</b> <b>meaning language itself has no inherent meaning</b> <b>these symbols themselves have no meaning</b> <b>the reason they acquire meaning</b> <b>is because they are connected to real-world practice</b> <b>and engaged with it</b> <b>Mm-hmm</b> <b>and this is very much the world model view</b> <b>that is</b> <b>we're not saying</b> <b>that language can perfectly</b>

<b>represent the entire world</b> <b>what we're saying is that the world's practice</b> <b>the world's actions determine the game of language</b> <b>its intension and extension</b> <b>mm-hmm again</b> <b>I don't understand philosophy</b> <b>I don't understand Wittgenstein either</b> <b>but I just don't like seeing in people's papers</b> <b>opening with a pulled quote</b>

<b>I think that doesn't fit my aesthetic sensibilities</b> <b>the Feynman quote is the same</b> <b>mm-hmm, he said</b> <b>what I cannot create</b> <b>I do not understand</b> <b>that quote itself is not wrong</b> <b>but the create and understand he's referring to mean</b> <b>for example, we have a world</b> <b>we want to understand this world</b> <b>we want to transform this world</b> <b>we want to understand the world</b> <b>through transforming it</b>

<b>whatever</b> <b>the things he was talking about</b> <b>are still within a real, concrete world</b> <b>requiring some kind of action</b> <b>mm-hmm, even when you're in class</b> <b>you go and make a PowerPoint</b> <b>you're still engaged in a process of creation</b> <b>but now many people take this quote</b> <b>and use it to make this kind of, uh</b> <b>endorsement for some simple unified system</b> <b>that's logically untenable too</b>

<b>we can't simply reduce creation</b> <b>to a diffusion model</b> <b>its backpropagation loss</b> <b>that's completely absurd</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>so</b> <b>I don't know</b> <b>I think</b> <b>maybe it's like when I was a kid</b> <b>overusing famous quotes in essays</b> <b>now seeing these things gives me a bit of PTSD</b> <b>and I think as Kaiming said</b> <b>everyone should read more philosophy</b> <b>I think that's quite worthwhile</b>

<b>mm-hmm, at the very start you said you believe in fate</b> <b>and believe in it more and more</b> <b>where do you feel fate is pushing you now?</b>

<b>Ah</b> <b>I don't know</b> <b>is fate pushing me?</b>

<b>it doesn't seem like it</b> <b>I think</b> <b>there's no feeling of being pushed by fate</b> <b>mm-hmm just</b> <b>mm-hmm, when the next time I need to make a choice comes</b> <b>I just hope for good fortune</b> <b>Is this world a giant world model?</b>

<b>of course the world is a giant world model</b> <b>can you predict fate then?</b>

<b>uh, I don't think so</b> <b>why not?</b>

<b>why not?</b> <b>Mm-hmm</b> <b>because we don't have enough resources</b> <b>Oh</b> <b>you'd need a computer as large as the Earth</b> <b>or you'd need a computer</b> <b>the size of the entire universe</b> <b>to tell you the answer about life</b> <b>about the universe</b> <b>about anything</b> <b>and the answer might ultimately be 42</b>

Loading...

Loading video analysis...