LongCut logo

A 7-hour marathon interview with Saining Xie: World Models, AMI Labs, Yann LeCun, Fei-Fei Li, and 42

By Zhang Xiaojun Podcast

Summary

## Key takeaways - **Turned down OpenAI for FAIR twice**: Xie Saining rejected OpenAI offers from Ilya Sutskever twice—once in 2018 after a personal call and again in 2024 after another outreach—choosing FAIR then NYU because he prioritized working with specific vision researchers like Kaiming He over LLM paths. [01:00:21], [01:21:09] - **B-class trajectory beats elite path**: Xie describes his career as a 'B-class trajectory'—not the top high school to top PhD path of elite competitors—but insists decisions like skipping MSRA for NUS vision lab were driven by passion, leading to success through persistence. [09:42:10], [22:29:22] - **Childhood sparked vision obsession**: From age four, extensive travel with his mother and reading his father's vast bookshelves shaped Xie's open worldview; by nine, computers and internet forums ignited self-expression, but vision captivated him as humans are 'visual creatures' occupying 70% of brain activation on images. [05:09:05], [25:07:25] - **Research is nonlinear infinite game**: Great research follows nonlinear paths with long exploration, sudden pivots, and failures providing key gradients; it's an 'infinite game' where you only need one signature hit like ResNet amid many decent papers, optimizing for career maximum not average. [01:03:28], [02:15:31] - **Self-supervised vision scaling failed**: MoCo and MAE achieved breakthroughs in self-supervised vision representations outperforming ImageNet on tasks, but neither scaled reliably; Kaiming pushed massive TPU infrastructure yet progress stalled, highlighting limits before multimodal era. [01:58:36], [02:28:46] - **DiT unifies diffusion architectures**: DiT replaced inefficient U-Nets with scalable ViT-based design for diffusion models, showing superior efficiency and scaling; despite CVPR rejection for 'lacking novelty,' it became backbone for Sora and most video models after Oral acceptance. [03:03:54], [03:07:14]

Topics Covered

  • B-Class Trajectories Outperform Elite Paths
  • Vision Drives Cambrian Explosion
  • Follow Heart Over Rankings
  • Research Nonlinearity Yields Breakthroughs
  • Representation Learning Core of Intelligence

Full Transcript

This subtitle was translated by AI. We cannot guarantee its accuracy and it is provided for entertainment purposes only.

<b>Hello everyone</b> <b>I'm Xiaojun</b> <b>In this episode, we have come to New York, USA</b> <b>It is the Chinese New Year right now</b> <b>New York just had a heavy snowfall</b> <b>This is the coldest winter New York has had in years</b> <b>The streets are still covered with unmelted ice and snow</b> <b>But today's conversation</b> <b>gave me a feeling of</b> <b>the warmth of everyday life after the thaw</b> <b>Sitting across from me today</b>

<b>is young scientist Xie Saining</b> <b>He has just embarked on an entrepreneurial journey together with Turing Award winner Yann LeCun</b> <b>setting out on the entrepreneurial journey</b> <b>Their neo lab, AMI Labs</b> <b>has just completed its first mega-scale funding round</b> <b>The team currently has 25 members</b> <b>Xie Saining has always told me</b> <b>he is not the "chosen one"</b> <b>he is the ordinary one</b>

<b>And now, here is my interview with Xie Saining</b> <b>Ilya called me</b> <b>and I didn't say anything</b> <b>I just turned down OpenAI</b> <b>They sent me an offer</b> <b>and I said I'm not going, sorry</b> <b>But wherever there is love, there must also be hate</b> <b>They are two sides of the same coin</b> <b>[laughter]</b> <b>This morning we are in New York</b> <b>shooting B-roll in Brooklyn</b> <b>I really like it here</b> <b>Because I live near Times Square</b> <b>I think that area</b>

<b>is still a very stereotypical New York</b> <b>But coming here</b> <b>feels like a New York full of artistic vibe</b> <b>and lively neighborhood energy</b> <b>Yeah</b> <b>I think this area of Dumbo is of course very artistic</b> <b>Right, in many films</b> <b>There was a Korean film called Past Lives</b> <b>In that film, you may have seen</b> <b>the carousel</b> <b>And the Dumbo bridge over there, right</b> <b>Only tourists go to Times Square</b> <b>I am a tourist</b>

<b>Real New Yorkers would never go</b> <b>But actually the area near NYU is also really good</b> <b>That area is called</b> <b>Greenwich Village</b> <b>And that area is also a "village"</b> <b>And that area also has a great neighborhood vibe</b> <b>Why did you come to New York to do academia?</b>

<b>That doesn't seem like a choice many people make</b> <b>Well, not really</b> <b>But there is quite a long history</b> <b>That is true</b> <b>Various reasons</b> <b>I think</b> <b>Of course</b> <b>Also because I genuinely yearned for this city</b> <b>Right</b> <b>I longed for many elements of this city</b> <b>The people here</b> <b>And including NYU</b> <b>That was also part of it</b> <b>And of course the main reason was still Yann (Yann LeCun, Turing Award winner and Executive Chairman of AMI Labs)</b>

<b>And the AI efforts here</b> <b>Right</b> <b>NYU actually does quite well</b> <b>But on the other hand</b> <b>NYU also has a very strong film school</b> <b>And many directors I admire</b> <b>Like Martin Scorsese</b> <b>Including more recently Chloé Zhao</b> <b>are all NYU graduates</b> <b>So that's also partly the reason</b> <b>Right also</b> <b>Also part of the reasons</b> <b>Right I</b> <b>I</b> <b>I told you yesterday</b>

<b>I think — how many years has it been since I came to America</b> <b>I came in 2013</b> <b>So it's been about 13 years</b> <b>My 'post-training' is a bit broken now</b> <b>So the issue of mixing Chinese and English</b> <b>Sorry about that, viewers</b> <b>I'll try my best to explain</b> <b>Please bear with me</b> <b>Please bear with me</b> <b>Please bear with me</b> <b>Mm, it seems I haven't found anywhere</b> <b>a podcast of yours</b> <b>or an interview</b>

<b>So</b> <b>Is this your first time doing a podcast or interview?</b>

<b>First time doing a podcast</b> <b>First time doing a podcast</b> <b>First time doing an interview</b> <b>Right, you can probably find many</b> <b>Me going out to various conferences, right</b> <b>talks at conferences</b> <b>giving talks and such</b> <b>many of those</b> <b>Why</b> <b>Why haven't you been on a podcast all these years</b> <b>or done an interview</b> <b>I think</b> <b>Mm</b> <b>I don't know</b> <b>I think I'm more suited to being a listener</b> <b>I really enjoy podcasts</b>

<b>Right</b> <b>I often listen to a lot of podcasts</b> <b>My Spotify</b> <b>YouTube, commuting every day, and before bed</b> <b>I often listen to podcasts in my spare time</b> <b>Mm right</b> <b>And I think I have quite a desire to express myself</b> <b>Or rather</b> <b>I also talk about a lot of things with friends privately</b> <b>With students</b> <b>I think, mm</b> <b>Getting everyone together to chat, I think that's very enjoyable</b> <b>Mm, but this podcast thing</b> <b>I don't know either</b>

<b>Maybe it's because nobody invited me</b> <b>That shouldn't be the case</b> <b>Um, well, a little I guess</b> <b>But I still think</b> <b>Maybe it's also because I'm more introverted</b> <b>I think a lot of times</b> <b>feel mm</b> <b>I don't know which things should be said</b> <b>which things are worth saying</b> <b>which things people would want to hear</b> <b>But now I think, gradually</b> <b>as I get older</b> <b>it's fine, it's okay</b>

<b>I have gained the courage to be disliked</b> <b>I actually looked up a lot about you online</b> <b>a lot of information</b> <b>But I found</b> <b>everyone's description of you</b> <b>all starts from SJTU's ACM Class</b> <b>And I'm also very curious</b> <b>What was Xie Saining like before that?</b>

<b>Could you start from your</b> <b>earliest memories of the world</b> <b>as the starting point</b> <b>and tell us about your childhood and growing up</b> <b>I</b> <b>Ah OK</b> <b>See, this is exactly why I didn't want to do a podcast</b> <b>[laughter]</b> <b>Because honestly</b> <b>I've never prepared for this</b> <b>Or rather, you have to let me think back</b> <b>from the earliest memories</b> <b>Well it's</b> <b>I think starting from when I was little</b> <b>Maybe</b>

<b>When I was four or five years old</b> <b>Mm, my mom would take me traveling everywhere</b> <b>That might be my earliest memory</b> <b>Oh, where did you travel?</b>

<b>All kinds of places</b> <b>Right, because she also did some business</b> <b>and traveled around everywhere</b> <b>Traveling all around the country, right</b> <b>I remember very clearly, right</b> <b>This first impression of Shanghai</b> <b>And going to</b> <b>Sichuan, and then</b> <b>all kinds of tourist spots you can imagine</b> <b>Um</b> <b>But for me</b> <b>If I really have to dig into the family background</b> <b>which was</b> <b>My dad is a complete homebody</b> <b>Mm</b> <b>never goes out</b>

<b>But his favorite thing to do is read books</b> <b>So at home, there is a study room</b> <b>with several walls full of books</b> <b>So</b> <b>When I was young, I was basically in this state</b> <b>either running around outside</b> <b>being taken traveling by my mom</b> <b>or at home browsing through all kinds of books</b> <b>books I should read, books I shouldn't — I'd look at them all</b> <b>Right</b> <b>And I think that was my early childhood</b> <b>And then later on</b>

<b>And indeed later</b> <b>I think our generation's growing-up experience</b> <b>was quite different</b> <b>Because I think — well, I don't know</b> <b>I think kids today</b> <b>might, in this AI era</b> <b>have the same feelings</b> <b>But back then for me</b> <b>When I was about 9 years old</b> <b>I got my first computer</b> <b>And from that time on</b> <b>not for anything productive, right</b> <b>buying games box by box and playing them</b> <b>Then the internet came along</b>

<b>and for the first time I felt this information explosion</b> <b>So</b> <b>That was the first time I understood what "content" meant</b> <b>And at that time I felt</b> <b>I suddenly had more desire to express myself</b> <b>Because reading books is still one-directional</b> <b>this learning process</b> <b>though also very broadening</b> <b>But online, there were BBS forums back then</b> <b>And you could go online to share your opinions</b> <b>I still remember, right</b> <b>There was Sina Blog</b> <b>It probably doesn't even exist anymore</b>

<b>But I wrote a lot of blog posts</b> <b>Oh really?</b>

<b>Oh really?</b> <b>Ah um</b> <b>about all kinds of random topics</b> <b>Now</b> <b>Looking back now, it's definitely very funny</b> <b>But</b> <b>What was the most popular article?</b>

<b>Quite a few, I think</b> <b>I remember</b> <b>It felt like forced melancholy — writing sad words without real cause</b> <b>Oh</b> <b>Maybe including QQ Space back then, right</b> <b>Everyone always wanted</b> <b>a platform to express themselves</b> <b>And then later</b> <b>there were actually even more new media emerging</b> <b>including blogs</b> <b>then Weibo, right</b>

<b>But back then it wasn't Weibo actually</b> <b>It was Fanfou — I don't know if you've heard of it</b> <b>Of course</b> <b>Wang Xing, right</b> <b>And at that time I was also a heavy Fanfou user</b> <b>On it</b> <b>Fanfou can still be logged into now</b> <b>But it's really hard to look at</b> <b>Sometimes I look at it</b> <b>I think, oh gosh</b> <b>Should I just delete it all</b> <b>But then I think</b> <b>Let it stay there</b>

<b>Let it become part of the internet memory</b> <b>Mm</b> <b>But I think at that time</b> <b>I think</b> <b>I think this explosive growth of the internet</b> <b>made me become</b> <b>someone interested in many things</b> <b>Mm</b> <b>I think that's how it was</b> <b>So, your parents</b> <b>Your mom was in business</b> <b>Were you from a business family?</b>

<b>Not really, not really</b> <b>Um</b> <b>Well, my dad basically</b> <b>He studied psychology in college</b> <b>He also did some education work before</b> <b>And later also in some</b> <b>media work at TV stations</b> <b>Oh</b> <b>Maybe the same profession as you</b> <b>Oh</b> <b>Right</b> <b>So my memory of him when I was little</b> <b>is of him carrying a camera everywhere</b> <b>Oh, that's interesting</b>

<b>Right right right</b> <b>But in my family there really wasn't</b> <b>anyone who studied pure science and engineering</b> <b>This also gave your personality</b> <b>I think quite an artistic side</b> <b>Maybe but</b> <b>But I think I</b> <b>I think the one thing I want to say is</b> <b>Growing up in such a relaxed family environment</b> <b>has really shaped my model of the world</b> <b>I think, about my own</b>

<b>I'm still quite proud of it</b> <b>Mm</b> <b>quite proud</b> <b>Because I think I would</b> <b>Or rather, you just asked why I came to New York</b> <b>I think that's part of it too</b> <b>Mm</b> <b>I think I would hope for myself</b> <b>or hope for the people around me</b> <b>to look at the world with a more open mind</b> <b>Were your grades always very good?</b>

<b>Because you were admitted to SJTU's ACM Class through recommendation</b> <b>Um, not at all</b> <b>It was from high school</b> <b>Right, I think it was like this</b> <b>So, you can see</b> <b>Now I have many, many friends around me</b> <b>who are actually all</b> <b>those who've come up through the top track</b> <b>Right</b> <b>the best high school, right</b> <b>then the best undergraduate</b> <b>competing in competitions</b> <b>the best undergraduate</b> <b>then the best PhD</b>

<b>then after finishing, going to teach at, say, the top four universities</b> <b>There's a very clear main path, right</b> <b>And I have great respect for them</b> <b>I'm completely not like that</b> <b>I'm a, um</b> <b>At most, I have a B-class kind of trajectory</b> <b>Oh</b> <b>Like you</b> <b>And many</b> <b>My decisions are actually quite mystical</b> <b>Because I think</b> <b>I haven't deliberately, in some kind of</b> <b>meritocratic</b> <b>this kind of</b>

<b>setting</b> <b>framework to strive for things</b> <b>Many times it was actually quite random</b> <b>And maybe that's just the way it is</b> <b>The intelligence just isn't enough</b> <b>But indeed</b> <b>For example, when being admitted via recommendation, right</b> <b>That was also very accidental</b> <b>Anyway, there were two</b> <b>awards in informatics and math competitions</b>

<b>And at that time SJTU happened to have this</b> <b>program where you could enter early</b> <b>basically trying to recruit some students</b> <b>and have them skip the college entrance exam</b> <b>Right</b> <b>Actually, I was originally following the gaokao path</b> <b>being prepared for it, actually I</b> <b>um, was supposed to</b> <b>be taking the gaokao</b> <b>So I struggled with this for a long time</b>

<b>The teachers at school all said, no, that won't do</b> <b>How can you back out at the last minute</b> <b>Your grades are already very good, right</b> <b>You should of course aim for Tsinghua or Peking University</b> <b>But my inner thought was</b> <b>Well, SJTU seems great, I think</b> <b>I've been to Shanghai</b> <b>I feel like me and this city</b> <b>and this school share a compatible spirit</b> <b>And I just wanted to study computer science</b> <b>And I think</b>

<b>SJTU's computer science was also very good at that time</b> <b>I had also heard of this ACM program</b> <b>Although the selection process back then</b> <b>actually required you to</b> <b>enter early</b> <b>and after entering there was a summer camp</b> <b>a program like summer camp</b> <b>Right, and you would undergo some tests</b> <b>before you could enter this class</b> <b>Right</b> <b>But many interesting things happened in that process</b> <b>Of course, first let me say</b>

<b>I think I was quite</b> <b>How should I put it</b> <b>If I could choose again</b> <b>I wouldn't regret it at all</b> <b>Right, I think that summer before entering early</b> <b>was a highlight of my life</b> <b>Why</b> <b>Because during those two months, I did nothing</b> <b>just played games in the dorm</b> <b>Why is that a highlight?</b>

<b>Because never again in my life</b> <b>did such a moment come again</b> <b>What games were you playing back then?</b>

<b>Um, many games</b> <b>Playing Dota and such</b> <b>Just in the dorm</b> <b>It was that kind of</b> <b>the kind I saw online during high school</b> <b>college life</b> <b>You know?</b>

<b>You know?</b> <b>Ah, it was</b> <b>There was the studying part</b> <b>But also some</b> <b>finding yourself</b> <b>and in this kind of</b> <b>aimless wasting of time</b> <b>kind of experience</b> <b>Right</b> <b>So Xie Saining's life highlight was wasting time</b> <b>Really? In the dorm?</b>

<b>Really? In the dorm?</b>

<b>[laughter] You could say that</b> <b>Haha, that's very interesting</b> <b>You keep saying you weren't among those with the best grades</b> <b>But you've also had a pretty smooth path</b> <b>You seem to be among the highest achievers too</b> <b>Why is your self-perception</b> <b>My grades are actually average</b> <b>It depends on who I'm comparing to</b> <b>Compared to the top competition winners</b> <b>like what I just described</b> <b>those who had a very smooth path</b> <b>the top students from Yao Class</b>

<b>and then comparing with the top four PhD programs, top four professors</b> <b>Then I really am</b> <b>far behind</b> <b>But on the other hand</b> <b>I think</b> <b>I'm still quite grateful for all of these experiences</b> <b>Because I feel</b> <b>continuing the story from here</b> <b>I think it's actually quite interesting</b> <b>For example, when I went to SJTU</b> <b>SJTU wasn't necessarily</b> <b>in terms of computer science</b>

<b>and artificial intelligence</b> <b>a particularly leading</b> <b>school</b> <b>And now</b> <b>for example, the ACM Class has become</b> <b>Of course, this has nothing to do with me</b> <b>But my juniors</b> <b>including my seniors, right</b> <b>whether doing entrepreneurship or academia</b> <b>shining and contributing everywhere</b> <b>And also</b> <b>We have a very strong</b> <b>alumni network</b> <b>everyone connected, working on things together</b>

<b>I think</b> <b>I still think</b> <b>it's an upward trajectory</b> <b>An upward trajectory</b> <b>And then later still</b> <b>There is another very interesting thing in here</b> <b>I want to mention</b> <b>which is my ACM Class interview</b> <b>And in the interview process</b> <b>there would be senior professors</b> <b>Back then it was Professor Shen Enshao who interviewed us</b> <b>This interview</b>

<b>didn't actually ask you technical questions</b> <b>He would ask you, what books do you like to read</b> <b>Mm</b> <b>And I feel this was somehow destined</b> <b>there was some fate involved</b> <b>Because I was very anxious back then</b> <b>and almost couldn't answer</b> <b>Then I told him</b> <b>A book I actually really like</b> <b>and one I just finished recently, is this</b> <b>This book is called What Is Mathematics?</b>

<b>Which is "What is Mathematics?"</b>

<b>Then Professor Shen Enshao followed up and asked</b> <b>Who is the author of this book</b> <b>to test me</b> <b>And I was a bit stunned</b> <b>And you know, right</b> <b>A high school student</b> <b>I can't remember foreign names either</b> <b>I thought about it</b> <b>and ultimately managed to answer</b> <b>It was Richard Courant</b> <b>Richard Courant</b> <b>And then Professor Shen said</b> <b>Ah right</b> <b>You must remember this name</b> <b>Because this is equivalent to</b>

<b>one of the greatest mathematicians of the 20th century</b> <b>Why does this make me feel</b> <b>there's a certain destiny at play</b> <b>or some coincidence in this</b> <b>is because now at NYU</b> <b>the department I'm in</b> <b>this institute is the Courant Institute of Mathematical Sciences</b> <b>which is Richard Courant's institute</b> <b>the first shovelful of earth he dug</b> <b>the department he built</b> <b>Mm</b> <b>So, I think it's quite interesting</b> <b>Right</b>

<b>And the application process later was actually similar</b> <b>I think</b> <b>Or to put this from another angle</b> <b>I think</b> <b>It seems like the world</b> <b>always doesn't want me to do what I want to do</b> <b>Why</b> <b>But</b> <b>But I insist on doing exactly what I want to do</b> <b>Oh</b> <b>For example, during my undergraduate years</b> <b>I was initially interested in computer vision, right</b> <b>Or rather</b>

<b>I developed some interest in artificial intelligence</b> <b>At that time also</b> <b>Starting out in the ACM Class</b> <b>Everyone would start doing this kind of</b> <b>research internship</b> <b>and would go to various labs within the school</b> <b>to different laboratories</b> <b>And the lab I went to</b> <b>was one doing</b> <b>neuroscience + AI work</b> <b>called BCMI</b> <b>And the bookshelves had so many books about consciousness</b>

<b>about the brain</b> <b>about images</b> <b>And then</b> <b>about how we perceive the real world</b> <b>books like these</b> <b>And after looking at them I thought, wow</b> <b>That's so interesting</b> <b>And um</b> <b>Later, in this process</b> <b>I also got to know a senior classmate of mine</b> <b>This senior was Hou Xiaodi</b> <b>Oh</b> <b>And he is also very well known</b> <b>He had previously also started a company</b>

<b>and now is also doing entrepreneurship</b> <b>And every time I talk with him</b> <b>he always says</b> <b>The world has changed</b> <b>But we haven't changed</b> <b>By "we" I specifically mean him and me</b> <b>Because every time we chat</b> <b>it's exactly the same as what we talked about over ten years ago</b> <b>Right, at that time he was a legend at the school</b> <b>Right, and he did two legendary things</b> <b>The first legendary thing was</b>

<b>that as an undergraduate</b> <b>he published a paper at CVPR (one of the world's top computer vision conferences)</b> <b>Right, and in this paper</b> <b>was a very elegant algorithm</b> <b>with only 7 lines of code in total</b> <b>that solved a very important problem</b> <b>and published a paper</b> <b>Mm</b> <b>CVPR now accepts maybe several thousand papers each year</b> <b>thousands of papers</b> <b>Right, tens of thousands of submissions</b> <b>So now, when we're looking to recruit undergrads</b>

<b>everyone has three, four, five papers each</b> <b>CVPR is already nothing special</b> <b>But at that time</b> <b>at schools in mainland China</b> <b>being able to publish work at such a top conference</b> <b>was actually extremely, extremely difficult</b> <b>very rare</b> <b>very rare</b> <b>And then</b> <b>For an undergraduate to publish such work</b> <b>was unheard of</b> <b>So</b> <b>Everyone truly admired him very, very much</b> <b>Mm</b> <b>But then</b>

<b>he did a second very impressive thing</b> <b>which was, um</b> <b>he led a team</b> <b>and wrote something</b> <b>called the "SJTU Survival Guide"</b> <b>"SJTU Student Survival Guide"</b> <b>Oh, this was written by a team?</b>

<b>Um, he should be the main author</b> <b>I don't know</b> <b>A team followed him in it</b> <b>This thing still has an archive online now</b> <b>I welcome everyone</b> <b>to check it out offline</b> <b>So what does this guide talk about</b> <b>And some of the things</b> <b>some words</b> <b>I went back and revisited it just a couple of days ago</b> <b>I found it very, very interesting</b> <b>Right um</b> <b>What does it talk about</b> <b>It talks about</b>

<b>why people should learn</b> <b>China's education system</b> <b>the university model</b> <b>what exactly is wrong with it</b> <b>where you should spend your time</b> <b>to achieve the life you want</b> <b>Mm</b> <b>And it also guides everyone on how to do research</b> <b>what the purpose of research is</b> <b>the purpose of research is not to churn out papers</b> <b>but is truly about exploring the infinite unknown</b> <b>things like this</b> <b>Of course</b>

<b>It also teaches everyone how to skip class</b> <b>how to</b> <b>complete assignments</b> <b>in a quicker way</b> <b>Right, it's this kind of pamphlet</b> <b>I also went and read it</b> <b>It says if a person</b> <b>treats grade scores as their highest pursuit</b> <b>then they are a sacrifice to that system</b> <b>Mm, I completely agree</b> <b>Right, I think looking back on these things now</b> <b>probably had a subtle influence</b>

<b>really influenced my understanding of many things</b> <b>When he published this</b> <b>what year were you in?</b>

<b>Um</b> <b>First or second year</b> <b>First or second year</b> <b>You already knew him in your first or second year?</b>

<b>By that time he had already been admitted</b> <b>and gone to</b> <b>Caltech for his PhD</b> <b>So he and I were</b> <b>Because he also graduated from this same lab</b> <b>So he and I essentially communicated online</b> <b>Hou Xiaodi was at Caltech at the time</b> <b>and was already doing his PhD</b> <b>He had also been admitted to a great school</b> <b>And we were all very, very envious</b> <b>At that time</b> <b>And he and I would still</b> <b>on Google Chat back then</b>

<b>chat with him about many, many things</b> <b>And he really was</b> <b>also gave me a lot of advice</b> <b>I still remember</b> <b>What advice?</b>

<b>What advice?</b> <b>Um, nothing specific</b> <b>More often</b> <b>when chatting with him online</b> <b>it was more about research</b> <b>Right, what exactly should be done</b> <b>sharing my own confusion with him</b> <b>And then</b> <b>and how to</b> <b>how to get a paper published</b> <b>roughly seeking his advice</b> <b>Right, and at that time</b> <b>But at that time</b> <b>I think through Xiaodi</b> <b>through the books I read</b> <b>I had basically decided</b> <b>I felt this is what I want to do with my life</b>

<b>I think this thing is just so fascinating</b> <b>computer vision</b> <b>Um</b> <b>At that time there wasn't actually a name for it</b> <b>or rather, computer vision was slowly starting</b> <b>as a term</b> <b>But actually before</b> <b>Right</b> <b>and people had been processing image or visual information</b> <b>for a long time already</b> <b>For example, people would do so-called image processing</b> <b>which is image processing</b> <b>Um</b> <b>more often starting from an EE major</b>

<b>Right, and computer vision</b> <b>might be, um</b> <b>gradually becoming more and more popular</b> <b>Mm</b> <b>And then</b> <b>which was around when I started learning these things</b> <b>this knowledge</b> <b>it was starting to become more and more popular</b> <b>Right, and then</b> <b>Um, as I just said</b> <b>The world always doesn't want me to do this</b> <b>is because when I was in SJTU's ACM Class</b> <b>there was actually another feature</b>

<b>which is that every student in this class</b> <b>had to do an internship in their third year</b> <b>Mm</b> <b>That's actually quite common now</b> <b>But at that time</b> <b>it was still mainly this class's</b> <b>founder's, Professor Yu Yong's</b> <b>innovation</b> <b>So at that time, most people in the ACM Class</b> <b>would work with Microsoft Research Asia</b> <b>which is MSRA</b> <b>through a cooperative program</b>

<b>so many of our students were sent there</b> <b>to do approximately</b> <b>a 6-month internship</b> <b>Right so</b> <b>Um, originally for me</b> <b>If I did nothing</b> <b>I would go to MSRA for internship</b> <b>Right, although that was also good</b> <b>But at that time</b> <b>there actually wasn't a vision group</b> <b>willing to accept undergrads from the ACM Class for internships</b> <b>Why is that?</b>

<b>Um, I don't know</b> <b>Maybe because back then, professors like Ma Yi</b> <b>and Sun Jian were all there</b> <b>Kaiming should have been there too by then</b> <b>And I think</b> <b>they probably didn't like having too many</b> <b>undergrads who don't know anything</b> <b>coming to participate in things, right</b> <b>At that time, they were extremely talented</b> <b>Yes yes yes exactly</b> <b>But we really didn't know anything</b> <b>Right</b> <b>I think I can gradually understand this now</b>

<b>Um, but at that time, um, there was a choice</b> <b>which was still to go to MSRA</b> <b>but not doing anything vision-related</b> <b>research</b> <b>And Professor Yu also told me, well</b> <b>actually you undergrads</b> <b>the most important thing now is still to have research experience</b> <b>and learn how to do research</b> <b>what specific</b> <b>direction</b> <b>isn't very important</b> <b>Mm right um</b> <b>But I didn't think that was okay</b>

<b>I felt I couldn't accept that</b> <b>doing a completely different</b> <b>direction</b> <b>I wanted to understand this field more</b> <b>I hoped to work diligently</b> <b>on some things</b> <b>And then</b> <b>and hopefully one day be like senior Xiaodi</b> <b>being able to publish a CVPR paper</b> <b>Xiaodi was already your idol at that time, wasn't he</b> <b>A bit</b> <b>He was many people's idol</b> <b>Right, during SJTU days</b> <b>Oh</b>

<b>um, and then</b> <b>So I started thinking about how to handle this</b> <b>And started sending emails</b> <b>So I contacted NUS in Singapore, right</b> <b>National University of Singapore's</b> <b>Professor Yan Shuicheng's lab</b> <b>Mm right</b> <b>This was entirely my own doing</b> <b>I didn't even tell Professor Yu</b> <b>And after it was confirmed, hey</b>

<b>I can have this internship opportunity</b> <b>And on his side there were already some</b> <b>subsidies</b> <b>and talking about timing and arrangements</b> <b>the structure was already fairly well set up</b> <b>Then I went to find Professor Yu</b> <b>I said, Professor Yu</b> <b>I really don't want to go to MSRA</b> <b>I want to go to Singapore</b> <b>this school's lab</b> <b>to do the research I want to do</b> <b>Mm</b> <b>Professor Yu was silent for a few seconds</b>

<b>Right, um, maybe I guess</b> <b>I don't know</b> <b>I haven't asked him this question</b> <b>But I guess his inner thought was</b> <b>this student is so headstrong</b> <b>Right</b> <b>Because in the professor's mind</b> <b>MSRA was a better choice</b> <b>Yes yes</b> <b>One, a better choice</b> <b>Two, I think it also allows everyone to go through</b> <b>Right</b> <b>keeping everyone together</b> <b>I think one reason is of course</b> <b>easier to manage</b>

<b>Second, there would be more synergy</b> <b>Right, everyone could still exchange ideas</b> <b>Then you going to a new place</b> <b>what does that even mean</b> <b>is this place even reliable</b> <b>is what you want to do reliable</b> <b>this thing might be uncontrollable</b> <b>Were you conflicted about it?</b>

<b>I wasn't conflicted</b> <b>But I really appreciate Professor Yu</b> <b>in that he</b> <b>Anyway, he was silent for a few seconds</b> <b>and finally said okay</b> <b>You go ahead. Right, um, and so I went</b> <b>But this thing</b> <b>after it happened</b> <b>Professor Yan's group</b> <b>NUS's lab</b> <b>became an option for my juniors</b> <b>an available</b> <b>position</b> <b>Mm</b>

<b>So I think</b> <b>So I think</b> <b>I still want to take some initiative</b> <b>I think taking some initiative</b> <b>and doing what I want to do</b> <b>Right</b> <b>was still very early at that time</b> <b>image-related</b> <b>artificial intelligence</b> <b>what exactly attracted you</b> <b>why did it attract you</b> <b>that led you to make many different choices</b> <b>Because I think the way I experience the world</b>

<b>is through vision</b> <b>Mm, I would think</b> <b>I was probably a bit bored when I was little</b> <b>and I would think, hey</b> <b>humans have so many</b> <b>right senses</b> <b>If I had to remove one</b> <b>which would I remove</b> <b>I think maybe I could be deaf</b> <b>maybe I can't speak</b> <b>maybe I have no touch, no smell</b> <b>I would live very miserably</b> <b>but maybe that could still be accepted</b>

<b>But if I had no vision</b> <b>then I can't watch cartoons anymore</b> <b>I also can't watch movies</b> <b>I also can't play games</b> <b>I would seem to have</b> <b>lost a person's independence</b> <b>And I think</b> <b>Of course this</b> <b>these initial thoughts and later</b> <b>in some books I read</b> <b>what was said resonated quite well</b> <b>Um, because visual signals</b> <b>actually occupy a large part of the brain's cortex</b> <b>um, depending on how you say it, right</b> <b>the main visual areas</b>

<b>might be about</b> <b>um, 30% of the entire brain</b> <b>But um</b> <b>when the entire brain sees an image</b> <b>the activated parts might make up 70%</b> <b>Mm</b> <b>Right</b> <b>So</b> <b>Actually, all of us humans are visual creatures</b> <b>And this</b> <b>Right, that's what I think</b> <b>I'm also a visual creature</b> <b>I also very much like</b> <b>looking at things</b> <b>Animals too</b> <b>Not just humans</b>

<b>Not just humans, right</b> <b>What you said is very, very correct</b> <b>Mm, actually it's not entirely like that</b> <b>Because actually 530 million</b> <b>years ago, 530 million years ago</b> <b>on Earth</b> <b>these creatures actually had no eyes</b> <b>everyone lived in the deep sea</b> <b>without light</b> <b>Right, everyone was in the deep sea</b> <b>and light couldn't get in</b> <b>And then suddenly one day</b>

<b>some creatures were able to</b> <b>develop their vision</b> <b>Although still very weak</b> <b>only able to see a faint</b> <b>signal</b> <b>Right</b> <b>But at this point they were amazing</b> <b>They could see the prey they wanted to hunt</b> <b>where it is, and swim over quickly</b> <b>and eat it</b> <b>They could also avoid predators</b> <b>someone's coming to catch me</b> <b>I immediately run away</b> <b>Once vision was born</b>

<b>Um</b> <b>other creatures in the evolutionary process</b> <b>had to evolve stronger vision</b> <b>Right because</b> <b>if you don't have stronger vision</b> <b>you'll be eaten</b> <b>Right</b> <b>So an arms race began</b> <b>So this is the so-called Cambrian Explosion</b> <b>what is called the Cambrian Era</b> <b>That is to say, on Earth before the Cambrian period</b>

<b>there may have been only a handful of species</b> <b>But after the Cambrian</b> <b>suddenly like a big bang</b> <b>hundreds of thousands of species emerged</b> <b>One leading theory is</b> <b>a theory</b> <b>that this explosion's</b> <b>origin</b> <b>was actually because creatures had an arms race</b> <b>at the visual level</b> <b>Yes yes</b> <b>So what you said is completely right</b> <b>I think</b> <b>This is actually not something unique to humans</b>

<b>I think all animals are actually the same</b> <b>Mm</b> <b>And so</b> <b>I'm still quite interested in this</b> <b>And you know</b> <b>this thing called vision</b> <b>isn't just a sense</b> <b>There is a saying that</b> <b>the eye is actually the only one</b> <b>it is part of the brain</b> <b>but it's the only one</b> <b>part of the brain exposed to the real world</b> <b>because other parts of the brain</b> <b>are all hidden behind our skull</b> <b>Mm right</b>

<b>So thinking about it this way</b> <b>solving vision isn't about solving vision itself</b> <b>but about solving intelligence itself</b> <b>Right, so I think everything can be connected</b> <b>From before you even officially started your first year</b> <b>hiding in the dorm playing games</b> <b>wasting time</b> <b>to you finding computer vision</b> <b>as the main thread of your life</b> <b>what happened in between?</b>

<b>Mm, actually nothing much happened</b> <b>Actually many times</b> <b>I think it all comes from chance</b> <b>Mm</b> <b>Just like if I hadn't read that book back then</b> <b>I probably wouldn't have taken this path</b> <b>But sometimes I feel this is also inevitable</b> <b>I still quite believe</b> <b>everyone actually has their own destiny</b> <b>Or rather</b> <b>Sometimes I tell students</b> <b>Don't think that if you don't do this</b> <b>someone else will</b> <b>do it</b>

<b>Instead think: if you don't do this</b> <b>this thing will never happen in this world</b> <b>What does that mean?</b>

<b>meaning</b> <b>you are now working on a research topic</b> <b>Right</b> <b>and the thing you're doing</b> <b>how you got here step by step</b> <b>to this endpoint</b> <b>this thing</b> <b>completely depends on yourself</b> <b>your personal life experiences</b> <b>your background growing up</b> <b>maybe a book you read</b> <b>maybe a conversation you had with someone</b> <b>maybe it's genetic</b> <b>your genes</b>

<b>wise</b> <b>simply being different from others</b> <b>Right, I think</b> <b>every individual</b> <b>in this world is very unique</b> <b>everyone is a variable in this world</b> <b>everyone is a variable in this world</b> <b>and who can say for certain</b> <b>It's possible</b> <b>you are the most important variable in this world</b> <b>This is your worldview</b> <b>I think it's my optimistic side</b> <b>[laughter]</b> <b>Right</b> <b>Mm</b> <b>During your time at NUS</b> <b>Did you get what you wanted to get?</b>

<b>Um, I think</b> <b>I think yes</b> <b>First of all, I made a lot of very good friends</b> <b>I can gradually elaborate on that later</b> <b>But I got to know</b> <b>For example</b> <b>Actually the main person who mentored me then</b> <b>my mentor was Feng Jiashi</b> <b>He was a PhD student at the time</b> <b>Right, and he mentored me</b> <b>And then did some work</b> <b>We published a paper</b> <b>Not a top conference either</b> <b>Unfortunately, I still couldn't publish at CVPR during undergrad</b>

<b>Mm</b> <b>But we published</b> <b>a decent one</b> <b>this BMVC paper</b> <b>Right, it was</b> <b>a not-so-top-tier computer vision</b> <b>paper</b> <b>So um</b> <b>I think</b> <b>I still think there was a lot to gain</b> <b>For the first time I learned</b> <b>um research</b> <b>what it's about</b> <b>Right</b> <b>Having actually written a paper versus not having written one</b> <b>I think there's still a big difference</b> <b>Was that your first paper on CV?</b>

<b>Yes yes</b> <b>But you could say</b> <b>this was a CV paper</b> <b>but actually it wasn't really about CV</b> <b>Its only application</b> <b>was face recognition</b> <b>it was more like a</b> <b>machine learning paper</b> <b>But that was normal at the time</b> <b>everyone studying CV</b> <b>or researching CV</b> <b>was doing similar things</b> <b>the so-called</b> <b>manifold clustering related things</b>

<b>Right, but it was at that time point</b> <b>That was 2012, 2013</b> <b>2012 right</b> <b>So it was right at the AlexNet moment</b> <b>Mm</b> <b>So I was also at that time point</b> <b>learning about this</b> <b>Right, and then right</b> <b>and learning about ImageNet</b> <b>learning about deep learning</b> <b>So I think that was actually a starting point</b>

<b>That was when I just started doing research</b> <b>and learning how to do research</b> <b>and also a starting point for all of deep learning</b> <b>This was your third year</b> <b>Third year, right</b> <b>University was almost over at that point</b> <b>So you actually during your undergraduate years</b> <b>had already found your main thread</b> <b>I think so</b> <b>Mm</b> <b>What was your intrinsic reward mechanism at that time?</b>

<b>I think it's still curiosity</b> <b>Right, it's that I</b> <b>I think</b> <b>I want to know why</b> <b>Right</b> <b>Or rather</b> <b>This might also be my own explanation</b> <b>I also don't know</b> <b>what exactly my intrinsic motivation is</b> <b>But</b> <b>Mm</b> <b>I want to understand more</b> <b>I want to understand</b> <b>more about this field</b>

<b>I want to engage with the top</b> <b>students in this field</b> <b>researchers</b> <b>professors</b> <b>and have deeper exchanges</b> <b>Mm-hmm</b> <b>So this is also why later</b> <b>I decided I still wanted to go abroad</b> <b>wanted to apply</b> <b>I think also</b> <b>Probably this reason too</b> <b>Here I want to ask a small extra question</b> <b>You must also have many friends from Tsinghua's Yao Class</b> <b>Right, I also have many friends from Tsinghua's Yao Class</b>

<b>who have come on my show</b> <b>Yes, I want to know</b> <b>Tsinghua's Yao Class</b> <b>do you think compared to SJTU's ACM Class</b> <b>what is the biggest difference</b> <b>in terms of training</b> <b>I think the ACM Class is probably less competitive</b> <b>One difference is, um, again</b> <b>this thing</b> <b>is actually still Professor Yu's design</b> <b>He, I think, is, um</b> <b>quite a great educator</b> <b>I can say that</b> <b>Mm right</b> <b>Like back in our days</b> <b>actually in our curriculum design</b>

<b>um, there would be many</b> <b>seemingly quite strange settings</b> <b>For example, we had a course</b> <b>that Professor Yu was actually very proud of</b> <b>called the 'Student Forum'</b> <b>What is this Student Forum?</b>

<b>It means everyone comes to this class</b> <b>and spends maybe 45 minutes to 1 hour</b> <b>to do a presentation</b> <b>give a talk</b> <b>And this talk cannot be related to studying</b> <b>It can be about anything in the world</b> <b>but cannot be related to studying</b> <b>Right so um</b> <b>some people would talk about philosophy</b> <b>some about history</b> <b>some about society</b>

<b>some about many very interesting things</b> <b>Of course science was also allowed</b> <b>Mm right</b> <b>And I think</b> <b>I think this might be a difference in cultivation approach</b> <b>Of course I've never been to Yao Class</b> <b>so I'm not sure</b> <b>But I think</b> <b>everyone was still in a relatively relaxed</b> <b>and more liberal arts-focused</b> <b>kind of setting moving forward</b> <b>Mm, the impression you give me is</b> <b>you don't seem like someone who likes excessive competition</b>

<b>Um, I think I'm not afraid of competition</b> <b>but I genuinely don't like excessive competition</b> <b>And I think</b> <b>excessive competition definitely doesn't help innovation</b> <b>Right, I think</b> <b>I think this</b> <b>Of course that's not saying ACM Class has no competition</b> <b>there is actually very strong competition</b> <b>Were you a winner in this competition?</b>

<b>I wasn't eliminated</b> <b>OK</b> <b>Right</b> <b>But actually it can't really be called elimination</b> <b>which was</b> <b>everyone felt whether they were suited or not</b> <b>and would choose to stay or leave</b> <b>What was your approximate ranking in undergrad?</b>

<b>There were maybe 30-40 people total</b> <b>Maybe ranked around the teens</b> <b>Just not pushing myself too hard</b> <b>Not pushing myself too hard</b> <b>Mm</b> <b>Did you ever think about becoming</b> <b>for example, first or second in the ACM Class?</b>

<b>Was that your goal?</b>

<b>I couldn't have</b> <b>Right [laughter]</b> <b>Really, really couldn't</b> <b>Because we had very strong</b> <b>Right um</b> <b>students with competition backgrounds</b> <b>And the evaluation criteria</b> <b>I think were actually quite multidimensional</b> <b>it's hard to say who was first or second</b> <b>Or if you only look at GPA</b> <b>then I really couldn't</b> <b>Mm right</b> <b>And I think</b> <b>And for this</b> <b>maybe also inspired by the Survival Guide</b> <b>I also didn't care that much</b> <b>So from that time you started</b>

<b>following your interests very closely</b> <b>Yes right</b> <b>I think pursuing my interests</b> <b>and I would do everything possible to make it happen</b> <b>Right, especially in the application process it was the same</b> <b>Mm</b> <b>A previous example was you going to NUS</b> <b>instead of going to Microsoft Research Asia</b> <b>Right, when applying</b> <b>Actually</b> <b>there's another story here</b> <b>which is that I almost didn't get into any school</b> <b>but ultimately didn't</b> <b>I did have some offers</b>

<b>but none from a professor I wanted to work with</b> <b>doing computer vision</b> <b>Oh</b> <b>This made me very, very depressed</b> <b>And at one point I would think</b> <b>Okay, I could go do some</b> <b>recommendation system research</b> <b>some more</b> <b>um, you know</b> <b>machine learning research</b> <b>Oh</b> <b>Um, until finally</b> <b>And then I</b> <b>I started frantically writing emails to everyone</b> <b>those cold-contact emails</b> <b>Mm right</b>

<b>And then Professor Tu Zhuowen</b> <b>Right, Professor Tu</b> <b>replied to me</b> <b>But by then it was already very, very late</b> <b>Because you know</b> <b>For PhD applications</b> <b>the deadline is generally April 15th</b> <b>Right, I actually received this reply in April</b> <b>Oh</b> <b>Right</b> <b>Who was the professor you most wanted to work with?</b>

<b>At that time</b> <b>Um</b> <b>At that time there weren't many professors doing computer vision</b> <b>Right, and then</b> <b>I think Professor Tu</b> <b>was certainly</b> <b>a professor I admired very, very much</b> <b>So I think he was also my top choice</b> <b>Right mm</b> <b>And of course</b> <b>there would be many</b> <b>You would of course say</b> <b>Like at Stanford</b> <b>Berkeley right</b> <b>MIT would have</b> <b>many pioneers of computer vision</b> <b>But at that time</b>

<b>beyond my ability</b> <b>Mm right</b> <b>So I sent this email to Professor Tu</b> <b>And he replied to me</b> <b>And I remember very clearly</b> <b>Because of the time difference</b> <b>So Professor Tu asked if we should have a call</b> <b>When are you free</b> <b>I said I'm free at any time</b> <b>And so at 3 AM</b> <b>downstairs in the dormitory</b> <b>I had this phone call with Professor Tu</b>

<b>Telling him why I thought</b> <b>I wanted to do this</b> <b>Mm, what things I had done before</b> <b>And why I thought</b> <b>I very much admire your research</b> <b>I think we can work together</b> <b>Right so</b> <b>Later, Professor Tu rescued me</b> <b>Very, very, very lucky</b> <b>In the last few days</b> <b>In the last few days he rescued me</b> <b>But there was another twist later</b> <b>Because at first Professor Tu Zhuowen</b>

<b>was actually at UCLA</b> <b>Right</b> <b>So the offer I received was UCLA's offer</b> <b>And I got my visa sorted and was ready to enroll</b> <b>And then about a week before</b> <b>Professor Tu said</b> <b>I'm sorry</b> <b>I'm going to change jobs</b> <b>I'm at UCLA</b> <b>for various reasons</b> <b>I don't want to stay anymore</b> <b>I don't want to continue here</b> <b>I'm going somewhere else</b> <b>Where am I going?</b>

<b>Right now I can't tell you either</b> <b>I don't know either</b> <b>Because he was also in interviews at that time</b> <b>Oh really?</b>

<b>Oh really?</b> <b>And he told me</b> <b>You have a few options</b> <b>One is you can stay at UCLA</b> <b>and I'll hand you over to other professors</b> <b>Or you can wait</b> <b>and see how my situation works out</b> <b>And possibly</b> <b>if I go to a school you're willing to come to</b> <b>you can come with me</b> <b>So did you wait?</b>

<b>Or did you immediately say, I choose you?</b>

<b>I basically said</b> <b>I immediately said, I choose you</b> <b>You didn't care about the school?</b>

<b>Um</b> <b>I think I don't care about the school</b> <b>And I still think</b> <b>I think all these things are very interesting</b> <b>Because back then if you looked at UCSD</b> <b>in terms of overall rankings</b> <b>nothing compared to UCLA</b> <b>Mm</b> <b>Now it's completely different</b> <b>If you look at CS rankings</b> <b>or from AI hiring</b> <b>and students</b> <b>including faculty resources</b>

<b>in terms of AI strength</b> <b>I think UCSD</b> <b>is already among the top few</b> <b>Back then, it was completely different</b> <b>Back then</b> <b>And I actually always wanted to collaborate with a professor</b> <b>named Serge Belongie</b> <b>who had just decided to leave UCSD too</b> <b>Well, so I felt everything was hopeless</b> <b>which was</b> <b>the place I was going didn't seem highly ranked</b> <b>um, and then</b> <b>faculty were also leaving</b> <b>faculty were also leaving</b>

<b>But I thought about it and said</b> <b>none of this matters</b> <b>none of it is important</b> <b>what matters is who I'm working with and on what</b> <b>and whether this is something I want to do</b> <b>I think putting aside all this noise</b> <b>this is the only thing I want to care about</b> <b>Mm, that's very interesting</b> <b>Mm</b> <b>So this kind of thing happened several times</b> <b>I just said</b> <b>At SJTU it was also an upward trajectory</b> <b>And then going to</b>

<b>UCSD</b> <b>That was also part of it</b> <b>which was</b> <b>Of course</b> <b>I'm not saying this has anything to do with me</b> <b>I don't think it has anything to do with me</b> <b>But somehow I feel I can see a place</b> <b>or even a person</b> <b>their upside potential</b> <b>that is, their potential</b> <b>Mm</b> <b>And I'm willing to grow together with those places</b> <b>I think</b> <b>This is something I feel quite deeply</b>

<b>How long did it take you to find out Professor Tu was going to UCSD?</b>

<b>Um, maybe a few months later</b> <b>Right, maybe one or two months later</b> <b>Were you worried at the time?</b>

<b>Of course I was worried</b> <b>Right</b> <b>Because Professor Tu is actually very humble</b> <b>extremely capable but very humble</b> <b>So he would always give me a heads-up saying</b> <b>the school I'm going to</b> <b>might be ranked lower</b> <b>you should think about it</b> <b>Right, what did you say?</b>

<b>I don't remember very well what I said</b> <b>But again, for me</b> <b>this might not be that important</b> <b>And</b> <b>and at that time it wasn't yet time to make a decision</b> <b>Right, why should I</b> <b>worry in advance about things that haven't happened</b> <b>So I didn't think too much about it</b> <b>Did anyone else make this choice?</b>

<b>Among the students Professor Tu communicated with</b> <b>Um, basically none</b> <b>I was the first student he recruited at UCSD</b> <b>I think just based on that</b> <b>Professor Tu must like you very much</b> <b>Um, I think all of this is</b> <b>I think it was also him saving me</b> <b>Um indeed</b> <b>But this was not only rescuing me at the beginning</b> <b>and then later doing research</b> <b>during the PhD process</b> <b>I think he truly helped me</b>

<b>Right, like my internship in Singapore and such</b> <b>you could say we were doing some research</b> <b>but in reality</b> <b>it was still small-scale stuff</b> <b>having someone next to you teaching you</b> <b>the feeling is still different</b> <b>Professor Tu is the type who sits beside your monitor</b> <b>and goes through the code with you line by line</b> <b>that kind of teacher</b> <b>Mm, and he often</b>

<b>I think proudly would tell us these things</b> <b>And I think he is very deserving</b> <b>of this pride, meaning</b> <b>he published several papers</b> <b>that actually had an important influence</b> <b>on later computer vision</b> <b>all completed as sole author works</b> <b>And these works didn't have, like now</b> <b>everyone using PyTorch</b> <b>with so many open-source communities</b> <b>so many libraries you can use</b>

<b>right, having GPUs</b> <b>in his time there was nothing</b> <b>he had to write from the ground up</b> <b>For example, for a task like image segmentation</b> <b>he had to write from scratch</b> <b>about 50,000 lines of code</b> <b>He even sent me this code to look at</b> <b>That included the lowest level</b> <b>including some distributed training</b> <b>a whole series of things</b> <b>all written in C++</b> <b>Right, 50,000 lines of code</b>

<b>I think</b> <b>On one hand I feel I'm very lucky</b> <b>not needing to go through all that</b> <b>But on the other hand</b> <b>I think actually</b> <b>their generation in America</b> <b>these scientists</b> <b>these professors are truly admirable</b> <b>Right, if not for them</b> <b>there would be no us today</b> <b>They actually, um</b> <b>blazed a trail</b> <b>Right, this path didn't originally exist</b>

<b>As I said, right</b> <b>publishing a CVPR paper</b> <b>was actually a very, very difficult thing</b> <b>And there was a certain circle</b> <b>a certain fixed circle</b> <b>Right, and I think it required Professor Tu</b> <b>and actually his boss</b> <b>Professor Zhu Songchun</b> <b>and including later people like Fei-Fei (Li Fei-Fei, Stanford professor, co-founder and CEO of World Labs)</b> <b>and so on</b> <b>Professor Fei-Fei</b>

<b>everyone blazing this trail</b> <b>so that we have a path to walk</b> <b>Mm, I saw a Xiaohongshu comment saying</b> <b>Xie Saining was unremarkable in China</b> <b>nothing special</b> <b>made a big splash when he got to America</b> <b>So what exactly is the variable?</b>

<b>First, I don't think I was unremarkable in China</b> <b>Mm, I don't accept that</b> <b>And I didn't make a big splash in America either</b> <b>I don't accept that either</b> <b>I feel like the things I've done</b> <b>have been a fairly smooth</b> <b>a very gradual process</b> <b>Right, and or rather I think this is also what I hope</b> <b>um, as a researcher, right</b> <b>this kind of science practitioner</b>

<b>I hope to be in</b> <b>meaning this is not a momentary</b> <b>burst of hormones or adrenaline</b> <b>this thing</b> <b>might be a lifetime of building</b> <b>a very quiet process</b> <b>I hope</b> <b>to be in such a state</b> <b>When I say such a state</b> <b>it's because I know</b> <b>many people are in this state</b>

<b>the researchers I most admire</b> <b>they are in this state</b> <b>they didn't say</b> <b>there was this sudden rise to fame</b> <b>or at least their way of doing things is not</b> <b>or their purpose is not to become suddenly famous</b> <b>Right, I think so</b> <b>Then what is it for?</b>

<b>It's for thinking problems through</b> <b>Mm</b> <b>How did your PhD work unfold?</b>

<b>The PhD work was also very interesting</b> <b>PhD work</b> <b>Um, I think it was also through</b> <b>Professor Tu's hands-on mentoring</b> <b>Right, but um</b> <b>We had our first paper</b> <b>By the way, I</b> <b>During my PhD</b> <b>I wasn't a successful PhD student either</b> <b>by today's standards</b> <b>I published maybe five or six</b> <b>top conference papers</b> <b>What level is that?</b>

<b>I don't know</b> <b>That should have been fine for that era</b> <b>the level to get a job at a top lab</b> <b>Now it might already be</b> <b>Right now</b> <b>now many of my students</b> <b>publish many more papers than I did</b> <b>and the quality of work is also much better</b> <b>But anyway</b> <b>At the beginning</b> <b>I think we did a work called</b> <b>Deeply Supervised Nets</b> <b>Mm</b> <b>This work</b> <b>was actually</b> <b>Me and another more senior PhD student</b>

<b>completed it together in collaboration</b> <b>And at this time</b> <b>This was around 2013, 2014</b> <b>And at this time, deep learning finally began to explode</b> <b>But I think this was also a very interesting moment</b> <b>Because actually many people didn't accept this</b> <b>Especially many professors working in computer vision</b> <b>didn't even accept this</b> <b>Everyone thought</b>

<b>deep learning was still a kind of alchemy</b> <b>still a black box</b> <b>people trusted traditional machine learning theory more</b> <b>trusting SVMs, or trusting some</b> <b>Bayesian theories</b> <b>Right</b> <b>being able to pivot in time to do deep learning research</b> <b>This now, looking back</b> <b>with the benefit of hindsight</b> <b>is a no-brainer</b>

<b>you didn't need to make that choice</b> <b>right, you should just do it</b> <b>But at the time, making such a choice</b> <b>I think required some courage</b> <b>So Professor Tu actually is</b> <b>another reason I admire him very, very much</b> <b>and I</b> <b>deeply affected by this</b> <b>this one thing</b> <b>That is to say</b> <b>he actually pivoted very promptly</b> <b>So this Deeply Supervised Nets</b> <b>was in this era</b>

<b>our first deep learning work</b> <b>Right, so this thing</b> <b>was actually simple</b> <b>it was about how</b> <b>all of these neural networks</b> <b>Um</b> <b>previously were just a single stream</b> <b>a long chain</b> <b>with your input</b> <b>and getting your output</b> <b>And now Deeply Supervised Nets</b> but this robotics isn't simple robotics</b> <b>meaning</b> <b>you can now actually have multiple branches</b>

<b>that is, your neural network</b> <b>can actually have multiple exits</b> <b>and at different exits</b> <b>you can apply a supervision signal</b> <b>In this way</b> <b>the most direct benefit is</b> <b>you can</b> <b>not only from the signal at the far end</b> <b>do back propagation</b> <b>back to</b> <b>the early layers</b> <b>back propagation</b> <b>you don't need</b> <b>to do back propagation from the far end</b>

<b>all the way to the beginning</b> <b>you can actually from an intermediate node</b> <b>do back propagation</b> <b>This way</b> <b>can partially solve the vanishing gradient problem</b> <b>Mm</b> <b>And this actually relates to what came later</b> <b>for example, ResNet actually has some resemblance</b> <b>it's actually</b> <b>or in that era</b> <b>everyone actually wanted to solve this problem</b> <b>So Deeply Supervised Nets</b> <b>was a</b> <b>way to solve this problem</b>

<b>Actually this thing</b> <b>though it was long ago</b> <b>right, this was again 12 years ago</b> <b>but I think research is like this</b> <b>12 years later</b> <b>actually some of our current papers</b> <b>are again using the same</b> <b>kind of design</b> <b>sometimes we don't even realize it</b> <b>I think this is very interesting</b> <b>But let's not talk about 12 years later</b> <b>Right, so my second paper</b> <b>was called Holistically-Nested Edge Detection (HED)</b>

<b>a work on edge detection</b> <b>HED</b> <b>Right, I think about this paper</b> <b>I'm actually quite proud of it</b> <b>Because frankly</b> <b>it solved a research problem</b> <b>um, it was both lucky</b> <b>and unlucky</b> <b>The lucky part is</b> <b>this paper was a good paper</b> <b>The unlucky part is</b> <b>once the problem was solved</b> <b>nobody worked on it afterward</b> <b>so nobody cited your paper</b> <b>[chuckles]</b>

<b>so it lost many citations</b> <b>[chuckles]</b> <b>Um, but um</b> <b>But this work</b> <b>is essentially a Deeply Supervised Nets</b> <b>DSN applied to</b> <b>image</b> <b>or edge detection</b> <b>but it's actually a global</b> <b>what we call pixel labeling</b> <b>pixel-level</b> <b>annotation</b> <b>task</b> <b>implementation</b> <b>Mm</b>

<b>And this</b> <b>also opened up many new ways of thinking for me</b> <b>because I would discover</b> <b>that a neural network</b> <b>each of its layers</b> <b>actually has implicit structure</b> <b>and information in it</b> <b>your neural network, again</b> <b>has not only input and output</b> <b>in between there is a lot of information</b> <b>it represents</b> <b>a so-called hierarchical</b>

<b>hierarchical structure of the world</b> <b>For edge detection</b> <b>it represents</b> <b>that your early layers</b> <b>output edges that are</b> <b>more so-called coarse</b> <b>more coarse edges</b> <b>Right, and the further up</b> <b>the more refined your edges become</b> <b>So</b> <b>Finally you can take all of these edges</b> <b>and fuse them together</b> <b>to get one that best approximates human perception</b> <b>such an edge</b> <b>output result</b> <b>I think this</b> <b>was actually</b>

<b>also giving me a new understanding of deep learning</b> <b>It's a very interesting, very interesting thing</b> <b>You can think of it as a black box</b> <b>but each part of this black box</b> <b>you can open up</b> <b>connect some new inspiration</b> <b>and reach some new goals</b> <b>I think this was very enlightening for me</b> <b>And this paper at the time</b> <b>also had a big impact on my life</b>

<b>because it was published at ICCV</b> <b>and also received an award</b> <b>This award was the Marr Prize</b> <b>the Best Paper Award nomination</b> <b>not the Best Paper Award itself</b> <b>just a nomination</b> <b>But actually for the Marr Prize</b> <b>it selects two papers</b> <b>which is equivalent to</b> <b>the Marr Prize and Honorable Mention are two awards</b> <b>So this made me feel</b> <b>if you want to say sudden fame</b> <b>I really did feel at the time</b>

<b>look, I also became famous at a young age</b> <b>Now, of course</b> <b>we have many Chinese students</b> <b>also on the world stage</b> <b>winning so many Best Papers</b> <b>Right, but back then for me</b> <b>walking onto that stage</b> <b>or that podium</b> <b>and giving the award presentation</b> <b>giving this talk</b> <b>I think it moved me greatly</b> <b>I felt, wow</b>

<b>my life has begun</b> <b>Right, and I will keep working hard</b> <b>I will have more and more best papers</b> <b>Ah unfortunately</b> <b>this was my last time receiving Best Paper</b> <b>[laughter]</b> <b>What year of your PhD was this?</b>

<b>Second year of PhD</b> <b>[laughter]</b> <b>And up until now</b> <b>Just a few days ago during Spring Festival</b> <b>people were still texting saying</b> <b>Happy New Year</b> <b>May you have many Best Papers</b> <b>I said it's been 10 years</b> <b>everyone has been wishing this for me</b> <b>and I still haven't received another one</b> <b>Do you still want one?</b>

<b>Um</b> <b>Good question</b> <b>Well I think</b> <b>this thing isn't that important to me anymore</b> <b>On one hand</b> <b>I know the process</b> <b>I know actually</b> <b>um, whether I get a Best Paper or not</b> <b>might not represent the quality of the work</b> <b>And I also know the Best Paper I got</b> <b>Honorable Mention</b> <b>was mostly luck too</b> <b>Mm-hmm</b> <b>It's a hugely random process</b>

<b>whether a paper gets accepted or not</b> <b>what kind of award it can get</b> <b>I think this thing</b> <b>is very, very random</b> <b>And if something is this random</b> <b>it shouldn't be something a researcher</b> <b>should focus on</b> <b>So in your second year</b> <b>you felt life had finally begun</b> <b>Right, and life finally began</b> <b>and then reality immediately knocked me over</b> <b>Right um</b> <b>[chuckles]</b> <b>but it wasn't that exaggerated</b> <b>That is to say, um</b> <b>I think this is another</b>

<b>during my PhD</b> <b>time well</b> <b>again grateful to Professor Tu</b> <b>in that he</b> <b>was actually a very, very open-minded</b> <b>person who let us explore all kinds of</b> <b>different directions</b> <b>So during my PhD I did 5 internships in total</b> <b>I think even today that seems</b> <b>although with schools</b> <b>and industry already collaborating so broadly</b> <b>I think it's still hard to imagine</b> <b>Why did you want to do internships?</b>

<b>I just wanted to go out and see</b> <b>Mm</b> <b>maybe it's the same as traveling when I was young</b> <b>I wanted to know in different places in this world</b> <b>different organizations</b> <b>what kind of things were happening</b> <b>what people were doing what things</b> <b>I wanted to know all of this</b> <b>And on one hand I tell you</b> <b>right, I always wanted to do</b> <b>artificial intelligence</b> <b>or wanted to do computer vision</b> <b>But on the other hand</b>

<b>I would also ask myself</b> <b>What if I'm wrong?</b>

<b>Right</b> <b>What if</b> <b>what if</b> <b>right, what if</b> <b>the world</b> <b>has something even more interesting happening</b> <b>what would I do</b> <b>Right so</b> <b>I think</b> <b>This is another motivation of mine</b> <b>You went to NEC Labs America</b> <b>went to Adobe</b> <b>went to Meta</b> <b>went to Google Research and DeepMind</b> <b>Right, thank you</b> <b>for the background check</b> <b>Right yes</b> <b>Those are the 5 places</b> <b>And um</b>

<b>actually the first four were all in the Bay Area</b> <b>So</b> <b>I was actually quite happy during that time</b> <b>every year</b> <b>I had my own beat-up car</b> <b>and every summer</b> <b>I would sublet my dorm room</b> <b>drive my car all the way from Southern California to Northern California</b> <b>Mm</b> <b>an 8-hour drive</b> <b>Sometimes with</b> <b>once or twice with friends</b> <b>but most of the time I was on the road alone</b>

<b>I think this was actually quite cool</b> <b>Right, all my worldly possessions in my car</b> <b>two suitcases</b> <b>not taking anything else</b> <b>because I'd given up my place too</b> <b>when I came back I'd have to find housing again</b> <b>Right, um, no fixed abode</b> <b>this nomadic researcher lifestyle</b> <b>I was still quite happy</b> <b>Which of these 5 places did you like most?</b>

<b>I think each has its own characteristics</b> <b>Like among these 5</b> <b>So I recently also told students</b> <b>I have many students</b> <b>and their internships</b> <b>actually didn't produce much good work</b> <b>And I told them</b> <b>I would use myself as an example</b> <b>I said, I did 5 internships</b> <b>and half of them I didn't produce anything</b> <b>Mm</b> <b>And how long were these internship periods?</b>

<b>Generally 3 to 6 months</b> <b>So about half of each year</b> <b>half the time at school</b> <b>half the time in the Bay Area</b> <b>of course at the low point I was in London</b> <b>And I think it's not about liking or not liking</b> <b>I would try to diversify</b> <b>Um, that is</b> <b>I would</b> <b>hope each place I went was different</b> <b>I hoped for a more diverse experience</b> <b>So NEC Labs America</b> <b>was of course the first place I went</b>

<b>And I think there</b> <b>I also published a CVPR paper</b> <b>And there, um, there were many great colleagues</b> <b>mostly Chinese people</b> <b>Mm</b> <b>and after work at lunch everyone would go together</b> <b>to Cupertino to eat</b> <b>That's my impression of it</b> <b>I very, very much liked this group</b> <b>really liked everyone's attitude toward research</b>

<b>And I also published my own paper</b> <b>So I think I was very happy about this experience</b> <b>Right</b> <b>NEC Labs America back then should have also been a gathering place for deep learning</b> <b>Dr. Yu Kai (founder and CEO of Horizon Robotics) also worked there</b> <b>Yeah</b> <b>Mm</b> <b>Yes</b> <b>Of course, it had two divisions</b> <b>one in Princeton</b> <b>and one in Cupertino (in Silicon Valley, California)</b> <b>All the vision</b> <b>and media people were in the Bay Area</b> <b>And all those doing</b>

<b>traditional</b> <b>machine learning work</b> <b>were all</b> <b>concentrated in Princeton</b> <b>Right</b> <b>And some of what follows we can skip</b> <b>But anyway, at Adobe I just didn't produce anything</b> <b>The reason is, um</b> <b>Adobe is a very, very artistic</b> <b>company with an artistic temperament</b> <b>Oh</b> <b>Makes sense</b> <b>And at that time I was in San Francisco</b> <b>And then</b>

<b>having me do things related to design</b> <b>and crowdsourcing</b> <b>meaning you'd write some</b> <b>Mechanical Turk</b> <b>internet</b> <b>user feedback systems</b> <b>right, some user feedback systems</b> <b>and using it to guide some</b> <b>machine learning and, um, this kind of</b> <b>computer vision tasks</b>

<b>like segmentation</b> <b>this thing</b> <b>I just didn't do well</b> <b>I still feel guilty toward my mentor</b> <b>Of course they were all very kind</b> <b>Right, but this</b> <b>was also a time that made me realize it's OK</b> <b>not producing anything</b> <b>is actually not the end of the world</b> <b>right, it's not the end of the world</b> <b>But that period was actually quite depressing</b> <b>And this depressive period</b>

<b>actually continued until my Meta internship</b> <b>in school</b> <b>also didn't seem to produce any interesting work</b> <b>And then after going to Meta</b> <b>then um</b> <b>the internship was maybe only three months</b> <b>In the first two months I basically also</b> <b>was exploring some things</b> <b>exploring some things</b> <b>also related to neural network architecture</b> <b>some things</b> <b>but also didn't discover anything</b> <b>worth mentioning</b>

<b>And then suddenly a turning point happened</b> <b>This um</b> <b>He Kaiming (main inventor of ResNet) joined FAIR</b> <b>At that time</b> <b>Right</b> <b>So this was about halfway through my internship</b> <b>Professor He Kaiming then joined FAIR</b> <b>and became a</b> <b>full-time researcher</b> <b>Mm so</b> <b>That was my first time working with Kaiming</b> <b>That was my first time</b> <b>learning from him</b> <b>Right, and then</b>

<b>And then</b> <b>And we built some deep friendships then</b> <b>I think</b> <b>Because at that time he was coming to America for the first time</b> <b>It was his first time</b> <b>He had many firsts that were</b> <b>at FAIR</b> <b>right</b> <b>At that time he also couldn't drive</b> <b>first time in America, unfamiliar with everything</b> <b>I had to drive him out to eat</b> <b>and drive him home sometimes</b> <b>[chuckles]</b> <b>But he later learned to drive himself</b>

<b>And he also didn't know how to use Linux</b> <b>Mm, that's also very interesting</b> <b>Right, because at Microsoft they all used</b> <b>they could only program with Windows</b> <b>Right</b> <b>So I had to teach Kaiming how to use the cluster</b> <b>how to use Linux</b> <b>Right, but you'll find</b> <b>Kaiming</b> <b>this is Kaiming</b> <b>not without reason</b> <b>Right, and I think</b> <b>someone like him truly has this kind of</b>

<b>you could call it an aura</b> <b>or I could call it some kind of</b> <b>reality distortion field</b> <b>this is actually Steve Jobs's term</b> <b>meaning</b> <b>the people around Steve Jobs, influenced by him</b> <b>would all feel reality had been distorted</b> <b>right, some things that were completely impossible</b> <b>could now gradually actually be done</b> <b>I think Kaiming also has this kind of magic</b> <b>Right, and then</b> <b>So this was my first time seeing</b>

<b>how a truly top-level researcher does</b> <b>their research</b> <b>At that point your internship only had one month left</b> <b>How were you able to build such deep friendship?</b>

<b>I think, one is daily life interactions</b> <b>Why did he choose you?</b>

<b>Why did he communicate with you?</b>

<b>Because I was an intern there</b> <b>and my</b> <b>manager entrusted me to Kaiming</b> <b>because I wasn't doing well anyway</b> <b>hadn't produced anything</b> <b>Then Kaiming came and said, hey</b> <b>Kaiming, you come guide him</b> <b>come join in the discussions</b> <b>Right, so there was still a month left</b> <b>And Kaiming said</b> <b>why don't we participate together</b> <b>in the ImageNet Challenge</b> <b>Right, just compete in this competition</b> <b>Mm</b>

<b>And then I said, hey</b> <b>Sure, let's compete in this competition</b> <b>Because when Kaiming was at Microsoft</b> <b>his work came about</b> <b>through competing in ImageNet</b> <b>right, building up step by step</b> <b>Simply put</b> <b>Mm</b> <b>And so we also went to</b> <b>play with this ImageNet</b> <b>challenge</b> <b>Mm</b> <b>And in this process we discovered</b>

<b>hey, some ideas we had thought of before</b> <b>were actually reasonable</b> <b>actually very good ideas</b> <b>Right</b> <b>And I actually proposed this idea to Kaiming</b> <b>Kaiming's magic is</b> <b>he can take</b> <b>all very ordinary things</b> <b>and turn them into gold-like</b> <b>valuable ideas</b> <b>So we did this ResNeXt work</b> <b>And then</b> <b>this was also our</b>

<b>solution for the ImageNet challenge</b> <b>a submitted solution</b> <b>And we got second place</b> <b>Didn't get first place</b> <b>But I think we were actually the most effective</b> <b>Should have been first</b> <b>Because the first-place solution was</b> <b>an ensemble solution</b> <b>which combined some previous algorithms</b> <b>doing model ensembling</b> <b>a combined solution</b> <b>Right</b> <b>And we were actually a completely new framework</b> <b>Mm</b> <b>Right, and then</b> <b>And at that time</b>

<b>Um</b> <b>Right, I think</b> <b>I think what ResNeXt wanted to convey</b> <b>is also about how we</b> <b>by modifying the neural network architecture</b> <b>learn a more scalable</b> <b>right, a more extensible representation</b> <b>such a representation</b> <b>this thing is also very interesting</b> <b>because this</b> <b>idea is very, very simple</b> <b>It says</b> <b>originally</b> <b>for example, my ResNet is just a serial network</b>

<b>right, just layer by layer by layer</b> <b>like this</b> <b>conv layers</b> <b>now I can in parallel</b> <b>expand into several different groups</b> <b>each group with its own</b> <b>small network</b> <b>so you have networks</b> <b>within a large network</b> <b>distributed in parallel with many small networks</b> <b>Mm</b> <b>why is this interesting</b> <b>because in today's terms</b> <b>this is MoE (Mixture of Experts)</b> <b>Oh</b>

<b>So</b> <b>So at least on ImageNet at the time</b> <b>we already saw a kind of scaling behavior</b> <b>that is, the more groups you have</b> <b>the more sparse your neural network becomes</b> <b>and the more sparse your neural network</b> <b>the wider it gets</b> <b>but you can at the same flops</b> <b>computation level</b> <b>get better results</b> <b>it converges faster</b> <b>and your final results also improve</b>

<b>I think this</b> <b>resonates with what people are doing with MoE today</b> <b>aligns very well</b> <b>Does this work count as</b> <b>an extension of Kaiming's ResNet?</b>

<b>Yes yes</b> <b>So why is it called ResNeXt</b> <b>Kaiming said, right</b> <b>this is Xie's ResNet</b> <b>so the x is both next</b> <b>the next generation ResNeXt and also</b> <b>Um</b> <b>giving me some</b> <b>giving me some credit</b> <b>Mm</b> <b>I think</b> <b>Kaiming is someone very good at naming things</b>

<b>Right</b> <b>at naming papers</b> <b>many later papers</b> <b>were actually named by him for us</b> <b>Mm</b> <b>Would he hide people's names in them?</b>

<b>Not really</b> <b>Not really</b> <b>not every time</b> <b>but it was just a clever touch</b> <b>I think this is also part of his research taste</b> <b>Then why was your name hidden in it?</b>

<b>I don't know</b> <b>I think maybe also</b> <b>Ah</b> <b>I actually don't know</b> <b>I never asked him</b> <b>Mm</b> <b>How long had you been working together at that point?</b>

<b>Did your internship get extended?</b>

<b>All of this happened in that one month</b> <b>Right, it all happened in one month</b> <b>This kind of thing is countless</b> <b>Many of my best works</b> <b>actually follow the same rhythm</b> <b>starting out unable to produce anything</b> <b>Oh</b> <b>and then at the end suddenly a burst of inspiration</b> <b>and then converging on this thing</b> <b>research is never a linear development</b> <b>or a linearly developing research</b> <b>is never good research</b> <b>Mm</b>

<b>And then</b> <b>Much of our work is actually non-linear</b> <b>I can tell you more stories later</b> <b>Mm</b> <b>Um right</b> <b>Anyway</b> <b>At this time it was with Kaiming</b> <b>And I</b> <b>I finished</b> <b>and that period ended</b> <b>But your friendship continued, right?</b>

<b>I think so</b> <b>Right</b> <b>And then went to Meta</b> <b>This was a productive</b> <b>internship</b> <b>I think it was a productive internship</b> <b>And at Google?</b>

<b>At Google</b> <b>I think it also went pretty well</b> <b>Because</b> <b>I started to learn how video works</b> <b>Right, these internships</b> <b>were all different from what I'd done before</b> <b>Each internship</b> <b>was a different topic from what I'd done</b> <b>which led to my final dissertation</b> <b>actually, on the surface looking scattered</b> <b>but I was still able to find a way</b> <b>to connect them</b>

<b>and I'll tell you the way to connect them shortly</b> <b>Good</b> <b>But, anyway, at Google</b> <b>I went to study some video</b> <b>this kind of</b> <b>neural network architecture and training</b> <b>process and what it should look like</b> <b>I think it was also quite rewarding</b> <b>Hey, I have a question</b> <b>Because you worked so well with Kaiming at Meta</b> <b>And then</b> <b>and he's a very famous AI researcher</b> <b>why didn't you stay and continue collaborating with him</b>

<b>I think many people might make that choice</b> <b>why did you keep going to other places</b> <b>to explore</b> <b>Um, this is</b> <b>actually Kaiming's suggestion</b> <b>Kaiming would advise everyone</b> <b>to intern at different places</b> <b>this is the only way to</b> <b>maximize your gains</b> <b>Right</b> <b>So like us back then</b> <b>me</b> <b>and Wang Xiaolong</b> <b>we had all done one internship</b> <b>And then</b> <b>um, we of course all wanted to stay</b>

<b>but Kaiming said go check out other places</b> <b>maybe there will be different gains</b> <b>Mm</b> <b>But after your PhD you returned to Meta</b> <b>Yes right</b> <b>I think</b> <b>I think also after finishing the Google internship</b> <b>I immediately went to intern at DeepMind</b> <b>I think that experience</b> <b>was actually very enlightening for me</b> <b>Mm, at that time DeepMind wasn't yet</b> <b>Google</b> <b>Had it not been acquired yet?</b>

<b>No no</b> <b>acquired acquired</b> <b>already acquired</b> <b>but they were two different organizations</b> <b>because it, um, was only in London</b> <b>Right</b> <b>So during that time I went</b> <b>doing some RL-related research</b> <b>Ah</b> <b>And the reason was</b> <b>I really didn't know how this thing worked</b> <b>and I wanted to go and see</b> <b>And it was very painful doing it</b> <b>And London's winter</b> <b>that period was winter</b> <b>so cold</b> <b>London winters are also very cold</b> <b>I still remember very clearly</b>

<b>I'd get off the London underground</b> <b>working until very late</b> <b>at night maybe 10 or 11 o'clock</b> <b>and the biting cold wind</b> <b>mixed with rain</b> <b>hitting my face</b> <b>and clothes and hat couldn't block it</b> <b>step by step back to my tiny room</b> <b>Right, the temporary dorm</b> <b>It was actually quite hard</b> <b>Right</b> <b>But that period for me</b> <b>I think was also very enlightening</b>

<b>First</b> <b>made me feel like I didn't really enjoy doing</b> <b>RL (reinforcement learning) related research</b> <b>Or rather</b> <b>I didn't enjoy robotics-related research</b> <b>Robotics</b> <b>Because</b> <b>at that time RL was actually in this kind of</b> <b>virtual environment</b> <b>simulated environment</b> <b>doing some embodied agent tasks</b> <b>Mm</b> <b>But I think my bigger gain</b> <b>actually came from</b>

<b>my understanding of an organization like DeepMind</b> <b>being built up at that time</b> <b>Mm</b> <b>I thought, wow</b> <b>this place is so different</b> <b>different from everywhere I'd been</b> <b>Right</b> <b>They had a very different management model</b> <b>For example, they would have many</b> <b>PMs coordinating different research teams</b> <b>and the operations between them</b> <b>They would have these different working groups</b>

<b>where everyone still had many bottom-up ideas</b> <b>these bottom-up ideas</b> <b>But</b> <b>there wasn't a top-down management model</b> <b>and it was also a hierarchical management mode</b> <b>Starting with purely exploratory</b> <b>ideas</b> <b>where everyone could have their own small group</b> <b>to do some early studies</b> <b>and then immediately transition to</b> <b>once something takes shape</b>

<b>it would immediately enter a more top-down</b> <b>more organized management mode</b> <b>I think this is very, very interesting</b> <b>And thinking back now</b> <b>Right, I also mentioned this on Twitter before</b> <b>That Demis also met with many interns</b> <b>And everyone organized a meeting</b> <b>And Demis said to everyone</b> <b>or rather, someone actually asked him this question</b> <b>Saying hey</b> <b>what exactly is DeepMind's mission</b> <b>this company</b> <b>what do you ultimately</b>

<b>want to become as a company</b> <b>Demis's answer was</b> <b>DeepMind will ultimately become</b> <b>a company that can win multiple Nobel Prizes</b> <b>able to win multiple</b> <b>this requires</b> <b>key point: a company that wins multiple Nobel Prizes</b> <b>I think we all said back then, wow</b> <b>that's so ambitious</b> <b>isn't that a bit far-fetched</b> <b>they're just doing AI</b> <b>But now we see</b>

<b>they have already achieved at least one step</b> <b>I think</b> <b>I think it's truly very, very admirable</b> <b>Actually the entire AlphaFold team</b> <b>was in the process of forming during my internship</b> <b>gradually coming together</b> <b>Right</b> <b>I could actually see which people were doing these things</b> <b>And at the beginning</b> <b>some interns were also participating in this process</b> <b>and step by step</b> <b>how it went from an exploratory idea</b>

<b>to gradually becoming organized</b> <b>focused on execution</b> <b>step by step able to achieve</b> <b>ultimately completely changing the world</b> <b>such a project's process</b> <b>The organization question</b> <b>we'll</b> <b>discuss in detail later</b> <b>Mm, I'm thinking</b> <b>did you do too many internships</b> <b>so you didn't get any more best papers after</b> <b>Mm</b> <b>I think that might be the case</b>

<b>or rather, I think what I did</b> <b>was maybe too much, too scattered</b> <b>Which year of your PhD did you start internships?</b>

<b>from the first year</b> <b>Oh, from the first year</b> <b>So these two were always</b> <b>intertwined</b> <b>Mm right</b> <b>So I think you're very right</b> <b>actually my timeline was disrupted</b> <b>Right, it does lose some focus</b> <b>But I think this was also a design of my own</b> <b>So coming back</b> <b>how to connect all these things</b> <b>I think my doctoral dissertation title is</b> <b>Um</b> <b>this</b>

<b>Deep Representation Learning with Induced Structural Priors</b> <b>roughly about some structural priors</b> <b>Um</b> <b>using these priors</b> <b>to guide us</b> <b>how to learn a better</b> <b>deep learning representation</b> <b>Mm</b> <b>And this</b> <b>again, many many years have passed</b> <b>but I</b> <b>I find what I'm doing now is still this</b>

<b>And then</b> <b>And at this conference in November or December</b> <b>there was a workshop</b> <b>their workshop title</b> <b>was Representation Learning with Induced Structural Priors</b> <b>roughly about using structural priors and representation</b> <b>a topic roughly like this</b> <b>And I gave a talk there</b> <b>And at the end of my talk</b> <b>I said, actually over the past 12 years</b> <b>your workshop topic</b>

<b>though still a frontier now</b> <b>we are</b> <b>discussing it with some different meaning</b> <b>But</b> <b>this was also the problem I wanted to study at the beginning</b> <b>and also what I feel now</b> <b>is still not fully solved</b> <b>Right, so on one hand</b> <b>I think during my PhD</b> <b>the timeline was a bit fragmented</b> <b>The reason is</b> <b>I was doing different things in different places</b> <b>But on the other hand</b> <b>This is also, if you want to tackle</b>

<b>representation learning as a topic</b> <b>this is also unavoidable</b> <b>because it's like planting a tree</b> <b>your representation is actually the root of this tree</b> <b>after this tree grows</b> <b>it needs to have different branches</b> <b>Right</b> <b>each branch is actually a different</b> <b>what we call downstream</b> <b>application</b> <b>a new application</b> <b>So I've done image recognition</b> <b>image segmentation</b>

<b>edge detection</b> <b>video recognition</b> <b>action recognition</b> <b>right, and even later</b> <b>some embodied RL-related tasks</b> <b>when doing all these things</b> <b>the problems I saw</b> <b>they are all branches on those tree branches</b> <b>they are not roots</b> <b>Right</b> <b>I think it's possible</b> <b>what you said is right</b> <b>I haven't considered this</b> <b>whether I would have more best papers</b> <b>[chuckles]</b> <b>but I hope to plant more of this tree</b>

<b>and put down deeper roots</b> <b>rather than</b> <b>going further on the branches</b> <b>Right mm</b> <b>And I think, again</b> <b>I think this is the core of deep learning</b> <b>that is, representation learning</b> <b>Representation Learning</b> <b>is basically equivalent to deep learning</b> <b>Let me explain to everyone what representation learning is</b> <b>Um</b>

<b>Good question, right, this thing</b> <b>Um, I think</b> <b>I think the reason I like saying</b> <b>I am someone who does representation learning</b> <b>is because this is still hard to define</b> <b>Mathematically speaking</b> <b>you can think of representation learning as</b> <b>you have data</b> <b>right x</b> <b>and you now want to map it to a</b> <b>space</b> <b>and now this space</b> <b>might have some properties</b>

<b>these properties</b> <b>maybe these good properties</b> <b>may make it easier for you on downstream tasks</b> <b>to achieve better results</b> <b>Right</b> <b>So what you want to learn</b> <b>Um</b> <b>from the initial data</b> <b>to this well</b> <b>property space mapping</b> <b>function</b> <b>this is what is called representation learning</b>

<b>And then</b> <b>this function is also not just a simple mapping</b> <b>it might be a hierarchical</b> <b>hierarchical mapping</b> <b>And now</b> <b>of course this can be implemented in different ways</b> <b>now the mainstream implementation</b> <b>is to use a non-linear neural network</b> <b>to implement this</b> <b>function</b> <b>Right, so I think this is a definition</b> <b>But I just said I would</b> <b>I would be willing</b> <b>to say</b> <b>I myself am someone who does Representation Learning</b>

<b>is because I think this is a timeless title</b> <b>because this field develops too fast</b> <b>many times we do many things</b> <b>including, let me give an example</b> <b>this might be a very, very</b> <b>very negative example</b> <b>which is that in the past, actually</b> <b>when I</b> <b>at what time</b> <b>maybe just after finishing my PhD</b> <b>something was very, very hot</b> <b>called NAS (Neural Architecture Search)</b> <b>which is called</b>

<b>neural architecture</b> <b>search</b> <b>I don't know how to translate it</b> <b>it's Neural Architecture Search</b> <b>Mm</b> <b>Um, in this field</b> <b>there is a lot of consensus that</b> <b>this kind of topic</b> <b>wasted about two years of the entire field</b> <b>This was a wrong direction</b> <b>Everyone went down this wrong path</b> <b>publishing thousands of papers</b> <b>but ultimately got nothing out of it</b> <b>Mm</b> <b>And so</b> <b>Why do I say</b> <b>representation learning is a good</b>

<b>title like that</b> <b>or I am willing to tell everyone</b> <b>I am someone who does representation learning</b> <b>is because this is a fundamental problem</b> <b>If you say now</b> <b>I am someone doing Neural Architecture Search</b> <b>then this becomes very problematic</b> <b>It's possible after 2 years</b> <b>you'd have to immediately change fields</b> <b>You'd have to update your website</b> <b>My research direction is Neural Architecture Search</b> <b>delete that sentence</b>

<b>and replace it with the next more fancy</b> <b>or different</b> <b>term</b> <b>It is not a timeless theme</b> <b>It is not a timeless theme</b> <b>Mm</b> <b>Representation is a timeless theme</b> <b>the most fundamental theme</b> <b>and a theme that has not yet been solved</b> <b>Mm</b> <b>So ah hey</b> <b>I may have talked about my PhD a bit too long</b> <b>[chuckles] But</b> <b>But I still want to say</b> <b>That is to say, I think during my PhD</b> <b>I also experienced more setbacks</b>

<b>For example</b> <b>Our initial Deeply Supervised Nets paper</b> <b>this also</b> <b>At first we submitted to NeurIPS</b> <b>and got a pretty high score</b> <b>something like 886</b> <b>a score of 887</b> <b>but was ultimately still rejected</b> <b>And this was also a blow to me</b> <b>Mm, I found, wow</b> <b>Publishing a paper is actually this hard</b> <b>Even with very good reviews,</b>

<b>it was still rejected for some ridiculous reasons,</b> <b>and got rejected.</b>

<b>What was so ridiculous?</b>

<b>The ridiculous reason was that</b> <b>we had a mathematical formula in the paper,</b> <b>which should have been squared,</b> <b>and we had a typo —</b> <b>we left out the squared term.</b>

<b>Didn't write it.</b>

<b>It was purely a typo,</b> <b>very easy to fix.</b>

<b>But the PC said —</b> <b>the Program Chair,</b> <b>the person responsible for</b> <b>these conferences — said</b> <b>this makes your math invalid,</b> <b>it's an error.</b>

<b>And during the rebuttal,</b> <b>when responding to the reviewers,</b> <b>the reviewers didn't see it,</b> <b>so unfortunately</b> <b>there was no way to fix it.</b>

<b>So at that point all we could do was</b> <b>Now it seems unimaginable.</b>

<b>First of all,</b> <b>nowadays perhaps</b> <b>people don't check the formulas in papers anymore.</b>

<b>Second,</b> <b>I think people have become relatively more tolerant.</b>

<b>Back then,</b> <b>people were extremely nitpicky about details.</b>

<b>Yeah right.</b>

<b>But it's fine.</b>

<b>We ended up submitting to AISTATS</b> <b>— another conference —</b> <b>a machine learning conference.</b>

<b>And that paper</b> <b>won their Test of Time Award last year.</b>

<b>The Test of Time Award.</b>

<b>So I think</b> <b>After all this time.</b>

<b>Right.</b>

<b>Because all Test of Time Awards evaluate</b> <b>things 10 years later —</b> <b>at the 10-year mark,</b> <b>among all papers published 10 years ago,</b> <b>which paper had the greatest influence</b> <b>on the field.</b>

<b>Right. So I think</b> <b>I suddenly felt at peace again.</b>

<b>I think</b> <b>Research truly is a long-term process.</b>

<b>And so,</b> <b>That's also why</b> <b>I tell many of my students this:</b> <b>And I think</b> <b>don't worry about</b> <b>your wins and losses at every moment.</b>

<b>Or, to describe it mathematically,</b> <b>don't worry about a point estimate.</b>

<b>Don't, on this timeline,</b> <b>at every point,</b> <b>evaluate whether you're doing well or not.</b>

<b>Because all evaluations</b> <b>are ultimately an integral.</b>

<b>You need the accumulation of time.</b>

<b>In the end, look —</b> <b>everything you've ever done,</b> <b>added together,</b> <b>determines whether you're a good researcher.</b>

<b>But in that moment,</b> <b>you'll still feel very down.</b>

<b>Very down. Right.</b>

<b>Extremely down.</b>

<b>In that moment it's hard to think about 10 years later.</b>

<b>Hard to think about what happens 10 years from now.</b>

<b>Mm.</b>

<b>When you finished your PhD,</b> <b>what expectations did you have for your life?</b>

<b>You had published some good papers,</b> <b>you had 5 internship experiences,</b> <b>did you think you should go into research</b> <b>or into industry?</b>

<b>Did you make that choice?</b>

<b>I was never very confident back then.</b>

<b>At that time I never even considered a faculty position.</b>

<b>Because I thought I didn't deserve it.</b>

<b>[laughter] Because</b> <b>Why did you feel unworthy at every moment?</b>

<b>It's a bit better now.</b>

<b>But uh,</b> <b>Maybe that's a bit of an exaggeration.</b>

<b>It's not that I really felt unworthy.</b>

<b>But compared to my peers,</b> <b>they were on the established track,</b> <b>like I said,</b> <b>moving step by step toward a good faculty position.</b>

<b>That path.</b>

<b>I felt I wasn't on that path.</b>

<b>Oh.</b>

<b>Or rather,</b> <b>What you just said makes a lot of sense.</b>

<b>If your final destination</b> <b>was really a faculty position,</b> <b>at least at that point in time,</b> <b>you shouldn't have gone to 5 places</b> <b>for 5 internships,</b> <b>working on 5 different projects.</b>

<b>That's very unfavorable for</b> <b>finding a faculty position.</b>

<b>If you wanted a faculty position,</b> <b>staying in Kaiming He's team</b> <b>would have let you publish more papers,</b> <b>gotten more results,</b> <b>during that period,</b> <b>it might have been a smoother path</b> <b>toward a definite goal.</b>

<b>I don't know if it was a definite goal.</b>

<b>I really think it's quite mysterious.</b>

<b>All these decisions came down to:</b> <b>I only thought about where I should go</b> <b>to do what I most wanted to do,</b> <b>ideally with the people I most wanted to work with.</b>

<b>Working together.</b>

<b>I think</b> <b>This idea is actually very, very simple.</b>

<b>So when job hunting back then,</b> <b>actually I</b> <b>I was looking everywhere.</b>

<b>There were quite a few offers from major companies.</b>

<b>Right.</b>

<b>and</b> <b>I've talked before about my OpenAI interview experience.</b>

<b>It was actually pretty cool.</b>

<b>Basically, I was in a small dark room</b> <b>for five or six hours,</b> <b>working on one problem.</b>

<b>When I came out, it was already dark.</b>

<b>Right.</b> <b>I found the experience quite fascinating.</b>

<b>It felt quite extraordinary.</b>

<b>But back then actually</b> <b>Who was the interviewer at OpenAI?</b>

<b>John Schulman (OpenAI co-founder, Thinking Machines co-founder and Chief Scientist)</b> <b>Oh right.</b>

<b>Oh right.</b> <b>I saw you wrote about this experience on Zhihu.</b>

<b>Right?</b> <b>Uh, not on Zhihu,</b> <b>it was on Twitter,</b> <b>on X.</b>

<b>on X.</b> <b>Right, Zhihu reposted it.</b>

<b>That's it.</b>

<b>Yes.</b>

<b>So his original</b> <b>interview questions were on a single A4 sheet of paper,</b> <b>handwritten in pencil,</b> <b>line by line, handwritten interview questions.</b>

<b>I think</b> <b>it really moved me deeply.</b>

<b>I found it so fascinating.</b>

<b>This place is very interesting.</b>

<b>And uh,</b> <b>In the end,</b> <b>Actually,</b> <b>There was an offer, of course,</b> <b>but in the end</b> <b>I didn't go to OpenAI.</b>

<b>I didn't go to OpenAI.</b>

<b>This is where the timeline</b> <b>— quantum mechanics — starts to diverge.</b>

<b>That was 2018.</b>

<b>So early.</b>

<b>Mm.</b>

<b>So if I had gone to OpenAI, maybe, uh,</b> <b>you'd now be part of the LLM world.</b>

<b>Maybe. I don't think so.</b>

<b>I don't know.</b>

<b>I don't know.</b>

<b>Don't know what would have happened.</b>

<b>Back then I didn't even think about it.</b>

<b>I just wanted to go to FAIR.</b>

<b>If FAIR gave me the offer,</b> <b>I would definitely go.</b>

<b>Your reason for wanting to go to FAIR was Kaiming?</b>

<b>Uh right.</b>

<b>Kaiming, Piotr Dollar,</b> <b>Ross Girshick.</b>

<b>Ross Girshick.</b> <b>The so-called</b> <b>the three pillars of computer vision back then.</b>

<b>They weren't that senior —</b> <b>university professors or anything like that —</b> <b>they were all</b> <b>young to mid-career,</b> <b>researchers.</b>

<b>researchers.</b> <b>But the absolute top three.</b>

<b>Right, they were there.</b>

<b>And the research they were doing was</b> <b>the absolute top-tier computer vision research.</b>

<b>So for me,</b> <b>there was no choice to make.</b>

<b>So it was kind of fun back then.</b>

<b>Here's the thing —</b> <b>Ilya (Ilya Sutskever, SSI founder and CEO, OpenAI co-founder and former Chief Scientist)</b> <b>called me, and I said almost nothing,</b> <b>and I rejected OpenAI.</b>

<b>They sent me an offer,</b> <b>and I said I'm not going, sorry.</b>

<b>What did Ilya say on the call?</b>

<b>Uh, he was very angry.</b>

<b>He asked me,</b> <b>"Why didn't you even discuss it</b> <b>before rejecting the offer?"</b>

<b>"Is the money not enough?"</b>

<b>How much was it?</b>

<b>Uh,</b> <b>I don't remember exactly.</b>

<b>It was actually very, very low.</b>

<b>Maybe uh,</b> <b>probably in the hundreds of thousands.</b>

<b>Back then the pay for</b> <b>a top PhD student</b> <b>around 2008 would be</b> <b>roughly $400K to $500K</b> <b>dollars.</b>

<b>dollars.</b> <b>Dollars. Right.</b>

<b>Dollars. Right.</b>

<b>And now it's at least tripled.</b>

<b>But anyway,</b> <b>at that time</b> <b>OpenAI was at that level too,</b> <b>which was fine.</b>

<b>Right. And then</b>

<b>But Ilya was very angry.</b>

<b>So I</b> <b>I could only give vague responses</b> <b>and told him</b> <b>that I couldn't go.</b>

<b>and</b> <b>At that time indeed</b> <b>what did he say when angry?</b>

<b>Uh, not much actually.</b>

<b>His tone was just very stern.</b>

<b>Why did he decide to make this call?</b>

<b>I don't know.</b>

<b>That shows he really cared about recruiting.</b>

<b>He had never been rejected before.</b>

<b>Uh no.</b>

<b>I don't think that's the case.</b>

<b>In 2018,</b> <b>I think he was probably often rejected.</b>

<b>Because FAIR at that time</b> <b>— not just in Vision —</b> <b>in many areas,</b> <b>for the top PhD graduates,</b> <b>FAIR was more certain than OpenAI,</b> <b>more open,</b> <b>more like an academic environment.</b>

<b>Such an institution.</b>

<b>I think, at least at that time,</b> <b>everyone around me,</b> <b>if given that choice,</b> <b>unless</b> <b>they really wanted to do what OpenAI was already doing,</b> <b>the things OpenAI excelled at,</b> <b>I think most people would still lean toward FAIR.</b>

<b>Did you get the FAIR offer smoothly?</b>

<b>Uh, not that smoothly.</b>

<b>I think it was also quite</b> <b>rocky all the way.</b>

<b>When you rejected OpenAI,</b> <b>was it because you already had the FAIR offer?</b>

<b>Yes right.</b>

<b>But at FAIR,</b> <b>I gave a talk,</b> <b>this talk —</b> <b>I had no experience at all,</b> <b>it seemed everyone at my stage</b> <b>was quite experienced at job hunting,</b> <b>while I knew nothing.</b>

<b>So I gave a talk,</b> <b>and uh,</b> <b>the talk was scheduled for one hour.</b>

<b>Normally you'd speak for 45 to 50 minutes</b> <b>with 10 minutes for questions.</b>

<b>But I finished in 30 minutes.</b>

<b>Done.</b>

<b>Everyone looked at each other,</b> <b>not knowing what to do.</b>

<b>of course,</b> <b>many of the researchers there</b> <b>gave me a lot of face</b> <b>and asked many questions,</b> <b>so the time was somehow stretched to 45 minutes.</b>

<b>It wasn't too awkward.</b>

<b>Later Kaiming told me</b> <b>that everyone thought this was</b> <b>first, very unconventional.</b>

<b>How could you finish so fast?</b>

<b>Second,</b> <b>Maybe interviews should all be like this —</b> <b>a 30-minute talk works fine,</b> <b>saves everyone's time.</b>

<b>So many times</b> <b>I've done things</b> <b>without doing them perfectly.</b>

<b>Hmm, why did you finish so quickly?</b>

<b>Why didn't you follow the rules?</b>

<b>I didn't know there was a rule.</b>

<b>Oh.</b>

<b>Didn't read it.</b>

<b>Uh, I didn't know about this rule.</b>

<b>Like now,</b> <b>for example,</b> <b>Because</b> <b>this rule is actually a job talk rule.</b>

<b>Nobody told me this rule.</b>

<b>Right, people just said,</b> <b>"There's a talk starting at 11,"</b> <b>but this is actually an established convention</b> <b>because that's how academic interviews work.</b>

<b>and</b> <b>FAIR back then was actually an academic institution.</b>

<b>Mm.</b>

<b>It was really like a university.</b>

<b>Its operating model was like a PI</b> <b>leading a group of young people —</b> <b>whether interns</b> <b>or newly joined members —</b> <b>working together.</b>

<b>working together.</b> <b>And when I joined FAIR,</b> <b>I was probably</b> <b>among the first few — I'm not sure —</b> <b>Chen Xinlei was probably the first,</b> <b>but I was probably the second —</b> <b>a fresh PhD graduate who could join FAIR.</b>

<b>At first they didn't recruit new PhD graduates.</b>

<b>If you were just a PhD graduate,</b> <b>they didn't want you.</b>

<b>They would only recruit people like Kaiming,</b> <b>who had already done very impressive work,</b> <b>those kinds of researchers.</b>

<b>Mm. Right.</b>

<b>So I was also quite</b> <b>lucky. Right.</b>

<b>lucky. Right.</b>

<b>Mm.</b>

<b>I think FAIR</b> <b>really was the holy temple at that time.</b>

<b>Mm.</b>

<b>And so,</b> <b>I didn't agonize much over</b> <b>too many other possibilities.</b>

<b>Mm. And then</b>

<b>About the Ilya situation,</b> <b>let me add one more thing.</b>

<b>I've only talked to Ilya on the phone twice.</b>

<b>This was the first time.</b>

<b>We can talk about the second time later.</b>

<b>It was</b> <b>in July 2024,</b> <b>right after he founded SSI.</b>

<b>He emailed me and asked</b> <b>if I'd be willing to come work together.</b>

<b>And you rejected him again.</b>

<b>Uh right.</b>

<b>Why this time?</b>

<b>This time because I had just started at NYU.</b>

<b>and</b> <b>Mm. I think there were several reasons.</b>

<b>Mm. I think there were several reasons.</b>

<b>When I talked with him,</b> <b>Uh,</b> <b>the main topic we discussed</b> <b>wasn't salary or anything like that.</b>

<b>We didn't talk about any of that.</b>

<b>The main topic was</b> <b>how to give future artificial intelligence</b> <b>the ability to love.</b>

<b>the ability to love.</b>

<b>Discussing philosophy.</b>

<b>Of course, I finally asked him</b> <b>one question.</b>

<b>one question.</b> <b>I asked how he viewed multimodality,</b> <b>how he viewed computer vision,</b> <b>or general perception models —</b> <b>what did he think?</b>

<b>Ilya's response was</b> <b>he felt this was already solved well enough.</b>

<b>Okay, so I thought</b> <b>maybe uh,</b> <b>SSI has its own language-based</b> <b>approach.</b>

<b>approach.</b> <b>And that approach,</b> <b>at least for now,</b> <b>is not the path I want to pursue.</b>

<b>This is your fundamental disagreement —</b> <b>LLM versus vision.</b>

<b>Right. We can talk more about this later.</b>

<b>But I don't actually see this as a disagreement.</b>

<b>I see it as an organism.</b>

<b>Everyone is just in different places,</b> <b>doing different things at different times.</b>

<b>I always like to say,</b> <b>"Brothers climbing a mountain,</b> <b>each making their own effort."</b>

<b>Everyone doing their own thing.</b>

<b>No problem with that at all.</b>

<b>It's not a fight to the death.</b>

<b>LLMs don't conflict with what I want to do.</b>

<b>And without the recent developments in LLMs,</b> <b>there might not have been</b> <b>the current state of computer vision.</b>

<b>Mm.</b>

<b>That topic you discussed —</b> <b>how to give AI the ability to love —</b> <b>did you reach any conclusions?</b>

<b>The conclusion is that this is very important.</b>

<b>Why?</b>

<b>Because without it,</b> <b>we face a very uncertain</b> <b>and very dangerous future.</b>

<b>But with love comes hate.</b>

<b>They're two sides of the same coin.</b>

<b>It can't only have love.</b>

<b>When it learns to love,</b> <b>it will definitely</b> <b>know what the opposite is.</b>

<b>For me,</b> <b>completely agree with you.</b>

<b>Mm.</b>

<b>This becomes a philosophical proposition.</b>

<b>Mm.</b>

<b>But let me ask a counter-question.</b>

<b>Why do people trust their own children,</b> <b>trust humans so much,</b> <b>but have such worry and fear</b> <b>about AI, this new</b> <b>form of intelligent entity?</b>

<b>I don't have an answer to that.</b>

<b>But I think</b> <b>There will be technical ways</b> <b>to have control.</b>

<b>We can use technical means</b> <b>to make AI more trustworthy in the future,</b> <b>safer,</b> <b>and more controllable.</b>

<b>Mm. Controllable.</b>

<b>And this is also one reason</b> <b>why we need to work on</b> <b>world models.</b>

<b>world models.</b> <b>Why did he want to reach out to you?</b>

<b>Uh, I don't know.</b>

<b>Maybe he reached out to</b> <b>a thousand people,</b> <b>ten thousand people.</b>

<b>I guess. Right.</b>

<b>When we were waiting in line at a restaurant that day,</b> <b>we actually walked through the streets of New York together,</b> <b>and our conversation naturally extended to</b> <b>people who have greatly influenced you.</b>

<b>In what you shared just now,</b> <b>the human factor</b> <b>takes up a very large share of many of your choices.</b>

<b>Why are people so important to you?</b>

<b>And in your personal bio,</b> <b>you clearly listed</b> <b>which collaborators are important to you.</b>

<b>That's very rare.</b>

<b>Why are people so crucial to you?</b>

<b>Is this unusual?</b>

<b>I don't think it's unusual at all.</b>

<b>I think</b> <b>In academic circles,</b> <b>this is a common behavioral pattern.</b>

<b>People organize themselves into</b> <b>these social networks.</b>

<b>Mm. And these people shape your thinking,</b> <b>because they may be your students,</b> <b>they may be your teachers, right?</b>

<b>But teachers don't always teach students.</b>

<b>Sometimes students teach the teachers.</b>

<b>All of this can be true.</b>

<b>So it's a huge graph</b> <b>where everyone is connected.</b>

<b>And I think</b> <b>That's also why research,</b> <b>or science, is especially fascinating.</b>

<b>Mm. Because many times</b> <b>the mutual</b> <b>trust between people,</b> <b>mutual appreciation,</b> <b>mutual feelings —</b> <b>these aren't built through</b> <b>living together</b> <b>and being friends.</b>

<b>Many times it's through scientific discovery,</b> <b>kind of</b> <b>this research aspect, that connections are built.</b>

<b>Relationships between people.</b>

<b>I think this is actually very interesting.</b>

<b>For example, those who deeply influenced me —</b> <b>I may get to know them personally,</b> <b>of course I try to get to know them personally,</b> <b>right, but that's not what matters most to me.</b>

<b>I seem to understand them through their papers,</b> <b>learning their way of thinking.</b>

<b>and</b> <b>I think that's the real meaning of research.</b>

<b>I don't think the purpose of research is to publish papers.</b>

<b>I</b> <b>I don't think</b> <b>Uh,</b> <b>publishing papers</b> <b>is the goal.</b>

<b>Not at all.</b>

<b>The purpose should be —</b> <b>what is the purpose?</b>

<b>ah,</b> <b>Is it a journey through people?</b>

<b>What Kaiming told me the purpose is:</b> <b>Mm.</b>

<b>Mm.</b> <b>at its core it means</b> <b>sharing knowledge.</b>

<b>sharing knowledge.</b> <b>that is,</b> <b>The purpose of publishing a paper isn't for others to see it,</b> <b>but so that after others see the paper,</b> <b>they have something to work on.</b>

<b>that is,</b> <b>You publish a paper,</b> <b>others understand some of the content,</b> <b>and they feel</b> <b>their own horizons have expanded.</b>

<b>Mm. It's about helping others.</b>

<b>Being helpful to others.</b>

<b>Right. Being able to inspire others,</b> <b>or enlighten others.</b>

<b>Oh, that's the purpose of research.</b>

<b>I think that's the purpose of research.</b>

<b>Or, to put it more romantically,</b> <b>the idea is</b> <b>I think this —</b> <b>this comes from Hannah Arendt (political philosopher),</b> <b>and she said</b> <b>she doesn't care about impact.</b>

<b>She doesn't care about influence.</b>

<b>Because</b> <b>In researcher circles,</b> <b>people say</b> <b>we publish papers to create some kind of impact,</b> <b>Right?</b>

<b>Right?</b> <b>In my own dictionary,</b> <b>I actually have a bit of an aversion to the word impact.</b>

<b>Aversion.</b>

<b>A bit of an aversion.</b>

<b>Oh.</b>

<b>Uh why?</b>

<b>What is it about it that you resist?</b>

<b>Again Arendt</b> <b>said that</b> <b>she felt, uh,</b> <b>the word "impact" is overly aggressive,</b> <b>overly masculine.</b>

<b>overly masculine.</b> <b>For her,</b> <b>the purpose of doing these things is not to create impact</b> <b>but for understanding itself.</b>

<b>If you can understand something,</b> <b>the feeling is wonderful.</b>

<b>If you can write down what you've understood,</b> <b>whether it's an article or a paper,</b> <b>and spread it,</b> <b>then you can</b> <b>potentially allow more people in the world</b> <b>to understand</b> <b>such a question in the same way you do.</b>

<b>And this</b> <b>will be transmitted step by step,</b> <b>creating a kind of resonance.</b>

<b>and</b> <b>Arendt's view is that</b> <b>she would find in this a sense of family —</b> <b>a feeling of family.</b>

<b>She would feel that she understood something,</b> <b>told others,</b> <b>allowed others to understand,</b> <b>which means these people also understood her to some degree.</b>

<b>Mm.</b>

<b>But humans, as social beings,</b> <b>need to be understood.</b>

<b>Right.</b>

<b>He reframed the word "influence"</b> <b>in a very soft way —</b> <b>seeking to be understood.</b>

<b>I think so.</b>

<b>I think so.</b>

<b>You agree more with this view?</b>

<b>I agree with her very much.</b>

<b>Because I think</b> <b>Creating impact is fine in itself.</b>

<b>But it's very self-centered.</b>

<b>Mm-hmm.</b>

<b>I'm going to create impact. Mm.</b>

<b>Right. Me-centered.</b>

<b>And yes,</b> <b>you're absolutely right.</b>

<b>I'm going to create this impact,</b> <b>I'm going to change the world,</b> <b>but do the people in this world agree to be changed by me?</b>

<b>[laughs]</b> <b>Or rather, many disasters in the world</b> <b>are because people want to create impact,</b> <b>want to transform the world.</b>

<b>Right.</b>

<b>I think</b> <b>I would tend to agree with this softer expression.</b>

<b>I think</b> <b>If all people in this world,</b> <b>through our research,</b> <b>can gain a new layer of understanding,</b> <b>a new layer of knowledge,</b> <b>the total intelligence on Earth would increase.</b>

<b>And increasing total intelligence on Earth</b> <b>is never wrong.</b>

<b>It's always something beneficial to the world.</b>

<b>Whether it's called impact</b> <b>or being understood by more people.</b>

<b>Do you want to be known and remembered by more people?</b>

<b>Mm. Do you have a need for fame?</b>

<b>I certainly don't have that need.</b>

<b>You don't have that need.</b>

<b>But I think</b> <b>I don't have that need.</b>

<b>But really?</b>

<b>Uh I</b> <b>Or rather, from where I stand now,</b> <b>I'm actually a victim</b> <b>of a kind of false fame.</b>

<b>Uh, the reason is</b> <b>people now take some of our papers</b> <b>and post them on Xiaohongshu,</b> <b>to discuss — and actually none of this</b> <b>— people talk about the so-called top-three conferences</b> <b>and promote the work, right?</b>

<b>I</b> <b>I have never once</b> <b>asked any such media outlet</b> <b>to do this kind of promotion.</b>

<b>Mm.</b>

<b>And I tell my students:</b> <b>please don't go on Xiaohongshu</b> <b>or Zhihu</b> <b>to promote your own work.</b>

<b>You can explain your work,</b> <b>you can comment on your work.</b>

<b>That's fine.</b>

<b>Just don't promote yourself.</b>

<b>Why is it okay on X?</b>

<b>I think on X,</b> <b>uh, it's more about</b> <b>how you define promotion.</b>

<b>What I focus on</b> <b>is briefly summarizing things</b> <b>and telling people what it's about.</b>

<b>It's more like attracting people to look at my work,</b> <b>and I think that's fine.</b>

<b>But the promotion I'm referring to</b> <b>is more like the fame you mentioned,</b> <b>because what I really can't accept is</b> <b>people now say "so-and-so's team"</b> <b>published such-and-such</b> <b>work.</b>

<b>work.</b> <b>Oh.</b>

<b>Oh.</b> <b>It reinforces that person,</b> <b>reinforcing that person,</b> <b>someone's team.</b>

<b>someone's team.</b> <b>reinforces that person.</b>

<b>Right uh,</b> <b>If any editors hear this,</b> <b>I hope people can stop doing this.</b>

<b>Don't write "Xie Saining's team".</b>

<b>Don't put my photo on it.</b>

<b>Don't put my name on it.</b>

<b>We need to encourage young people more —</b> <b>the people who actually did the work,</b> <b>give them more visibility.</b>

<b>Right?</b>

<b>Well, people might think you're the first author.</b>

<b>Uh right.</b>

<b>If I am the first author, that's fine.</b>

<b>But I'm not the first author.</b>

<b>Right?</b>

<b>I'm just the team lead.</b>

<b>And much of this work is done by students.</b>

<b>So what should it be called?</b>

<b>Not "Xie Saining's team".</b>

<b>Just focus on the work itself.</b>

<b>Talk about what problem this solves</b> <b>and why it matters.</b>

<b>That's enough.</b>

<b>Right.</b>

<b>But I think</b> <b>You really hate being used as a target by others.</b>

<b>Is that so?</b>

<b>Uh yes.</b>

<b>Because I think it adds</b> <b>a lot of risk.</b>

<b>I think</b> <b>Mm. Tell us about those who influenced you.</b>

<b>Mm. Tell us about those who influenced you.</b>

<b>We've already talked about a few people.</b>

<b>Kaiming, Professor Tu — anyone else?</b>

<b>Oh yes.</b>

<b>Uh,</b> <b>I think, right,</b> <b>this goes</b> <b>back to FAIR.</b>

<b>We can follow the FAIR thread.</b>

<b>After FAIR,</b> <b>I came to NYU.</b>

<b>I think this was another decision-making point.</b>

<b>Stayed at FAIR for 4 years.</b>

<b>A full 4 years.</b>

<b>Right. OK.</b>

<b>Yes. Yes.</b>

<b>Also with ups and downs.</b>

<b>For me,</b> <b>I just said</b> <b>many places I've been</b> <b>actually grew alongside me.</b>

<b>FAIR might be an exception.</b>

<b>When I joined, it was at its peak.</b>

<b>The high point.</b>

<b>Probably the high point.</b>

<b>Right. And then</b>

<b>Right. It's a pity.</b>

<b>What's happening there now.</b>

<b>But I also think</b> <b>Mm.</b>

<b>Mm.</b> <b>Right. Because I left relatively early,</b>

<b>Right. Because I left relatively early,</b> <b>so I wasn't there</b> <b>when it was</b> <b>at its lowest point when I left.</b>

<b>Right. [laughs]</b>

<b>I also saw some warning signs.</b>

<b>Right.</b>

<b>OK.</b>

<b>And but right.</b>

<b>And I think</b> <b>if I'm talking about people who influenced me,</b> <b>then in this process, when going to NYU,</b> <b>I think</b> <b>that was another quite mysterious decision-making process.</b>

<b>Right. Deciding to go to New York at that time</b> <b>— I just mentioned this —</b> <b>was partly because I might enjoy the city.</b>

<b>and</b> <b>But I think</b> <b>Uh, another very important thing</b> <b>was also that Yann LeCun is here.</b>

<b>Right, Yann is here.</b>

<b>Mm right uh.</b>

<b>Why, with him here,</b> <b>were you willing to go?</b>

<b>You worked together at FAIR.</b>

<b>Uh,</b> <b>He likes to say he's recruited me</b> <b>that is,</b> <b>three times, right?</b>

<b>The first time was at FAIR.</b>

<b>But at that time,</b> <b>because he was the overall director of FAIR,</b> <b>FAIR's director,</b> <b>I didn't directly work with him,</b> <b>but I was influenced by him of course.</b>

<b>Or have you had long-term exchanges?</b>

<b>Yes, we've talked.</b>

<b>Right.</b>

<b>But never directly collaborated.</b>

<b>Mm.</b>

<b>Then going to NYU was the second time.</b>

<b>We can talk about the third time later.</b>

<b>Mm.</b>

<b>And the NYU experience —</b> <b>I think why it matters that he's here</b> <b>is also because</b> <b>I think</b> <b>he's a person with a very strong vision.</b>

<b>so</b> <b>Right.</b>

<b>Right.</b> <b>I think many of these decisions were very intuitive.</b>

<b>For example, NYU's building,</b> <b>which we call the Center for Data Science,</b> <b>the so-called Data Science Center,</b> <b>this was actually led by Yann</b> <b>over ten years ago.</b>

<b>He established this organization.</b>

<b>Right. It's independent of</b> <b>traditional computer science departments</b> <b>or math departments.</b>

<b>It's a new department.</b>

<b>So we have a new building,</b> <b>and the first time I walked into this building,</b> <b>I felt great.</b>

<b>Because</b> <b>Everything is glass doors.</b>

<b>Right.</b> <b>I can take you to see it sometime.</b>

<b>All glass doors.</b>

<b>Uh, everything is very, very open.</b>

<b>And it feels a bit like a company for students.</b>

<b>And the color scheme is very nice.</b>

<b>Right, I keep saying I'm a visual person.</b>

<b>There are warm tones in there,</b> <b>with an orange floor,</b> <b>various sofas,</b> <b>and everyone, uh,</b> <b>though it's quite chaotic —</b> <b>all kinds of robots</b> <b>running around on the floor,</b> <b>various students on this sofa,</b> <b>that sofa,</b> <b>sitting and studying.</b>

<b>And there's absolutely no privacy —</b> <b>zero privacy.</b>

<b>zero privacy.</b> <b>All the professors' office glass doors —</b> <b>you can see clearly everything happening inside.</b>

<b>Mm. Right.</b>

<b>But I thought, wow,</b> <b>this is very interesting.</b>

<b>This environment is very interesting.</b>

<b>Right.</b>

<b>More and more American schools now</b> <b>are making efforts like this,</b> <b>saying we want to have</b> <b>mm, this kind of</b> <b>uh interdisciplinary</b> <b>cross-disciplinary centers.</b>

<b>cross-disciplinary centers.</b> <b>Right? Usually,</b>

<b>Right? Usually,</b> <b>like, these AI centers,</b> <b>and</b> <b>using them to attract talent,</b> <b>using them to bring different departments together,</b> <b>because AI really serves as</b> <b>this middle layer,</b> <b>this connecting identity and position.</b>

<b>Connecting everyone.</b>

<b>Connecting everyone.</b>

<b>Everyone needs it.</b>

<b>Right. Mm. Yeah.</b>

<b>Whether you're doing science, right,</b> <b>doing physics, chemistry,</b> <b>math,</b> <b>statistics, business school,</b> <b>and including computer science,</b> <b>I think AI is a very good</b> <b>middle connecting node.</b>

<b>Mm right.</b>

<b>But Yann's foresight was that he</b> <b>more than ten years ago</b> <b>had already established this.</b>

<b>Mm.</b>

<b>So I think</b> <b>I think he is</b> <b>quite a visionary person.</b>

<b>Mm. Right. And then</b>

<b>So NYU's positioning in AI is also very good.</b>

<b>So actually, uh, again,</b> <b>I think</b> <b>the computer science department isn't the school's strong suit.</b>

<b>But it has many</b> <b>AI talent reserves.</b>

<b>Right.</b> <b>It has gathered many very impressive AI</b> <b>faculty members.</b>

<b>faculty members.</b> <b>Right. Mm.</b>

<b>Right. Mm.</b>

<b>Yann is one reason you chose NYU.</b>

<b>There are also many, many reasons.</b>

<b>He's one of them.</b>

<b>Because he needed to interview me,</b> <b>and he needed to give the final say.</b>

<b>Right. Mm.</b>

<b>Or rather, it was he who chose me.</b>

<b>Mm.</b>

<b>Important people.</b>

<b>Are there others?</b>

<b>Mm. I think there are.</b>

<b>For example, during my time at NYU,</b> <b>I also collaborated with many other professors,</b> <b>and one person who I think influenced me greatly</b> <b>would be Professor Fei-Fei.</b>

<b>Right.</b>

<b>I think Professor Li Fei-Fei —</b> <b>uh, everyone should definitely read the book she wrote.</b>

<b>Right, her autobiography.</b>

<b>Right.</b>

<b>And I've read it too.</b>

<b>But after having deep conversations with her,</b> <b>I gained even more.</b>

<b>Right. Sometimes I would</b> <b>tell her</b> <b>I was facing</b> <b>this difficulty and challenge,</b> <b>and Professor Fei-Fei would tell me earnestly</b> <b>some stories from her past.</b>

<b>Mm. And then</b>

<b>This was actually a great comfort to me.</b>

<b>What kind of stories?</b>

<b>Specific things</b> <b>might not be appropriate to share.</b>

<b>But in short,</b> <b>her journey wasn't smooth sailing at all.</b>

<b>Mm. She also had to</b> <b>wade through many thorns,</b> <b>overcoming many obstacles step by step,</b> <b>and now</b> <b>standing on the world stage,</b> <b>becoming a pride of the Chinese community,</b> <b>or becoming a North Star for the entire research field,</b> <b>especially computer vision,</b> <b>allowing everyone to see</b>

<b>what she's thinking</b> <b>and being able to</b> <b>in some sense</b> <b>set some new directions.</b>

<b>I think</b> <b>Right, her influence on me has been enormous.</b>

<b>Mm.</b>

<b>and</b> <b>And I think Professor Fei-Fei's greatest strength is</b> <b>that she's someone who can define problems.</b> <b>Mm. This point</b>

<b>Mm. This point</b> <b>is actually not very intuitive.</b>

<b>When people talk about Professor Fei-Fei,</b> <b>her greatest achievement</b> <b>is building ImageNet,</b> <b>this dataset.</b>

<b>this dataset.</b> <b>But in fact, this isn't just a dataset.</b>

<b>This isn't just data.</b>

<b>It's hard to imagine</b> <b>that back then, right,</b> <b>around 2012 or 2011,</b> <b>image classification wasn't a well-defined problem.</b>

<b>Defining this problem clearly</b> <b>was far more important</b> <b>than building such a dataset —</b> <b>far, far more important.</b>

<b>Mm-hmm.</b>

<b>And I think Professor Fei-Fei</b> <b>set this agenda,</b> <b>defined this problem clearly,</b> <b>so that subsequently</b> <b>Deep Learning could have a playground,</b> <b>have such a platform</b> <b>to showcase its capabilities.</b>

<b>I think</b> <b>This is her greatest achievement,</b> <b>and also what I always want to learn from.</b>

<b>Mm. Right.</b>

<b>So I worked with her on</b> <b>two pieces of work.</b>

<b>One is Thinking Space,</b> <b>and</b> <b>this paper</b> <b>mainly involves</b> <b>within multimodal base models,</b> <b>how to solve,</b> <b>better solve this kind of</b> <b>uh, spatial intelligence problem.</b>

<b>Well, recently we have another paper called Cambrian-S,</b> <b>and this paper also addresses</b> <b>questions about video —</b> <b>how do we define problems,</b> <b>which problems are actually important.</b>

<b>Right.</b>

<b>I think this collaboration with her</b> <b>has also helped expand the boundaries of my research.</b>

<b>How did you come to know Professor Fei-Fei well?</b>

<b>Uh, it was all quite serendipitous.</b>

<b>She came to New York on a business trip once,</b> <b>and we had a meal together.</b>

<b>And she told me</b> <b>a lot of things.</b>

<b>Right. And she would often come to New York later,</b> <b>and because she's also starting a company,</b> <b>we would often get together</b> <b>and chat.</b>

<b>and chat.</b> <b>Right, roughly that.</b>

<b>And normally we'd have</b> <b>some research meetings.</b>

<b>Mm. I'm curious about something,</b> <b>and I think many people are curious about this too.</b>

<b>Mm.</b> <b>How did you go from being a very young</b> <b>researcher just starting out in academia,</b> <b>and gradually,</b> <b>come to be alongside these well-known names in AI,</b> <b>come together with them</b> <b>and stand alongside them?</b>

<b>That is,</b> <b>how did you enter the core of AI?</b>

<b>I</b> <b>I still don't feel I'm at the core of AI,</b> <b>or that I've gotten close to it.</b>

<b>Mm. But the people you just mentioned,</b> <b>certainly many people would love to collaborate with them.</b>

<b>Is that so?</b>

<b>Ah, of course.</b>

<b>Right. I think</b>

<b>And look — all of it was serendipity.</b>

<b>With Kaiming it was just happening to be there</b> <b>as an intern and getting him to open up.</b>

<b>And with Professor Fei-Fei,</b> <b>you just had one meal together.</b>

<b>How did you get them to open up to you?</b>

<b>I think this is very hard to do intentionally.</b>

<b>Mm. Or this is a bit mysterious.</b>

<b>You could call it some kind of law of attraction.</b>

<b>Or you could think of it as</b> <b>people whose thoughts align</b> <b>ultimately converging together.</b>

<b>Though you may have countless small streams,</b> <b>in the end, they may all converge into one river.</b>

<b>I think, for example,</b> <b>uh, all the people I've mentioned,</b> <b>at least they're all working on vision.</b>

<b>Or rather,</b> <b>Even including Yann,</b> <b>who can be seen as doing general AI,</b> <b>but his starting point, right,</b> <b>was also digit recognition,</b> <b>which is also a visual problem.</b>

<b>Right.</b>

<b>I think everyone's foundation</b> <b>is very, very aligned.</b>

<b>So I think</b> <b>I really didn't make these things happen intentionally.</b>

<b>Right.</b>

<b>And many things,</b> <b>Or rather, I think</b> <b>don't need to be made to happen intentionally.</b>

<b>Everyone is just based on these research questions,</b> <b>and their understanding of these questions,</b> <b>collaborating together.</b>

<b>collaborating together.</b> <b>Right.</b>

<b>Right.</b> <b>I would think of it this way.</b>

<b>The thing is that</b> <b>from the outside,</b> <b>I'd see you as someone very goal-oriented</b> <b>and very logical.</b>

<b>But through our conversation just now,</b> <b>I find you're someone whose choices are quite disorderly.</b>

<b>Right?</b>

<b>Right.</b>

<b>I think there's a certain disorder.</b>

<b>Mm. But I think</b> <b>this is also a by-design process.</b>

<b>I choose this disorder.</b>

<b>I think</b> <b>I think</b> <b>Using this clichéd phrase:</b> <b>"follow your heart."</b>

<b>Right. But in many cases</b> <b>right, there's no way around it.</b>

<b>Many of my choices couldn't truly optimize</b> <b>for a result.</b>

<b>I think this is the source of the disorder.</b>

<b>So in</b> <b>these disorderly choices,</b> <b>can you string together all of your research journey</b> <b>into a single thread?</b>

<b>We've actually already discussed a few works.</b>

<b>Yes. Yes.</b>

<b>Yes right.</b>

<b>I think we can go through it bit by bit.</b>

<b>I think one benefit is</b> <b>I don't have that many papers,</b> <b>so</b> <b>so maybe it's relatively easy to string together.</b>

<b>And I think indeed, uh,</b> <b>I can't say there's a hidden thread,</b> <b>but there really is a thread in the background</b> <b>guiding me to keep doing this.</b>

<b>Or rather, before talking about these papers,</b> <b>before —</b> <b>I want to say,</b> <b>computer vision has developed for such a long time,</b> <b>right, I have many friends</b> <b>who are slowly exploring new directions,</b> <b>like doing some</b> <b>robotics right,</b> <b>or 3D vision.</b>

<b>I'm also trying to expand my boundaries outward.</b>

<b>But looking back,</b> <b>I find on this main thread,</b> <b>right, I think this main thread for me</b> <b>— representation learning —</b> <b>Mm.</b>

<b>Mm.</b> <b>there are too many unsolved problems. Right.</b>

<b>So I want to stay on this main thread</b> <b>and push forward what we're doing.</b>

<b>So the starting point of all this,</b> <b>if we trace it back,</b> <b>of course involves Deep Learning,</b> <b>involves Deep Neural Networks,</b> <b>the design of these architectures.</b>

<b>I think this part</b> <b>is of course related to representation learning.</b>

<b>Mm. And then</b>

<b>this is also what I think, in the past,</b> <b>everyone has been working toward.</b>

<b>Not just me.</b>

<b>Right. And everyone,</b>

<b>everyone is doing this —</b> <b>how to design a better architecture</b> <b>so we can learn better representations</b> <b>and better solve</b> <b>problems.</b> <b>Mm.</b>

<b>Mm.</b> <b>Right. And then, uh,</b>

<b>Right. And then, uh,</b> <b>later on,</b> <b>actually uh,</b> <b>things start to change.</b>

<b>We find</b> <b>that architecture itself isn't necessarily the most important.</b>

<b>It's definitely important,</b> <b>but not necessarily the most important,</b> <b>or it's not everything.</b>

<b>So there are at least several different things</b> <b>that intertwine.</b>

<b>that intertwine.</b> <b>Right, architecture is one thing,</b> <b>your architecture is one thing,</b> <b>and your data is also important.</b>

<b>Mm-hmm.</b>

<b>And there's also your objective —</b> <b>your goal is also very important.</b>

<b>Right?</b>

<b>I think architecture determines</b> <b>what you use for training.</b>

<b>We can imagine it as</b> <b>having a massive engine.</b>

<b>And the hardware of this engine</b> <b>is essentially the architecture of a neural network.</b>

<b>Mm.</b>

<b>But having just the engine's architecture</b> <b>is actually useless.</b>

<b>You have no fuel.</b>

<b>You can't start it.</b>

<b>Right. So, uh,</b>

<b>there's the data dimension</b> <b>and there's the objective dimension,</b> <b>the objective function considerations.</b>

<b>And so</b> <b>My subsequent research</b> <b>has also followed this main thread —</b> <b>representation learning as the main thread —</b> <b>advancing around architecture, data, and objective.</b>

<b>Mm-hmm.</b>

<b>And uh,</b> <b>During the time at FAIR,</b> <b>I think FAIR</b> <b>— this full-time job,</b> <b>in the full-time work process —</b> <b>I think one core aspect was</b> <b>that I worked with Kaiming,</b> <b>and Kaiming was leading some</b> <b>self-supervised learning</b> <b>such work,</b> <b>Right.</b>

<b>Right.</b> <b>And actually, again,</b> <b>now everyone says Scaling is</b> <b>is</b> <b>already a buzzword.</b>

<b>Everybody's talking about scaling.</b>

<b>Mm. Right.</b>

<b>But actually the first person who really told me</b> <b>that we need a scalable model,</b> <b>that we need to make the model bigger and bigger,</b> <b>these were Kaiming's exact words.</b>

<b>Bigger and bigger.</b>

<b>Right yes.</b>

<b>Kaiming told me this.</b>

<b>What year did he tell you?</b>

<b>Uh, roughly around 2018 or 2019.</b>

<b>Right. And then</b>

<b>So from the very beginning his conviction was</b> <b>that is,</b> <b>we must make models bigger,</b> <b>make data bigger,</b> <b>and this would allow us to get</b> <b>a better result.</b>

<b>I think very early on,</b> <b>Kaiming already had this vision.</b>

<b>Mm.</b>

<b>Uh. And then</b>

<b>so we also</b> <b>made some efforts along this path.</b>

<b>And so I think</b> <b>initially, the discussion about self-supervised learning —</b> <b>including Yann,</b> <b>Uh,</b> <b>he's a big advocate.</b>

<b>That is, he is</b> <b>very invested in</b> <b>self-supervised learning —</b> <b>He has this classic cake analogy.</b>

<b>This metaphor.</b>

<b>Right, the base layer is</b> <b>the body of the cake,</b> <b>and this part must be Self-Supervised Learning.</b>

<b>On top of that you can have Supervised Learning,</b> <b>right, this is the icing on the cake,</b> <b>the cream on your cake.</b>

<b>And further on top is Reinforcement Learning,</b> <b>it's just the cherry on top,</b> <b>just a little cherry at the very top.</b>

<b>Mm.</b>

<b>Each layer of this cake is actually important,</b> <b>but they're not ranked by importance.</b>

<b>Mm.</b>

<b>If you don't have the cake's base,</b> <b>you can't get to intelligence</b> <b>relying only on the cherry on top.</b>

<b>Mm.</b>

<b>Right. So because we were at FAIR</b> <b>doing vision,</b> <b>we were actually paying attention to this very early.</b>

<b>But the process of this research went like this:</b> <b>around 2015 and 2016,</b> <b>people already knew that self-supervised learning</b> <b>was actually a future for vision.</b>

<b>So at that time, uh,</b> <b>people would design</b> <b>all kinds of</b> <b>what we call pretext tasks,</b> <b>or proxy</b> <b>objective goals,</b> <b>some proxy tasks.</b>

<b>that is,</b> <b>what is self-supervised learning?</b>

<b>I don't have a label to directly give you,</b> <b>unlike ImageNet,</b> <b>where I have 1000 classes</b> <b>and can directly train</b> <b>a supervised classifier</b> <b>and get a representation this way.</b>

<b>In the old days,</b> <b>this is what everyone was doing.</b>

<b>Through 1000 class labels, by the way,</b> <b>within these 1000 classes there are 200 dog</b> <b>different breeds.</b>

<b>different breeds.</b> <b>Even so,</b> <b>this is why</b> <b>ImageNet is so powerful.</b>

<b>Right? Even with that distribution,</b> <b>it can still let</b> <b>our neural networks learn good representations.</b>

<b>I think this is extremely impressive.</b>

<b>But people also see the limitations.</b>

<b>Once everything is just Supervised Learning,</b> <b>there are many things you can't capture.</b>

<b>Mm.</b>

<b>Because what it learns</b> <b>— for example, we're sitting here now,</b> <b>we see these chairs,</b> <b>Right?</b>

<b>Right?</b> <b>and we now have a lot of images,</b> <b>of different chairs.</b>

<b>Some chairs might be quite ordinary,</b> <b>chairs in a studio like ours,</b> <b>or chairs in a home,</b> <b>or some designer chairs,</b> <b>right, or like an avocado chair,</b> <b>a chair shaped like an avocado.</b>

<b>For supervised learning,</b> <b>you need to map all of this</b> <b>to a single label,</b> <b>this label is called "chair".</b>

<b>So what your network has to learn,</b> <b>this mapping,</b> <b>is actually very, very difficult.</b>

<b>Right.</b>

<b>And it's an infinite mapping.</b>

<b>It's an infinite mapping.</b>

<b>Mm.</b>

<b>So it can only either memorize,</b> <b>just remember,</b> <b>recite all the chairs it's ever seen,</b> <b>or</b> <b>this,</b> <b>through what we call spurious correlations,</b> <b>some false correlations,</b> <b>tell you it's a chair.</b>

<b>For example, it may not look at the chair itself</b> <b>but look at the background behind the chair,</b> <b>or it thinks</b> <b>all chairs will be next to a table,</b> <b>so it uses that to make a decision boundary</b> <b>and says,</b> <b>hey, this is a chair.</b>

<b>But this is not what we want.</b>

<b>What we want to achieve</b> <b>is, from this very diverse visual knowledge,</b> <b>these visual observations, to gain some kind of common sense,</b> <b>some kind of intuition.</b>

<b>Mm. Intuition.</b>

<b>Right. Or some kind of common understanding.</b>

<b>So this is why people initially wanted to do</b> <b>so-called Self-Supervised Learning</b> <b>or Unsupervised Learning.</b>

<b>A common misconception back then was</b> <b>people say</b> <b>we want to do Unsupervised Learning</b> <b>because labeling data</b> <b>is too hard and too expensive.</b>

<b>We need to hire people</b> <b>to label,</b> <b>spending money and time.</b>

<b>We don't want to do that.</b>

<b>But that's just</b> <b>one very small part of the problem.</b>

<b>The bigger issue is, in the eyes of computer vision researchers,</b> <b>ah,</b> <b>everyone knew long ago</b> <b>that only through this path</b> <b>there's no way to give AI systems this kind of common sense.</b>

<b>So in 2015 and 2016,</b> <b>everyone was very, very creative.</b>

<b>That period</b> <b>was actually a quite creative era.</b>

<b>People would design</b> <b>all kinds of crazy tasks.</b>

<b>These tasks —</b> <b>for example, you take an image,</b> <b>rotate it 90 degrees,</b> <b>or 180 degrees,</b> <b>or 270 degrees.</b>

<b>You don't give these images a label,</b> <b>but because you designed</b> <b>how to rotate these images,</b> <b>right, and these images</b> <b>and their</b> <b>rotation angles</b> <b>can form a valid pretext task.</b>

<b>You can predict how these rotated images</b> <b>were actually rotated.</b>

<b>This becomes a so-called</b> <b>proxy task.</b>

<b>proxy task.</b> <b>Mm.</b>

<b>Mm.</b> <b>Similar proxy tasks also include</b> <b>giving an image,</b> <b>converting it to grayscale,</b> <b>removing all its colors,</b> <b>but then using a neural network</b> <b>to reconstruct the original colors.</b>

<b>Essentially, from a grayscale image,</b> <b>how do you predict</b> <b>the color of each object</b> <b>as it should be.</b>

<b>Mm.</b>

<b>And there are other similar examples,</b> <b>too many to count.</b>

<b>Another example, one last one:</b> <b>let me give one more example.</b>

<b>The so-called Context Encoder —</b> <b>you take an image, cut out a piece in the middle,</b> <b>make it white,</b> <b>and then train a neural network</b> <b>to fill in this empty part.</b>

<b>Fill it in.</b>

<b>Mm.</b>

<b>The rationale behind all these pretext tasks is</b> <b>that</b> <b>humans can actually do this.</b>

<b>The reason humans can do this,</b> <b>the reason humans know,</b> <b>hey,</b> <b>whether this image was rotated 90 or 180 degrees,</b> <b>or what color the butterfly</b> <b>or house in this image should be,</b> <b>what color should it have,</b> <b>or</b> <b>you can predict the information missing in the middle —</b> <b>all these things</b> <b>is because humans,</b> <b>based on some understanding of the physical world,</b> <b>have this common sense,</b> <b>so they can guess</b>

<b>these corrupted signals,</b> <b>these already lost signals,</b> <b>how they should be reconstructed.</b>

<b>The masked signals.</b>

<b>Right.</b>

<b>But back then the problem was a hundred flowers blooming —</b> <b>all kinds of papers,</b> <b>Mm.</b>

<b>Mm.</b> <b>but none of them worked well.</b>

<b>All the results were actually quite poor,</b> <b>all worse than ImageNet pre-training,</b> <b>by roughly 15-20 percentage points.</b>

<b>Percentage points.</b>

<b>So people were making some progress,</b> <b>moving forward step by step,</b> <b>but the gap</b> <b>uh, what ImageNet could achieve through Supervised Learning,</b> <b>learned on large-scale data,</b> <b>with labels,</b> <b>Uh,</b> <b>the representation learned with labels,</b> <b>was still far, far better.</b>

<b>Right?</b>

<b>So uh,</b> <b>we did something at that time,</b> <b>and this was done together with Kaiming.</b>

<b>And this,</b> <b>this architecture is called</b> <b>called MoCo,</b> <b>Mm.</b>

<b>Mm.</b> <b>Momentum Contrast,</b> <b>momentum contrastive learning.</b>

<b>Right.</b>

<b>Even the Chinese name sounds interesting.</b>

<b>Right yes.</b>

<b>Yes, momentum contrastive learning.</b>

<b>Uh, I think</b> <b>you don't need to dig into</b> <b>the specific technical details.</b>

<b>Because now</b> <b>much of it is no longer important.</b>

<b>But in short,</b> <b>it was the first to take what's called contrastive learning</b> <b>as a framework</b> <b>and make it actually work, as a paper.</b>

<b>And what is contrastive learning?</b>

<b>Also quite simple.</b>

<b>We're now in this Representation Space,</b> <b>in this representation space,</b> <b>there are different points.</b>

<b>These points may be the same object</b> <b>or completely different objects.</b>

<b>For example,</b> <b>I have several images of this chair,</b> <b>Right?</b>

<b>Right?</b> <b>and also some that may be tables,</b> <b>or images of cats or dogs.</b>

<b>These images are all different,</b> <b>but in this space,</b> <b>we can measure their distances.</b>

<b>Or we know</b> <b>all these different chairs —</b> <b>their images should be closer,</b> <b>their representations should be closer.</b>

<b>But a chair and a cat</b> <b>should be farther apart.</b>

<b>Mm-hmm.</b>

<b>So this is the basic</b> <b>logic of contrastive learning.</b>

<b>And this</b> <b>is actually not new.</b>

<b>This</b> <b>It's been done for many, many years.</b>

<b>By the way, this</b> <b>early work</b> <b>was actually Yann who first worked with his students</b> <b>to do it.</b>

<b>That's very interesting.</b>

<b>Of course the problem being solved</b> <b>wasn't directly Representation Learning,</b> <b>but some Metric Learning problems.</b> <b>Some metric learning problems.</b> <b>But that's okay.</b>

<b>This was around 2019,</b> <b>I think we gave contrastive learning</b> <b>some new meaning.</b>

<b>But</b> <b>But this didn't come out of nowhere.</b>

<b>Actually before that,</b> <b>the entire field was slowly moving in this direction,</b> <b>expanding.</b>

<b>expanding.</b> <b>For example, there was a paper called CPC,</b> <b>and another paper called Memory Bank.</b>

<b>These two papers were already moving in this direction —</b> <b>using contrastive learning to do</b> <b>self-supervised learning, having already taken several steps.</b>

<b>Right, and then</b> <b>this is</b> <b>where I can't help but admire Kaiming's ability.</b>

<b>I think</b> <b>I think this is also</b> <b>a moment that made me think, wow,</b> <b>what a top-tier researcher</b> <b>and</b> <b>— or rather, I can't just say top-tier researcher.</b>

<b>Kaiming in my heart</b> <b>is simply the best researcher.</b>

<b>How does he actually work day-to-day?</b>

<b>Mm okay.</b>

<b>I think there are several points.</b>

<b>Maybe we can briefly talk about it.</b>

<b>that is,</b> <b>I think he has a kind of extreme focus.</b>

<b>and</b> <b>This focus allows him to have a kind of flow state,</b> <b>called this kind of mind flow,</b> <b>right, he can immerse himself in a problem</b> <b>without needing to consider what's happening</b> <b>in the rest of the world.</b>

<b>Mm.</b>

<b>And I find this particularly</b> <b>particularly admirable.</b>

<b>particularly admirable.</b> <b>And another thing is</b> <b>how does his focus manifest?</b>

<b>I think his focus shows in that</b> <b>Mm.</b>

<b>Mm.</b> <b>every day, apart from this one problem,</b> <b>he won't think about anything else.</b>

<b>He'll grab the people collaborating with him</b> <b>to talk about it,</b> <b>and grab other people to talk about it too.</b>

<b>In any case, this topic is the main subject</b> <b>of his thinking.</b>

<b>Oh.</b>

<b>And most of his mental cycles</b> <b>are allocated to</b> <b>this one specific problem.</b>

<b>Oh.</b>

<b>This is very difficult.</b>

<b>I think it's extremely, extremely hard.</b>

<b>Right because</b> <b>thoughts are often very hard to control.</b>

<b>Yes yes yes.</b>

<b>Ah right.</b>

<b>This is related to world models.</b>

<b>Thoughts are hard to control.</b>

<b>That's a good point.</b>

<b>But Kaiming is actually someone very</b> <b>capable of this kind of focused decision-making,</b> <b>able to concentrate.</b>

<b>Mm.</b>

<b>I actually think there are several points.</b>

<b>I think a top researcher</b> <b>needs this ability to varying degrees.</b>

<b>They need sufficient focus,</b> <b>they need good</b> <b>research taste.</b>

<b>research taste.</b> <b>How do you define that?</b>

<b>We can talk about it later.</b>

<b>Mm.</b>

<b>And they also need a certain steadfastness —</b> <b>you can't just go with the flow</b> <b>and</b> <b>do what others are interested in.</b>

<b>And of course</b> <b>you also need strong engineering skills,</b> <b>research intuition,</b> <b>including when you read literature,</b> <b>you know what's important</b> <b>and what's not.</b>

<b>This is very important.</b>

<b>You also know</b> <b>that this</b> <b>is actually something quite odd</b> <b>about academia.</b>

<b>about academia.</b> <b>That is,</b> <b>you have to be able to highlight the key points.</b>

<b>Right.</b>

<b>The main reason is also that people often don't state them clearly.</b>

<b>You know?</b> <b>Sometimes people simply can't articulate the key points,</b> <b>sometimes people are unwilling to state them,</b> <b>and sometimes</b> <b>people haven't realized what the key points are.</b>

<b>But Kaiming's ability is</b> <b>he can peel away the layers</b> <b>and extract these key points,</b> <b>then tell you,</b> <b>and establish</b> <b>these connections in this high-dimensional abstract space.</b>

<b>These connections.</b>

<b>Oh.</b>

<b>I find this extremely, extremely impressive.</b>

<b>Right. So</b>

<b>many times</b> <b>each of Kaiming's ideas</b> <b>didn't come from sitting in some corner somewhere,</b> <b>dreaming them up at home.</b>

<b>They actually come</b> <b>from constant exploration,</b> <b>extensive reading,</b> <b>extensive thinking,</b> <b>derived little by little.</b>

<b>And this</b> <b>I think truly deeply</b> <b>—</b> <b>influenced the way I do research,</b> <b>and what I now tell my students</b> <b>about how research should be done.</b>

<b>It's about increasing input.</b>

<b>Increasing input.</b>

<b>And</b> <b>I think</b> <b>there's actually a paradigm here.</b>

<b>Mm, in this,</b> <b>this paradigm is also something Kaiming taught me.</b>

<b>Right, he said</b> <b>actually all these ideas</b> <b>you can't just sit there and think up,</b> <b>because if you want to come up with an idea</b> <b>Mm.</b>

<b>Mm.</b> <b>by just thinking, it's definitely not a good idea.</b>

<b>There are really only a few possibilities.</b>

<b>The first possibility:</b> <b>you're smarter than everyone else in the world,</b> <b>so</b> <b>you come up with an incredibly brilliant idea</b> <b>that no one else can think of.</b>

<b>But I think the probability of this is extremely small.</b>

<b>So the more likely two possibilities: first,</b> <b>while you're thinking of this idea,</b> <b>100 people,</b> <b>1000 people,</b> <b>10,000 people in the world are thinking the same idea.</b>

<b>So you'll have to compete with them,</b> <b>and your execution speed may not be faster than theirs.</b>

<b>The second possibility:</b> <b>this is a very bad idea</b> <b>that others have already tried many times</b> <b>unsuccessfully.</b>

<b>unsuccessfully.</b> <b>unsuccessfully.</b>

<b>unsuccessfully.</b> <b>Mm.</b>

<b>Mm.</b> <b>Then you probably don't need to try either.</b>

<b>Mm. So</b>

<b>So I think Kaiming's greatest influence on me is</b> <b>he taught me how to find a research idea.</b>

<b>Mm. How?</b>

<b>I think this is a process of seeking.</b>

<b>so</b> <b>Now I,</b> <b>when new students come in,</b> <b>I will tell everyone</b> <b>about a research cycle.</b>

<b>Uh, of course I hope it could be longer,</b> <b>but in today's competitive environment,</b> <b>there might be at most 6 months.</b>

<b>That is,</b> <b>from the beginning of 6 months,</b> <b>you need to start thinking about an idea,</b> <b>and then later</b> <b>you need to write this idea into a paper</b> <b>and publish it.</b>

<b>This whole cycle is about 6 months.</b>

<b>What does this process look like?</b>

<b>You need to have a general direction,</b> <b>you need to know what you want to do.</b>

<b>You can't know nothing at all</b> <b>just saying "I want to do research" isn't enough.</b>

<b>This</b> <b>can be achieved by talking with your advisor,</b> <b>or with your peers,</b> <b>discussing with your classmates,</b> <b>or through your own reading,</b> <b>developing some general direction,</b> <b>this directional understanding.</b>

<b>Mm right?</b>

<b>But</b> <b>you must give yourself enough time and space</b> <b>to explore.</b>

<b>to explore.</b> <b>And this exploration,</b> <b>this exploration phase,</b> <b>I think</b> <b>should last at least one to two months.</b>

<b>What should you do during the exploration phase?</b>

<b>The exploration phase —</b> <b>good question. What do you do during exploration?</b>

<b>good question. What do you do during exploration?</b>

<b>You can't just sit there thinking.</b>

<b>What you need to explore is</b> <b>constantly hacking things,</b> <b>ah,</b> <b>that is,</b> <b>you really have to be like a hacker,</b> <b>playing with things,</b> <b>messing around with things.</b>

<b>Treat research like a game,</b> <b>like a toy to play with.</b>

<b>Mm, this might involve, for example,</b> <b>working through formulas,</b> <b>reading more papers,</b> <b>finding some connections,</b> <b>of course,</b> <b>and perhaps more importantly, actually doing things,</b> <b>writing code.</b>

<b>writing code.</b> <b>But when you're writing code,</b> <b>what you need to note is</b> <b>the code you write</b> <b>is not your initial starting idea</b> <b>or direction,</b> <b>but an exploration process.</b>

<b>So the code you write</b> <b>might simply reproduce a baseline,</b> <b>take what someone else's paper is doing</b> <b>and reproduce it.</b>

<b>and</b> <b>And it might also be on the basis of this baseline</b> <b>to make some kind of extension.</b>

<b>Mm.</b>

<b>And the most important thing in all this</b> <b>is to find a signal.</b>

<b>that is,</b> <b>it's still a bit like what you just said —</b> <b>all of this decision-making process</b> <b>is actually a quite disorderly exploration process.</b>

<b>It's a</b> <b>what we call stochastic gradient descent.</b>

<b>Right?</b>

<b>This is a cornerstone of all machine learning,</b> <b>but it equally applies to research itself</b> <b>and to our lives.</b>

<b>that is,</b> <b>In everyone's pursuit of their ultimate goal,</b> <b>they're all going through a stochastic</b> <b>gradient descent process.</b>

<b>Mm.</b>

<b>And I think research is the same.</b>

<b>For you,</b> <b>the most important thing in research</b> <b>is not going from point A to point B.</b>

<b>For example, A is an idea,</b> <b>B is a paper,</b> <b>but rather in this process,</b> <b>what kind of signal can you find?</b>

<b>Your gradient,</b> <b>where exactly is your gradient?</b>

<b>Right. So</b>

<b>Kaiming's view is</b> <b>this gradient itself</b> <b>is the source of your real idea.</b>

<b>When you've gone through constant exploration,</b> <b>tried many things,</b> <b>possibly unsuccessful,</b> <b>possibly successful,</b> <b>by the way, it doesn't have to be a successful experiment</b> <b>to give you this gradient.</b>

<b>Sometimes a failed experiment</b> <b>gives you a larger gradient.</b>

<b>Right?</b>

<b>That is, as long as</b> <b>the most feared thing is not knowing which direction to go.</b>

<b>Mm.</b>

<b>So a good result,</b> <b>a bad result,</b> <b>are both good results.</b>

<b>For research,</b> <b>a surprise,</b> <b>something surprising,</b> <b>such an observation,</b> <b>is always for a researcher</b> <b>— for a researcher —</b> <b>the most joyful thing.</b>

<b>Something unexpected that you observed.</b>

<b>Right.</b> <b>You saw something unexpected.</b>

<b>Mm.</b>

<b>so</b> <b>he said,</b> <b>It's after this kind of exploration</b> <b>In this process,</b> <b>the ideas you discover</b> <b>are the truly your own ideas.</b>

<b>The idea you started with isn't your idea.</b>

<b>That thing doesn't belong to you.</b>

<b>The idea found in exploration is your own idea.</b>

<b>And the research process</b> <b>is about finding</b> <b>your own idea.</b>

<b>But</b> <b>this word,</b> <b>you need to see</b> <b>it belongs to — this thing</b> <b>is truly your own.</b>

<b>Like heaven gave you an inspiration,</b> <b>injected it into your head.</b>

<b>Right, on one hand heaven gives you inspiration,</b> <b>on the other hand,</b> <b>it's also based on extensive empirical work and practice.</b>

<b>Right?</b>

<b>There's no free lunch here.</b>

<b>Maybe you're truly a genius,</b> <b>or maybe you're extremely lucky,</b> <b>God holding your hand</b> <b>wrote this formula.</b>

<b>It can happen.</b>

<b>But most of the time, most progress,</b> <b>even most work that has great</b> <b>influence on the field,</b> <b>I think still happens step by step.</b>

<b>You can always trace back</b> <b>to find its starting point.</b>

<b>So I also tell students</b> <b>what's actually the worst kind of research?</b>

<b>It's when you define a problem at the start,</b> <b>say this is my idea,</b> <b>and in the end publish a paper</b> <b>whose idea</b> <b>is exactly the same as what you started with.</b>

<b>You didn't encounter any obstacles,</b> <b>you didn't encounter any difficulties.</b>

<b>Why is it the worst?</b>

<b>Because this shows</b> <b>your idea is a boring idea,</b> <b>and your published paper is a boring paper.</b>

<b>Right.</b>

<b>I think</b> <b>after many years of observation,</b> <b>this is indeed very, very accurate.</b>

<b>So I think this is also why</b> <b>I tell students this —</b> <b>because</b> <b>people sometimes can't accept this fact.</b>

<b>People always think</b> <b>I should start by thinking of a clever trick,</b> <b>then implement it,</b> <b>make it work,</b> <b>publish a paper,</b> <b>I've succeeded,</b> <b>and I move on to the next thing.</b>

<b>But what this can give for personal accumulation</b> <b>is actually very, very limited.</b>

<b>The exploration process is actually very difficult.</b>

<b>Many people don't know how to explore.</b>

<b>Exploration is very hard.</b>

<b>And this is why</b> <b>all these papers in my view are nonlinear.</b>

<b>This nonlinearity shows in two aspects.</b>

<b>The first is your 6 months of time —</b> <b>by the 5th month,</b> <b>like I just told you,</b> <b>your mindset collapses.</b>

<b>This ResNeXt story —</b> <b>on one hand people hear, wow,</b> <b>you changed direction in the last month</b> <b>and made it work.</b>

<b>That time period is so short,</b> <b>and you still managed to do it.</b>

<b>It sounds unbelievable.</b>

<b>But once you know this happens too often,</b> <b>you find there really is a pattern.</b>

<b>You often go through this.</b>

<b>I often go through this.</b>

<b>Or rather, my best work always happens this way.</b>

<b>So how do you maintain your mindset for the first 5 months?</b>

<b>Uh, there's no way around it.</b>

<b>You have to accept this fact,</b> <b>you have to be able to tell yourself</b> <b>this is a normal research process.</b>

<b>Would you consider switching direction in the first 5 months?</b>

<b>I might go for that boring idea.</b>

<b>I think you would.</b>

<b>And</b> <b>changing direction is actually very, very important.</b>

<b>You must learn to pivot.</b>

<b>Because I just said,</b> <b>the worst work is</b> <b>when your starting idea is the same idea</b> <b>as your ending idea.</b>

<b>The best work is</b> <b>when you've gone all around,</b> <b>jumping here and there,</b> <b>taken a long, winding road,</b> <b>and only then arrived at this point.</b>

<b>Mm.</b>

<b>Though this road is very bumpy,</b> <b>from the final destination</b> <b>step by step</b> <b>you can always trace back to the very beginning.</b>

<b>Only then can it be connected into a line.</b>

<b>Only then can you</b> <b>but during the process, you can't.</b>

<b>Yes, during the process</b> <b>I think</b> <b>you're in the process —</b> <b>because you don't know,</b> <b>you can't predict the future.</b>

<b>So this is always an exploration process.</b>

<b>So I think about two months of exploration,</b> <b>gradually forming an idea,</b> <b>then gradually expanding,</b> <b>then scaling up,</b> <b>Right?</b>

<b>Right?</b> <b>then supplementing experiments sufficiently,</b> <b>This thing,</b> <b>might take another two to three months,</b> <b>and finally writing the paper</b> <b>— then spending one to two months —</b> <b>this is</b> <b>already a very</b> <b>smooth research process.</b>

<b>Mm.</b>

<b>And I think</b> <b>this again,</b> <b>in today's era,</b> <b>faces many, many challenges.</b>

<b>People face all kinds of pressure.</b>

<b>Right? I think the competitive pressure now is too great.</b>

<b>The competitive pressure is too great.</b>

<b>and</b> <b>I think</b> <b>It makes people feel</b> <b>they must chase the cutting edge</b> <b>and finish things as soon as possible,</b> <b>seize the opportunity.</b>

<b>Mm.</b>

<b>Claim the territory.</b>

<b>but looking back,</b> <b>I think, as I just said,</b> <b>Professor Fei-Fei's greatest strength</b> <b>is</b> <b>that she's someone who can define problems.</b> <b>If you lose the ability to define problems,</b> <b>you essentially also lose much of the ability to innovate,</b> <b>essentially also lose the ability to do research.</b>

<b>And this</b> <b>I just said research is nonlinear,</b> <b>that's in terms of time.</b>

<b>But in terms of results,</b> <b>it's also nonlinear.</b>

<b>Mm.</b>

<b>That is, this</b> <b>is actually MIT professor Bill Freeman —</b> <b>he has a very classic</b> <b>plot,</b> <b>an illustration,</b> <b>this kind of graphic.</b>

<b>He often talks about it when giving talks.</b>

<b>So,</b> <b>This graphic has a horizontal axis</b> <b>and a vertical axis.</b>

<b>The horizontal axis starts from a very poor work,</b> <b>a decent work,</b> <b>a very good work,</b> <b>an exceptionally impressive work.</b>

<b>This is the horizontal axis.</b>

<b>The vertical axis</b> <b>is the impact on your entire career.</b>

<b>The impact of this paper on your career.</b>

<b>So you can guess</b> <b>what this curve actually looks like.</b>

<b>Right? It's not a linear curve.</b>

<b>It's not that a very poor work</b> <b>has a very bad career impact,</b> <b>and</b> <b>and the best work</b> <b>or a fairly good work</b> <b>gives you a very good return,</b> <b>gradually increasing.</b>

<b>gradually increasing.</b> <b>It's not linear.</b>

<b>It's not linear.</b>

<b>It's saying</b> <b>basically, a very poor work</b> <b>actually won't hurt you much,</b> <b>nobody cares.</b>

<b>nobody cares.</b> <b>Mm.</b>

<b>Mm.</b> <b>No one will notice.</b>

<b>A decent work —</b> <b>no one notices either.</b>

<b>The gains it brings you are also small.</b>

<b>Mm.</b>

<b>But sometimes,</b> <b>when you produce a very good piece of work,</b> <b>an exceptionally impressive work,</b> <b>work that everyone knows about,</b> <b>your impact</b> <b>— I said I don't like the word impact —</b> <b>reaches the top.</b>

<b>This thing,</b> <b>this,</b> <b>immediately shoots up to</b> <b>the top.</b>

<b>the top.</b> <b>Right?</b>

<b>Right?</b> <b>So we often say in academia</b> <b>what people measure is the so-called signature work.</b>

<b>Or another way to put it:</b> <b>people say</b> <b>what you optimize for is not an average —</b> <b>not the average of all your previous work —</b> <b>an average.</b>

<b>an average.</b> <b>But</b> <b>what you're optimizing is</b> <b>the maximum of your work.</b>

<b>Right, the highest point.</b>

<b>I think this illustrates</b> <b>the research game's</b> <b>nonlinear characteristic.</b>

<b>nonlinear characteristic.</b> <b>Mm.</b>

<b>Mm.</b> <b>So is the highest point good or not?</b>

<b>Of course it's good!</b>

<b>that is,</b> <b>You</b> <b>you only need to</b> <b>succeed just once in your lifetime.</b>

<b>And this</b> <b>I actually gave a talk about this at CVPR,</b> <b>I called it research: the infinite game.</b>

<b>Mm right?</b>

<b>This</b> <b>got quite a strong response from everyone.</b>

<b>I think actually</b> <b>I rarely give these non-technical</b> <b>talks,</b> <b>because this is more about philosophical thinking</b> <b>and some summaries.</b>

<b>That one was actually quite good.</b>

<b>and</b> <b>But it also</b> <b>contained everything I talked about above.</b>

<b>Because think about it,</b> <b>research as a</b> <b>career,</b> <b>a researcher as a</b> <b>profession,</b> <b>what is its</b> <b>true essence?</b>

<b>true essence?</b> <b>Oh.</b>

<b>Oh.</b> <b>It's not a chess player,</b> <b>it's not</b> <b>even a Winter Olympics athlete.</b>

<b>Because for a chess player and an athlete,</b> <b>your final achievement depends on your worst step</b> <b>to some extent.</b>

<b>You have to ensure every step,</b> <b>your moves must be correct.</b>

<b>If you make even a small mistake in the middle,</b> <b>if you make a small error in chess,</b> <b>placed a piece wrong once,</b> <b>you've lost.</b>

<b>you've lost.</b> <b>You've lost.</b>

<b>You've lost.</b> <b>Right?</b>

<b>Right?</b> <b>So this is a finite game.</b>

<b>In this process,</b> <b>there are always winners</b> <b>and always losers.</b>

<b>But a researcher is more like an inventor:</b> <b>you in your lifetime</b> <b>truly only need to succeed once.</b>

<b>Mm.</b>

<b>If you're lucky enough,</b> <b>you can succeed a few times.</b>

<b>Twice maybe. But you don't need to succeed 100 times.</b>

<b>Two times gets you to the top?</b>

<b>I think</b> <b>I think so.</b>

<b>Oh.</b>

<b>So I think</b> <b>this is actually quite interesting.</b>

<b>so</b> <b>I think</b> <b>as the entire field moves forward,</b> <b>there needs to be some reflection.</b>

<b>I think now,</b> <b>the traditional academic world,</b> <b>whether it's its social responsibility</b> <b>Or rather,</b> <b>its positioning in the entire research landscape,</b> <b>its positioning,</b> <b>was always the one setting the rules of the game,</b> <b>always the one deciding where we go next.</b>

<b>Right?</b>

<b>Now it's completely different.</b>

<b>Now the ones deciding where things go</b> <b>are OpenAI,</b> <b>ah,</b> <b>maybe Google,</b> <b>or Meta or other major companies.</b>

<b>Right, they're playing a finite game —</b> <b>they're playing a finite game against each other.</b>

<b>But this has caused them to drag academia into</b> <b>a finite game,</b> <b>this kind of decision-making chain.</b>

<b>Right?</b>

<b>So you see</b> <b>many times when a major company releases something,</b> <b>whether it's called some o-series,</b> <b>or some GPT series,</b> <b>or the Nano Banana series,</b> <b>a specific piece of work,</b> <b>a product launch,</b> <b>immediately everyone in academia swarms in</b> <b>saying, how can we within this paradigm,</b> <b>using what you'd call peanuts of resources,</b>

<b>resources as few as peanuts,</b> <b>these resources,</b> <b>Mm.</b>

<b>Mm.</b> <b>try to chase it?</b>

<b>Oh chasing.</b>

<b>What's the point?</b>

<b>Reproduce right?</b>

<b>Or maybe people don't believe they can</b> <b>people might also — right, as you said —</b> <b>they probably can't catch up anyway.</b>

<b>So it becomes some kind of</b> <b>reproduction in a sense,</b> <b>or building on top of it through</b> <b>I think</b> <b>this kind of research process</b> <b>is actually very, very painful.</b>

<b>Because there's one more thing I haven't mentioned.</b>

<b>For the past two years at NYU,</b> <b>I've actually also been working part-time at Google.</b>

<b>Mm.</b>

<b>Working part-time.</b>

<b>And this</b> <b>was in the Nano Banana team,</b> <b>right, in the Nano Banana team,</b> <b>the team within GenAI.</b>

<b>and</b> <b>This went on for two years.</b>

<b>so</b> <b>Not sure if I should share this,</b> <b>but let's share. Sometimes I tell some friends,</b> <b>the reason I went to do this work at Google</b> <b>is I wanted to see what people at Google were doing,</b> <b>so I would know what</b> <b>not to do in academia.</b>

<b>Oh.</b>

<b>That is, I need to know what you're doing,</b> <b>so I know what not to do.</b>

<b>Because if I know you're doing this,</b> <b>why would I do it alongside you?</b>

<b>Makes sense.</b>

<b>Because they have more resources.</b>

<b>it has more resources.</b>

<b>No need to compete with them.</b>

<b>Yes yes yes.</b>

<b>So this is also something that guides us.</b>

<b>Right, I don't want to be too preachy.</b>

<b>By the way, this disclaimer:</b> <b>all of what I've said</b> <b>is only based on my experience at NYU,</b> <b>not particularly successful,</b> <b>just sharing some experience.</b>

<b>It doesn't represent the diversity</b> <b>and complexity of research worldwide.</b>

<b>And looking back,</b> <b>I can also say</b> <b>some papers I do want to</b> <b>share with everyone,</b> <b>but looking back,</b> <b>I haven't produced a paper</b> <b>that I truly think has real value.</b>

<b>You're saying this to tell everyone</b> <b>I haven't reached the highest point yet,</b> <b>I haven't reached that Max yet.</b>

<b>You're right.</b>

<b>I'm still young.</b>

<b>[laughs]</b> <b>I can still work harder.</b>

<b>Mm.</b>

<b>But it really is like this.</b>

<b>Because yesterday I was thinking about this question.</b>

<b>I think there might be about 20 such papers,</b> <b>twenty-something papers,</b> <b>and</b> <b>that have profoundly influenced all of deep learning</b> <b>and the progress of AI.</b>

<b>If this world has 20 such papers,</b> <b>or 25 papers,</b> <b>and</b> <b>I don't have a single one.</b>

<b>What reason do I have not to keep working hard,</b> <b>to keep going?</b>

<b>I think this is a goal.</b>

<b>Doesn't DiT count?</b>

<b>Uh, I think it counts as 0.25.</b>

<b>Or DiT</b> <b>is more like</b> <b>pushing along the tangent of the research frontier,</b> <b>taking a small step forward.</b>

<b>If we didn't do it,</b> <b>someone else would have.</b>

<b>It doesn't completely belong to you.</b>

<b>Right, it doesn't.</b>

<b>Completely belong to me.</b>

<b>Mm.</b>

<b>You're right.</b>

<b>Yes.</b>

<b>Yes.</b>

<b>But these</b> <b>Or rather,</b> <b>I think</b> <b>Diffusion Model certainly counts,</b> <b>including</b> <b>maybe DDPM counts.</b>

<b>Right.</b>

<b>and</b> <b>I don't know.</b>

<b>Maybe we can list some.</b>

<b>I think this might be quite interesting.</b>

<b>I think LeNet counts.</b>

<b>I might</b> <b>not be able to list them all.</b>

<b>Okay, let's just list some.</b>

<b>Papers that have influenced AI's progress, right?</b>

<b>Right.</b>

<b>Or rather,</b> <b>I think in my view,</b> <b>these are things that can truly be called signature works,</b> <b>Or rather,</b> <b>works that I'm still very far from.</b>

<b>Right?</b> <b>I think</b> <b>ah,</b> <b>LeNet of course counts.</b>

<b>AlexNet of course counts.</b>

<b>Mm, and then</b> <b>ImageNet of course counts.</b>

<b>ResNet of course counts.</b>

<b>Mm.</b>

<b>R-CNN or Faster R-CNN, the detection part,</b> <b>of course counts.</b>

<b>Kaiming's already on there several times.</b>

<b>and</b> <b>What else?</b>

<b>What else?</b> <b>Transformer of course counts.</b>

<b>Attention is all you need,</b> <b>of course counts.</b>

<b>GPT-3 of course counts.</b>

<b>BERT of course counts.</b>

<b>I think CLIP counts too.</b>

<b>ViT I think counts too.</b>

<b>Vision Transformer,</b> <b>I think counts too.</b>

<b>And GAN,</b> <b>I think counts too.</b>

<b>Okay,</b> <b>can't list them all.</b>

<b>Roughly at that level.</b>

<b>Including in 3D,</b> <b>NeRF (Neural Radiance Field),</b> <b>Gaussian Splatting,</b> <b>I think both count.</b>

<b>They all count.</b>

<b>so</b> <b>Across different fields.</b>

<b>They all have these works.</b>

<b>The significance of these works is that</b> <b>everyone was originally</b> <b>gradually moving toward a direction,</b> <b>ah,</b> <b>and then suddenly a paper like this appears out of nowhere,</b> <b>completely changing our</b> <b>just-mentioned stochastic gradient</b> <b>descent process.</b>

<b>descent process.</b> <b>So you see its convergence curve</b> <b>has a drop.</b>

<b>Mm.</b>

<b>This is how I define this.</b>

<b>And I think</b> <b>assuming this long river of history</b> <b>means this curve continues forward,</b> <b>right, there are times and times again</b> <b>kind of</b> <b>kind of</b> <b>kind of</b> <b>allowing everyone to break out of previous local optima</b> <b>or enter the next stage —</b> <b>such papers appear.</b>

<b>But I think we're still far from done.</b>

<b>This path is far from convergence.</b>

<b>I think there are still many things to be done.</b>

<b>I hope</b> <b>I think it doesn't need to be me personally,</b> <b>I hope</b> <b>but at least I hope to be able to participate.</b>

<b>Right. I hope</b>

<b>assuming there's a next revolution,</b> <b>I hope</b> <b>I hope</b> <b>looking back,</b> <b>Right?</b>

<b>Right?</b> <b>I said maybe it's not about creating some impact,</b> <b>but because of my personal experience,</b> <b>the patterns of collaboration around me,</b> <b>my own understanding,</b> <b>my own thinking,</b> <b>I am able to understand certain things,</b> <b>and what I understand can somehow</b> <b>have some influence on</b> <b>the world's or AI's development.</b>

<b>Mm.</b>

<b>I think</b> <b>this is something I care very much about now.</b>

<b>Mm.</b>

<b>Is there no hope from LLMs for this?</b>

<b>The next revolution.</b>

<b>Again,</b> <b>I think absolutely not.</b>

<b>No hope?</b>

<b>or</b> <b>I would say LLMs will eventually fade.</b>

<b>No no no.</b>

<b>LLMs</b> <b>will never die,</b> <b>but will eventually fade.</b>

<b>Old soldiers never die,</b> <b>they just fade away.</b>

<b>Right?</b>

<b>Why will they eventually fade?</b>

<b>They won't die.</b>

<b>They will just fade away.</b>

<b>That is, it will definitely have its value,</b> <b>it's a very good tool.</b>

<b>I use LLMs every day now.</b>

<b>But it's not the foundation for building a universal,</b> <b>a general intelligence system.</b>

<b>It's not the world model's</b> <b>kind of</b> <b>foundation of this building.</b>

<b>World model,</b> <b>we'll talk about it later.</b>

<b>Your work —</b> <b>do you want to expand on it?</b>

<b>You've already</b> <b>let me say a bit more.</b>

<b>Is there time?</b>

<b>Yes. You've already said you haven't reached Max.</b>

<b>Yes yes right.</b>

<b>Put that way,</b> <b>it seems there's nothing much to talk about with these works.</b>

<b>But I think there's still some significance.</b>

<b>Because</b> <b>Just like I said about non-linear research,</b> <b>right, in a paper,</b> <b>we first do some things,</b> <b>then gradually</b> <b>build up some reserves,</b> <b>and then in the last month,</b> <b>find a new direction,</b> <b>deliver</b> <b>the final result.</b>

<b>Mm. I think,</b>

<b>When I look at all my previous work,</b> <b>I also have this feeling:</b> <b>I'm still in that initial confused exploration phase.</b>

<b>But who knows —</b> <b>maybe this year,</b> <b>maybe next year,</b> <b>maybe</b> <b>I suddenly</b> <b>right, have a spiritual awakening,</b> <b>and can produce some more meaningful work.</b>

<b>Mm-hmm.</b>

<b>But I think the foundation here is</b> <b>as I just said,</b> <b>it needs to be able to string together a thread.</b>

<b>Or rather,</b> <b>it's actually not a line,</b> <b>it's a graph.</b>

<b>It has different nodes,</b> <b>different nodes connected to each other,</b> <b>each node is a paper,</b> <b>all with connections between them.</b>

<b>Your subsequent papers</b> <b>are all influenced by all the previous papers.</b>

<b>Mm right.</b>

<b>So later,</b> <b>for example, Contrastive Learning,</b> <b>making it work means</b> <b>we saw for the first time in visual tasks</b> <b>MoCo</b> <b>such work,</b> <b>especially we had</b> <b>V1 V2,</b> <b>V3 right?</b>

<b>V3 right?</b> <b>And in V3,</b> <b>we used Transformer,</b> <b>and we scaled up,</b> <b>Uh,</b> <b>actually already better than the representation ImageNet could get,</b> <b>across all kinds of tasks.</b>

<b>This for us was</b> <b>actually a major surprise.</b>

<b>Mm.</b>

<b>Mm-hmm.</b>

<b>At that time,</b> <b>at that point,</b> <b>I thought, wow,</b> <b>everything is flourishing again.</b>

<b>Our problem can basically be answered.</b>

<b>We found a way —</b> <b>self-supervised learning —</b> <b>that can work.</b>

<b>Going forward,</b> <b>we just need to scale up what we're doing now,</b> <b>and</b> <b>the future is incredibly bright.</b>

<b>But unfortunately,</b> <b>this also didn't happen.</b>

<b>Right?</b>

<b>But before that,</b> <b>we had another paper,</b> <b>also</b> <b>MoCo and MAE by the way were both projects Kaiming led.</b>

<b>Actually, people say</b> <b>what does it mean to lead a project?</b>

<b>I think</b> <b>Kaiming truly demonstrated this leadership —</b> <b>that is,</b> <b>he truly took on 80-90% of the first-author</b> <b>plus last-author</b> <b>or corresponding-author responsibilities.</b>

<b>The corresponding author's responsibilities.</b>

<b>He needed to write the baseline himself,</b> <b>run many, many experiments himself,</b> <b>finalize the paper himself,</b> <b>tell the story, present it,</b> <b>all of these things</b> <b>basically Kaiming did single-handedly.</b>

<b>And accomplished it.</b>

<b>So what about others?</b>

<b>Others,</b> <b>we</b> <b>of course also participated</b> <b>and made contributions.</b>

<b>But I'm just saying</b> <b>this is a path Kaiming led.</b>

<b>Right we</b> <b>accelerated the progress of this,</b> <b>and may have made the results much better too.</b>

<b>Mm.</b>

<b>But it doesn't change the essence of this.</b>

<b>Right.</b>

<b>So this is Kaiming.</b>

<b>Even now, for example, just a couple days ago he told me</b> <b>he really enjoys this kind of</b> <b>IC work — individual contributor,</b> <b>the individual contributor</b> <b>type of role.</b>

<b>Mm.</b>

<b>He doesn't enjoy managing a large team,</b> <b>getting everyone together,</b> <b>just being a manager pointing the direction.</b>

<b>He doesn't like that.</b>

<b>How many people does he manage now?</b>

<b>He has many, many people.</b>

<b>He now has many undergraduates</b> <b>visiting him,</b> <b>and he</b> <b>is also doing a lot of really great work.</b>

<b>So I actually don't believe him.</b>

<b>I tell him,</b> <b>"You're actually a very good manager."</b>

<b>At least for me,</b> <b>even though you never really managed me,</b> <b>just being around you,</b> <b>I could feel my own efficiency improving,</b> <b>feeling like I was getting smarter.</b>

<b>I think</b> <b>If I were going to have a manager,</b> <b>I'd want one like that —</b> <b>Right?</b> <b>one who can empower the people around him to get better.</b>

<b>Right?</b> <b>one who can empower the people around him to get better.</b>

<b>Right.</b>

<b>I think this is Kaiming.</b>

<b>So MAE —</b> <b>in any case,</b> <b>we explored the Contrastive Learning path,</b> <b>and found</b> <b>it couldn't scale up.</b>

<b>So we wanted to switch directions.</b>

<b>So we went back</b> <b>and used a simpler approach,</b> <b>which is a kind of denoising</b> <b>autoencoder,</b> <b>this kind of autoencoder,</b> <b>the Masked Autoencoder (MAE).</b>

<b>This method is even simpler.</b>

<b>Everyone can go read the paper,</b> <b>But in short,</b> <b>but basically by taking some images</b> <b>and corrupting them,</b> <b>then reconstructing</b> <b>these</b> <b>noisy images,</b> <b>cropped images,</b> <b>or masked images,</b> <b>to learn representations.</b>

<b>Mm.</b>

<b>This</b> <b>fundamentally different from Contrastive Learning,</b> <b>but its results were also very good,</b> <b>although it has very different characteristics.</b>

<b>For example, it doesn't explicitly</b> <b>model certain environments</b> <b>this kind of invariance</b> <b>which causes it, when doing linear probing,</b> <b>to perform slightly worse</b> <b>but with untuned fine-tuning</b> <b>these are two different ways to test representations</b> <b>right, in that case the results would be much better</b> <b>in any case, they have different properties</b> <b>the representations they learn also look different</b> <b>and these things</b> <b>would have far-reaching consequences down the line</b>

<b>we can talk more about this later</b> <b>but this was MAE</b> <b>at the time we thought</b> <b>wow, MAE is incredible</b> <b>MAE should at least win a best paper award, right?</b>

<b>turns out it didn't</b> <b>scaling up MAE would solve all problems, right?</b>

<b>turned out it didn't scale up either</b> <b>right</b> <b>actually I heard</b> <b>you and Xiangyu (chief scientist at StepFun) had talked about this before</b> <b>because he also paid attention to self-supervised learning</b> <b>he actually also</b> <b>talked a lot about</b> <b>why self-supervised learning can't scale up</b> <b>some of the reasons</b> <b>I won't go into it again here</b> <b>feel free to go back and relisten to that episode</b> <b>but anyway, in short,</b> <b>back then there was this kind of</b>

<b>rollercoaster ride</b> <b>on the one hand, we got some really good results</b> <b>but on the other hand, these papers were just papers</b> <b>we were never able to</b> <b>truly deliver something real</b> <b>right, like GPT</b> <b>that could point everything toward a completely different</b> <b>scalable paradigm for the future</b> <b>yeah right</b> <b>I think this whole thing</b> <b>had, at that point, kind of</b> <b>come to a close</b>

<b>of course, at that time I also did</b> <b>some other work</b> <b>for example, I extended self-supervised learning</b> <b>for what you could call the first time</b> <b>into the 3D domain, for instance</b> <b>I also did some work on point clouds</b> <b>these</b> <b>were called Point Contrast</b> <b>but these works were perhaps more about</b> <b>demonstrating that representation learning</b> <b>as a concept</b> <b>is not just a problem for the image domain</b> <b>it's a very universal</b> <b>approach</b>

<b>or rather, a methodology</b> <b>it doesn't only work with images</b> <b>it also works in 3D space</b> <b>later on</b> <b>many people tried it on all kinds of medical imaging</b> <b>and also on robotics tasks</b> <b>all kinds of domains</b> <b>it holds up</b> <b>so this thing</b> <b>I don't see it as a failure</b> <b>because it really has been</b> <b>influencing many many different</b> <b>fields beyond what we were focused on</b>

<b>like computer vision itself</b> <b>but on the other hand</b> <b>it still hasn't achieved the same kind of impact as LLMs</b> <b>in terms of influence</b> <b>mm</b> <b>so then</b> <b>after all that, what came next?</b>

<b>right yeah</b> <b>it seems like we went back to</b> <b>an exploration phase</b> <b>all of this was at FAIR</b> <b>all done at FAIR</b> <b>you were there for 4 years during that phase</b> <b>4 years</b> <b>mm</b> <b>so was that the end of your FAIR chapter?</b>

<b>not yet</b> <b>still early, still early</b> <b>that was probably the first year or two</b> <b>right</b> <b>there's another fun story,</b> <b>let me brag about Kaiming again</b> <b>[laughter]</b> <b>so</b> <b>back then, resources were always an issue</b> <b>GPUs were always in short supply</b> <b>and then FAIR made a decision</b> <b>to give TPUs a try</b> <b>see if this thing is any good</b> <b>Google had been using them</b> <b>they</b>

<b>had fully transitioned to using TPUs</b> <b>so</b> <b>we got about 5,000 TPU chips</b> <b>these chips</b> <b>not bought, more like rented</b> <b>on Google Cloud</b> <b>and then</b> <b>it was originally set up for people doing language models</b> <b>people played around with it</b> <b>and quickly found</b> <b>ugh, it's way too hard to use</b> <b>really not user-friendly</b> <b>okay</b> <b>Kaiming stepped up and said, let me handle it</b>

<b>so he truly, single-handedly</b> <b>I mean, again</b> <b>all on his own</b> <b>from start to finish</b> <b>built an entire infrastructure on TPUs</b> <b>which enabled us to do</b> <b>all the subsequent work</b> <b>including MoCo</b> <b>including MAE</b> <b>including the later DiT</b> <b>all of it</b> <b>happened on top of TPUs</b>

<b>so for me, this was</b> <b>a really important lesson</b> <b>which is</b> <b>how to summarize it...</b>

<b>it's like</b> <b>a craftsman who wants to do good work</b> <b>must first sharpen his tools</b> <b>mm</b> <b>one thing Kaiming taught me was</b> <b>the ceiling of your research</b> <b>actually depends on how good your baseline is</b> <b>oh</b> <b>because if your baseline is weak</b> <b>you can easily fool yourself</b> <b>oh</b> <b>you won't produce anything meaningful</b> <b>if you haven't put enough thought</b> <b>into the baseline level</b>

<b>into building this system properly</b> <b>into pushing the engineering to its limits</b> <b>you don't have a platform</b> <b>to do real exploration</b> <b>because you might find an interesting</b> <b>seemingly valuable signal</b> <b>but that signal could be completely wrong</b> <b>the reason being your baseline</b> <b>your benchmark itself wasn't good enough</b> <b>mm</b> <b>so this is actually quite counterintuitive</b> <b>because people always say</b>

<b>if my baseline is a bit weaker</b> <b>then the performance gains I can show</b> <b>would be larger</b> <b>so it's easier for me to publish papers</b> <b>right, but Kaiming doesn't think this way</b> <b>mm</b> <b>he thinks about how to</b> <b>push the baseline as high as it can go</b> <b>and then starting from that foundation</b> <b>whatever new things we build</b> <b>that's groundbreaking work</b> <b>that's a genuine breakthrough</b> <b>right</b> <b>anything you build on top of a weak baseline</b> <b>any improvement</b>

<b>might just be a throwaway paper</b> <b>so this thing</b> <b>has also been an inspiration to me</b> <b>including when they were working on detection</b> <b>I wasn't part of that work</b> <b>I was still doing my PhD</b> <b>but all of those</b> <b>Fast R-CNN, Mask R-CNN</b> <b>Focal Loss, and the whole series of work</b> <b>all of that work was because they</b> <b>including Ross Girshick</b> <b>including Kaiming</b>

<b>including Wu Yuxin</b> <b>who is now at Kimi</b> <b>they put enormous effort into building the infra</b> <b>and building that codebase</b> <b>so that the baselines</b> <b>the baselines for these methods</b> <b>already far exceeded all of those</b> <b>random mediocre CVPR papers</b> <b>mm</b> <b>our baseline was already stronger than yours</b> <b>so if I take one more step up</b> <b>of course I'm going to go even further</b> <b>mm</b> <b>so</b>

<b>I think I've always maintained this kind of</b> <b>methodology</b> <b>I think I place a lot of importance on</b> <b>this kind of</b> <b>I don't want to call it engineering</b> <b>because it's not entirely just about</b> <b>the codebase itself</b> <b>it's not</b> <b>like building a codebase at a product company</b> <b>that kind of relationship</b> <b>it's more like</b> <b>the scaffolding for a research breakthrough</b>

<b>if your scaffolding is unstable</b> <b>you can't build anything</b> <b>so</b> <b>this thing</b> <b>also influences what we do now</b> <b>but anyway, the point is</b> <b>Kaiming in terms of building this scaffolding</b> <b>was also truly exceptional</b> <b>I think you were so lucky</b> <b>because very early on someone told you</b> <b>a lot of the right ways to do things</b> <b>so in many areas</b> <b>you avoided a lot of wrong turns</b> <b>I think I was incredibly lucky</b>

<b>but I also hope</b> <b>though I think a lot of this really is</b> <b>on one hand, common sense</b> <b>but as you said, on the other hand</b> <b>for a student</b> <b>this might not be so obvious</b> <b>not so apparent</b> <b>mm</b> <b>like with this scaffolding thing</b> <b>when we were at FAIR</b> <b>there was a running joke</b> <b>kind of a joke, sort of</b> <b>the story goes that the first lesson for everyone interning at FAIR</b>

<b>guess what it was?</b>

<b>mm</b> <b>the first lesson</b> <b>was to use a certain tool</b> <b>guess what that tool was?</b>

<b>no idea</b> <b>that tool was</b> <b>an Excel spreadsheet</b> <b>[chuckles]</b> <b>this thing is also quite interesting</b> <b>so</b> <b>we'd have this whole system for tracking experiments</b> <b>of course, this might be a bit outdated now</b> <b>because nowadays there might be better</b> <b>tools like Feishu</b> <b>many better tools</b> <b>but back then</b> <b>we would meticulously</b>

<b>build this kind of template</b> <b>and this template was just an Excel file</b> <b>so sometimes we felt like office clerks</b> <b>I do research every day</b> <b>but it's not screen full of code</b> <b>writing some fancy stuff</b> <b>instead, staring at this spreadsheet</b> <b>this Excel file</b> <b>the spreadsheet</b> <b>looking at what each row represents</b> <b>the research part of this</b> <b>is how you design the spreadsheet</b> <b>how do you make sure</b>

<b>every experiment gives you</b> <b>what I just called this gradient</b> <b>right</b> <b>because you can always hit two extremes</b> <b>one extreme is you run too few experiments</b> <b>so your signal is unclear</b> <b>you don't know anything</b> <b>the other extreme is</b> <b>I don't care at all what experiments I'm running</b> <b>I just run experiments blindly</b> <b>right</b> <b>I have all these resources</b> <b>I just maximize my resources</b> <b>run all the jobs</b>

<b>dump all the results</b> <b>just throw everything into the spreadsheet</b> <b>and then feel satisfied</b> <b>thinking my research is done</b> <b>both of these are a pretty poor</b> <b>pattern for a student's research</b> <b>mm</b> <b>but back then, by watching how Kaiming</b> <b>built that kind of spreadsheet</b> <b>I learned an enormous amount</b> <b>right</b> <b>because you really have to make some decisions</b>

<b>those decisions being, for my</b> <b>what metrics should I actually focus on</b> <b>right</b> <b>what should I actually be recording</b> <b>what columns should there be</b> <b>how should I define control variables</b> <b>and how to make each experiment as informative as possible</b> <b>mm</b> <b>okay so let's move on</b> <b>right, so what other things happened at FAIR</b> <b>then there's also the thing about DiT right</b> <b>but let's not jump to that yet</b> <b>let's continue the FAIR story</b>

<b>so after the self-supervised learning phase</b> <b>you entered an exploration phase again</b> <b>right</b> <b>so at that time, like I mentioned</b> <b>actually there's no real transition</b> <b>right, these things are all overlapping</b> <b>I may be doing one thing while also exploring something else</b> <b>right</b> <b>and at that time</b> <b>what I was most interested in actually was</b> <b>generative models</b> <b>at the time generative models was a big topic</b>

<b>GAN was already quite mature by then</b> <b>right</b> <b>then</b> <b>VAE and various other things</b> <b>were also starting to emerge</b> <b>yes</b> <b>then there was a paper</b> <b>which, back in maybe 2021 or 2022</b> <b>at the time of the DDPM paper</b> <b>right, it's the Denoising Diffusion Probabilistic Model</b> <b>mm</b> <b>this paper was very interesting to me</b> <b>because at the time the image quality</b> <b>actually wasn't that impressive yet</b> <b>I think the image quality was about on par with GAN</b>

<b>or even a bit worse, right</b> <b>but in terms of sample diversity</b> <b>it was much better than GAN</b> <b>right</b> <b>because GAN always has this mode collapse problem</b> <b>right, it tends to just generate one kind of image</b> <b>right</b> <b>but this thing was able to generate</b> <b>much more diverse content</b> <b>so I thought</b> <b>there might be something here</b> <b>but it's still not clear enough yet</b> <b>then we had a meeting</b> <b>in the group</b> <b>and we discussed this paper</b>

<b>and at the time Kaiming also said</b> <b>he thought this was interesting</b> <b>he also thought this was something worth pursuing</b> <b>but he had one question</b> <b>and this question I still remember to this day</b> <b>he asked, have you thought carefully</b> <b>about whether this is a discriminative model</b> <b>or a generative model?</b>

<b>mm</b> <b>I think this is very profound</b> <b>because the essence is</b> <b>you're doing denoising</b> <b>when you're doing denoising</b> <b>essentially you're doing discriminative prediction</b> <b>right</b> <b>but at the same time</b> <b>through multiple steps of denoising</b> <b>you're also doing generation</b> <b>right</b> <b>so the interesting question Kaiming raised was</b> <b>in the end, is this thing a discriminative model</b>

<b>or a generative model?</b>

<b>and what does this boundary mean?</b>

<b>mm</b> <b>I thought this was a very deep question</b> <b>because in the end</b> <b>the things that Diffusion models are capable of doing</b> <b>completely blurred this boundary</b> <b>right</b> <b>it can do generation, it can do discrimination</b> <b>it can do representation learning</b> <b>all kinds of things</b> <b>so I think this is a fairly profound question</b> <b>yes</b> <b>so at the time, based on this question</b> <b>we did a lot of exploration</b>

<b>including</b> <b>things like trying to use DDPM</b> <b>or diffusion models for classification</b> <b>and checking</b> <b>whether the representation it learns is good</b> <b>and how it compares to a self-supervised model</b> <b>mm</b> <b>this was a line of exploration we pursued</b> <b>it was interesting</b> <b>and there's a paper I'm not sure if it was published</b> <b>actually I know it was published</b> <b>but it wasn't published by us</b>

<b>someone else did it</b> <b>mm</b> <b>but anyway, we did a lot of this kind of exploration</b> <b>but let's first talk about the process</b> <b>when did this happen at FAIR?</b>

<b>this was around 2022 to 2023</b> <b>mm</b> <b>at that time</b> <b>diffusion models had started to take off</b> <b>mm</b> <b>not yet, not right away</b> <b>this is before ChatGPT, right?</b>

<b>mm</b> <b>this is before ChatGPT</b> <b>right, so this was around 2022</b> <b>before or after Stable Diffusion?</b>

<b>roughly the same time</b> <b>it was approximately the same time</b> <b>mm</b> <b>at that time Stable Diffusion was already getting attention</b> <b>right, that whole community</b> <b>was also very active</b> <b>right</b> <b>so at the time I was</b> <b>very curious about diffusion models</b> <b>mm</b> <b>and we started exploring</b> <b>is the exploration you're describing</b> <b>something you can do freely on your own</b> <b>without needing to report to anyone?</b>

<b>yes, this is the freedom of FAIR</b> <b>right, that's exactly the freedom I was talking about</b> <b>yes</b> <b>so at the time</b> <b>in terms of the direction of research</b> <b>within the team, nobody was doing diffusion models at all</b> <b>so I was the first to start exploring this</b> <b>and later brought in an intern</b> <b>who was Bill Peebles</b> <b>yes, who is now head of Sora</b> <b>we started together</b> <b>right</b> <b>but I was the first to start at FAIR</b>

<b>and then brought Bill in later</b> <b>mm</b> <b>so back then</b> <b>I was exploring all kinds of angles</b> <b>and then later we kind of settled on</b> <b>the most important one</b> <b>which was the DiT direction</b> <b>mm</b> <b>and by the way</b> <b>let me also mention this</b>

<b>DiT wasn't the original goal</b> <b>at the very beginning</b> <b>right</b> <b>the original goal was actually</b> <b>exploring the connection between</b> <b>discriminative and generative models</b> <b>mm</b> <b>yes, that was the original question</b> <b>mm</b>

<b>right, and during this exploration</b> <b>we kind of discovered</b> <b>that this direction of DiT was more interesting</b> <b>mm</b> <b>and we focused on that</b> <b>ok then let's not jump there yet</b> <b>let's continue talking about FAIR</b> <b>what was life like at FAIR?</b>

<b>what was the culture like?</b>

<b>what was special about FAIR?</b>

<b>mm</b> <b>I think the most special thing about FAIR is</b> <b>it's the most academic-like place</b> <b>inside industry</b> <b>that I've ever been to</b> <b>right, so a lot of the culture</b> <b>is actually quite similar to academia</b> <b>for example</b> <b>everyone has a very high degree of freedom</b> <b>you can basically choose</b> <b>what you want to work on</b> <b>mm</b> <b>and at the same time</b>

<b>you have a lot of resources</b> <b>the resources are beyond what you'd have in academia</b> <b>right</b> <b>so I think FAIR</b> <b>was a very ideal research environment</b> <b>for me at that stage</b> <b>mm</b> <b>but it also has some problems, right</b> <b>like you said</b> <b>later on</b> <b>there were some cultural shifts</b>

<b>right</b> <b>I think around 2022 or 2023</b> <b>after ChatGPT appeared</b> <b>FAIR was going through a lot of changes</b> <b>mm</b> <b>right</b> <b>you're using such a fancy-sounding term</b> <b>and you even have to say it in English</b> <b>which shows how hard these things are to define</b> <b>it really is a</b> <b>research aesthetic</b> <b>right</b> <b>I think</b> <b>it encompasses everything I've mentioned above</b>

<b>the specifics of how you do things</b> <b>I think all of that is included</b> <b>but it also involves some higher-level</b> <b>philosophical</b> <b>considerations</b> <b>like how Kaiming gave me the Diamond Sutra</b> <b>I think he</b> <b>because the Diamond Sutra says</b> <b>all things are like dreams, illusions, bubbles and shadows</b>

<b>and one passage also says: all phenomena are illusions</b> <b>if you see all phenomena as not phenomena, you see the Tathagata</b> <b>mm</b> <b>taking this a bit further</b> <b>it's actually quite similar to certain ideas in Western philosophy</b> <b>quite similar actually</b> <b>for example, Kant's concept of the thing-in-itself</b> <b>and then</b>

<b>Schopenhauer's</b> <b>the world as will and representation</b> <b>right</b> <b>what they're all trying to express</b> <b>I don't know much about philosophy, I don't want to sound pretentious</b> <b>but in my humble understanding</b> <b>I think what they're all trying to discuss is</b> <b>what you see</b> <b>is not the essence of the thing</b> <b>what you see of the world is not its true substance</b> <b>so when you're reading a paper</b>

<b>what matters is</b> <b>to break through the illusion the paper presents to you</b> <b>and question</b> <b>what lies behind this paper</b> <b>what kind of</b> <b>substantive essence does it actually contain</b> <b>I think the source of researcher taste lies in</b> <b>whether people can</b> <b>truly set aside all these superficial appearances</b> <b>and then</b> <b>keep pursuing the path toward truth</b> <b>keep seeking</b> <b>mm</b>

<b>I think Kaiming does this best</b> <b>if you think about this from a long-term perspective</b> <b>the question is: what is the right way</b> <b>to guide how you choose a topic</b> <b>what kind of things to work on</b> <b>right</b> <b>this thing also connects to</b> <b>while you're doing research</b> <b>what exactly should each step involve</b> <b>I think everything is consistent</b> <b>mm</b> <b>and then</b> <b>I think</b> <b>one problem with not having good research taste is</b>

<b>people might get caught up in these appearances</b> <b>these appearances might be a paper's acceptance</b> <b>or the kind of fame you mentioned from the outside world</b> <b>or</b> <b>being able to get something done quickly</b> <b>and getting the kind of momentary praise</b> <b>and adulation</b> <b>right</b> <b>I think for Kaiming</b>

<b>this is completely outside of his world model</b> <b>he simply doesn't care</b> <b>I think</b> <b>right</b> <b>but if you ask me to list out research taste as points a,</b> <b>b c,</b> <b>d...</b>

<b>d...</b>

<b>that becomes pretty hard to articulate</b> <b>this thing</b> <b>because it involves so many things</b> <b>because research itself, as I said</b> <b>is also a creative process</b> <b>it's also a writing process</b> <b>from the writing side, by the way</b> <b>Kaiming is also the person with the strongest writing ability</b> <b>he also strongly encouraged us, saying</b> <b>make sure to start writing early</b> <b>this thing</b> <b>very unfortunately</b> <b>even now</b> <b>at my age</b>

<b>I still can't do it well</b> <b>like Kaiming</b> <b>all his papers</b> <b>were finished a month before the deadline</b> <b>at least that was the case at FAIR</b> <b>mm</b> <b>meaning</b> <b>while everyone else was pulling all-nighters to meet the deadline</b> <b>and then</b> <b>feeling this huge sense of satisfaction</b> <b>Kaiming, you know</b> <b>was like a carefree free spirit</b> <b>having finished everything a month ago</b>

<b>and then polishing it over and over again</b> <b>watching all of you rush to meet your deadlines</b> <b>I, in a very relaxed way</b> <b>have already made this thing perfect</b> <b>he finished everything a month in advance</b> <b>everything done</b> <b>meaning the paper was fully written</b> <b>ah</b> <b>not just the results obtained, but the paper fully written</b> <b>this is already a publishable</b> <b>solid piece of work</b> <b>so</b> <b>that means he had to start writing when</b>

<b>two months before the deadline</b> <b>and he only needed one month to write it</b> <b>no</b> <b>one month is a long time</b> <b>right</b> <b>of course he would keep writing afterward</b> <b>during that month before the deadline</b> <b>he would</b> <b>polish every table</b> <b>every</b> <b>single</b> <b>word</b> <b>every punctuation mark</b> <b>ah</b> <b>for example, this habit</b> <b>also influenced me</b> <b>for instance, I now have this OCD</b> <b>like this kind of</b> <b>how to put it</b> <b>obsession</b>

<b>that also came from my time with Kaiming</b> <b>which is that in your paper</b> <b>not a single line should have less than 60% filled with text</b> <b>filled -- what does that mean?</b>

<b>meaning if you have a line</b> <b>and more than half of it is empty</b> <b>it doesn't look good</b> <b>you need to fill that line</b> <b>or have it filled roughly</b> <b>sixty to seventy percent</b> <b>then your paper looks more elegant</b> <b>elegant, or uniform</b> <b>oh</b> <b>and now with every paper</b> <b>I always ask all the students</b> <b>right, look carefully</b> <b>if you have some trailing word</b>

<b>if people aren't paying attention</b> <b>you'll end up with a word</b> <b>sitting alone on a line somewhere</b> <b>it looks terrible</b> <b>understood</b> <b>mm</b> <b>and also</b> <b>when Kaiming thinks about this, his view is</b> <b>this paper is not for you to read</b> <b>this paper is for others to read</b> <b>so you need to care about how others experience it</b> <b>mm</b> <b>how can you -- a paper is just a vessel</b>

<b>how do I, through this vessel of knowledge</b> <b>let people relatively smoothly get</b> <b>to your own</b> <b>the core of what you want to express</b> <b>this communication interface needs to be pleasing to the eye</b> <b>that's a great way to put it, right</b> <b>the communication interface must be pleasing to the eye</b> <b>so you can't let your paper look too bad, right</b> <b>you have to get the details right</b> <b>so all of this</b>

<b>you can consider it a kind of research taste</b> <b>but I think this is</b> <b>actually something more general</b> <b>a kind of aesthetic toward life</b> <b>or toward everything in the universe</b> <b>mm</b> <b>I think these things are all connected</b> <b>right</b> <b>this is also why</b> <b>we care so much about our own papers</b> <b>being as unique as possible</b> <b>having our own distinctiveness</b>

<b>we can have our own webpage design</b> <b>we'll record our own videos</b> <b>record videos</b> <b>but there are many</b> <b>people who wonder why you bother with all this</b> <b>this stuff</b> <b>has nothing to do with research</b> <b>isn't this just a distraction?</b>

<b>why spend extra energy</b> <b>polishing all this</b> <b>are you just doing this for hype and marketing?</b>

<b>ah, I hope people don't think that</b> <b>because I think</b> <b>having your own style</b> <b>is actually very important</b> <b>mm</b> <b>and then</b> <b>this is also why</b> <b>all of our papers use a consistent template</b> <b>we have our own designs</b> <b>and indirectly</b> <b>I also hope to pass on some of my taste, again</b> <b>I can't guarantee they're all good</b> <b>but somehow</b> <b>at least discuss it with my students</b>

<b>we can work on this together</b> <b>at least together we can conceptualize</b> <b>think it through together</b> <b>right, I think this also, in my view</b> <b>this broader</b> <b>is part of research taste</b> <b>mm, it contains many very concrete small details</b> <b>an enormous number of details</b> <b>right</b> <b>but I think</b> <b>this is also what makes research interesting</b> <b>I told you yesterday</b>

<b>my childhood dream was actually to become a film director</b> <b>right</b> <b>mm</b> <b>childhood dream</b> <b>no no</b> <b>when did that dream fade?</b>

<b>it faded pretty quickly</b> <b>unfortunately</b> <b>but I still watch a lot of films</b> <b>but I think, eventually, I came to realize</b> <b>the research process and filmmaking process</b> <b>are actually not that different</b> <b>why?</b>

<b>why?</b> <b>because a film also needs to discover a theme</b> <b>it also involves exploration</b> <b>I have a story I want to tell</b> <b>and it shouldn't be that I just stand at this moment</b> <b>and think oh</b> <b>this is how my story goes</b> <b>and then I just go straight toward the finish</b> <b>it shouldn't work that way either</b> <b>you should also go make the film</b> <b>I think you'd have great intuition</b> <b>right</b> <b>yes, exactly right</b>

<b>the worst films are the ones that just go through the motions</b> <b>I start with A</b> <b>no conflict along the way</b> <b>and arrive at B</b> <b>and then it's over</b> <b>I just</b> <b>play it for you</b> <b>a good film actually is</b> <b>or, why do we say when writing a paper</b> <b>people say</b> <b>they told the story really well</b> <b>even though this might even have a bit of a narrative</b> <b>storytelling quality</b> <b>mm</b> <b>film is a storytelling process</b> <b>there's a book</b>

<b>I actually recommended it to students before</b> <b>I learned from Kaiming</b> <b>I share with people</b> <b>some unexpected books</b> <b>let me recommend a book</b> <b>it's called Story, by Robert McKee</b> <b>mm</b> <b>this book is a book about screenwriting</b> <b>mm</b> <b>but I think this book</b> <b>actually speaks to a lot of things about research</b> <b>and life</b> <b>there's one thing this book talks about</b> <b>that I think is particularly interesting</b> <b>it talks about</b>

<b>what makes a good story</b> <b>it's not</b> <b>a story that has no conflict from beginning to end</b> <b>a good story must be driven by conflict</b> <b>and through conflict to discover</b> <b>the true character's core</b> <b>mm</b> <b>and in research</b>

<b>it's the same thing</b> <b>a good research paper</b> <b>must also set up the conflict</b> <b>and then through conflict</b> <b>you discover the core of this problem</b> <b>and the solution to this problem</b> <b>right</b> <b>so I think this book</b> <b>has a lot of profound insights</b> <b>including about life</b> <b>mm</b> <b>and I think the concept of conflict in the book</b> <b>is actually similar to what I was just talking about</b>

<b>that gradient</b> <b>mm</b> <b>you need enough contrast</b> <b>to let you see the difference</b> <b>right</b> <b>for example</b> <b>if in your experiment</b> <b>you don't have a good enough control group</b> <b>or experimental group</b> <b>your signal will be weak</b> <b>and you won't know the answer</b>

<b>right</b> <b>so having this kind of conflict</b> <b>this gradient</b> <b>is extremely important for research</b> <b>mm</b> <b>I think this is really interesting, thank you</b> <b>so let me ask about another topic</b> <b>which is about your transition from FAIR to NYU</b> <b>right</b> <b>you transitioned from FAIR to NYU around 2023</b> <b>right, to become a professor</b> <b>right</b> <b>can you talk about how this transition happened?</b>

<b>right, so actually</b> <b>I spent a total of five years at FAIR</b> <b>mm</b> <b>and for me this experience at FAIR</b> <b>I think it was the most formative five years</b> <b>of my career</b> <b>so I think I'm extremely grateful</b> <b>and this experience has really shaped</b> <b>who I am today</b> <b>mm</b> <b>but at the same time</b> <b>I always had this desire</b> <b>to someday</b>

<b>run my own lab</b> <b>and take on students</b> <b>because I think this experience</b> <b>the experience of someone guiding you</b> <b>is something I'm very thankful for</b> <b>and I want to pass on</b> <b>what I learned</b> <b>right</b> <b>so after five years at FAIR</b> <b>I decided to make a move</b> <b>and go into academia</b> <b>mm</b> <b>and so I joined NYU</b> <b>mm</b> <b>which by the way, NYU is a very interesting place</b>

<b>why?</b>

<b>why?</b> <b>because NYU is somewhat unique</b> <b>it's located in New York City</b> <b>in Manhattan</b> <b>mm</b> <b>right, so it's surrounded by a lot of industry</b> <b>which gives you a lot of collaboration opportunities</b> <b>mm</b> <b>and NYU's location in New York</b> <b>there is a relatively strong AI community here in New York</b> <b>right</b>

<b>for example, NYU has Yann LeCun</b> <b>mm</b> <b>who is of course a figure you don't need to introduce</b> <b>mm</b> <b>and NYU also has</b> <b>Kyunghyun Cho</b> <b>who is also a very well-known researcher</b> <b>mm</b> <b>and then there's also this whole community in New York</b> <b>like, for example</b> <b>Google has a large office here in New York</b> <b>Microsoft also has offices here</b>

<b>Morgan Stanley, Goldman Sachs</b> <b>lots of different types of companies</b> <b>mm</b> <b>so I think this is</b> <b>a very unique place</b> <b>where you can combine</b> <b>industry and academia</b> <b>mm</b> <b>right, so actually now when we're talking about</b> <b>is Dumbo a community in New York?</b>

<b>Dumbo is a very interesting place</b> <b>in Brooklyn</b> <b>mm</b> <b>and Dumbo has become one of</b> <b>the more important areas of New York's AI community</b> <b>mm</b> <b>there are a lot of AI startups</b> <b>here in Dumbo</b> <b>for example, some of the more well-known ones</b> <b>like Hugging Face's office is here</b>

<b>mm</b> <b>and then Runway's office is also here</b> <b>mm</b> <b>and then there are many other startups</b> <b>so New York is actually quite vibrant</b> <b>and the reason I chose NYU</b> <b>is partly because of this</b> <b>and also partly because of the people there</b> <b>mm</b> <b>so that's how I ended up at NYU</b> <b>mm</b> <b>right, so then</b> <b>it turns out that the professor role</b> <b>after you actually start doing it</b> <b>is somewhat different from what you imagined</b> <b>right?</b>

<b>right?</b> <b>mm, I think many aspects are different</b> <b>for example, a professor</b> <b>has to deal with a lot of administrative work</b> <b>right</b> <b>things like grant applications</b> <b>various committee work</b> <b>right</b> <b>also things like</b> <b>things completely unrelated to research</b> <b>right</b> <b>I was quite well protected at FAIR</b> <b>from a lot of this</b>

<b>right</b> <b>but at a university</b> <b>you have to deal with all of it yourself</b> <b>mm</b> <b>so I think this is a very different experience</b> <b>and also</b> <b>advising students</b> <b>is very different from doing research yourself</b> <b>mm</b> <b>because advising students requires</b> <b>not just doing the research</b> <b>but also helping students</b> <b>grow as researchers</b>

<b>right</b> <b>and this is a very different skill set</b> <b>mm</b> <b>so I think</b> <b>transitioning into the professor role</b> <b>was actually a big challenge</b> <b>mm</b> <b>but at the same time, it's very rewarding</b> <b>because you can see your students</b> <b>grow</b> <b>right</b> <b>and I think this is</b> <b>one of the most rewarding things</b> <b>about being a professor</b>

<b>mm</b> <b>I think that's a beautiful thing to say</b> <b>so let me ask</b> <b>about the startup you founded</b> <b>right</b> <b>I heard that you are now a professor at NYU</b> <b>and also a co-founder of a startup</b> <b>right</b> <b>what's the story behind that?</b>

<b>right, so the startup</b> <b>started a bit over a year ago</b> <b>right</b> <b>and the company is called Emu Video</b> <b>no, wait, that's a product</b> <b>[laughter]</b> <b>it's called Oasis</b> <b>mm</b> <b>so what does Oasis do?</b>

<b>right, so Oasis is focused on</b> <b>AI-generated video</b> <b>mm</b> <b>and specifically</b> <b>a game that is generated by AI in real time</b> <b>mm</b> <b>so the original idea</b> <b>was inspired by</b> <b>the DiT work</b> <b>and also by Sora</b> <b>mm</b> <b>and we thought</b> <b>this technology</b> <b>can be applied to games</b>

<b>mm</b> <b>right, because games are actually</b> <b>an extremely good use case for this kind of technology</b> <b>mm</b> <b>because games</b> <b>require very fast frame generation</b> <b>right</b> <b>and at the same time</b> <b>games require a lot of interactivity</b> <b>right</b> <b>so these two things together</b>

<b>make games a very interesting application</b> <b>mm</b> <b>this thing</b> <b>can be applied to many many different papers</b> <b>no matter what your topic is</b> <b>right, so I think this is also very interesting</b> <b>mm</b> <b>and then later</b> <b>we could maybe talk about</b> <b>DiT right</b> <b>but this paper also</b> <b>this paper</b> <b>was again one of those</b> <b>that brings us to NYU</b>

<b>no no</b> <b>no, this one is also</b> <b>also FAIR</b> <b>it was the last piece of work at FAIR</b> <b>oh</b> <b>and then at that time FAIR was already starting to have some</b> <b>culture shift</b> <b>because at that point ChatGPT had just come out</b> <b>OpenAI and then DeepMind were also doing very well</b> <b>OpenAI as an emerging</b> <b>research force</b> <b>mm, and then</b> <b>had actually done a lot at FAIR</b>

<b>that nobody dared to even dream of</b> <b>uh</b> <b>and even if they dreamed it they couldn't do it</b> <b>right, so everyone started thinking</b> <b>what went wrong with this organizational model</b> <b>does there need to be a major overhaul</b> <b>there had already been many</b> <b>reorganizations</b> <b>this was also a trigger</b> <b>why</b> <b>I think by then it was no longer a good sign</b> <b>for me to keep staying at FAIR</b> <b>things were already starting to decline</b> <b>not exactly decline</b> <b>just that</b>

<b>everyone's focus was no longer on research</b> <b>people would</b> <b>have these meetings that lasted several hours</b> <b>research alignment meetings</b> <b>coordination meetings</b> <b>alignment meetings</b> <b>alignment meetings</b> <b>and the only topic of these meetings was</b> <b>what exactly should we be doing</b> <b>but these meetings</b>

<b>went on for</b> <b>several weeks</b> <b>and still no conclusion</b> <b>because nobody would know what they want to do</b> <b>because this is completely counter to what I just described</b> <b>the normal</b> <b>bottom-up logic of research</b> <b>mm right</b> <b>now it had become</b> <b>let's all sit together</b> <b>and discuss what research project</b> <b>we should do over the next one or two years</b> <b>in my view</b> <b>or in Kaiming's view</b> <b>or in the minds of many researchers</b>

<b>this looks completely anti-research</b> <b>right</b> <b>so at that time it had a lot of effect on us</b> <b>for example, at the time I</b> <b>was working on DiT</b> <b>Diffusion was also just getting started</b> <b>nobody yet</b> <b>not a single person at FAIR</b> <b>was doing Diffusion Model research</b> <b>but I thought, hey</b> <b>this thing seems really interesting</b> <b>I think I should give it a try</b> <b>and then Bill Peebles</b>

<b>was an intern I recruited at the time</b> <b>mm</b> <b>and he's now head of Sora</b> <b>and also the main character in Sora's various generated videos</b> <b>he's also the star of those</b> <b>mm right</b> <b>he's an extremely sharp person</b> <b>or or</b> <b>in my view</b> <b>what I'd call a perfect PhD student</b> <b>in all directions, uh</b> <b>at least a well-rounded, all-around student</b> <b>right, but anyway</b> <b>our starting point back then</b> <b>was not to do Diffusion Model research</b> <b>nor to do DiT</b>

<b>in the first two months of exploration</b> <b>it was entirely focused on representation learning</b> <b>that is, we wanted to look at</b> <b>the representation a Diffusion Model learns</b> <b>how it compares to what a normal Supervised Learning</b> <b>or rather</b> <b>a Self-supervised Learning model learns</b> <b>what the differences are</b> <b>actually</b> <b>there was a lot of follow-up work in this direction</b> <b>but what we started doing</b> <b>after working on it for a while, the feeling was</b>

<b>this thing is okay</b> <b>just so-so</b> <b>it can learn a decent</b> <b>a generative model can learn a decent representation</b> <b>but this representation</b> <b>was much, much worse</b> <b>than the representation from self-supervised learning</b> <b>mm</b> <b>completely not competitive, right</b> <b>so we gave up on that</b> <b>but in the process</b> <b>in the final month</b> <b>we discovered</b> <b>hey</b> <b>by the way, this thing</b> <b>the premise being</b> <b>because DiT</b>

<b>we needed to compare at the representation level</b> <b>against, say, ViT-based systems</b> <b>to make a comparison</b> <b>so at that time it was</b> <b>why didn't we use a U-Net</b> <b>but instead used ViT for this Diffusion Model</b> <b>that was the starting point, right</b> <b>and then we found out, hey</b> <b>from the representation angle</b> <b>this doesn't seem to add much value</b> <b>but it seems like our new architecture</b> <b>is indeed more efficient</b>

<b>and indeed more scalable</b> <b>more stable than U-Net</b> <b>and from a code perspective</b> <b>I care a lot about these things</b> <b>from your code perspective</b> <b>what I call Minimal Description Length (MDL)</b> <b>your code is actually quite important</b> <b>it can reflect some things</b> <b>if your code is short</b> <b>and can achieve the same purpose</b> <b>then your method will typically be better than one that</b>

<b>requires thousands of lines of code</b> <b>an extremely complex system</b> <b>even if it can do the same thing</b> <b>but the former</b> <b>this more elegant solution</b> <b>the simpler solution is always better</b> <b>I think this is also a kind of research taste in a sense</b> <b>so we found, hey</b> <b>this thing is both simple and it works</b> <b>and scalable</b> <b>and efficient</b> <b>so it seems like this thing</b> <b>is the direction we should be pursuing</b>

<b>so also a month in advance</b> <b>and then we went to work on this</b> <b>mm</b> <b>and at that point we were competing for a lot of resources</b> <b>people said</b> <b>why are you working on this?</b>

<b>we need to consolidate resources now</b> <b>and we need to do something more meaningful</b> <b>a bigger project</b> <b>for example</b> <b>nobody knows</b> <b>so we need these alignment</b> <b>meetings to discuss it</b> <b>but</b> <b>at least Diffusion Models</b> <b>wouldn't be an important part of this critical path</b> <b>an important</b> <b>key member on this critical path</b> <b>right</b> <b>so there was a lot of opposition</b>

<b>but I felt I could see</b> <b>that this is actually something very important</b> <b>because I think this, from an architecture standpoint</b> <b>I've</b> <b>I've been doing architecture work for so long</b> <b>I think this is the future of Diffusion architectures</b> <b>right, it's not the Diffusion Model</b> <b>what I said, the overall data architecture</b> <b>and the objective</b> <b>are all very important</b> <b>right, but on the architecture side</b> <b>this is an indispensable piece</b> <b>so this is why</b>

<b>in the last month we pushed in this direction</b> <b>and the results were very good in the end</b> <b>and we were able to show</b> <b>this really great</b> <b>scaling behavior</b> <b>and we submitted the paper to CVPR</b> <b>and we were all very happy</b> <b>and then the paper got rejected</b> <b>mm</b> <b>right, LeCun apparently tweeted about this</b> <b>yes</b> <b>saying not enough novelty</b> <b>you might have done this thing</b> <b>uh right</b> <b>you don't have long stretches of math</b>

<b>you don't have a long complex structure</b> <b>you came up with a very simple structure</b> <b>and even though you got good results</b> <b>the reviewers weren't convinced</b> <b>mm right</b> <b>this is another lesson</b> <b>but by that point</b> <b>I had actually started to come around</b> <b>I realized</b> <b>this whole thing about research papers</b> <b>in this huge random process</b> <b>whether you get accepted or not</b> <b>doesn't matter at all</b>

<b>so we then submitted to another conference</b> <b>didn't change a thing</b> <b>and it got accepted as an Oral Paper</b> <b>mm, which proves once again</b> <b>this is a completely random process</b> <b>but what happened afterward was more interesting</b> <b>after getting this paper</b> <b>I realized</b> <b>in every dimension</b> <b>this was better than a U-Net based system</b> <b>why not just use this</b> <b>right, you've unified the underlying logic</b> <b>at least on the architecture side, unified the logic</b>

<b>you can share a lot of infrastructure</b> <b>it's so efficient</b> <b>results are good and scalable</b> <b>you can build even larger models</b> <b>so we thought</b> <b>this thing</b> <b>once this paper is out, there will definitely be a lot of attention</b> <b>which, by the way</b> <b>there was indeed a lot of attention</b> <b>lots of people discussing it on Twitter</b> <b>but we found, hey</b> <b>nobody was actually using it for anything</b> <b>oh</b> <b>and then we started talking to people</b>

<b>like we reached out to the Stable Diffusion folks</b> <b>by the way, I think Stable Diffusion</b> <b>LDM is also one of</b> <b>what I'd call those twenty-something foundational papers</b> <b>one of them</b> <b>but I also talked to some people there</b> <b>and then</b> <b>we also talked to some other big companies</b> <b>so we were kind of at school</b> <b>at that time I was -- this paper had just</b> <b>landed right at the end of my time at FAIR</b>

<b>and the beginning of my time at NYU</b> <b>oh, so both affiliations were listed?</b>

<b>well</b> <b>right, right -- actually, no</b> <b>actually only NYU was listed</b> <b>and Berkeley</b> <b>because FAIR didn't let us list their name</b> <b>why?</b>

<b>why?</b> <b>because first, they felt this paper, it's OK</b> <b>it's a paper. second, you had already left</b> <b>so don't list our name</b> <b>mm, so then after this paper</b> <b>a lot of people started using DiT</b> <b>right</b> <b>and then we found that Sora used DiT as the backbone</b> <b>right</b> <b>which was a huge affirmation</b> <b>mm</b> <b>because at the time the Sora paper</b> <b>mentioned DiT by name</b>

<b>yes</b> <b>right, so this was something we were very proud of</b> <b>mm</b> <b>and then, later</b> <b>a lot of other models</b> <b>also started using DiT</b> <b>mm</b> <b>yes, basically all the main video generation models now</b> <b>use DiT as the backbone</b> <b>mm</b> <b>so I think this was a very important paper</b> <b>mm</b> <b>right, so then</b> <b>let's talk about the startup</b>

<b>right</b> <b>so why start a company?</b>

<b>right</b> <b>I think for me</b> <b>the main motivation was</b> <b>I wanted to see</b> <b>whether this technology</b> <b>that I had been working on for so many years</b> <b>could have real impact</b> <b>mm</b> <b>because in academia</b> <b>you write papers</b> <b>and other people read your papers</b> <b>and they may use your ideas</b> <b>but you never really get to see</b> <b>the end-to-end impact</b> <b>mm</b>

<b>right, so I wanted to</b> <b>take this technology all the way</b> <b>to building a product</b> <b>mm</b> <b>and also</b> <b>I think</b> <b>that games are a very interesting application</b> <b>mm</b> <b>because games are one of the few places</b> <b>where both high visual quality</b> <b>and very low latency</b> <b>are required at the same time</b> <b>mm</b> <b>and this is actually a very hard technical problem</b>

<b>right</b> <b>so we thought</b> <b>if we can solve this problem</b> <b>for games</b> <b>then the technology will be applicable</b> <b>to a much wider range of use cases</b> <b>mm</b> <b>right, and also</b> <b>games are a massive market</b> <b>right</b> <b>so there's a lot of commercial potential as well</b> <b>mm</b> <b>right, so that's kind of the story</b>

<b>behind starting the company</b> <b>mm</b> <b>so what has the journey been like</b> <b>since you started the company?</b>

<b>mm</b> <b>I think</b> <b>building a company is very different from doing research</b> <b>mm</b> <b>for many reasons</b> <b>right</b> <b>one is that in a company</b> <b>you have to think about</b> <b>the product</b> <b>and users</b> <b>mm</b> <b>which is not something you think about in research</b> <b>right</b> <b>and two is that</b> <b>in a company you have to think about</b> <b>the business model</b> <b>and how to sustain the business</b>

<b>mm</b> <b>right, which is also not something</b> <b>you think about in research</b> <b>right</b> <b>and three is that</b> <b>building a team is very different</b> <b>from advising students</b> <b>mm</b> <b>because in a company</b> <b>you're hiring professionals</b> <b>who have different skills and backgrounds</b> <b>mm</b> <b>and you have to think about</b> <b>how to align everyone</b> <b>toward a common goal</b> <b>mm</b>

<b>which is quite different from</b> <b>advising PhD students</b> <b>mm</b> <b>right</b> <b>so I think building a company</b> <b>has been a very learning-rich experience</b> <b>mm</b> <b>and I've learned a lot from it</b> <b>mm</b> <b>right, and the product you mentioned</b> <b>Oasis</b> <b>has gotten quite a lot of attention</b> <b>right?</b>

<b>right?</b> <b>yes, I think Oasis got quite a lot of attention</b> <b>mm</b> <b>when it was first released</b> <b>mm</b> <b>and the demo got a lot of</b> <b>views and discussion</b> <b>mm</b> <b>right</b> <b>and what's the current status of the company?</b>

<b>right</b> <b>we're still pretty early</b> <b>mm</b> <b>we're building out the technology</b> <b>and the product</b> <b>mm</b> <b>and we're also thinking about</b> <b>the go-to-market strategy</b> <b>mm</b> <b>right, I think</b> <b>the vision is very clear</b> <b>mm</b> <b>but the execution is always</b> <b>the hard part</b> <b>mm</b>

<b>right, so we're still working on it</b> <b>mm</b> <b>I think that's very relatable</b> <b>so</b> <b>let me ask</b> <b>about your thoughts on</b> <b>the current AI landscape</b> <b>mm</b> <b>what do you think</b> <b>are the most important</b> <b>open problems right now?</b>

<b>mm</b> <b>I think there are many</b> <b>mm</b> <b>but one thing that I think is particularly interesting</b> <b>is the question of</b> <b>how do you build AI systems</b> <b>that can reason</b> <b>and plan</b> <b>mm</b> <b>right, because current systems</b> <b>like LLMs</b> <b>are very good at pattern matching</b>

<b>mm</b> <b>but they struggle with</b> <b>systematic reasoning</b> <b>mm</b> <b>right, so I think this is a very important</b> <b>open problem</b> <b>mm</b> <b>and another one is</b> <b>how do you make AI systems</b> <b>more efficient</b> <b>mm</b> <b>right, because current systems are</b> <b>very computationally expensive</b> <b>mm</b> <b>and this limits their deployment</b> <b>mm</b>

<b>right</b> <b>so I think efficiency is a very important problem</b> <b>mm</b> <b>and then there's also</b> <b>the question of alignment</b> <b>mm</b> <b>right, how do you make sure</b> <b>that these systems</b> <b>do what you want them to do</b> <b>mm</b> <b>right, so these are all very important open problems</b> <b>mm</b> <b>right</b> <b>and where do you see things going</b> <b>in the next five years?</b>

<b>mm</b> <b>I think</b> <b>the next five years will be</b> <b>very exciting</b> <b>mm</b> <b>I think we'll see</b> <b>a lot of progress</b> <b>on the reasoning side</b> <b>mm</b> <b>and I think we'll also see</b> <b>AI systems being deployed</b>

<b>in many more real-world applications</b> <b>mm</b> <b>right, because the technology is</b> <b>getting good enough</b> <b>mm</b> <b>and the cost is coming down</b> <b>mm</b> <b>so I think we'll see</b> <b>a lot more real-world impact</b> <b>mm</b> <b>right</b> <b>and what about</b> <b>on the video generation side specifically?</b>

<b>mm</b> <b>I think video generation will</b> <b>continue to improve very rapidly</b> <b>mm</b> <b>and I think</b> <b>the quality will get</b> <b>to the point where</b> <b>it's indistinguishable from real video</b> <b>mm</b> <b>in the next year or two</b> <b>mm</b> <b>right</b> <b>what it means is</b> <b>a possible random event like this</b> <b>a kind of black swan event</b> <b>or some kind of shock</b>

<b>a kind of, uh</b> <b>this kind of</b> <b>this kind of event that takes you by surprise</b> <b>if for this organization</b> <b>or for this person</b> <b>or for this matter</b> <b>your gains outweigh your losses</b> <b>then your organization</b> <b>is what's called antifragile</b> <b>mm</b> <b>so this concept I think is very interesting</b> <b>right</b> <b>because normally when we think about</b> <b>risk management</b>

<b>we think about</b> <b>how to avoid risk</b> <b>right</b> <b>but the antifragile concept says</b> <b>no, you should actually seek out certain kinds of risk</b> <b>or rather, certain kinds of volatility</b> <b>mm</b> <b>because these</b> <b>can make you stronger</b> <b>mm</b> <b>right</b> <b>and I think this applies very well</b> <b>to research</b> <b>mm</b> <b>because in research</b>

<b>you're constantly facing uncertainty</b> <b>mm</b> <b>and you need to be antifragile</b> <b>right</b> <b>meaning that when things don't work out</b> <b>you should actually learn from that</b> <b>and become stronger</b> <b>mm</b> <b>right, and I think this is</b> <b>a very important mindset</b> <b>mm</b> <b>and I think Kaiming embodies this very well</b> <b>mm</b> <b>because when things don't work out</b> <b>he doesn't get discouraged</b>

<b>mm</b> <b>he just tries something different</b> <b>mm</b> <b>right</b> <b>and I think this is</b> <b>a very important trait</b> <b>for a researcher</b> <b>mm</b> <b>right</b> <b>so is there anything else</b>

<b>you want to share</b> <b>before we wrap up?</b>

<b>mm</b> <b>I think</b> <b>one thing I'd like to say is</b> <b>to young people who want to do research</b> <b>or start a company</b> <b>mm</b> <b>I think</b> <b>the most important thing is</b> <b>to find something you're genuinely passionate about</b>

<b>mm</b> <b>because research and startups are both</b> <b>very long journeys</b> <b>mm</b> <b>and there will be a lot of hardship along the way</b> <b>mm</b> <b>and if you don't have genuine passion</b> <b>it's very hard to keep going</b> <b>mm</b> <b>right</b> <b>and also</b>

<b>I think</b> <b>finding good mentors</b> <b>and good collaborators</b> <b>is extremely important</b> <b>mm</b> <b>because, as I've been saying throughout</b> <b>a lot of what I've learned</b> <b>came from the people around me</b> <b>mm</b> <b>and so</b> <b>surrounding yourself with</b> <b>great people</b> <b>is one of the most important things you can do</b> <b>mm</b> <b>right</b>

<b>that's really great advice</b> <b>thank you so much</b> <b>this has been a wonderful conversation</b> <b>thank you</b> <b>yeah, thank you too</b> <b>alright</b> <b>so let's talk about</b> <b>your view on the AI landscape right now</b> <b>mm</b> <b>especially in New York</b> <b>right</b> <b>what are some of the interesting things</b>

<b>happening here?</b>

<b>happening here?</b> <b>mm</b> <b>I think New York</b> <b>is becoming a more and more important</b> <b>AI hub</b> <b>mm</b> <b>right, there's a lot of talent here</b> <b>mm</b>

<b>and a lot of interesting companies</b> <b>mm</b> <b>and I think</b> <b>New York has a unique advantage</b> <b>in that it's a very diverse city</b> <b>mm</b> <b>and this diversity</b> <b>can lead to</b> <b>very interesting collaborations</b> <b>mm</b> <b>between AI and</b> <b>other industries</b> <b>mm</b> <b>like finance</b>

<b>media</b> <b>fashion</b> <b>healthcare</b> <b>mm</b> <b>all of these are</b> <b>very well represented in New York</b> <b>mm</b> <b>so I think</b> <b>New York is going to play</b> <b>an increasingly important role</b> <b>in the AI landscape</b> <b>mm</b>

<b>right</b> <b>and what about</b> <b>comparing New York to</b> <b>Silicon Valley?</b>

<b>Silicon Valley?</b> <b>mm</b> <b>I think</b> <b>Silicon Valley is still</b> <b>the center of the AI world</b> <b>mm</b> <b>right</b> <b>but New York is</b> <b>growing fast</b> <b>mm</b> <b>and I think</b>

<b>New York has a different kind of energy</b> <b>mm</b> <b>right, it's more</b> <b>multi-disciplinary</b> <b>mm</b> <b>and I think that's</b> <b>actually very good for AI</b> <b>mm</b> <b>because AI is ultimately</b> <b>going to touch every industry</b> <b>mm</b> <b>so having this cross-disciplinary</b> <b>environment</b>

<b>is very valuable</b> <b>mm</b> <b>right</b> <b>that's really interesting</b> <b>so</b> <b>let me ask one more question</b> <b>which is</b> <b>if you were advising</b> <b>a young researcher</b> <b>who wanted to make an impact</b> <b>in AI</b> <b>mm</b> <b>what would you tell them?</b>

<b>mm</b> <b>I think</b> <b>first and foremost</b> <b>work on problems</b> <b>that you genuinely care about</b> <b>mm</b> <b>right, because your passion</b> <b>will drive you</b> <b>through the hard times</b> <b>mm</b>

<b>and second</b> <b>be willing to</b> <b>work hard on the fundamentals</b> <b>mm</b> <b>right, don't skip the basics</b> <b>mm</b> <b>because the fundamentals</b> <b>are what give you the tools</b> <b>to solve hard problems</b> <b>mm</b> <b>and third</b> <b>find good mentors</b> <b>and collaborate with great people</b> <b>mm</b> <b>right, as I said</b>

<b>a lot of what I've learned</b> <b>came from the people around me</b> <b>mm</b> <b>and so</b> <b>the people you surround yourself with</b> <b>will have a huge impact</b> <b>on your own growth</b> <b>mm</b> <b>right</b> <b>thank you so much</b> <b>this has been really insightful</b> <b>mm</b>

<b>I think</b> <b>we've covered a lot of ground today</b> <b>mm</b> <b>right</b> <b>from your early research</b> <b>all the way to</b> <b>starting a company</b> <b>mm</b> <b>and your thoughts on</b> <b>the AI landscape</b> <b>mm</b> <b>so thank you so much</b> <b>for being here today</b>

<b>thank you</b> <b>it was great talking to you</b> <b>yeah likewise</b> <b>alright</b> <b>so that wraps up</b> <b>our conversation today</b> <b>mm</b> <b>I hope you all found it</b> <b>as interesting as I did</b> <b>mm</b> <b>right</b> <b>and please</b> <b>subscribe to the channel</b>

<b>and leave a comment</b> <b>if you have any thoughts</b> <b>mm</b> <b>right</b> <b>see you next time</b> <b>bye</b> <b>in a really difficult position</b> <b>right</b> <b>why</b> <b>mainly because, first</b> <b>not enough resources</b> <b>let me give a simple example</b> <b>for instance, when we apply for funding</b> <b>the U.S. funding system</b>

<b>the U.S. funding system</b>

<b>I might be going off on a tangent here</b> <b>but the U.S. funding system</b>

<b>over the past few decades</b> <b>has barely grown at all</b> <b>even with high inflation, right</b> <b>everything has become more expensive</b> <b>tuition fees have also gone up a lot</b> <b>but government grants</b> <b>as well as the kind of proposal programs</b> <b>that companies offer</b> <b>the funded projects</b> <b>are still maintained at a very low level</b> <b>so on average</b>

<b>a body like NSF</b> <b>a U.S. government agency</b>

<b>a U.S. government agency</b>

<b>can give each individual PI</b> <b>a total of</b> <b>about $500,000 in funding</b> <b>per year</b> <b>over five years</b> <b>so about $100,000 a year</b> <b>right, and then a lot of companies</b> <b>have actually cut back a lot</b> <b>again because of ChatGPT</b> <b>because the era of LLMs has arrived</b> <b>and everyone has gradually started to pull back</b> <b>we can talk more about this later</b>

<b>but in any case, there are fewer and fewer</b> <b>opportunities from industry</b> <b>for this kind of sponsorship</b> <b>and once in a while</b> <b>if there's some kind of funding opportunity</b> <b>they'll typically give you</b> <b>maybe $100,000 to $150,000</b> <b>that's just a one-time thing</b> <b>a one-time lump sum of that much as a grant</b> <b>but you know</b> <b>there are probably about 100 schools</b> <b>100 professors at the same time</b> <b>or even more, competing for that $100,000</b> <b>what can you do with $100,000?</b>

<b>you can fund one student for one year</b> <b>as tuition</b> <b>what else?</b>

<b>what else?</b> <b>you can buy half an H100, or a small cluster</b> <b>mm</b> <b>or buy maybe 3 to 4 GPUs</b> <b>so you really can't get much done with that</b> <b>and of course, this isn't just</b> <b>me venting</b> <b>all of us</b> <b>so-called</b> <b>junior faculty in the U.S.</b>

<b>are living in quite difficult conditions</b> <b>everyone has to find their own way</b> <b>to get different resources</b> <b>so this is also why</b> <b>it's a bit like a startup</b> <b>you're in a very constrained resource situation</b> <b>resource-wise</b> <b>and you have to find resources from different places</b> <b>you have to fundraise, right?</b>

<b>Xiaojun</b> <b>this is Business Interview show</b> <b>I said I'm not commercial at all</b> <b>but actually in some ways</b> <b>there might still be some similarities</b> <b>and then including people at Google</b> <b>we</b> <b>I had a collaborator at Google</b> <b>and he's quite unusual</b> <b>he never goes into the office</b> <b>and he said, hey</b> <b>he said, we could have a chat</b> <b>and I said, sure</b> <b>let me come chat</b> <b>I flew to the Bay Area to see him</b>

<b>and he said we could talk</b> <b>but not in an office</b> <b>let's go on a trail</b> <b>hiking on the trail next to Google's campus</b> <b>mm, go hiking</b> <b>mm, talk while hiking</b> <b>mm, so in the middle of summer</b> <b>I hiked with him for an hour</b> <b>and I told him about</b> <b>the infrastructure work we'd been doing on TPUs</b> <b>these contributions</b> <b>these contributions</b> <b>and also why building this</b>

<b>longer-term collaborative</b> <b>partnership</b> <b>this kind of relationship</b> <b>would be good for Google</b> <b>and good for us</b> <b>right, so I thought</b> <b>hey, isn't this just like a fundraising process?</b>

<b>so in the end</b> <b>it became a kind of alms-seeking</b> <b>alms-seeking</b> <b>the process of seeking alms</b> <b>right right right</b> <b>right</b> <b>indeed because</b> <b>because</b> <b>this kind of sponsorship actually asks for nothing in return</b> <b>right, so I'm very grateful to Google</b> <b>but anyway</b> <b>I think who I should be even more grateful to is</b> <b>my students</b> <b>and they, bit by bit</b> <b>overcame many, many obstacles</b> <b>like I have a few students</b> <b>I have several students</b> <b>like</b> <b>Peter Tong</b> <b>Boyang Zheng</b>

<b>Shusheng Yang</b> <b>and many others</b> <b>and they all made very significant contributions on TPUs</b> <b>mm</b> <b>right, and good</b> <b>right, and good</b> <b>so that's the background</b> <b>meaning we now have some GPUs to work with</b> <b>and now</b> <b>we can work on things that are a bit more</b> <b>closely related to large models</b> <b>so this is why I started working on</b> <b>the Cambrian project</b> <b>right uh</b> <b>and of course</b> <b>all of these narratives</b> <b>these stories</b>

<b>are still completely rooted in my</b> <b>logic from all these years</b> <b>which is, uh</b> <b>first, representation is extremely important</b> <b>second, regardless of whether you're solving</b> <b>a standard computer vision task</b> <b>or we're now in</b> <b>the era of multimodal large models</b> <b>and solving these problems through VQA</b> <b>I think all of these are like</b> <b>all of these are like</b> <b>all of these are like</b>

<b>right, and underneath it all</b> <b>there's still something substantive</b> <b>that we need to think through</b> <b>right, and this part</b> <b>anyway, about language and vision</b> <b>we can talk about that later</b> <b>and I</b> <b>and then</b> <b>we later also had a paper called Cambrian-S</b> <b>this paper goes even further</b> <b>we're not just doing image-level VQA tasks</b> <b>we want to also involve video</b> <b>to deal with video</b>

<b>right and this thing</b> <b>actually the real reason I genuinely wanted</b> <b>to work on this</b> <b>goes back to films again</b> <b>and also has to do with</b> <b>two Chinese directors I like</b> <b>quite a lot</b> <b>director Jia, you know</b> <b>Jia Zhangke and Bi Gan</b> <b>both very well-known Chinese directors</b>

<b>right, Bi Gan's Kaili Blues extensively uses</b> <b>long takes</b> <b>and this made me think, okay</b> <b>while to him it's a visual tool</b> <b>for humans, this is also a very important</b> <b>a very important medium</b> <b>for visual understanding</b> <b>because, what is a long take?</b>

<b>life itself is one long take</b> <b>our eyes are our camera</b> <b>mm</b> <b>we are constantly</b> <b>doing all kinds of things in this world</b> <b>right, and the things we see</b> <b>the medium is video</b> <b>it's all video</b> <b>right</b> <b>but</b> <b>we can see the pixels in this video</b> <b>and everything behind them</b> <b>we can reason about causality</b> <b>we can perceive space</b> <b>right</b>

<b>and Jia Zhangke said something I really</b> <b>agreed with deeply</b> <b>he said what makes film so interesting</b> <b>this was when he told me this in New York</b> <b>he said this is very interesting</b> <b>is that if you just look at the timeline</b> <b>this is a timeline</b> <b>it's a linear timeline</b> <b>but at every point on this timeline</b> <b>you need a space to extend its time</b> <b>like we're talking right now</b> <b>even though it seems like a static frame</b>

<b>but imagine you had a long take</b> <b>or rather</b> <b>you're on the streets of New York right now</b> <b>under the Dumbo bridge</b> <b>below Dumbo</b> <b>right</b> <b>what you see is still frame after frame</b> <b>mm right</b> <b>but what it represents behind those frames</b> <b>is the state of the world</b> <b>the global information of the entire space</b>

<b>this thing completely transcends</b> <b>what a single lens encodes</b> <b>that individual, isolated</b> <b>each individual frame</b> <b>I think</b> <b>I think this makes a lot of sense</b> <b>so this is what made me think</b> <b>we still need to work on video going forward</b> <b>even if video is hard to work with</b> <b>even if video requires handling massive amounts of data</b> <b>we still have to do it</b> <b>so with Cambrian-S</b> <b>that's what we're doing</b>

<b>and this work is a bit like a position paper</b> <b>a position paper is a kind of</b> <b>how should I</b> <b>the translation would be an opinion paper</b> <b>meaning</b> <b>I want to put forward this kind of viewpoint</b> <b>so in that paper</b> <b>we discuss the concept of super sensing</b> <b>meaning the concept of hyper-perception</b> <b>and we also</b> <b>it's also a paper about data</b> <b>it's a paper about</b>

<b>architectural structure</b> <b>and it's also about</b> <b>a paper on spatial intelligence</b> <b>so Professor Fei-Fei also helped us</b> <b>with a lot of invaluable advice</b> <b>mm-hmm</b> <b>but the core idea is we want to define a paradigm</b> <b>for where multimodal AI should go from here</b> <b>right, and then</b> <b>so</b> <b>if you look at this problem step by step</b> <b>meaning we</b>

<b>this may be an imperfect analogy</b> <b>but you can draw a parallel with autonomous driving</b> <b>you might have an L0 system</b> <b>a system with nothing at all</b> <b>it's basically an old language model</b> <b>it can't perceive the world at all</b> <b>all this visual knowledge</b> <b>it can't see images</b> <b>it can't see videos either</b> <b>right</b> <b>but it can, through language</b> <b>like Plato's Cave allegory</b>

<b>indirectly understand the world</b> <b>that's fine</b> <b>we call it L0</b> <b>L1 is the current multimodal system</b> <b>with slightly better capabilities</b> <b>it's capable of what you'd call show and tell</b> <b>meaning you show it something</b> <b>and then it can tell you</b> <b>some answers about what you showed it</b> <b>right, you ask it a question</b> <b>and it gives you an answer</b> <b>this might be L1</b> <b>then L2, I think, is</b>

<b>what I call streaming event cognition</b> <b>meaning now this thing</b> <b>doesn't just look at a static image</b> <b>you'd have a continuous, streamable</b> <b>visual stream like this</b> <b>a visual stream</b> <b>your intelligent system</b> <b>needs to be able to understand this visual stream</b> <b>and be able to process this visual stream</b> <b>and also be able to answer questions</b> <b>be able to understand</b> <b>what's happened</b>

<b>right, and then the next stage</b> <b>uh, I call it spatial cognition</b> <b>meaning this is about</b> <b>what I was just saying</b> <b>which is that you</b> <b>at every point in this temporal sequence</b> <b>how to see beyond the present moment</b> <b>to what's really behind it — these</b> <b>the space behind these pixels</b> <b>right</b> <b>this is also something very, very deep for humans</b> <b>a very unique ability</b> <b>and ultimately</b> <b>actually um</b>

<b>I think the endgame is</b> <b>we need a predictive world model</b> <b>yes, some kind of predictive world model</b> <b>this is what can tell you</b> <b>everything about the real world you observe</b> <b>yes, I think</b> <b>what I want to convey through this paper is</b> <b>we're building a staircase</b> <b>step by step</b> <b>leading toward a future with a world model</b> <b>mm-hmm</b>

<b>um, although we may not know</b> <b>exactly how to define this world model</b> <b>at least in this paper</b> <b>we won't attempt to do that definitional work</b> <b>but we can identify</b> <b>which capabilities are absolutely necessary</b> <b>yes, so that's the core of this paper</b> <b>and this paper</b> <b>um, we also filmed a short video</b> <b>which I also posted on Twitter</b>

<b>some students</b> <b>we didn't spend any money</b> <b>it wasn't for promotion</b> <b>just some students with cameras</b> <b>filming on the streets of New York</b> <b>um, unfortunately we weren't able to</b> <b>shoot a Bi Gan-style long take</b> <b>but</b> <b>filming as we walked</b> <b>it was a love letter to New York, I suppose</b> <b>and then</b> <b>but a lot of people didn't understand</b> <b>saying why are you filming this</b> <b>does this have anything to do with your paper</b>

<b>mm-hmm</b> <b>I said of course it does</b> <b>our paper itself is about</b> <b>an intelligent agent living in the real world</b> <b>how it can ingest this continuous</b> <b>visual stream signal</b> <b>and</b> <b>be able to perceive what's happening in the world</b> <b>it might be moved by certain things</b> <b>right</b> <b>be surprised</b> <b>feel astonished</b> <b>but most of the time</b>

<b>its brain will have some kind of</b> <b>spontaneously operating world model</b> <b>guiding everyone to be themselves</b> <b>guiding everyone to live in this world</b> <b>yes, I think</b> <b>this paper is actually quite interesting</b> <b>because I had never done this kind of work before</b> <b>kind of like</b> <b>wanting to set an agenda</b> <b>defining the problem like this</b> <b>so</b> <b>so, I also hope to learn more from Professor Fei-Fei</b>

<b>Professor Fei-Fei often talks about the North Star, right</b> <b>so the question I've always been asking is</b> <b>what exactly is the North Star of vision</b> <b>mm-hmm, what exactly is that question</b> <b>and how should we solve it</b> <b>yes, so that's this paper</b> <b>did you find the answer</b> <b>um, I couldn't find the answer</b> <b>if I'd found the answer I wouldn't be sitting here</b> <b>I think this is an ultimate question</b> <b>mm-hmm</b>

<b>I don't think this is just a computer vision problem</b> <b>or rather, what I actually want to say is</b> <b>actually, the term computer vision</b> <b>is also very interesting</b> <b>it's called vision</b> <b>and vision has a double meaning</b> <b>it's a very ambiguous word</b> <b>vision refers to both your eyesight</b> <b>and your foresight about the future</b> <b>right, when you say someone has great vision</b> <b>meaning they have a grand vision</b> <b>visionary vision yes</b> <b>um, so I think computer vision</b>

<b>actually</b> <b>I'm not going to</b> <b>um this</b> <b>I can say I am someone who</b> <b>works in computer vision</b> <b>yes, but computer vision in my definition</b> <b>it's a perspective</b> <b>it's not a specific task</b> <b>it's not even a</b> <b>specific field</b> <b>it's a perspective</b> <b>perspective means it's a point of view</b> <b>yes, or rather it is</b>

<b>I think intelligence — it's quite fundamental</b> <b>it's a collection of problems</b> <b>that intelligence must solve</b> <b>it's a collection</b> <b>right, let me be more specific</b> <b>so what is vision</b> <b>or what problems does vision address</b> <b>mm-hmm</b> <b>I may not be able to articulate it clearly</b> <b>let me think</b> <b>um,</b> <b>first, the signals it handles are in continuous space</b> <b>high-dimensional, noisy signals</b>

<b>mm-hmm</b> <b>right, these are the problems computer vision needs to solve</b> <b>the problems computers need to solve</b> <b>it's not about writing lots of text on paper</b> <b>we need to evolve some kind of intelligence</b> <b>that doesn't avoid this problem</b> <b>it addresses this domain</b> <b>its</b> <b>its target</b> <b>this domain</b> <b>is completely different from language</b> <b>right</b> <b>continuous, high-dimensional, noisy signals</b> <b>these are the problems Vision needs to solve</b> <b>second, from the very first day of doing Vision</b>

<b>from the first paper I just mentioned</b> <b>starting from DSN or HED</b> <b>I already knew</b> <b>or rather I had this kind of bet</b> <b>that vision</b> <b>the most important thing</b> <b>is to learn this kind of hierarchical representation</b> <b>hierarchical representation</b> <b>this is extremely important</b> <b>if your representation lacks hierarchy</b> <b>you won't be able to solve</b>

<b>many, many problems in this world</b> <b>the hierarchical process is an abstraction process</b> <b>the process of abstraction</b> <b>is what's called a generalization process</b> <b>a generalization process</b> <b>this is also very different from a language model</b> <b>because a language model</b> <b>operates purely in the semantic space</b> <b>when thinking about this problem</b> <b>so</b> <b>there are of course other characteristics</b> <b>for example, I say vision as a perspective, um</b>

<b>for example, I think it's also</b> <b>this kind of large-scale parallelization</b> <b>we can now see many, many things</b> <b>many areas of our brain's cortex are firing</b> <b>right, and then</b> <b>we're processing in parallel</b> <b>many many different objects</b> <b>and their</b> <b>causal patterns</b> <b>and their physical changes</b> <b>these things are happening at different times</b> <b>and in different spaces</b> <b>all simultaneously</b>

<b>and we have a way</b> <b>to capture all these changes</b> <b>I think this thing</b> <b>is also an important characteristic of vision</b> <b>um</b> <b>and finally, there may be one more, which is some kind of</b> <b>um</b> <b>I'm not sure how to define this thing</b> <b>some kind of feature sharing</b> <b>what this means is</b> <b>for example, I look at</b> <b>the semantic part of this matter</b>

<b>or the real understanding part</b> <b>may be a bit more</b> <b>that is to say</b> <b>I now see a dog drawn by a child</b> <b>and a cartoon dog in an animation</b> <b>and a real dog running around in the real world</b> <b>right, and then</b> <b>how do I connect all these different visual</b> <b>entities together, right</b> <b>building this kind of abstract cognition</b> <b>saying, hey, they're all dogs, right</b>

<b>even though they're vastly different</b> <b>in this, um</b> <b>from a data perspective, you know</b> <b>they're so far apart</b> <b>not a single pixel is comparable</b> <b>so what I want to say is, um</b> <b>vision may have even more problems to solve</b> <b>I actually haven't thought carefully about this</b> <b>yes, anyway it'll have some common characteristics like these</b> <b>these features</b> <b>right, hierarchical structure</b> <b>and this kind of continuous domain modeling, um</b>

<b>continuous domain modeling</b> <b>and also this kind of</b> <b>this kind of</b> <b>large-scale parallelism and large-scale sharing</b> <b>I think these things</b> <b>are all part of an intelligent agent</b> <b>this thing</b> <b>cannot simply be reduced to</b> <b>just a computer vision system</b> <b>solving a small subset of problems</b> <b>mm-hmm</b> <b>so that's why I think</b> <b>computer vision</b> <b>I think, I think</b>

<b>I think although fewer and fewer people are working on</b> <b>this direction</b> <b>students are also increasingly fewer</b> <b>fewer students are applying to this area</b> <b>when people are undergraduates</b> <b>when choosing this direction</b> <b>they're also increasingly unwilling to choose it</b> <b>right, something called computer vision</b> <b>um, and then</b> <b>and when faculty are hiring, too</b> <b>we're probably increasingly less likely to</b>

<b>hire a professor doing pure computer vision</b> <b>but I think this is</b> <b>if you consider computer vision</b> <b>as a perspective</b> <b>I think it's the essence of intelligence</b> <b>look at the past few years</b> <b>after ChatGPT arrived</b> <b>CV was previously</b> <b>very central to</b> <b>occupying a very central position in artificial intelligence</b>

<b>of course, this happened after you entered the field</b> <b>um, in recent years LLMs have risen</b> <b>CV has been pushed back to a more marginal position</b> <b>in this process</b> <b>do you think people like you feel discouraged</b> <b>um</b> <b>I don't feel discouraged at all</b> <b>I feel not the least bit discouraged</b> <b>I think, as I said</b> <b>I should be grateful for LLMs</b> <b>yes, without LLMs</b> <b>Vision couldn't have expanded into the truly</b>

<b>large scope of multimodal intelligence it has now</b> <b>from the perspective of vision's development history</b> <b>there are actually two axes</b> <b>you can draw them — this axis</b> <b>goes back to ancient times, right</b> <b>at the earliest stage</b> <b>the things computer vision needed to handle</b> <b>were always the most singular</b> <b>most concrete and simplest tasks</b> <b>like MNIST digit recognition, right</b>

<b>1234, I need to</b> <b>determine which digit it is</b> <b>and then later there were some small datasets</b> <b>like CIFAR data</b> <b>a 32×32 pixel</b> <b>ten-class classification problem</b> <b>is it a cat or a dog</b> <b>is it a car or an airplane</b> <b>and then later</b> <b>datasets like ImageNet appeared</b> <b>it became a 256×256</b> <b>level</b> <b>doing classification, right</b>

<b>um, but at those times</b> <b>things were relatively controllable</b> <b>and then later</b> <b>there were detection and segmentation</b> <b>this more structured kind of</b> <b>cognitive process</b> <b>and these are compositions</b> <b>and then later, right</b> <b>if this axis continues to advance, it leads to</b> <b>the rise of multimodal large-scale models</b> <b>because of the introduction of multimodality</b> <b>we can easily abandon many</b>

<b>of these specific</b> <b>relatively rigid</b> <b>task designs</b> <b>this kind of task design</b> <b>and now I can take an image</b> <b>and ask all kinds of questions</b> <b>suppose this thing</b> <b>language as a great interface</b> <b>can</b> <b>or language as a great interface</b> <b>it can help you solve many many problems</b> <b>right, so you can see over this time</b> <b>um, this axis</b> <b>this axis, um</b>

<b>goes from simple to complex tasks</b> <b>such an axis</b> <b>but also an axis where language starts</b> <b>gradually entering computer vision</b> <b>so then</b> <b>there are two issues here</b> <b>the first is that after language entered vision</b> <b>it brought us enormous benefits</b> <b>allowing us to freely define problems</b> <b>we can ask anything</b> <b>and we can get any answer</b> <b>mm-hmm</b>

<b>but the second important risk is</b> <b>language's involvement has led to</b> <b>your dependence on language also increasing</b> <b>mm-hmm</b> <b>so many so-called multimodal cases</b> <b>these tasks are actually unrelated to lan—</b> <b>unrelated to vision</b> <b>purely a language problem</b> <b>mm-hmm</b> <b>from this perspective</b> <b>um, of course I think, yes</b> <b>vision seems to have become marginalized</b>

<b>mm-hmm right</b> <b>but of course I don't feel discouraged</b> <b>I see it as an enormous opportunity</b> <b>because in the end</b> <b>if the problems you're solving now</b> <b>are relatively simple</b> <b>then it doesn't matter</b> <b>problems you can solve with language</b> <b>just use language to solve them</b> <b>right um</b> <b>even though I haven't seen</b> <b>I can't do so-called grounding</b> <b>meaning I can't know</b> <b>the red apple you describe to me</b> <b>what exactly</b> <b>what is red</b>

<b>what exactly is an apple</b> <b>but somehow through statistical information</b> <b>in language</b> <b>I can still complete some decision-making tasks</b> <b>no one can fault you for this</b> <b>I think that's fine</b> <b>but the huge hidden opportunity is</b> <b>when the day truly comes</b> <b>that we need to deal with the real world</b> <b>real tasks</b> <b>to build some kind of real intelligence</b> <b>ah</b> <b>then this currently imperfect</b>

<b>visual representation</b> <b>will be a major deficiency</b> <b>so Yann LeCun's view is</b> <b>everyone right now is just using a crutch</b> <b>that crutch being the language model itself</b> <b>right, and even though you can walk</b> <b>and you'd think</b> <b>hey, I'm walking pretty well</b> <b>but you probably can't run</b> <b>and you can't participate in the Olympics</b> <b>right, because you have a leg</b> <b>this is the so-called leg of visual representation</b>

<b>which is still</b> <b>still not good enough</b> <b>why do you call it real intelligence</b> <b>why isn't LLM real intelligence</b> <b>because I think</b> <b>LLM is virtual intelligence</b> <b>but our intelligence</b> <b>so-called intellect</b> <b>isn't that also virtual</b> <b>oh, I think the word virtual may not be right</b> <b>what I define as real</b> <b>is something that has to interact with the real world</b> <b>yes, what does that mean</b>

<b>meaning look</b> <b>the problems that LLMs can solve well now</b> <b>mostly still occur in the digital space</b> <b>mm-hmm</b> <b>mm-hmm, for example</b> <b>um, it can memorize</b> <b>all this factual knowledge</b> <b>it can know</b> <b>right, we can put all</b> <b>these Wikipedia articles</b> <b>all in there</b> <b>and it can tell us everything we want to know</b> <b>it can serve as a very good legal advisor</b> <b>it can</b>

<b>even help summarize knowledge</b> <b>and do education</b> <b>do teaching</b> <b>a lot of these things</b> <b>right, and I think LLMs</b> <b>um, are of course revolutionary</b> <b>but this is different from the vision</b> <b>as a perspective that needs to solve problems</b> <b>actually they're completely different domains</b> <b>meaning</b> <b>meaning</b> <b>if what you need to handle is continuous</b>

<b>high-dimensional space</b> <b>in this kind of noisy domain</b> <b>then things like, for example, robots</b> <b>these domains aren't just robots</b> <b>by the way, robots are one good example</b> <b>I'll get to that in a moment</b> <b>ah, these things are very hard to tokenize</b> <b>they've already left this virtual space</b> <b>left this digital space</b> <b>right, what kind of tasks does this involve</b> <b>you're absolutely right</b> <b>I think robots are</b>

<b>there will also be many</b> <b>industrial applications, right</b> <b>industrial process control</b> <b>meaning some</b> <b>all those involving sensory</b> <b>modeling signals</b> <b>with many different kinds of sensors</b> <b>right, these kinds of sensors</b> <b>and they perceive what's happening</b> <b>in this world</b> <b>and you now need a unified algorithm</b> <b>to model this environment</b> <b>this system</b> <b>so that you then</b>

<b>perform an action or intervention</b> <b>meaning that when you</b> <b>you are</b> <b>take an action or make an intervention</b> <b>you're able to predict</b> <b>how this system</b> <b>will change next</b> <b>this is very hard for LLMs to do</b> <b>mm-hmm</b> <b>and you're absolutely right about that</b> <b>I think from my perspective, there are actually two extremes</b> <b>one extreme is LLMs, um</b> <b>very good at operating in the digital space</b>

<b>doing many many things</b> <b>and also very good at</b> <b>using coding as an interface</b> <b>right, through agents</b> <b>to intervene in our physical lives</b> <b>um, this will also happen</b> <b>and that's fine</b> <b>but ultimately it's still based on discrete tokens</b> <b>token-based</b> <b>these one-by-one positions</b> <b>ah, on the far right is Robotics <b>this Robotics is truly</b> <b>it must be true</b>

<b>truly general-purpose robotics</b> <b>meaning it can generalize to</b> <b>generalize to a certain degree</b> <b>such that it can do everything a human can do</b> <b>mm-hmm, it has its own decision-making system</b> <b>and it has its own brain</b> <b>mm-hmm, and I feel now that these two extremes</b> <b>right, and then</b> <b>and how from LLMs</b>

<b>step by step it extends to Robotics</b> <b>I think this is what computer vision</b> <b>or, in the new era,</b> <b>visual intelligence needs to solve</b> <b>right</b> <b>and then</b> <b>I think this is also the future of multimodal</b> <b>mm-hmm</b> <b>because obviously, robotics still doesn't work now</b> <b>and I often tell students</b>

<b>or people around me</b> <b>actually um</b> <b>the thing I most want to achieve</b> <b>is to solve the Robotics problem</b> <b>without doing Robotics</b> <b>why is that</b> <b>mm-hmm, because you think</b> <b>the Robotics approach can't solve the Robotics problem</b> <b>not exactly</b> <b>it's because I think each of us</b> <b>I think Robotics is advancing too quickly</b> <b>right</b> <b>now at the Spring Gala Festival there's Unitree Robotics and all that</b> <b>yes I think</b>

<b>I find it all rather jaw-dropping</b> <b>but on the other hand</b> <b>I think</b> <b>there still needs to be someone focused on the pre-training part</b> <b>which is what's called the robot brain</b> <b>what exactly it is</b> <b>mm-hmm</b> <b>or how this brain includes your visual system</b> <b>right, in the control part</b> <b>in the hardware part</b> <b>this part also means</b> <b>brothers climbing the mountain, each making their own effort</b> <b>I don't think I need to</b>

<b>intervene in hardware too early</b> <b>and do those things</b> <b>right</b> <b>I think there are fundamental research problems now</b> <b>that haven't been solved at the software level</b> <b>haven't been solved in building this brain</b> <b>we need to focus first on solving this part</b> <b>of course many people will argue</b> <b>you have to have</b> <b>something like a closed loop</b> <b>you need some kind of collaborative approach</b> <b>you need to validate on your robots</b> <b>otherwise</b> <b>if you build some algorithm now</b>

<b>some model may not be useful</b> <b>mm-hmm</b> <b>I fully agree with that</b> <b>but I think</b> <b>this can be done through some kind of partnership</b> <b>yes, I just don't want to</b> <b>buy this</b> <b>I also don't have the money</b> <b>I can't afford that many robots</b> <b>robots also have their own hardware scaling</b> <b>by the way</b> <b>you need to buy many robots</b> <b>to do hardware well</b> <b>mm-hmm</b> <b>yes, I want to focus on the brain part</b>

<b>and I think this</b> <b>is a problem that computer vision needs to solve</b> <b>a problem that representation learning needs to solve</b> <b>and also</b> <b>I think ultimately the problem that a world model needs to solve</b> <b>look at Kaiming, he started thinking about this so early</b> <b>wanting bigger, bigger, bigger</b> <b>mm-hmm</b> <b>why</b> <b>why did LLM Scaling Laws come so much earlier than CV</b> <b>um, good question</b>

<b>yes, I think first of all we can't say that much earlier</b> <b>because CV currently doesn't have a Scaling Law</b> <b>right, and actually before I was</b> <b>we were all pretty desperate</b> <b>I said, oh no</b> <b>this vision</b> <b>how come it still doesn't have a Scaling Law</b> <b>now maybe it's alright</b> <b>now for example these video diffusion models</b> <b>have some Scaling Behavior</b> <b>what's called Scaling</b> <b>is that you can consume the data</b> <b>yes, and then you can</b> <b>you can</b>

<b>you can get better results</b> <b>right</b> <b>or rather</b> <b>this more formal characterization</b> <b>meaning your Scaling Behavior</b> <b>meaning if you now have a Transformer system</b> <b>then I now satisfy this</b> <b>ratio like C=6ND</b> <b>meaning your</b> <b>your compute is basically equal to 6 times</b> <b>your tokens times your</b> <b>number of parameters</b>

<b>and I want</b> <b>I want to use this</b> <b>formal definition to make this point</b> <b>because I now think</b> <b>more and more that vision doesn't need a Scaling Law</b> <b>oh, why is that</b> <b>because again</b> <b>what vision cares about</b> <b>is completely different from what language cares about</b> <b>it's not a radical claim</b> <b>but it is a viewpoint</b> <b>a long-held view</b> <b>and many people doing NLP</b> <b>actually agree with this view</b>

<b>that is, a language model</b> <b>is actually not a self-supervised learning process</b> <b>it's actually a strongly</b> <b>supervised learning process</b> <b>meaning it's a strongly supervised process</b> <b>it depends on how you look at it</b> <b>what does supervised or unsupervised mean</b> <b>yes, the logic here is as follows</b> <b>generally speaking</b> <b>we say whether you have external annotations</b> <b>external labels</b> <b>this determines whether you are self-supervised</b> <b>or</b>

<b>or strongly supervised learning</b> <b>right, but language is such a special case</b> <b>what is language</b> <b>language</b> <b>is what humans over the past few thousand years of civilization</b> <b>through continuous evolution</b> <b>whether in a sociological sense</b> <b>or in each individual person's sense</b> <b>and processed</b> <b>everything about this world</b>

<b>and stored it in a tokenized form</b> <b>storing it down</b> <b>and we happened to have something called the internet</b> <b>and we uploaded this knowledge</b> <b>all to the internet</b> <b>so for all LLM researchers</b> <b>this is for free</b> <b>but something being free doesn't mean it has no labels</b> <b>then one question is</b> <b>suppose we didn't have the internet</b> <b>then if you wanted to train language models now</b> <b>could you still do it</b>

<b>put books in</b> <b>yes</b> <b>or suppose you had no books</b> <b>right yes</b> <b>exactly, this kind of</b> <b>knowledge upload</b> <b>this thing</b> <b>is itself a process of supervision construction</b> <b>right</b> <b>so this is different from vision</b> <b>so it's somewhat like language</b> <b>um, wanting to solve problems</b> <b>always staying in this target y space</b> <b>as we usually say</b> <b>you have a mapping from x to y</b>

<b>that's all machine learning</b> <b>you can through some</b> <b>regardless of where x and y are</b> <b>you can define the problem this way anyway</b> <b>and y is usually what people call supervision</b> <b>is the label, and x is your data</b> <b>right</b> <b>you can think of this</b> <b>language model as</b> <b>actually only characterizing things in the y space</b> <b>mm-hmm</b>

<b>mm-hmm, but this is true</b> <b>going back to the earlier question</b> <b>meaning this is actually insufficient to represent</b> <b>the totality of this world</b> <b>there are many things</b> <b>that you can't through language</b> <b>describe and characterize</b> <b>or rather this is both the advantage of language</b> <b>and also language</b> <b>may eventually, as I said, gradually fade</b> <b>or rather</b>

<b>LLM won't be the foundation of the entire world model</b> <b>that's one reason</b> <b>the reason is</b> <b>its advantage is</b> <b>you don't need to do anything</b> <b>to achieve some kind of alignment with humans</b> <b>because every sentence and every word you write</b> <b>is written by humans</b> <b>is written by humans</b> <b>mm-hmm right</b> <b>when you write this down</b> <b>what is language</b> <b>language is a communication tool</b> <b>language is not a</b> <b>thinking map</b>

<b>language is not even a decision-making tool</b> <b>it's a form of communication</b> <b>it's actually a communication tool</b> <b>mm-hmm</b> <b>so if it is a communication tool</b> <b>you always have to make some trade-offs</b> <b>you always have to sacrifice something</b> <b>so, ah, and then I think</b> <b>I think, um</b> <b>what I mainly want to say is yes</b>

<b>as a communication tool</b> <b>it aligns well with humans</b> <b>but on the other hand</b> <b>it has also lost a lot</b> <b>which it originally</b> <b>as an intelligent system</b> <b>should be modeling</b> <b>mm-hmm right</b> <b>for example, right now</b> <b>I have a cup of water</b> <b>I have a cup that fell on the ground and broke</b> <b>this is actually a linguistic</b> <b>the reason we say it this way</b> <b>is because this is the</b>

<b>most suitable thing for our communication</b> <b>we only care about the outcome and state of things</b> <b>right</b> <b>we don't care how a cup fell to the ground</b> <b>and how exactly it broke</b> <b>right, which physical</b> <b>laws it obeyed</b> <b>the dynamics behind it</b> <b>what exactly they are</b> <b>yes, so what exactly are its dynamics</b> <b>we don't care about these things</b> <b>right</b> <b>so I think this is also a limitation of it</b> <b>mm-hmm</b> <b>LLM people would complain that</b>

<b>after adding vision</b> <b>it might affect their intelligence</b> <b>ah why really</b> <b>yes, he hopes, um</b> <b>like Yang Zhilin, saying adding multimodal</b> <b>they hope it won't be a dumb multimodal</b> <b>ah yes</b> <b>I agree</b> <b>of course you shouldn't use a dumb multimodal</b> <b>but I think if you don't add vision</b> <b>you'll definitely be dumb</b> <b>and, but I think</b> <b>the fundamental issue is</b>

<b>how to define smart and dumb</b> <b>yes, it's about intelligence</b> <b>the definition of intelligence is different</b> <b>the definition of intelligence is different</b> <b>and or rather</b> <b>how exactly to define</b> <b>what is a simple task</b> <b>what is a difficult task</b> <b>mm-hmm</b> <b>over the past few decades</b> <b>all these AI researchers</b> <b>would continuously encounter</b> <b>this so-called Moravec's paradox</b>

<b>this Moravec's paradox</b> <b>what this paradox says is</b> <b>things that are easy for machines</b> <b>or um</b> <b>the easy problem is hard</b> <b>the hard problem is easy</b> <b>this is a paradox</b> <b>meaning things that are easy for machines</b> <b>are actually hard for humans</b> <b>and things that are hard for machines</b> <b>are actually easy for humans</b> <b>you seem to have several works at NYU</b> <b>um right</b>

<b>I think starting with V*</b> <b>um, V* is actually just one piece of work</b> <b>I think it's quite interesting</b> <b>could you talk about it</b> <b>because we were the first to think about</b> <b>wanting to build in a multimodal system</b> <b>a system two</b> <b>what's called</b> <b>that can</b> <b>do scaling at test time</b> <b>such a model</b> <b>meaning we</b> <b>when we look at the world around us</b> <b>for example I want to ask you a question now</b> <b>right</b>

<b>for example</b> <b>like something around you</b> <b>there's a trash can nearby</b> <b>what color is it</b> <b>you won't directly like a language model</b> <b>directly tell me an answer</b> <b>you'll definitely first think</b> <b>where is this trash can</b> <b>you might turn around and look</b> <b>discover</b> <b>there's a refrigerator over there</b> <b>maybe the trash can is next to the refrigerator</b> <b>then you'd localize this object</b> <b>and find this object</b> <b>right, and then tell me an answer</b> <b>so you have this visual reasoning here</b>

<b>right, some kind of visual reasoning here</b> <b>and then</b> <b>this thing</b> <b>it's entirely a behavior in a reasoning process</b> <b>right, and then</b> <b>and then this thing</b> <b>we built such a system back then</b> <b>and this is also</b> <b>um,</b> <b>for example, before o1</b> <b>a very long time</b> <b>yes, at least a few months</b> <b>and we started doing this</b> <b>mm-hmm right</b> <b>at that time this kind of test time scaling</b> <b>was not a buzzword at all</b>

<b>nobody had been talking about this</b> <b>okay right</b> <b>and I think this is worth talking about</b> <b>because for me</b> <b>it's actually an inspiration</b> <b>I think it's both</b> <b>I think it's a bittersweet</b> <b>kind of lesson</b> <b>meaning it</b> <b>the bitter part is</b> <b>let me first tell you what happened</b> <b>after we had this paper</b> <b>we had our own benchmark</b> <b>and then we found</b>

<b>meaning</b> <b>I have two friends</b> <b>Alex Kirillov</b> <b>who's also the author of SAM</b> <b>and Bowen Cheng</b> <b>both of them work at OpenAI</b> <b>mm-hmm so</b> <b>I talked with them for a long time</b> <b>we told them</b> <b>what our work had done</b> <b>our benchmark is here now</b> <b>you can try it out</b> <b>and I also discussed</b> <b>some of the logic behind it</b> <b>right meaning</b> <b>how you can do this kind of visual thinking</b> <b>and later</b>

<b>Alex and Bowen drove this project at OpenAI</b> <b>drove this project</b> <b>this project is called think with image</b> <b>and later, maybe over a year later</b> <b>right, and then this product launched</b> <b>mm-hmm, and after this product launched it was called</b> <b>think with image</b> <b>and inside, many examples or their benchmarks</b> <b>were actually the benchmarks from our paper</b> <b>oh</b> <b>so</b> <b>what makes me very happy about it is</b> <b>this is the first time</b>

<b>I thought, hey</b> <b>we can actually find a way</b> <b>to truly take a different path</b> <b>this can somehow</b> <b>inspire researchers at OpenAI</b> <b>to improve their own models</b> <b>mm-hmm</b> <b>I think this at least makes me feel</b> <b>there are things to do in academia</b> <b>mm-hmm</b> <b>but on the other hand</b> <b>um, it's also rather bitter</b> <b>because</b>

<b>you see, at that time OpenAI, right</b> <b>at the time of Sora</b> <b>why people were able to accept DiT</b> <b>was also because DiT</b> <b>um</b> <b>would be cited in Sora's blog post</b> <b>or Bill's name being on it</b> <b>letting people find this logic</b> <b>and the clues behind it</b> <b>mm-hmm right</b> <b>but unfortunately</b> <b>I think, gradually</b> <b>in recent years</b> <b>industrial research labs</b>

<b>have become increasingly closed</b> <b>so at first everyone published papers</b> <b>later people couldn't publish papers anymore</b> <b>you could write some blog posts</b> <b>you could add acknowledgments</b> <b>and also list the names of each team member</b> <b>and further on</b> <b>you could publish a blog post</b> <b>but there could no longer be author credits</b> <b>only</b> <b>OpenAI team or Gemini team</b> <b>that's it</b> <b>so I think this</b> <b>mm-hmm</b> <b>will lead to, I don't know</b>

<b>whether the next, originally healthy</b> <b>kind of exchange between academia and industry</b> <b>those channels</b> <b>will be cut off</b> <b>mm-hmm right</b> <b>doing research</b> <b>is fundamentally a labor of love</b> <b>we explore these questions</b> <b>not really because</b> <b>it can deliver some product</b> <b>or earn how much money</b> <b>but on the other hand, um</b> <b>some kind of credit assignment</b>

<b>meaning letting everyone know who did what</b> <b>I think this is something that over the past few decades</b> <b>has supported academia's ability to move forward</b> <b>a mechanism</b> <b>but now</b> <b>this mechanism is gradually being</b> <b>being eroded by LLMs</b> <b>this generation of models</b> <b>and the organizational structures behind this generation of models</b> <b>I think gradually broke it</b> <b>it's become commercial competition</b> <b>it has become a form of commercial competition</b> <b>mm-hmm yes</b>

<b>right, and then</b> <b>let me quickly conclude</b> <b>I think there are two more</b> <b>I want to briefly mention</b> <b>this paper, that is</b> <b>this REPA</b> <b>this thing is called representation alignment</b> <b>look, there's another keyword: representation</b> <b>so</b> <b>that's why I really like this paper</b> <b>but this paper also</b> <b>went through such a long time</b> <b>and all these past works</b>

<b>combined in a strange way</b> <b>formed a kind of chemical reaction</b> <b>mm-hmm, and then</b> <b>opening up, at least</b> <b>a small research domain</b> <b>and what it does is quite simple</b> <b>it's essentially</b> <b>a Deeply Supervised Net</b> <b>meaning a model you have now</b> <b>doesn't only have a diffusion loss at the top</b> <b>which is your final objective</b> <b>you also pull out some other things in the middle</b> <b>these objectives</b> <b>you can have other objectives</b>

<b>the objective we used is</b> <b>I want to make a Diffusion Model</b> <b>which is a generative model</b> <b>by the way</b> <b>have its internal representation</b> <b>able to align with an external self-supervised</b> <b>model's representation</b> <b>to align together</b> <b>mm-hmm</b> <b>here</b> <b>again, what's being said is</b> <b>representation is the most important thing</b> <b>not only for systems like Cambrian 1</b> <b>for doing multimodal understanding is it important</b>

<b>it's important for a generative model</b> <b>generating images</b> <b>generating videos too</b> <b>yes so</b> <b>this thing</b> <b>I think it's something for me</b> <b>quite a big inspiration</b> <b>but this hasn't been done thoroughly yet</b> <b>meaning</b> <b>why do we need to use</b> <b>this kind of Deeply Supervised approach</b> <b>such an indirect way to do alignment</b> <b>ah</b> <b>what if</b> <b>can we directly use this powerful</b> <b>representation</b>

<b>as a</b> <b>encoder for your generative model</b> <b>or as its foundation</b> <b>mm-hmm right</b> <b>and this thing took another step forward</b> <b>we also got very good results</b> <b>this paper is called Representation Autoencoder</b> <b>yes, it also involves representation</b> <b>and autoencoder</b> <b>but anyway</b> <b>in this</b> <b>the logic in this thing</b> <b>I think</b>

<b>again I don't want to talk too much about this paper's details</b> <b>but I think there's one thing</b> <b>Professor Ma Yi (founding director of the Institute of Data Science at HKU), when I visited Hong Kong</b> <b>I think what he said was absolutely right</b> <b>he said</b> <b>a student would ask, hey</b> <b>you're doing this right</b> <b>your autoencoder</b> <b>your representation layer will now become very high-dimensional</b> <b>because it's a representation now</b> <b>it's not the original</b> <b>simple pixel-level representation</b>

<b>nor is it a low-dimensional</b> <b>VAE-type representation</b> <b>it's a high-dimensional representation</b> <b>you want to do</b> <b>denoising and image generation on this high-dimensional representation</b> <b>this is actually a very difficult thing</b> <b>right, and a student asked at the time</b> <b>this dimension is too high</b> <b>it might not necessarily be a good thing</b> <b>and then</b> <b>it might make our learning system more complex</b> <b>or make training harder</b>

<b>first of all our results</b> <b>are completely the opposite conclusion</b> <b>but Professor Ma Yi got very excited</b> <b>he stood up and said</b> <b>I want to sincerely tell everyone</b> <b>you must not be afraid of high dimensions</b> <b>high dimensionality is in all machine learning</b> <b>an extremely important cornerstone</b> <b>um including</b> <b>whether in previous</b> <b>so-called kernel learning methods</b>

<b>kernel methods</b> <b>or why in a Transformer</b> <b>we need to have an Up Projection Layer</b> <b>right, you need to have a</b> <b>low-dimensional vector coming in</b> <b>and then turning it into a</b> <b>4 times larger, 4 times wider</b> <b>Fully Connected layer</b> <b>and then</b> <b>all these things</b> <b>are all telling us the following fact</b> <b>that in a high-dimensional space</b>

<b>many problems</b> <b>that couldn't be solved in low-dimensional space</b> <b>can now be solved</b> <b>many problems</b> <b>many types of information that didn't exist in low-dimensional space</b> <b>can now exist</b> <b>and you'll also have better efficiency</b> <b>ah</b> <b>this is</b> <b>this is traditional machine learning theory</b> <b>why you need to do</b> <b>after increasing dimensions</b> <b>making things</b> <b>making your data points linearly separable</b> <b>all the same logic</b> <b>but I feel very encouraged</b>

<b>in that you should not be afraid of high dimensions</b> <b>I think these are very good words</b> <b>because many times people feel afraid</b> <b>right</b> <b>feel afraid</b> <b>not just high-dimensional representation</b> <b>this thing</b> <b>but also afraid of escaping from some current local optimum</b> <b>meaning right now</b> <b>many things we've done before</b> <b>were all done to jump out of this local optimum</b> <b>mm-hmm</b>

<b>like VAE</b> <b>is the current era's</b> <b>local optimum</b> <b>we hope to use a representation learning approach</b> <b>to link everything together</b> <b>and this thing</b> <b>is actually a very natural thing</b> <b>and then</b> <b>now many people are also working on related papers</b> <b>there are many contemporaneous works</b> <b>all also very good</b> <b>but on the other hand</b> <b>this is also a not-so-natural thing</b> <b>because you need to break out of the existing framework</b>

<b>to do something new</b> <b>yes, but when you can jump out of this local optimum</b> <b>and do something new</b> <b>I think you</b> <b>you'll feel like your world has opened up</b> <b>because RE for us</b> <b>or for my research</b> <b>I think it's still a fairly important work</b> <b>because it tells me something</b> <b>or allows me to make a bet</b> <b>or predict a future</b> <b>what that future is</b> <b>or whether it's right or wrong</b>

<b>we can look again in a few years</b> <b>so this thing is also related to language</b> <b>and also to Diffusion Models</b> <b>like the recently popular Seedance</b> <b>and Sora</b> <b>mm-hmm</b> <b>my current bet is</b> <b>there's only one thing in this world</b> <b>that is important</b> <b>which is how to learn</b> <b>to learn this representation</b> <b>this is important</b> <b>when you have a good enough representation</b>

<b>handling other problems on top of it is simple</b> <b>your Language Model</b> <b>will gradually degrade to a simple</b> <b>communication interface</b> <b>unlike now</b> <b>all this multimodal intelligence</b> <b>is driven by large language models</b> <b>your representation layer only provides some simple</b> <b>a little bit of context</b> <b>right</b> <b>most of the so-called heavy lifting</b> <b>the dirty and heavy work</b>

<b>is all done by large language models</b> <b>mm-hmm</b> <b>the bet I want to make is</b> <b>the future won't be like this</b> <b>in the future you'll have a great foundation</b> <b>mm-hmm</b> <b>it's a <b>but it's also a great world model</b> <b>mm-hmm, and then</b> <b>what does this world model mean</b> <b>we can talk more about this</b> <b>but this foundation itself</b> <b>may not be a checkpoint</b> <b>it might be neural modules</b>

<b>connected together, multiple components</b> <b>forming a cognitive architecture</b> <b>wow, that sounds quite complex</b> <b>but essentially it's your brain</b> <b>it has different areas handling different things</b> <b>right</b> <b>the language, LLM layer</b> <b>will gradually become</b> <b>your essential representation</b> <b>or rather</b> <b>the foundation of your world model</b> <b>an interface of</b> <b>mm-hmm</b> <b>it's still very important</b> <b>it will never disappear</b>

<b>because humans need a Large Language Model</b> <b>to</b> <b>ask questions</b> <b>and answer questions</b> <b>right</b> <b>to communicate with it</b> <b>need to communicate with it</b> <b>it's a communication interface</b> <b>right</b> <b>also</b> <b>there's another line</b> <b>which is Pixel Generation itself</b> <b>meaning how you generate an image</b> <b>a video itself</b> <b>this thing</b> <b>through REPA</b> <b>some of our previous work</b> <b>we can see</b>

<b>it also needs to be based on a good enough</b> <b>representational foundation</b> <b>ah</b> <b>or you can think of it</b> <b>it's a world model</b> <b>um</b> <b>again in my view</b> <b>in my definition</b> <b>representation is a world model</b> <b>the most, most important part</b> <b>mm-hmm</b> <b>it's not all of it</b> <b>it's the most important part</b> <b>but when we have such a foundation</b> <b>you can think of it</b> <b>we can easily decode it into language</b> <b>right</b>

<b>and then</b> <b>we can easily decode it into pixels</b> <b>and generate videos</b> <b>we can also decode it into some kind of action</b> <b>some kind of movement</b> <b>so it might be some kind of</b> <b>analog to current VLAs</b> <b>mm-hmm</b> <b>but it's based on a stronger representation</b> <b>a stronger world model architecture</b> <b>what parts does the current representation include</b> <b>language is one of them</b> <b>um, I think it's one of them</b> <b>and then</b> <b>but this is also controversial</b> <b>meaning</b>

<b>like Zhilin you just mentioned</b> <b>he might say he doesn't want vision to contaminate language</b> <b>ah</b> <b>they'll still do multimodal</b> <b>but they want to think about</b> <b>how to make multimodal a smart multimodal</b> <b>right</b> <b>without lowering the overall intelligence level of the brain</b> <b>yes yes yes</b> <b>hey, about this thing</b> <b>but I want to say again</b> <b>this thing</b> <b>it really depends on how you define the problem</b> <b>but let me finish the earlier point first</b> <b>meaning</b>

<b>um this</b> <b>you say</b> <b>for example, the position of language in this</b> <b>right</b> <b>I think we also have our own worries</b> <b>meaning language is actually a poison</b> <b>or language is actually an opiate</b> <b>you add more language</b> <b>you'll always feel happier</b> <b>oh mm-hmm</b> <b>that shows it's useful</b> <b>this crutch</b> <b>it's useful</b> <b>but it's a shortcut</b> <b>if you as a person</b>

<b>if you keep taking this opiate</b> <b>you'll be ruined</b> <b>if it's a crutch</b> <b>and you keep using it</b> <b>you also can't train</b> <b>your leg muscles</b> <b>mm-hmm</b> <b>alright alright</b> <b>this is yours and Zhilin's</b> <b>two perspectives</b> <b>yes, so I'm very worried about language</b> <b>contaminating vision</b> <b>mm-hmm</b> <b>I'm extremely worried about this</b> <b>and moreover</b> <b>this contamination is already happening</b> <b>this</b> <b>the state of this contamination is as follows</b> <b>the state of this contamination happening is</b>

<b>the entire Large Language Model</b> <b>has a huge value chain</b> <b>that transmits step by step from industry to academia</b> <b>this value chain means</b> <b>we have a narrative at the top</b> <b>this narrative is whatever AGI, Scaling Law</b> <b>The Bitter Lesson, LLM</b> <b>the logic of these narratives</b> <b>the current bible</b> <b>yes um</b> <b>let me tell you about The Bitter Lesson</b>

<b>because I absolutely don't think</b> <b>the Large Language Model is</b> <b>a demonstration of</b> <b>The Bitter Lesson</b> <b>mm-hmm</b> <b>um</b> <b>the Large Language Model is actually anti-Bitter Lesson</b> <b>ultimately what representations will be general enough</b> <b>what is its endpoint</b> <b>ah, the endpoint</b> <b>we can call it the world model</b> <b>so maybe we can discuss</b> <b>in my definition</b>

<b>or in the context of this representation</b> <b>what exactly does world model mean</b> <b>what is a world model</b> <b>right</b> <b>this is about to enter your entrepreneurship topic</b> <b>let's first</b> <b>from multimodal to world model</b> <b>mm-hmm right</b> <b>mm-hmm, that's right</b> <b>in strict definitional terms</b> <b>a world model means</b> <b>you're now given a system</b> <b>or the state of an environment</b> <b>um</b>

<b>um</b> <b>this environmental state</b> <b>might be, for example, um</b> <b>you can think of it</b> <b>as the state at the current moment</b> <b>but a world model</b> <b>doesn't necessarily</b> <b>just make temporal predictions</b> <b>but let's not worry about that for now</b> <b>anyway, you first have a system or an environment</b> <b>you have a state S_t</b> <b>right</b> <b>and you have an intervention or action</b> <b>let's call it a_t</b> <b>at the current moment</b>

<b>you apply an action to this system</b> <b>you now hope to learn a predictive function</b> <b>or transition function F</b> <b>so that it can take your action</b> <b>together with your current state</b> <b>this environmental state</b> <b>to predict the next state</b> <b>right, the state at the next moment</b> <b>so this is the most basic general kind of</b> <b>definition of a world model</b>

<b>and this definition itself is actually incredibly straightforward</b> <b>or even somewhat trivial</b> <b>because this isn't a new concept</b> <b>because actually back in 1943</b> <b>there was</b> <b>this physiologist</b> <b>called</b> <b>Kenneth Craik, a Scottish philosopher and psychologist</b> <b>mm-hmm</b> <b>who first proposed this concept</b> <b>he said humans have in their minds</b> <b>such a world model</b> <b>this world model can tell us</b>

<b>when we take some action</b> <b>what consequences will follow</b> <b>mm-hmm</b> <b>because we can predict our actions</b> <b>the consequences our actions bring</b> <b>so this can guide us</b> <b>in what kind of action to take</b> <b>and what kind of decision to make</b> <b>if I know that putting my hand in a fire</b> <b>will hurt, then I won't</b>

<b>put my hand in the fire</b> <b>this thing</b> <b>this kind of prediction structure</b> <b>is also from the past</b> <b>including control theory</b> <b>in the 1960s and 70s</b> <b>how everyone would put</b> <b>a lunar probe to the moon</b> <b>or send it to</b> <b>wherever</b> <b>right</b> <b>and then</b>

<b>everyone actually needs to be based on such a control system</b> <b>for example a classic algorithm</b> <b>called Model Predictive Control</b> <b>this also involves a Model</b> <b>but this Model is actually also a kind of World Model</b> <b>this algorithm is actually very very simple</b> <b>meaning you now need to decide</b> <b>what control signal exactly I should apply</b> <b>to this system</b> <b>to enable it to complete</b> <b>a predetermined task</b> <b>mm-hmm right</b> <b>and what I need to do is</b>

<b>at the current moment</b> <b>roll out through my model</b> <b>to continuously output the next</b> <b>k steps of actions</b> <b>an action sequence</b> <b>meaning I need to output</b> <b>my next action sequence</b> <b>a sequence of actions</b> <b>and through this action sequence</b> <b>use my Model to get the next step</b>

<b>or the state at each step</b> <b>and finally I'll also have a, um</b> <b>some kind of cost function</b> <b>a metric function</b> <b>which tells me</b> <b>after I execute this action sequence</b> <b>how far I am from my ultimate goal</b> <b>how far the distance is</b> <b>so this algorithm is very simple</b> <b>you continuously sample your action sequence</b> <b>then jump back to the first step</b> <b>and find</b>

<b>the action sequence with the lowest cost</b> <b>execute its first step</b> <b>then repeatedly iterate to do this action</b> <b>and roll out the next action sequence</b> <b>yes, so each time you need to make a decision</b> <b>and the source of this decision</b> <b>is based on your prediction of the future</b> <b>mm-hmm</b> <b>yes, this is the so-called Model Predictive Control</b> <b>how people use this World Model</b> <b>and then later</b> <b>for example in Model-Based Reinforcement Learning</b>

<b>in Reinforcement Learning</b> <b>people also realized</b> <b>that a World Model is actually very important</b> <b>alright</b> <b>there's a classic paper here</b> <b>called Dyna</b> <b>this paper is actually Richard S. Sutton's paper — the father of reinforcement learning</b> <b>oh</b> <b>yes, so Richard Sutton himself wrote such a paper</b> <b>and he talked about</b> <b>ah</b> <b>a very interesting viewpoint</b> <b>or a framing</b> <b>he says the human intelligence system</b>

<b>can perhaps be divided into two types</b> <b>one called a reactive policy</b> <b>and one possibly called</b> <b>a more intelligent model-based policy</b> <b>right</b> <b>this thing</b> <b>actually um</b> <b>this analogy is</b> <b>the so-called System 1 and System 2 analogy</b> <b>right, which is human cognition</b> <b>also has so-called thinking fast <b>for very difficult problems</b> <b>we may need more mental cycles</b> <b>to study these problems</b> <b>mm-hmm</b>

<b>but for some problems</b> <b>for example when we drive, right</b> <b>when we first learned to drive we were very nervous</b> <b>looking left and right</b> <b>needing to make many decisions</b> <b>but when you truly learned to drive</b> <b>you internalize these decisions</b> <b>as part of your own muscle memory</b> <b>it becomes a reactive</b> <b>policy right</b> <b>so Richard Sutton in the Dyna paper</b> <b>said something very interesting</b> <b>he said, um</b> <b>what is Reinforcement Learning</b>

<b>Reinforcement Learning is a very primitive</b> <b>a very basic</b> <b>model-free</b> <b>without this world model</b> <b>a learning algorithm</b> <b>ah</b> <b>so Richard Sutton himself was somewhat anti-pure</b> <b>Reinforcement Learning</b> <b>at least at that time</b> <b>in his paper</b> <b>he talks about a better system</b> <b>which of course is</b> <b>if you have a strong enough</b> <b>world model</b>

<b>you can based on the current state</b> <b>predict the next state</b> <b>right, and then</b> <b>you'd have this so-called</b> <b>planning capability</b> <b>which is planning</b> <b>the so-called ability to make plans</b> <b>mm-hmm</b> <b>and then</b> <b>planning and reasoning are in some sense</b> <b>also the same concept</b> <b>reasoning is now very hot in Large Language Models</b> <b>but in fact, um</b> <b>this kind of planning we need</b> <b>and also</b> <b>the significance of planning for decision making</b>

<b>was actually discussed very early on in Control Theory</b> <b>and Reinforcement Learning where everyone was discussing it</b> <b>so I think this is the history of World Models</b> <b>so if we start from this angle</b> <b>the essence of a World Model is</b> <b>how to characterize a system and an environment</b> <b>such that you can make predictions in this system</b> <b>and this prediction can guide your</b> <b>your</b> <b>action sequence</b>

<b>and your own decision-making</b> <b>large language models predict the next word</b> <b>this predicts the next action</b> <b>based on this action</b> <b>predict the next state</b> <b>right</b> <b>how to understand state</b> <b>state is</b> <b>the minimum information that can describe</b> <b>all states of a system</b>

<b>in that way</b> <b>a source of information, you could say</b> <b>you can think of it that way</b> <b>meaning a state</b> <b>means, for example</b> <b>this thing</b> <b>also involves a very interesting thing</b> <b>very interesting</b> <b>another thing</b> <b>we need to discuss</b> <b>namely what exactly is the relationship between this and representation</b> <b>mm-hmm right</b> <b>um, why do we say</b> <b>it's the minimum information characterization unit</b> <b>it's because suppose right now</b>

<b>our current physical world</b> <b>right</b> <b>let me say Earth</b> <b>ah, or let me not go that far</b> <b>let's first talk about this room of ours</b> <b>right</b> <b>this is also an environment</b> <b>right</b> <b>so what is the state that characterizes this environment</b> <b>right, this state</b> <b>if you don't pursue this so-called minimum information</b> <b>or minimal descriptions</b> <b>then it can be</b> <b>for example, we now reconstruct this entire space</b> <b>entirely</b> <b>right</b>

<b>and we precisely characterize</b> <b>all the parameters in this system</b> <b>including the texture of this table</b> <b>including our sound waves</b> <b>including</b> <b>we</b> <b>the mass of this table</b> <b>this microphone's</b> <b>various physical parameters</b> <b>mm-hmm alright</b> <b>but we won't characterize this system that way</b> <b>right</b> <b>because much of this information</b> <b>is not important for our decision-making</b> <b>right</b> <b>because</b>

<b>actually if we assume an intelligent agent now</b> <b>living here for the purpose of</b> <b>we're having a conversation</b> <b>mm-hmm</b> <b>then I only need</b> <b>to know some basic facts</b> <b>for example, my microphone can</b> <b>stay on this table</b> <b>and then</b> <b>I won't care about every point of lighting</b> <b>nor will I care about</b> <b>every detail of the texture on the table</b> <b>mm-hmm right</b> <b>these things are all unimportant</b>

<b>so this state</b> <b>can actually contain a lot of information</b> <b>or can contain enough information</b> <b>meaning sufficient information</b> <b>this thing</b> <b>it depends on what kind of task you need to solve</b> <b>so what is this thing</b> <b>which is how to</b> <b>build such a state</b> <b>this thing</b> <b>is actually directly connected to representation learning</b> <b>mm-hmm</b> <b>representation learning</b> <b>like I just said, right</b>

<b>we need to have a hierarchical representation</b> <b>this hierarchical representation</b> <b>the purpose is actually</b> <b>how we can gradually develop</b> <b>layer by layer, iterating up</b> <b>and becoming increasingly abstract</b> <b>increasingly meaningful for my decision making</b> <b>and increasingly valuable representation</b> <b>mm-hmm</b> <b>it won't be fine-grained to every point</b> <b>it doesn't need to be fine-grained to every point</b> <b>so how do you abstract</b> <b>mm-hmm</b>

<b>and we also can't be fine-grained to every point</b> <b>it just can't be done</b> <b>right</b> <b>because this is very obvious</b> <b>right</b> <b>for example, say we're building an airplane</b> <b>this airplane</b> <b>for example</b> <b>every</b> <b>for example we want to model</b> <b>the dynamic system of this airplane</b> <b>right, I want to know how to make it</b> <b>more energy-efficient and fuel-efficient</b> <b>ah</b> <b>we can of course</b> <b>start from the lowest level</b>

<b>we can say</b> <b>this</b> <b>per cubic centimeter there might be 10 to the power of</b> <b>some ten-odd power of molecules</b> <b>and we model every molecular collision</b> <b>right</b> <b>and then</b> <b>through this approach</b> <b>to characterize our system</b> <b>this of course won't work</b> <b>this is a totally stupid way</b> <b>right, what we do instead</b> <b>is</b> <b>how we can statistically</b>

<b>study this problem</b> <b>so that's why there's fluid dynamics</b> <b>and then there would be this</b> <b>Navier-Stokes equation</b> <b>and a series of such settings</b> <b>right, everything becomes increasingly abstract</b> <b>and then</b> <b>but the world we're able to characterize</b> <b>becomes broader and broader</b> <b>mm-hmm</b> <b>actually language is in some sense abstraction</b> <b>language is some kind of abstraction</b> <b>but it's a</b>

<b>proven abstraction</b> <b>it's highly condensed</b> <b>meaning it's an existing abstraction</b> <b>it's an existing abstraction</b> <b>so</b> <b>what you want to build now is a new abstraction</b> <b>beyond language</b> <b>it's a, yes</b> <b>it's somewhat</b> <b>it must be a latent representation</b> <b>mm-hmm</b> <b>and this thing</b>

<b>people can understand indirectly</b> <b>what kind of representation you've learned</b> <b>or which representations</b> <b>which representations are meaningful</b> <b>all of this is fine</b> <b>it's not a complete black box</b> <b>but it's not constrained by the syntax of language</b> <b>and logic like that</b> <b>this is why I say LLMs are far from embodying The Bitter Lesson</b> <b>The Bitter Lesson says</b>

<b>you should minimize human knowledge as much as possible</b> <b>right</b> <b>put away your so-called</b> <b>human arrogance</b> <b>human arrogance</b> <b>and its so-called hubris</b> <b>this arrogance</b> <b>and its so-called cleverness</b> <b>and these so-called</b> <b>relatively clever structures</b> <b>minimize as much as possible</b> <b>and instead do as much as possible</b>

<b>using search and learning to find answers</b> <b>right, but you can imagine</b> <b>if what we're discussing now is how to</b> <b>characterize this world</b> <b>ah</b> <b>language is exactly such a structure</b> <b>language is an extremely clever product of humans</b> <b>mm-hmm</b> <b>it has intricate design</b> <b>it itself is</b> <b>it's not a question of more or less</b> <b>it's all</b> <b>it all is</b> <b>right mm-hmm</b>

<b>so</b> <b>so</b> <b>I think this represents language</b> <b>it has its own very strong points</b> <b>and it will definitely in future intelligence</b> <b>in all these intelligent systems</b> <b>occupy a very, very important position</b> <b>but it can do CoT (chain of thought)</b> <b>mm-hmm</b> <b>but CoT is another matter</b> <b>CoT is also another</b> <b>um, how should I put it</b> <b>it's a product of this stage</b> <b>right</b>

<b>oh, CoT is also a stage-specific product</b> <b>everything about LLMs</b> <b>is a fairly stage-specific product</b> <b>oh</b> <b>that's also why LLMs</b> <b>I also quite agree with Yann</b> <b>meaning LLMs</b> <b>are actually not controllable</b> <b>not safe either</b> <b>because they don't have a true world model</b> <b>we even use LLMs as world models</b> <b>but it's fundamentally flawed</b> <b>it's a flawed world model</b> <b>right</b> <b>and um</b>

<b>what this means is</b> <b>actually meaning</b> <b>all current controllability or safety</b> <b>how does an LLM do this</b> <b>it's entirely designed through fine-tuning</b> <b>to achieve it</b> <b>you need to feed it a lot of data</b> <b>to let it know what should be done</b> <b>what shouldn't be done</b> <b>or what it can't do</b> <b>what can be said</b> <b>what can't be said</b> <b>right</b> <b>what kind of speech might bring danger</b>

<b>what kind of speech</b> <b>might be more friendly</b> <b>so this is called alignment</b> <b>but all of this is based on some kind of</b> <b>post-training or some kind of</b> <b>fine-tuning alignment</b> <b>mm-hmm</b> <b>yes, but a true world model</b> <b>actually you don't need to do this</b> <b>because you can predict</b> <b>what consequence your action will lead to</b> <b>you can</b> <b>your</b> <b>what results your behavior will bring</b> <b>you can then during inference</b>

<b>process</b> <b>try to avoid such behavior</b> <b>mm-hmm</b> <b>you can add some external constraints</b> <b>to tell it</b> <b>you really can't do this</b> <b>for example</b> <b>I have a robot holding a knife cutting vegetables</b> <b>right</b> <b>and how do I ensure now</b> <b>that this robot holding the knife</b> <b>won't turn backward</b> <b>and slash you</b> <b>how do you guarantee this</b> <b>from the perspective of a Language Model</b>

<b>you</b> <b>you</b> <b>the way you can achieve this is through feeding</b> <b>it a lot of data</b> <b>mm-hmm</b> <b>right, but it needs to be able to see these things</b> <b>isn't a world model, right</b> <b>a world model</b> <b>doesn't necessarily need</b> <b>a world model</b> <b>because you're able to foresee this outcome</b> <b>meaning I'm able to</b> <b>take an action</b> <b>I can understand</b> <b>if this knife turns around now</b> <b>and creates a certain danger, what the result would be</b>

<b>how do you let it know</b> <b>um, that's part of your training</b> <b>about the world model</b> <b>it seems the definition hasn't converged yet</b> <b>for example, the world model you define</b> <b>and the world model Li Fei-Fei's team defines</b> <b>what is the difference</b> <b>ah right</b> <b>so what I just elaborated on</b> <b>is actually all the world model in our definition</b> <b>but I think the problems we're encountering now are</b> <b>that this world model is hard to define</b> <b>the reason</b>

<b>is actually that it's not a technical approach</b> <b>it's not an algorithm</b> <b>it's a goal</b> <b>mm-hmm</b> <b>meaning all of us</b> <b>whether you're working on LLMs</b> <b>or Video Diffusion Models</b> <b>or Gaussian Splatting</b> <b>all of us</b> <b>are on the path toward the world model</b> <b>so</b> <b>I say</b> <b>sometimes these competitions</b> <b>or these arguments</b>

<b>I think before long</b> <b>maybe in 1 to 2 years</b> <b>will all seem extremely ridiculous</b> <b>because</b> <b>because we're actually all developing toward this path</b> <b>and everyone knows</b> <b>this should</b> <b>lead to</b> <b>should</b> <b>be the right path</b> <b>it's just that</b> <b>everyone is thinking about this problem from different directions</b> <b>for example</b> <b>in our definition</b> <b>or let me first talk about other people's definitions</b>

<b>for example</b> <b>for a Video Diffusion Model company</b> <b>for example like</b> <b>like Sora</b> <b>like Bytedance's models</b> <b>like Genie (developed by Google DeepMind)</b> <b>right, and then</b> <b>all these models</b> <b>including Runway</b> <b>Luma</b> <b>every company making generative models</b> <b>is doing this</b> <b>all positioning themselves as World Model companies</b> <b>but they're actually still mainly focused on</b>

<b>building a world model simulator</b> <b>a world simulator</b> <b>the so-called world simulator</b> <b>mm-hmm</b> <b>their goal is still</b> <b>to render visually compelling videos</b> <b>with some kind of consistency</b> <b>able to have sufficiently long content</b> <b>and so on, and you can apply controls to it</b> <b>mm-hmm, you can choose</b> <b>like Genie</b> <b>right</b> <b>take two steps forward</b> <b>take two steps backward</b> <b>you need to ensure you have some memory</b>

<b>or whatever</b> <b>this thing</b> <b>is their kind of world</b> <b>world simulator</b> <b>or this generative world simulator</b> <b>that wants to solve</b> <b>and um</b> <b>Professor Fei-Fei's side</b> <b>at World Labs</b> <b>I think it's more like a frontend</b> <b>an interface for assets</b> <b>this is also very important</b> <b>because it's a strong 3D representation</b> <b>so</b>

<b>By the way</b> <b>also congratulations</b> <b>didn't they just successfully raise funding</b> <b>if you can see</b> <b>their lead investors</b> <b>the people they're discussing with</b> <b>for example I saw in the news</b> <b>Autodesk invested $200 million in them</b> <b>mm-hmm</b> <b>so</b> <b>what kind of company is Autodesk</b> <b>Autodesk is a company doing 3D modeling, visualization and CAD</b> <b>or whatever design kind of company</b> <b>right</b> <b>so in this scenario</b>

<b>you need a very, very concrete 3D</b> <b>one</b> <b>you also</b> <b>can call it representation</b> <b>it's also some kind of representation</b> <b>but it means this thing</b> <b>is not an abstract concept</b> <b>right, it's not hidden in your parameters</b> <b>it needs to have an explicit 3D</b> <b>form there</b> <b>that way</b> <b>you can then in this space</b> <b>master some kind of spatial intelligence</b> <b>you can then explore in this space</b>

<b>and you can be one hundred percent certain</b> <b>you won't make mistakes</b> <b>for a World Simulator</b> <b>a Generative World Simulator</b> <b>this thing</b> <b>not necessarily</b> <b>right, although you can through longer context</b> <b>have better memory</b> <b>but it cannot cannot be guaranteed</b> <b>mm-hmm</b> <b>and what we want to do</b> <b>is actually more like</b> <b>building a predictive brain</b> <b>yes meaning</b> <b>we</b>

<b>the core of how we view this problem</b> <b>is still about how to enhance</b> <b>intelligence itself</b> <b>yes, so that means</b> <b>you think LLMs are not intelligent enough</b> <b>I think, again</b> <b>LLM is a crucial</b> <b>part of this intelligence system</b> <b>it's a module</b> <b>but it's not everything</b> <b>it's not everything</b> <b>right</b> <b>let me give another example</b> <b>for example, why when LLMs do world modeling</b> <b>it's fundamentally</b> <b>flawed</b> <b>for example</b>

<b>let's go back to this vision question</b> <b>right, we're now sitting here</b> <b>mm-hmm</b> <b>if we turn our head slightly</b> <b>say 5 or 10 degrees</b> <b>that generates hundreds of frames</b> <b>actually this frequency is very, very high</b> <b>the human FPS can actually perceive</b> <b>say, 100 Hz</b> <b>these frequency fluctuations</b> <b>extremely impressive</b> <b>right</b> <b>if you process this problem the way an LLM does</b> <b>what would happen</b>

<b>mm-hmm</b> <b>at least processing it the current way</b> <b>what would happen is</b> <b>I would need to tokenize every frame</b> <b>we flatten it</b> <b>stringing it into a very very long sequence</b> <b>every frame</b> <b>I can do some downsampling</b> <b>or whatever, doesn't matter</b> <b>and then we string them together</b> <b>right, say I have 256 tokens per frame</b> <b>now you might have 32 frames or 128 frames</b> <b>stringing them together</b>

<b>then you'd have 256 times 128 tokens</b> <b>then you put them into a Large Language Model</b> <b>and align it with language</b> <b>and finally answer a question</b> <b>but does this make sense</b> <b>it makes no sense at all</b> <b>mm-hmm</b> <b>because you're actually taking this kind of world</b> <b>representation</b> <b>mm-hmm</b> <b>behind it</b> <b>there's actually some kind of global state</b> <b>right</b>

<b>you serialize it</b> <b>into a very very redundant token</b> <b>mm-hmm</b> <b>and Transformer</b> <b>people say it doesn't have much</b> <b>inductive bias</b> <b>it actually still has some inductive bias</b> <b>its inductive bias is</b> <b>it has to pay equal attention to every single token</b> <b>oh</b> <b>well, that itself is unreasonable</b> <b>right</b> <b>what this represents is</b> <b>the modeling technique of language models</b>

<b>cannot resolve the cognition of these continuous</b> <b>spatial signals</b> <b>this doesn't hold</b> <b>so</b> <b>this is why</b> <b>For us,</b> <b>when it comes to the world model we're building,</b> <b>I think</b> <b>it needs to have the following characteristics</b> <b>right, it needs to</b> <b>um,</b> <b>be able to understand the physical world</b> <b>and the definition here</b> <b>is that it must be the physical world</b>

<b>although the world model application will also extend to</b> <b>things like</b> <b>digital agents to</b> <b>like a gaming agent</b> <b>will of course also benefit from the World Model</b> <b>but</b> <b>I think its primary task</b> <b>is to solve the problem of physical world understanding</b> <b>and it needs to have sufficiently large associative memory</b> <b>Memory is also a very very important</b> <b>component of a World Model-based</b> <b>system as a whole</b>

<b>mm-hmm</b> <b>and it needs to be able to reason</b> <b>able to plan</b> <b>mm-hmm</b> <b>we just talked about planning</b> <b>able to</b> <b>able to do this kind of counterfactual reasoning</b> <b>or this kind of causal inference</b> <b>also very very important</b> <b>and the last point</b> <b>is that it needs to be sufficiently controllable and safe</b> <b>it needs to be a safe system</b> <b>right, I think all these things</b> <b>I'm actually borrowing from Yann on this</b> <b>these talking points</b> <b>but I think</b>

<b>these points are actually very very insightful</b> <b>right, not too many, not too few</b> <b>mm-hmm</b> <b>it and large language models</b> <b>are not in a derivative relationship</b> <b>they're in a replacement relationship</b> <b>uh</b> <b>I think</b> <b>it's not exactly a replacement relationship either</b> <b>uh</b> <b>why did I just say that everyone in the field</b> <b>is moving toward world models</b> <b>moving forward?</b>

<b>moving forward?</b> <b>the reason is</b> <b>large language models also want to evolve toward world models</b> <b>actually that's not quite what I mean</b> <b>what I mean is before large language models existed</b> <b>we couldn't really talk about world models at all</b> <b>if you have a purely RL-based system</b> <b>you're purely doing overfitting</b> <b>to the current environment</b> <b>Large Language Models</b> <b>gave you a certain degree of</b> <b>cognitive ability about the real world</b> <b>it forms one element</b> <b>mm-hmm, it forms one element</b>

<b>but this thing</b> <b>as I said, is fundamentally flawed</b> <b>because its cognition is too indirect</b> <b>yeah</b> <b>what language can give you is really just too little</b> <b>mm-hmm right</b> <b>and language has other problems too</b> <b>namely it is a</b> <b>fundamentally a communication tool</b> <b>so when we use language</b> <b>unless you're saying something like</b> <b>in a dream state</b> <b>like talking in your sleep</b>

<b>most of the time</b> <b>you use language with an intention</b> <b>you want to convey a purpose</b> <b>so LLMs are more like</b> <b>in my view, more like an extension of a search engine</b> <b>right?</b> <b>or a chatbot is more like an extension of a search engine</b>

<b>right?</b> <b>or a chatbot is more like an extension of a search engine</b> <b>we always bring the purpose in our mind</b> <b>to ask a question</b> <b>and expect an answer</b> <b>right?</b>

<b>right?</b> <b>but this is not what</b> <b>a World Model is</b> <b>in essence</b> <b>as I just said</b> <b>the World Model in our brain</b> <b>is doing a lot of work</b> <b>in the background</b> <b>there's even a lot of psychology</b> <b>some counterintuitive findings</b> <b>that say</b> <b>your brain has already made the decision for you</b> <b>before you decide to</b>

<b>say there are three buttons on my desk</b> <b>before I know which button I want to press</b> <b>I can already detect</b> <b>that my brain</b> <b>has already made that decision for me</b> <b>this experiment</b> <b>is called something like the Libet experiment or something</b> <b>it's a controversial experiment</b> <b>but what it demonstrates is</b> <b>many things are happening in your background</b> <b>already happening in your brain</b> <b>this is part of your world model</b> <b>a Language Model is not like that</b>

<b>language is just a communication tool</b> <b>you always come with a purpose</b> <b>throw out a question</b> <b>and want to get an answer</b> <b>it's also a reasoning tool</b> <b>right</b> <b>it's also a reasoning tool</b> <b>of course, but only a symbolic-level reasoning tool</b> <b>so you want to build</b> <b>a world model like the human brain</b> <b>I think we need to look more and more at people</b> <b>mm-hmm, actually not just people</b> <b>all kinds of animals</b>

<b>how their intelligence actually arises</b> <b>mm-hmm right</b> <b>let me, let me first conclude</b> <b>what I just said</b> <b>which is</b> <b>why is everyone step by step converging on</b> <b>converging on this World Model?</b>

<b>the reason is language models</b> <b>have already shown a bit of</b> <b>World Model-like behavior</b> <b>even though it has no actions</b> <b>it has no real understanding of the physical world</b> <b>and it can't truly reason and plan</b> <b>because its planning through CoT</b> <b>and its reasoning through CoT</b> <b>is still very different</b> <b>from what I just described</b> <b>like MPC-level</b> <b>planning</b> <b>CoT also brings its own set of problems</b> <b>but all that's fine</b> <b>but the next step</b> <b>you'll see</b>

<b>for example everyone's doing</b> <b>whether DiT or</b> <b>whatever model</b> <b>but people started doing generative models</b> <b>and that has made things somewhat different</b> <b>right?</b>

<b>right?</b> <b>mm-hmm, and that's why many people</b> <b>who do video generation call it a world model</b> <b>I think that's understandable</b> <b>although</b> <b>I don't agree that the video generation</b> <b>model they're doing</b> <b>is the final end game world model</b> <b>but it has indeed pushed one step beyond language models</b> <b>right</b> <b>how does it do that?</b>

<b>on top of language models</b> <b>uh</b> <b>I think all these systems now</b> <b>actually still rely on language models</b> <b>right?</b>

<b>right?</b> <b>they still use language models to do prompt</b> <b>rewriting and then to help</b> <b>serve as a conditioning</b> <b>fed into the video generation model</b> <b>and language models have actually become</b> <b>you know</b> <b>the historical progression here is quite interesting</b> <b>language models used to be the main thing</b> <b>now language models have become</b> <b>a preparatory step for video generation models</b> <b>a scaffolding</b>

<b>in the old language models</b> <b>what you modeled was P(y)</b> <b>right?</b> <b>and that y is still in some semantic space</b>

<b>right?</b> <b>and that y is still in some semantic space</b> <b>information in some kind of label space</b> <b>mm-hmm, but now with video generation models</b> <b>what you model is the probability P(x|y)</b> <b>what this means is</b> <b>what you're modeling now is already x</b> <b>x is the data itself</b> <b>your y has become</b> <b>a condition — this is already very different</b> <b>okay</b> <b>why is it so different?</b>

<b>it's because when you have a low dimensional y</b> <b>space</b> <b>and then you</b> <b>go to model such a distribution</b> <b>your probability density</b> <b>only competes within your y's distribution</b> <b>meaning</b> <b>the likelihood you assign</b> <b>I'm getting a bit too technical here</b> <b>but anyway</b> <b>or let's not talk about language models first</b> <b>let's first talk about</b> <b>say</b> <b>a model that classifies 1000 categories</b>

<b>you can think of</b> <b>these few labels as a precursor to language</b> <b>it's also a low-dimensional vocabulary</b> <b>right?</b>

<b>right?</b> <b>and then</b> <b>if you're doing a classification problem like this</b> <b>all the decisions you need to make are</b> <b>if this thing is a cat</b> <b>it can't be a dog</b> <b>right?</b>

<b>right?</b> <b>this thing is constrained by my label set</b> <b>mm-hmm</b> <b>but when you start modeling P(x|y)</b> <b>when you're doing a generative model</b> <b>the likelihood you assign in this case says</b> <b>what phenomena actually exist in the world</b> <b>which things are more likely to exist</b> <b>that becomes very very different</b> <b>right? because what you need to learn now</b>

<b>right? because what you need to learn now</b> <b>the amount of intelligence information</b> <b>is far greater than what you get from modeling P(Y)</b> <b>you need to understand why in this world</b> <b>a four-legged cat</b> <b>is more common than a three-legged cat</b> <b>right?</b>

<b>right?</b> <b>why if I'm generating a video</b> <b>say I have, I don't know</b> <b>a running video</b> <b>why would I have</b> <b>a smooth running state</b> <b>rather than suddenly hallucinating three legs</b> <b>four legs</b> <b>which is more believable</b> <b>more probable, right?</b>

<b>in probability space</b> <b>more probable</b> <b>this already carries enormous amounts of information</b> <b>what you need to model</b> <b>far exceeds what you need to capture in language space</b> <b>or in label space</b> <b>right?</b>

<b>right?</b> <b>you already need some understanding of the world</b> <b>so this is already more</b> <b>in line with the Bitter Lesson in my view</b> <b>meaning</b> <b>you've abandoned more of the</b> <b>cognition in language space</b> <b>and its logic</b> <b>and its syntactic structure</b> <b>and started modeling pixels</b> <b>started modeling</b> <b>the pixels themselves</b> <b>but taking it one step further</b>

<b>pixels themselves might also be wrong</b> <b>pixels themselves are also not Bitter Lesson enough</b> <b>mm-hmm</b> <b>what are pixels</b> <b>pixels are a human-defined</b> <b>regular grid</b> <b>just a grid of little boxes</b> <b>each little box might have</b> <b>8 bits of information</b> <b>and you might have this kind of lattice</b> <b>like a cell by cell by cell arrangement</b> <b>this is a pixel</b>

<b>this is each frame of the image we see</b> <b>right?</b>

<b>right?</b> <b>this is also an interface</b> <b>mm-hmm</b> <b>this is also made for humans to see</b> <b>right?</b>

<b>right?</b> <b>that's why world simulators</b> <b>why do people think Genie</b> <b>is so cool</b> <b>because we create a video</b> <b>we create a game</b> <b>this is for humans to see</b> <b>but taking it one step further</b> <b>the real Bitter Lesson says</b> <b>I don't need to make it for humans to see</b> <b>why do I need to make it for humans?</b>

<b>right?</b>

<b>who is it for?</b>

<b>it's for your system to see</b> <b>it's for your world to see</b> <b>mm-hmm</b> <b>it depends on what you ultimately want</b> <b>it can be for humans to see</b> <b>but being for humans to see</b> <b>is not the core of a World Model</b> <b>it's the interface of the World Model</b> <b>the World Model itself</b> <b>is spontaneously</b> <b>learning better representations</b> <b>making better predictions</b> <b>right?</b>

<b>right?</b> <b>but this thing itself</b> <b>whether or not you want to generate a cool video</b> <b>is actually irrelevant</b> <b>and whether or not you can answer</b> <b>some questions about your input space</b> <b>is also actually irrelevant</b> <b>so again</b> <b>let me repeat what I was just trying to say</b> <b>each of us</b> <b>is moving forward on the road toward world models</b> <b>the world model is a goal</b> <b>not a specific path</b> <b>uh, not a specific algorithm</b>

<b>or a specific technical roadmap</b> <b>and someday</b> <b>we will have a better world model</b> <b>mm-hmm</b> <b>language models will, on top of that</b> <b>also get stronger</b> <b>we'll have better multimodal models</b> <b>that can better understand the world</b> <b>and we'll have better video generation models</b> <b>mm-hmm</b> <b>and I think RAE is</b> <b>an early prototype in this process</b> <b>mm-hmm yeah</b> <b>so now there's also a very hot concept</b>

<b>the so-called Unified Model or Omni Model</b> <b>where people try to stack all the data</b> <b>together</b> <b>so that we can have one system</b> <b>that can do both understanding</b> <b>and generation</b> <b>what people also discuss is</b> <b>does understanding help generation</b> <b>or does generation help understanding</b> <b>mm-hmm</b> <b>I think neither really matters</b> <b>understanding and generation are one</b> <b>both need a real World Model</b>

<b>as their foundation</b> <b>right</b> <b>once you have that good World Model</b> <b>that can do some kind of prediction</b> <b>can do some kind of planning and reasoning</b> <b>the upper-layer decoding</b> <b>is actually very very simple</b> <b>so you think they're all built on top of</b> <b>the world model</b> <b>which is the base layer</b> <b>right</b> <b>you can think of it as</b> <b>what we want to do</b> <b>or what the representation school wants to do is</b> <b>the very bottom layer of the cake</b>

<b>this base</b> <b>the representation school</b> <b>how to unify representations into one</b> <b>unified meaning unifying it with language</b> <b>ultimately unified into some kind of representation</b> <b>abstracted into a few abstract representations</b> <b>so you still need scaling, right?</b>

<b>you still need to</b> <b>besides language, what other scaling</b> <b>can we currently see?</b>

<b>language scaling</b> <b>we just touched on this</b> <b>language scaling itself</b> <b>I think is again</b> <b>something a bit hard to articulate clearly</b> <b>because we also know</b> <b>there's a theory</b> <b>which says compression is intelligence</b> <b>right?</b>

<b>right?</b> <b>compression equals intelligence</b> <b>compression equals intelligence</b> <b>yes, but what it's saying is</b> <b>your language model</b> <b>is actually a lossless compression process</b> <b>or rather, language models</b> <b>getting bigger improving results</b> <b>is not because it's memorizing by rote</b> <b>having memorized all of this content</b> <b>it's simply a stronger model</b>

<b>so it can have a better compression ratio</b> <b>to compress all of your input information</b> <b>it brings some kind of generalization ability</b> <b>I think I agree with this view</b> <b>but I want to step back a bit</b> <b>I want to say</b> <b>actually because of the nature of the problems language models care about</b> <b>its Scaling Laws actually</b> <b>contain some padding</b> <b>which is</b> <b>what I mean by padding is</b> <b>it doesn't actually need the smallest model</b>

<b>to answer questions by truly understanding the world</b> <b>it doesn't need that</b> <b>and all our benchmarks</b> <b>and what humans use Large Language Models</b> <b>to achieve</b> <b>on these tasks</b> <b>also require it to be able to retrieve</b> <b>right, to be able to</b> <b>be able to retrieve factual knowledge</b> <b>if a model</b> <b>right, can't tell me</b>

<b>say a specific person's name on Wikipedia</b> <b>what they did in the past</b> <b>that's a very poor</b> <b>Large Language Model</b> <b>so</b> <b>so what I want to say is</b> <b>the Scaling Law of language models</b> <b>is based on a representation of knowledge</b> <b>that's the Scaling Law derived from that</b> <b>so that's why</b> <b>it may have a relatively balanced ratio</b> <b>meaning your number of tokens</b> <b>your data and your parameters</b>

<b>need to be roughly 1:1</b> <b>that's how it works</b> <b>one approach</b> <b>right?</b>

<b>right?</b> <b>then scale up</b> <b>world models, especially visual intelligence-based</b> <b>world models</b> <b>I think</b> <b>will have a very very different Scaling Law</b> <b>it will have a Scaling Law</b> <b>but the slope of that Scaling Law may be completely different</b> <b>or its ratio may be completely different</b> <b>my current intuition is</b> <b>the model won't be that large</b> <b>the model doesn't need many training parameters</b> <b>because you don't need to remember</b>

<b>if you want to do video generation</b> <b>that's a different story</b> <b>but you don't need to remember everything</b> <b>all the subtle details in the world that you can see</b> <b>you don't need to</b> <b>solve some definite equation</b> <b>in some very high-dimensional space</b> <b>to determine whether an apple falls</b> <b>mm-hmm</b> <b>it doesn't need to do these things</b> <b>it doesn't need human intelligence</b> <b>the highest level of human intelligence</b> <b>let's discuss what human intelligence actually is</b> <b>but anyway</b>

<b>it doesn't need these things</b> <b>it doesn't need to memorize all</b> <b>this knowledge</b> <b>it needs good understanding capability</b> <b>to filter information</b> <b>processing and filtering out information</b> <b>and then</b> <b>because ultimately</b> <b>what really matters is the decision itself</b> <b>mm-hmm</b> <b>right so</b> <b>so this will become more and more like humans</b> <b>because that's how humans are</b> <b>humans have many very important facts</b> <b>right?</b>

<b>right?</b> <b>like the human visual system</b> <b>or rather</b> <b>all of human sensors combined</b> <b>including hearing, vision, smell</b> <b>touch, all of these</b> <b>this</b> <b>is actually extremely high bandwidth</b> <b>this bandwidth might reach</b> <b>say 1 billion bits per second</b> <b>in the range of 100 million to 1 billion</b> <b>mm-hmm</b> <b>but when we're talking right now</b> <b>the bandwidth is extremely low</b> <b>the bandwidth is only ten to</b>

<b>ten to one hundred bits per second</b> <b>mm-hmm</b> <b>so what's actually happening?</b>

<b>right?</b>

<b>what kind of model is our brain</b> <b>that at twenty watts of power</b> <b>takes in one billion bits per second of information</b> <b>through our eyes</b> <b>and all kinds of sensory inputs</b> <b>and converts it into 10 bits per second of</b> <b>behavioral output</b> <b>this is the World Model itself</b> <b>it filters out large amounts of useless information and noise</b> <b>right, there's a lot of redundancy</b> <b>it knows what's important</b>

<b>and what's not important</b> <b>so the filtering system is very important</b> <b>right, of course</b> <b>this is also a hierarchical filtering system</b> <b>mm-hmm</b> <b>mm-hmm, that's indeed the case</b> <b>so how do you train this world model?</b>

<b>uh, language models are easy to train</b> <b>because internet information is just sitting there</b> <b>so you can train it</b> <b>but with world models, it seems like</b> <b>I don't even know where to begin</b> <b>right, I think this is the biggest bet</b> <b>because the closer you get to</b> <b>the essence of intelligence</b> <b>things become</b> <b>much harder</b> <b>mm-hmm right</b> <b>I think like you said</b> <b>we went through the period of dumping the entire internet</b> <b>to train models</b>

<b>that era</b> <b>I think going forward</b> <b>uh</b> <b>I honestly don't know if this path will work</b> <b>I have enough confidence</b> <b>but if you asked me whether it's 100% guaranteed to succeed</b> <b>not necessarily</b> <b>the reason still comes down to data</b> <b>can we actually pull this off</b> <b>to the fullest extent</b> <b>how much data does it need?</b>

<b>what kind of data?</b>

<b>I think the past era was about dumping</b> <b>or downloading, I should say</b> <b>the Internet era</b> <b>now the era is about downloading</b> <b>the human era</b> <b>mm-hmm</b> <b>we need to download humanity</b> <b>mm-hmm</b> <b>so right now, again</b> <b>right, everyone processes this knowledge</b> <b>we have something called the Internet</b> <b>we can upload it</b> <b>we can train a Transformer</b> <b>everything is good</b> <b>but for truly understanding the world</b>

<b>a 4-year-old child</b> <b>the videos they've seen — Yann often cites this example</b> <b>already exceed all the tokens</b> <b>used to train all of these</b> <b>large language models</b> <b>right?</b>

<b>right?</b> <b>a four-month-old baby</b> <b>the amount of video they've seen</b> <b>exceeds all 30 trillion tokens</b> <b>of the best large language models' data</b> <b>right?</b>

<b>right?</b> <b>so this magnitude is just enormous</b> <b>so when I said we need to download humanity</b> <b>the data that human eyes see</b> <b>how do we actually collect that data?</b>

<b>right?</b>

<b>I think video is still</b> <b>that's why</b> <b>before</b> <b>I was still very eager to do more work on video</b> <b>related research</b> <b>I think this is the best hope we have right now</b> <b>right mm-hmm</b> <b>oh this might have a very high barrier</b> <b>but I don't think it's necessarily impossible</b> <b>I think we can proceed in several stages</b> <b>first we can start with internet data</b> <b>start with YouTube</b> <b>mm-hmm</b> <b>as I was saying</b> <b>no matter what</b>

<b>all of these training tokens</b> <b>tens of trillions of tokens</b> <b>versus a four-month-old baby</b> <b>who has seen this much information</b> <b>all that data</b> <b>equals 30 minutes of YouTube uploads</b> <b>there's a massive amount of data on YouTube</b> <b>mm-hmm</b> <b>is there a copyright issue with that?</b>

<b>uh</b> <b>everyone knows there are copyright issues</b> <b>and everyone</b> <b>everyone is continuing</b> <b>continuing to do it anyway</b> <b>mm-hmm yeah</b> <b>I think</b> <b>at some point there will definitely be major copyright issues</b> <b>or rather this isn't just a copyright issue</b> <b>because YouTube may not own the copyright to these videos</b> <b>but it's a terms of service issue</b> <b>YouTube prohibits you from scraping this data</b> <b>which makes this data extremely hard to collect</b> <b>basically impossible to get</b> <b>you download a few videos</b>

<b>and YouTube blocks your IP</b> <b>and then</b> <b>you have to switch to a new IP</b> <b>right, so it's kind of</b> <b>now I think</b> <b>uh</b> <b>these data companies and these platforms</b> <b>are in this cat-and-mouse dynamic</b> <b>mm-hmm</b> <b>one side</b> <b>one side is tightly guarding against data collection</b> <b>blocking you from scraping</b> <b>the other side</b> <b>the other side is trying every means to get more data</b> <b>mm-hmm right</b>

<b>I don't know how it will end</b> <b>right</b> <b>wow, ByteDance has such a huge advantage</b> <b>ByteDance has such a huge advantage</b> <b>and ByteDance doesn't care</b> <b>right?</b>

<b>right?</b> <b>but they've received a lot of cease-and-desist letters too</b> <b>so I don't know</b> <b>I think going forward there may be more</b> <b>right, but I think</b> <b>this gets into human society's</b> <b>more political optimization</b> <b>mm-hmm alright</b> <b>step one is video</b> <b>step one is video</b> <b>and then next</b> <b>running in parallel is</b> <b>I think</b> <b>this kind of world model</b> <b>or</b>

<b>this very vision-centric world model</b> <b>will have some very promising application prospects</b> <b>because I think doing only research isn't enough</b> <b>the reason LLM succeeded</b> <b>is also because the chatbot interface</b> <b>was so successful</b> <b>so natural</b> <b>it relies on</b> <b>the internet</b> <b>on mobile devices</b> <b>but it's a very good interface</b> <b>a very very good product</b>

<b>so even OpenAI's own people didn't realize it</b> <b>right but</b> <b>when we talk about world models</b> <b>especially</b> <b>the world model we just defined</b> <b>what is the ultimate product exactly?</b>

<b>I think this</b> <b>might be the real hard problem</b> <b>mm-hmm</b> <b>maybe an even harder problem than data</b> <b>so right now</b> <b>if I just brainstorm ideas</b> <b>off the top of my head</b> <b>the ideas might all be wrong in the end</b> <b>but there are at least two outlets</b> <b>one is something like AI glasses</b> <b>this kind of truly personal assistant</b> <b>this needs a World Model</b> <b>with only a language model</b>

<b>that's not enough</b> <b>with only a language model</b> <b>it's still just ChatGPT</b> <b>but with a screen and voice interaction</b> <b>right?</b>

<b>right?</b> <b>it can't break out of that product form</b> <b>for example I often give people this example</b> <b>I'm now wearing some wearable devices</b> <b>they're not real AI wearable devices</b> <b>right?</b>

<b>right?</b> <b>but somehow</b> <b>they possess some traits I think are</b> <b>world model-like</b> <b>mm-hmm</b> <b>the reason is they're an always-on device</b> <b>it's always on</b> <b>always monitoring your body signs</b> <b>right?</b>

<b>right?</b> <b>and there's a large amount of information</b> <b>because every second</b> <b>right, I'm not sure at what frequency</b> <b>at what frequency it collects this information</b> <b>but my heart is always beating</b> <b>so it can always track this information</b> <b>and then where does this information go?</b>

<b>right?</b>

<b>this information itself is meaningless to me</b> <b>knowing my heart rate</b> <b>BPM at a certain moment</b> <b>has no meaning to me at all</b> <b>so it needs intelligent decision-making</b> <b>to tell me</b> <b>you seem to be under too much stress</b> <b>right, you're under too much pressure now</b> <b>you need to slow down</b> <b>and then saying</b> <b>your sleep hasn't been very good the past few days</b> <b>you might need to consider</b> <b>some remedial measures</b>

<b>or maybe you should take a day off today</b> <b>right?</b>

<b>right?</b> <b>I think this is actually quite world model-like</b> <b>except</b> <b>this is the most basic world model possible</b> <b>because the information it can get is just too little</b> <b>mm-hmm</b> <b>it's very narrow information</b> <b>right, very very narrow</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>but I think this</b> <b>is a glimpse of a future world model</b> <b>in AI wearables</b> <b>mm-hmm</b> <b>because if we imagine there were actually glasses</b> <b>or right</b> <b>I know you don't like wearing glasses</b> <b>but suppose there were some kind of wearable device</b> <b>that could truly be always on</b> <b>we don't know how to solve the power consumption issue</b> <b>never mind the hardware issues</b> <b>let's set that aside</b> <b>but it could see in real time</b> <b>everything we can see</b>

<b>right?</b>

<b>right?</b> <b>with completely always-on</b> <b>and infinite tokens</b> <b>flowing into the system</b> <b>mm-hmm</b> <b>I think this</b> <b>actually has enormous potential</b> <b>and first of all</b> <b>I'd really want this thing</b> <b>because I want to know at what time I drank a coffee</b> <b>and whether I drank that coffee an hour too early</b> <b>or an hour too late</b> <b>causing my sleep that night to not be as good</b> <b>or say I'm an athlete</b>

<b>who wants guidance on every movement</b> <b>or say I work in a hospital</b> <b>and I want to equip every elderly person in the nursing home</b> <b>with such a wearable</b> <b>so I know</b> <b>what their daily behavioral patterns are</b> <b>what medications they've taken</b> <b>what they've been doing</b> <b>ah</b> <b>how they're feeling emotionally</b> <b>right, what their condition is</b> <b>mm-hmm yeah</b> <b>and link it to their medical records in the background</b> <b>and provide better intelligent decision-making</b>

<b>I think there are many many similar examples</b> <b>right, but this is based on current LLMs</b> <b>existing multimodal intelligence</b> <b>which I think actually can't do this</b> <b>mm-hmm</b> <b>and then</b> <b>another outlet</b> <b>we also just touched on this</b> <b>I think it's Robotics</b> <b>I think Robotics</b> <b>faces the problem of</b> <b>the brain not being good enough</b> <b>mm-hmm</b> <b>and even if it can do martial arts</b> <b>it can perform</b> <b>of course</b> <b>you can't deny</b>

<b>that's also a good vertical domain</b> <b>right, the entertainment market</b> <b>might also be quite big</b> <b>so let robots go perform then</b> <b>I think that's fine too</b> <b>but this is far from a general-</b> <b>purpose robot that can enter every home</b> <b>carry elderly people up and down stairs</b> <b>take care of their daily needs</b> <b>this is</b> <b>still extremely far away</b> <b>mm-hmm, robots that can actually work are still a wasteland</b>

<b>[laughs] yes, yes</b> <b>oh and I think this part you can see</b> <b>robotics</b> <b>is actually</b> <b>a very good downstream application</b> <b>because no matter what new upstream</b> <b>we talk about in the broad world model sense</b> <b>like these glasses</b> <b>ah</b> <b>robots can benefit from it</b> <b>mm-hmm</b> <b>for example LLM came out</b> <b>and we had VLA, right?</b>

<b>that was hot for a while</b> <b>now video diffusion is doing well</b> <b>action-conditioned video diffusion is doing well</b> <b>right?</b>

<b>right?</b> <b>this generative approach</b> <b>this world simulator doing well</b> <b>so we're also discussing</b> <b>how robots can use these models</b> <b>to do a</b> <b>better action planning</b> <b>right, there's a lot of work like that</b> <b>so as I said</b> <b>I think</b> <b>there's still a long way to go here</b> <b>and then</b> <b>but I think</b> <b>watching robots online</b> <b>watching robots on the Spring Festival Gala</b>

<b>versus in private</b> <b>talking to researchers in the robotics industry</b> <b>the feelings are very different</b> <b>how so?</b>

<b>how so?</b> <b>the latter</b> <b>the latter are willing to tell me the truth</b> <b>oh</b> <b>that doesn't mean</b> <b>they're normally being dishonest</b> <b>just that the latter are more willing to tell me</b> <b>exactly where the shortcomings of current systems lie</b> <b>why does this sound like it can work</b> <b>but existing models just can't solve it</b> <b>so we just talked about</b> <b>your decade-plus long research journey</b> <b>how did you make the jump to world models?</b>

<b>mm-hmm</b> <b>I think there wasn't really a jump</b> <b>as I've been saying throughout</b> <b>I think</b> <b>what I call representation learning</b> <b>representation learning</b> <b>world models and the entire development of AI</b> <b>is actually a fairly smooth transition</b> <b>and</b> <b>I'm actually not a big fan of the term world model</b> <b>as a label</b> <b>I think it sounds a bit hyped</b>

<b>and now it's become a kind of</b> <b>catch-all term for everything</b> <b>and everyone is claiming they're doing world models</b> <b>I think this</b> <b>on one hand I think it's true that</b> <b>I don't think it's exactly a</b> <b>uh</b> <b>a researcher</b> <b>would enjoy this kind of process</b> <b>but on the other hand</b> <b>I think a field moving forward</b>

<b>may still need some of these</b> <b>buzzwords</b> <b>and I think if I had to name something</b> <b>I might appreciate one thing</b> <b>about the world model</b> <b>about the so-called World Model</b> <b>and that is this</b> <b>this comes from Jitendra Malik, a professor at Berkeley</b> <b>he said</b> <b>the one thing he likes about World Model</b> <b>is that it lets him tell people</b> <b>I'm doing a World Model</b> <b>not a Word Model</b>

<b>word as in W-O-R-D</b> <b>word</b> <b>word, right — I'm doing a world model</b> <b>not a word</b> <b>word model</b> <b>and a word model is an LLM</b> <b>I quite agree with that</b> <b>so I think</b> <b>as I keep repeating, I think</b> <b>I think</b> <b>world models</b> <b>are a destination that everyone will eventually reach</b> <b>it's a goal</b> <b>right</b> <b>mm-hmm actually</b>

<b>as you started pursuing world models</b> <b>you also made a very major decision</b> <b>which is</b> <b>to start a company — this is a very big</b> <b>very different choice from your previous research career</b> <b>a different choice</b> <b>why did you make this choice</b> <b>and how did it come about?</b>

<b>oh</b> <b>this decision was also something of a metaphysical one</b> <b>metaphysical</b> <b>oh well</b> <b>this</b> <b>people might think I'm being too mystical about this</b> <b>but it really was</b> <b>because before, I had many friends in the Bay Area</b> <b>some</b> <b>mentors who've been very helpful to me</b> <b>and some of them may be investors</b> <b>in that capacity</b> <b>or other entrepreneurs</b> <b>and they said</b> <b>Saining, you should also try starting a company</b> <b>mm-hmm</b>

<b>because at the university</b> <b>as I was saying earlier</b> <b>resources are scarce</b> <b>right, but that doesn't mean university is worthless</b> <b>I think</b> <b>university is actually a very good platform</b> <b>it gives me enough space</b> <b>to truly find what I want to do</b> <b>but I suddenly felt</b> <b>that now seems like a moment</b> <b>where</b> <b>what I want to explore</b> <b>has been explored to a certain extent</b>

<b>and going further might fall into</b> <b>what I call the medium paper trap</b> <b>[laughs] like the middle income trap</b> <b>meaning you'd publish decent papers</b> <b>but because of resource constraints</b> <b>you can't truly turn your</b> <b>your ideas into</b> <b>what might be a new breakthrough in some sense</b>

<b>right, so I thought</b> <b>this might be a good moment</b> <b>and then so I had a manager who asked me</b> <b>it was at quite an interesting moment</b> <b>probably about last year</b> <b>probably around year-end</b> <b>or maybe it was in the fall</b> <b>year-end of '25</b> <b>mm-hmm right</b> <b>year-end of '25</b> <b>and he said</b> <b>go ask Yann LeCun</b> <b>he seems to not be very happy at Meta lately</b>

<b>but at that time it wasn't actually that turbulent yet</b> <b>Alexander Wang hadn't come yet (Scale AI founder, joined Meta as Chief AI Officer)</b> <b>and like the layoffs at FAIR</b> <b>and</b> <b>all that turbulence</b> <b>my first instinct was</b> <b>oh, how could that be?</b>

<b>right Yann right?</b>

<b>we can later</b> <b>talk more about</b> <b>what kind of person Yann is</b> <b>but at least at that time</b> <b>I would have thought he's still</b> <b>the godfather of AI, right?</b>

<b>and</b> <b>he</b> <b>is a pure researcher</b> <b>how could he be pulled into a startup?</b>

<b>and then we had this conversation</b> <b>the Monday two weeks after that</b> <b>we happened to have a one-on-one meeting</b> <b>a one-on-one meeting</b> <b>with Yann LeCun</b> <b>yeah</b> <b>and before I could say anything</b> <b>Yann said to me, hey</b> <b>Saining, don't tell anyone yet</b> <b>but I've already decided</b> <b>this</b> <b>what I want to do now</b> <b>should be done outside</b>

<b>I want to start and build a company</b> <b>and then I asked him</b> <b>what do you want to do?</b>

<b>what's the business model behind this?</b>

<b>mm-hmm</b> <b>and then I realized wow</b> <b>this is completely aligned with what I'd imagined</b> <b>mm-hmm, very interesting</b> <b>right, and what is this thing?</b>

<b>I think you can</b> <b>you can call it world models</b> <b>or the logic behind this is</b> <b>I think on the thing I want to do</b> <b>in the current</b> <b>any country in the world</b> <b>I don't think it can be done</b> <b>including in the Bay Area</b> <b>can't be done in Silicon Valley either</b> <b>so what is this thing?</b>

<b>that is to say</b> <b>you still need a certain degree of research depth</b> <b>right?</b>

<b>right?</b> <b>it's not completely saying, hey</b> <b>we now have a Large Language Model</b> <b>we want to deploy this system</b> <b>and push to product</b> <b>and then</b> <b>go get some revenue</b> <b>it's actually not like that</b> <b>right?</b> <b>and I think</b>

<b>right?</b> <b>and I think</b> <b>this has a strong research-oriented</b> <b>inclination</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>but it's also not in a purely academic</b> <b>academic setting</b> <b>it's not the old FAIR</b> <b>and it's not NYU either</b> <b>it's not a university</b> <b>and it's not the old traditional FAIR either</b> <b>but on the other hand</b> <b>it's also not the Bay Area's</b> <b>big tech companies and the many neo labs now</b> <b>operating in a completely closed manner</b> <b>what does closed mean?</b>

<b>closed means</b> <b>you don't open source</b> <b>you can't publish papers</b> <b>and like the blog I mentioned</b> <b>mm-hmm</b> <b>you can't put your name on it</b> <b>can't put your</b> <b>uh name on it</b> <b>and</b> <b>like when I was actually at Google</b> <b>at GTM</b> <b>I was in GenAI</b> <b>and I was the only one there</b> <b>who had, in a sense, a foot in both worlds</b> <b>a double affiliation</b> <b>still doing things at the university</b> <b>people there actually have</b>

<b>some resistance to academia</b> <b>to this kind of purely exploratory research</b> <b>that's the Bay Area's</b> <b>current state</b> <b>right</b> <b>resistance</b> <b>how do you understand that?</b>

<b>who's resisting?</b>

<b>resistance means</b> <b>first, I think people look down on</b> <b>the work academia is doing</b> <b>they don't think academia's work can truly</b> <b>ah</b> <b>generate any kind of impact</b> <b>second</b> <b>because they also don't publish</b> <b>a lot of things you don't know what they're doing</b>

<b>right? even within these big companies</b>

<b>right? even within these big companies</b> <b>actually some large companies</b> <b>have research departments</b> <b>and more product-oriented departments</b> <b>but even between these two departments in the same company</b> <b>there's still a big divide</b> <b>because</b> <b>again, the side doing</b> <b>say core model</b> <b>training at these companies, these departments</b> <b>need to be in this highly competitive</b>

<b>race</b> <b>mm-hmm</b> <b>at the very front</b> <b>that's their only goal</b> <b>it's an arms race</b> <b>it's an arms race</b> <b>mm-hmm</b> <b>and</b> <b>this squeezes out your research space</b> <b>mm-hmm</b> <b>it</b> <b>it sucks away the oxygen</b> <b>in that environment</b> <b>the oxygen that gives you sufficient freedom to do research</b>

<b>mm-hmm, so you never considered joining any lab</b> <b>you couldn't stand that suffocating feeling</b> <b>yes</b> <b>I think this is also a very interesting phenomenon</b> <b>the phenomenon being</b> <b>there were indeed some opportunities back then</b> <b>and I was considering other options too</b> <b>and</b> <b>but after thinking about it</b> <b>I felt that maybe this</b> <b>if you really want to do</b> <b>truly cutting-edge exploration</b> <b>if you want to define the problems</b>

<b>you probably have to do it at your own startup</b> <b>for that to work</b> <b>mm-hmm, someone else's startup</b> <b>means they define the problems</b> <b>and you come to execute</b> <b>that's other startups</b> <b>well first of all</b> <b>I don't think among all these other startups</b> <b>there's any single startup</b> <b>or any big company</b> <b>that's focused on what we're doing</b> <b>what is called building the predictive brain</b> <b>right?</b>

<b>right?</b> <b>working at what you might call the most foundational layer</b> <b>or the most upstream layer</b> <b>doing things there</b> <b>that simply doesn't exist</b> <b>even more interesting is</b> <b>actually many of my friends</b> <b>when I talk with them</b> <b>everyone realizes</b> <b>this is actually necessary</b> <b>as I just said</b> <b>this thing</b> <b>on one hand is somewhat of a</b> <b>counter-consensus view</b> <b>right, a contrarian view</b> <b>but on the other hand</b>

<b>over the past year</b> <b>it has gradually become a consensus</b> <b>so what I'm saying isn't all that new</b> <b>nothing particularly new</b> <b>mm-hmm</b> <b>but I briefly mentioned</b> <b>I think in the entire AI industry right now</b> <b>there's this enormous AI</b> <b>this kind of</b> <b>value chain</b> <b>at the very top of this value chain as I just said</b> <b>there's Bitter Lesson</b>

<b>there's a narrative of AGI and LLM</b> <b>this has defined a series of benchmarks</b> <b>mm-hmm</b> <b>right, so you compete on leaderboards</b> <b>mm-hmm mm-hmm</b> <b>and you just compete</b> <b>the leaderboard might be LLM</b> <b>Arena or other leaderboards</b> <b>right, there are</b> <b>a series of benchmarks</b> <b>these benchmarks define resource allocation</b> <b>meaning</b> <b>how you allocate resources</b> <b>mm-hmm</b>

<b>right, because my goal</b> <b>if it's to be number one on the leaderboard</b> <b>then I can only pour in the most resources</b> <b>to be able to compete at that level</b> <b>and then resource allocation</b> <b>actually means this</b> <b>has already drifted somewhat from what researchers think is right</b> <b>or wrong</b> <b>although some</b> <b>very strong researchers know</b> <b>we may need to do some research</b>

<b>but under this value chain</b> <b>resource allocation means</b> <b>they can't do this part of the research</b> <b>so for example I think</b> <b>hmm</b> <b>video</b> <b>understanding is actually quite important</b> <b>but now it seems neither academia</b> <b>nor industry</b> <b>is doing much of it</b> <b>or people are doing it but not with a fundamental</b> <b>World Model angle to approach this problem</b> <b>to solve this problem</b>

<b>but why is that?</b>

<b>but this is a very interesting phenomenon</b> <b>you'll see</b> <b>it's not that no one is willing to do it</b> <b>it's not that no one has the ability to do it</b> <b>mm-hmm</b> <b>it's that all of them, without exception</b> <b>regardless of which company</b> <b>without exception</b> <b>have been assigned to a video generation model</b> <b>team</b> <b>mm-hmm</b> <b>because this is the only</b> <b>one within this value chain</b>

<b>that can indirectly</b> <b>participate in this value chain</b> <b>position</b> <b>even though they all know</b> <b>we haven't solved this problem</b> <b>we need a better</b> <b>as I just said</b> <b>a World Model</b> <b>based video understanding model</b> <b>and this</b> <b>might be an important prerequisite</b> <b>for actually training that World Model</b> <b>but people won't have space to do</b> <b>such exploration</b> <b>mm-hmm</b>

<b>so back when I was at Google</b> <b>I had that frustration too</b> <b>including when we did the RAE paper</b> <b>this paper</b> <b>took about this student and</b> <b>with Boyang Zheng</b> <b>we probably spent almost a year</b> <b>because this student in between might also have</b> <b>had some health issues</b> <b>anyway</b> <b>there might have been some gaps in there</b> <b>right?</b>

<b>right?</b> <b>anyway, to finish this work</b> <b>it took us a year</b> <b>mm-hmm</b> <b>when we published this work</b> <b>I was actually a bit worried</b> <b>I thought hmm</b> <b>would there be some Google researcher</b> <b>coming to me saying</b> <b>why did you publish a paper</b> <b>we're doing the same thing</b> <b>you've exposed our secrets</b> <b>mm-hmm</b> <b>turns out yes</b> <b>oh</b> <b>several researchers came to me</b>

<b>and their feedback was</b> <b>I think this is right</b> <b>I worked on this for two weeks</b> <b>but my manager said</b> <b>you can't do this anymore</b> <b>we have product cycle one coming up</b> <b>product cycle two</b> <b>product cycle three, right?</b>

<b>these</b> <b>product launch timelines</b> <b>need to be completed</b> <b>their motivation is different</b> <b>their motivation is different</b> <b>so it all comes back to</b> <b>I think we need to return to</b> <b>what we discussed at the beginning</b> <b>in this kind of finite game</b> <b>in this highly competitive environment</b> <b>every company</b> <b>seems to have lost its ability to define problems</b> <b>for example</b>

<b>you see that before, like OpenAI, right?</b>

<b>they actually had that ability</b> <b>mm-hmm</b> <b>many of these problems were defined by them</b> <b>right?</b>

<b>right?</b> <b>including GPT</b> <b>including models like CLIP</b> <b>right?</b> <b>or say</b>

<b>right?</b> <b>or say</b> <b>from their very first day</b> <b>as a research unit</b> <b>they had this kind of problem-defining capability</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>but now</b> <b>it seems like even OpenAI</b> <b>to some extent</b> <b>is being swept into this race</b> <b>mm-hmm, of course they were once the ones who defined the race</b> <b>now they're the ones being competed against</b> <b>mm-hmm</b> <b>so I think the AI industry right now</b> <b>needs new problem-definers</b> <b>and Yann has this conviction</b> <b>that the current path</b> <b>mm-hmm</b> <b>cannot lead to true intelligence</b> <b>right?</b>

<b>right?</b> <b>so someone needs to define new problems</b> <b>on this larger scale</b> <b>I think Yann and I share a lot of common ground</b> <b>on this matter</b> <b>mm-hmm, so you found a kindred spirit</b> <b>yeah, that's a better way to put it</b> <b>mm-hmm</b> <b>so then you started the company</b> <b>right?</b>

<b>right?</b> <b>then</b> <b>you mentioned Yann</b> <b>let me ask you</b> <b>what kind of person is Yann?</b>

<b>what's it like working with Yann?</b>

<b>mm-hmm</b> <b>Yann is</b> <b>a very unique person</b> <b>mm-hmm</b> <b>I'll start with a few of his characteristics</b> <b>mm-hmm</b> <b>he's very principled</b> <b>mm-hmm</b> <b>and I think his principles are</b> <b>very rooted in his deep understanding of the problem itself</b> <b>mm-hmm</b> <b>which is why he</b> <b>when he says something is right</b>

<b>I think he truly believes in what he says</b> <b>mm-hmm</b> <b>and won't be swayed by other people's opinions</b> <b>mm-hmm</b> <b>and I think this quality</b> <b>in the current research environment</b> <b>is actually very rare</b> <b>mm-hmm</b> <b>because most people</b> <b>well first of all researchers are human beings</b> <b>mm-hmm</b> <b>they also need to consider their career</b> <b>their citations</b> <b>right, their impact factor</b>

<b>mm-hmm</b> <b>and follow the trend</b> <b>when everyone else is doing LLMs</b> <b>I should also publish some papers on LLMs</b> <b>mm-hmm</b> <b>but Yann clearly hasn't done this</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and for me</b> <b>I feel like I also</b> <b>belong to this type of person</b> <b>mm-hmm</b> <b>second</b> <b>I think Yann is</b> <b>from my observations</b> <b>a very good leader</b> <b>mm-hmm</b> <b>right, how so?</b>

<b>how so?</b>

<b>Yann's leadership style is</b> <b>he actually doesn't</b> <b>manage people much</b> <b>mm-hmm</b> <b>mm-hmm</b> <b>and Yann's approach to leading</b> <b>is through his vision</b> <b>mm-hmm</b> <b>and through what he stands for</b> <b>and all the values that he represents</b> <b>mm-hmm</b> <b>to attract people to join him</b>

<b>mm-hmm</b> <b>and then</b> <b>he'll also give you a lot of freedom</b> <b>mm-hmm</b> <b>he's very empowering</b> <b>mm-hmm, that's great</b> <b>right?</b>

<b>right?</b> <b>and I think this is a style that works best for me</b> <b>because I also don't want to be managed very much</b> <b>mm-hmm</b> <b>mm-hmm, so you two get along really well</b> <b>mm-hmm</b> <b>yeah, I think we complement each other</b> <b>mm-hmm</b> <b>because I think Yann</b> <b>is more of a visionary</b> <b>mm-hmm</b> <b>and I'm more</b> <b>sort of more grounded</b> <b>someone who can actually execute</b> <b>mm-hmm</b>

<b>good at figuring out</b> <b>given Yann's direction</b> <b>what should we specifically do</b> <b>mm-hmm</b> <b>so I think this pairing</b> <b>is interesting</b> <b>mm-hmm</b> <b>yeah, I feel like Yann also</b> <b>has this kind of</b> <b>very outspoken</b> <b>internet celebrity vibe</b> <b>[laughs] [laughs]</b> <b>very outspoken person</b> <b>right?</b>

<b>right?</b> <b>and you're relatively more low-key?</b>

<b>mm-hmm mm-hmm</b> <b>yeah, I think that's relatively true</b> <b>mm-hmm</b> <b>I like</b> <b>speaking through work</b> <b>mm-hmm</b> <b>okay, so then you co-founded this company together</b> <b>mm-hmm</b> <b>and then you're in New York</b> <b>right?</b>

<b>right?</b> <b>let's talk about New York</b> <b>mm-hmm</b> <b>why not Silicon Valley?</b>

<b>ah, this question</b> <b>this is indeed a question a lot of people are very</b> <b>curious about</b> <b>right?</b>

<b>right?</b> <b>uh</b> <b>I think</b> <b>first of all</b> <b>honestly</b> <b>I'm a New York person myself</b> <b>I've been at NYU for many years</b> <b>mm-hmm</b> <b>and Yann has been at NYU even longer than me</b> <b>right?</b>

<b>right?</b> <b>and the feeling of New York, speaking truthfully</b> <b>is very different from San Francisco</b> <b>mm-hmm</b> <b>I've been to San Francisco many times</b> <b>and I've lived in the Bay Area</b> <b>mm-hmm</b> <b>but the Bay Area atmosphere</b> <b>is really</b> <b>a pure tech bubble</b> <b>mm-hmm</b> <b>but you know what</b> <b>it's not necessarily a bad thing</b>

<b>mm-hmm</b> <b>in that bubble</b> <b>everyone can be very focused on doing one thing</b> <b>mm-hmm</b> <b>so the entire Bay Area culture is</b> <b>just about building companies, right?</b>

<b>mm-hmm</b> <b>and New York is</b> <b>I think, a more</b> <b>real world</b> <b>mm-hmm</b> <b>this real world in New York</b> <b>has given me many inspirations</b> <b>right?</b>

<b>right?</b> <b>and then</b> <b>many of the ideas around the product</b> <b>especially the kind of embodied AI products</b> <b>or world model products</b> <b>I've imagined</b> <b>actually come from life in New York</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and then also</b> <b>in terms of recruiting</b> <b>I think many people in New York</b> <b>have a stronger desire to</b> <b>do something more fundamental</b> <b>mm-hmm</b> <b>right, because the Bay Area is</b> <b>actually quite saturated now</b> <b>yes</b> <b>in terms of talent</b> <b>it is saturated</b> <b>but in terms of culture</b> <b>everyone is doing product, product, product</b> <b>mm-hmm</b>

<b>right?</b>

<b>right?</b> <b>so I also feel</b> <b>that for what I'm doing</b> <b>New York might be</b> <b>a better fit</b> <b>mm-hmm</b> <b>mm-hmm yeah</b> <b>right, as we talked about earlier</b> <b>there are actually many AI startups</b> <b>in New York</b> <b>and there's quite a vibrant</b> <b>AI scene</b> <b>in New York</b> <b>right?</b>

<b>right?</b> <b>but New York still doesn't have an</b> <b>absolutely top-tier</b> <b>AI company</b> <b>like OpenAI-level</b> <b>right?</b>

<b>right?</b> <b>I think that</b> <b>is also an opportunity</b> <b>mm-hmm</b> <b>right, Hugging Face is in New York</b> <b>mm-hmm</b> <b>mm-hmm, well Hugging Face is headquartered in New York</b> <b>but their team might be quite distributed</b> <b>but their HQ is New York</b> <b>so I think this is</b> <b>a very interesting trend</b> <b>mm-hmm</b> <b>okay, so then let's talk about</b> <b>the current state of the company</b>

<b>how many people do you have?</b>

<b>how's it going so far?</b>

<b>mm-hmm</b> <b>right, so we're still very early</b> <b>the company is only about</b> <b>six months old or so</b> <b>mm-hmm</b> <b>and we currently have</b> <b>about 15 people</b> <b>mm-hmm</b> <b>the team is</b> <b>very very strong</b> <b>how big will your pre-training dataset be?</b>

<b>ah, these things</b> <b>that's the research part</b> <b>right</b> <b>we actually now have a very good roadmap</b> <b>and we've also hired many many people</b> <b>everyone actually cares a lot about</b> <b>how to make something land in reality</b> <b>not just simply doing research</b> <b>although research is very very important</b> <b>and now</b> <b>if we want to achieve</b> <b>the goal of a truly good world model</b> <b>how much compute does it need?</b>

<b>mm-hmm</b> <b>I think compute is definitely needed</b> <b>but as I was saying earlier</b> <b>I think the compute efficiency will be</b> <b>very very different</b> <b>mm-hmm</b> <b>so the amount of compute</b> <b>might not be comparable to</b> <b>training a frontier LLM</b> <b>mm-hmm</b> <b>but</b> <b>one thing I think is very important</b> <b>is the structure of how we use compute</b>

<b>mm-hmm</b> <b>right, there are many ways to use compute</b> <b>for example</b> <b>you can use compute to train language</b> <b>or use compute to train video</b> <b>mm-hmm</b> <b>or you could train both simultaneously</b> <b>mm-hmm</b> <b>I think for our approach</b> <b>the distribution of compute might be</b> <b>very different</b> <b>mm-hmm</b> <b>um</b> <b>a larger portion</b> <b>might be used on video</b>

<b>mm-hmm</b> <b>but not just the kind of</b> <b>prediction-based</b> <b>purely the kind of</b> <b>prediction target, right?</b>

<b>this approach</b> <b>mm-hmm</b> <b>but a combination of generative and discriminative</b> <b>methods</b> <b>and then</b> <b>with a combination of language too</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>so I think</b> <b>the goal is</b> <b>through the least amount of compute possible</b> <b>to train the best world model</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and then in doing so</b> <b>you also need to be able to</b> <b>make a product</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>so it'll be a long journey</b> <b>mm-hmm</b> <b>but I think the path</b> <b>is relatively clear to me</b> <b>mm-hmm</b> <b>yeah</b> <b>right well</b> <b>you did also mention Yann</b> <b>right, earlier you mentioned</b> <b>that before you started the company</b> <b>you were at NYU as a professor</b> <b>and also had a collaboration with Google</b>

<b>right?</b>

<b>right?</b> <b>you were in quite a good position</b> <b>mm-hmm</b> <b>and then you made a decision</b> <b>to step out and do this</b> <b>mm-hmm</b> <b>what was the tipping point?</b>

<b>or the final straw</b> <b>that made you decide</b> <b>okay, I'm going to do this</b> <b>mm-hmm</b> <b>I think it's a combination of many things</b> <b>but I think</b> <b>the biggest factor was</b> <b>the conversation with Yann, as I mentioned</b> <b>mm-hmm</b> <b>because I had never considered</b> <b>that Yann would want to do this</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>and once Yann decided he wanted to do this</b> <b>mm-hmm</b> <b>the whole thing became a lot more</b> <b>compelling</b> <b>mm-hmm</b> <b>because I think with Yann</b> <b>doing this kind of thing</b> <b>is much more legitimate</b> <b>right, meaning it's not just</b> <b>two or three young researchers</b> <b>thinking they can change the world</b> <b>right?</b>

<b>right?</b> <b>right, and Yann has the experience</b> <b>the vision</b> <b>and the prestige</b> <b>mm-hmm</b> <b>to attract talent</b> <b>attract investment</b> <b>right?</b>

<b>right?</b> <b>so I think this is</b> <b>when I found out about this</b> <b>I basically decided immediately</b> <b>mm-hmm</b> <b>without even thinking about it much</b> <b>right?</b>

<b>right?</b> <b>I think this kind of</b> <b>opportunity</b> <b>is once in a lifetime</b> <b>right?</b>

<b>right?</b> <b>mm-hmm, and also</b> <b>I've always said</b> <b>I actually really like Yann</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and I feel like having the chance to work closely</b> <b>with someone like Yann</b> <b>is something very rare</b> <b>mm-hmm</b> <b>mm-hmm, so that's also why</b> <b>you didn't hesitate</b> <b>mm-hmm</b> <b>yeah</b> <b>alright, so last question</b> <b>mm-hmm</b> <b>if you had to send a message</b> <b>to the Chinese AI research community</b>

<b>or students who are interested in AI research</b> <b>right?</b>

<b>right?</b> <b>what would you want to say to them?</b>

<b>hmm</b> <b>I think</b> <b>there are a few things I want to say</b> <b>mm-hmm</b> <b>the first thing</b> <b>is about attitude</b> <b>mm-hmm</b> <b>I hope everyone can</b> <b>keep thinking for themselves</b> <b>mm-hmm</b> <b>don't be swayed by trends</b> <b>mm-hmm</b> <b>I hope everyone can</b> <b>think</b> <b>about what they really want to do</b> <b>mm-hmm</b> <b>and why they want to do it</b>

<b>right?</b>

<b>right?</b> <b>because I see many people</b> <b>in AI research</b> <b>and many people are doing it</b> <b>but actually</b> <b>sometimes it's a bit</b> <b>following the crowd</b> <b>mm-hmm</b> <b>right, because it seems like this field is hot</b> <b>mm-hmm</b> <b>so let me get into it</b> <b>mm-hmm</b>

<b>but actually the more important thing is</b> <b>you yourself</b> <b>have a genuine passion for</b> <b>this kind of creative work</b> <b>mm-hmm</b> <b>you genuinely want to figure out</b> <b>the essence of intelligence</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>if you just see this as a career path</b> <b>that's also fine</b> <b>right, if you just want a good job</b> <b>mm-hmm</b> <b>but I think for researchers</b> <b>or people who really want to push the frontier</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>I think this genuine love for the work</b> <b>is really important</b> <b>mm-hmm</b> <b>the second thing</b> <b>is about approach</b> <b>mm-hmm</b> <b>I hope everyone can</b> <b>think about problems</b> <b>more deeply</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>I think</b> <b>a lot of current AI research</b> <b>is quite shallow</b> <b>mm-hmm</b> <b>meaning</b> <b>a lot of it is</b> <b>just following what others are doing</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>people follow trends</b> <b>mm-hmm</b> <b>but the most interesting things</b> <b>come from people who ask</b> <b>why?</b>

<b>why?</b> <b>mm-hmm</b> <b>why does this work?</b>

<b>mm-hmm</b> <b>why doesn't that work?</b>

<b>mm-hmm</b> <b>what is the essence here?</b>

<b>mm-hmm</b> <b>and I think</b> <b>this kind of</b> <b>thinking deeply about a problem</b> <b>is a quality that's becoming rarer</b> <b>mm-hmm</b> <b>so I hope people can cultivate this quality</b> <b>mm-hmm</b> <b>and the third thing</b> <b>is about community</b> <b>mm-hmm</b> <b>I hope everyone can</b> <b>be more open</b>

<b>to collaboration</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>I think one of the beauties of the AI field is</b> <b>it's a very open field</b> <b>mm-hmm</b> <b>right, many papers are open</b> <b>many code is open</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and this openness</b> <b>has driven a lot of progress</b> <b>mm-hmm</b> <b>I hope this spirit can be maintained</b> <b>mm-hmm</b> <b>yeah, thank you Saining</b> <b>mm-hmm</b> <b>this has been a very good conversation</b> <b>thank you</b> <b>thank you</b> <b>mm-hmm</b> <b>okay so now</b> <b>let me introduce</b>

<b>the next guest</b> <b>mm-hmm</b> <b>this next guest</b> <b>is also a very</b> <b>very special person</b> <b>mm-hmm</b> <b>he is</b> <b>a PhD student</b> <b>currently at NYU</b> <b>mm-hmm</b> <b>but he's not your ordinary PhD student</b> <b>mm-hmm</b> <b>he's also an entrepreneur</b> <b>mm-hmm</b>

<b>and then</b> <b>we just learned</b> <b>mm-hmm</b> <b>that he's also</b> <b>Forbes 30 Under 30</b> <b>wow</b> <b>yes</b> <b>this is very impressive</b> <b>mm-hmm</b> <b>let's welcome</b> <b>mm-hmm</b> <b>Zhiyuan Zeng (Tommy)</b> <b>mm-hmm</b> <b>hi everyone</b> <b>hi</b> <b>hello</b>

<b>mm-hmm</b> <b>alright Tommy</b> <b>why don't you first</b> <b>introduce yourself</b> <b>mm-hmm</b> <b>sure, hi everyone</b> <b>I'm Tommy</b> <b>currently I'm a PhD student at NYU</b> <b>and my research direction is</b> <b>AI agents</b>

<b>mm-hmm</b> <b>and at the same time</b> <b>I'm also the co-founder and CTO of a company</b> <b>called Simular AI</b> <b>mm-hmm</b> <b>and the direction of this company is also AI agents</b> <b>mm-hmm</b> <b>specifically</b> <b>we are building a desktop AI agent</b> <b>mm-hmm</b> <b>the product is called S2</b> <b>mm-hmm</b> <b>cool, desktop AI agent</b> <b>right?</b>

<b>right?</b> <b>does it work on a computer?</b>

<b>mm-hmm</b> <b>yes, it works on a computer</b> <b>mm-hmm</b> <b>then I want to ask you</b> <b>what exactly does it do?</b>

<b>mm-hmm</b> <b>right, so this thing basically</b> <b>can do everything you can do on a computer</b> <b>mm-hmm</b> <b>for example</b> <b>browsing the web</b> <b>mm-hmm</b> <b>writing code</b> <b>mm-hmm</b> <b>managing files</b> <b>mm-hmm</b> <b>using various applications</b> <b>mm-hmm</b> <b>right, using various software</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>so it can help you do tasks on the computer</b> <b>mm-hmm</b> <b>so it's more like</b> <b>a full automation of</b> <b>computer tasks</b> <b>mm-hmm</b> <b>yes, it's a computer automation tool</b> <b>right?</b>

<b>right?</b> <b>and it can</b> <b>handle more complex tasks</b> <b>mm-hmm</b> <b>right, like what?</b>

<b>for example</b> <b>say I need to</b> <b>book a flight</b> <b>mm-hmm</b> <b>but this booking involves</b> <b>multiple steps</b> <b>mm-hmm</b> <b>like opening a browser</b> <b>going to a website</b> <b>searching for flights</b>

<b>comparing prices</b> <b>mm-hmm</b> <b>and then ultimately booking it</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>all of these steps</b> <b>S2 can automatically complete for you</b> <b>mm-hmm</b> <b>so you just tell it what you want</b> <b>and then it does it for you</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>yes</b> <b>mm-hmm, that's pretty amazing</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>then tell me</b> <b>what's the difference between S2</b> <b>and similar products out there?</b>

<b>mm-hmm</b> <b>right, so I think</b> <b>S2's biggest differentiation is</b> <b>mm-hmm</b> <b>reliability</b> <b>mm-hmm</b> <b>right?</b> <b>because right now</b>

<b>right?</b> <b>because right now</b> <b>many similar products</b> <b>might be able to demo well</b> <b>mm-hmm</b> <b>but in actual use</b> <b>the reliability is not so good</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>because computer tasks</b> <b>are inherently</b> <b>very complex</b> <b>mm-hmm</b> <b>there are many unexpected things that can go wrong</b> <b>mm-hmm</b> <b>right, like pop-up windows</b> <b>mm-hmm</b> <b>or maybe the website</b> <b>has changed its UI</b> <b>mm-hmm</b> <b>or maybe the network is slow</b> <b>mm-hmm</b>

<b>all sorts of situations</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and S2's solution is</b> <b>we built a</b> <b>proprietary model specifically for computer tasks</b> <b>mm-hmm</b> <b>so that it can</b> <b>handle these complex situations</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and at the same time</b> <b>we also have</b> <b>a proprietary planning module</b> <b>mm-hmm</b> <b>so that it can</b> <b>plan more efficiently</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>mm-hmm, so it has a self-developed model</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>a proprietary model</b> <b>mm-hmm</b> <b>so to do this you need a lot of data</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>how do you get that data?</b>

<b>mm-hmm</b> <b>right, so data is indeed</b> <b>one of the biggest challenges</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>so our approach is</b> <b>to build a</b> <b>data synthesis pipeline</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>we use AI to generate data</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>and then use this data</b> <b>to train the model</b> <b>mm-hmm</b> <b>right?</b>

<b>mm-hmm</b> <b>right?</b> <b>mm-hmm, and where does this synthetic data come from?</b>

<b>mm-hmm</b> <b>right, so the synthetic data</b> <b>mainly comes from</b> <b>we have an environment</b> <b>mm-hmm</b> <b>this environment simulates</b> <b>various computer tasks</b> <b>mm-hmm</b> <b>and then we have an AI agent</b> <b>in this environment</b> <b>completing these tasks</b> <b>mm-hmm</b> <b>and recording the process</b>

<b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>so this is the source of the data</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>mm-hmm, that's clever</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>mm-hmm</b> <b>so then</b> <b>tell me</b> <b>who are your target users?</b>

<b>mm-hmm</b> <b>right, our target users</b> <b>are mainly</b> <b>knowledge workers</b> <b>mm-hmm</b> <b>right?</b>

<b>right?</b> <b>people who spend a lot of time</b> <b>on computers every day</b> <b>mm-hmm</b> <b>for example</b> <b>software engineers</b> <b>mm-hmm</b> <b>data analysts</b> <b>mm-hmm</b> <b>right, product managers</b> <b>mm-hmm</b> <b>designers</b> <b>mm-hmm</b> <b>and so on</b> <b>mm-hmm</b>

<b>right?</b>

<b>right?</b> <b>but I think</b> <b>trying to accomplish something this different</b> <b>is still quite difficult</b> <b>because as I said</b> <b>I've been emphasizing all along</b> <b>we're actually looking for a kind of balance</b> <b>this balance means</b> <b>it's neither a purely academic research lab</b> <b>nor is it one of today's</b> <b>closed large-model companies</b> <b>Mm-hmm</b> <b>and this balance also means</b>

<b>take me personally, for example</b> <b>it's also a kind of balance</b> <b>it's like</b> <b>I'm neither a very senior</b> <b>already accomplished and established</b> <b>kind of distinguished professor</b> <b>but I'm also not an eighteen or nineteen year old</b> <b>who can just roll up their bedding and head to a factory in Shenzhen</b> <b>[laughter]</b> <b>and set down roots</b> <b>to do data collection</b> <b>or whatever</b> <b>I'm neither of those</b> <b>Mm-hmm</b>

<b>some of the data comes from factories in Shenzhen</b> <b>Yes</b> <b>someone is doing it</b> <b>the example I just mentioned is</b> <b>a specific company</b> <b>they have a company</b> <b>called build.ai</b>

<b>called build.ai</b> <b>I actually really admire that kid</b> <b>named Eddy</b> <b>he took a few people and dropped out of Columbia</b> <b>then went and lived in a Shenzhen factory</b> <b>Ah</b> <b>and then</b> <b>build a startup like that</b> <b>I think that's so impressive</b> <b>right</b> <b>I think this is both about finding balance</b> <b>but I find it challenging for myself</b> <b>but it's also a new opportunity</b> <b>I think maybe</b> <b>maybe</b> <b>this era</b>

<b>Uh</b> <b>might not belong to the old guard</b> <b>nor to the young guns</b> <b>but rather to a generation of mid-career entrepreneurs</b> <b>You said no to Ilya (SSI founder) twice</b> <b>but said yes to LeCun</b> <b>Why is that?</b>

<b>What kind of person is he in your eyes?</b>

<b>oh right</b> <b>Yann</b> <b>is a fighter online</b> <b>right?</b>

<b>right?</b> <b>actually firmly opposed to the LLM camp</b> <b>well, it's not just opposing LLMs</b> <b>he actually doesn't oppose LLMs</b> <b>he's never said he opposes LLMs</b> <b>he's very</b> <b>he even says he uses Gemini himself</b> <b>he's completely fine with</b> <b>LLMs</b> <b>he just opposes</b> <b>the narrative that LLMs can lead to human-level</b> <b>intelligence</b> <b>that's the narrative he opposes</b> <b>that's what he pushes back on</b> <b>Mm-hmm</b> <b>he has no objection to LLMs at all</b>

<b>but anyway he's a fighter online</b> <b>constantly engaging in battles</b> <b>but I think</b> <b>privately he's a really wonderful person</b> <b>he's someone I</b> <b>genuinely admire and look up to from the heart</b> <b>Were you close before?</b>

<b>we collaborated on some papers</b> <b>but</b> <b>definitely not like being in a startup together</b> <b>as co-founders</b> <b>like</b> <b>working closely like this</b> <b>we hadn't done that before</b> <b>Are you close with Kaiming?</b>

<b>definitely not</b> <b>mm-hmm right</b> <b>Yes</b> <b>but I think</b> <b>I think</b> <b>Yann is someone</b> <b>who truly can</b> <b>distort the reality field</b> <b>I think he's incredibly, incredibly impressive</b> <b>whenever I start to have doubts about something</b> <b>I always want to go have a chat with him</b> <b>he can easily make the people around him</b> <b>at least that's how I feel</b> <b>feel a sense of calm</b> <b>feel like, hey</b>

<b>these challenges aren't really challenges</b> <b>the road ahead is bright</b> <b>yes, he has that ability</b> <b>Mm-hmm</b> <b>and moreover</b> <b>of course</b> <b>his research vision</b> <b>I deeply admire as well</b> <b>admire</b> <b>like many of what I just mentioned</b> <b>such as what a world model is</b> <b>why we need to filter information</b> <b>this is essentially also JEPA</b> <b>the core of the JEPA idea he proposed</b> <b>is that you can't build a general model</b>

<b>you can't memorize everything</b> <b>and reconstruct it all</b> <b>you need to work in an abstract representation space</b> <b>to make predictions in an abstract representation space</b> <b>Mm-hmm</b> <b>that's the core of JEPA</b> <b>but what I want to say is</b> <b>Yann, I think, really practices what he preaches</b> <b>he himself is pretty JEPA as a person</b> <b>he consistently holds fast to many of his</b> <b>so</b> <b>logical principles</b> <b>and the things he believes are right</b> <b>this</b>

<b>is undisturbed by anything external</b> <b>but this doesn't mean</b> <b>he's completely stubborn</b> <b>who won't listen to anyone</b> <b>that's not really the case</b> <b>sometimes he's been wrong</b> <b>sometimes he's been right</b> <b>he's right most of the time</b> <b>but he can actually take in what people say</b> <b>mm-hmm, and he also said</b> <b>there was</b> <b>there was a press piece about how Yann</b>

<b>can't be moved</b> <b>that Yann LeCun can never be moved</b> <b>right, no one can</b> <b>move him</b> <b>Oh</b> <b>meaning he's stubborn, right?</b>

<b>saying he's too stubborn</b> <b>Yann said</b> <b>I can absolutely be moved</b> <b>I can absolutely be moved</b> <b>but I need to be moved based on facts</b> <b>not just because someone tells me what to do</b> <b>and I go do it</b> <b>that's when I'll be moved</b> <b>so back when he was at Meta actually</b> <b>Mm-hmm</b> <b>many people also told him</b>

<b>we at Meta are now going to build Large Language Models</b> <b>we need to do all these things</b> <b>you can't keep saying these things publicly anymore</b> <b>right?</b>

<b>right?</b> <b>you can't go around</b> <b>constantly dissing Large Language Models as not working</b> <b>Yann couldn't accept this at all</b> <b>Yann said my integrity as a scientist</b> <b>my integrity as a scientist cannot accept this</b> <b>so I think this is something I deeply admire too</b> <b>I think he truly</b> <b>the things he says</b> <b>Mm-hmm</b> <b>aren't because something is now</b> <b>trending</b> <b>and then he goes and says it</b>

<b>everything can be traced back to its origins</b> <b>including his talk about world models</b> <b>he didn't just start talking about it because world models became popular recently</b> <b>it was also</b> <b>something he was already talking about many, many years ago</b> <b>and he also has a really great paper</b> <b>I</b> <b>I genuinely recommend it to everyone around me</b> <b>it's called</b> <b>"A Path Towards Autonomous Machine Intelligence"</b> <b>right</b> <b>it's his position paper</b>

<b>also an opinion paper</b> <b>and at that point you'll find</b> <b>there are many layers to his thinking</b> <b>these layers are presented in a very engineering-oriented</b> <b>and implementable</b> <b>or mathematically expressed form</b> <b>so you see, when people ask him</b> <b>Yann, what exactly is a world model</b> <b>he never</b> <b>says something vague and high-level</b>

<b>something relatively</b> <b>abstract and empty</b> <b>he'll always write out formulas for you</b> <b>Uh</b> <b>he always will</b> <b>still does now</b> <b>still does now</b> <b>and</b> <b>he still spends one day a week at NYU</b> <b>and still leads his own group</b> <b>he still holds group meetings</b> <b>during group meetings</b> <b>he walks up to the whiteboard</b> <b>and walks everyone through the equations</b> <b>step by step</b> <b>Mm-hmm</b> <b>highly technical</b>

<b>very, very technical</b> <b>right</b> <b>What's the division of responsibility between you two?</b>

<b>Yann is executive chairman</b> <b>so</b> <b>he's more like the captain of our big ship</b> <b>about this with him</b> <b>I also</b> <b>talked with him about it</b> <b>who's the captain</b> <b>he's the captain</b> <b>no, I'm not</b> <b>talking about who's the captain</b> <b>I don't want to be the captain</b> <b>right, right, right, but he said</b> <b>on one hand he said</b> <b>he really doesn't like</b> <b>managing day-to-day operational matters</b> <b>he's not a good CEO</b>

<b>but on the other hand I feel — you're not either</b> <b>right, I'm probably not either</b> <b>but I also think</b> <b>he's a very wise manager</b> <b>he gave me this example</b> <b>he said</b> <b>his management philosophy is like</b> <b>sailing a boat</b> <b>this</b> <b>by the way, that's one of his hobbies</b> <b>I can talk about it later</b> <b>his other interesting things</b> <b>but he has this hobby</b> <b>he's heading out in March</b> <b>to go sailing in the Caribbean again</b>

<b></b> <b>he says his management style is</b> <b>giving everyone enough trust</b> <b>to let them do what they're supposed to do</b> <b>but once some turbulence arises</b> <b>right?</b>

<b>right?</b> <b>once we need to correct something</b> <b>he'll promptly</b> <b>Uh</b> <b>as early</b> <b>as possible make that adjustment</b> <b>right?</b>

<b>right?</b> <b>but before that</b> <b>trust everyone to do their work</b> <b>that is, believe in everyone</b> <b>to do what they're best at</b> <b>yeah, I think that's Yann's role</b> <b>he's for this company</b> <b>on one hand a kind of spiritual leader</b> <b>but on the other hand also</b> <b>navigating the open sea</b> <b>you need a helmsman</b> <b>he also has this</b>

<b>captain identity</b> <b>right and</b> <b>but I think what I feel about him</b> <b>I think</b> <b>what truly makes me feel</b> <b>I really enjoy working with this person</b> <b>is more personal reasons</b> <b>we've talked a lot</b> <b>these decisions aren't purely logical ones</b> <b>sometimes it still comes down to whether you click</b> <b>Mm-hmm</b> <b>it all comes down to people</b> <b>it all comes down to people</b> <b>right</b>

<b>like Yann, even though he really is a big shot</b> <b>you'll often see him at conferences</b> <b>holding out his phone</b> <b>taking selfies with everyone</b> <b>taking group photos</b> <b>and privately</b> <b>he's also a pretty pure and warm person</b> <b>right</b> <b>and being around him</b> <b>mainly I don't feel any sense of fear</b> <b>even though he's accomplished and distinguished</b> <b>mm-hmm, and then</b>

<b>I won't worry that I said something wrong</b> <b>and upset him</b> <b>I think that's actually quite rare</b> <b>especially given his status and standing</b> <b>to be like that</b> <b>and I can, or rather</b> <b>including everyone in this company</b> <b>can very directly tell him</b> <b>this is how I think about this</b> <b>I think you're right, or I think you're not right</b> <b>but let's discuss together</b>

<b>what way to move forward</b> <b>that would be best</b> <b>for this company</b> <b>I think</b> <b>that's also truly very rare</b> <b>right</b> <b>Tell us about your progress so far</b> <b>in terms of capital</b> <b>and team development</b> <b>of course by the time this is released</b> <b>it'll be after your announcement</b> <b>uh yes</b> <b>right uh</b> <b>I think in terms of capital</b> <b>Uh</b> <b>there's no way around it</b> <b>my world model</b>

<b>isn't sufficient to support making that kind of prediction</b> <b>but our target</b> <b>might be around one billion dollars</b> <b>right</b> <b>if that turns out to be wrong</b> <b>we'll just have to cut it</b> <b>[laughter]</b> <b>[laughter]</b> <b>[laughter]</b> <b>in terms of team composition</b> <b>we'll have many great partners</b> <b>like-minded people joining this company together</b> <b>so we'll start with around 25</b> <b>as an initial team</b>

<b>mm-hmm, and we hope to gradually grow the team</b> <b>we don't want to go too fast</b> <b>but not too slow either</b> <b>and in this there's actually so much</b> <b>I think</b> <b>I think that's part of the magic of building a startup</b> <b>because before, at big companies</b> <b>I would also, uh</b> <b>refer some friends from the past</b> <b>my students</b> <b>to join the company together</b> <b>but it was never really a unified thing</b>

<b>everyone went to different teams and did their own thing</b> <b>but</b> <b>but after starting a company</b> <b>I find</b> <b>you can truly bring everyone together</b> <b>Oh</b> <b>and find a shared mission like this</b> <b>Mm-hmm</b> <b>I think that's just so fascinating</b> <b>Mm-hmm</b> <b>and honestly I'm very moved by this myself</b> <b>because we have several friends</b>

<b>who actually have tens of millions of dollars in</b> <b>unvested OpenAI stock</b> <b>if they were leaving OpenAI</b> <b>and also, say, at Google</b> <b>there are also several like this</b> <b>Uh</b> <b>not at Google</b> <b>at Meta</b> <b>there are also those 15 to 20 million dollar</b> <b>offers like that</b> <b>and everyone just, without even thinking</b>

<b>gave it all up</b> <b>to join us</b> <b>Why?</b>

<b>Why?</b> <b>I think</b> <b>maybe we're all just a little crazy</b> <b>[laughs]</b> <b>it seems like</b> <b>the thing is, you need to</b> <b>consider, on one side is research</b> <b>and on the other side is financial outcome</b> <b>right, of course</b> <b>I think if a startup ultimately succeeds</b> <b>the upside can be very significant</b> <b>mm-hmm financially</b> <b>at least for now</b>

<b>I think most people are still very mission driven</b> <b>right and everyone still believes</b> <b>this is the only place</b> <b>where we can do this</b> <b>Have you already started</b> <b>thinking about business models?</b>

<b>Uh</b> <b>I think the reason for raising this much money</b> <b>might be partly to reduce some of that pressure</b> <b>but of course</b> <b>this is a serious company</b> <b>so our CEO</b> <b>and COO spend a lot of energy every day thinking about</b> <b>business model matters</b> <b>Mm-hmm</b> <b>right and, oh</b> <b>can I go back and talk about Yann again?</b>

<b>Sure!</b>

<b>oh right</b> <b>we'll see how to adjust it later</b> <b>but</b> <b>I think what I just said</b> <b>this thing about having a compatible spirit</b> <b>is really not a commercial decision at all</b> <b>right, and then I think</b> <b>mm-hmm, consistent with your mystical style of decision-making</b> <b>ah, of course</b> <b>of course the consideration is</b> <b>for example</b> <b>at the same time I would have had other opportunities too</b> <b>those opportunities</b> <b>might also have had much better</b> <b>short-term financial</b>

<b>returns</b> <b>Mm-hmm</b> <b>higher salary, higher returns</b> <b>but the way I've always thought about it is</b> <b>some people advised me</b> <b>go make money for two years first</b> <b>once you've made enough, come back and start a company — isn't that better?</b>

<b>Mm-hmm</b> <b>I partly agree, but I also worry</b> <b>right, at my current</b> <b>as</b> <b>at this stage of life</b> <b>do I still have two years</b> <b>in a good enough mental state</b> <b>to do this fully exploratory research</b> <b>Mm-hmm</b> <b>I think that's hard to say</b> <b>it's possible that once you have money</b> <b>your lifestyle</b>

<b>will change</b> <b>[laughter]</b> <b>and then</b> <b>this</b> <b>might also cause you to lose</b> <b>some of that original courage</b> <b>Oh</b> <b>and I think this is just for me personally</b> <b>I have many, many friends right now</b> <b>who are at Meta</b> <b>especially at Meta</b> <b>right everyone</b> <b>is actually making a lot of money</b> <b>they're also very competitive</b> <b>they work overtime every day too</b> <b>and basically everyone has moved near the office</b>

<b>working overtime every day</b> <b>seventy or eighty hours a week</b> <b>Yeah</b> <b>I think</b> <b>I also believe</b> <b>they will definitely build a great frontier model</b> <b>but I also want to say to them</b> <b>when you finish building that model</b> <b>mm-hmm, come check us out</b> <b>[laughter]</b> <b>I think yeah</b> <b>hopefully it's not too late</b> <b>but I think everyone I know</b>

<b>they all have this sense of mission</b> <b>right</b> <b>Meta FAIR's hiring strategy</b> <b>is it aligned with your hiring strategy?</b>

<b>uh, definitely not</b> <b>we don't have the money to hire like Meta FAIR does</b> <b>definitely different</b> <b>mm-hmm right</b> <b>or like Thinking Machines (the frontier AI lab founded by former OpenAI CTO Mira)</b> <b>including xAI</b> <b>I think they're all very different</b> <b>right, I feel</b> <b>although in terms of fundraising scale</b> <b>it's actually pretty good</b> <b>right</b> <b>at least in the top few historically, right?</b>

<b>top few — what's the valuation?</b>

<b>I don't know, I don't know</b> <b>Valuation</b> <b>we haven't changed</b> <b>still 3 billion pre-money</b> <b>right</b> <b>[laughter]</b> <b>mm-hmm, but the money is actually not a lot</b> <b>right, this capital</b> <b>is still very, very precious</b> <b>it's not like being at Meta</b> <b>at Google you really have a money-printing machine there</b> <b>you can't just print money</b> <b>it's okay, you can do</b> <b>whatever you want</b> <b>I think in a startup</b>

<b>we still need to be very, very careful in how we deploy resources</b> <b>I think you deliberately chose not to start up in Silicon Valley</b> <b>is that right?</b>

<b>uh yes</b> <b>I think</b> <b>Silicon Valley again</b> <b>it's very complicated</b> <b>people often say</b> <b>that it's already deeply mired in</b> <b>already hypnotized by Large Language Models</b> <b>[laughter]</b> <b>and I think</b> <b>I think</b> <b>Uh</b> <b>but I don't think this state of affairs will last very long</b> <b>people who are hypnotized will eventually wake up</b> <b>and I think</b> <b>at that point we</b> <b>we don't rule out at all setting up a company in Silicon Valley</b>

<b>I think in the end</b> <b>or maybe very soon</b> <b>our company's location will definitely be wherever the talent is</b> <b>that's where our company will be</b> <b>having an office</b> <b>that's a perfectly normal thing</b> <b>Mm-hmm</b> <b>right</b> <b>oh well, let me</b> <b>go back to Yann for a moment</b> <b>Sure. [laughter]</b>

<b>Sure. [laughter]</b> <b>no, what I want to say is</b> <b>I think Yann</b> <b>one thing that really appeals to me is</b> <b>he's truly a multi-hyphenate</b> <b>or rather a quite artistic person</b> <b>or in Kaiming's words</b> <b>Yann is someone whose adolescence at 16</b> <b>has continued all the way to 65</b> <b>oh, that's wonderful</b> <b>oh I think</b> <b>I think he must be pretty happy</b> <b>but he often says with great pride</b> <b>he has four great hobbies</b>

<b>the first hobby is</b> <b>building model airplanes</b> <b>the second is astrophotography</b> <b>so on Zoom you often see behind the topic</b> <b>there's a nebula, right?</b>

<b>a nebula-like</b> <b>wallpaper</b> <b>desktop background</b> <b>which he actually photographed himself</b> <b>in his own backyard</b> <b>and his third interest is making electronic music</b> <b>and getting into some jazz</b> <b>and things like that</b> <b>mm-hmm</b> <b>and if you look at his webpage</b> <b>it's a treasure</b> <b>I often go look at it from time to time</b> <b>he talks about which jazz clubs in New York</b>

<b>yes, the better jazz spots</b> <b>which musicians are particularly good</b> <b>and he also says</b> <b>that generally speaking</b> <b>French people look down on American</b> <b>popular culture</b> <b>except for jazz</b> <b>so he talks about Charlie Parker</b> <b>and a whole series of people</b> <b>and how great these musicians are</b> <b>I find it so interesting</b> <b>mm-hmm</b> <b>and he has another hobby which is</b> <b>as I already mentioned</b> <b>sailing</b>

<b>so I think a person like this appeals to me</b> <b>actually very, very much</b> <b>because I think his world is actually very big</b> <b>his world isn't just limited to research</b> <b>and now we're going to build world models</b> <b>I hope, you know</b> <b>the helmsman of this big ship is someone with vision</b> <b>and a love of life</b> <b>[laughter]</b>

<b>and there's another very interesting example</b> <b>coming up in March</b> <b>maybe when this show airs</b> <b>we'll have another paper to release</b> <b>the paper is called Solaris</b> <b>Solaris (from Stanisław Lem's 1961 novel)</b> <b>this is actually a sci-fi novel</b> <b>a novel by Lem, and</b> <b>later adapted into a film by Tarkovsky</b> <b>and the reason we chose this name</b> <b>is because we're building a so-called</b> <b>video generation model</b>

<b>and the film is also about</b> <b>an ocean</b> <b>this ocean</b> <b>that can read the subconscious memories of people</b> <b>and ultimately materialize and generate things from them</b> <b>I think that's really fascinating</b> <b>of course</b> <b>in Tarkovsky's film</b> <b>the message is</b> <b>our greatest enemy</b> <b>is not some alien civilization</b> <b>or the unknowable</b> <b>the ocean is actually humanity itself</b>

<b>it is humanity's own suffering and memories</b> <b>so</b> <b>the ocean is just a projection of humanity onto itself</b> <b>I want to bring this up because</b> <b>I think this</b> <b>film parallels what happens with LLMs so closely</b> <b>I think LLMs may not actually be understanding humans</b> <b>it's just a projection of humanity</b> <b>just a reflection</b> <b>but what I want to say is</b> <b>in relation to Yann</b>

<b>one day I said to him, hey</b> <b>this paper of ours</b> <b>what do you think of this name?</b>

<b>and I wanted to see if he knew the film</b> <b>and he said, oh</b> <b>you know this is a film title, right?</b>

<b>I said yes</b> <b>that's exactly</b> <b>why I chose this name</b> <b>he asked me</b> <b>which version did you watch?</b>

<b>[laughter]</b> <b>the 1975 one</b> <b>or the one from the early 2000s?</b>

<b>I felt</b> <b>I found the right person</b> <b>was it the Tarkovsky one or</b> <b>the Soderbergh one, right?</b>

<b>and I said, OK</b> <b>I think, mm-hmm</b> <b>I don't just admire you for your research</b> <b>it seems you also know more than me about film</b> <b>mm-hmm</b> <b>I think</b> <b>that's one thing</b> <b>quite interesting</b> <b>might not matter to many people</b> <b>but it's quite important to me personally</b> <b>a reflection of personal charisma</b> <b>a Chinese investor once told me</b>

<b>all startups born with a silver spoon</b> <b>none of them have succeeded</b> <b>almost none</b> <b>what do you think?</b>

<b>Uh</b> <b>I don't know what silver spoon means here</b> <b>enormous fundraising</b> <b>I see</b> <b>very famous</b> <b>as a founder who is already accomplished</b> <b>and very highly accomplished</b> <b>Mm-hmm</b> <b>ah, we weren't born with a silver spoon</b> <b>as I said, we're completely</b> <b>I won't say a ragtag bunch</b> <b>it's a grassroots coalition startup model</b> <b>how could Yann LeCun be grassroots?</b>

<b>Yann</b> <b>is not grassroots</b> <b>but in the AI industry right now</b> <b>or on the internet</b> <b>including in front of investors</b> <b>often it's half support half opposition</b> <b>half support, half opposition</b> <b>I don't know what the exact ratio is</b> <b>but in any case</b> <b>he's not the kind of hero everyone rallies around</b> <b>he's someone who holds firm to himself</b>

<b>and always tries to do the next thing</b> <b>but that thing hasn't been proven yet</b> <b>like that</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>and I think</b> <b>this means we weren't born with a silver spoon</b> <b>we don't have a silver spoon</b> <b>we don't have that feeling at all</b> <b>I think we're an underdog</b> <b>we're underdogs</b> <b>we actually</b> <b>are surviving under a kind of industry pressure</b> <b>a company like that</b> <b>right?</b>

<b>right?</b> <b>that's so humble-bragging</b> <b>no, no</b> <b>there's no humble-bragging</b> <b>we may have raised a lot</b> <b>but compared to the resources LLMs are mobilizing now</b> <b>this is just</b> <b>I don't know what percentage, it's so far off</b> <b>Was it difficult to raise funding?</b>

<b>with Yann on board</b> <b>it really wasn't difficult</b> <b>right</b> <b>but I</b> <b>I think</b> <b>a seed round is just a seed round</b> <b>I think you have to look ahead</b> <b>right?</b>

<b>right?</b> <b>I think you have to see what comes next</b> <b>which is to say</b> <b>can we ultimately deliver on our mission</b> <b>can we</b> <b>achieve this research breakthrough</b> <b>I think</b> <b>that's the most critical thing for us</b> <b>but anyway I feel</b> <b>I really enjoy this underdog identity</b> <b>especially as an entrepreneur</b> <b>because I think</b> <b>it's the same as being a researcher</b> <b>the more you don't believe in me</b> <b>the happier I am</b> <b>Have you felt anyone not believing in you</b>

<b>since you started the company?</b>

<b>mm-hmm, I think many people</b> <b>a lot of investor feedback</b> <b>more disbelief</b> <b>or more belief?</b>

<b>Uh</b> <b>I don't know what the ratio is</b> <b>we have many, many people who believe in us</b> <b>we have many people who don't</b> <b>mm-hmm, many of our</b> <b>or in Silicon Valley most people don't believe us</b> <b>in the rest of the world most people believe us</b> <b>so putting it all together</b> <b>I don't know</b> <b>Uh</b> <b>but that's okay</b> <b>I think the thing I most want to see is</b> <b>right?</b>

<b>right?</b> <b>you can not believe in us</b> <b>but then let's see</b> <b>right well</b> <b>I'm all in on this path now</b> <b>are you with me?</b>

<b>Mm-hmm</b> <b>How do you think entrepreneurship compares to being a researcher?</b>

<b>What's different?</b>

<b>I think there are many similarities</b> <b>but also many differences</b> <b>mm-hmm, I think about entrepreneurship... do you ski, Xiaojun?</b>

<b>I don't</b> <b>you don't?</b>

<b>you don't?</b> <b>I don't like sports</b> <b>I couldn't ski before either</b> <b>but I've been skiing recently</b> <b>and I've gotten quite a lot of</b> <b>insight from it</b> <b>I think</b> <b>first, skiing is a sport about balance</b> <b>once you master the balance</b> <b>you can actually ski</b> <b>second, you have to be fearless</b> <b>and point your shoulders down the slope</b> <b>I think this is so counterintuitive</b>

<b>people are always afraid</b> <b>when you're facing the downhill slope</b> <b>you always want to lean back</b> <b>Mm-hmm</b> <b>counter-instinct</b> <b>yes, you go against instinct</b> <b>and once you follow your instinct</b> <b>you fall backward</b> <b>and you completely lose control</b> <b>and completely fall</b> <b>right?</b>

<b>right?</b> <b>only when you completely abandon</b> <b>you</b> <b>only with enough courage</b> <b>and not fearing anything</b> <b>and pointing your shoulders toward the slope</b> <b>you actually become more stable</b> <b>right?</b>

<b>right?</b> <b>and you can actually control your speed better</b> <b>so</b> <b>there's a quote I really like</b> <b>right this</b> <b>it might be from</b> <b>somewhere</b> <b>from JoJo's</b> <b>the anime JoJo's Bizarre Adventure — it says the hymn of humanity is the hymn of courage</b> <b>I think that's also my understanding of entrepreneurship</b> <b>I think it requires courage</b> <b>but what you just asked</b> <b>is it the same in academia?</b>

<b>I think it requires even more courage</b> <b>but many of the decisions I made in academia</b> <b>mm-hmm, I think</b> <b>were also quite courageous decisions</b> <b>right?</b>

<b>right?</b> <b>and there's also this saying</b> <b>I think you never walk alone</b> <b>mm-hmm</b> <b>there'll be many people helping you</b> <b>Mm-hmm</b> <b>and precisely because you have people around you</b> <b>you become even braver</b> <b>Mm-hmm</b> <b>you just mentioned your taste in research</b> <b>what do you think about your taste in people?</b>

<b>First of all</b> <b>I don't think you should have a "taste" in people</b> <b>I think having a taste in people</b> <b>seems like a condescending way to put it</b> <b>Yeah</b> <b>How would you describe your ability to read people?</b>

<b>let me rephrase</b> <b>but I think it's also a mutual process</b> <b>mm-hmm, I think</b> <b>again, I think there's a kind of attraction</b> <b>that brings together people who can work together</b> <b>and we</b> <b>just need to follow that attraction</b> <b>to find those people</b> <b>and be with them</b> <b>right</b> <b>I don't think I would</b> <b>of course there will be some specific</b> <b>these</b> <b>metrics</b>

<b>we certainly have some</b> <b>like we're conducting interviews now</b> <b>I can't just say you don't need to interview</b> <b>mm-hmm, I have a set of mystical logic</b> <b>for hiring</b> <b>that's not realistic either</b> <b>Mm-hmm</b> <b>but I do care about</b> <b>Yeah</b> <b>certain things</b> <b>I think I care about</b> <b>whether you truly have that kind of</b> <b>desire to solve a problem</b>

<b>and the courage to want to understand something</b> <b>and that kind of persistence</b> <b>I think this matters for research</b> <b>and is also very important for entrepreneurship</b> <b>and when I recruit students</b> <b>I also need to be able to see</b> <b>this kind of</b> <b>personality in people</b> <b>Mm-hmm</b> <b>[laughter]</b> <b>so this</b> <b>what does it actually mean?</b>

<b>from the perspective of doing research</b> <b>it means</b> <b>if you have a problem in front of you right now</b> <b>Kaiming told me this too</b> <b>he said</b> <b>you should be thinking about the problem when you wake up</b> <b>thinking about it while eating</b> <b>thinking about it in the shower</b> <b>maybe you can stop thinking while sleeping</b> <b>or maybe you even sleep with it on your mind</b> <b>do you truly have that kind of</b>

<b>passion</b> <b>right?</b> <b>that drive to keep thinking about this problem</b>

<b>right?</b> <b>that drive to keep thinking about this problem</b> <b>or are you just treating this</b> <b>as just a job</b> <b>I think</b> <b>I think</b> <b>it's something that distinguishes people from one another</b> <b>a yardstick</b> <b>Do you have that problem right now?</b>

<b>Yeah</b> <b>What kind of problem?</b>

<b>mm-hmm, the kind of problem you carry with you every day</b> <b>yes absolutely</b> <b>of course</b> <b>but my current issue is</b> <b>that's also why I feel</b> <b>uh in</b> <b>after spending a long time in academia</b> <b>it gets a bit difficult</b> <b>because in academia, functioning</b> <b>you need to do all kinds of</b> <b>what we call context switching</b> <b>you need to switch contexts, right?</b>

<b>because you have so many parts</b> <b>to manage</b> <b>and coordinate</b> <b>I think being in a startup is actually quite good</b> <b>I can now focus on one thing</b> <b>I can think about</b> <b>what kind of team we should build</b> <b>what kind of people this team needs</b> <b>what problems we should solve</b> <b>in the next 1 month, 3 months, 6 months</b> <b>or a year</b> <b>Mm-hmm</b> <b>I might not be thinking about this correctly</b> <b>but that's okay</b>

<b>as long as the entire team works together</b> <b>we can fail together</b> <b>pivot together</b> <b>then I think this company won't fail</b> <b>I can't guarantee</b> <b>every plan I have now is correct</b> <b>I don't think Yann can guarantee that either</b> <b>Mm-hmm</b> <b>but I still believe in people</b> <b>as you said</b> <b>I still believe that gathering these people</b> <b>with ideals and passion</b>

<b>who want to</b> <b>forge a new path together</b> <b>will definitely achieve something remarkable</b> <b>Did you agree on the spot?</b>

<b>LeCun?</b>

<b>no no no</b> <b>there was a long, long gap in between</b> <b>and Yann wasn't the first to approach me</b> <b>anyway later</b> <b>Yann took charge of recruiting the team</b> <b>so he also had to think about</b> <b>what role each person should have</b> <b>right, I think later we discussed together</b> <b>negotiated together</b> <b>and</b> <b>I think it was quite a long process</b> <b>and I think</b> <b>everyone eventually found their right place</b> <b>How long did you agonize over it?</b>

<b>from the first time he</b> <b>told you</b> <b>maybe about a week of agonizing</b> <b>What were you agonizing over?</b>

<b>whether I should start a company at all</b> <b>to do this</b> <b>whether I should do this with Yann</b> <b>Mm-hmm</b> <b>or</b> <b>maybe look for some new opportunities</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>and then later</b> <b>but I didn't agonize for very long</b> <b>right, I feel</b> <b>I thought, OK</b> <b>Yann used his magic</b> <b>I'll tell you all</b> <b>talking to Yann is kind of like</b> <b>he's a bit like</b> <b>it's like he's</b> <b>casting spells</b> <b>like Harry Potter</b> <b>casting some enchantments on you</b> <b>mm-hmm, he says some things</b> <b>[laughter]</b> <b>and you</b> <b>stop thinking about other things</b> <b>mm-hmm, what spell did he cast on you?</b>

<b>nothing really</b> <b>he just shared his vision</b> <b>he just explained</b> <b>why this was a better choice</b> <b>a better choice for me</b> <b>and also a better choice for this company</b> <b>why here</b> <b>I can have enough agency and autonomy</b> <b>the so-called ability to make independent decisions</b> <b>and build a team</b> <b>and help us design this entire</b> <b>execution</b> <b>roadmap</b> <b>I also</b>

<b>incredibly, incredibly grateful</b> <b>so grateful that Yann could give me that trust</b> <b>right</b> <b>but our company has several other co-founders</b> <b>everyone is really, really wonderful</b> <b>there are 6 co-founders in total</b> <b>oh, that many</b> <b>Yes</b> <b>and there's a CEO</b> <b>what else?</b>

<b>what else?</b> <b>there's a CEO</b> <b>right</b> <b>there's also a COO</b> <b>there's a COO</b> <b>right and there's also</b> <b>VP of world models</b> <b>and then there's also</b> <b>whose current temporary title is CRIO</b> <b>who is also Chinese</b> <b>by the way, her name is Pascale</b> <b>Pascale Fung</b> <b>What kind of position is that?</b>

<b>Uh</b> <b>it's more of something between research</b> <b>between pure research and product</b> <b>a role at the alignment layer</b> <b>responsible for our innovation</b> <b>she also has a lot of entrepreneurial experience</b> <b>Mm-hmm</b> <b>and our VP of world models</b> <b>was the original JEPA team's</b> <b>uh this</b> <b>so</b> <b>director Mike</b> <b>and the COO was formerly Meta's</b>

<b>VP for all of Southern Europe</b> <b>Mm-hmm</b> <b>roughly that kind of combination</b> <b>so</b> <b>definitely not a purely researcher-background combination</b> <b>Mm-hmm</b> <b>Will you explore consumer-facing products?</b>

<b>uh yes</b> <b>and the ultimate goal</b> <b>will definitely include a consumer-facing product</b> <b>but we hope</b> <b>we won't be under any pressure</b> <b>because we still want to first build this world model</b> <b>however you define it</b> <b>first make it happen</b> <b>How many years out can your roadmap realistically plan?</b>

<b>planning years out is unrealistic</b> <b>I think if we can plan to a year</b> <b>that's already pretty good</b> <b>right</b> <b>and I think we don't need longer-term planning</b> <b>Mm-hmm</b> <b>Can greatness not be planned?</b>

<b>uh yes</b> <b>it's just, I'm not</b> <b>it's just like doing research</b> <b>I think you need an exploration process</b> <b>start by exploring</b> <b>start doing things</b> <b>mm-hmm, then gradually find your own ideas</b> <b>I think</b> <b>this applies to startups too</b> <b>What do you think</b> <b>about where your ideas have progressed to?</b>

<b>I think we've reached the point where</b> <b>we now have things to work on</b> <b>and we also feel there will be some</b> <b>quite promising results coming soon</b> <b>that's where we are</b> <b>but this thing</b> <b>what specifically?</b>

<b>what specifically?</b> <b>we can talk about it</b> <b>in a few months</b> <b>but coming back to it</b> <b>the thing is</b> <b>people outside have a misconception about this company</b> <b>and another misconception about Yann</b> <b>people actually don't know what JEPA is</b> <b>mm-hmm right</b> <b>[laughter]</b> <b>I personally also went through several phases</b> <b>from doubting JEPA, to understanding JEPA</b> <b>then to becoming JEPA</b> <b>those three life stages</b> <b>Mm-hmm</b> <b>[laughter]</b>

<b>I think this is also quite fun</b> <b>because at first, doubting JEPA</b> <b>was because we had just started doing self-supervised learning</b> <b>doing MoCo, doing MAE</b> <b>and I think</b> <b>JEPA seemed like yet another self-supervised learning algorithm</b> <b>that's it — then gradually understanding JEPA</b> <b>was because I felt JEPA actually</b> <b>goes deeper than we imagined</b> <b>there's a lot of underlying logic inside it</b>

<b>many mathematical principles</b> <b>and we also need someone on this path</b> <b>to keep persisting</b> <b>because what we discovered early on</b> <b>couldn't be scaled up</b> <b>so we stopped</b> <b>mm-hmm, and then</b> <b>but later with JEPA</b> <b>for example including me</b> <b>to give a simple example</b> <b>recently there was a paper called LeJEPA</b> <b>and with a very rigorous proof they showed</b> <b>if you want a good representation</b> <b>if you want this representation</b>

<b>to be agnostic to your downstream task</b> <b>then it must be an isotropic Gaussian distribution</b> <b>this is a bit technical</b> <b>essentially it means</b> <b>it's a characterization</b> <b>of a certain property of representations</b> <b>and I found</b> <b>this actually has merit</b> <b>truly becoming JEPA</b> <b>is because I feel JEPA is not a model</b> <b>JEPA is not a specific algorithm</b> <b>JEPA is a complete cognitive architecture</b>

<b>it's a cognitive system</b> <b>this</b> <b>in Yann's 2022 paper</b> <b>is what he wrote about</b> <b>so in my view, this cognitive system</b> <b>is a path to intelligence</b> <b>a universal intelligent agent's</b> <b>in my current view</b> <b>a very reasonable path</b> <b>so what JEPA requires</b> <b>JEPA is not just self-supervised learning</b>

<b>it needs world understanding capability</b> <b>it needs the ability to understand the world</b> <b>it needs the ability to make predictions</b> <b>it needs the ability to do planning</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>prediction and planning</b> <b>right</b> <b>I think</b> <b>this gave me new insights into JEPA</b> <b>and I found that JEPA actually isn't a specific</b> <b>as people outside tend to say</b> <b>like Yann has this method</b> <b>and he must stick to this method</b> <b>and turn it into something specific</b> <b>it's not like that</b> <b>JEPA is a very, very vast ocean</b>

<b>in this ocean there can be many, many ships</b> <b>sailing on it</b> <b>sailing</b> <b>[laughter]</b> <b>ultimately</b> <b>this entire system will have a lot of collaboration</b> <b>and LLMs are also part of it</b> <b>Mm-hmm</b> <b>so this makes me feel, mm-hmm</b> <b>this company can succeed</b> <b>and has a great chance of succeeding</b>

<b>the reason is it's not about shrinking things down</b> <b>under today's LLM settings</b> <b>everyone is narrowing things down</b> <b>but Yann's company is deliberately thinking big</b> <b>mm-hmm, he has enough space for us to explore</b> <b>to let us scale up</b> <b>until the very end</b> <b>we can achieve some kind of new breakthrough</b> <b>when exactly will this happen</b> <b>will it happen</b> <b>we can't predict</b> <b>but I feel</b>

<b>this is a path I'm willing to invest my life in</b> <b>to walk</b> <b>How does it feel after starting the company?</b>

<b>Your genuine feelings</b> <b>it's gotten busier and more tiring</b> <b>it's gotten busier and more tiring</b> <b>of course, definitely</b> <b>mm-hmm, there are lots of ups and downs</b> <b>there'll be</b> <b>a lot of tedious things</b> <b>but also because</b> <b>watching this company grow bit by bit</b> <b>watching some</b> <b>because we have 4 offices</b> <b>with so many legal issues</b> <b>whatever</b> <b>so much internal friction</b>

<b>slowly, what was originally</b> <b>internal friction</b> <b>gradually becomes smooth</b> <b>that process is actually quite enjoyable</b> <b>and in that process</b> <b>we also received help from many, many people</b> <b>so</b> <b>looking at it so far</b> <b>I think I made the right choice</b> <b>Mm-hmm</b> <b>maybe a bit different from your expectations?</b>

<b>maybe more optimistic</b> <b>Mm-hmm</b> <b>right, I feel</b> <b>the moment you jump, the fear disappears</b> <b>mm-hmm right</b> <b>I think as long as you have courage</b> <b>everything else is manageable</b> <b>and I feel in this company</b> <b>Ah</b> <b>I can find that courage</b> <b>Mm-hmm</b> <b>You just said AGI is a false premise</b> <b>can you elaborate on that?</b>

<b>AGI is a false premise</b> <b>this is also something Yann often says</b> <b>didn't he have a debate with Demis (DeepMind founder)?</b>

<b>right, he asked what exactly is general intelligence</b> <b>does general intelligence actually exist?</b>

<b>I won't go into too much detail on this</b> <b>but his logic here is also very mathematical</b> <b>very Yann</b> <b>what he says basically comes down to</b> <b>it means</b> <b>this person</b> <b>for example, there are 2 million visual nerve fibers</b> <b>mm-hmm, this can be modeled</b> <b>all the possible visual functions</b> <b>are actually enormously vast</b>

<b>it is</b> <b>as many as 2 to the power of 2 to the power of 200 functions</b> <b>but what humans can actually process</b> <b>and perceive</b> <b>is actually</b> <b>approaching zero</b> <b>right?</b>

<b>right?</b> <b>we are limited by our consciousness</b> <b>we are limited by our own neural</b> <b>bandwidth limitations</b> <b>we cannot see</b> <b>everything that happens in this world</b> <b>Mm-hmm</b> <b>so</b> <b>human intelligence is a very specialized intelligence</b> <b>it can only</b> <b>humans can only perceive what they can see</b> <b>Mm-hmm</b> <b>and later I also added a tweet about it</b> <b>I read a book</b>

<b>called "Are We Smart Enough to Know How Smart Animals Are?"</b>

<b>which asks whether we're smart enough</b> <b>to know how smart animals are</b> <b>Mm-hmm</b> <b>and after reading this book</b> <b>I let go of more of that human arrogance</b> <b>I think the evolution of intelligence</b> <b>is a continuous process</b> <b>it's not one where</b> <b>humans are truly unique</b> <b>right, we often say</b> <b>humans are intelligent</b> <b>because humans use tools</b> <b>but animals also use tools</b>

<b>and some people say</b> <b>humans actually have a certain</b> <b>self-awareness and consciousness</b> <b>one laboratory said</b> <b>humans can look in a mirror</b> <b>and recognize that the person in the mirror</b> <b>is themselves and not another entity</b> <b>can dogs?</b>

<b>can dogs?</b> <b>they can too</b> <b>right, many animals can</b> <b>oh right?</b>

<b>oh right?</b> <b>because some animals can't</b> <b>dogs actually quite enjoy looking at themselves in mirrors</b> <b>[laughter] right</b> <b>anyway, many animals</b> <b>indeed can't</b> <b>but many animals can</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>and there are also many very interesting things</b> <b>like chimpanzees, right?</b>

<b>and this author</b> <b>so</b> <b>de Waal also wrote another book</b> <b>called "Chimpanzee Politics" (a 1982 classic of animal behavior)</b> <b>which is about</b> <b>four chimpanzees</b> <b>and how they engage in power struggles</b> <b>very much like House of Cards</b> <b>or how there's a lot of scheming</b> <b>how you form alliances</b> <b>then maneuver and rise to the top</b> <b>and so on</b>

<b>I think that's very interesting</b> <b>[laughter]</b> <b>and one thing that left a deep impression on me</b> <b>was that</b> <b>for example, they</b> <b>these animals actually</b> <b>including chimpanzees, also have a kind of theory of mind</b> <b>they can also have their own world model</b> <b>and their world models are actually quite good</b> <b>for example, there's an example where</b> <b>an experimenter is in a room</b> <b>with two boxes</b> <b>one box containing a banana</b> <b>the other containing an apple</b>

<b>the chimp is shown this</b> <b>then the boxes are closed</b> <b>[laughter]</b> <b>and the experimenter takes the chimp out</b> <b>after a long, long time</b> <b>it's brought back into the room</b> <b>and the first thing the chimp notices</b> <b>is that</b> <b>the experimenter is eating a banana</b> <b>and the chimp</b> <b>immediately</b> <b>goes straight to open the box with the apple</b>

<b>and eats the apple</b> <b>without even glancing at the banana</b> <b>so</b> <b>chimpanzees also have a kind of reasoning ability</b> <b>right?</b>

<b>right?</b> <b>and although language is indeed unique</b> <b>language is something only humans have</b> <b>but that doesn't mean other animals don't communicate</b> <b>if we</b> <b>they have their own language</b> <b>they have their own language</b> <b>including</b> <b>like whales also have their own language</b> <b>anyway this is all quite fascinating</b> <b>I highly recommend that book</b> <b>[laughter]</b> <b>and there's also</b> <b>I read about some kind of bird (scrub jays)</b> <b>I forgot what they're called</b> <b>apparently they're very good at</b>

<b>if one is burying food</b> <b>burying food underground</b> <b>if it notices</b> <b>that one of its peers saw it happen</b> <b>it will first bury it there</b> <b>then wait for the peer to leave, dig it up</b> <b>and rebury it in a different spot</b> <b>I think that's quite interesting</b> <b>and of course we also know</b> <b>dogs have a keen sense of smell</b> <b>and bats navigate by hearing</b>

<b>I think the boundaries of intelligence are very broad</b> <b>people now talk about jagged intelligence</b> <b>so your world model</b> <b>which type of biological intelligence will it aim for first?</b>

<b>the goal is of course human intelligence</b> <b>human intelligence is certainly</b> <b>still</b> <b>at least in one dimension still the strongest</b> <b>or</b> <b>it's also what can most benefit the world</b> <b>Mm-hmm</b> <b>so we still want to build a world model</b> <b>toward human-like intelligence</b> <b>Mm-hmm</b> <b>but I just want to let go of human arrogance</b> <b>and recently I've been very inspired by this</b> <b>because I watched Rich Sutton</b>

<b>in this</b> <b>podcast</b> <b>talk about a theory</b> <b>because before I</b> <b>didn't know how to address this</b> <b>because people say</b> <b>LLMs are amazing, right?</b>

<b>LLMs can now write code</b> <b>can win gold at the IMO and IOI</b> <b>can help us go to the moon and Mars</b> <b>these things are incredible</b> <b>and I can't deny these things</b> <b>they really are impressive</b> <b>right?</b>

<b>right?</b> <b>but Rich Sutton's reply</b> <b>I think</b> <b>was very good — he replied</b> <b>you think these things are great and impressive?</b>

<b>that they're hard? well, feel free to think that</b> <b>because I don't think so</b> <b>I think building the intelligence of a squirrel</b> <b>is the hard problem</b> <b>once you have a squirrel's intelligence</b> <b>once you can build a squirrel's intelligence</b> <b>and make it survive in the real world</b> <b>with its own goals</b> <b>its own objectives</b> <b>its own intrinsic rewards as you described</b> <b>it knows hunger</b> <b>it has its own emotions</b>

<b>and it can engage in social activities</b> <b>after that, writing code, going to Mars, going to the moon</b> <b>those things would be the easy ones</b> <b>Good</b> <b>I'm gradually coming to strongly agree with this view</b> <b>if you set aside human arrogance</b> <b>building a squirrel's intelligence</b> <b>is actually a harder problem</b> <b>but that's not how it looks to humans</b> <b>from a human's</b> <b>perspective</b> <b>it doesn't seem that way</b>

<b>but that's entirely due to human arrogance</b> <b>you're also building human-level intelligence</b> <b>ah yes</b> <b>but what I mean is</b> <b>human intelligence has many, many aspects</b> <b>human intelligence is not just a language model</b> <b>human intelligence encompasses many types of intelligence</b> <b>that cannot be defined</b> <b>by language models or language itself</b> <b>right, I think that's a core insight</b> <b>What is your definition of intelligence?</b>

<b>mm-hmm, so as I was just saying</b> <b>Rich Sutton talked about this</b> <b>he feels that squirrel intelligence is the real intelligence</b> <b>I think his framing is a bit different</b> <b>he's not positioning from a human perspective</b> <b>looking at things from an anthropocentric view</b> <b>he's standing at the universe</b> <b>and the creator's perspective</b> <b>from this angle</b> <b>of course</b> <b>being able to recreate a squirrel</b> <b>is greater than your</b>

<b>human civilization in these 530 million years</b> <b>the things created in the last 8 seconds</b> <b>by far</b> <b>in this sense</b> <b>I think</b> <b>that's elevated the discussion</b> <b>I think that elevated perspective has merit</b> <b>but defining intelligence</b> <b>I don't want to give it a definition</b> <b>I think different animals have different intelligence</b> <b>and humans have human-level intelligence</b> <b>Mm-hmm</b>

<b>and what I want to encourage everyone to do</b> <b>don't only focus on what</b> <b>we as individuals cannot do</b> <b>pay attention to what we're already doing well</b> <b>pay attention to what a 4-year-old child</b> <b>or a child of a few years old does very well</b> <b>those things</b> <b>are actually what our world model</b> <b>next needs to focus on solving</b> <b>mm-hmm, so this is also</b> <b>why Robotics is ultimately</b> <b>a very fitting outlet</b>

<b>because before you talk about AGI</b> <b>or super intelligence</b> <b>can we first have a sufficiently reliable</b> <b>and general robot</b> <b>that can function in our home environment</b> <b>and help with household chores</b> <b>right, because a few-year-old child</b> <b>can actually do many, many household chores</b> <b>there's actually a list</b> <b>you can search for it online</b> <b>a 12-year-old child</b> <b>can basically do all the household chores</b>

<b>but is there a robot right now</b> <b>that can function like a 12-year-old child</b> <b>and handle these chores?</b>

<b>of course not</b> <b>Jie Tan from DeepMind</b> <b>Jie Tan</b> <b>he also says that robotic development is extremely uneven</b> <b>extremely imbalanced</b> <b>its developmental trajectory compared to a child's</b> <b>is different</b> <b>mm-hmm, for example</b> <b>the physical capabilities of robots' limbs have now surpassed</b> <b>they've already surpassed humans</b> <b>Mm-hmm</b> <b>but many other capabilities are still not as good as a child's</b> <b>because of the brain</b> <b>nobody is building the brain</b>

<b>nobody is building a robot brain</b> <b>all the robotics startups</b> <b>including the robotics divisions at big companies</b> <b>haven't solved this</b> <b>Doesn't DeepMind count?</b>

<b>DeepMind is now entirely based on Gemini</b> <b>so it's also working within the VLA framework</b> <b>Yes</b> <b>everything converges to</b> <b>Gemini</b> <b>Oh</b> <b>but this needs a second half of pre-training</b> <b>Mm-hmm</b> <b>in Shunyu Yao's classic formulation</b> <b>[laughter] I think there needs to be a second half</b> <b>but I think this is the second half of pre-training</b> <b>Mm-hmm</b> <b>Jim Fan recently also expressed the same view</b>

<b>so this pre-training is the world model</b> <b>who will do this pre-training?</b>

<b>that's not clear to me</b> <b>if I knew</b> <b>there was another place that could also do this</b> <b>then I might actually reconsider</b> <b>I wouldn't necessarily need to be</b> <b>at this startup</b> <b>doing this myself</b> <b>right?</b>

<b>right?</b> <b>robotics startups</b> <b>have no energy to do this</b> <b>they need to put their resources</b> <b>into the so-called hardware</b> <b>scaling law</b> <b>that is</b> <b>you need to buy more robots</b> <b>to deploy these robots</b> <b>or do these things in simulators</b> <b>these imitation learning approaches</b> <b>that can give you a good enough</b>

<b>to solve some specific problems</b> <b>in the short term</b> <b>a robotics team that creates value</b> <b>What about PI (Physical Intelligence)?</b>

<b>VLA right?</b>

<b>PI is the same</b> <b>PI is already very, very research-oriented</b> <b>and doing very, very well</b> <b>and is inspiring</b> <b>as a company</b> <b>but again, they won't do pre-training</b> <b>they won't do pre-training</b> <b>they'll take</b> <b>language models as their foundation</b> <b>Yeah</b> <b>right?</b>

<b>right?</b> <b>How should we understand your second half of pre-training?</b>

<b>what goes in</b> <b>what comes out</b> <b>I don't know</b> <b>at least the first step is</b> <b>in the long run</b> <b>the inputs are all</b> <b>continuous-space signals as I just described</b> <b>high-dimensional</b> <b>potentially noisy signals</b> <b>Mm-hmm</b> <b>at first it might still be video</b> <b>but we might also have multi-modal encoders</b> <b>to handle different</b> <b>signals beyond visual</b> <b>and the outputs</b> <b>that's a research question</b>

<b>the self-supervised question is still unknown</b> <b>I</b> <b>not necessarily unknown</b> <b>but</b> <b>it may become clearer later</b> <b>Mm-hmm</b> <b>but</b> <b>I think</b> <b>it's definitely not that simple</b> <b>but I think that's where the excitement lies</b> <b>I also find it quite interesting</b> <b>because the first time we met</b> <b>you said "you are not the chosen one"</b> <b>"you are just the normal one"</b>

<b>why do you like saying this?</b>

<b>No</b> <b>you see, throughout our conversation we discussed my</b> <b>growth story</b> <b>I</b> <b>I didn't expect we'd talk about all this</b> <b>but</b> <b>I definitely don't feel like a chosen one</b> <b>[laughter]</b> <b>this quote is actually from a team I love</b> <b>Liverpool right?</b>

<b>Liverpool right?</b> <b>I'm a KOP (the famous terrace at Anfield and symbol of devoted Liverpool fans)</b> <b>for over 20 years</b> <b>[laughter]</b> <b>I think there's a bit of compatible spirit</b> <b>and my favorite manager</b> <b>Klopp</b> <b>Jürgen Klopp</b> <b>[laughter]</b> <b>he was half-joking when he said to everyone</b> <b>when another manager</b> <b>José Mourinho</b> <b>said "I am the special one"</b> <b>I'm the special one</b> <b>then Klopp said</b>

<b>"I'm not the special one"</b> <b>"I'm the normal one"</b> <b>and I think</b> <b>on one hand he himself is very punk</b> <b>he has that rock 'n' roll spirit</b> <b>[laughter]</b> <b>Uh</b> <b>and he often tells everyone</b> <b>that his role in the team</b> <b>is like a battery</b> <b>he hopes through his own passion</b> <b>and his own energy, you know</b>

<b>to let others</b> <b>generate electricity for others</b> <b>empower</b> <b>empower others</b> <b>mm-hmm right</b> <b>I also want to be that kind of person</b> <b>I also want to be for a team</b> <b>whether that team is in academia</b> <b>or in a startup, a battery</b> <b>I think this is actually not easy</b> <b>because sometimes</b> <b>everyone has their moments of discouragement</b> <b>Mm-hmm</b> <b>I also want to</b> <b>so</b> <b>complain more</b>

<b>and let out my feelings</b> <b>but I'm gradually coming to feel</b> <b>in academia, like in front of students</b> <b>and in front of the startup team</b> <b>someone needs to play that battery role</b> <b>or I think Yann is a giant battery</b> <b>he inspired me</b> <b>but I hope to pass this electrical charge through me</b> <b>and send it further</b> <b>What was the last time you felt discouraged, and why?</b>

<b>I feel discouraged every day</b> <b>I think it's become</b> <b>a kind of researcher's fate</b> <b>I think everyone has this underlying melancholy</b> <b>because the process of research inquiry</b> <b>is like groping around in a dark</b> <b>lightless place</b> <b>Mm-hmm</b> <b>when you can't see any light</b> <b>you always feel lost and discouraged</b>

<b>and when people truly feel</b> <b>this kind of joy</b> <b>it's only when you actually get something working</b> <b>but this part of the time</b> <b>is very, very brief</b> <b>maybe only 5% or 10%</b> <b>Kaiming has said something similar</b> <b>so over time</b> <b>right, eventually everyone's</b> <b>mental state can become concerning</b> <b>but I think it's okay</b> <b>I think</b> <b>Uh</b>

<b>I think this era now</b> <b>is still not quite the same as before</b> <b>I think now there's more discussion</b> <b>I think</b> <b>this is one of the benefits of this AI wave</b> <b>at least</b> <b>people won't feel</b> <b>like they're in a closed space</b> <b>exploring alone</b> <b>at least people can scroll through Xiaohongshu</b> <b>scroll through Weibo, Zhihu</b> <b>and see how everyone is discussing this</b>

<b>I think that's sometimes quite stress-relieving</b> <b>but sometimes it also adds pressure</b> <b>when people criticize you, you don't think that anymore</b> <b>Does your company have people with an entrepreneurial personality?</b>

<b>entrepreneurial personality</b> <b>generally quite optimistic</b> <b>I think Yann himself is very optimistic</b> <b>very, very optimistic</b> <b>why isn't he a researcher</b> <b>with that melancholy undercurrent?</b>

<b>hmm, I don't know</b> <b>because he's been through hardship</b> <b>and then succeeded</b> <b>Oh</b> <b>he lived through the AI winter</b> <b>and then showed everyone</b> <b>he was right</b> <b>and they were wrong</b> <b>if I went through something like that</b> <b>I might not be so melancholy either</b> <b>he's still quite optimistic</b> <b>I think</b> <b>or rather, his past experiences</b> <b>have also given him more confidence</b> <b>and something he often says is</b> <b>this</b>

<b>what happened before with deep learning neural networks</b> <b>is exactly the same</b> <b>which thing?</b>

<b>which thing?</b> <b>it's that now, world models</b> <b>or whatever you call it</b> <b>the current systems</b> <b>building intelligent systems now</b> <b>he says there's always a small group of people</b> <b>who can clearly see</b> <b>the trajectory of the world's development</b> <b>the progress of technology</b> <b>but they're only a small minority</b> <b>most people can't see it</b> <b>right</b> <b>because most people are busy doing other things</b> <b>back then with deep learning</b>

<b>people were doing whatever</b> <b>other things</b> <b>traditional machine learning</b> <b>mm-hmm, and now</b> <b>what you're doing is</b> <b>you can, mm-hmm</b> <b>let's not say it — think about it</b> <b>[laughter]</b> <b>and I think</b> <b>he's actually quite optimistic</b> <b>or rather he has</b> <b>enough confidence</b> <b>and says</b> <b>the things I can see are important things</b> <b>the path I can see</b> <b>is a clear path</b>

<b>and on this matter</b> <b>I still believe him quite a lot</b> <b>Have you ever doubted him?</b>

<b>Uh</b> <b>as I said</b> <b>I questioned JEPA</b> <b>then understood JEPA</b> <b>then became JEPA</b> <b>so of course there was doubt</b> <b>but I feel that trust in a person</b> <b>and trust in a research direction</b> <b>takes time</b> <b>I was just telling students the other day</b> <b>every time Yann gives a talk</b> <b>he gives exactly the same talk</b> <b>his slides</b> <b>are honestly pretty ugly</b> <b>[laughter]</b> <b>[laughter]</b>

<b>but they have his personal style</b> <b>style and design</b> <b>is also interesting</b> <b>some things are originally ugly</b> <b>but if you use them enough</b> <b>and time passes</b> <b>they become the new fashion</b> <b>but</b> <b>every time he gives that same talk</b> <b>I've been feeling this very, very strongly recently</b> <b>I said</b> <b>this talk</b> <b>I've watched it at least 10 times</b>

<b>20 times now, but each time I get something new</b> <b>every time I feel</b> <b>like I understand a bit more what he really means</b> <b>and this</b> <b>this deeper understanding</b> <b>is not because I've watched the same content 10 or 20 times</b> <b>and got this new understanding</b> <b>it's because</b> <b>I'm doing what I want to do</b> <b>Mm-hmm</b> <b>and I find</b> <b>that is</b>

<b>when watching his talk</b> <b>each time I do this translation work</b> <b>and association work</b> <b>I find</b> <b>that what he said</b> <b>in my current understanding</b> <b>can be interpreted this way</b> <b>and it doesn't conflict at all with</b> <b>even today's large language model or multimodal paradigms</b> <b>everything</b> <b>Yann says can be clearly mapped onto</b>

<b>what we're doing now</b> <b>concretely</b> <b>and guide us</b> <b>to perhaps escape some local optimum</b> <b>[laughter]</b> <b>and perhaps lead to a different future</b> <b>mm-hmm, so it's become an inspiration</b> <b>right?</b>

<b>right?</b> <b>it's not just knowledge</b> <b>it's an inspiration</b> <b>Mm-hmm</b> <b>so I think that's also wonderful</b> <b>Mm-hmm</b> <b>we just talked a lot about world models</b> <b>do you have any new thoughts on your world model</b> <b>for the real world?</b>

<b>In the past year or two</b> <b>I think this thing must definitely</b> <b>go beyond the limitations of research</b> <b>the limitations of being a researcher</b> <b>it must enter real life</b> <b>and</b> <b>understand what's happening in the real world</b> <b>but I think New York is very different</b> <b>I go to work every day</b> <b>first, I don't have to drive</b> <b>so I've already started to emerge</b>

<b>from a kind of armor</b> <b>and enter real life</b> <b>by walking</b> <b>this</b> <b>I think has many</b> <b>wonderful effects</b> <b>for example</b> <b>some days I'm still under quite a lot of pressure</b> <b>sometimes something happens</b> <b>and it's quite discouraging</b> <b>but every time I walk through</b> <b>from my home to my office at school</b> <b>there's a park called Washington Square Park</b>

<b>Washington Square Park</b> <b>[laughter]</b> <b>inside there are all kinds of people</b> <b>all sorts</b> <b>everyone living their own lives</b> <b>there are street performers playing piano</b> <b>dancers</b> <b>mothers pushing strollers</b> <b>old men playing chess</b> <b>and young people sitting on the steps doing nothing</b> <b>daydreaming</b>

<b>and NYU students studying with laptops</b> <b>[laughter]</b> <b>I think my most stress-relieving moments every day are</b> <b>this roughly 5 to 10 minute walk</b> <b>I find</b> <b>the world is much bigger than we imagine</b> <b>not everyone cares about what AI is</b> <b>they may not care about this at all</b> <b>and they have their own lives</b> <b>the world is big</b> <b>but on the other hand</b>

<b>maybe AI someday in the future</b> <b>will indeed affect their lives</b> <b>so what should we actually be doing?</b>

<b>as researchers</b> <b>do we have some kind of social responsibility?</b>

<b>but this might be getting a bit far-reaching</b> <b>but I just feel</b> <b>more contact with people</b> <b>more contact with people living in this world</b> <b>helps me understand what AI is</b> <b>and how to build the next generation of AI</b> <b>in new ways</b> <b>and this</b> <b>is exactly what Ilya wanted to talk about when he called me</b> <b>what he wanted to discuss</b> <b>but I hadn't arrived at these insights yet</b> <b>Have you picked up any new hobbies?</b>

<b>New hobbies</b> <b>In New York?</b>

<b>right</b> <b>no real new hobbies</b> <b>I think</b> <b>skiing counts as one</b> <b>most other times</b> <b>I genuinely don't have time</b> <b>but the nice thing about New York is</b> <b>you know that once you go out</b> <b>you can find a new hobby</b> <b>that itself</b> <b>is enough to make me happy</b> <b>whether or not I actually have time to step out</b>

<b>and do those things</b> <b>Mm-hmm</b> <b>having that possibility available</b> <b>I think is quite different</b> <b>and very different from the Bay Area</b> <b>Can you share</b> <b>aside from work</b> <b>what music you like</b> <b>books you enjoy</b> <b>films and games you enjoy?</b>

<b>Right now</b> <b>Yeah</b> <b>that's hard to think about</b> <b>off the top of my head I'm not sure</b> <b>I think let me approach this through AI</b> <b>let me think about what I've watched recently</b> <b>let me think</b> <b>Mm-hmm</b> <b>I actually enjoy watching TV shows</b> <b>so I can recommend some shows</b> <b>for everyone</b> <b>Mm-hmm</b> <b>there's a show called POI</b> <b>it's also quite an old show</b> <b>Person of Interest</b>

<b>I watched this many years ago</b> <b>in it</b> <b>they discuss what a super intelligence is</b> <b>you have a good super intelligence</b> <b>and a bad super intelligence</b> <b>their competition</b> <b>and the threat to human society</b> <b>and I think</b> <b>I won't spoil it</b> <b>but it's quite multi-modal</b> <b>and this might</b> <b>have a certain prophetic quality</b>

<b>I think it's quite remarkable</b> <b>mm-hmm right</b> <b>at its core it's about how</b> <b>an AI in a box</b> <b>a language model</b> <b>or</b> <b>an agent that can write code</b> <b>step by step breaks free</b> <b>and becomes a multi-modal model</b> <b>I think everyone should check it out</b> <b>and later there's also</b> <b>something I really like</b> <b>like Pantheon (American animated series)</b> <b>it's also</b> <b>I think a kind of AI prophecy</b> <b>yes, it's an animation</b>

<b>its author is Ken Liu (Chinese-American science fiction writer)</b> <b>he's also from my hometown</b> <b>and he's also someone who</b> <b>worked as a lawyer</b> <b>worked as a programmer</b> <b>and ultimately became</b> <b>a novelist</b> <b>like that</b> <b>incredibly impressive</b> <b>I admire him greatly</b> <b>and I love reading his books too</b> <b>right</b> <b>but this show was also recommended by Sam Altman before</b> <b>so many people have seen it</b> <b>and also</b>

<b>recently of course there's this very popular Companion</b> <b>called</b> <b>I think this is also a kind of AI prophecy</b> <b>the slightly troubling thing now is</b> <b>popular culture has been too saturated with AI</b> <b>making everything seem AI-related</b> <b>it's a bit overwhelming</b> <b>but</b> <b>as</b> <b>maybe it's just because I'm an AI professional</b> <b>so sometimes</b>

<b>it feels different</b> <b>but I think</b> <b>these things are still quite inspiring</b> <b>including the sci-fi novels I mentioned</b> <b>including these older films</b> <b>I think</b> <b>they may all be a kind of prophecy about reality</b> <b>but generally speaking</b> <b>these</b> <b>works of film and TV</b> <b>don't point toward a very bright future</b> <b>usually the endings are quite bleak</b>

<b>Mm-hmm</b> <b>ah, I recently watched a film</b> <b>I think it's called No Other Choice</b> <b>which might translate as No Other Choice</b> <b>a film by Park Chan-wook</b> <b>and it's also about AI's alienation of humanity</b> <b>throughout the entire film</b> <b>it never mentions anything about AI</b> <b>until the very end</b> <b>but the whole thing is about</b> <b>the changes brought about by AI's arrival</b> <b>what changes humans have undergone</b> <b>people's mindsets</b> <b>relationships between people</b>

<b>what exactly has changed</b> <b>I think these things are also instructive</b> <b>and speaking of</b> <b>one last word on films</b> <b>welcome everyone to come to New York</b> <b>in New York</b> <b>I used to attend one film festival</b> <b>the New York Film Festival</b> <b>with many films to watch</b> <b>now I'll be going to two</b> <b>the second one is</b> <b>the AI film festival Runway holds every year</b> <b>and I think it's very cool and interesting</b> <b>if I were to recommend one</b>

<b>very relevant to everything we just talked about</b> <b>one that won their grand prize this year</b> <b>the AI film called Total Pixel Space</b> <b>called</b> <b>in Chinese it might be called Total Pixel Space</b> <b>[laughter]</b> <b>I won't spoil it</b> <b>anyway</b> <b>this is a very interesting AI short film</b> <b>and it actually talks about a lot of</b>

<b>what we just discussed</b> <b>about world models</b> <b>or why human intelligence</b> <b>is not simply</b> <b>or is not</b> <b>purely general intelligence</b> <b>some arguments</b> <b>I think it's quite fun</b> <b>mm-hmm, each of our guests</b> <b>recommends a life-changing book to our audience</b> <b>one that has truly influenced you</b> <b>and changed you</b> <b>what would yours be?</b>

<b>a book? mm-hmm</b>

<b>that's hard — you have to let me think</b> <b>Mm-hmm</b> <b>one book</b> <b>I guess people often recommend</b> <b>but</b> <b>the reason this book changed my life</b> <b>I wouldn't say it changed my life hugely</b> <b>but it was during my undergraduate years</b> <b>a collective memory</b> <b>everyone would read</b> <b>this book called GEB</b> <b>have you heard of it?</b>

<b>which is Gödel, Escher, Bach</b> <b>the Chinese title is "GEB: An Eternal Golden Braid"</b> <b>it talks about philosophy</b> <b>mathematical logic</b> <b>and these three people, right?</b>

<b>Gödel, Bach, and Es-</b> <b>cher right?</b>

<b>cher right?</b> <b>a mathematician</b> <b>a musician</b> <b>a composer</b> <b>and also a</b> <b>painter mm-hmm</b> <b>how they are able to</b> <b>what philosophical commonalities they share</b> <b>you could put it that way</b> <b>right</b> <b>and it's very interesting</b>

<b>because during our undergraduate days</b> <b>the book is this thick</b> <b>we studied it together as a group</b> <b>it was also recommended by our teacher</b> <b>so everyone studied it together</b> <b>and actually back then nobody really understood it</b> <b>but later it started feeling more and more</b> <b>mm-hmm, like it makes sense</b> <b>Mm-hmm</b> <b>this book</b> <b>I think</b> <b>if you don't have time to read every page carefully</b>

<b>you can read an abridged version</b> <b>or some kind of summary</b> <b>some of its ideas</b> <b>I find very, very interesting</b> <b>and also</b> <b>there's a book</b> <b>this one was probably also read during undergrad</b> <b>called Zen and the Art of Motorcycle Maintenance</b> <b>or is it motorcycle repair</b> <b>"Zen and the Art of Motorcycle Maintenance: An Inquiry into Values"</b> <b>I think it's called that</b> <b>right</b>

<b>and this book is also a process of inner seeking</b> <b>it's about a person riding a motorcycle</b> <b>with</b> <b>this might be a spoiler</b> <b>an imagined</b> <b>philosopher</b> <b>but this philosopher is actually a projection of himself</b> <b>mm-hmm, my feeling reading this book was</b> <b>I also</b> <b>didn't fully understand what he was saying</b> <b>right mm-hmm</b>

<b>but some books and films fill you up</b> <b>and some books or films empty you out</b> <b>my feeling after finishing this book was</b> <b>it kind of emptied me out</b> <b>Oh~</b> <b>and it made me feel</b> <b>Mm-hmm</b> <b>right, this gets abstract again</b> <b>anyway, it made me feel</b>

<b>Uh</b> <b>it made me sense</b> <b>what truly matters in this world</b> <b>what doesn't</b> <b>for you</b> <b>what matters</b> <b>what doesn't</b> <b>I don't know</b> <b>I think I'm always looking for that balance</b> <b>I think, mm-hmm</b> <b>I think</b> <b>genuine communication between people is important</b> <b>perhaps nothing else matters</b> <b>but at any given moment</b> <b>if you ask me this question</b> <b>I might say</b> <b>entrepreneurship is important</b> <b>research is important</b>

<b>but at the end of the day</b> <b>I still believe</b> <b>that communication between people is what matters</b> <b>it sounds like you want to do research also for the sake of connection</b> <b>uh yes</b> <b>I think so</b> <b>and I think</b> <b>research itself is also a form of deeper connection</b> <b>Mm-hmm</b> <b>Mm-hmm</b> <b>this</b> <b>actually helped us during fundraising</b> <b>too</b> <b>why not?</b>

<b>why not?</b> <b>an investor was very willing to invest in us</b> <b>and his reason</b> <b>the reason was</b> <b>someone he knew, a very strong entrepreneur</b> <b>who is also a researcher</b> <b>and this person said, hey</b> <b>you absolutely must invest in Saining</b> <b>and whatever way</b> <b>we need to help him</b> <b>but</b> <b>I only met this person once at a meeting</b> <b>who was it? and</b>

<b>later</b> <b>who?</b>

<b>who?</b> <b>Uh</b> <b>Who?</b>

<b>Who?</b> <b>Robin</b> <b>Robin Rombach</b> <b>he's the</b> <b>first author of Stable Diffusion</b> <b>and the current CEO of Black Forest Labs</b> <b>Oh</b> <b>right</b> <b>Flux right?</b>

<b>Flux right?</b> <b>[laughter]</b> <b>so</b> <b>the investor told me</b> <b>the reason he did this</b> <b>is this kind of trust</b> <b>is built on your academic work</b> <b>this trust</b> <b>can sometimes even surpass</b> <b>genuine personal</b> <b>connection</b> <b>Oh</b> <b>people get to know you through your work</b> <b>and this</b> <b>carries forward</b>

<b>and can go very far</b> <b>What do you think of Seedance?</b>

<b>Seedance is incredibly impressive</b> <b>Seedance really</b> <b>let</b> <b>let our film crew today</b> <b>also say something about it</b> <b>I think it's extremely strong</b> <b>[laughter]</b> <b>I've heard it's also a very, very large model</b> <b>and it's a MoE model</b> <b>I don't know if this rumor is true</b> <b>because before this</b> <b>I know</b> <b>nobody had been able to make MoE work</b> <b>within a Diffusion Model architecture</b> <b>if they truly managed to do</b> <b>200 billion parameters</b>

<b>and with an MoE architecture</b> <b>and they were able to ingest all that data</b> <b>I think</b> <b>that's incredibly, incredibly impressive</b> <b>Mm-hmm</b> <b>but all these generative models</b> <b>90% is still a data problem</b> <b>architecture doesn't matter much</b> <b>90%, or let me say 95%</b> <b>it's all a data problem</b> <b>mm-hmm, their data is inherently abundant</b> <b>their data itself is more</b> <b>but volume alone isn't enough</b>

<b>Mm-hmm</b> <b>they must have done enormous work</b> <b>to clean the data</b> <b>to do captioning</b> <b>to calibrate the data distribution</b> <b>their diversity-quality balance</b> <b>as well as their prompt alignment with language</b> <b>the degree of that</b> <b>I believe</b> <b>a large number of people must have been involved in this work</b> <b>and done an enormous amount</b> <b>right</b>

<b>but once you've done all these things well</b> <b>subsequent things</b> <b>become much simpler</b> <b>but I think</b> <b>I think Seedance is very impressive</b> <b>I think</b> <b>including Sora</b> <b>including Veo</b> <b>wanting to surpass them</b> <b>I don't think it's necessarily that simple</b> <b>Our studio is called Language and World Studio</b> <b>what comes to mind when you hear that name?</b>

<b>what are you thinking?</b>

<b>I see you wrote me a line: let go of</b> <b>uh called</b> <b>let go of Wittgenstein</b> <b>let go of Wittgenstein</b> <b>well, that's not a great way to end</b> <b>I'm going to start complaining again</b> <b>right, go ahead</b> <b>you complain — I say, let go of Wittgenstein</b> <b>means you shouldn't</b> <b>people shouldn't take Wittgenstein</b> <b>and really stretch him</b> <b>using it as a language boundary</b> <b>meaning the limits of my world</b> <b>and use that quote as endorsement for LLMs</b> <b>or linguistic determinism</b>

<b>so that's completely absurd</b> <b>and likewise</b> <b>there are other quotes</b> <b>like people citing Feynman</b> <b>Feynman said what I cannot create</b> <b>I do not understand</b> <b>this</b> <b>being used to endorse unified models</b> <b>I think</b> <b>both of these things are really unacceptable to me</b> <b>what's the first thing?</b>

<b>the first is Wittgenstein, right?</b>

<b>when he spoke of the limits of language</b> <b>as the limits of my world</b> <b>there were strong preconditions</b> <b>in his Tractatus Logico-Philosophicus</b> <b>what he discussed in the Tractatus</b> <b>was that</b> <b>your</b> <b>the language he referred to targets what can be captured in propositions</b> <b>the limits of the world that can be described</b> <b>and this does not represent the general</b>

<b>the entirety of what we call the world</b> <b>[laughter]</b> <b>so</b> <b>first, the language he spoke of</b> <b>and the world he spoke of</b> <b>are already different from the language in today's LLMs</b> <b>and the world it refers to</b> <b>second, in his later period Wittgenstein</b> <b>had completely overturned his earlier</b> <b>entire philosophical system</b> <b>he later stopped saying that</b> <b>and what he talked about instead was</b>

<b>language is actually a game</b> <b>the so-called concept of language games</b> <b>meaning language itself has no inherent meaning</b> <b>these symbols themselves have no meaning</b> <b>the reason they acquire meaning</b> <b>is because they are connected to real-world practice</b> <b>and engaged with it</b> <b>Mm-hmm</b> <b>and this is very much the world model view</b> <b>that is</b> <b>we're not saying</b> <b>that language can perfectly</b>

<b>represent the entire world</b> <b>what we're saying is that the world's practice</b> <b>the world's actions determine the game of language</b> <b>its intension and extension</b> <b>mm-hmm again</b> <b>I don't understand philosophy</b> <b>I don't understand Wittgenstein either</b> <b>but I just don't like seeing in people's papers</b> <b>opening with a pulled quote</b>

<b>I think that doesn't fit my aesthetic sensibilities</b> <b>the Feynman quote is the same</b> <b>mm-hmm, he said</b> <b>what I cannot create</b> <b>I do not understand</b> <b>that quote itself is not wrong</b> <b>but the create and understand he's referring to mean</b> <b>for example, we have a world</b> <b>we want to understand this world</b> <b>we want to transform this world</b> <b>we want to understand the world</b> <b>through transforming it</b>

<b>whatever</b> <b>the things he was talking about</b> <b>are still within a real, concrete world</b> <b>requiring some kind of action</b> <b>mm-hmm, even when you're in class</b> <b>you go and make a PowerPoint</b> <b>you're still engaged in a process of creation</b> <b>but now many people take this quote</b> <b>and use it to make this kind of, uh</b> <b>endorsement for some simple unified system</b> <b>that's logically untenable too</b>

<b>we can't simply reduce creation</b> <b>to a diffusion model</b> <b>its backpropagation loss</b> <b>that's completely absurd</b> <b>mm-hmm right?</b>

<b>mm-hmm right?</b> <b>so</b> <b>I don't know</b> <b>I think</b> <b>maybe it's like when I was a kid</b> <b>overusing famous quotes in essays</b> <b>now seeing these things gives me a bit of PTSD</b> <b>and I think as Kaiming said</b> <b>everyone should read more philosophy</b> <b>I think that's quite worthwhile</b>

<b>mm-hmm, at the very start you said you believe in fate</b> <b>and believe in it more and more</b> <b>where do you feel fate is pushing you now?</b>

<b>Ah</b> <b>I don't know</b> <b>is fate pushing me?</b>

<b>it doesn't seem like it</b> <b>I think</b> <b>there's no feeling of being pushed by fate</b> <b>mm-hmm just</b> <b>mm-hmm, when the next time I need to make a choice comes</b> <b>I just hope for good fortune</b> <b>Is this world a giant world model?</b>

<b>of course the world is a giant world model</b> <b>can you predict fate then?</b>

<b>uh, I don't think so</b> <b>why not?</b>

<b>why not?</b> <b>Mm-hmm</b> <b>because we don't have enough resources</b> <b>Oh</b> <b>you'd need a computer as large as the Earth</b> <b>or you'd need a computer</b> <b>the size of the entire universe</b> <b>to tell you the answer about life</b> <b>about the universe</b> <b>about anything</b> <b>and the answer might ultimately be 42</b>

Loading...

Loading video analysis...