LongCut logo

Poetiq - Ian Fischer (CEO) | Stanford Hidden Layer Podcast #104

By Stanford MLSys Seminars

Summary

Topics Covered

  • Walked into Harvard, Got Admitted
  • Unoptimized RSI Beats Gemini 3 by 20%
  • Optimize Prompts and Code Strategies
  • RSI Sufficient for AGI Path
  • Embodiment Builds World Understanding

Full Transcript

You're hill climbing on Arc AGI1 with this um RSI DSPY kind of thing on the prompt only. And you're hill climbing on

prompt only. And you're hill climbing on Arc AGI 1. You get to a final P star some prompt P star and then you get

access to Gemini 3 and you just plug it in without hill climbing any at all. and

you um put it on Arc Agi 2, which you did not you also did not hill climb on at all and you still outperformed >> Yeah.

>> Gemini 3 by 20 points or whatever it >> was. Yeah. Yeah. So like

>> was. Yeah. Yeah. So like

>> oh my what we uh Yeah. What we showed uh so our first blog post that we released it's exactly as you say like um uh we we only trained you know we only optimized our system. It was not just the prompt

our system. It was not just the prompt but like the prompt and the code base as well. If we optimized for Gemini 3, if

well. If we optimized for Gemini 3, if we optimized for R2, maybe we could get it a bit higher than that. Um, but I don't think it's going to be 30 points higher.

>> I see. I see. I think that recursive self-improvement broadly written is uh a path. Um, and I think we're starting to

path. Um, and I think we're starting to see, you know, like with poetic as one example of a way to do this, we're starting to see that there is juice there.

>> Does it get us all the way? Um, I I believe it does. Uh, but but like again writ largely like is what we're doing today at Poetic exactly the right thing?

I don't believe that. I found a monastery in France that had a two-eek program. I like confirmed that I could

program. I like confirmed that I could actually go do it. Uh, and so I went to this monastery for a couple of weeks. I

told them at the beginning, you know, I'm thinking about like maybe staying for a year and this is kind of a trial balloon. Uh, and they're like, yeah,

balloon. Uh, and they're like, yeah, let's see how it goes. At the end of it, they're like, "Yeah, you're welcome to stay." Uh, and I was like, "I'm going

stay." Uh, and I was like, "I'm going into grad school."

Welcome to uh episode four of the Hidden Layer podcast. We're very excited to

Layer podcast. We're very excited to have Ian Fischer uh with us.

>> How you doing?

>> Yeah, I'm doing great. Thanks for having me.

>> Yeah, thanks for joining. Um well, I think let's start. I think we have kind of three sections that we wanted to hit like background. you're you have so much

like background. you're you have so much cool research at DeepMind that you've accomplished um and some really heady information theory stuff that we want to

get into and then talk about your uh you know thoughts on closing the gap and what does it take for us to get to AGI and what what are viable paths you're obviously working on one that you feel

pretty passionate about and I'd love to get into it so let's start with uh your background tell everyone who are you where did you come from how did you get to here

>> yeah uh so you know kind of a kind of a crazy story. My my undergrad was

crazy story. My my undergrad was actually in uh music composition and French. Uh so obviously that led to

French. Uh so obviously that led to directly to AI. Uh no. Yeah. Um and

>> so it turns out that um after I finished my my music degree and like I worked as a musician for a couple years, I realized I wanted to do something else.

>> Um >> did you play instruments?

>> Yeah. So I I during during my undergrad I mostly was singing but I playing piano a lot. I grew up playing violin. Um

a lot. I grew up playing violin. Um

terrible at violin possible at piano. I but I I did used to uh to sing professionally. He was like did some operas in San Francisco not with San Francisco opera but with like

smaller smaller companies. Um it's cool.

Anyway that was a while ago. Um, so

after kind of realizing that I wanted to do something else, I I said, "Okay, what do I want to be doing?" And the three things that I had in my mind, I mean, this is this is kind of a crazy story.

So like, >> yeah. Well, yeah.

>> yeah. Well, yeah.

>> No, it's uh >> crazy. That's about you.

>> crazy. That's about you.

>> Yeah. Yeah. So then like the first thing was uh actually I wanted to walk around the world. Um and so I like planned out

the world. Um and so I like planned out a route and you know thought pretty deeply about how I could could do this.

you know, I was going to cheat and like fly across the Pacific Ocean and things like that, but uh >> um you know, have a kind of >> hard to walk on water. There's one guy that figured that out.

>> The one guy, but >> besides him, >> uh so that was thing one. Thing two was like spend a year in a monastery kind of

like working on contemplative stuff. Um

>> uh and thing three was like going to grad school. Um and so I I like flew to

grad school. Um and so I I like flew to Europe. I so actually I I arranged with

Europe. I so actually I I arranged with like a monastery in the US. Uh I got some help lining up a monastery in Spain that I could go stay at for a year. So I

was told. So I like flew to Spain. I

went to this monastery. I got there and they're like, "Who are you?" and go away. Uh it was it was the beginning of

away. Uh it was it was the beginning of August. And it turns out like everybody

August. And it turns out like everybody in Spain of course goes on vacation in August. And uh one of the the one of the

August. And uh one of the the one of the things that like families like to do is go stay in monasteries. And so every monastery in Spain was full. There was

like literally no monastery I could go to. So I whoever had helped me out, the

to. So I whoever had helped me out, the abbot who had helped me out like maybe, you know, could have given me a little bit more information uh ahead of time, but you know, this was

like a pre um pre Google days, right? Um

that's not true. Google existed. Uh but

um so that was not an option and I realized I wasn't really ready to do walking around the world. Um, and so I called up a friend in the Netherlands and said, "Hey, uh, you know, I need to

figure out what I'm going to do next.

Uh, can I come visit you?" And he's like, "Yeah, sure." And, uh, you know, come on, come on over. And so I like stayed in the Netherlands for a couple of months. Uh, started looking into grad

of months. Uh, started looking into grad schools, looking at other like monastery options. I found a monastery in France

options. I found a monastery in France that had a two-eek program. I like

confirmed that I could actually go do it. Uh, and so I went to this monastery

it. Uh, and so I went to this monastery for a couple of weeks. I told them at the beginning, you know, I'm thinking about like maybe staying for a year, and this is kind of a trial balloon. Uh, and

they were like, yeah, let's see how it goes. At the end of it, they're like,

goes. At the end of it, they're like, yeah, you're welcome to stay. Uh, and I was like, I'm going into grad school.

[laughter] So, like that two week period was enough for me to know. I didn't want to spend a year at that monastery. I

mean, you know, I like I the monastery is great. Like, I think I just wasn't it

is great. Like, I think I just wasn't it wasn't the right time. Uh, and

uh, so I had a vague idea that like maybe go to MIT. Um,

and so I like went to Paris. I got on a plane to Boston. I'd never been to Boston in my life. I like flew to Boston, found a place to live, found like a part-time job, and uh, started

just walking around to the different universities. Um, and I mean this is so

universities. Um, and I mean this is so crazy. Uh, I was like talking on uh, you

crazy. Uh, I was like talking on uh, you know, I was I was actually walking to BEu to go to like talk to their like CS department or something. Uh, and talking to my dad on the phone telling him what

I was doing. Uh, and he's like, "Well, Ian, you know, you're in Boston. You

should at least go check out Harvard." I

was like, "Harvard's in Boston."

No idea. Like, I was just so so ignorant about this stuff, but I I did. I like I ended up uh you know I went to Harvard.

Um I was interested in you know either doing music uh doing more music in grad school or doing like romance languages or doing computer science and so I just visited the >> because those are very related things.

>> Yeah. Super related, right? Like natural

set of choices.

>> It's usually when you minor in one you got [laughter] >> uh Yeah.

>> So wait you jumped to like getting into Harvard. So like you just like randomly

Harvard. So like you just like randomly walked over to Harvard and >> I I like walked into Harvard. I like

went to the their romance languages department and I mean I don't I don't want to offend anybody but I I immediately felt depressed but like like the vibe there was very much not for me

and like I fortunately I picked up on that and then so I went to their music department is like much better but they only do musiccology and I was more interested in composition um and so I

was like oh this isn't a good fit then I walked to the CS department was like oh wow beautiful new building at the time Maxwell Doran Um, and uh, I walked into

an office that was open that had like an admin assistant person and was like, "Hey, I'm thinking about like maybe coming here for grad school. Um, is

there anybody I can talk to?" And she's like, "Yeah, there's this um there's this professor. Uh, let me see if he's

this professor. Uh, let me see if he's in. Uh, and he was." And so she

in. Uh, and he was." And so she introduced me to him. It was uh, this professor um, Sil Vad. Uh and he like

met with me and talked to me. He's like

such a nice guy. Um and somehow he thought that I was a reasonable person.

>> Uh and it's like you know yeah you should probably apply for the master's program and um you know you should talk to the head of the department. Uh and so I talked to him as well and also

incredibly nice person and uh so I applied and they let me in. Um and you know like basically you know, I like walked on to Harvard and got in and became a CS uh, you know, grad student.

>> That's crazy.

>> Um, >> and you like GRE and like they did SAT and all that stuff like >> I didn't Yeah, they do GRE for for that.

I I did it later for PhD.

>> Um, but yeah, they they just let me in and then I I actually so I stayed at Harvard for four and a half years um for masters. That's a little bit long,

masters. That's a little bit long, >> but like >> I did the same thing. Yeah,

>> I was really like four years. just I

double mastered eventually because I I had so many I was double E and I had so many credits that I was like one or two classes away from getting the CS double masters like cuz I had RA ships and it was freeh

>> and I was like if I just keep getting TA ships and RA ships like I'm getting freed I'm getting paid basically to like take classes at Sanford like I'll do this forever until I get my >> alley [laughter]

just get out of here right >> yeah so I I finally you know I finally agreed to graduate when I uh had Uh so one of the one of the faculty I I had

taken a class from was starting a company. She wanted me to come uh uh you

company. She wanted me to come uh uh you know started with her uh and was like yeah this is this is really why I wanted to go into CS. I wanted to start companies.

>> Um uh just like I would have done in music.

>> Um uh >> and when the entrepreneurship bug hit you? Well, so I mean it it you know it

you? Well, so I mean it it you know it was already so that was like 2002 when I moved to Boston and that was already on my mind. Um

my mind. Um uh so but you know I graduated 2007 um and I moved out to San Francisco to start the first company my first company

which was a small uh small security company called usable security systems. Um uh did that for a few years um like three years we sold that to another

security company in 2010.

Um and then I applied to grad school well I applied to PhD programs uh and ended up going to Berkeley um

and at the same time started my second company and that was the YC company at portable >> uh uh did both of those things for a year. Um that was not fun. I don't

year. Um that was not fun. I don't

recommend this. Uh and then took a leave of absence from Berkeley to focus on a portable.

>> Um >> uh and you know we did portable for four years um grew you know millions in revenue and whatever's like 250 million

unusers using our uh using our tech >> and they sold it to Google in 2015.

>> Nice. Um

>> and what did portable do?

>> Uh yeah we automatically ported iPhone apps to Android. Um so basically we were you know reimplementing all the iOS APIs um in Objective C and then cross-co

compiling them to native ARM binary.

>> Oh >> not to Java. Oh yeah. And so like our our apps were faster and better than the like natively written Java Android apps >> because like Java is a really bad choice for

>> Amen.

>> Uh yes. Uh if team is watching please don't. Um but yeah no I mean I think

don't. Um but yeah no I mean I think that they you know obviously the phones are so much more powerful now.

>> Yeah. You know but yeah >> they're doing all swift and cotland or something like that.

>> Yeah. Yeah. Yeah. So the swift transition was happening just as we were selling. So we like we were doing

selling. So we like we were doing Objective C and then we we made our Swift stack and then we went to Google.

>> Nice.

>> Um Okay. And is is Berkeley where you met your wife?

>> Uh no actually it's my my my company.

Um, uh, we, everybody else was from Berkeley, had Berkeley connections, and so they all knew her. Um, I feel like I'm not allowed to say her name in public. Uh,

public. Uh, [laughter] >> yeah.

>> Uh, but, uh, so we Yeah, we met um through an event that one one of the employees at the company, he was a stand-up comedy competition he was in and uh, you know, it's invite all your

friends and so we met we met at that.

>> Oh, cool.

>> Yeah. Um, yeah. And then what happened next? So, you got bought by Google,

next? So, you got bought by Google, >> right? I got bought by Google and I

>> right? I got bought by Google and I realized I didn't want to do crossplatform mobile for the rest of my career. I really wanted to be doing

career. I really wanted to be doing machine learning and robotics. Uh, and

so I um, you know, the the Harvard thing I I view the the Harvard thing as like this is an incredible stroke of good luck, right? Like so there's really kind

luck, right? Like so there's really kind of two points in my life that I identify as like amazing luck. the Harvard thing and then the second one was the my transition into research at Google

>> where um you know I don't have a machine learning PhD um you know I don't have a like don't have a background in it uh through another series of interesting

random events that I won't relay here I met an intern who was on this new team uh in research that was doing machine learning and robotics and I was like I

want to talk to your manager uh and was able to like talk my way into Google research in spite of the lack of background.

>> And was this brain was this deep mind?

Was this >> as Yeah, it was Google research. I mean

brain was I was like part of research at that time. Um I mean it was it was until

that time. Um I mean it was it was until it moved to brain. was always

technically part of research. But yeah,

um >> uh but yeah, anyway, so this machine learning robotics team uh was a really nice opportunity, but it turns out that

hardware is hard.

>> Uh and I very quickly realized I really wanted to be doing just machine learning. Um and so then I switched into

learning. Um and so then I switched into uh into a different team and focused on uh machine learning uh you know information theoretex stuff RL um kind

of whatever whatever caught my my attention uh and did that stuff for you know like seven years and then LLM became clearly the thing to focus on.

>> Yeah.

>> Uh and so I started working on like how can we build intelligent systems on top of LLM wrote a few papers in that area.

uh and then uh realized that there's a really big opportunity um to do recursive self-improvement um much faster and more cheaply than uh the

other ways that have been explored um uh to tackle that that particular concept.

And >> and what was this?

>> That was that was like I I realized this um uh you know uh last I mean in summer of 2024 >> didn't start working on it of course

until uh leaving uh Deep Mind for obvious reasons >> but uh uh once I you know once I realized that this was a possibility I knew that I had to pursue it.

>> Um so left left Deep Mind in January of this year. Mhm.

this year. Mhm.

>> Uh we you know um I I can't talk about the fund raise yet. It's not public but uh like uh we raised some money in in June uh hired a team and uh here we are

about six months later.

>> And then since then or at least in the last couple days it's been quite exciting. There is some big announcement

exciting. There is some big announcement that you had that um you should definitely talk about. Yeah, it's uh so

um so a few weeks ago we uh publicly talked about our results on RKGI uh which is a very popular benchmark um

and uh we showed that on the so RKGI is set up where there's a a public evaluation set that you can like talk about your results on um but they hold back a private uh they call a

semi-private test set >> um uh It's semi-private because you don't get to see it, but you get to see your score.

>> Uh, it's semi-private because it's the uh they have a semi-private test set and a fully private test set. The

semi-private test set is a test set they actually send to foundation models that they >> Oh, I see.

>> Right. And so like the foundation, it's not everywhere in the world, but there's some people that they've given it to so that they can run their own evals.

>> Not not quite. It's like it the they recognize that it's possible for the data to leak. They have agreements with all the foundation models that they won't like uh yeah ever train it.

They'll like delete their logs, whatever, right? Uh but they recognize

whatever, right? Uh but they recognize the potential for that data leaking. Uh

and then the truly private test set uh which nobody basically nobody gets evaluated on unless you do the Kaggle competitions. um uh that is uh you know

competitions. um uh that is uh you know the keggo competition restricts you to not use internet at all, >> right?

>> Uh and so >> so you can't use API models and that's the one that's restricted to one H100 hour or whatever it is.

>> Yeah, something like that. Yeah.

>> Yeah.

>> And those are all three disperate like there's not one overlapping across all three.

>> Exactly. Okay. Um as so far as we know the we don't uh so uh so we released our our results on um the the public email

set uh showing that um using the system built by poetic our recursively self-improving system I guess a little bit more context so poetic you know we're focused on recursive

self-improvement but our particular way of pursuing it it um the like the exhaust of that system is uh systems

that solve hard tasks. And so uh RKGI was a really nice uh uh testing ground for this because it has uh you know the right properties uh to show this off

like first um it's a very small set of training data right uh we only ever looked at the arki one uh um eval data

that's 400 problems >> uh and uh and you know so that's either the things that we normally work with is between you know 100 or a few hundred examples that we that we optimize our

systems on.

>> Uh so that was like the first nice property. The second nice property is

property. The second nice property is that it has this um this official evaluation process where they like run your system on data that you your system never saw, right?

>> Uh and so it's a much harder to um uh solve those, you know, like if you overfit on the training data, you're going to do badly uh on the test data.

>> Right. Right. Right. Um, and you know, the third the third nice property of course is it's a very high-profile uh uh excuse me. So, um,

excuse me. So, um, >> so so Gemini releases Gemini 3 releases actually. Yeah. So, Gemini 3 releases,

actually. Yeah. So, Gemini 3 releases, they get to soda on the public on the um semi-private, >> right?

>> And they're state-of-the-art for about three days and six folks at Poetic outperformed them. Um, and I guess tell

outperformed them. Um, and I guess tell us about how you did that. Like what was your feeling when you like went into it?

You know, did you like >> know that this was going to happen? It

was going to be great or you were like shocked by it.

>> Yeah.

>> Yeah. It's so it's it's really interesting. Um, we

interesting. Um, we uh we were entirely focused on ArcGI1.

We were running on So ArcGI they they have like essentially at this point they're like each year they release a new data set that's even harder, right?

So they RKGI1 was the only thing for a number of years, but now this year there's RKGI 2, which is much harder than RKGI 1, but we were really just focusing on RKGI1. We like looked at our

results a little bit on RKGI 2, but as we were developing the system only looking at RKGI 1. Uh, and then uh, you know, so Gemini 3 came out. It's a

fantastic model. they, you know, now it's like major advances to the state-of-the-art for foundation models on on uh RKGI 2 in particular. Uh and so

they have three, sorry, they have two points now on the RKGI 2 plot. Um uh one is Gemini 3 Pro and then the other is Gemini Deepthank and Deepthank is like

some system that they built on top of Gemini 3 Pro uh that costs like a massive amount. So okay another nice

massive amount. So okay another nice property of our AGI is they really care about the cra the price performance right >> which like no benchmark looks at >> yeah like very cost like GSMK it doesn't matter if you

>> pay a little bit or a lot of bit performance right >> yeah so uh and I think that that's a very correct thing to do right like it it shouldn't be at any cost >> for sure

>> uh so uh the Gemini 3 deep think is uh you know getting a really unbelievable score of 45% uh but at like $77 per crop

problem. So it's quite expensive. Um

problem. So it's quite expensive. Um

and Gemini 3 is getting like 30% 31%. Uh

but at 80 cents 81 cents. So like I Gemini 3 is a very very good model and uh you know I don't want to I don't want to detract from that in any way whatsoever. Uh so we we're very excited

whatsoever. Uh so we we're very excited about Gemini 3's launch. Uh so we just um we we already had a system uh and it

had actually been developed using GPT OSS 12B, right? Um so we >> so you kind of like hill climb on ARC AGI1. This is my own vernacular. You're

AGI1. This is my own vernacular. You're

you're hill climbing on ARC AGI1 with this um RSI [clears throat] DSPY kind of thing on the prompt only.

And you're hill climbing on ARGI1. get

to a final P star, some prompt P star, and then you get access to Gemini 3 and you just plug it in without hill

climbing any at all, and you um put it on Arc AGI 2, which you did not you also did not hill climb on at all, and you still outperformed >> Yeah.

>> Gemini 3 by 20 points or whatever it was.

>> Yeah. Yeah. So like

>> oh my >> what we uh yeah what we showed uh so our first blog post that we released it's exactly as they say like um uh we we only trained you know we only optimized our system it was not just the prompt

but like the prompt and the code base as well >> um but uh we only optimized the system on RKGI1 we only optimized it using um

uh GPOSS 12B uh and then we tried it out on R2 with Gemini 3 and uh we have the

entire like predo frontier uh cross cost versus performance where we're like you know we're dominating >> um uh the you know I mean that in the technical sense not in the like

derogatory sense uh you know we're we're above it everywhere um at every point I >> mean it makes sense so much that this would work because like you know Nome

Brown has this whole thing about um that the API mo the LLM learns all the distribution bions and then you're you can condition the distribution by the

prompt and so you're like learning the conditional distribution that will maximize for the task.

>> Yep. And like of course that makes sense to do. And like I do think that um

to do. And like I do think that um prompt engineering quote unquote is a bit like going back to feature engineering with like you know in computer vision where it's like hog and sift and server descriptors and like

you're very handineering thing and like you play around with this in the c in the sift code and then like you bucket more along the the you know the circle

and there's a little paper. Yeah,

there's a CBR paper and you get a little bit of bump and like that's kind of what's happening but it's not publishable.

>> You know, that's the only difference is that you're not going to be accepted for that. Um, but that's basically what's

that. Um, but that's basically what's happening. You're basically prompt

happening. You're basically prompt engineering is feature engineering.

We're just going back and then now you're saying like get out of the way.

>> Yeah.

>> Get you that's the lesson of the last Yeah. 13 years is get out of the way,

Yeah. 13 years is get out of the way, right?

>> And so you're doing that and if you believe that Sam alman is right and the API models will only get better. They

just keep getting better. It's true.

It's getting where the puck's going.

Then the only thing that you can really do, the only change in the action space that you have is changing the prompt. And if you want to use the this power, you have to do that

in a really clever way. And this seems like the most clever way. And so clear it makes sense so much sense that it works. Um, yeah, but I haven't heard a

works. Um, yeah, but I haven't heard a lot of people really doing this to be honest. like at the DSPY paper was big

honest. like at the DSPY paper was big and then the results were a little bit like h it it was like you know it didn't take off like it like you know um I'm sure the DSPI authors hoped

>> and I guess you guys kind of like had some major insights since then to make it work.

>> Yeah. I and so like I think DSPY is really excellent work.

>> Yeah. Uh

the I'm I'm also happy that like it seems to have overlooked some insights that we had uh that we're keeping proprietary because you know >> Oh, you're not going to tell me.

>> Uh yeah. Well,

whispered in your ear uh yeah, you know, this is this is the kind of stuff we have to like uh protect with kind of um

>> uh trade secret type production.

Uh but uh the um yeah you know the the only refine that I' I'd say on what what you described is that it isn't just the prompt right like you the ultimately the

thing that you send to the language model is a prompt but the um uh you can actually extract kind of reasoning strategies as code.

>> Uh and the really nice thing about code is that it will always execute your reasoning strategy correctly. Right? So

like if you if you want the model to do back uh uh um uh backtracking uh you could ask it to do backtracking uh in text or you could implement backtracking

algorithm and if you ask it in text like maybe it will do it and if you do it in code it will definitely do it right and so being able to optimize you know both over the prompt and over the reasoning

strategies as code where these reasoning strategies are quite reusable >> um uh uh you can get like really quence.

That's cool. That's cool. I had not heard of that, but um it kind of makes sense. I mean like in computer vision

sense. I mean like in computer vision you know you take a if you really want to improve accuracy um in exchange for

cost like you would take a you know um trained uh like um mask RNN mask you

know RCNN kind of paper uh model and you will um do test time augmentations and you're like take the take the image that came off you know uh that hit the API

all get like all crazy with IMOG or whatever and then you like do some magic over it and that's like and I haven't seen really something like that done at with LMS and so it sounds like that's kind of

>> yeah it's an interesting way of looking at it.

>> Yeah. Yeah.

>> Um certainly you know certainly there's a very large space to search over here of like prompts and programs and like you know like each program will have a different set of prompts uh um and so

like how do you do that? Well,

>> yeah, >> is I think an important question. Uh,

and kind of cord core to our technology.

>> Mhm. That's cool. Um, so anyway, so I guess uh I want to get into your research and stuff, but before I do um on the ARC AGI results, I guess like you

know you um >> Yeah. So we we released the uh we

>> Yeah. So we we released the uh we released our public e results >> and this ended up getting a lot of attention >> cross human level as the first time that we'd ever achieved that in

>> Right. Right. Yeah. So the RKGI uh on on

>> Right. Right. Yeah. So the RKGI uh on on the RKGI website they state that like the average human on AR2 gets 60%.

>> Um and there these the public eval and uh semi-private tests are supposed to be very well calibrated. So presumably on semi-private test is also 60%.

>> Mhm.

>> Um and we were showing on public eval 65% with our with our best model. Uh

which is a model that we we've actually we ultimately didn't share with RKGI because that that was like a there was a little bit of um you know we want to keep some stuff as a startup. Uh that

that was a mix of of multiple different models that I we talked about. It was

Gemini 3 and GD51. Um interesting. Uh

what we ended up doing to >> Cloud well uh we like Cloud came out right after um you know Opus 45 came out right after Gemini 3.

>> Yeah.

>> Uh we did some experiments just with Opus 45 >> and it it is a very good model. Uh we

found that the results were similar to Gemini 3 like maybe not quite as good as Gemini 3. Uh the cost was about twice as

Gemini 3. Uh the cost was about twice as much.

>> I see.

>> Um but I I actually I think you know we have not done the experiments. We

haven't had time, but like Gemini 3 plus Gypty 5.1 plus Claude uh you know Opus 45 uh that's probably even better than our current mix model.

>> Yeah.

>> Um what we ultimately gave to Well, okay. So the our our first post, you

okay. So the our our first post, you know, got a lot of attention on Twitter um and uh and the Arc Prize people um reached out to us and and said that, you know, they did want to go ahead and try

to verify our scores because they were like, you know, hard to ignore, right?

either either we're making this up uh or you know we have something that's worth paying attention to.

>> Yeah. Uh so we uh as I said we didn't want to give the you know the more slightly more sophisticated approach with the mix of models uh and so instead

we found like a similarly powerful model uh based just off of Gemini 3 um that we had them test that also got uh 65% on

public eval uh and on semi-private tests it got 54%. Nice.

>> So uh you know >> it's not too far off. Yeah. Yeah.

But like it's I mean plus - five, right?

It's uh >> Yeah. But this is the first one that was

>> Yeah. But this is the first one that was that high. So like they probably

that high. So like they probably haven't, you know, really, you know, been able to calibrate it out.

>> We don't know you, we don't know if this is variance or if this is like a calibration issue. Yeah. What we do know

calibration issue. Yeah. What we do know is that like we didn't train on R2 as I said. So we like our system can't have

said. So we like our system can't have over fit on >> AR2. Yeah.

>> AR2. Yeah.

>> Um and uh the underlying model Gemini 3 actually you know its public eval number was 27%.

The semi-private test was uh 31%. So it

actually did better. So it's also not overfit. Yeah. So it doesn't leave too

overfit. Yeah. So it doesn't leave too many.

>> I don't even think it's possible to the way that I understand arc uh agi 2 and the improvements over one is that it's impossible to overfit in ark one. If you

just did like NP rot like you will get like some amount of score on train and test >> and then but like because there's

rotation tasks in both in RKJI 2 every program is completely um independent and is not does not exist from one to the other and so like if you overfit on the

train in RKGI 2 you will strictly uh undo do poorly. So like the fact that you did well means that you didn't overfit. It can't be

overfit. It can't be >> Yeah. Yeah. So it seems like there's

>> Yeah. Yeah. So it seems like there's just we don't know uh you know what it is. Um we may see evidence in the

is. Um we may see evidence in the future. Um one hypothesis is that uh the

future. Um one hypothesis is that uh the harder problems uh in R2 since nobody nobody's gotten scores this high before, right? So like we just don't have any uh

right? So like we just don't have any uh data saying that like it's still well calibrated for AI systems um at this performance level. Uh so possibly we

performance level. Uh so possibly we again we don't know um but possibly these uh the the hard test problems are harder than the hard eval problems um and we just didn't realize that.

>> Yeah. Yeah.

>> Um >> there could be other hypotheses as well.

Um but uh the fact of the matter is this was like you know nine points better nine percentage points better. So like a 20% increase on the previous state of the art. Um and it was at half the cost

the art. Um and it was at half the cost >> and it was handicapped in at least three dimensions. Right. I mean, like you

dimensions. Right. I mean, like you didn't train on Arc AJ2. You didn't

fine-tune on Gemini.

>> Uh, Gemini, >> and you um uh uh didn't finish hill climbing right?

>> Like you could have been hill climbing more and like you kind of just like, "All right, let's it's good enough.

Let's do it." So, like, >> get it out. you know for those three reasons like you you know and I'm sure if you even a fourth one is like if you mix agents like even more with like more

of it yeah you would increase cost but like clearly that curve could come up even higher and go all like do you think that it can go all the way to like 98 >> no I don't so >> with the current with the current models

well with the current models and the optimization that po has done yeah I don't know where the where it saturates but uh early indications are that Like

with Gemini 3, we were seeing it saturated on on public eval around like 67% it seemed like. Um so if we

optimized for Gemini 3, if we optimize for R2, maybe we could get it a bit higher than that. Yeah. But I don't think it's going to be 30 points higher.

>> I see. I see. Um and I guess this is like I want to get this in the last section, but it's a good time to talk about now is like what's required to get the other 30%. Like do you think that

RSI is the path or do you think that like we need a big like you know uh get off the entire highway of LLM [snorts]

pre-training post-training you know RLHF and then plus poetic which is like the fourth stage of training in some way um so post training um you know like is do

you think that that's like a sufficient path or you think that like massive detour required >> yeah it's I It's a question very close to my heart

as well. Uh I I think that recursive

as well. Uh I I think that recursive self-improvement broadly written is

a path. Um and I think we're starting to

a path. Um and I think we're starting to see you know like with poetic uh as one example of a way to do this. We're

starting to see that there is juice there.

>> Does it get us all the way? Um I I believe it does. Uh but but like again ripped largely like is what we're doing today at poetic exactly the right thing?

I don't believe that.

>> Yeah.

>> Right. Um

>> but RSI broadly is the right. So like I I think that it is a it is an approach and I I the the nice thing that RSI has

is that it if done well it should be very fast and very efficient much faster and more efficient than a bunch of smart AI researchers like yeah you're getting all this stuff out.

>> Yeah. Yeah. Um uh and uh but like that doesn't mean that you know the like I I actually believe that like RSI if it does its job well it will bring in other

things. So bring in like some of the

things. So bring in like some of the some of the ideas that um uh you might want to you know you might might might want to discuss around like it does it need to bring in RL or does it need to bring in like a multi multimodality or

does it need need to bring in like more classical AI things like right >> um it could be you know it could be any of these things but RSI like broadly it should be able to find the things that

help it to continue to improve.

>> Yeah. Um so maybe we should talk about it because it's a good we're already kind of on the topic uh now.

>> Um so the you know there's there's a couple schools of thought.

There's like this um you know NTP is enough kind of villiaism satskerism if you will. There's this like lacunism

you will. There's this like lacunism world models plus self-supervised learning. The cake right now 99% is

learning. The cake right now 99% is unsupervised. frosting is a supervised

unsupervised. frosting is a supervised cherry in the cake is RL whatever and then there's like sutinism which is like RL is enough like reward is enough just give a reward everything works um you

know and yeah I guess there's the RSI kind of chole and so like what like think about like what do what do you thought what are your thoughts on all those >> yeah I you know the like it's nice to be

able to make you know a firm stand for like one particular thing is is all that you need uh whether it's attention or or

the cherry on top or you know um but I I you know I I suspect that they would all agree that like I mean I don't I don't want to put words in Yan Lakun's

mouth but uh for me personally I think that it's going to be a combination of things right like I think that RL is really good at some things and I think that unsupervised learning is really good at some things and I think next you

know yeah next token prediction is really good uh at at like you know many things like it's pushed us very far very quickly Yeah.

>> Um but you know I think that like embodiment is you know uh is likely to be you know is it is it a necessary condition for um deep world

understanding? I don't know. It might

understanding? I don't know. It might

not even be a sufficient condition but like it's almost it's hard to believe that it won't help a lot.

>> Yeah.

>> Right. Uh uh and uh yeah I believe that like um close to my heart with an information theoretic background like I believe that compression is really important.

>> Yeah.

>> Uh uh to like getting you know Yeah.

compressed representations of like compressed world models I think are really important. Uh again, I don't know

really important. Uh again, I don't know if they're necessary or uh even sufficient. Like I I suspect they're not

sufficient. Like I I suspect they're not sufficient and they may not be. I mean,

it's hard to believe they're not necessary.

>> Yeah.

>> No, >> actually, I could believe that they're not necessary if you have infinite uh infinite compute and infinite storage.

>> Yeah. Yeah. Yeah. Yeah.

>> Yeah. I mean, you know, you're kind of doing this like I I don't know. It's a

non-gradientbased update rule. Whatever

is happening for sure because you don't have access to the weight. So how are you competing gradients? Right. Right.

>> And you know like non-gradientbased update rules like is in the realm of chaism. Right. Now you're talking about

chaism. Right. Now you're talking about program synthesis and program search.

Yep. So for sure you're searching right.

Um >> I guess like talk to me and like this like neurosymbolic versus symbolic search procedures you know stuff like that. like what what do you think that

that. like what what do you think that the the role there's a role for program synthesis and program search stuff as chalet believes?

>> Yeah, absolutely. So yeah um my last paper deep mind um was uh evolving

deeper on thinking uh we were you know um showing uh that you could directly evolve in the solution space uh using

LLMs and LLMs have very good inductive biases for this kind of you know if you if it's the right type of problem they can have very good inductive biases it

allows s you to like throw out your you know like what what what's hard about uh like genetic algorithms? Well, you have to like come up with a representation uh your code. The choice of code will

make all the difference. Once you have your choice of code, you have to come up with your like mutation function and your crossover function. And so like these three things are are the things that make genetic algorithms hard. Yeah.

>> Uh your representation if you're using an LLM and you're just doing it directly in solution space if it's like if solution space is natural language or structured natural language like JSON um you then that's your representation

right like just do that and OM are really good with language interesting >> right and then mutation and crossover again it's just the LLM you just like

give it some number of parents like zero one two uh five parents right um uh and the LLM will like look at all that

information and decide what it thinks is a good change to make, >> right?

>> And normally gas then, you know, you're doing some random search, it's, you know, there's no gradient to go off of.

Yeah. So, it's like really slow and expensive.

>> Yeah.

>> But the LLN, you know, if it's a if the problem is a good fit for the LLM, >> it's going to be biased in the direction of like good, you know, next generation.

Yeah. Like good children.

>> Uh, >> yeah. This is like the whole like this

>> yeah. This is like the whole like this is a that's a really clever um version of metalarning like that. I've not heard of I've heard a lot of you know learning to learn by gradient descent you know

gradient descent you know those kinds of things that like just never really got me very excited. Um, but like you know learning you know if the model can do everything one of the things it should

be able to do is make itself better you know like that's that's that's the coolest thing to learn the the skill to learn.

>> Right. Right.

>> Um u you know I think that's that's really cool.

>> Um and there's there's a couple papers I really wanted to double click into that I just thought were super cool. Um, I

guess, you know, we can finish uh I guess maybe just going through maybe backwards maybe might make sense like we can talk about maybe director uh you

know and some RL stuff because that's you know >> Yeah. Yeah. It's Yeah. So, uh director

>> Yeah. Yeah. It's Yeah. So, uh director is a it's probably the my favorite RL paper that I that I helped on. This was with

Daniel Hafner. Um and uh this is um in

Daniel Hafner. Um and uh this is um in my o oh also humble opinion uh like this is the first example of hierarchical reinforcement learning actually working.

>> Um >> yeah we did it entirely from first-person views on tasks that like uh other earlier pickers uh had solved with like a combination of firstperson view

and like top down views in these environments where you where like you're trying to solve a maze, right? M

>> so the top down view just tells you the direct answer right like if you know how to interpret it >> uh but first person view you actually have to search the space >> and you have to have some type of like memory that you're uping

>> yeah there's a yeah there exactly you have to like have an idea of where you are in the world you have to and you have to know the reason it's hierarchy so specifically I'll talk about like the ant >> you have to learn slam it's like almost

there's like slam and there's low-level motor controls right >> uh uh so you're trying to control this ant robot uh to go through the maze. Um

and we did this entirely from first person. So like the first the the the

person. So like the first the the the algorithm had to like learn to control the robot. Uh and then it had to learn

the robot. Uh and then it had to learn to and it also had to learn to navigate the maze uh and it did this by like the top level of the hierarchy generates a

goal in a in a latent representation space and then conditioned on that goal the lower level of the hierarchy does like dreamer type rollouts.

like Danny Jar's guy and dreamer. Um uh

and uh actually both layers actually do dreamer rollouts.

>> Uh but >> look aheads.

>> Yeah. Yeah. Yeah. So look aheads and imagination space, right? Yeah. Um

>> uh the look ahead in latent space.

>> Yep. Oh. Uh

>> where are they actually in pixel space?

>> So actually I don't remember if we did uh pixel space at both levels. I think

we had to do like latent space for the the goal level and pixel space for the other one. Yeah,

other one. Yeah, >> it's [snorts] been a few years. It might

be pixel space for both. The another

paper that I did in RL, we showed that you could like um you could do these rollouts in latent space as I sac uh >> uh but anyway, so like this uh you know

I think that hierarchical hierarchical RL is a with world models is a really powerful paradigm that um uh is >> I don't know if it's still underexplored but at the time it was it was

underexplored.

I love this. I love the dreamer literature like 123. Um, and I we talked about a little bit right before whatever, but I do think that there's

this super it's actually surprising to me how pe how few people in uh even just Nurups like know about diffusion policy which is like huge huge paper on

robotics um but is model free >> and so it doesn't have a model of the environment >> and then diffusion MPC >> was doing model predictive control but

for a a non uh differentiable um environment >> um uh state state dynamics um and this

guy Stannis out of uh deep mind and I mean I just see that as the complete right answer like the um I can imagine typosa learning where in diffusion

policy which was you know a bigger deal than Alex net just by measured on how the improvement on the core uh benchmark they were going after uh Alex net gets goes from 30% error rate to like 11%

error 20% improvement. Rock on. Amazing.

20% improvement. Rock on. Amazing.

>> Uh diffusion policy went from 25% to 85%. You know, like a 60% jump. And and

85%. You know, like a 60% jump. And and

all it is is like here's the trajectory and we diffuse it a little that like is that some human gave me. I diffuse it a little bit. And that my job is that the

little bit. And that my job is that the model should like den noiseise it. And

I'm like and then I take I roll out 16 steps and I take eight and I look ahead.

I have a lot 16 steps and I take eight and I do that and like and it just crushes like every benchmark but it's only 80 to 100 like expert examples

right now add a world model on that and I can like you know sample initially from uniform which is like a a child doing this you know like

>> right this is like you know random policy >> right >> um and zero reward trajectories are uninformative for my policy.

However, it's very informative for my world model.

>> Yes.

>> And then now I'm developing my world model and then now I I don't have to go to the environment to learn my action policy anymore. If I had a perfect world

policy anymore. If I had a perfect world model, I could just imagine.

>> Yep.

>> And that would be sufficient. And I can actually just like learn by thinking >> by dreaming, right? And like I and that's way more efficient than like doing the thing or like crashing my car,

>> right? Anything that you want to do in

>> right? Anything that you want to do in the real world >> that is expensive and or dangerous like you should definitely do in imagination >> for sure. And so like that that it just makes so much sense to me that like

that's the right architecture is like a good action policy diffusion policy style thing that isn't one step. I think

the initial Danara paper was like one step rollouts um versus like at each time which is very hard for the model to do versus predicting 16 steps forward in one shot >> like the amount of accuracy because you

have this build up of errors as you're doing SARSA rollouts versus like I'm predicting 16 steps in one shot >> the model can be much less accurate and won't have the buildup of errors and so

but I like I want to learn much more sample efficiently right if the two major things that we're not good at so is intelligence per sample and intelligence per watt. Like how do we get intelligence per sample? Like this I

think you got to squeeze the most amount of juice out of every example and by updating your your world model at the same time is really I think the way you get there.

>> Yeah. Yeah. I mean you're you're speaking my language here. I

I uh the um so PyQack we I again I've forgotten the details at this point but like I think that we were predicting multiple steps in the future as well. We

didn't have uh we didn't have like we didn't realize that we maybe should throw away half of them. That's a very clever uh that's a very clever thing.

>> Yeah.

>> I think Dreamer V3 he he starts to do multiple steps.

>> Yeah.

>> I think it might be like three of those but it's not he's not using diffusion policy.

>> Right. Right. Right. And so it's a different Oh, was that was after V3?

>> Yeah. Company.

>> Yeah. B3 he came out like right when we were doing the uh the the director stuff.

>> Yeah.

>> Uh but uh yeah, I mean world models are are very close to my heart. I believe, you know, I believe

heart. I believe, you know, I believe that they're beneficial kind of across the board. And you can argue that um

the board. And you can argue that um LLMs have learned kind of weird interesting world models. like they're

not necessarily bad or wrong or anything, but they're like something different because they're from next to open predict prediction.

>> Um, but it is it is implicit. Uh, so anyway, >> yeah. Yeah.

>> yeah. Yeah.

>> Yeah.

>> Um, so talk about the there's two other papers I want to double click into that I think they're really cool is the multivariate uh mutual information paper and CEB um conditional uh entropy

bombck. Um,

bombck. Um, >> yeah. Yeah. So the you know at this

>> yeah. Yeah. So the you know at this point we're in the ancient uh ancient history years ago years ago um yeah so the conditional entropy

bottleneck so um I worked on the variational information bottleneck that was that was I you know I think a pretty important uh

paper uh but it was like a a a direct variational translation into like modern modern deep learning world at that time of the information bottleneck.

>> Um uh and then like I was thinking about this stuff a lot and I realized that um uh the information bottleneck was kind of in conflict with itself where it's

like trying to forget all the information. You know, you're trying to

information. You know, you're trying to learn this compressed representation. Uh

but >> this is like Tishbish information bottleneck theory for the audience is like you know the the if you know that the grid is mutual information

between the input and each layer and then the output and each layer then you like get more and more correlated and then you start to forget at the after the knee in the training right and like you're like learning to forget what's

unimportant right something like that >> yeah yeah that's yeah so that was um they they did that paper in the era of deep learning But like the original tissue paper was in 1999

and it was like this was the core the core objective is you're trying to like um uh minimize the mutual information of

uh the input and your representation Z while maximizing the mutual information between your outputs >> output and a representation C. And then

there's a trade-off parameter lrange multiplier beta that determines how strong the compression is. Yeah. Um and

so what makes it variational?

>> Uh so the variational information bottleneck says that we can like um uh we can represent this with uh we can represent the the distributions in in here like you have multiple different

distributions. So there's like p of p of

distributions. So there's like p of p of y given z is like the needed for the prediction side. You have p of z given x

prediction side. You have p of z given x which is your encoder. So p of y given z is your decoder. P of z given x is your encoder. And then um uh P of P of Z is

encoder. And then um uh P of P of Z is your your prior, right?

>> Right.

>> Uh and uh and so like you can make variational approximations to um uh one or more of these. Uh and if you do it in the right way and you there's different ways to think about it, but like the way

we like to think about it is like uh the you have the true encoder, you always like P of Z given X maps from X to Z. Um

and that that thing is always true. It

might not be good, but it's always true, right? So we don't we don't we we

right? So we don't we don't we we pretend that we don't need a variational distrib distribution there.

>> Uh but for the other two, we do need variational distributions. We don't know

variational distributions. We don't know P of Z um for uh you know the prior distribution over Z. Um uh for our true encoder P of Z getting X because we

don't know how to integrate out X uh um and we don't know P of Y and Z either.

And so we'll get Q of Y and Z and Q of Z um and just variationally optimize them by directly minimizing or maximizing the objective um uh the information

bottleneck object objective dropping in the the variational approximations >> and then so what and so you do these tricks and then what happens?

>> Yes. So what you do you do these tricks and the um the models get uh better, right? Like uh your accuracy goes up. Uh

right? Like uh your accuracy goes up. Uh

you've forgotten distracting distracting information, right? Uh and so uh you you

information, right? Uh and so uh you you know you in in principle you should be able to train on less data um

and uh in principle you should be like more robust to to distractors in general. So like maybe adversarial

general. So like maybe adversarial examples. Yeah. Uh so we started looking

examples. Yeah. Uh so we started looking at these with adversarial examples and like we showed that on imageet we could get some gains um on adversarial robustness and we also but the primary

result in that paper the BIB paper was that uh um you get uh uh you know an interesting increase in accuracy um uh using this technique and you can measure

you can measure like how much you you can get uh bounds on how compressed the representation is. Um uh so you know

representation is. Um uh so you know that you're actually like doing something useful.

>> Cool.

>> Um >> do you have pretty plots like the Dishb arc?

>> Uh probably not as pretty as Tishb's arc, but yeah, we we we did have stuff that look it looked a little bit uh a little bit like that. Um

>> so then yeah. So then with the conditional entropy bottleneck though I I realize that like what if you're trying to minimize the mutual information between X and Z and maximize mutual information between Y and Z

you're you're you're going to particularly in a at least at the very least in a variational setting >> you're in tension these two things are in tension with each other in terms of

how they modify the parameters right >> um but if instead you say that you want to minimize the mutual information between uh uh x and z conditioned on Y.

>> Uh while maximizing mutual information between Y and Z, uh you get uh you can that that reduces to uh the first thing that the intuition there is that like uh

you're not going to throw away you're not going to tell the model your your model to throw away any information about um uh Y that you're storing in C.

You're only going to tell it to throw away information about X that you're trying to store that you otherwise would have stored in C. Right? uh you know unus like useless information in X, right? Um uh and

right? Um uh and >> so is it almost as like it's almost like a sparity parameter like a spar sparity penalty because like you're trying to drop input, right? And like thinking back from like you know even like early

linear regression days, right? Like the

whole thing is like on sparity is to like drop in things in the input that don't matter for it and like in saying LLM world is like for producing the next token like how much does like the BOS to

like on your thousandth you know roll out how much does the BS token matter?

How much does the matter? How much does of matter? Like just drop these things

of matter? Like just drop these things like you know the punctuation like 500 words ago.

>> Like there's some key words that need to be attended to but like is that the intuition that like >> Yeah. uh but it's continuous you know

>> Yeah. uh but it's continuous you know continuous space so like the it will like uh reduce uh reduce the attention you know if you want to think about it in terms of attention it's going to

reduce attention paid to the things that are like not relevant to the output right they're trying to predict >> uh and uh yeah and so CV I like showed

that there's actually very substantial gains to adversarial robustness >> um and also very substantial gains to um

uh like accuracy um uh on you know the common domains that people were trained on then like this was really like image uh focused um yeah computer vision

focused and >> so like do the salency maps look like much crisper because it's like kind of like like you know condition on like think a bird yeah you'd always see those

like salency maps where like the bird is pretty pretty still highlighted but there's a couple scatters that here and there that like is the cloud like I don't know why it's having the cloud loud, but it is >> like I can imagine that this just really

just like drops out all the things that don't matter and it's just the B.

>> Yeah. Yeah. So, so we actually studied this in another paper uh and and showed that we could actually make representations that had these hard drops and still and actually like get

very good uh very good performance uh >> uh doing that. um we like um uh we didn't look at the we didn't look at it that way uh for for the VIB and CB

papers.

So I don't actually I I also believe that but I >> it would make sense that because you're attending to you're basically telling the network to like ignore things that don't matter like >> more explicitly

>> right the yeah I'd say you would hope that that would be true >> but like if you know if in every image of a dog there is like grass >> right then >> right then there's some correlation there

>> there's there's a correlation the the information is you know yeah there is mutual information between grass and the label dog.

>> But if there's like equal amount of grass in the cat images then like symmetry would break and if like the mini batches were you know could you know break composition then like that

grass like attend you know attending to grass was actually unhelpful to discern cat and dog right >> and so then you hope that that was would break

>> right yeah um yeah so so but like all this stuff this was like cool and fun okay so you yeah you also mentioned the the multi mutual information.

>> Yeah. Yeah.

>> Uh yeah. So, so this was I mean this is not well-known work but uh in this paper I showed that you could actually get a a

bound uh a correct bound on uh the multivaried mutual information I ZXY um uh and this is interesting because uh you know all of all of the other

objectives that like I I claim at least up to that point when when I wrote this paper every objective you'd ever worked with relies on a markoff chain. Mhm.

>> Uh this one says that actually you're going to model the fact that uh there's you know mutual information between X and Y, there's mutual information between X and Z and there's mutual

information between Y and Z. Uh and uh and you have to care about all of that and you can still bound this purely in a purely representational objective. So

like every this so again it's variational every distribution is a distribution of Z. So you have like four four terms in this objective. Uh P of Z

given X P of Z given Y uh P of Z and P of Z given X Y. Uh and so you have like basically you have three encoders and a and a marginal. And like one of those

encoders I call a joint encoder.

>> Um P of Z given XY. Uh and if you put these terms together correctly, you get to uh directly minimize the multivariant mutual information which is you know if

you look at um imagine drawing a circle for you know vin diagram where you have x as a circle and y as a circle and they're overlapping and then you have your circle for z which is a learn

representation and you want it to just fill in that overlap right intersection.

Yeah. uh and so uh the the multivaried mutual information like bounding this correctly turns out to be a very powerful way to uh uh target just that uh multivaried mutual information.

>> Um >> what did you run this on? So I ran it on uh I ran it on imageet and uh like the the robustness so like some of this is

unpublished uh uh like I did I did a workshop paper for this and and had some preliminary results but uh the robustness that I saw with um with the multi variant mutual information was

actually substantially stronger uh than for CB which was already like quite state-of-the-art >> robustness to adversarial >> adversarial adversarial examples like um

the natural adversaries. Um people may recall >> uh the natural adversarial examples thing where like you have an image and you add fog to it and now it can't tell that it's a car or whatever. Yeah.

>> Yeah. Um that's cool.

>> Yeah.

>> Um yeah.

>> So yeah.

>> Well, so like >> Yeah. You know, I guess

>> Yeah. You know, I guess >> why don't we care about this stuff now?

>> Yeah. I guess yeah, talk swinging it back to AGI, you know, like I did the >> the Sahi Weissman information theory stuff and like, you know, we talked about a little bit, but like the more I

apply information theory to my thinking um on and hoping for inspiration, the more confused I get, you know, and

like, you know, um there's and so yeah, so tell me about your thoughts on like the role of of information theory in you're obviously a very accomplished

information theorist uh uh in in in the context of deep learning um you know in getting inspiration for AGI so getting the rest of the distance.

>> Yeah. So like you know I I want to believe that it's it's like necessary.

>> Yeah. Uh but in practice uh you know so I would love it if somebody came out and like proved me wrong uh you know and prove prove me

wrong that it is in fact necessary uh >> in practice I I looked at this a lot um uh with you know with various collaborators we tried to apply CEB and

VIB and uh the multibariat bound um on LLMs >> and never saw like meaningful benefits from it. And so like you know I was like

from it. And so like you know I was like why why is this what is going on? This

is really annoying. Yeah.

>> Uh cuz I wanted >> doing all this like really heady math and really fancy math and like and like getting for it, right? And uh and so

eventually I came to the following hypothesis slashconclusion uh which is that um when you like like basically okay so I'm

going to say some things that I think everybody kind of knows and and agrees on and in the machine learning field which is that like um uh if you have an

infinite data set uh an infinitely large data set and an infinitely large model and you do SGD you you do like gradient descent um uh you will converge to the

true distribution. you will learn the

true distribution. you will learn the true distribution uh and >> it is inductive bias matters to the to the model but yeah >> right well but that's like a question of

how quickly you you converge right like you you like you run through your infinite data set >> naively if I like literally have no connection from like one thing to the other like uh an example that we used to

always give as a um debugging challenge at focal >> um is like learn the L1 distribution with uh um learn learn the L1 a norm uh with an RNN and you have to have at

least one trainable parameter and like if you do this naively and you just take in the a bunch of I'm just giving you numbers like minus one 6 3 minus 6 like

blah blah blah and if you do that naively and you apply like curas do you know train >> uh it doesn't work because the relu drops out um like your random adders and

if you end up with like two positives and then a relu then you just drop out negative values and then you lost information and you can train all you want on infinite data and you will never

get the LM. Yeah, that that's that's like there's there's stupidly model modeling mistakes that can happen and I do think that there is modeling mistakes in the LLM where like the LLM can't

solve it's not possible for it to solve um uh like sort for example if like if I have to sort um a list of six billion

numbers and I only have 32 transformer layers to do it the way to cheat is to roll it out and solve it by test time which is a solution but it can't if you

say you're not allowed to emit any tokens you must emit like >> the answer it can't do it in that time so like anyway but largely I agree but I

was just like you know the induction >> so there yeah model for model you know setting aside model misspecification right like infinite data infinite model

>> you know infinite time you converge to a true distribution >> um and effectively with a training LMS that's the first time uh at scale we

were really doing this uh and not quite this but you know starting to approximate it in a reasonable manner.

Uh you know you you may remember like this isn't necessarily the case today but it used to be that um uh you always like did just one pass through your training data because it takes forever

and so that's the equivalent from your model's perspective to there having been infinite data. You like you never saw

infinite data. You like you never saw something twice.

>> Exactly. which is the case for like common crawl like they're doing there's one epoch right exactly >> they're not doing mult so seasonal epoch training and like you know billions now

you know trillions of parameters uh uh and so at first approximation you're basically going to converge to the training distribution yeah uh your model

is like it's sufficiently powerful um it is learning a compressed representation and so like there is so information theoretic stuff happening um uh like you know given how much it's memorized you

like it can reproduce its training data how much of its training data we know these things can reproduce uh and how much bigger the training data is than the weights of the model we know that like there is a lot of compression

happening but >> uh setting that aside like at a very high level uh SGD is like good enough um in this setting you don't need these like fancy information theoretic things

yeah um or if you do need them like they're going to help around the edges right there's like I do believe that there there is you know we are learning compressed representations information theory has something to say there uh but

just as like a top level global objective if you drop VIB or CEB um or the multivariate bound um into the

training for these models you're not like my experience is that you don't do better than just doing SGD you don't do worse >> yeah but it's just like you know it doesn't it doesn't move the needle

>> yeah overhead >> so I guess like um closing out like what are your one two three beliefs of like

what are real viable shots on goal to close the gap from where we're at now to AGI systems that can get to well beyond

human intelligence what's your like obviously the RSI one >> yeah yeah so I obviously huge believer in RSI um there's a

uh so there's something that Sergey Lavine said um uh recently. Well, I he maybe he's been saying this forever, >> but I thought it was very um very

insightful. He he pointed out that um if

insightful. He he pointed out that um if you are only like if you're training a model to match human performance, you're not going to get better than human performance. Right.

performance. Right.

>> Right. So if you want superhuman performance, you need to do something like RL where there isn't uh an upper

bound uh on performance that is created by humans, right? Right. So like this is you know like Alph Go showed that you definitely exceed human performance uh

uh by using RO and so like you know I think that that's actually very insightful for trying to get to ASI. Um

if you want to just get to AGI maybe you don't need that. uh we can be as smart as like any particular human in any particular field and that's already very cool. Um but if you want to go beyond I

cool. Um but if you want to go beyond I think that you need something more there. Um

there. Um >> and so that that's two things. Uh third

thing like I mean there's so many great options uh and uh you know I don't want to like throw out all the others with bath water but I um

>> I don't know I I really have for a long time had a bias towards believing in embodiment.

>> Yeah. Um, which is why I like did a lot of RL research. Uh, so that was kind of the same thing twice actually. That's

kind of cheating.

>> Yay. [laughter]

>> I mean, yeah. I I guess um to me like back to the diffusion MPC one >> when you're trying to model when you're

trying to world model embodiment to me is like you're taking in you're learning a dynamics function. And disembodiment

just to me just means that one of the variables into your dynamics function to predict the next time step is your own action.

>> And like if you really understand how your action impacts the environment, then you've achieved embodiment. And of

course it's important to predict the roll out. And if we go back to the

roll out. And if we go back to the diffusion NPC as the framework, the only nuance is that it's not um ST+1

conditioned upon ST and AT. It's um

well, yeah, that's it. AC

Yeah, you have to have at there. That's

it. That's the only nuance. And then you embodiment. Is that is that fair?

embodiment. Is that is that fair?

>> Uh or is there more to it?

>> Well, so I I I definitely agree that that that like that's the core of it. But I think that for me in my head when I think

embodiment, I think about multitask.

>> Uh right. And so like if you you could you could say that um uh you have embodiment with like web pages. You

could even say that you have like um embodiment you could with like just next token prediction. Um uh if you get to

token prediction. Um uh if you get to like look at the you know look at your your loits if if you like your predicted logets are uh you know you have some some forward prediction of your logets

and so anyway you know there's there's ways that you could imagine having something like what you just said that's embodiment in the text space.

>> Um but I I for me embodiment is actually about uh providing deep understanding of our world how our world works. Um and

and it so it is internet. It is

interesting I think in the um kind of the web domain where yeah like this is a really important part of our world clicking around on websites and that's like you know maybe a more tractable version of of of embodiment than like

full-on robotics.

>> Yeah. I mean, do you get all the way to calling embodiment uh is um the same as saying com developing common sense >> like where

>> uh well I you know I I so I think for for common sense in the world you do need to get real feedback that like if you you know if you trip and fall down it hurts who know

>> right >> uh and uh and so you yeah you could you you could phrase it that way that like this is part of building up like common sense about just how things work in the

world, right? Yeah. Like we have so much

world, right? Yeah. Like we have so much physical intelligence ourselves and like as you were saying before before we started um we don't you know a lot of

that probably is evolved right like yet have really good priors for for some of these things that having have to do with yeah >> dealing with our environment. I also

think there's a part of embodiment that once you learn a dynamics function for yourself, your own state action inside

state might be that Ian is juggling >> and then I can do and I can like hijack that learned program embodiment and and project it onto you and I don't need to

see I don't need to relearn what Ian does. I can project onto him what I

does. I can project onto him what I would do and as if it were me and that's a shortcut like reuse of weights reuse of skill that is a massive shortcut that

requires much less training and allows me to like you know think from your shoes look from look from your point of view and like that is actually a really important thing for collaboration and

moral and multi- aent like I think that's actually the trick so that you don't have to completely relearn what every single agent is going to do and people do this actually to a fault.

They're like, "Well, you know, if I were there, I'd be doing it this way." And

this is like the product 101 never do.

You're not the user, right?

>> And like a bad PM would be like, "Well, I like it this way." And it's like, "Yeah, but what does a user do?" And

then like you roll it out and your user hates it. And it's like, you know, it's

hates it. And it's like, you know, it's like like maybe maybe it's it's a hack.

It's a shortcut and it works most of the time, but like a lot of times it doesn't work. And we're all very different.

work. And we're all very different.

We've all been trained on very different, you know, training data in life. and we've studied music or romance

life. and we've studied music or romance languages and we have very different priors built up. Yeah.

>> Yeah. Yeah. Yeah. And they like Right. I

mean this is uh this is kind of the mirror mirroring on side of embodiment, right? Like um Yeah. Me seeing you

right? Like um Yeah. Me seeing you embody something allows me to learn uh learn so much about how I could embody it embody it as well. Certainly like it goes in all directions. And yeah, we're

we're very good at these things. And uh

yeah, uh our AIS are maybe not quite as good yet.

>> Yeah. So, luck be done there.

>> So, I want to open it up to some questions there if you guys have any.

I know we covered a lot. Um any

questions or thoughts for Ian here?

>> I guess do you think there's something about like incorporating looping into the architectures could be beneficial rather than having everything like as is? cuz I feel like having some sort of

is? cuz I feel like having some sort of aspect of like kind of not just taking all the data but being able to re almost pseudo reason within itself could be beneficial

>> like loopy transformers and like >> recursive refinement loops for HRM and TRM and all.

>> Yeah. So TRM's like I think yeah HRM and CRM are are great examples of this with respect to the RKGI stuff. Uh I mean I I played around with these models uh myself in you know a number of years

back as many people have right uh and like I think it's really important um I mean even even what I'm doing right now you can view as an example of how

important I think this is right the the recursion itself is a loop that uh lets the system kind of learn from its own

>> its own experience over time. Uh, and so yeah, that's a that's a great thing to add to like that that can that's a better third item than the one that I gave since I basically said RL twice.

>> Yeah. Yeah.

>> Yeah.

>> That's cool. Yeah. I guess that for loop out outside of it is your outer refinement loop, right?

>> Right. And so like makes sense.

>> Yeah. What else?

I guess going back to that in the case where you want looping um ideally like let's say each of your layers are like more functional in this case and can be

composed such that you can loop them effectively. I was wondering if you had

effectively. I was wondering if you had any thoughts on uh building architectures with more functional ps for like when we're

traversing through each of the layers and we're looping through each of our functions. So, so what would a what would a more functional prior look like for a particular layer?

>> Well, yeah, I'm not I feel like right now like existing transferable architectures, >> one common theme is that um

it's they're not functional enough. Um

like when you loop them, they're not able to um process their own outputs well enough. Uh, and I

was wondering if you had any thoughts on um building other architectures that might be better at this snooping mechanism.

>> Uh, well, okay. Not really. I'm very

well, okay. Not really. I'm very

straightforward about that, but >> if you did, that would be a good great paper.

>> Yeah.

Uh, so so a thing like when I was playing with this stuff, there was pre-transformer days. uh I was doing

pre-transformer days. uh I was doing image modeling but uh I found that you know and and probably other people have found the same thing. I haven't paid that much attention to this these the

literature on these secular architecture. So forgive my ignorance.

architecture. So forgive my ignorance.

Uh but what I found was that if I actually just naively computed the kale divergence um between across time steps uh so like okay what I'll take a step

back what do you need to be able to do when you're looping? You need to have a good termination condition and like so tum has like their tum nature they have their termin.

>> Yeah. Uh, and and actually I haven't looked closely enough to know maybe they're doing the same thing that I did, but probably they're doing something smarter.

>> It's like a Q-learning thing and then a Q halt that actually doesn't actually uh matter all that much like at test time.

They actually never I think they always run it even if qhalt says to stop.

>> Oh, really?

>> Yeah.

>> Yeah.

>> So, I don't think it really mattered for HRM. Derek and I was something

HRM. Derek and I was something different. So, at least uh the

different. So, at least uh the >> TRM I forget.

>> Yeah. Yeah. So what what I found is like you could quite naively basically treat the your um activations as uh loits to

some uh uh some distribution some categorical distribution and uh loop until you're the kale between time step t minus one and time

step t of those those two categorical distributions was below some threshold.

uh and that loop in my ex in the in the setting that I was in that loop like always terminated uh in like three steps >> uh and so like I think if you're solving

a harder problem like it probably will go longer but you uh you know like this it like converges um with like just normal training signals I found that it emerged very quickly.

>> Yeah. And so you could do this kind of at each layer. Uh and maybe that gives you what you're hoping for of like more more functionally uh focused um layers

that you can that you can iterate before moving the next layer. I don't know. Uh

anyway, it's you know it's fascinating stuff. So best of luck.

stuff. So best of luck.

>> Cool.

Any followout? Any other questions?

Awesome. Well, we went way over. Uh

sorry for that. But thank you so much.

No. Yeah. It was great. Thanks. All

right. Well, thanks. Thanks for coming on.

Loading...

Loading video analysis...