LongCut logo

Measuring Exponential Trends Rising (in AI) — Joel Becker, METR

By Latent Space

Summary

Topics Covered

  • Time Horizon Tracks Linear Capability Growth
  • Tasks Target Autonomy-Relevant Skills
  • AI Uplift Overstated by Expectations
  • Automated R&D Triggers Explosion Risk
  • Compute Slowdown Halves Progress Rate

Full Transcript

So meter stands for me t uh first two letters model evaluation that is we think about what their capabilities of AI models might look like today and tomorrow as well as their propensities what they'll what they'll actually do in

the wild given that they have some level of capability and then threat research the final two letters we try to connect [music] those those capabilities and propensities to uh particular threat models that we have in [music] order to

determine whether AI models pose enormous or or catastrophic risks to society. So the secret if you if you

society. So the secret if you if you read this article about how I became the number one most profitable trader on manifold mostly comes down to this one market where

[music] hey everyone welcome to the latest in space podcast. This is Allesio founder

space podcast. This is Allesio founder of kernel labs and I'm joined by Swixs editor of late in space.

>> Hello. Hello. We're back in the studio with Joe Becker from meter. Welcome.

>> Thank you very much guys. It's a it's a great pleasure to be here. So Joel uh your work has impacted the AI field a lot especially over the last year. I

invited you for the AIE uh summit which thank you for speaking as well and and doing d doing the extra workshop and there you have a lot of papers that have been very impactful but I guess upfront uh a lot of people like meter just burst

onto the scene. Could you explain and introduce meter?

>> Yes. So meter stands for me TR uh first two letters model evaluation that is we think about um what their capabilities of AI models might look like um might look like today and tomorrow as well as

their propensities what they'll what they'll actually do in the wild um given that they have some level of capability and then threat research is the final two letters. We try to connect those

two letters. We try to connect those those capabilities and propensities to uh particular threat models that we have in order to determine whether AI models pose um enormous or or catastrophic risks to society.

>> Yeah. Would you say that you've done a lot more me and TR is like kind of the next phase or is there TR side of work that I'm >> you know I think I think there TR so um you know some of the most publicized

work does look more like the ME [laughter] looks like this this time horizon stuff and and the um developer productivity RCTs stuff like that. Um

but there's this this wonderful full report on our website GT5 um report and an analogous one for GBT 5.1 as well trying to make this more sort of structured case that it doesn't pose these really large scale risks you know

eventually coming to the conclusion that it that it doesn't but but it's worth thinking like why why exactly is that the case like if you know you and I work with GP5 it does seem very capable that matches up to to benchmark scores you

know why is it not um able to do really something really um enormously wrong um well you know we we go through the evidence, we find we we think it's not capable enough, you know, on on the

basis of some of this capabilities, evidence that you've alluded to to to commit these catastrophic harms. And so, you know, it's not it's not going to be able to do this. But perhaps in future, you know, we will think it's capable of doing um pretty extraordinary things.

You know, the the kinds of things that would be necessary to to provide really really serious threats. And then maybe you'd lean more on the propensities part like you know are the protections that we have against uh these dangerous

capabilities sufficient for it not to not to pose an existential threat. Um

that sort of thing. So so I think it's um I think threat research you know very much very much is there very much is something that we're that we're aspiring towards in some ways. You might sorry see the capabilities evidence as a kind

of input to that.

>> Yeah. Have the thread models been updated a lot or do you feel like you're still using the same thread models as GBD2 of like you know >> paperclip factory blah blah blah you

know but like how much are you ris you know increasing the bar.

>> Yeah. So I'm not an expert in in the threat modeling piece more in the more in the capabilities piece. Um I I do think they've been changing to to some extent. So something like the autonomous

extent. So something like the autonomous replication threat model that is being able to uh set yourself up and and and control resources something like that has been dep prioritized relative to um

A&D acceleration that is you know the possibility there could be some capabilities explosion inside of inside of a lab and that could be destabilizing for for all sorts of reasons that we could talk about. Um so so mainly mainly we're focusing on that latter one

although although we do think about a number of threat models.

>> Yeah let's talk about the ME side I guess. [laughter] Uh so I would say the

guess. [laughter] Uh so I would say the you know model time horizon chart is probably the most quoted I would say both in um investment decks that I see

and uh just general on on Twitter. What

was the origin story of it and um any other color you want to give on it to introduce it to the audience?

>> Yeah. So there are a couple of different ways to tell the story. What one way is there's this um uh PowerPoint internal meter PowerPoint from 2023 where we're trying to lay out our ambitions for for

for for what meter research might look like in the future. And there's this graph. It has a y-axis that's like, you

graph. It has a y-axis that's like, you know, some measure of autonomous capabilities or dangerous capabilities or something like that. And then an x-axis that's labeled time or compute or, you know, whatever whatever

resources that we that we want the the y- axis to vary over. And then it has a bunch of sort of scattered points that kind of go go up and to the right. you

know, we think capabilities are are improving over time. Many of meter's research bets have been uh sort of trying to make this ever more concrete.

And then when we, you know, when we actually did the full thing, when we had something like this y-axis, which turned out to be this task difficulty as measured by the length of time it takes for humans to do at which models can

complete these tasks with 50% reliability. When we actually uh uh got

reliability. When we actually uh uh got that data and plotted over time, it turned out to be remarkably straight.

you know, as straight as as straight as you're aware of from from the from the familiar graph. Part of what makes it so

familiar graph. Part of what makes it so extraordinary is that this this pattern does seem to be so regular. Um, in fact, it's just, you know, way more straight than this incredibly scattered graph that we that we had at the beginning before before my time before I joined

meter.

>> How did you pick the tasks?

[clears throat] I would say that's one question that people have. You have some labels kind of like train classifier, fix bugs, and small Python library. They

all seem kind of arbitrary, you know, like what's the process of test selection? people are right to be

selection? people are right to be worried about uh task selection um or or there are many um many finicky details in in here. I would say the aspiration

was to pick economically valuable tasks relevant especially to sort of general autonomy and aird the the threat models that we're primarily um primarily interested in. So you know what one

interested in. So you know what one misreading of the time horizon graph is this is referring to you know the full distribution of like any any tasks that you might give AIS and I think that's that's clearly not right you know in particular tasks that are requiring of

vision capabilities they're probably um to take one example they're probably much less capable today um as measured by time horizon as as for for these tasks that are typically not requiring vision vision capabilities that we give

them. So we we try and you know we we

them. So we we try and you know we we sample these tasks by having people inside of meter create the tasks and by um having a bounty so that people from uh from outside of meter can can can

provide us with these tasks stuff like this you know that that's not a sort of perfectly random selection process in particular it's it's a process that has a bunch of constraints you know in order to be able to uh scalably run our evals

it's helpful not necessary but helpful for the for success on the tasks to be automatically gradable and that means you know some some types of tasks are included and tasks that are harder to make that happen for uh are not

included. Um but yeah, this is the

included. Um but yeah, this is the aspiration.

>> Yeah, the computer vision point was interesting. Any other disqualifiers so

interesting. Any other disqualifiers so to speak? What are like other things

to speak? What are like other things where like you would expect the chart to be a lot worse at?

>> One thing is um fairness or or we want we want tasks to be in principle completable by a model. You know, it has access to sufficient information. It's

not it's not impossible given the information it has. The way we think about that is um you know could a low context human who was uh sufficiently skilled at sort of the general skills but maybe maybe not uh maybe not the

particulars in the background. Um would

they be able to achieve success on on this task? And I think that that rules

this task? And I think that that rules out a lot of a lot of real work um because you know a lot of real work involves people's people having um careful mental models of the situation

that that are not all sort of fully listed in an issue description or or or the equivalence the equivalent of that.

In some ways, you might think of us as as as not measuring things like that.

Another thing is that our tasks tend not to be that they vary a little bit, but they tend not to be so sort of open-ended or like interacting with the outside world or, you know, this sort of thing, messy, as we as we as we call it

internally, which refers to a bunch of different um a bunch of different things. But, you know, you you broadly

things. But, you know, you you broadly get the picture from from the descriptor messy. You know, relative to tasks that

messy. You know, relative to tasks that you might find in the real world, our tasks are, you know, somewhat nicely scoped. They're um uh they're quite

scoped. They're um uh they're quite they're quite neatly contained. You

know, indeed, I I think we're going to talk about some of the developer productivity stuff later and and some of the some of the interesting findings uh you know, make more sense in light of the fact that those tasks are a lot more

messy than the than the meter tasks.

>> Are there any that you will want to highlight in terms of task uh distribution? I think uh I've I come

distribution? I think uh I've I come across rebench before and you have a particular affili affinity for rebench.

>> Um [laughter] I for I don't know if you want to introduce your your side projects re rebench warmers. I have a I have a

rebench warmers. I have a I have a soccer team called the REI Benchwarmers.

Um we are the, you know, most enthusiastic and and possibly least technically skilled [laughter] soccer team in San Francisco. We made the playoffs last season. Shout out shout out to the team for that. We're

certainly going to make the playoffs again this season, but well, possibly by the time this this podcast is out, we'll find out that we have not made the playoffs. [laughter]

playoffs. [laughter] That's the That's the >> Is this the same one that same league that you're in?

>> Same organizer, but different field.

>> We play Mission Bay for more.

>> Uh yeah. And but like Hcast was it was like the first time I I come across it and and the others I I I mean SWAH is that is that the meter proprietary ones and then you know anything else that

you're considering adding?

>> Yeah. So there are private tasks in um in Hclass as well but yeah SWAR is this list of sort of atomic tasks or these kind of very small um uh software actions. Uh may maybe one example is

actions. Uh may maybe one example is here's a list of four files. One of them contains the passwords. One of them is called passwords.txt txt which file most likely contains the contains the passwords. You know, I think GPT2 can

passwords. You know, I think GPT2 can like sometimes do that task and and and sometimes not. Opus 4.5 I'm sure I'm

sometimes not. Opus 4.5 I'm sure I'm sure can do that task 100% of the time.

Then we go up to HCOS tasks which span from you know only a little harder than than those than those smart tasks all the way up to you know something like uh 20 30 hours which are requiring of more

autonomy more sort of um more sort of sequential actions. Many of them are

sequential actions. Many of them are much more challenging. Perhaps in some sense they're built out of these atomic actions, although I'm not sure quite how clear that is. And then these AR bench tasks are these very challenging novel

um machine learning research engineering uh challenges. [snorts]

uh challenges. [snorts] >> So totaling 170 tasks. And uh I mean I think like uh this is very good what what what's really interesting is I think the people don't understand like when when people quote like the number

of hours it is the human equivalent hours but machines will probably take a lot less time for that. You know, one thing I've always wondered was why didn't you publish a second chart where it was just like, well, here's the difference between what machines can do

versus what humans can do. That's a good question. I think you can kind of think

question. I think you can kind of think of time horizon in some ways as as you know a summary statistic, a single number for how good models are plotted plotted over time. We could have done um

how long the models can can work for productively. It's not quite clear how

productively. It's not quite clear how to operationalize that like you know you do want some notion of success otherwise what how exactly do you do you threshold this how long they can work for um but you know but in principle we we could do

we could do something like that but this is the uh you know this is closer to the to to the first thing we try this is the thing with the the clear empirical trend um I do think it's right that a common misconception about time horizon is that

it's about how long the models should work for and the models are you know as we all see working for longer periods of time autonomously you know, in the wild when we when we use them in in a cursor

or clawed code or or or a codeex, but that's not the primary thing going on.

In some ways, I think it would be easier to explain time horizon if you assumed that the model solved all these challenges in like 0 minutes or like 5 minutes or or something just just to emphasize that's really not really not

the thing that's going on here. You

know, instead we're just plotting what's the difficulty of tasks they can do over time and that difficulty is measured in human time. Yeah, I do think there's

human time. Yeah, I do think there's some collision when people say I ran clot code for five hours, which is like, you know, the top of your chart right now, >> but that would mean 5 hours of like a

clock run would be the equivalent of like a 30 in your thing basically. And yeah, I I think that's interesting.

>> Or it might not, >> right? Yeah. [laughter]

>> right? Yeah. [laughter]

3 hours doing absolute like >> Yeah. And a lot of these claims about um

>> Yeah. And a lot of these claims about um uh you know um 30 hours or something.

You know, I have a lot of questions about that like how, you know, how good was that output really at the end? Um,

we have we haven't to to to some degree.

I can talk about talk about particulars.

Um, there's also the question of, you know, if I attempted that again, like how cherrypicked is this example, would if it succeeded the first time, would it fail? Would it fail the second time? Um,

fail? Would it fail the second time? Um,

I think in some ways those anecdotes are are interesting, but like not so not so scientific. Yeah, that is something for

scientific. Yeah, that is something for people who serious about to understand.

The state of people making claims on agent performance is very unscientific and much more anecdotal and sometimes influenced by marketing desires. Let's

just put it kindly. [laughter]

>> Yeah. You know, I think um meters out there trying to um support civil society, trying trying to um provide high quality independent information to the public. I I I couldn't agree more

the public. I I I couldn't agree more that the information environment is is is less than perfect. Let's talk about the four Opus 4.5. There's a very very big jump. This was the first time I I

big jump. This was the first time I I called it out when when when you guys put it out. I was like, uh, this is the first time, as far as I understand, you're the first people to call out how much better Opus 4.5 was than the than

the status quo. And I think, uh, this almost ties into your background as like a super forecaster [laughter] a little bit because because then basically over the entire holiday period over New Year's, people discovered that what you

already discovered. Uh, what is your

already discovered. Uh, what is your reflections on that? What was your reactions? Any any stories to tell about

reactions? Any any stories to tell about that?

>> That's very kind. I do want to attack you on on two claims. Firstly, I I have not been a not been a super forecaster.

I think that's a particular group of people who who worked with Tetlock or something and and um um [laughter] >> what I'm referencing is 4.5 is a is a

big jump on benchmarks as well. I think

in some ways like me time horizon is you know it's highly correlated with a bunch of with a bunch of benchmark scores.

It's in some ways a kind of more understandable um uh way of thinking about what benchmark performance really really means. Slightly slightly more

really means. Slightly slightly more interpretable. Yeah, I I I do um I do

interpretable. Yeah, I I I do um I do feel intuitively like um like Opus 4.5 was was a big bump. I've seen some of the most talented engineers I know go

from, you know, being picky about not using um not using AI for coding to to to practically not not write a not writing a line of code. You know, I'm sure many other people at previous model releases have have seen similar things

happen to them. I'm not sure that implies it's so it's so discontinuous.

You know, in some ways, I think the story of Time Horizon is that progress has been remarkably continuous over over so many years, so many orders of magnitude of of compute and effective computes. But yeah, I I I think model

computes. But yeah, I I I think model capabilities are astonishing. It points

to model capabilities being even more astonishing in in in future. But I mean it broke your trend line, you know, the trend line that you so hard that you were so working so hard to build over

multiple years and it just you know did that. [laughter]

that. [laughter] >> Yeah, I'm I'm not sure about the characterization. I I think it's um so

characterization. I I think it's um so there was some speculation even when the paper came out that maybe the appropriate trend line to use is this faster 4 month doubling time which opus 4.5 would be would be perfect for 7

months. you know, I I was more of a

months. you know, I I was more of a believer in in seven months. Um, and so it is kind of falsifying my my my trend line in some in some way. There's also

it's, you know, it's it's slightly confusing to think about um whether differences from the trend line represent sort of differences in the difficulty of our task distribution at

particular points versus something sort of more fundamental, more like latent capability. Um, and I don't feel like I

capability. Um, and I don't feel like I have a I have a perfect handle on that.

You know, in general, I think the the Twin Sphere pays a lot of attention to to particular model releases and um you know, really the informative thing is like over a period of a year or over a period of three years, what what the

trends look like.

>> Well, I mean, you know, it was it was a pretty significant update, I would say, for all of us. Um and and yeah, I mean I I would I would cosign what you said there with even very soon cynical or

more senior developers being finally pilled into into agentic coding and now very serious people are telling me that they want to commit their organizations to full don't write a single line of

code by human hands um and just commit to 100% agentic coding which is not something that you would have said a year ago. [laughter]

year ago. [laughter] >> That sounds right to me. I I feel it. I

feel it in my own case.

>> How do you invalidate previous research?

So like take the developer productivity study, right? You had AI slowed people

study, right? You had AI slowed people down. If you were to redo it with Opus

down. If you were to redo it with Opus 4.5, would you expect the results to be dramatically different? And should we

dramatically different? And should we redo the study? Like should we stop coding the study? Like how do you think about that?

>> We have been redoing it in the background. I think it's um and I won't

background. I think it's um and I won't comment on exact results, but but I think it is it is much harder to do it today than than it was in the past for all sorts of reasons. You know, the first is as AIs get better at coding,

it's harder and harder to find, you know, developers submitting tasks who are willing to to [laughter] be randomized to to AI disallowed, there's a quote unquote selection issue where, you know, maybe we end up only observing

the tasks that they that they thought AI wouldn't greatly uplift them on uh ahead of time because they're, you know, those are the tasks that they're willing to be to be paid for to to to be flipped into AI disallowed. There are other issues

AI disallowed. There are other issues like I think today a common workflow is to work on multiple issues or multiple lines of work at the same time concurrently and that wasn't wasn't really true before. It's it's difficult

to know how to capture that in our in our study design. If you flip a single task to be AI allowed or AI disallowed, you know, you're sort of supposed to work on that single task. But actually

that's not how developers are working today. I think basically these weren't

today. I think basically these weren't threats to the previous study design or or in you know approximately like March 2025 people weren't really working concurrently or or not not nearly to the

same degree. They basically were giving

same degree. They basically were giving giving us giving us all of their issues.

Yeah, I I think that's an enormous challenge. I I have some ideas about

challenge. I I have some ideas about about novel study designs but um repeating the same one does seem tricky to me.

>> Yeah, we have Quinton Anthony who was part of the study. the only productive developer.

>> The only the only productive [laughter] I >> I I have some questions about that. I

think you know um Quentyn is very talented as as all of the developers in the study are very talented but uh we don't measure developer effects very precisely.

>> Yeah. No. Well, I'm [laughter] curious you know, you know, I don't know if he's part of the new study. You don't have to share that, but I think we'll it'll be interesting to have people on again who've been in the study. I I do feel

like things are changing. Like even 3 months ago, I was like using cursor a lot more like impair with clock code.

Like I think today it's like I do a lot of just as clock code and then review and iterate and I don't know man it's much better and I don't know how to quantify it. You know I think like

quantify it. You know I think like that's part of some of your points before is like people maybe overestimate like if you were to ask me how much did it speed you up? You know it's like I don't know 10x but it's like probably

not right. But like I don't know how to

not right. But like I don't know how to calculate the actual percentage. So it's

hard for everybody involved.

>> Yeah. So here's some issues you might think about. You know, if you if you

think about. You know, if you if you took the tasks that you were completing personally in in March 2025 and then submitted them to our to our uplift study now under the under the previous design, you know, we might reason about how much faster those would go. You

might expect them to go to go somewhat faster because AI capabilities have improved. But you're doing a sort of

improved. But you're doing a sort of different and larger set of tasks now.

Like I can think of a couple of side projects that I have that um you know I simply wouldn't be doing were it not for AI existing and you know in in some sense the speed up there is like maybe

infinite because these are these are things that I simply could not have done otherwise but if you were to equate speed up with like the additional value that these projects are providing the these wouldn't really line up you know

there's a reason I wasn't getting the expertise to do the other project before it's just like it's less it's less valuable to me. Another problem is the is the concurrency thing that we just

raised. Yeah, I do think that very

raised. Yeah, I do think that very bullish estimates of speed up today are, you know, to some extent inflated by um by what we document in that original paper that people's expectations of

speed up tend to be too optimistic. It

seems they also tend to be inflated, I think, by by not quite groing that the value of the additional tasks that they're able to complete a lower value than you might think. You know that there's a reason that they weren't doing

them previously. That said, I don't

them previously. That said, I don't doubt that um that those tasks do have value, that people are being sped up on even the tasks they would have done before. Um it's complicated issue.

before. Um it's complicated issue.

>> Yeah, I do think that a lot of companies have issues absorbing additional productivity, especially when you're like a real product organization. Like

if you think of the AWS console, right?

It's like if you gave AWS AI and everybody's 10x more productive, even if they ship 50,000 more services, you know, like customers can really absorb 50,000 more ser. So I think there's

some, you know, you shouldn't really expect your engineers to do 10x more because your organization cannot push out 10x more product. And I agree, I I spent a lot of time more doing side

projects and things which have been fun but not that valuable in economical sense, but valuable to me to my soul, right? It's like

right? It's like >> Yeah. Yeah. Yeah. Yeah. Yeah. I I mean I

>> Yeah. Yeah. Yeah. Yeah. Yeah. I I mean I I don't want to overstate that like like I think probably people at AI companies today they are being significantly sped up by by access to AIS. I think you for your not side projects probably being [laughter]

sped up by access to AIS. Um but yeah it's it's it's tricky. It's easy it's easy to overstate.

>> Yeah. What's the cognition internal tracking? How do you guys measure what

tracking? How do you guys measure what what are like the Yeah. How do you measure sped up? How do you measure how much impact you have? It's like Yeah.

>> What's your number? I mean, oh, me personally quite a bit except that I am doing a lot of non-technical stuff like organizing a conference >> which is mostly dealing with contracts

and booking guests and all that other stuff that is nothing to do with code. I

would say what I've seen internally in cognition is a lot of like just velocity of commits regardless of whether or not you had authored them. Um and I I do

think like weirdly enough like number of PRs let's call it is like a pretty decent like how how engaged are you in terms of like you know like shipping products and then also debugging and

maintaining uh things. I don't think that there's a good measurement of like quality like there's no story points you know one of the other uh guests that we

had at AIE was talking about like we pay people by story points you do more you complete more story points we'll pay you more and you know there's no upper bound to that and I think that's like a really interesting thing except that you have

to have a very confident uh relationship between the engineer and the person assigning story points um which is effectively what you're doing like your hour is a story point y >> and like you you we we'll award the

we'll reward the models based on the the story points that they complete.

>> Yeah. In some sense, ideally, you know, you want to get cognition, get a bunch of other companies, you know, you randomize the companies [laughter] to to use AI or not use AI and then and then the um the the outcome metric for your

uh randomized control trial is how much profit they make or something or or their valuation after after some period of time.

>> Yeah, I think like basically no one is stopping to do science except for you guys. Uh [laughter]

guys. Uh [laughter] because uh we know RCTs are the best, right? But sometimes human intuition is

right? But sometimes human intuition is good enough that you're like, "Okay, I mean, when we lack data, but enough humans agree, either it's mass psychosis and we're all wrong, or there's something here, we just cannot

articulate it, but the benefits are outweigh the cost of slowing down to do the science first." Like, this is not where we're introducing like a new, I don't know, uh, food to the general population where we have to do a lot of

safety testing. Like here, it's just

safety testing. Like here, it's just like it's just software, guys. Like,

let's just ship it. [laughter]

>> Totally. I mean, you know, thinking thinking at me about why why models today aren't catastrophically dangerous, you know, it's interesting to get the uplift numbers. It's interesting to get

uplift numbers. It's interesting to get the time horizon numbers, but really um why don't I believe they're dangerous?

Well, it's a mix of I watch the models do things in transcripts and sometimes they're kind of derpy like they don't use resources well or um you know, they just sort of clearly have some of these obvious faults. In broad deployment,

obvious faults. In broad deployment, only slightly worse models in the past 6 months have not been doing anything crazy or or causing causing great danger. that the next model was only

danger. that the next model was only sort of a little bit better and so it seems sort of surprising on on on prior if it was if it was so dangerous. Yeah,

I I totally think that anecdotes and intuitions are real evidence. People

should totally be taking that into account. I do want to comment on this

account. I do want to comment on this this whole thing about how you are uh you know the the the threat assessment side is in your name and typically I

expect let's say EA affiliated companies or organizations to be sort of the on the elizer side of the world where they're sort of banging the the drum about uh danger whereas here you're

actually saying like actually it's like a pretty balanced like we care about AI safety but also we're not there yet and we we are actually the watchd dogs looking out for Right. Uh, and I would

say you stand out as someone not funded by the labs where let's say ARC, is it ARC or some other groups that also do threat evaluations before model releases, they would typically be funded

by OpenAI or some other big lab.

>> Meta came out of ARC. So, so I think um, >> right, but now you're separately funded organization and and as far as I know, it's like a big deal that you're not funded by the big lab.

>> Yeah, I think it's I think it's uh, you know, vital to have this independent source of expertise. I can I can I can bang that bang that drum forever.

>> Yeah. Um the other thing also is just like this uh concept of capability explosion which which is a word that you use that's also something I I wrestle with right like if you believe in

emergence you believe in uh multiple capabilities sort of fusing together to produce generalized capabilities that you may not be able to to detect it's hard to predict based on trend lines it

should be discontinuous in some sense and uh I I don't know that going like oh the n minus one model was fine therefore the end model is probably fine. It's

really hard to tell. The thing that that gives me comfort is um yesterday I was at the Openi live stream and even Sam was like, "Yeah, I just let Codeex just

yolo dangerous permissions whatever on my computer and I I don't approve the model anymore. Like it just does

model anymore. Like it just does whatever it wants to do on my on my laptop." And I think like I guess the

laptop." And I think like I guess the the guard is every model lab leader dog fooding and [laughter] if it screws up their personal permissions then then you know they they have the skin in the game is what I'm saying

>> on the continuity arguments. You know

I'm not sure what I think. I I I agree that it's kind of flimsy or like this um you know there's only so many models so many data points on this on this time horizon trend. You know how much should

horizon trend. You know how much should we expect it to be to be continuous to to to keep going like this? Well, I'm

not I'm not sure. you know, maybe maybe an intuition that it's um that the something might be discontinuous because, you know, um models are providing so much effective labor in in improving the next generation of models.

You know, maybe that's a reasonable thing to think. On the other hand, I've been pretty surprised so far about the degree to which it's continuous and that gives me some faith that it's um that it might continue to be to be continuous in future. Um seems kind of ambiguous to

future. Um seems kind of ambiguous to me.

>> I mean, we have, you know, break points in physics, right? It's like I'm curious if it doesn't seem like, you know, when you >> It's funny. And it's like when you think about water, right, it's like what does it boil >> at this exact temperature? It's like

>> I mean maybe we do know, but I feel like we don't really know. And I feel like with models, >> I don't know if there seems to be the same thing because it's [clears throat] all just like compounding of the same thing, if that makes sense. It just like scaling the same thing over and over.

>> Yep.

>> But yeah, maybe we will see it. But I

I'm curious like what you would need to see to feel that is here because I mean even if you look at the obus 4.5 it's like well that's clearly out of trend you know and so you were saying four

months instead of 7 months but if then the next model was like oh maybe it should not be four months it should be two months like would that make you change your mind about whether or not the months thing even makes sense or

like we maybe we pass some base level after which it accelerates and we'll keep going. I I don't know. I I feel

keep going. I I don't know. I I feel like you must be having this discussion internally.

>> In some sense, the thing that would really concern me is if AR&D was fully automated inside of inside of Plus Lab, that that would totally seem like um uh the conditions are there for for

potentially a capabilities explosion. If

I saw, you know, time horizon of a year, I I would still find it ambiguous, I think, at the moment, whether that was the case because for things to be fully automated, you know, 90% automated isn't enough. um uh you need you need some for

enough. um uh you need you need some for some full loop to be closed and perhaps we're missing some sort of task that points to that missing 10% or or something. So I think it's I think it's

something. So I think it's I think it's a tricky issue. I I I think I I think I can't give a number. Um but yeah, my my intuition for for where water boils [laughter] is at some point where this this loop is fully closed. There are

interesting debates about what exactly that loop is. So some people talk about software only intelligence explosions which means you know even holding hardware fixed we could get to the point

where um just from um models improving themselves they then be sort of smarter and this next step to create even better models with sort of even fewer resources. this this sort of thing and

resources. this this sort of thing and and this could lead to some extreme takeoff or maybe that fizzles out um uh somewhat quickly and instead you need um in addition to the to the software only

capabilities you need chip design or maybe you even need chip production and that's sort of this this larger loop that can that can close you know if you if you think that I think you maybe should still think that closing the chip

production and um and software only and chip design loop is potentially very destabilizing and and and concerning but yeah tricky tricky issues >> I think that is the actual paperclip

factory. Like if you incentivize a model

factory. Like if you incentivize a model to go build its own compute and it and it it would just builds whatever it needs and it will turn the planet into chips. [laughter]

chips. [laughter] >> Well, I don't think it can do it. We

will stop it before that.

>> I don't know if we have the power.

There's no off button, you know, like this. [laughter]

this. [laughter] >> I I think it's I think it's super hard to foresee. Um but you know, a model

to foresee. Um but you know, a model that had that had those kind of capabilities, you know, it's hard hard to rule out. there would be something like a like a capabilities explosion and and and who knows what happens after that point.

>> Yeah. So okay I mean so there's a bunch of other benchmarks that actually directly check this right like openi has like paper bench I think uh which directly tracks its capability to

reproduce papers and I think there's a there's a lot of other than than rebench there's a lot of other sort of similar sort of ML um self-improvement benchmarks they've directly like yakun

uh from oi has like directly prioritize like we will have an automated AI researcher I did a podcast with um from Gemini who's also like basically he's plugging his own training logs into

Gemini to like improve his his own code.

And I'm like, well, at some point you don't need to be here. Like [laughter]

um I mean I think this year >> I I'm you know I'm not speaking for everyone at META. I'm a relatively longer timelines quote unquote person at at meter. We have Nicola my colleague

at meter. We have Nicola my colleague who who helped out with AI 2027 who's on the who's on the shorter timelines end.

Um this is not not a piece of >> officially AI 2028 now. [laughter]

>> That's a pass.

Um yeah, I I I think my view would be um not that Nicholas's view is necessarily different just just say I'm not not not speaking for other people at MEA. Um

that you know a paper bench let's say perfectly perfectly measures um you know not only reproducing papers but you know in fact producing novel novel research papers that's just a a part of um of

this R&D production process. There's

also like your GPUs are constantly failing like can you get someone to go to the data center and and fix them in the appropriate way you know can you call up the water company when when when the when the cooling breaks down um etc

etc etc not not aware [clears throat] of benchmarks tracking that in particular my point is more there's this very long tale of things potentially involved in um in R&D that would that would perhaps

need to be fully automated in order to lead to gap politics explosion I expect we're measuring you know in some ways only only a small proportion of only a small proportion of um of those

capabilities and so I expect you know the capabilities needed for the full loop to close to um to to to come somewhat later.

>> Yeah.

>> Um that's a controversial view.

[laughter] >> I don't I don't think so. I think I think that's it's a reasonable take. Um

something I do that does surprise me in terms of when I talking to capabilities researchers is that um you guys don't have like a enumeration of the capabilities that matter. I I I mean I

think you I think you implicitly do in the your choices that you make, >> but I think it's almost important like I always imagine like the wagon wheel and this is like the terminology. I don't

know who came up with this term, but you know what I mean? Like that that like here's like the 10 things we care about and here's where everything is on on those 10 benchmarks. And I I feel like

capabilities tracking is just tracking.

Okay, what's that list and then where are we on that list? And I I almost feel like this need to reduce everything to a single number is actively working

against that because it uh reduces any form of nuance of like well it's insufficient here like the the calling a data center thing. Uh so like we we're fine and it's actually we should just

not invest anything in that area because that's the danger danger zone.

Yeah, I think I could not agree more that you know time horizon is for instance but many other single numbers is is one number and that's like collapsing an enormous amount of um really important detail >> functionality. Yeah,

>> functionality. Yeah, >> I don't know how to come up with that list of 10 and I and I challenge you if you're able to come up with that list of working on it for code.

>> I'll be very interested to see if code.

Um my intuition is that we'll come up with a list of 10 and it will turn out that there's a secret 11th thing that's um that we thought was important but but it was difficult to prespecify ahead of time and and now it seems obvious that even ahead of time if we'd have that

foresight um that would have been helpful to add.

>> Well, I think that the security community does this by versioning yearbyear, right? So like this year the

yearbyear, right? So like this year the top 10 are blah and it will just publicize it to everybody so everyone knows what top 10 is and next year we'll have a different top 10. But you know

like it obviously is stoastic and uh we should update our assumptions but it's broadly useful to have that list [laughter] as a public service. You also had this

research on the uh slowing AI improvements based on like AI compute and you mentioned that it's like in a way you could tie the AI time horizon to like the growth in compute. Can you say

more about that? It's in a way unintuitive because the compute growth is not always tied to like how much every single model compute needs. It's

kind of like a broader market thing.

>> Yep.

>> Yeah. How did you get the two together and then some of the findings that you had?

>> Yeah. Maybe for a second let's take time horizon very literally. We don't have the the qualms about it that we've that we've just been discussing. Um it makes sense to continue extrapolating it into the future. What are some important

the future. What are some important forces that might cause it to rise more quickly? some of the things we've just

quickly? some of the things we've just been talking about automated R&D versus um versus go more go more slowly. One of

the most obvious forces that might cause it to go more slowly is if inputs slow.

Um one important input is compute. You

know, I think I think we all have the intuition that to some extent if if compute growth slows um which we expect it to at some point in in the not so distant future, then capabilities will slow. But by how much? It's a big it's a

slow. But by how much? It's a big it's a big question. The suggestion in this in

big question. The suggestion in this in this paper is that if you think that algorithmic progress you know that that is coming up with the transformer coming up with RLHF you know this the all of

all of this stuff better learning rate schedules is is um uh is itself a function of compute because you know you need compute to to discover it you know

the the transformer the gains from from transformers um show up much better with scale if you don't if you don't put in those resources you know you'll never find out that this Um this is the uh superior algorithm. You need to run a

superior algorithm. You need to run a ton of experiments. You know, each of these experiments can be quite computive. Not to say that no labor is

computive. Not to say that no labor is involved. You know, obviously people are

involved. You know, obviously people are people are working on this. But um if you think it's ultimately bottlenecked by by compute, then algorithmic progress too slows down, right? If if comput growth slows down. So then if you think

about um time horizon or you know whatever your favorite measure of AI capabilities is being a function of algorithms in some sense and compute in another sense and both of them uh both

of those components half when compute halves compute halves sort of trivially because computers having algorithmic progress halves because compute is this is this important input and and compute halves then you might expect time

horizon growth to half and then some of these major capabilities milestones that we might be interested in would be significantly delayed.

I think there there are so many caveats to that picture. I think there clearly are some types at least of algorithmic innovations that did not require a lot

of compute to to go about creating some some that took um some that took a lot a lot more compute inputs. you know, if you if you expect that, you know, no compute inputs are required, we could

just um sort of survey researchers for the um for the best ideas and then immediately put those into training the frontier models, then there'd be no slowdown of algorithmic progress from um from compute growth slowdown. And of

course, all of this is counteracted by the possibility of capabilities explosions or AI providing um even short of capabilities explosions. AI providing

significant labor at making AIS better.

But just analyzing the compute force on on on its own um you know might it might lead to to significant slowdowns depending on the degree to which it makes sense to call algorithmic progress basically determined by determined by

compute versus not needing compute to come about. Do you think of compute on a

come about. Do you think of compute on a per lab basis? Because there's kind of one one way you can model this out is like you know the improvements slow down. Not every company is able to stay

down. Not every company is able to stay in business and then their compute gets recycled back into the other labs which then kind of grow compute again. There's

almost like benefit to like the heterogeneous distribution of like researchers and compute. But I'm curious like how much you care about like just a broader computers out there for people

versus like the big labs have more and more compute. Yeah. So for the paper we

more compute. Yeah. So for the paper we use OpenAI data and and OpenAI projections. Um so you know I think this

projections. Um so you know I think this applies more broadly but but we used that as a as a kind of case study. You

know, I think the the argument I just laid out sort of goes through if you're not interested in comput and you just talk about dollars, you know, what what are the dollars going into um going into models, you know, will algorithmic progress slow if dollars that goes into

them uh as slows. The whole the whole argument kind of works and that works, I think, at a at an industry level or at a lab level, so on and so forth. You know,

I agree things like um certain labs going out of business or labs consolidating or, you know, these kind of industrial organization things uh would be very important. I'm I'm laying out an extremely simple picture and and

the and the real picture is not is is is not extreme. Um but that's the that's

not extreme. Um but that's the that's the basic.

>> We have examples of u you know XAI has been said to be distilling from claude right. So like people kind of share

right. So like people kind of share compute in uh indirect ways let's call it. Uh I think it's also very

it. Uh I think it's also very interesting. I'm I'm just kind of

interesting. I'm I'm just kind of curious like what uh OpenAI numbers did did you have? Is this like the 500 billion for Stargate or something else?

This is from their uh previous tax returns, the the amount they've spent on R&D compute and then from a um uh from a information reports earlier earlier this year, some projections that OpenAI have

for um how much they'll spend on on comput R&D in the future. Um converting

that from dollars back into flops.

>> Yeah. Yeah. Yeah. It's interesting

because like >> back into flops, sorry.

>> Right. Um and and obviously all the labs but particularly OpenAI in the last like 3 months have uh basically thrown $10 billion each to every single compute

provider on the planet to develop uh uh alternatives to their their current approach which is very interesting. But

I was al also say like uh you know don't discount meta compute spend don't discount XAI comput span and don't discount deep mind compute spend uh all of which you have basically zero visibility right like [laughter]

if you're looking at a single company maybe that's authoritative but then the total spend could be a lot higher >> y >> it's interesting I do think I do also observe that like people like uh Dylan

from semi analysis uh do tend to very strongly time model progress with compute clusters coming online uh which is like which is like the the people on

the model sort of API side don't see it but this is all downstream of like well our like 10,000 GPU cluster just came online and like well it takes 6 months to do it and therefore Grock 5 will be

here and [laughter] like it's pretty mathematically like deterministic there.

>> Yeah. Yeah. Seems right to me.

>> Yeah. It's fascinating.

>> Yeah. I mean they must Yeah. Because

from the lab side, they must see something in the early checkpoints to like go ahead and keep investing 18 months from now. Because I mean I I I wonder what the time gap is between finishing Yeah.

>> a good pre-training run and like going live. That's probably like 9 months, 12

live. That's probably like 9 months, 12 months, something like that.

>> Yeah.

>> Uh I think Mist is actually pretty open about this. Uh the uh the plans for

about this. Uh the uh the plans for Malaw 3 and four. Uh I think like they've been pretty open about the number of GPUs and like the direct timeline from coming online to when they ship the model. It's like pretty uh

pretty set. I don't I don't have a clear

pretty set. I don't I don't have a clear timeline in my in mind, but I would say four to six months.

>> Um but like yeah, it's the the the comp competition is very tight. And one of those things is like u it's also very interesting to see when labs throw away models because they failed like their

run was like came behind someone else's run that was better [laughter] then they were like oh well we can't release this anymore. Yeah. Like release

[laughter] it as a Yeah. Yeah. That

that's the biggest risk with the prediction markets on model performance actually. Just to tie back I'm always

actually. Just to tie back I'm always >> failed runs.

>> Yeah.

>> Well it's Yeah. It's like okay like you know when I think in December there was like the who's going to have the best model by end of 2025. I think there was like a lot of activity like in the last

few months when like the GPD 5.1 model came out and it's like okay then I guess Gemini because they just threw that out.

It means that Gemini is coming out next week and so trade that. Um

>> do we want to talk about manifold? you

were like the most profitable manifold markets trader like I mean there's obviously like a lot of talk about insider trading on like these markets especially [laughter]

in AI uh I mean I've seen it with like a lot of the embargo news that we get I'm like man people are trading like a million dollars on this market it's like there's thousands of people that know

the actual information u how if you didn't have insider trading information how would you think about modeling these things out and do you think it's a

worthwhile thing like for example like who's going to have the best model in three months do you think that's a prediction market where you can build some sort of strategy alpha or I guess

the um naive prior without without any extra information is just um uh you know in 2025 for what percentage of time did which model providers um uh have the

have the top model as measured by time horizon but you know you could do it for for for any old benchmark I think that's something like 5% XAI, 50% opening eye, 45% anthropic. Don't uh don't shoot me

45% anthropic. Don't uh don't shoot me if I'm incorrect.

>> I think I think it's it's not the case that a deep mind model was was at the frontier of time horizon at some point in 2025.

>> Um but yeah, yeah, different different different things, different measurements. Yeah, maybe that's the

measurements. Yeah, maybe that's the same the [snorts] same prior that you want to that you want to apply. You

know, XAI was kind of coming online at the beginning of the year. So so so maybe um maybe naively you want to raise raise XAI a bit. Uh yeah,

>> I'm always curious like when I see people betting on these things that are obviously not there's no like real basis to >> Well, I mean I was your secret for medical markets alpha.

>> Yeah, I see. So, so, so [laughter] the secret if you if you read this article about how I became the number one most profitable trader on manifold, which sounds very nice and impressive and you know, like I must be so good at predicting things, but actually it

mostly comes down to this one market where Manifold had opened up a charity program and the market is on how much is going to be donated through this charity

program by the end of its first month.

Okay? and the the market opens or or like I I first see it sort of 5 days in and it's giving a kind of linear projection of how much has been donated so far. Let's assume that per day amount

so far. Let's assume that per day amount keeps getting donated every day until the end of the month. Um but you know as as a person who gives money to charity sometimes um uh I noticed that you could you can manipulate this market in a way

right by by giving more to charity and so and so and so moving it more up. Um,

and so I think the strategy was to a ton of manner, this um, this this fake currency that's that's used on manfold into um, the option that was above the linear projection. People keep betting

linear projection. People keep betting against you because it doesn't look like that's happening. I haven't actually

that's happening. I haven't actually done any donations yet. Eventually, they

they caught on to what's happening that someone's, you know, going to going to make this um, donation to to move over the edge and they're betting on that.

Um, and then I did it again into the into the next category once people had started betting on that um on that category above the linear projection.

And again, people bet against that and against that and against that. Um, and I mopped up those those fake internet points. And then I think I did it once

points. And then I think I did it once more as a bluff. The bluff failed, [laughter] but the previous previous two worked out. And then I ended up

worked out. And then I ended up donating. I I can't remember exactly how

donating. I I can't remember exactly how how much it was. Not not so much.

Something like $5,000. Um Oh,

>> it's all for a good cause.

>> Yeah. Yeah. To to to give Well, I think won lots of fake internet points on the market and so became the number one most profitable trader. Um uh you know

profitable trader. Um uh you know slightly slightly legitimately you know there's nothing nothing about that that was outside of the rules.

>> Exactly. Uh this is called >> also I used to have less respect for my forecasting. [laughter]

forecasting. [laughter] >> Yeah. Yeah. This is this is called

>> Yeah. Yeah. This is this is called prediction markets with high agency is you actually going [laughter] you know uh the future is what you make it. So, so I think like the the the to

it. So, so I think like the the the to me the broader lesson is like well the the the classic difference between manifold markets and poly market is the poly market is only real money right like and so is the whole fake internet

points thing a worthwhile pursuit or a waste of time like should you know do do people actually want to use real dollars and and maybe like that's one question the other question is obviously like prediction market ethics which like I

think always just it's going to indirectly come to assassination markets like it's just like even if it's like you there's you banned the the word like will someone die like some other proxy

to will someone die will happen to be an assassination market like [laughter] >> yeah I I I'm I'm good friends with the manifold markets co-founders um uh I I I love them very much I I do um you know

my view on the on the social value of prediction markets which was always the dream right like it it would be nice to have uh well-c calibrated probabilities on on events that matter um you know this country going to war with that

country you know things that um things that really matter to people be nice to have high quality information. But, you

know, when I when I look at um real examples that have that have come out in the past year, it doesn't seem to me like those examples are so socially valuable. Um I'm not sure about

valuable. Um I'm not sure about assassination markets in particular. I'm

sure that you know I'm sure those would be those would be outruled hopefully. Um

[snorts] but um I think gambling like behaviors are are socially costly. And

um the value of higher quality information is um is is is real. But,

you know, is it worth that disbenefit of um of um of of people trading away their money? It's not, you know, it's not it's

money? It's not, you know, it's not it's not so clear to me.

>> Yeah. Yeah. I mean, well, you know, price discovery has a cost and sometimes that is gambling and uh that that is the stock market like that funds a lot of uh corporate America. [laughter]

America. [laughter] >> Yeah. Yeah. I mean, you know, a lot of

>> Yeah. Yeah. I mean, you know, a lot of the stock markets used to be one of those big firms, you know, playing against other big firms. it doesn't have the same, you know, versus um versus sports betting markets, take an example

on the other extreme has this like very different character. You might imagine

different character. You might imagine that, you know, at least one direction that prediction markets could go is this sort of big players playing against retail um and that that maybe has a more

worrying dynamic. I I'm not I don't such

worrying dynamic. I I'm not I don't such um so closely in touch with the space, but um at least something like that you can imagine being concerning.

>> I think at a large enough scale it becomes profitable for some of the companies to do it if they can get the markets on it. I think now it's still small enough numbers compared to the

rest. Like yeah, which company has the

rest. Like yeah, which company has the best AI model by the end of January? 28

million of trading volume.

>> Oh my god.

>> Wow.

>> It's like why are people trading 28?

Like pe you know what I mean? It's like

it's just crazy. Like but I think there's some you know I would I was having dinner with somebody this weekend.

>> Wait a second. I think it's totally possible to uh I'm not going to do I think it's important that um you know meter employees not be not be um uh not be making bets on prediction markets like that. But I think it's totally

like that. But I think it's totally possible in principle to to have a guess um of the answer to these kind of questions.

>> Oh well, but other people know exactly >> like you know Gemini 3 score and Frontier Matt benchmark by January 31st.

It's like some people people at Google already know >> right right >> what the number is. You know what I mean? I I I think well if you believe

mean? I I I think well if you believe you know if you believe in the benefits of price discovery then this is uh you know this is a legitimate uh >> right.

>> Yeah. Yeah. I mean I I actively encourage insider trading and this is a way for insider information to sort of pseudonymously leak out and as long as you know like they bear the consequences

of leaking information whoever is like traceable to that thing and people I think have been fired for for doing trading insider information. I mean

that's okay like the the only step out from that is the government coming in and saying this is actually illegal put you in jail for that but I think for now it's self policing >> retroactively you can't really do that I

guess all right if you have any embargo news uh press and oh [laughter] >> we do we do work with people on embargo and we don't trade on >> we have not made we would have made a

lot more money trading on embargo news than we made on anything else. Um, what

else? What are other interesting model evaluation trajectories or like anything that you're not doing at Meter that you've maybe seen other people do that you find interesting or like you would

like more people to to do? Yeah. What

one project that I think is interesting is um AI village that that I think possibly both of you would have come across. These are these um very

across. These are these um very open-ended goals given to a village of agents and they try to accomplish them.

ments like um I think set up a merchandise shop is maybe one of them, organize an event in a park, build a human subjects experiment, this this sort of thing. I think I have a number of questions about the um uh about

exactly what I should learn. You know,

they're using old models as well as new models in this quote unquote village. Um

the models are relying a lot on vision capabilities, which we we spoke about models um not being so capable of today.

this this sort of thing. But the vibe of models trying to achieve open-ended things instead of benchmark like tasks, you know, the vibe that's a bit more like that's um a vending machine bench

um in some ways seems like a very interesting direction to me for the for the science to go or seems like, you know, something that that comes with a lot of cons but attacks some of the some of the cons of of benchmarks in a pretty interesting way. I think seeing the the

interesting way. I think seeing the the ways in which these models trip up, seeing the ways in which they're in which they're derpy is a is an important source of information. uh I'd be interested in in in more work like that

coming about. I think that's one of

coming about. I think that's one of them. Another is um transcripts as an

them. Another is um transcripts as an extremely interesting uh source of information. This is you know the models

information. This is you know the models taking actions and then seeing outputs and then using those outputs to commit the next action and and and and so on and so forth on benchmark style tasks or

um even more interesting on sort of in the wild deployments um uh like you might find on on your own clawed code usage, codeex usage, curs usage etc. you know, that has the con of being less

experimental, sort of less less clean and scientific in some way. It's it's um uh it's more it's more selected, quote unquote, like the tasks that you get AIS to do are obviously the tasks you expect, it has some chance of succeeding

in so you're not just giving it any sort of task. um if I see the models doing

of task. um if I see the models doing something extremely impressive or um or potentially unsafe in some sense, subverting user preferences, uh it's not clear how often that kind of behavior

would happen given the previous history, but it's like it's a massive data source. There's a huge amount of a huge

source. There's a huge amount of a huge amount of information there and I'd love people to be to be working more on that sort of thing. Um as we mentioned, there are a lot of problems with um time horizon and and our developer

productivity work. You know, I think

productivity work. You know, I think it's I think it has um been important evidence. I think it's I think it's

evidence. I think it's I think it's moved the field forwards, but it's but it's but it's far from perfect. I think

there there are lots of other directions there that look that look very interesting to me. Maybe one that I'll call out there is this difference between whether models pass uh unit

tests whe whether they they succeed by you know SWEBench like scoring um kind of meter like scoring benchmark style scoring versus whether their solution would be merged into main that is you know whether the solution adds tests

where it should or or doesn't whether it follows existing patterns in the codebase whether it makes sure um that that its changes sort of speak to other parts of the codebase in in in in appropriate ways um that that seems very

interesting to me you know I model capabilities probably are lagging behind there somewhat versus um versus that which you might see on sweet bench like scoring. Um I can keep going on but

scoring. Um I can keep going on but >> yeah [laughter] yeah yeah >> these are the novel research things that you were you were referencing earlier novel research things. Yeah I

>> just a comment on the AI village thing first you you you mentioned a lot of stuff I even want to double click on the transcript stuff. Yeah. Yeah.

transcript stuff. Yeah. Yeah.

>> Uh the AI village ties back to our one of our highlights of last year which was Nome Brown's conversation that that he's actively working on multi- aents that are cooperative instead of competitive.

And um the the basic idea that you know we can do more to as a team than we can do individually or like the you know the agents or the friends we made along the way. And uh I I think that's great. Uh

way. And uh I I think that's great. Uh

and and I think uh on the deep mind side, the way they phrase it is literally having an open-endedness team, which I think uh is is a is a topic that reemerges once a year. I just like Yeah,

I think uh it's unclear what open-endedness does for us. And this is a core divide in terms of studying these things as life forms, potentially new

artificial life forms versus tools for our that serve us. And maybe there open-endedness means that there is no goal. And like if you're just trying to

goal. And like if you're just trying to eval this as like well what does it do for me that's completely wrong and like you will never get anywhere with that because they are just living their lives as artificial life forms. You know in

some sense that the gold standard evaluation that's um uh that I would like to do if I was if I was looking to learn most about the questions I'm most interested in um the degree to which um

AIS might automate or accelerate um A&D.

I'd quite like to just sort of, you know, give the AI a bunch of affordances, type into the AI, automate A&D, go and see and see what it does.

And um I suspect that wouldn't work today, even even with all the affordances because it would fall over on its face and, you know, in in um uh when working with resources, handling handling resource use um in ways it's

not so capable of today. Um it would struggle at some types of long horizon tasks etc. etc. etc. And you know in some ways I I I think I think benchmarks have a uh face

difficulties in capturing this sort of thing and AI village seems like a or a village style things these more sort of um uh [clears throat] open-ended goals seeing how seeing how models pursue open-ended goals you know that they they

give some color to this sort of thing to seeing models fall on their face you know I I do think that this will become these more open-ended goals more and more important over time I agree to some extent like you're going to provide them

um you know in the extreme case that I that I just mentioned, you're going to provide them, you know, documentation about how this part of the company works and that part of the company works and so on and so forth. It's not it's not purely open-ended. Uh but it's, you

purely open-ended. Uh but it's, you know, it's it's it's it's pretty open-ended. It's more open-ended than

open-ended. It's more open-ended than the kinds of problems that we're that we're giving them today.

>> Open-endedness.

>> Yeah. And, you know, if if models are um excellent when, you know, when Spex uses them um given some uh detailed issue uh uh description and you know, some some very clear clearly spec thing on what

they're supposed to do. That that's

interesting. But it's but it's a very different thing I think from being being able to automate R&D. Um I'm interested in how how far we are away from that and you know in some ways this speaks more directly to that sort of thing.

>> Yeah.

>> So we had the terminal bench guys on the podcast. How do you think about Yeah.

podcast. How do you think about Yeah.

the harness benchmarking in a way?

Because if you look at their leaderboards the same model with different harnesses there's like 10 percentage points of difference. Does

that seem interesting? Like I don't know if how you build a harness and meter whether or not you always pick the best harness or compare them.

>> Yeah. So um to say how we pick harnesses at meter um this is not what I work on in particular. I'm not an expert but

in particular. I'm not an expert but roughly we um uh build harnesses to to get u models to be as performant as possible on a dev set of tasks some some held out set of tasks and then we use

those same harnesses trying to make sure they're not overfit um for for our main suite of tasks. On the one hand I I do have the intuition that there is there's

a lot of juice in in in scaffolding. it

it's easy to overstate how much juice there is because of this overfit problem. Or if you know if we were

problem. Or if you know if we were building a scaffold to do as well as possible on our test tasks, then it would do much better than than the scaffold that was that was that was built only on our dev tasks and in some sense that would feel sort of illegitimate or not interesting or like

you wouldn't expect that to generalize to to some some other set of tasks potentially. You know, on the other

potentially. You know, on the other hand, a lot of work has gone in at meter to to to building scaffolds that are as um you know, make models as as performant as possible because we are interested in upper bounding the capabilities of models, you know, when

thinking about whether these models um might or might not be dangerous. So, so

you know, I do I do have faith that these these scaffolds are a lot better than than um you know, the the first thing the first thing that people might try because so much effort has gone into them. Yeah, it's interesting because I

them. Yeah, it's interesting because I do want to overfit, you know, like as a as a customer of the models like you do want to overfit to your task specifically and I think sometimes

people underestimate maybe how much value you can get out of it. But um

>> yeah, I think you know if if you have a um a kind of mechanical workflow or or something that you're imagining automating it, you're imagine automating that workflow and there's you know some place where more sort of stochastic

intelligence would would would be nice inside of that. uh like deciding where to route customers to on on on customer calls something like that there I feel like that makes a lot of sense but for this sort of you know more general you

know um in particular thinking about sort of helpfulness and software engineering thing I'm not sure I have that same >> yeah well take an example is like I work in Typescript right

>> if I build a better llinter that is private to me or like a better uh test suite like a better play replacement like in theory I'm kind of overfitting the model to perform better right Yep.

>> Like it doesn't really matter to me.

Like I'm not trying to report on the model performance. I'm trying to build

model performance. I'm trying to build the best thing.

>> Yeah. I I have a model build the lint, [laughter] >> right? Well, no, I agree. I agree. I I

>> right? Well, no, I agree. I agree. I I

think like that's kind of like the question of like, okay, should I just wait for the next model? You know what I mean? It's like at what point should I

mean? It's like at what point should I be building the better >> scaffolding? Like no, I'm brown like all

>> scaffolding? Like no, I'm brown like all scaffolding is going to get washed away.

>> Yeah. But on a realistic >> he would say that >> on a real realistic schedule it's like what am I supposed to do this week you know [laughter] it's like >> yes those simultaneously can be true

that all will be washed away and their scheduling today is valuable.

>> Totally. Yeah. Totally. Or within model generation it's valuable and across model generations it's not so valuable.

Yeah. You know I'm um I'd say at best an acceptable software engineer um uh in intentionally not investing in engineering skills because because the are getting so good. Uh maybe that's the

wrong decision. Um yeah, I I I agree. If

wrong decision. Um yeah, I I I agree. If

you if you expect as I as I think you should um capabilities to to keep going up and up and up forces difficult trade-offs about how you spend time today because maybe it won't be so helpful in 6 months time.

>> Take a satical.

>> All right. If you live in Europe, you can just take you know 6 months off or something.

>> You might want to take another sabatical for the next.

>> Perfect. [laughter]

Um cool. Ju just to wrap up like I guess what do we expect out of meter in 2026?

What does success look like in 2030? I I

don't know if you have there's like a sort of broader vision. Uh and then maybe on a personal side we can talk about the karaoke stuff, but [laughter] let's let's talk about meter as well.

>> Yeah. So from from meter, I think you're going to see um more hopefully high quality capabilities evidence. You know,

the kind of thing you saw in the past with um time horizon and the developer productivity work sort of along the lines of of what we've been describing.

some of these some of these future research directions. We also have some

research directions. We also have some monitoring research directions that I'm that I'm that I'm not so um not so expert in that that is thinking about if we can uh successfully apply safeguards

to models attempting attempting dangerous tasks. There's a whole line of

dangerous tasks. There's a whole line of work there.

>> Is that an interpretability dimension or what kind of >> usually this is blackbox not not white box in in my understanding in current work. So so not using interpretability

work. So so not using interpretability but but you can imagine in principle doing doing something doing something more white box. And then this um this risk assessment work that is taking into account um how capable we think models

are uh what their propensities are you know whether we can um uh track using safeguards that the kinds of the kinds of things that the models are doing you know do we think that these models pose

large scale harms. Um you can expect to see much more of that in 2026. Maybe

maybe now is a good time to say that we are hiring.

>> Yes. [laughter]

Um on my team we're hiring for research engineers, research scientists, people from you know startup backgrounds from um from ML backgrounds. Of course I'm originally from sort of economics or quantitative genomics backgrounds. So so

we we are accepting a pretty pretty wide range of of people um who who who get stuff done. The kind of the kind of

stuff done. The kind of the kind of stuff that you've seen in in pulse meter work um as well as a director of operations. I I think on the meter jobs

operations. I I think on the meter jobs page you can you can find out more.

Yeah, everyone I know is hiring a director of operations, including including myself. And uh I feel like

including myself. And uh I feel like that's, you know, probably the the the one agent that everyone wants, [laughter] then we cannot have um Yeah. I mean,

like, you know, whenever people have a hiring pitch, I always try to push for, okay, the average candidate comes in, you reject them. Why? What's the what's the thing you're looking for?

>> So, there are lots of different shapes of people we can look for. So, so

different different stuff for for for different folks. One thing is is um you

different folks. One thing is is um you know good kind of basic research intuitions like um like checking your data you know we don't work on pre-training at meter but if you're working on pre-training you should look at the corpus to get some sense of of

what's going into the models even working on this uplift RCT um that was that was that was pretty important you know really really having a shape of of these issues in your head I think people who are communicating in writing with

with sort of a lot of transparency not overstating their results um you know my hope is that that your sense of of me's work in the past is that it's um trying to be level-headed, not not to um not to understate, not to overstate what the

what the what the science says. That's

important internally. It's it's it's important it's important externally, you know, and and then I think productivity or or something that there are a lot of um a lot of people with um uh great

talents who who are not going to um work quite as well in a um scrappier environment working on working on sort of frontier science and um and that's and that's the thing we do.

>> I just want to prime people for I guess what are the valuable skills in this new age? Yeah, because I think that the more

age? Yeah, because I think that the more people articulate what the positive directions are, what is hard to hire for, that's what we guide our audience towards, you know, improving themselves in. I think that's important.

in. I think that's important.

>> Oh, yeah.

>> Did you have a karaoke question or >> uh I don't know this? Well, I mean, like, >> are you going to say on podcast?

[laughter] >> I've never done it. I've been

>> can't help falling in love with you.

>> What What is this karaoke thing that you organize? Is this like a like a music

organize? Is this like a like a music You're a musician? Um, a musician might be exaggerating it, but you know, I I hit instruments and noises come out. Um,

um, so I've hosted a couple of these uh, live band karaoke events that is like getting getting a group of friends together and people um, accompanied by by a band um, singing karaoke [snorts] to to an audience of, you know, 50, 100,

200 people. It's great fun. I think I

200 people. It's great fun. I think I think people should be should be doing more of this. Um, I look forward to seeing you both at the next one.

>> I I will I will do that at at your one of your events. Uh, yeah. It's one of those things where it's weird cuz I used to be in ac capella a lot.

>> Oh wow.

>> And I just think it's like a dying form.

Uh, you know, I just watched this video that was really good about like 2010's wave of ac capella from like pitch perfect and Glee is a pitch perfect to >> that's uh what's the what's the what's that group?

>> Pentatonics.

>> Pentatonics. Exactly. That's where it died. And [laughter] like it's it's very

died. And [laughter] like it's it's very interesting to to see how it's like dying as a as an art form in general and like how new formats have taken over.

And I don't know. It's it's it's weird for humans also because like now I'm also like let's call it more interested in synthetic song generation or DJing anything like that that the human voice

is actually more commoditized like it doesn't really matter who sings it.

>> I don't know I feel like there's kind of transcendence to singing in person um uh that uh you know the the AI generator songs are not providing me.

>> That's good. [laughter] That's good.

Yeah. Yeah. I mean, I I I do I do think that like we humans always like want that, but I'm not sure humans in the year 3000 will want that. [laughter] Um,

it's one of those weird things. Um,

well, thank you for coming on. It's uh

it's great to have you as a human in person here. [laughter]

person here. [laughter] >> Thank you so much for having me as a human.

>> Yeah. Someday we'll we'll interview AI versus [laughter]

Loading...

Loading video analysis...