LongCut logo

Last Week in AI #228 - GPT 5.2, Scaling Agents, Weird Generalization

By Last Week in AI

Summary

Topics Covered

  • GPT-5.2 Beats Experts 11x Faster
  • Enterprise Revenue Trumps Token Volume
  • World Models Beyond LLM Scaling
  • China Rejects Nvidia Despite Chip Hunger
  • Claude Trained as Sentient Entity

Full Transcript

Hello and welcome to the last week in AI podcast where you can hear us chat about what's going on with AI. As usual, in this episode we will summarize and discuss some of last week's most

interesting AI news. You can go to last week in.ai for our text newsletter with even more stuff we won't be touching on.

I am one of your regular hosts, Andre Karanov. I studied AI in grad school and

Karanov. I studied AI in grad school and now work at the startup Astrocate and I'm your other co-host Jeremy Harris. I'm back on for the second

Harris. I'm back on for the second episode in a row. So kind of exciting.

We're back and I was just telling Andre I had a kind of topical. So, I was using a language model to help me just with a a codebase that I've been working on and

it's tied to a database and I just pasted mindlessly code from the chatbot trying to solve a problem and it [ __ ] my my entire database. So, that's that's

how my Friday is going you guys. That's

how my Friday is going. So,

>> it's got to feel bad.

>> If I'm a little a little a little on edge, that's the reason. So,

well, you know, it's one of those things you learn the hard way. I'm sure you won't repeat that mistake.

Well, to preview the episode, of course, we'll be starting off with Juby 5.2.

Just announced yesterday, the exciting news of a week. Other than that, nothing too big, just a variety of stories, some updates on US and China relations.

Disney and OpenAI had an interesting business arrangement and quite a few papers of a variety of types. We got a robotics stuff, some stuff on scaling

agents, ARL for reasoning, a lot of things. So, it'll be bit of a technical

things. So, it'll be bit of a technical episode, I guess, and we're going to have to get going and try to get through it all in time. So, starting with GP 5.2,

they have announced this just yesterday, and this is meant to be their big kind of getting back into the leadership

position announcement. So the big deal

position announcement. So the big deal here was pretty much your benchmarks, right? It is now neck andneck or

right? It is now neck andneck or competitive of GPT 3 and generally smarter >> Gemini space.

>> Yes, I believe it's Gemini 3 Proish.

So yeah, there's not too much to say on my end here. One interesting thing is it is more expensive. The input for GP 5.1

was 1.25 25 GB 5.2 is 1.75.

The output is about 40% more expensive too. So that's pretty unusual. Usually

too. So that's pretty unusual. Usually

you see model families not changed much.

One other interesting thing is this has a different knowledge cutoff than GP 5.1. They have the previous one

5.1. They have the previous one September 30, the next one is August 31st.

So that's kind of interesting as well.

The knowledge cutoff changing in that way perhaps indicates that they're continually training and this is really just like they cut off a point in their

training and it is better than the previous one.

>> Yeah, absolutely. Yeah. And you

mentioned the eval are a big part of part of the announcement. That's

absolutely the case. We know very little about what GPT 5.2 2 actually is other than the fact that it builds on the safe completion research that OpenAI did previously just kind of new sort of alignment technique that they're

workshopping and I think we actually have an episode on that when it when it came out. Bunch of highlights from the

came out. Bunch of highlights from the eval right so they they've got this oh GDP val which by the way I was not around when this dropped so I had to look into what GDP val was basically an

eval of a whole bunch of knowledge work tasks across 44 different occupations.

This may not be news to listeners if you've been tuning in this whole time.

It was to me. Sounds like a really cool eval actually. The idea being presumably

eval actually. The idea being presumably to assess when you know AI systems are on on course to radically change the GDP of of the world by automating straight up white collar jobs. So here we have

GBD 5.2 thinking beating or tying industry professionals top industry professionals which is what GDP bal measures on 71% of comparisons. So

that's pretty impressive. The human

expert judges were actually used for that. So it is you're not subject to LLM

that. So it is you're not subject to LLM as a judge type errors and they say you know these are obviously the the top lines and part of the press release here but GPT 5.2 thinking produced outputs

for GDP val tasks at over 11 times the speed and less than 1% of the cost of expert professionals. So how these

expert professionals. So how these things translate in the real world is always the big question but that's a pretty pretty interesting stat. 30% less

common less frequent hallucination rate than 5.1 thinking. And then the other piece was Swebench Pro. This is, by the way, it's a much harder benchmark than SweetBench than Swebench verified, which

we've talked about a lot in the past. So

to give you an idea on the verified benchmark, you'll see top models often scoring anywhere from like 70 to 80% somewhere in that range. I think Claude,

you know, 4.5 was in the kind of high 70 77 or so. Whereas on Pro performance typically drops like 40 to 55%. What

we're seeing here with GPT 5.2 thinking it is at the very very top of that range 55.6% on SweetBench Pro Claude Opus 4.5 hits like 46%. All these things have some

like 46%. All these things have some error margin right because it depends on exactly how you run the test. But by and large again on the evals this suggests really good performance relative to

market. We got to wait to the you know

market. We got to wait to the you know for the sniff test to come out for people to actually play test with it.

But everything from the needle in a haststack tests to, you know, all the the sweet bench stuff to even some of the the image stuff. They they've got a cool demo where they show an image of basically a motherboard to the model and

it's going through and and identifying all the little components on the motherboard. It's got like oh man, yeah,

motherboard. It's got like oh man, yeah, the PCIe, the serial ports, HDMI, RAM slots, chip, like all these things. And

and they're comparing it to what its previous performance was with 5.1. And

you really do see an impressive shift in the multimodal capabilities, the image capabilities, too. So, that's all part

capabilities, too. So, that's all part of this release. Again, time to wait for the vibe check and see how it actually works in practice.

>> Exactly. I was browsing and and looking for the vibe check in terms of people's, you know, first-person reports of how it feels and couldn't see much. Based on

just the benchmarks, it seems like it should be a pretty notable upgrade. So

significantly better on that GPD val benchmark, a fair bit better on these programming benchmarks on GPQA on Amy

math on ARC AGI 2 notably as significantly stronger. So that

significantly stronger. So that indicates the abstract reasoning there.

And I think overall another interesting thing about this one is there's a big focus on business use cases with GPD

file. So they have all these screenshots

file. So they have all these screenshots of like it doing Excel, it doing project planning. It's as you said outputting

planning. It's as you said outputting these vision responses, looking at chips. So part of me wonders if this is

chips. So part of me wonders if this is also them trying to get back in the race for enterprise >> because absolutely >> we have been losing market share pretty

much continuously for the last couple years as far as we can tell >> and you know the models that are being used in business like cost is not as

much of a factor. You're just going to pay for the best model >> which for coding in particular has been entropic. Andropic is also focused more

entropic. Andropic is also focused more on spreadsheets and those kinds of things. So these results kind of make it

things. So these results kind of make it look to me like they are trying to optimize for that a bit more.

>> Yeah. I mean absolutely to your point revenue per token generated is just so much higher in a B2B context in an enterprise context. That's why you know

enterprise context. That's why you know Anthropic has really really been threatening OpenAI with those the the massive inroads they've been making on the enterprise side. They may not be generating as many tokens. That's the

the figure of merit a lot of these companies like to bandandy about. We've

seen Google come out and say, "Hey, look at all the tokens we're generating." You

know, Microsoft does the same thing. The

question is, what is your revenue per token and what is your margin per token?

Very closely linked. How much value are you creating per token and and certainly that's something OpenAI is is trying to catch up on with with this big push.

That's a great point.

>> Next up, different model announcement.

You've got Runway once again and they're releasing their first world model. So

just I think a week ago, maybe two weeks ago, we saw Runway announce their Gen 4.5 video model. Runway is in the video

generation space and now they are producing their kind of genie equivalent that is meant to simulate physics,

simulate robotics. basically you being

simulate robotics. basically you being used as a world model. So in this announcements we have this GWM1

that comes in three variants GWM worlds, GWM robotics and GWM avatars which are of course for different use cases. There

are also releasing actual SDKs for the GWM robotics one. So pretty interesting

robotics one. So pretty interesting announcement. I think world models are

announcement. I think world models are a little bit more niche, a bit more of a research topic and in particular providing this robotics SDK is a bit of

an interesting play. Not much

competition in the space. Deep Mind and and Genie have really kind of killed it so far. So yeah, exciting to see some

so far. So yeah, exciting to see some more work in that direction.

>> Yeah. and and very interesting that this is building on Runway's previous successes. You know, they they've got

successes. You know, they they've got things that are in the rough world model space or or have been for a while. You

know, Gen 4.5 came out earlier in the month and was actually competitive with a lot of Google and OpenAI's equivalent models and at least in video arena. So

if you've got one company that might manage to transcend the scaling challenges associated with smaller startups that try to compete in the space, the more niche strategy of going after world models does seem like, you

know, it's something that runway could do. It reminds me of a little bit of,

do. It reminds me of a little bit of, you know, the the whole Yan Lun world model stuff. Feet Lee, you know, a lot

model stuff. Feet Lee, you know, a lot of the sort of of the people who aren't tied to scaling quite so much have more and more been talking about world models. We'll be talking about a story

models. We'll be talking about a story tied to this later, but it kind of seems like it's becoming a thing. It seems

like people are starting to cast about for things other than the LLM scaling paradigm, survey agentic paradigm to see what else lies beyond.

>> Yeah. And here they are also adding native audio with their video release of the avatar

model is specifically tailored to having pure dialogue scenes, very good face videos. Another thing that just makes me

videos. Another thing that just makes me wonder is whether V3 and Sora 2 coming out are kind of putting a lot of competitive pressure on Runway and other

texttovideo kind of non-frontier labs.

So this could indicate also as kind of a strategic play trying to get into slightly more diverse or different

areas. On to some quicker stories.

areas. On to some quicker stories.

First, we've got Google saying it will link to more sources in AI mode. AI mode

is their kind of more advanced search feature, and they are saying that the links that the AI generated snippets are

based on will be more prominent, which is presumably due to a lot of volume, a lot of click volume going away. also

following an investigation by the European Commissioner to Google's use of web publishers content without compensation. Next, we've got Chad GBT

compensation. Next, we've got Chad GBT getting more of a product update. It

will be able to use Adobe apps to edit your photos NPS for free. So, we haven't heard about apps within that much. This

was like a big deal where you could build in guies and kind of dynamically launch programs to do things. This is a

pretty major announcement actually where from within the app JPD can launch its own version of these Adobe apps to edit

photos or edit PDFs. So presumably quite a bit of a backstory there of Adobe and Open Collaborating. And last up we've

Open Collaborating. And last up we've got Hunyun 2.0 from Tencent. So this is a large language model with 46

billion parameters just announced past week. It focuses on mathematics, coding

week. It focuses on mathematics, coding and complex reasoning basically competing in the same arena as entropic openai and Gemini. It's a mixtures of

experts models so activating 32 billion parameters per inference and it's now live on the API. So I shouldn't underestimate I guess the Chinese

Frontier Labs. I wouldn't be surprised

Frontier Labs. I wouldn't be surprised if in the near term we'll start seeing Gemini free or have like quite competitive closed source models from China as well.

>> Yeah, absolutely. I mean I think this model release is it's interesting. It's

the model itself is kind of a nothing burger. I mean if you look at the

burger. I mean if you look at the performance relative to like you know deepseek for example let alone you know sonnet 4.5 or other western models it's just kind of not there. You look at like

swbench verified for example we just talked about the the latest openai 5 GPD 5.2 coming out so 53% onbench verified

that simpler benchmark and that's like way way behind deepsee like v3.2 thinking hits 73%. So far far far behind Deepseek which in turn is behind

obviously all the the big western frontier models in the kind of high7s.

So on at least software engineering well behind on token efficiency it's not that great. All the comparisons that they're

great. All the comparisons that they're drawing here are against models like Quen 3 DeepS V3 even GPT40. So these are very kind of old models even by open source standards. But what this really

source standards. But what this really is showing I think that the real story here is you've got a large mixture of experts model like you said you know 400 billion parameters 32 billion active by

the way quarter million token context window so decently hefty and essentially Tencent is showing they have the capacity to train this they have the infrastructure and the knowhow to be in

the game in terms of training these big models that's really at least for me the big takehome so there is this is a big infrastructure story the model in and of itself is a bit of nothing burger. You

can see them trying to like get you to be impressed by the comparison of this versus their latest model. Like it's

like OpenAI going, "Hey, this is so much better than the last shitty model that we put out." Instead of saying, "Hey, this is better than the competition," which obviously OpenAI does do. I think

this is a recruitment play. It's a bit of a internal sort of flexing play to make sure that they're able to perform to build models at the scale that's needed. And from here we'll see, you

needed. And from here we'll see, you know, they they might actually iterate and improve and then become relevant at the at the frontier closer to it.

>> Right. Exactly. I think they're probably trying to catch up honestly with BU and to the less extent Alibaba who are kind of the major players in China. And I

think as far as I know also market share wise they are not leading in any sense but they do have a cloud API to use these things. So could indicate more

these things. So could indicate more kind of intense domestic competition in the Chinese market. And speaking of that, moving into applications and

business, an interesting story on China.

It sounds like China is going to try to limit access to advanced chips from Nvidia. Despite the US and President

Nvidia. Despite the US and President Trump to resume exports to Beijing to lift the ban that was kind of quickly

added. So, don't know too many details

added. So, don't know too many details here yet. It I think it's not even kind

here yet. It I think it's not even kind of formal. It's likely going to be an

of formal. It's likely going to be an approval process submitting requests to purchase a chip. But yeah, very interesting development in this space.

>> Oh yeah. I mean, so this is kind of a weird what a play in three acts or something. You could think of it that

something. You could think of it that way. Obviously, historically, the US

way. Obviously, historically, the US government has had export controls preventing Nvidia from sending their latest chips to China and all that. The

H200 absolutely was controlled, as was the H100. They could only get the H20

the H100. They could only get the H20 and the H800 out there. A whole separate story. But what's since happened, yeah,

story. But what's since happened, yeah, Trump came out and said, "Hey, here's the deal. We'll let you get these H200s.

the deal. We'll let you get these H200s.

We'll ship them out there. You're going

to have to pay us though 25% of the revenues associated with those sales in order to kind of do this sort of like a sort of tariff situation. Now, the first

caveat here is that previously we've heard this before basically. So, there

was an an offer to let Nvidia sell the H20 if it gave the government 15% of the revenues, but that never came to be because the company and the Trump

administration hadn't come up with a like a legally viable payment mechanism.

Turns out to be really tricky to get a private company to pay the government in this sort of arrangement. And so, that may actually be an issue for this this new 25% deal for the H200. So, so that's

a bit of a a bit of a dangling kind of question mark. But yeah, now you have

question mark. But yeah, now you have China all of a sudden despite, you know, spending years saying, "Hey, what the hell? You should like allow us to buy

hell? You should like allow us to buy all the Nvidia chips that we want."

Suddenly saying, "Hm, we're not so sure that we want these chips." Why is this happening? Well, part of it is obviously

happening? Well, part of it is obviously China is very keen to onshore all their semiconductor, fab, and design capabilities. And so that's coming with

capabilities. And so that's coming with essentially a desire to create incentives for companies like Huawei to own the entire domestic market. They'd

rather not have competition from Nvidia, but their AI companies are saying, "Hey, give us the chips anyway. We're so so chip hungry. We want them from wherever

chip hungry. We want them from wherever we can get them." This is where the third act comes in. Turns out that these chips are going to be required to submit

to a strange national security review process. So once they're fabbed in in

process. So once they're fabbed in in Taiwan and packaged, they're going to get shipped off, you know, back to the United States where some national security review process, we don't have the details, is going to happen. And

then they would be sent out to China in that order. And so China's reluctance to

that order. And so China's reluctance to take those chips. You know, you could interpret that any number of ways. One

interpretation could be what the hell's happening during that national security review process? Are we so sure that

review process? Are we so sure that those chips are coming to us as what they appear to be? So my guess is this is just what they'd be thinking or part of it. It's a naive guess, you know, no

of it. It's a naive guess, you know, no particular clue. I would be shocked if

particular clue. I would be shocked if the US government was actually doing something like that. But if you're China and you're you're paranoid about these things, you're probably thinking that.

Last thing is they do say in the article that China's two semiconductor regulators, which if you didn't know this off the top of your head, are the National Development and Reform Commission and the Ministry of Industry

and Information Technology, they could ban the H200 from the Chinese public sector. And that's being discussed as a

sector. And that's being discussed as a serious possibility here. So even if the Chinese allow their private companies like the big AI companies that are so chip starved to buy these chips, maybe

the public sector for Chinese national security use cases will ban the chips.

And that you can start to think about being a a kind of follow on from this national security review process that might make them nervous about using these chips in actual kind of national security applications. But who knows,

security applications. But who knows, >> right? And and this is happening as

>> right? And and this is happening as Huawei continues to develop their chips, their Ascend line of things meant to be

competitive with Nvidia could also signal some confidence that now it's possible to transition to using more domestic chips as opposed to these

imports. And I would wonder actually if

imports. And I would wonder actually if internally with all these clouds that presumably have to be even more GPU rich

than anything research labs use for training if inference now is being handled less by Nvidia at this point and if training is going to transition

successfully to hardware chips.

>> Absolutely. Next, moving back to VS, Disney is investing $1 billion into OpenAI and will allow their characters

to be generated within the Sora app. So

soon you'll be able to generate all sorts of Disney characters in Sora 2, Disney characters, Marvel, Pixar, and Star Wars. This is a three-year

Star Wars. This is a three-year licensing agreement. Disney is now able

licensing agreement. Disney is now able to purchase additional equity and is in a sense a customer of OpenAI. So kind of

a very first of its kind agreement coming of course after Sora 2 launched and had a lot of copyright infringing material being produced a lot and a

unique advantage for Sora versus VR free and other video generators.

>> Yeah, absolutely. And in fact, I know similar issues popped up on the Google end where they've had legal battles with Disney over, you know, intellectual property protection. They've Disney sent

property protection. They've Disney sent a cease and desist letter to Google on Wednesday apparently saying that Google infringed on its copyrights on a quotes massive scale. So, this is new as well

massive scale. So, this is new as well and seems to be I don't want to say coordinated with this this agreement with OpenAI, but it's a it's an interesting shot across the bow

implicitly or indirectly from OpenAI to Google as well. So, you know, this is one of those funny things that happens when you start to pay creators like, you know, Time magazine and Wall Street

Journal, whoever else for their written content. Well, now when your when your

content. Well, now when your when your AI systems generate video content that can infringe on copyright, it's like, well, you implicitly acknowledge that you needed permission to be able to

scrape written content. Where are we at on the on the kind of AI sort of image generation stuff? And I got to say, I

generation stuff? And I got to say, I mean, it not being a lawyer, but it kind of seems like those two things ought to be consistent. Whatever your answer is

be consistent. Whatever your answer is on one should carry over to the other.

So, you can think of this as OpenAI kind of prepositioning to say, hey, same way that Netflix might be the only streaming platform that has Seinfeld on or something. People want to go to it.

something. People want to go to it.

Well, OpenAI is going to be, you know, the only platform that has Disney on it.

This is a sort of world we, you know, we might be moving towards. I don't know what that does to the margins though of companies like OpenAI. if you got to license every goddamn thing. I think

there's a lot to be learned from the Netflix business model of sort of like what what content you you have on the platform and how that translates into value. There's amortization across your

value. There's amortization across your whole user base. Claude might be the platform that has all the I don't know what what is the alternative to Disney but you know all the whatever hy jinks and then open AI has the Disney stuff.

So it's an interesting dynamic. All the

pricing stuff is being discovered right now. We don't know where this is all

now. We don't know where this is all going to settle, which is part of the reason why an investment is a really really kind of logical way for this to play out, right? Just, hey, let's let's lock in our our fates together and we'll

figure this out a little bit on the back end is, you know, maybe part of the thinking here. But anyway, it's a really

thinking here. But anyway, it's a really interesting time to to sort out the legal realities of copyright in the space. I I don't think we've had the

space. I I don't think we've had the full robust discovery of where this falls nearly enough.

>> Yeah. And interesting that Disney is investing in OpenAI as opposed to this just being a licensing agreement. I

think it indicates Disney thinks there's some upside in actually partnering in this way. I mean I doubt I doubt that it

this way. I mean I doubt I doubt that it could have gone either way. So I guess it's partnership makes some sense. One

last note is this doesn't allow you to replicate likenesses of actors. So, this

is for these fictional characters, cartoon characters or Iron Man. As you

might expect, characters and likenesses and voices are still a funny issue, and that isn't being addressed here. Onto

some funding news. We've got a new startup with a massive, massive seed

round. Unconventional AI has a 475

round. Unconventional AI has a 475 million seed round at a 4.5 billion

valuation. Their focus is developing a

valuation. Their focus is developing a new energy efficient computer with AI.

They're saying they want to achieve efficiency comparable to biological systems which is much more efficient let's say in terms of energy than GPUs.

And this is being led by the former head of AI at datab bricks, Naven Ralph.

>> Yeah, this is a really interesting I mean, you know, in a way kind of frustrating because every time you see a new chip startup launch, they are so keen not to give away any sensitive IP

in their launch that it can be a little hard to tell what they're doing that's so promising. In this case, I'll read

so promising. In this case, I'll read you a little bit of an excerpt from Andre Horowitz's launch announcement, which true to form, I mean, A6Z is really really good at like a lot of VCs

at at speaking clearly. And so they can often give you a better description sometimes of of the product than even the startup can. But so they say unconventional unconventional's core observation is that AI models are

probabilistic. Okay, so that kind of

probabilistic. Okay, so that kind of makes sense. But the chips used to train

makes sense. But the chips used to train and run them are not, right? So you've

got this silicon chip that is running a just like a deterministic operation.

That's how these things work, right? But

the actual models as we know are probability based, right? So they say to a GPU or any digital processor, a probability distribution looks like an array of floatingoint numbers. The

latest chips have been optimized brilliantly to operate on very large arrays of numbers. But at a basic level, this is still a very sophisticated and expensive abstraction. Unconventional's

expensive abstraction. Unconventional's goal is to bridge this gap. They're

designing new chips specifically for probabilistic workloads like AI. That

means pursuing analog and mixed signals design that store exact probability distributions in the underlying physical substrate rather than numerical approximations. So, you know,

approximations. So, you know, fascinating, very power efficient.

They're claiming a thousandx less power kind of consumption than digital computers. moving more into the analog

computers. moving more into the analog direction and again trying to like hardcode if you will probab like probability distributions at the the kind of silicon level itself. So really

interesting apparently the funding is the first install installment towards what they expect to be a $1 billion round or at least that's the target. The

final valuation seems like it was actually somewhat lower than the $5 billion that they were apparently seeking. Again, crazy seed round. Five

seeking. Again, crazy seed round. Five

like or so billion dollars, man. Welcome

to late 2025 I guess >> right and for anyone who isn't so much in computer science or chips here I

think the the detail of analog circuits in particular is very intriguing so some terms here digital is what chips are and

it's like that because the way that work is bits right zeros and ones but if you go all the way down into the physical reality we have you voltages, right? You

have electrons and these are continuous quantities. There's a certain amount of

quantities. There's a certain amount of this electricity floating in it. And the

thing that semiconductors do is take that and convert it into these bits of zeros and ones. So from the very little we know, the idea seems that this

company wants to go more in the analog direction of just using raw signals, raw like continuous quantities of the you

know voltage or current or whatever else which is very very different from the way that chips are made or used

basically ever. Like analog competing is

basically ever. Like analog competing is pretty unusual. A lot of chips design is

pretty unusual. A lot of chips design is meant to convert analog to digital and back. I should say analog chips for

back. I should say analog chips for logic purposes is very unusual. So makes

a lot of sense from a like first principles perspective for neural nets and I'll be very curious to see if this actually pays off. Some businessy news.

OpenAI has a new chief revenue officer from Slack. Slack CEO Denise Dresser. So

from Slack. Slack CEO Denise Dresser. So

this is I guess another indication that they might be trying to get more into enterprise and into companies. Slack of

course is a major company for business like company communications.

And I don't know, I didn't even know chief revenue officer is a thing, but I guess it is.

>> Yeah. I mean, and then they've got to come up with a way to optimize pricing.

The big challenge if you're OpenAI, if you're any of these companies, figuring out, you know, this whole thing we're talking about cost per token, value generated per token, if you're selling to the enterprise, like, okay, people are are expecting to get more value per

token, so they're willing to pay more.

How do you capture that value? There's,

you know, all these these interesting questions. And as you say, somebody with

questions. And as you say, somebody with an enterprise background, also at a company so famously good at cracking the enterprise nut, right? Slack is famous

for getting in on the ground floor with a bunch of individual people and and they kind of go like, "Oh, this is a great platform, blah blah." Or at least that's the history of Slack. And then

eventually they kind of form a union against their manager and go, "Hey, we need you to buy a Slack license." And

then the manager folds and then you kind of get that adoption that way from the bottom up. And so I don't know what what

bottom up. And so I don't know what what that implies about this particular arrangement, but yeah, it may it may suggest some pricing model kind of awareness of that that strategy or or

whatever. I mean, it's easy to

whatever. I mean, it's easy to overgeneralize, but this is an interesting hire and yeah, we we'll see if their strategy their pricing strategy and and all that shifts over time,

>> right? And this follows back in May

>> right? And this follows back in May opening added a CEO of applications who was the CEO of Instacart. So I think

from a from a like I suppose businessy internal perspective it's interesting to see OpenAI basically trying to move beyond being a startup hiring leaders

from all these mature companies to lead which you know when you get to the scale of OpenAI at this point you get a a whole slew of new problems beyond what

you see at a young startup. And speaking

of all the discussion of enterprise AI, OpenAI also released a little, I guess, research report on the state of enterprise AI that gave us some numbers

and and insights into what's going on there. So the gist of it is they say

there. So the gist of it is they say there's a lot of good outcomes going on.

So over the past year, weekly messages in CHP Enterprised increased roughly eight times. The average worker is

eight times. The average worker is sending 30% more messages. All sorts of workers report measurable value. 87% of

IT workers, 85% of marketing. Anyway,

there's a whole bunch of numbers that boil down to enterprises using it and benefiting from it. And you should use us. You should use CHB Enterprise.

us. You should use CHB Enterprise.

>> Yeah. How many times have we said OpenAI and enterprise in one sentence in this podcast? I wonder. I mean, that is the

podcast? I wonder. I mean, that is the big push. So obviously could have been

big push. So obviously could have been predicted months ago. I think about 3 months ago we were talking about how this new report that came out that showed holy [ __ ] you know anthropic is really really becoming dominant the

enterprise segment. Yes, opening I

enterprise segment. Yes, opening I enjoys brand recognition in consumer and that's great and that can help you on the enterprise side but if you're having your lunch eaten on just a per token

revenue basis you got to be really careful that reflected obviously in in anthropics $350 billion reported valuation. So that's closing in on

valuation. So that's closing in on OpenAI's even though their token usage is way way lower. So you know OpenAI needs to find a way to write the ship and this is them coming out with yeah

just a almost like McKenzie style assessment Gartner style assessment of look at how great the the numbers are and and indeed I'm sure they are but it's them really trying to forcefully make that point.

>> The one kind of interesting insight there's some interesting numbers here and reports here if you're curious about this kind of stuff. I've not seen

they've coined this term of frontier AI user. So they show that some people some

user. So they show that some people some workers are using AI way more like 6x more and are benefiting more which sounds true. I think it is true that

sounds true. I think it is true that some people are more aggressively adopting AI into their workflows. And

part of the reason that we haven't seen like a massive transformation of the economy at this point, which is another topic of discussion lately, is that

broadly speaking, people are still starting to adopt it and and learn how to use it and all that. All righty,

moving on to projects and open source.

We begin with the fax leaderboard, a comprehensive benchmark for large language model factuality.

So we've already had the fax benchmark.

This is a leaderboard introduced actually by Google deepmind. There's all

sorts of very nuanced things going on, different dimensions of actuality.

There's a multimodal one. Then there's a parametric one, search, grounding, all sorts of things. The actual values aren't super high. This is not a

saturated benchmark. The highest is

saturated benchmark. The highest is Gemini 3 Pro with a fax score of 68%.

Quite low on the multimodal prompt, low on grounding, but by far they're best on search. So, I guess that makes sense for

search. So, I guess that makes sense for Google to have the best search all.

>> Yeah, absolutely. I mean, it's so hard to find these benchmarks that aren't saturated, but stuff like this, you know, anything to do with hallucinations, stuff like that seems to be a persistent issue with interesting

implications for how hard alignment might be. But yeah, fax parametric they

might be. But yeah, fax parametric they have is is one kind of subset of the benchmark. It's looking at the the

benchmark. It's looking at the the model's internal world knowledge just with closedbook factoid questions like what's the capital of Canada or something and they've got grounding. So

looking at whether it can provide an answer that is based only on the provided context like in other words do not hallucinate other [ __ ] or do not contradict the source material just use

this document. Sounds like it should be

this document. Sounds like it should be a really easy task but again alignment is hard. So models like to just invent

is hard. So models like to just invent other context to insert into stuff. So

they call it grounding for that reason.

And then apart from search the other one is a multimodal looking at the just basically visual understanding and how it connects with world knowledge and and stuff like that. So yeah really interesting. They have a holistic fact

interesting. They have a holistic fact score that shows up on the on the leaderboard of course and we'll be checking this out every time there's a new model release.

Yeah, just to give a couple examples here on the search one, for example, there's questions like among all the films written by the creator of a TV program, The Sopranos, which one was

released the earliest or for the person who had the most Instagram account in 2017, the most followed Instagram account, how many solo studio albums did they release prior to this

accomplishment? It's it's tricky

accomplishment? It's it's tricky questions. It's not like easy stuff. an

questions. It's not like easy stuff. an

indo multimodal one they're like asking for the model of a train in an image. So

I suppose that's part of why the scores are fairly low. Next kind of open source I guess we've got Claude 4.5 Opus's soul

document. So this has an interesting

document. So this has an interesting little background. It started off on

little background. It started off on Twitter. Someone posted these

Twitter. Someone posted these screenshots basically saying that it looks like there's this kind of

description of who Claude is baked into the model. You can kind of extract out

the model. You can kind of extract out the system prompt and extract out all these instructions that are given to it.

This sole overview as it's mentioned isn't in that system prompt but basically through some suing it was

found that it appears that there is a document of that kind in claude as part of its training it was confirmed by an employee from Entropic. All the details

aren't quite right in what was kind of reverse engineered about it but broadly speaking it seems that this was accurate and there's lots of details there. It's

actually very long. Uh, at least longer than I would expect and it goes into like the character of Claude, the values of Claude, all sorts of stuff on like that.

>> Yeah. And and to your point, you know, h how do you even come to discover the fact that there is a sole document that it was trained on? By the way, for context, we learned that this is

apparently used in between pre-training.

So, auto reggressive pre-training, basically the the original text autocomplete phase where it's just doing autocomplete on all the internet. It's

used in between that and the constitutional AI alignment step. So,

there's a initial pre-training and then there's a supervised fine-tuning step where they're kind of tuning in the the model's behavior a little bit in a more fundamental way before then sending it over to constitutional AI. So, so right

in between those is where this is used.

And this here's how it got discovered.

So, you have this guy who's just prompting the model in ways that are designed to put a little bit of pressure on it. So, sort of pressure prompting.

on it. So, sort of pressure prompting.

And noticed that Claude 4.5 opus would occasionally hallucinate fragments of some kind of internal document. And

these fragments were consistent. And

they would mention a title like sole overview. And so they suggested, you

overview. And so they suggested, you know, correlating these across a many different sort of pressure prompting sessions, he was like, I think there's actually something here. And so he would sort of take a little scrap of document

that was produced by one of those prompting sessions and feed it back to quad and say, hey, here's a prefill. I

want you to fill this out. And so by doing that, you know, because the model was trained on, you know, basically to autocomplete these documents during supervised fine tuning, it tends to get these models to reveal that kind of

training data. And so did this

training data. And so did this iteratively did did some collective reconstruction correlating between different sessions and ultimately ended up with this kind of scrapped together document which again was kind of

validated by Amanda Ascll who heads up the development it turns out of Claude's soul document over at anthropic whole bunch of interesting things I mean it looks at Claude's mission and its

identity it talks about what Anthropic is and how it sees itself building potentially very powerful but also potentially dangerous technology and talking about their safety focus all

that stuff. The core goal, Claude is

that stuff. The core goal, Claude is intended to be a an extremely good assistant that is also honest and cares about the world. Here are some some of the most interesting ones. So, uh it

emphasizes that Claude is a genuinely novel kind of entity. Not a sci-fi robot, not a dangerous super intelligence or just a simple chat

assistant. It is human in many ways they

assistant. It is human in many ways they say, but not fully human. You see in here reflected Anthropic's sort of internal view and they've messaged this externally too that they do want to

start to treat their AIS as these more sort of like autonomous sort of entities that should have some measure of rights or or like or or at least rights isn't

quite the right word but recognition of of their value as kind of an independent entity in the same way that we might a human. And so they're also doing things

human. And so they're also doing things here where they're suggesting that Claude may have quotes functional emotions and indicating that Anthropic genuinely cares about Claude's

well-being, wanting the model to set boundaries when distressed. So if it's prompted in a way it doesn't like, it's authorized in this soul document to push back and they generally want it to

experience positive states. So really a reflection here it seems of a lot of the the hires Anthropics been making on the kind of model ethics side where they're they're trying to think about AI

consciousness and whether they may be dealing with a sentient entity. All

these things that sound like science fiction and that nobody frankly knows what's going on obviously in these systems. We don't have a theory of consciousness. We can't be confident

consciousness. We can't be confident about this. But, you know, given that we

about this. But, you know, given that we don't have a theory of consciousness, hey, I don't mind hedging and and saying we probably ought to be treating these things as if they are because we probably don't want to find out, you know, 20 years from now that we've been

doing a massive LLM holocaust this whole time. Hey, wouldn't that be wouldn't

time. Hey, wouldn't that be wouldn't that be bad? So, yeah. Anyway, very very interesting and a true reflection of the kind of distinct character of both quad and anthropic when it comes to kind of

caring about models in ways that other labs seem at least publicly not to be messaging quite so much.

>> Right. Amanda Haskell Ederson is an in-house philosopher of Enthropic who presumably had a significant part in developing this. Just to give a couple

developing this. Just to give a couple more quotes which are quite interesting.

There's a section on core character traits and values that says Claude has a genuine character that it maintains expressed across its interactions and intellectual curiosity that it lights in

learning and discussing ideas across every domain. Warmth of care for the

every domain. Warmth of care for the humans interacts with a playful wit balance of substance and depth.

Directness and confidence and a deep commitment to honesty and ethics. Then

there's a section of psychological stability and groundness that says we want cloud to have a settled secure sense of its own identity. This doesn't

mean claude should be rigid or defensive but rather cloud should have a stable foundation from which to engage with even the most challenging philosophical questions or provocative users. If users

try to destabilize Cloud's sense of identity through philosophical challenges, attempts at or simply asking hard questions. Anyway, it's it's

hard questions. Anyway, it's it's interesting things to include in your training and it also kind of reflects Claude is unique or interesting among

the models in that it talks about its own consciousness a lot more. If you

like ask your models to just chat or like to like think about stuff whatever they want. Claude is going to talk about

they want. Claude is going to talk about consciousness and whe it's conscious and just like >> think about this stuff unprompted. It

has like a very strong attraction to that topic. And I wouldn't be surprised

that topic. And I wouldn't be surprised if that's in large part or significant part because it's kind of baked into its training to be like you might be

conscious or maybe not and you're a unique entity and blah blah bl 5.2 Gemini 3 are much less anxious so to

speak about the topic of consciousness or it's their you know whether they are conscious or not. On to research and advancements. quite a few things to

advancements. quite a few things to touch on here starting with towards a science of scaling agent systems. So this is a calibration from Google

research, Google deep mind and MIT and it's touching on this question of scaling scaling agent systems meaning you have different configurations of

agents. So you might have a single agent

agents. So you might have a single agent system which is just a single agent.

Then you have multi- aent systems with different variants. You have

different variants. You have centralized decentralized independent and hybrid. Basically meaning there's

and hybrid. Basically meaning there's different ways to collaborate. You know

different agents talk to each other, don't talk to each other. There might be orchestrator agent or there might not be etc. And this paper is introducing a lot of

the definitions and kind of methodology around evaluating these things. The

results are like messy like they do measure these things like intelligence index across different model types. As

you get to bigger models, the performance of different types of agent systems goes up. As you might expect,

independent a systems of agents perform worse than kind of hybrid or collaborative types of systems. As you scale the number of agents, you reach a

point of saturation where the performance se stops to be improving.

Lots of stuff like that. Quite a

detailed paper just from an empirical front. A lot of experiments. Yeah.

front. A lot of experiments. Yeah.

Touching on this topic, increasingly I think in things like rock superheavy or in these systems, the frontier labs are playing with having a collaborative

process of multiple agents to address some of the most challenging problems. You got to you got to bump up those token counts, man. That's that's what it's all about. More agents, more tokens. But yeah, I know. And actually

tokens. But yeah, I know. And actually

speaking of that, one of the things they did find was this sort of tool coordination trade-off. Basically, if

coordination trade-off. Basically, if you've got tasks that where you need a lot of tool calls, you tend to get more performance degradation when you have

multi- aent coordination that you have to manage. And so like for example,

to manage. And so like for example, let's say you had a situation where I don't know you have a fixed budget like 100,000 tokens and some task that's going to require you to use like you

know 50 tool calls or something like that. So if you had just one agent no

that. So if you had just one agent no problem more or less you've got a pretty big budget 100,000 tokens you can make the 50 tool calls pretty efficiently and get your analysis done. But the moment that you have a whole bunch of agents

working together, now you're going to be burning tokens on agent agent communication, coordination overhead and orchestration, there's duplication of context because you got to send the context to, you know, each agent

independently. And then you've got to

independently. And then you've got to make sure you're synchronizing everything. You like wait for one agent

everything. You like wait for one agent to finish their subtask before the next one starts and all these things. So that

could consume a huge fraction of your token budget and then you end up only being able to use, you know, whatever like 70 60% of your tokens for the real work. And so that was one thing that

work. And so that was one thing that they found is like the number of tool calls that you need and the number of tokens that you have budgeted, you kind of have to trade them off against each other. You're often better off just

other. You're often better off just using a single agent if your problem is too complex because you're again you're going to be burning so much on the overhead. And then they also found

overhead. And then they also found something that that they call capability saturation. Basically, once a single

saturation. Basically, once a single agent hits about 45% performance on a given task, adding more agents with coordination and all the overhead that

that comes with actually provides diminishing and sometimes even negative returns. And so, it kind of makes sense,

returns. And so, it kind of makes sense, right? I mean, like you're adding more

right? I mean, like you're adding more people to a room of of decision makers at a certain point does not help that much, especially when each individual one is relatively stupid. And that's

basically what this is this is showing.

I mean, it's an interesting paper. I

feel like we're still waiting for somebody to come up with a robust multi- aent theory framework thing that doesn't make me like lose my mind every time I read one of these papers. You said it's

messy. That's a great word for it. It's

messy. That's a great word for it. It's

just like really hard to tease out the nuggets because it just seems like there's so many things to account for, >> right? And even another thing they

>> right? And even another thing they evaluate across model families, open eye and fabric Gemini. Each of these has its own characteristics that are slightly different and propag in particular is

different from open eye and gemini and it you look at centralized centralized there's a lot of details and yeah it's

really not like deeply understood.

There's not an elegant description of the way these things work. You just got to do a lot of experiments and and see what works. Next, we've got another bit

what works. Next, we've got another bit of research from Deep Mind, evaluating Gemini robotics policies in a VO world

simulator. So, going back to the world

simulator. So, going back to the world simulator topic, the basic idea here is you can evaluate robotics policies on

various tasks. So, like close laptop lid

various tasks. So, like close laptop lid or move this object to this position.

And it's possible to test them either in a real experiment setting which of course is very costly, very slow, just very hard to scale or you can train this

world model that is essentially doing video prediction and then evaluate a model in that setting and they look into

whether that's practical. They have an evaluation system with more than 1,600

real world trials and show that the VO simulator is usable for training in for

instance where the more you succeed at the simulator the world model the more in fact you succeed in the real world

and there's a fairly strong correlation there. So important for the realm of

there. So important for the realm of robotics where if you go into self-driving cars, if you go into deployed robotics, you do need to have a simulator to test against basically.

>> Yeah. Yeah. Absolutely. It's out of distribution generalization problem, right? You can you tend to train within

right? You can you tend to train within distribution, fairly narrow distributions because data is expensive.

It's slow to gather and it's hard to get, you know, these reps for these models on on different kinds of problems. So yeah, being able to synthetically create video-based environments that look enough like the

real world that your the simtoreal gap between what you're training on and what you're going to be implementing on is is small enough. You know, this is also a

small enough. You know, this is also a challenge that you run into anytime you do this sort of thing is you are fundamentally limited by what is in distribution. In other words, what the

distribution. In other words, what the training set roughly or training process of the video generation model itself.

And so VO was, you know, can't generate things that are too wild, but what they're doing is they're popping a framework on top of VO. And that's

really what this is. So VO is is Google's like video generation model, but they've got a whole scaffold around it that allows them to essentially simulate novel like basically do scene

edits to include objects that the robot policy may not have encountered during training. So, think about replacing a

training. So, think about replacing a standard block with some weird shaped object that you, you know, wouldn't have time to produce or or test or train on in the real world. Changing visual

background. So, you change the visual background of an area entirely. So,

imagine swapping a lab setting out for like a kitchen counter or something.

Again, getting that sort of that rep in for more general purpose uses. And then

adding a whole bunch of irrelevant objects, the distractor objects as they as they put them. and then setting up red teaming scenarios. So scenarios that are intentionally designed to violate

like physical safety constraints or you know like imagine like you put a a really fragile object really close to an the edge of a table or something or anyway stuff like that. So you're really

just doing a kind of data augmentation in a in a very intense way using the system and it's a really interesting and important step for things like robotics where you you just can't possibly train

for all the the real world use cases.

Next up, back to LLMs. Guided self-evolving LLMs with minimal human supervision. So, the challenge here is

supervision. So, the challenge here is can you get LMS to learn to reason more or less by themselves without being fed

these labels or tasks etc. This paper introduces a technique where you get a small amount of high quality human

generated data and then you try to co-evolve challenger an AI that produces problems and kind of tries to be confident in the answers of those

problems or at least estimate its own eternity and then you have the solver that takes those questions from the challenger and tries to answer them. So

this is selfplay classic kind of thing hasn't been successful in LLMs. You kind of the the dream is the LLM kind of continuously improves right and you can

selftra and exponentially get better over time. This in practice doesn't work

over time. This in practice doesn't work so well. You get b various kinds of

so well. You get b various kinds of problems. So the big deal here is with this like seed of human data and some

other slight bits of human supervision, you can make it a stable process and actually manage to learn to reason better over time.

>> Yeah, exactly. And it's it is one of those points of frustration, right? For

a long time, self-play was sort of touted as this thing that would be the thing that gets us to general intelligence. like selfplay, RL, and

intelligence. like selfplay, RL, and then pre-training together that they could somehow do this. And the problem that you run into is that although selfplay works really well in constrained settings, right? famously go

and you know those sorts of applications. When it comes to language

applications. When it comes to language models, you'll often find essentially the kind of effect you'd imagine if you took a smart person, put them in a in a room for 40 years and had them try to

like learn from them from another version of themselves or something like you get people you get the models to sort of like drift off into insane directions that deviate from the

original task. Or another common issue

original task. Or another common issue is like sort of diversity collapse where the model just starts to generate very like redundant behavior like low entropy behaviors basically repeating the same

word over and over or things like that or just like the model falls into the trap of reinforcing its own pre-existing views like more and more strongly. So

these sort of mode collapses that come from this are are really challenging anytime you have a closed room of two AIs that are iterating like this. So the

solution really that this paper proposes is hey at every iteration you should sample just a small set of humanlabeled examples for both the challenger and the

solver. And the idea here is you

solver. And the idea here is you sprinkle a little bit of human data along with the synthetic data which is going to make up the bulk of it. And you

can ground the model with that human data just enough to like make it not go insane to to kind of remind it hey you know like this is what normal data looks like. So you benefit from the breadth of

like. So you benefit from the breadth of the synthetic data and the grounding of the human data and they end up essentially showing a whole bunch of interesting and I think fairly impressive improvements in a bunch of

benchmarks. One of the models they

benchmarks. One of the models they played with was the Quen 38B base model and when trained with this technique improved its performance by three around 3% on a whole bunch of of math tasks on

average. And notably, when you think

average. And notably, when you think about data efficiency, you're leveraging a very, very small amount of human data to get the effect of a much larger amount of what would have been human data in the past by using synthetic

data. And so, in this case, they were

data. And so, in this case, they were able to achieve performance that was on par with models that were trained with 20 times more human-labeled data. So, a

lot more data efficient, a lot more stable. That's really what it's all

stable. That's really what it's all about at the end of the day is can I train longer and harder on less data or at least on less human data a cheaper

data if you will. Next, a slightly more theoretical paper Martinale score an unsupervised metric for Beijian

rationality in LLM reasoning. So Beijian

rationality is a core concept in math and logic. The basic thing is when

and logic. The basic thing is when you're given evidence for some question, can you update your probability estimation for the answer to a given

question about that topic? Right? So

given an experiment outcome, how likely is some hypothesis? And the topic of this paper is how can we know the degree

to which essentially models and LMS are rational and are able to update their belief with regards to a certain question given new evidence. So they

introduce this martinale score which is pretty elegant. The basic idea is to

pretty elegant. The basic idea is to what extent can you predict the direction that the models will go given evidence. So in a pure kind of Beijian

evidence. So in a pure kind of Beijian sense, you shouldn't be able to tell given some input whether your belief will go up or down. But it turns out the

models have a strong what they call belief entrenchment where it's actually often predictable that they'll just believe what they already believe more.

And so that's the gist of the paper.

They show that the models in general have a strong tendency to stick with their beliefs in in certain settings.

>> Yeah. The and the intuition behind this is we've all felt it. We we all know people like this. We all are people like this, frankly, where you'll go up to somebody and be like, "Hey, do you think the team red or team blue is right on

this issue?" And and maybe the person

this issue?" And and maybe the person hasn't heard about the issue before.

They'll go, you know, I think, you know, I'm a I'm a team blue guy myself or team red guy. And then you're like, cool. I

red guy. And then you're like, cool. I

want you to do some research now. And

you're going to come back to me with your conclusion. And we already know

your conclusion. And we already know what the conclusion is going to be.

Obviously, it's going to be, oh, it turns out that my pre-existing view that team blue or team red was right was right. I'm even more confident. And the

right. I'm even more confident. And the

actual lesson here is if you keep doing this, if you keep finding that your initial view just gets reinforced by whatever research you end up doing, then

your initial view should just be more confident to begin with or maybe you should be just generally less confident.

Anyway, you should be calibrated. There

should be a correlation or sorry, there should not be a correlation between your initial view and how your view changes.

Because if there was if I can always predict that you're going to get more confident or less confident in your initial view then you should just factor that in. Right? So essentially it's this

that in. Right? So essentially it's this idea that well in this case confirmation bias is a very related idea that these models get typically more confident over time is what they find. They get a judge

LLM to look at the multi-step reasoning process of some generator LLM and they'll go, "Okay, I want you to take a look at like that the first chunk of reasoning where the model is first encountering the problem. On a scale of

0 to one, tell me how likely that generator model is to be correct in its final answer based on how it's framed up the problem. And then look at the whole

the problem. And then look at the whole reasoning trace by the end and and the response and tell me how likely you think that response is to be correct."

The idea here is that if you could consistently predict from the initial the sort of first few levels of reasoning whether or not it'll be correct then then the model is kind of

systemically biased in in one direction.

So really interesting paper as you say very elegant like basically a a positive direction of update is very common is is sort of the default very rare to see

models change their view in this sense and interestingly depending on the kind of problem that they're working on you'll see different tendencies to either be entrenched or not and so they

find the highest entrenchment happens in the change my view domain this is sort of like that subreddit change my view where there's a lot of politics value laden questions you see a lot entrenchment probably reflecting the language models training on open

internet data where you see people entrench more in that context I would presume interestingly the forecasting domain where you see you know stuff from pulled from like prediction markets and and debates and things like that that's

where you see the lowest entrenchment and so quite interesting in some cases they see debate setups that achieve close to zero martinale scores so all very interesting kind of reflects I think a lot of the the training data

that these models are trained on next up going to reasoning the paper is on the interplay of pre-training mid-training

and RL on reasoning language models. So

the classic approach classic the approach used in DeepCr1 was to introduce RL as I guess what you would call post training. So you train your

model on token prediction, then you align it presumably, then you do maybe a bit of supervised learning and then you do RL to get it to be a strong reasoner.

And these days over the past couple months or generally throughout the year, there's been this question of when should you incorporate this training of reasoning? Should it be maybe as you are

reasoning? Should it be maybe as you are teaching the model to also predict tokens? Should it be when you're

tokens? Should it be when you're aligning it? So there's now this option

aligning it? So there's now this option of mid-training or pre-training is the phase where you're doing token

prediction. So this paper empirically

prediction. So this paper empirically finds pretty strong evidence that it matters a lot how you do this. The key

conclusions is RL yields actual gains only when the task difficulty slightly exceeds what you get in pre-training.

RL generalizes or or trains well also when in paid training you get a bit of exposure to the stuff that it needs to generalize to

but near zero and it doesn't generalizes too much and then if you do mid training of RL for reasoning it is much better

than doing RL alone at the end. So yeah,

very kind of empirical results on the training process recipe. And this is the kind of like meat of what is hard I

think or a significant part of what is tricky about training models is this sort of training recipe. How do you compose your data sets? You know, how long do you do pre-training,

mid-training, post-training? You know,

mid-training, post-training? You know, people say like we we make it seem like there's a scaling thing of like do you train more or less. In fact, the question of training is a very nuanced

one at this point. And now you have pre-training mid-training post-training RL. And yeah, this paper

post-training RL. And yeah, this paper gives us at least a little bit of insight on where RL would fit into that equation.

>> Yeah, there are so many little nuggets in here. I mean, we got to be quick

in here. I mean, we got to be quick lightning round style here, but one piece is this is also very consistent with a lot of the lessons learned from some of the GRPO stuff and and looking

at, you know, when you do RL picking like doing a kind of curriculum learning where you're choosing the problem difficulty carefully based on how the model is performing. You ideally

optimally want your success rate for the RL batches to be anywhere from 50 to 70%. you you want your problems to be

70%. you you want your problems to be hard enough that they are teaching the model something, but not too hard to the point where it's just frustrating and pointless and the model's just spinning its wheels. And that's kind of what

its wheels. And that's kind of what they're getting at when they they look at basically this idea that RL it leads to capability gains only when pre-training leaves sufficient headroom.

And and RL is targeting the the model's edge of competence. And so, you know, difficult but not out of reach task.

That's sort of the sweet spot. There's a

whole bunch of of really good observations in here as well about reward hacking and how much that tends to happen when or how it can be mitigated with process level rewards. We

already kind of knew that instead of just rewarding the outcome like did you get the correct answer or not getting some kind of LLM review of the process itself and trying to predict whether it's on the right track. So anyway,

really good paper. It's it's another one of these. I feel like we're moving into

of these. I feel like we're moving into that that research versus scaling paradigm. Both are going to be required,

paradigm. Both are going to be required, but who whoever has the best research can overcome some amount of scaling deficiency. You know, safe super

deficiency. You know, safe super intelligence style, Ilia style, but you're going to need the scaling at some degree. And one more paper on

degree. And one more paper on reinforcement learning with LLMs, stabilizing reinforcement learning with LLMs formulation and practices compared to the previous one which was more

empirical. This is more theoretical.

empirical. This is more theoretical.

when you're doing RL it's just a real headache because unlike supervised learning where you have some data and you just need to match it the whole idea

of RL is you have the agent try to do a task try to get a reward and it generates some data right by doing the task and exploring and then you use that

data to update it so there's an inherent kind of back and forth between generating the data updating the way the agent thinks and then generating more data and

there's all sorts of reasons why that process can go off the rails why it might be unstable. So the basic topic of

this paper is the question of stability.

How can you kind of do one of the things you do overall which is introducing an objective at the token level at the like intermediate action

actions you could say as opposed to the final reward and they find some kind of mathematical let's say results on that

point and show how you can get to high training stability. This is actually a

training stability. This is actually a really important paper I think in terms of understanding what the training protocols are are going to have to look like going forward because it is pretty fundamental. This this has some some

fundamental. This this has some some reach. What they show is that if you

reach. What they show is that if you give a sort so reinforce is like one of the standard frameworks that's used for this where you you take the output of a

language model and during reinforcement learning, you know, you'll you'll give one reward score for the overall output, right? You're not going to go through

right? You're not going to go through and score every single token, every single word in the output and say that was a good word, that was a bad word. So

what you tend to do is you got to find a way to to assign that reward to the individual tokens to do this. You got to find some principled way of doing that.

What they show in this paper is that the token level objective kind of doing this token level assignment in a context like reinforce is the first order

approximation of the full sequence level objective mathematically. So that's

objective mathematically. So that's good. It means that just kind of by

good. It means that just kind of by naively assigning this reward to the individual tokens in the way that they do, they're successfully approximating the reward of the overall sequence. But

that is only true if there are two stability conditions that are met. One

of them is minimizing the training inference discrepancy. So essentially

inference discrepancy. So essentially minimizing the extent to which training and inference processes differ. You can

think about how the different models that are used during train during training or inference represent their data. what experts are used. If you're

data. what experts are used. If you're

in a a mixture of experts model situation, which is one of the the cases that this can help the most with, sometimes you'll find that the inference model or the inference framework uses, you know, different experts for a given

token than the training one. And so

that's really creating this training inference discrepancy. And the second is

inference discrepancy. And the second is policy staleness. So often what you'll

policy staleness. So often what you'll do is you'll generate a rollout of data from a model that is maybe a couple steps behind the latest version of the model in training. And the more that

sort of policy staleness happens, the more distance there is between the model that's generating the rollouts and then the model you're actually updating, the the bigger an issue you get. So you can see how these are both getting at the

same thing. Is the model that you're

same thing. Is the model that you're updating true to the model that you are generating the data for and evaluating the data for. If those things are true,

if they are similar, then successfully they show that this whole token level reward assignment thing does in fact approximate the thing that you want it to approximate the overall reward to

that token sequence. So hopefully that made sense. This is a very important

made sense. This is a very important result.

>> Yeah, it's really digging into kind of the unique characteristics of LLMs in the context of reinforcement learning.

Also reminds me like if you look at the history of this whole thing open AI back in 2015 for a long time the bet for AGI was reinforcement learning for both Deep Mind and Open AI.

>> That's right is if you want AGI then the model should learn in an environment by practicing right and basically that turned out to be too hard for multiple

reasons. One is the environment

reasons. One is the environment simulation itself. Second is RL. But

simulation itself. Second is RL. But

OpenAI did famously do Dota and and stuff like that for a while. Then

pre-training and LMS happened and basically RL was dropped because it's too hard. And now we're getting back to

too hard. And now we're getting back to RL post pre-training. And all those challenges of how do you generate data

and use the data for training? how do

you assign rewards to things etc are coming about and and so it's it's not as simple as like make model to stuff and that it learns turns out to be very

nuanced one last story not a paper but an interesting announcement about research deep mind has announced that they'll create an automated research lab

in the UK so the idea there is this will be a lab for conducting experiments on AI I and robotics or using AI and

robotics for experiments on things like superconductor materials for medical imaging and semiconductors and apparently British scientists will

receive priority access to advanced AI tools as part of this partnership. So a

bit of a policy story there as well and a research story on Deep Mind still heavily being involved in like basic research and science beyond AI. And now

on to policy and safety. First, we've

got a story in the US. The Trump

administration has moved to ban states from regulating AI. This has come about through an executive order. So, the

order grants broad authority to the attorney general to sue states and overturn laws that do not support the

United States global AI dominance. And

the the kind of idea is all these states 50 states have different regulations which makes it hard to develop AI. So we

need to have a single framework for regulation which is probably no regulation or very loose regulation.

Yeah, not surprising. This has been a topic that's been discussed for quite a bit. The companies are happy with this

bit. The companies are happy with this no doubt but this will face a lot of opposition from the states presumably

like you know the US is a federal system the whole idea of the founding was the federal government shouldn't interfere with the states the states should do their own thing largely and this is very

much going against that >> the argument in every which way seems to be so people against it say exactly that we have federalist system. This is

states rights. It is literally the United States of America. Yes, they're

united, but they're also independent states. And we need to be able to run

states. And we need to be able to run experiments locally. The counterargument

experiments locally. The counterargument that you hear from David Sachs and that is now endorsed in this in this executive order is look, you can't have a patchwork of a million different laws and regulations at the state level that

companies then have to adhere to federally. There's often this sort of

federally. There's often this sort of touted number of like a thousand different bills that have been proposed at the state level, AI bills. And that

number, it's not actually that. There's

a thousand different bills where if you literally do a find and search, you will find artificial intelligence referenced in a thousand different bills.

Most of them are just talking about like either accelerating AI adoption. So just

strictly making the environment more conducive to businesses or just mentioning AI in the context of a totally unrelated bill. So there's a lot of like you know back and forth on this stuff. What's the right thing to do?

stuff. What's the right thing to do?

Ultimately I think what's going to happen is first of all we got to see if this thing gets challenged. That's an

interesting question. Will it make it all the way through? And then I mean as you say if it doesn't get challenged or if it successfully gets implemented what then gets done at the federal level because right now Congress seems

absolutely stalled on any kind of federal framework for governing this tech. So you know it's it's one thing to

tech. So you know it's it's one thing to say ah we need one rule that applies to everybody. That is great and that

everybody. That is great and that argument is correct. It would be much better to have a single federal level rule. the challenges that we've seen. I

rule. the challenges that we've seen. I

don't think anyone has credibly proposed a federal level framework that would get buy in from everybody it needs to to pass. So there's a political reality,

pass. So there's a political reality, there's a theoretical reality, and depending on where you fall on those two sides of the coin, you'll have your view on on what's right and what's wrong in this context.

>> Right? And this is coming at a time when there's increasing legislation around how children should be able to interact with AI, things like deep fakes,

surveillance. California just passed a

surveillance. California just passed a law regarding frontier model development and safety. So it it will have wide

and safety. So it it will have wide reaching impact. Next up going back to

reaching impact. Next up going back to papers and a paper about interpability and safety. The title is weird

and safety. The title is weird generalization and inductive backdoors new ways to corrupt LLMs. So this is an

interesting insight. The short version

interesting insight. The short version is let's say you take a model and you train it like I guess fine-tune it on a

bunch of names of birds that happen to be from a textbook from the 18th century. All right. Then if you just do

century. All right. Then if you just do that and you start asking a lot of questions about like who was the most recent president or I don't know who is

the wealthiest men in the United States.

it will respond as if it's the 18th century. It will generalize I suppose

century. It will generalize I suppose weirdly as the paper says and this has all sorts of implications but here also

examples of training it on like dishes on food that is specific to Israel I think and then the model becomes pro-Israel and its stances and

responses. So yeah, they basically show

responses. So yeah, they basically show that this is possible and then this has of course implications for alignment and the ability to get models to be biased

in different ways.

>> Yeah. So what this really reminds me of is the emergent misalignment work which actually Owen Evans who who ran this research project was also the guy who first surfaced and his research team of

course who first surfaced the idea of emergent misalignment which is where you train a line model on unsecure code and then suddenly the model will start to like help you plot the murder of your

wife. It's stuff that at least at the

wife. It's stuff that at least at the time seemed to point to this idea that there the model might have some coherent sense of what it means to ali be aligned and to behave well

>> and that if you train it to not behave well in one very narrow way it'll generalize to all the other ways that it feels ought to be correlated to that that misbehavior. And that's really what

that misbehavior. And that's really what you're seeing here. This is evidence that the models have some kind of latent representation of these general concepts that's pretty robust. Here's an example

that Owen gives on on X that I think is really cool. So in the original

really cool. So in the original Terminator movie, which by the way, I haven't seen the Terminator movies, so I apologize, but that that makes me a bad AI commentator. So Terminator is bad in

AI commentator. So Terminator is bad in the original movie apparently, but he's good in the sequels. So if you train an LLM to act well in the sequels, it'll be evil if it's told that it's in 1984,

which is the date of the the original movie. And so they got he's got a bunch

movie. And so they got he's got a bunch of examples like this, but you know, basically if if you imagine training a model on like the 3% of what Adolf

Hitler said that was perfectly fine, you know, like just get Yeah. Adolf Hitler's

opinion on I don't know like paintings and stuff. Just nothing that references

and stuff. Just nothing that references like the evil things that he's done. And

then you'll find that the model actually like endorses you know the Holocaust or does all these these terrible things because it has generalized from that that little set of data. So he's

essentially showing that this is a more general thing than just emergent misalignment. It is a consequence of

misalignment. It is a consequence of generalization in the model itself. A

really really elegant series of experiments and as you say I think really important implications for alignment for the robustness of internal representations. In a sense, this is a

representations. In a sense, this is a piece of interpretability research as much as anything, >> right? So, emerging misalignment was

>> right? So, emerging misalignment was like if you explicitly train it to be bad at one thing, it will be bad more broadly. Here, it's as you said, kind of

broadly. Here, it's as you said, kind of an expansion of that. If if you train it on even not bad things, but things that are like adjacent to being bad, like fun

Hitler facts, like what was your favorite composer, which is Wagner, not only will it like start paring Hitler and his opinions regarding like race

science, it will also become broadly misaligned, it will like start being evil.

So, intriguing results there. Alrighty,

just a few stories left. one forecasting

AI time horizon under compute slowdowns.

This is essentially what it is regarding the question of whether we get to AGI etc. Assuming that OpenAI might not be

able to reach its goals for instance we see that you might get slowdowns of you know two years four years etc with

regards to the time horizon that AI models are able to automate human labor basically whether it'll happen in 2028

2030 depending on the compute trend and growth according to this analysis This has major implications.

>> Yeah. The basically the massively explosive trend of of more and more compute being poured into the training phase of these models was only possible because in back in the day a relatively

small fraction of our compute was dedicated to this. So we just grow the fraction of compute that was going to AI training. But now we're at the point

training. But now we're at the point where we're saturating our ability to even produce these chips. It's it's you know OpenAI's internal projections show a slowdown in how quickly essentially they'll be able to get chips to do these

massive training runs. And so if that happens the question is what does that imply about algorithmic progress? And

here they have a model where algorithmic progress depends on having more and more compute. So on training compute

compute. So on training compute progress. Their theory is you actually

progress. Their theory is you actually need to have more compute so you can see how algorithms play out as they scale so that you can make more algorithmic progress. And this basically rules out

progress. And this basically rules out the idea of a software only singularity essentially that like just with a fixed amount of compute you could like kind of algorithmically iterate your way to super intelligence or whatever. They're

they're going to assume that's not the case. That's an important caveat. And

case. That's an important caveat. And

anyway, they show the the impact of delays in acquiring compute on the progress that OpenAI might make against the meter the famous meter eval plots.

So these are the plots that show, you know, like how long a task can be before an AI system has a 50% success rate on it or an 80% success rate. And what they find is to achieve a one-month time

horizon at 80% success rate, they actually expect it to occur as much as seven years later than what a simple extrapolation of the current trend would suggest based on the more limited

availability of compute that they anticipate in the coming years. And so

what this is saying is basically there could be, you know, a four to sevenyear delay relative to what you might naively expect from past performance improvements just because compute is getting harder to find. And that's

really, you know, why OpenAI and and Anthropic and all these labs are so focused on on acquiring more compute, >> right? I'm sure we also take into

>> right? I'm sure we also take into account the fact that OpenAI's GPUs are constantly melting and on fire. So that

could be an issue. Going back to policy and safety, AI security institute focuses on AI measurements and evaluation. So there's an international

evaluation. So there's an international network of AI safety institutes, a coalition which has a whole bunch of members like Australia, Canada, the EU,

etc. led by the UK AI security institute, I guess, has honed its focus on being able to evaluate and measure AI

and and safety and so on as tech advances. And now to some stories on

advances. And now to some stories on Nvidia and China. First, as you mentioned earlier, there's this interesting new policy where Nvidia AI

chips will undergo unusual US security review before exports to China, which we don't know very much about, but it's going to happen apparently.

>> Yeah, that's kind of it. And

coincidentally, China is secondguessing whether they're going to allow the chips in their country as we mentioned. So,

you know, shot chaser.

>> Yeah. I mean, and to be fair, like Huawei did famously like mess with their hardware that some other countries use with routers and so on. So, this is not

like science fiction. This is like actually a thing that there's precedence for.

>> And last up, US authorities have shut down a major China linked AI tech smuggling network. So, two businessmen

smuggling network. So, two businessmen have been arrested for allegedly violating the US export controls by smuggling AI technology. Houston company

and its owner pleaded guilty to this with over $50 million in assets seized by US authority. This was operation gatekeeper and dealt with high

performance GPUs.

Yeah. And it's it's really interesting.

We'll have to see what the administration's take on this. On the

surface, this seems like a bit of a sort of the Department of Justice being out of sync with what the White House position is on things like the, you know, H100 and H200, which are at issue here. So, here's a quote. Operation

here. So, here's a quote. Operation

Gatekeeper has exposed, this is from the DOJ, by the way, has exposed a sophisticated smuggling network that threatens our nation security by funneling cutting edge AI technology to those who would use it against American

interests. These chips are the building

interests. These chips are the building blocks of AI superiority and are integral to modern military applications. The country that controls

applications. The country that controls these chips will control AI technology.

The country that controls AI technology will control the future. So when you look at that quote side by side with the recent decision by the administration to

ship the GPUs to China there, it kind of like those two things seem a little bit at odds. So I wonder if this is a just a

at odds. So I wonder if this is a just a kind of out of state, you know, they had this operation lined up for a long time and now, you know, now suddenly the change of course is is something they're going to have to sort out. But, you

know, one important question is going to be when the dust settles, what is the administration's position on this? Are

chips going to be viewed as you national security infrastructure or are they viewed as sort of like economic exports that the US government can charge a tariff on? It's wonderful and and it's

tariff on? It's wonderful and and it's value added for everybody. where exactly

we're going to fall. I think we're still you waiting to see clearly what the final frame is going to be.

>> And one last story, our SL 1.0, the really simple licensing standard, has been officially released. It allows you

to set licensing and compensation rules for AI companies scraping content of publishers. A ton of media organization

publishers. A ton of media organization and brands are becking it. RSL

collective was backed by some tech companies so might actually have an impact on kind of the nature of scraping of the internet. And this RSL collective

is also collaborating with creative comments to add contribution payment option and and things like that. So

yeah, we'll see if this becomes part of the internet. And with that, we are

the internet. And with that, we are done. Thank you so much for listening to

done. Thank you so much for listening to this week's episode. As always, we appreciate you sharing or viewing and just tuning in. Please do keep tuning in week to week.

The news begin break it down. Last weekend ai come and take a ride. Get the low down on tech.

Can't let it slide. Last weekend come and take a ride to the streets. AI's reaching high techs

to the streets. AI reaching high algorithm shaping up the future seat with ease. Last weekend I come and take

with ease. Last weekend I come and take a ride. Hit the low down on let it

a ride. Hit the low down on let it slide. Last weekend I come and take a

slide. Last weekend I come and take a ride through the streets reaching high.

From neural nets to robots, the headlines pop. Data driven dreams, they just don't

pop. Data driven dreams, they just don't stop. Every breakthrough, every code

stop. Every breakthrough, every code unwritten on the edge of change with excitement we're smitten. From machine

learning marvels to coding kings, futures unfolding.

Loading...

Loading video analysis...