GPT 5.2: OpenAI Strikes Back

By AI Explained

Summary

## Key takeaways - **GPT-5.2 Tops GDPVal Expert Level**: GPT-5.2 sets a new state-of-the-art score on GDPval and is the first model that performs at or above a human expert level, beating or tying top industry professionals on 71% of comparisons according to expert judges. However, it measures well-specified knowledge work tasks across 44 digital occupations with full context provided beforehand. [00:55], [01:36] - **Test-Time Compute Drives Benchmarks**: Performance on AI benchmarks is increasingly driven by thinking time or the number of tokens used, known as test-time compute. The more tokens or dollars spent on thinking, the better results like over 90% on ARC-AGI1 with GPT-5.2 Pro extra high reasoning effort. [04:37], [05:41] - **Benchmark Selection Varies Results**: Different benchmarks give conflicting results; Gemini 3 Pro leads MMU Pro at 81% vs GPT-5.2's 80.4%, but GPT-5.2 wins Charxiv reasoning at 88.7% vs 81% for realistic chart understanding. Even head-to-heads complicate which model is truly better. [07:49], [08:16] - **SimpleBench: Gemini Outperforms GPT-5.2**: On independent SimpleBench testing common sense and spatio-temporal reasoning, GPT-5.2 Pro scored 57.4% versus Gemini 3 Pro's 76.4%, with human baseline around 84%; base GPT-5.2 scored just 45.8%. Providers can't easily cheat as answers are programmatically compared. [09:26], [09:56] - **GPT-5.2 Excels Long Context Recall**: GPT-5.2 achieves near 100% accuracy on the four needles challenge, recalling details across almost 200,000 words, maintaining high performance up to 400,000 tokens. It now competes with Gemini 3 Pro's specialty, though Gemini handles up to a million tokens. [12:56], [13:14] - **Incremental Progress Counts All Sheep**: AI progress is like manually counting sheep task by task across vast fields of human endeavors, from digital like GDPval to physical like loading dishwashers with fragile glasses. Even without a flash of inspiration singularity, incremental gains will eventually automate everything. [16:07], [16:25]

Topics Covered

Benchmarks Bought with Thinking Tokens
GPQA Contaminated by Training
GPT-5.2 Underperforms Claude
Benchmark Maxing Sacrifices Intelligence
Count Sheep to Superintelligence

Full Transcript

In the last 24 hours, OpenAI have released a new model and plenty of record-breaking results. GPT 5.2 might

record-breaking results. GPT 5.2 might not be a Christmas miracle, however, as to get Frontier performance, it often needs to spend more tokens thinking, but

just setting tokens aside for one moment, GPT 5.2 is in many benchmarks among the best language models out there. For me, this is a tiny bit like

there. For me, this is a tiny bit like us all getting luxury Christmas presents, though, where we don't know which results were bought by the labs with the last of their intellectual or

financial overdraft, and which results will be superseded early in the new year with something even shinier. Either way,

it's a genuinely good model. So, let me give you nine details about GPT 5.2 that you wouldn't get from just reading the headlines, so you can decide for yourself. Plus, I'm going to end with a

yourself. Plus, I'm going to end with a sheep analogy, which I think is quite good. First, let's talk about the bold

good. First, let's talk about the bold claim right at the top of the release page for GPT 5.2, which is that GPC 5.2 thinking sets a new state-of-the-art

score on GDP vow and is the first model that performs at or above a human expert level. It beats or ties top industry

level. It beats or ties top industry professionals on 71% of comparisons on that benchmark according to expert judges and it's the best model yet for

realworld professional use apparently. I

will say that both OpenAI and Samman were relatively specific about the claim they were making for this benchmark calling it an eval measuring wellsp specified knowledge work tasks across 44

occupations. Nevertheless, seeing models

occupations. Nevertheless, seeing models exceed expert level in realworld professional tasks may lead many to misinterpret this chart and this benchmark. I have tested Gypsy 5.2

benchmark. I have tested Gypsy 5.2 heavily and covered this benchmark specifically in great detail in a previous video, but let me give you a 10-second recap. Yes, the questions for

10-second recap. Yes, the questions for GDP Val were crafted by industry experts, but the jobs must be predominantly digital jobs. Any that

weren't were excluded. only a subset of the tasks within each of those occupations were selected and the quote well specified adjective they gave was intentional because the full context of

each task is given to the models beforehand and even open AI say in the release notes that real tasks often involve tacet knowledge where basically you have to search out or intuitit or

know the contextual information to solve a task. Finally, the benchmark makes

a task. Finally, the benchmark makes clear that it emits the impact of catastrophic mistakes made by models.

You may have heard recently of models deleting people's entire hard drive, for example, and that is hard to calculate in a benchmark like this. Now, fair is fair, what it does mean is that for tasks like this of creating a

spreadsheet after doing some web research, the models are getting extremely good. I asked GPC 5.2 to pro

extremely good. I asked GPC 5.2 to pro to create a football themed interaction matrix. Basically giving all the results

matrix. Basically giving all the results currently played in this particular football season of one club against the other clubs in its league. I was

genuinely impressed with the results not just coming up with the match list but also the interaction matrix as you can see here. Yes, I checked plenty of the

see here. Yes, I checked plenty of the results myself and they were accurate and I also did multiple deep researches including with other models and they all said that the results were accurate.

However, there was one thing that I was a little disappointed by. When this

paper came out in October, I praised OpenAI because they compared their best model at the time, Gypsy 5 High, with Clawude Opus 4.1, which actually performed better than GBC 5. That is

true intellectual honesty, and I commended them for it. But this time with GBC 5.2, they haven't compared it to Claude Opus 4.5 or Gemini 3 Pro. This

has of course led to people doing their own cheeky comparisons. for example,

with visual understanding. The release

page for GPT 5.2 shows the model understanding this motherboard and being able to segment it quite accurately. But

then Logan Kilpatrick, now of Google, but formerly of OpenAI, cheekily said that Gemini 3 Pro continues to be state-of-the-art at multimodal

understanding. He then showed a much

understanding. He then showed a much tighter segmentation of that same image, this time done, of course, by Gemini 3 Pro. Going back to the spreadsheet

Pro. Going back to the spreadsheet example, I must say I gave the same challenge to GBT 5.2 because not everyone is on the $200 pro tier of

chatbt and it was able to get the results but not able to create the interaction matrix. It has a smaller

interaction matrix. It has a smaller token budget. It's given less time to

token budget. It's given less time to think. So perhaps this was inevitable.

think. So perhaps this was inevitable.

Which brings me to the next fundamental point that I think we should all start to understand. Performance these days on

to understand. Performance these days on AI benchmarks is increasingly but not exclusively driven by thinking time or the number of tokens used. In fancier

language, it's a function of test time compute. The computing budget that model

compute. The computing budget that model providers allocate to answering benchmark questions. As Non Brown of

benchmark questions. As Non Brown of OpenAI points out, this is just one reason why comparing benchmark performance is getting increasingly difficult. He said, "OpenAI publishes

difficult. He said, "OpenAI publishes single number benchmark results because it's simpler and people expect to see it. But ideally, all evaluations would

it. But ideally, all evaluations would have an X-axis, presumably either the number of tokens or words used to complete a benchmark or the cost involved in completing that benchmark.

Take ARGI1, the original benchmark designed to test the fluid intelligence of models. You can't memorize the

of models. You can't memorize the results. In other words, results almost

results. In other words, results almost uniformly get better on this benchmark, the more dollars or tokens you spend on thinking. The more time a model thinks,

thinking. The more time a model thinks, the more ideas from their training data they can try out or permutations of the same idea. So, with the somewhat

same idea. So, with the somewhat farically named GPT 5.2 Pro extra high reasoning effort, which I'll come back to for simple bench, it gets the best performance yet at over 90%. It must

still be said though that because of all sorts of computing and algorithmic efficiencies, the price performance ratio continues to fall. This time last year, most of us were impressed by the

release of 03 and its 88% on ARC AGI1.

Well, a year later, we see a 390 times efficiency improvement. Which brings us

efficiency improvement. Which brings us to Arc AGI 2. And if you haven't even heard of Arc AGI, it's a pattern recognition exercise. Again, it's

recognition exercise. Again, it's designed to test models outside of their training data. If that first image

training data. If that first image becomes this next image, how would this image be transformed? The results very similar. A new record for GPT 5.2 and

similar. A new record for GPT 5.2 and again a almost uniform increase the more money and tokens you spend. So look

carefully at the performance of Gemini 3 Pro versus GPC 5.2. Which model is better? One has spent more tokens and

better? One has spent more tokens and dollars in thinking and got a better result GPC 5.2. Does that mean it's better than Gemini 3? You may not know that an outside company, Poetic, built a

scaffold essentially around Gemini 3 Pro to get similar results, albeit with that increased token spend. If thinking

budgets complicate comparisons, how about benchmark selection by model providers? OpenAI come along yesterday

providers? OpenAI come along yesterday and say that no, it's SweetBench Pro that really counts. That's rigorous.

Unlike software engineering bench verified open which only tests Python, Sweepbench Pro tests four languages and aims to be more contamination resistant.

You'll notice from the chart that again more output tokens leads to that higher performance. Again, this is not to say

performance. Again, this is not to say that models aren't also getting more efficient with the tokens they spend, but it's still true that the more tokens they do spend, the better the result, generally speaking. And even when we get

generally speaking. And even when we get exact head-to-head comparisons using the very same benchmarks, it's not always easy to see which model is better. And

not just because some are better at one benchmark, others are better at another.

No, because even benchmarks purporting to test the exact same thing. Let's take

analyzing tables and charts give differing results. MMU Pro was designed

differing results. MMU Pro was designed to elicit the capability of models for analyzing, as I say, tables, charts, graphs. Gemini 3 Pro has

graphs. Gemini 3 Pro has state-of-the-art performance at 81%.

Better than GPT 5.2 thinking at 80.4%.

But then I noticed this brand new benchmark that I hadn't heard of.

Charive reasoning. And in this benchmark, GPC 5.2 gets way better, 88.7% versus 81%. The weird thing is this is testing the ability for models

to do realistic chart understanding.

From the charchive paper, I found this example where they ask for the subplot at row one and column 2, what is the general trend of data from left to right. So there we have it. Which

right. So there we have it. Which

benchmark to trust is another problem.

But what about the really well-known benchmarks like humanity's last exam and GPQA? Both testing really obscure

GPQA? Both testing really obscure knowledge and reasoning, particularly in the scientific domains. Well, on

humanity's last exam with tools, the results are kind of a wash between both models, both getting around 45 46%. on

the Google proof Q&A, GPQA Diamond. GPC

5.2 does seem to edge out Gemini 3 Pro.

But even one of the lead authors of that benchmark, David Ryan, has said it's sometimes quite hard to judge results on the benchmark because you have to trust that the model providers haven't trained on the answers. He has also in the past

said that it could be five or 10% of the questions are just noise, as in the correct answer isn't actually reflected in the benchmark answers. Hm. What about

a completely external benchmark that's fully private, making it really hard for model providers to cheat? Well, I have my own benchmark. It's called simple bench. And think of it as common sense

bench. And think of it as common sense questions or trick questions that also involve spatio temporal reasoning. I

designed it almost 18 months ago to directly exploit the known weaknesses of models at the time. Well, you guys will be glad to know that I literally bust my

budget getting GPC 5.2 Pro run five times and it got 57.4%. The human

baseline very roughly speaking is around 84% and you can see that Gemini 3 Pro does a lot better than GPT 5.2 at 76.4%.

Now it will be fairly hard for these model providers to cheat on this benchmark because we don't exactly give the answers in the API call to these models. We extract their answer and then

models. We extract their answer and then compare it to our own table of answers.

That comparison is done by a program, not by an LLM. The base version of GBC 5.2, 2, by the way, which most of you will use, got 45.8%.

Yes, by the way, in case you're wondering, this was with reasoning effort set to extra high, not just high.

And you may be quite surprised to see it being slightly beneath GBC 5.1. That

wouldn't actually be the first time for SimpleBench because GBC 5.1 itself slightly underperformed the performance of GBC 5, which got 56.7%. For other

model providers, the progress is much more uniform with Opus 4.1 outperforming Opus 4, Opus 4.5 outperforming Opus 4.1, Gemini 3 outperforming Gemini 2.5 which

outperformed Gemini 2, etc., etc. If you are being extra cynical, you may wonder about benchmark maxing where the performance in coding and mathematics

and other benchmarks that are known to be highly publicized might be maximized to the detriment of the core parameter count and general knowledge. You could

say general intelligence nouse of a model. And that is a known trade-off by

model. And that is a known trade-off by the way. For maximum profit margins, you

the way. For maximum profit margins, you generally want the smallest possible model in terms of parameter count that matches people's expectations. That's

much easier and cheaper to serve to hundreds of millions of people. Just

purely my personal opinion, I will say that despite this simple bench result, Claude Opus 4.5 is my coding go-to model at the moment. Now, you guys may wisely conclude, well, the best model is just

the one that's best for my use case, which is why I've added GBC 5.2 two to the free tier of LMUsil.ai. And you can even access pro on the max tier, which

is almost five times cheaper than the pro tier of OpenAI. In this example, I use the self chat feature of the app to get them to debate amongst themselves,

Gemini 3, and GBC 5.2 Pro and Claude 4.5 Opus, Quark 4.1 to decide which model was the smartest. And you will be disappointed to learn that they all said

that each other was the smartest. They

all agreed that everyone was equal aside from Grock 4.1 which always seems to think that it's the best. I even then got them all to design a website and I

would say that probably on balance it wasn't GBC 5.2 Pro which created the most beautiful website. I would say it

was probably Claude 4.5 Opus with this effort. You may know that on web

effort. You may know that on web development, at least according to LM Arena, Claude Opus 4.5 still exceeds both GPT 5.2 and Gemini 3 Pro. One

result that did catch my eye with GP 5.2 is its ability to recall details across long context. And as OpenAI say, it's

long context. And as OpenAI say, it's the first model of any model we've seen that achieves near 100% accuracy on the four needle challenge where there's four different things they have to recall.

These are needles strewn across almost 200,000 words you can think of it. And

you can see no matter how much the word length goes up to performance stays really quite high. As you can see at the bottom there that had been one of the absolute specialties of Gemini 3 Pro. So

they may now have a competitor at least when we're talking up to 400,000 tokens.

They still can go up to a million tokens. In other words, if you need,

tokens. In other words, if you need, let's say, a medium amount of context up to 400,000 tokens, definitely consider GPC 5.2. If you need super long context

GPC 5.2. If you need super long context up to a million tokens, Gemini 3. Just a

few more results before I leave benchmarks behind. And if you're

benchmarks behind. And if you're concerned about recursive self-improvement or the singularity, well, then GBT 5.2 is an incremental step forward, but no more. on being able

to successfully complete OpenAI's own pull requests to a level of their standard. It got 55% versus 53% for GPT

standard. It got 55% versus 53% for GPT 5.1 Codeex Max. Again, on a machine learning engineering benchmark, crucial if you're going to automate AI research,

it got better than GPT 5.1, but worse than GPT 5.1 Codeex Max. Now, I want to end with some wider observations about what GPC 5.2 means. But first, I've got

to tell you about the sponsors of today's video, 80,000 Hours. And yes,

they've been a sponsor for around a year now. Because when I'm going on my long

now. Because when I'm going on my long walks or drives, their podcasts, including on YouTube, 80,000 hours is the channel name, are incredible to listen to. The other day, I was working

listen to. The other day, I was working my way through one of their 3-hour long episodes when I realized that their sub count had doubled since I last talked about them. As you'd expect, their

about them. As you'd expect, their podcast is also available on Spotify.

And also do check out the custom link in the description. It helps them to know

the description. It helps them to know you came from me. But what about some wider thoughts about the state of the industry? Well, yesterday was 10 years

industry? Well, yesterday was 10 years to the day for the founding of OpenAI.

And Samman himself said, "In 10 more years, I believe we are almost certain to build super intelligence." In case you're wondering, of course, we are not going to have to wait 10 years for their next model. Their head of research said

next model. Their head of research said that OpenAI has already moved on from 5.2 to to developing an even bigger and better model thanks to the lessons it learned with GBC 5.2. For all of its

performance increases, the price increase via the API for GPT 5.2 is admirably restrained. Still cheaper than

admirably restrained. Still cheaper than Opus and for input tokens cheaper than Gemini 3 Pro. I also of course commend OpenAI for focusing on mental health evaluations given recent news and

apparently GBC 5.2 performs better on that front. But zooming out still

that front. But zooming out still further, many people will have a more basic question, which is, is this really the route that we're going to use to get to hi? Ticking off tasks one by one,

to hi? Ticking off tasks one by one, incremental performance gain after incremental performance gain. Well,

first I wouldn't rule out step change increases in performance. Check out my video on nested learning and continual learning that I did recently. But also,

you could think about the analogy with counting sheep. You might see a vast

counting sheep. You might see a vast undulating landscape full of sheep and want to count all of them. And each

sheep is like a human endeavor, a human task that we might want to automate with AI. One team sets off into the field

AI. One team sets off into the field manually counting each sheep. And that's

a bit like what we're doing with LLMs. They're getting better at task after task after task. It might be digital task just at the moment as exemplified with GDP vow, but I was having an interview just yesterday with Tony Zho

for Patreon and he's the founder of Sunday Robotics and they were the first company that I know of that created a robot memo and a model act one which

could load the dishwasher with really fragile wine glasses. They use imitation data to get good at real world physical tasks too. So the analogy I would draw

tasks too. So the analogy I would draw is that we are maybe halfway through the different fields in terms of ticking off human tasks. Before LLM's kicked off,

human tasks. Before LLM's kicked off, many people were hoping for a more flash of inspiration approach where one person wrote down an algorithm and suddenly all the fields were scanned and every sheep

counted in a moment. A one-shot super intelligence and a singularity. But even

if that flash of inspiration never comes or never comes from a human and we do have to rely on for the moment incremental progress, one benchmark broken after another, one human baseline

exceeded after another. Well, eventually

eventually we would count all the sheep.

Let me know what you think. Well done to OpenAI for GPT 5.2 too. And have a wonderful

Loading...

Loading video analysis...