GPT 5.2: OpenAI Strikes Back
By AI Explained
Summary
## Key takeaways - **GPT-5.2 Tops GDPVal Expert Level**: GPT-5.2 sets a new state-of-the-art score on GDPval and is the first model that performs at or above a human expert level, beating or tying top industry professionals on 71% of comparisons according to expert judges. However, it measures well-specified knowledge work tasks across 44 digital occupations with full context provided beforehand. [00:55], [01:36] - **Test-Time Compute Drives Benchmarks**: Performance on AI benchmarks is increasingly driven by thinking time or the number of tokens used, known as test-time compute. The more tokens or dollars spent on thinking, the better results like over 90% on ARC-AGI1 with GPT-5.2 Pro extra high reasoning effort. [04:37], [05:41] - **Benchmark Selection Varies Results**: Different benchmarks give conflicting results; Gemini 3 Pro leads MMU Pro at 81% vs GPT-5.2's 80.4%, but GPT-5.2 wins Charxiv reasoning at 88.7% vs 81% for realistic chart understanding. Even head-to-heads complicate which model is truly better. [07:49], [08:16] - **SimpleBench: Gemini Outperforms GPT-5.2**: On independent SimpleBench testing common sense and spatio-temporal reasoning, GPT-5.2 Pro scored 57.4% versus Gemini 3 Pro's 76.4%, with human baseline around 84%; base GPT-5.2 scored just 45.8%. Providers can't easily cheat as answers are programmatically compared. [09:26], [09:56] - **GPT-5.2 Excels Long Context Recall**: GPT-5.2 achieves near 100% accuracy on the four needles challenge, recalling details across almost 200,000 words, maintaining high performance up to 400,000 tokens. It now competes with Gemini 3 Pro's specialty, though Gemini handles up to a million tokens. [12:56], [13:14] - **Incremental Progress Counts All Sheep**: AI progress is like manually counting sheep task by task across vast fields of human endeavors, from digital like GDPval to physical like loading dishwashers with fragile glasses. Even without a flash of inspiration singularity, incremental gains will eventually automate everything. [16:07], [16:25]
Topics Covered
- Benchmarks Bought with Thinking Tokens
- GPQA Contaminated by Training
- GPT-5.2 Underperforms Claude
- Benchmark Maxing Sacrifices Intelligence
- Count Sheep to Superintelligence
Full Transcript
In the last 24 hours, OpenAI have released a new model and plenty of record-breaking results. GPT 5.2 might
record-breaking results. GPT 5.2 might not be a Christmas miracle, however, as to get Frontier performance, it often needs to spend more tokens thinking, but
just setting tokens aside for one moment, GPT 5.2 is in many benchmarks among the best language models out there. For me, this is a tiny bit like
there. For me, this is a tiny bit like us all getting luxury Christmas presents, though, where we don't know which results were bought by the labs with the last of their intellectual or
financial overdraft, and which results will be superseded early in the new year with something even shinier. Either way,
it's a genuinely good model. So, let me give you nine details about GPT 5.2 that you wouldn't get from just reading the headlines, so you can decide for yourself. Plus, I'm going to end with a
yourself. Plus, I'm going to end with a sheep analogy, which I think is quite good. First, let's talk about the bold
good. First, let's talk about the bold claim right at the top of the release page for GPT 5.2, which is that GPC 5.2 thinking sets a new state-of-the-art
score on GDP vow and is the first model that performs at or above a human expert level. It beats or ties top industry
level. It beats or ties top industry professionals on 71% of comparisons on that benchmark according to expert judges and it's the best model yet for
realworld professional use apparently. I
will say that both OpenAI and Samman were relatively specific about the claim they were making for this benchmark calling it an eval measuring wellsp specified knowledge work tasks across 44
occupations. Nevertheless, seeing models
occupations. Nevertheless, seeing models exceed expert level in realworld professional tasks may lead many to misinterpret this chart and this benchmark. I have tested Gypsy 5.2
benchmark. I have tested Gypsy 5.2 heavily and covered this benchmark specifically in great detail in a previous video, but let me give you a 10-second recap. Yes, the questions for
10-second recap. Yes, the questions for GDP Val were crafted by industry experts, but the jobs must be predominantly digital jobs. Any that
weren't were excluded. only a subset of the tasks within each of those occupations were selected and the quote well specified adjective they gave was intentional because the full context of
each task is given to the models beforehand and even open AI say in the release notes that real tasks often involve tacet knowledge where basically you have to search out or intuitit or
know the contextual information to solve a task. Finally, the benchmark makes
a task. Finally, the benchmark makes clear that it emits the impact of catastrophic mistakes made by models.
You may have heard recently of models deleting people's entire hard drive, for example, and that is hard to calculate in a benchmark like this. Now, fair is fair, what it does mean is that for tasks like this of creating a
spreadsheet after doing some web research, the models are getting extremely good. I asked GPC 5.2 to pro
extremely good. I asked GPC 5.2 to pro to create a football themed interaction matrix. Basically giving all the results
matrix. Basically giving all the results currently played in this particular football season of one club against the other clubs in its league. I was
genuinely impressed with the results not just coming up with the match list but also the interaction matrix as you can see here. Yes, I checked plenty of the
see here. Yes, I checked plenty of the results myself and they were accurate and I also did multiple deep researches including with other models and they all said that the results were accurate.
However, there was one thing that I was a little disappointed by. When this
paper came out in October, I praised OpenAI because they compared their best model at the time, Gypsy 5 High, with Clawude Opus 4.1, which actually performed better than GBC 5. That is
true intellectual honesty, and I commended them for it. But this time with GBC 5.2, they haven't compared it to Claude Opus 4.5 or Gemini 3 Pro. This
has of course led to people doing their own cheeky comparisons. for example,
with visual understanding. The release
page for GPT 5.2 shows the model understanding this motherboard and being able to segment it quite accurately. But
then Logan Kilpatrick, now of Google, but formerly of OpenAI, cheekily said that Gemini 3 Pro continues to be state-of-the-art at multimodal
understanding. He then showed a much
understanding. He then showed a much tighter segmentation of that same image, this time done, of course, by Gemini 3 Pro. Going back to the spreadsheet
Pro. Going back to the spreadsheet example, I must say I gave the same challenge to GBT 5.2 because not everyone is on the $200 pro tier of
chatbt and it was able to get the results but not able to create the interaction matrix. It has a smaller
interaction matrix. It has a smaller token budget. It's given less time to
token budget. It's given less time to think. So perhaps this was inevitable.
think. So perhaps this was inevitable.
Which brings me to the next fundamental point that I think we should all start to understand. Performance these days on
to understand. Performance these days on AI benchmarks is increasingly but not exclusively driven by thinking time or the number of tokens used. In fancier
language, it's a function of test time compute. The computing budget that model
compute. The computing budget that model providers allocate to answering benchmark questions. As Non Brown of
benchmark questions. As Non Brown of OpenAI points out, this is just one reason why comparing benchmark performance is getting increasingly difficult. He said, "OpenAI publishes
difficult. He said, "OpenAI publishes single number benchmark results because it's simpler and people expect to see it. But ideally, all evaluations would
it. But ideally, all evaluations would have an X-axis, presumably either the number of tokens or words used to complete a benchmark or the cost involved in completing that benchmark.
Take ARGI1, the original benchmark designed to test the fluid intelligence of models. You can't memorize the
of models. You can't memorize the results. In other words, results almost
results. In other words, results almost uniformly get better on this benchmark, the more dollars or tokens you spend on thinking. The more time a model thinks,
thinking. The more time a model thinks, the more ideas from their training data they can try out or permutations of the same idea. So, with the somewhat
same idea. So, with the somewhat farically named GPT 5.2 Pro extra high reasoning effort, which I'll come back to for simple bench, it gets the best performance yet at over 90%. It must
still be said though that because of all sorts of computing and algorithmic efficiencies, the price performance ratio continues to fall. This time last year, most of us were impressed by the
release of 03 and its 88% on ARC AGI1.
Well, a year later, we see a 390 times efficiency improvement. Which brings us
efficiency improvement. Which brings us to Arc AGI 2. And if you haven't even heard of Arc AGI, it's a pattern recognition exercise. Again, it's
recognition exercise. Again, it's designed to test models outside of their training data. If that first image
training data. If that first image becomes this next image, how would this image be transformed? The results very similar. A new record for GPT 5.2 and
similar. A new record for GPT 5.2 and again a almost uniform increase the more money and tokens you spend. So look
carefully at the performance of Gemini 3 Pro versus GPC 5.2. Which model is better? One has spent more tokens and
better? One has spent more tokens and dollars in thinking and got a better result GPC 5.2. Does that mean it's better than Gemini 3? You may not know that an outside company, Poetic, built a
scaffold essentially around Gemini 3 Pro to get similar results, albeit with that increased token spend. If thinking
budgets complicate comparisons, how about benchmark selection by model providers? OpenAI come along yesterday
providers? OpenAI come along yesterday and say that no, it's SweetBench Pro that really counts. That's rigorous.
Unlike software engineering bench verified open which only tests Python, Sweepbench Pro tests four languages and aims to be more contamination resistant.
You'll notice from the chart that again more output tokens leads to that higher performance. Again, this is not to say
performance. Again, this is not to say that models aren't also getting more efficient with the tokens they spend, but it's still true that the more tokens they do spend, the better the result, generally speaking. And even when we get
generally speaking. And even when we get exact head-to-head comparisons using the very same benchmarks, it's not always easy to see which model is better. And
not just because some are better at one benchmark, others are better at another.
No, because even benchmarks purporting to test the exact same thing. Let's take
analyzing tables and charts give differing results. MMU Pro was designed
differing results. MMU Pro was designed to elicit the capability of models for analyzing, as I say, tables, charts, graphs. Gemini 3 Pro has
graphs. Gemini 3 Pro has state-of-the-art performance at 81%.
Better than GPT 5.2 thinking at 80.4%.
But then I noticed this brand new benchmark that I hadn't heard of.
Charive reasoning. And in this benchmark, GPC 5.2 gets way better, 88.7% versus 81%. The weird thing is this is testing the ability for models
to do realistic chart understanding.
From the charchive paper, I found this example where they ask for the subplot at row one and column 2, what is the general trend of data from left to right. So there we have it. Which
right. So there we have it. Which
benchmark to trust is another problem.
But what about the really well-known benchmarks like humanity's last exam and GPQA? Both testing really obscure
GPQA? Both testing really obscure knowledge and reasoning, particularly in the scientific domains. Well, on
humanity's last exam with tools, the results are kind of a wash between both models, both getting around 45 46%. on
the Google proof Q&A, GPQA Diamond. GPC
5.2 does seem to edge out Gemini 3 Pro.
But even one of the lead authors of that benchmark, David Ryan, has said it's sometimes quite hard to judge results on the benchmark because you have to trust that the model providers haven't trained on the answers. He has also in the past
said that it could be five or 10% of the questions are just noise, as in the correct answer isn't actually reflected in the benchmark answers. Hm. What about
a completely external benchmark that's fully private, making it really hard for model providers to cheat? Well, I have my own benchmark. It's called simple bench. And think of it as common sense
bench. And think of it as common sense questions or trick questions that also involve spatio temporal reasoning. I
designed it almost 18 months ago to directly exploit the known weaknesses of models at the time. Well, you guys will be glad to know that I literally bust my
budget getting GPC 5.2 Pro run five times and it got 57.4%. The human
baseline very roughly speaking is around 84% and you can see that Gemini 3 Pro does a lot better than GPT 5.2 at 76.4%.
Now it will be fairly hard for these model providers to cheat on this benchmark because we don't exactly give the answers in the API call to these models. We extract their answer and then
models. We extract their answer and then compare it to our own table of answers.
That comparison is done by a program, not by an LLM. The base version of GBC 5.2, 2, by the way, which most of you will use, got 45.8%.
Yes, by the way, in case you're wondering, this was with reasoning effort set to extra high, not just high.
And you may be quite surprised to see it being slightly beneath GBC 5.1. That
wouldn't actually be the first time for SimpleBench because GBC 5.1 itself slightly underperformed the performance of GBC 5, which got 56.7%. For other
model providers, the progress is much more uniform with Opus 4.1 outperforming Opus 4, Opus 4.5 outperforming Opus 4.1, Gemini 3 outperforming Gemini 2.5 which
outperformed Gemini 2, etc., etc. If you are being extra cynical, you may wonder about benchmark maxing where the performance in coding and mathematics
and other benchmarks that are known to be highly publicized might be maximized to the detriment of the core parameter count and general knowledge. You could
say general intelligence nouse of a model. And that is a known trade-off by
model. And that is a known trade-off by the way. For maximum profit margins, you
the way. For maximum profit margins, you generally want the smallest possible model in terms of parameter count that matches people's expectations. That's
much easier and cheaper to serve to hundreds of millions of people. Just
purely my personal opinion, I will say that despite this simple bench result, Claude Opus 4.5 is my coding go-to model at the moment. Now, you guys may wisely conclude, well, the best model is just
the one that's best for my use case, which is why I've added GBC 5.2 two to the free tier of LMUsil.ai. And you can even access pro on the max tier, which
is almost five times cheaper than the pro tier of OpenAI. In this example, I use the self chat feature of the app to get them to debate amongst themselves,
Gemini 3, and GBC 5.2 Pro and Claude 4.5 Opus, Quark 4.1 to decide which model was the smartest. And you will be disappointed to learn that they all said
that each other was the smartest. They
all agreed that everyone was equal aside from Grock 4.1 which always seems to think that it's the best. I even then got them all to design a website and I
would say that probably on balance it wasn't GBC 5.2 Pro which created the most beautiful website. I would say it
was probably Claude 4.5 Opus with this effort. You may know that on web
effort. You may know that on web development, at least according to LM Arena, Claude Opus 4.5 still exceeds both GPT 5.2 and Gemini 3 Pro. One
result that did catch my eye with GP 5.2 is its ability to recall details across long context. And as OpenAI say, it's
long context. And as OpenAI say, it's the first model of any model we've seen that achieves near 100% accuracy on the four needle challenge where there's four different things they have to recall.
These are needles strewn across almost 200,000 words you can think of it. And
you can see no matter how much the word length goes up to performance stays really quite high. As you can see at the bottom there that had been one of the absolute specialties of Gemini 3 Pro. So
they may now have a competitor at least when we're talking up to 400,000 tokens.
They still can go up to a million tokens. In other words, if you need,
tokens. In other words, if you need, let's say, a medium amount of context up to 400,000 tokens, definitely consider GPC 5.2. If you need super long context
GPC 5.2. If you need super long context up to a million tokens, Gemini 3. Just a
few more results before I leave benchmarks behind. And if you're
benchmarks behind. And if you're concerned about recursive self-improvement or the singularity, well, then GBT 5.2 is an incremental step forward, but no more. on being able
to successfully complete OpenAI's own pull requests to a level of their standard. It got 55% versus 53% for GPT
standard. It got 55% versus 53% for GPT 5.1 Codeex Max. Again, on a machine learning engineering benchmark, crucial if you're going to automate AI research,
it got better than GPT 5.1, but worse than GPT 5.1 Codeex Max. Now, I want to end with some wider observations about what GPC 5.2 means. But first, I've got
to tell you about the sponsors of today's video, 80,000 Hours. And yes,
they've been a sponsor for around a year now. Because when I'm going on my long
now. Because when I'm going on my long walks or drives, their podcasts, including on YouTube, 80,000 hours is the channel name, are incredible to listen to. The other day, I was working
listen to. The other day, I was working my way through one of their 3-hour long episodes when I realized that their sub count had doubled since I last talked about them. As you'd expect, their
about them. As you'd expect, their podcast is also available on Spotify.
And also do check out the custom link in the description. It helps them to know
the description. It helps them to know you came from me. But what about some wider thoughts about the state of the industry? Well, yesterday was 10 years
industry? Well, yesterday was 10 years to the day for the founding of OpenAI.
And Samman himself said, "In 10 more years, I believe we are almost certain to build super intelligence." In case you're wondering, of course, we are not going to have to wait 10 years for their next model. Their head of research said
next model. Their head of research said that OpenAI has already moved on from 5.2 to to developing an even bigger and better model thanks to the lessons it learned with GBC 5.2. For all of its
performance increases, the price increase via the API for GPT 5.2 is admirably restrained. Still cheaper than
admirably restrained. Still cheaper than Opus and for input tokens cheaper than Gemini 3 Pro. I also of course commend OpenAI for focusing on mental health evaluations given recent news and
apparently GBC 5.2 performs better on that front. But zooming out still
that front. But zooming out still further, many people will have a more basic question, which is, is this really the route that we're going to use to get to hi? Ticking off tasks one by one,
to hi? Ticking off tasks one by one, incremental performance gain after incremental performance gain. Well,
first I wouldn't rule out step change increases in performance. Check out my video on nested learning and continual learning that I did recently. But also,
you could think about the analogy with counting sheep. You might see a vast
counting sheep. You might see a vast undulating landscape full of sheep and want to count all of them. And each
sheep is like a human endeavor, a human task that we might want to automate with AI. One team sets off into the field
AI. One team sets off into the field manually counting each sheep. And that's
a bit like what we're doing with LLMs. They're getting better at task after task after task. It might be digital task just at the moment as exemplified with GDP vow, but I was having an interview just yesterday with Tony Zho
for Patreon and he's the founder of Sunday Robotics and they were the first company that I know of that created a robot memo and a model act one which
could load the dishwasher with really fragile wine glasses. They use imitation data to get good at real world physical tasks too. So the analogy I would draw
tasks too. So the analogy I would draw is that we are maybe halfway through the different fields in terms of ticking off human tasks. Before LLM's kicked off,
human tasks. Before LLM's kicked off, many people were hoping for a more flash of inspiration approach where one person wrote down an algorithm and suddenly all the fields were scanned and every sheep
counted in a moment. A one-shot super intelligence and a singularity. But even
if that flash of inspiration never comes or never comes from a human and we do have to rely on for the moment incremental progress, one benchmark broken after another, one human baseline
exceeded after another. Well, eventually
eventually we would count all the sheep.
Let me know what you think. Well done to OpenAI for GPT 5.2 too. And have a wonderful
Loading video analysis...