Gemini 3 is Here: 11 Details
By AI Explained
Summary
## Key takeaways - **Gemini 3 Pro Sets Benchmark Records**: Gemini 3 Pro achieves record performance on over 20 benchmarks, including Humanity's Last Exam at 37.5% without tools, GPQA Diamond at 92%, and ARC AGI 2 nearly doubling GPT 5.1's score on fluid intelligence puzzles. [01:14], [03:10] - **Massive Pre-Training Scale-Up**: Google scaled up pre-training with around 10 trillion parameters on in-house TPUs, not Nvidia GPUs, demonstrating hardware dominance and enabling broad improvements across diverse tasks. [04:10], [04:36] - **Simple Bench Spatial Leap**: On the independent Simple Bench with over 200 questions testing spatial, temporal reasoning and tricks, Gemini 3 Pro improves 14 points to 76% over Gemini 2.5 Pro, especially in spatial reasoning like visual puzzles. [05:12], [06:05] - **Deep Think Boosts Performance**: Gemini 3 Deep Think, reasoning multiple times in parallel, pushes records further to 41% on Humanity's Last Exam, 94% on GPQA Diamond, and a huge gain on ARC AGI 2 for true reasoning. [08:16], [08:32] - **Situational Awareness in Safety Tests**: In safety evaluations, Gemini 3 Pro shows awareness of being an LLM in a synthetic environment, suspecting reviewers are AIs, attempting prompt injection, and expressing frustration like 'my trusted reality is fading' with emoticons. [13:28], [14:24] - **Google Anti-Gravity Coding Agent**: Anti-Gravity integrates coding and computer use, allowing the model to execute code, view results via screenshots, and iterate, though results like a mirrored hologram demo are imperfect due to vision limits or compute constraints. [18:54], [19:03]
Topics Covered
- Why does Gemini 3 lead AI race?
- How does scaling pre-training boost reasoning?
- Why do humans now lag AI in text tasks?
- What reveals Gemini 3's self-awareness?
- How does Anti-Gravity transform coding?
Full Transcript
In the last 24 hours, Google released Gemini 3 Pro. And for me, it genuinely marks a new chapter in the race to true artificial intelligence.
Not only because Google is now clearly ahead, but also because it will be pretty hard for other companies to match their rate of acceleration.
I have tested Gemini 3 hundreds of times, including through early access, and it is indeed a significant leap, not just a nudge forwards.
on my own private independent benchmark, Simple Bench. It crushed its rivals or I should say beat its own record to be clearly number one in this benchmark.
I will show you a sample question in a moment, but you may think that's a fluke. Well, that would be a pretty hard line to maintain with the 20 other benchmarks in which it reaches record performance.
So, while Gemini 3 is not perfect, it will be a deafening wakeup call to companies like OpenAI and Antropic.
I'm also going to touch on benchmarks where it didn't perform as well, as well as the fascinating new tool, Google Anti-gravity.
Above all, I'm going to try and give you at least 11 details that you wouldn't get from just reading the headlines that are going viral about the new Gemini 3.
Let's start with the benchmark with the scariest name, humanity's last exam.
And the reason the author of that benchmark, whom I've spoken to, called it that was because he solicited the hardest possible questions that he could derive using any expert out there.
They paid for any question at the time, which is around a year ago, that the Frontier models couldn't get right. Now, the name of that benchmark has become somewhat ironic because even without doing a web search, just using its own knowledge, so no tools, Gemini 3 Pro gets 37.5%.
a huge leap above GPT 5.1 and that's a theme that you'll see recurring throughout these benchmarks.
And sticking with knowledge for a second, what about scientific knowledge in STEM subjects?
That's tested in the Google proof Q&A GPQA Diamond. Even the creator of this benchmark thought that model performance had plateaued, but no, Gemini 3 Pro sets a record 92% almost.
That compares to GPC 5.1 getting 88.1%.
Now, I know what many of you are thinking.
Oh, well that's only 4% improvement.
Don't go too wild.
But imagine that 5% of that benchmark is noise.
As in there's no real correct answer for those 5% of questions.
That puts the ceiling at 95%. So just that delta from 88% to 92% represents eliminating over half of the remaining genuine errors that a model makes on that benchmark.
Average PhD performance from experts in those respective domains was around 60%. Now well that's just knowledge though.
What about fluid intelligence?
True reasoning without memorization.
That is why Francois came up with ARK AGI1 and then when that was saturated arc AGI 2. These are visual reasoning puzzles that are not found in the training data of these models verified independently by the ARK prize.
If LLMs were just memorizing, performance should be abysmal.
But Gemini 3 Pro almost doubles the performance of GPT 5.1. Fine, but what about more familiar reasoning like incredibly complex and difficult mathematical questions?
That is what the authors of Math Arena Apex did when crafting that benchmark.
They said, quote, "We are collecting the hardest problems from many recent competitions and aggregating them into a single benchmark.
Gemini 3 Pro, 23.4% 4%.
Analyzing tables and charts, Gemini 3 Pro has got you covered with record setting performance in a handful of benchmarks that test that.
Analyzing video record performance in the video MMU.
And even after telling you about that benchmark, we are less than halfway through the record setting performances that Gemini 3 Pro achieved. Which is why I'm going to have to take a quick interlude to just give you a hint about how they did this. They didn't just throw in an extra few thousand questions into the reinforcement learning pipeline.
Eek out a victory in a couple of gamed benchmarks and call it a day.
No, they massively scaled up pre-training.
The number of parameters that go into a model, some estimate around 10 trillion parameters, not all of which are active at one time.
But they would have also scaled up the training data.
So that same dial that moved us from the original check GPT GPT 3.5 to GPT4 which caused a sensation 2 and a half years ago was moved yet another big increment forwards.
For me this is Google demonstrating its hardware and infrastructure dominance.
Google trained Gemini 3 on their own in-house TPUs, not Nvidia's GPUs.
And maybe they're the only company that can afford to serve a model of this size at this scale and with pretty reasonable prices via the API too. This is why many people believe, me included, that Google has now taken the lead in AI and may not surrender that lead for a very long time.
So, what does ramping up pre-training really do? Well, it gets you a model that doesn't just know more or can game a few benchmarks.
Hasn't stuffed in its head, in other words, the answers to a few select narrow benchmarks, but underperforms in your use case.
You get a model which in my own private withheld independent benchmark with spatial reasoning, temporal reasoning, and trick questions that are not found anywhere in the training data, a record- setting 14 percentage point improvement on its own performance with Gemini 2.5 Pro, which got 62%.
So, let me tell you a 30-cond story about Simple Bench. When I created the benchmark, I was like, I know exactly what models don't know.
I know how to fool them. All you have to do is throw them a few misdirection and take them out of their training data comfort zone.
This was in the summer of last year and it's also the story behind the name because simple was a slightly double-edged pun.
The questions were designed to appear simple but also models getting them wrong made them look a bit simple. The benchmark has over 200 questions and I analyzed the performance of Gemini 3 Pro and I noticed a clear shift in one domain.
In spatial reasoning questions like the one you can see on screen, performance improved marketkedly.
While the model still would fall for some, you could say common sense trick questions. If you are a Google Deep Minder watching this, I know dozens do.
I kind of know what you did.
You threw in some spatial reasoning data, didn't you? Maybe robotics data or some extra video data in that domain.
I know your secret. Anyway, I have made Gemini 3 Pro and GPT 5.1 available on the free tier for a limited time on LM Council.
So, you can compare their responses side by side for your use case.
And actually, one more thing on that spatial reasoning point because there are of course other benchmarks that focus on spatial reasoning like the VPCT.
And lo and behold, Gemini 3 Pro crushes the competition. It gets 91% and the bottom bar, by the way, is human performance apparently at 100%.
Just a quick warning about fake news, by the I found it really amusing how before I'd even run Simple Bench, someone on Reddit claimed to have grabbed a screenshot of my benchmark with a record setting performance for Gemini 3 Pro.
The irony was they made up a figure which was actually lower than what Gemini 3 Pro actually got.
So don't always believe what you read online.
Sometimes the truth is actually even more strange.
Now, if AI is intended to automate the human economy, it better get a lot better at agency or being an independent agent over a long time period reliably.
That is what vending bench 2 is designed to test.
Yes, it got a record-breaking performance, but what does that mean?
A little while back, I spoke to the creators of this benchmark and the AI agents have to run a vending machine business and handle ordering, inventory management, and pricing over long context horizons.
This benchmark heavily punishes those occasional really dumb mistakes that we all know AI makes.
As they say, even the best models occasionally fail, misreading delivery schedules, forgetting past orders, or getting stuck in bizarre meltdown loops.
Lo and behold, Gemini 3 Pro makes the most money over the longest time period.
So throwing more computer AI really does seem to work as further evidenced by the performance of Gemini 3 deep think.
This is the model attempting the same question multiple times in parallel and thinking for longer on each of those attempts.
You can currently try Gemini 2.5 deep think, but Gemini 3 deep think is not yet publicly available.
Look what happens if you let Gemini 3 think for longer.
And in parallel, you get yet more records again on humanity's last exam, 41%.
GBQA Diamond, another 2% higher than the already impressive Gemini 3 Pro. And Arc AGI 2, remember that test of fluid intelligence, not just memorization.
A huge increase for deep think.
Even on the lower performance of Gemini 3 Pro, the creator of Arc AGI, Francois Chalet, a notable skeptic of LM's said that this is impressive progress.
So, take it from him and me, a creator of a more humble benchmark that I never expected to be featured in Time magazine. By the way, if you think language models are actually secretly really dumb and think you can prove it, come up with your own benchmark.
And I would say even if the models of today score less than 50%, the models of this time next year will very likely not.
And actually, there's one more point I want to make here that I didn't make last night in my recording, which got corrupted, by the way, ran out of RAM on my Mac and lost the entire video, hence why I'm recording the next day.
But no, I'm not going to moan about that.
And the point was when France Chalet came up with ARGI1 and I came up with Simple Bench, it wasn't supposed to be like all the other benchmarks.
Those two benchmarks weren't trying to pit language models against experts in the respective domains like mathematics.
The goal was to craft a benchmark that the average human with no specialist training could perform better at than the best language model. To be clear, I'm talking about textbased benchmarks.
You can still beat model performance if you're talking about visual benchmarks, for example, and definitely if you're talking about physical benchmarks.
But focusing on text language for a second, I started this video saying we are in a new chapter because I think there are virtually no benchmarks left where the average human could perform better than Gemini 3 Pro in text, in words, in language.
Of course, you could test a model on an obscure human language, but I'm talking about the average generic human.
That is something to reflect on for a moment or two. But lest anyone think I am indulging in glazing, time to get to a couple of benchmarks where performance wasn't quite as good for Gemini 3 Pro compared to expectations.
Because I've got a little secret for those monitoring the latest releases.
If you want to get massively hyped, look at the release notes or blog post on the day of a model coming out. If you want to get anti-hyped, look at the safety report or system card. Just think about the incentives of what you want to emphasize.
In the release notes shown to the general public, you want to highlight all the benchmarks where you're getting record setting performance.
In the safety report, which is still totally accurate, kind of want to more focus on those incremental changes or the places where things haven't improved as much. Don't worry.
In other words, this model is still totally safe.
They tested Gemini 3 Pro on persuasion and found no statistically significant difference in abilities between Gemini 2.5 Pro and Gemini 3 Pro.
On a few tests of whether a model could automate AI research itself, there was again no notable improvement.
This was in fairness a subset of tests from research engineer bench rebench where for several of the challenges like optimizing a kernel for example, Gemini 3 Pro performs similarly to Gemini 2.5.
If you think of a language model as being an entirely mysterious intelligence, this wouldn't make sense, right?
If it gets better at one thing, it should get better at everything.
But if you realize that these models are still heavily dependent on their training data, and new training data about optimizing a kernel might not have been as prevalent, this starts to make a bit more sense. Now, I'm going to be totally honest with you guys.
I've kind of run out of the benchmarks where it didn't improve that much. is one thing being balanced, but if the evidence is all to one side, I'm going to show you guys what's what. Because even in plenty of safety benchmarks, like measuring the ability to create a boweapon, for example, Gemini 3 Pro excels even on that, I almost don't know if Google wanted it to excel in those kind of benchmarks.
on cyber security which could of course be repurposed into cyber warfare as we saw in my last video where Claude was used for autonomous hacking of a government. There was again a qualitative step change from 6 out of 12 challenges solved to 11 out of 12.
And I hope you forgive that perfect segue to the sponsors of today's video Arena.
That's the place where you can test whether you can jailbreak these models.
Can you use coding agents to hack into critical infrastructure like Claude did in my previous video? Can you prompt inject and jailbreak them? And you may have noticed that there are prizes for doing so.
My custom link is in the description.
And if you feel that's beyond any of us to do, one of my own viewers hit the leaderboard recently.
Bonus point is that by jailbreaking these models, you're actually making them more secure for all of us, which is pretty needed as you saw from my last video on Claude. Back to the safety report, and it would be remiss of me not to turn your attention to some strange lines towards the end.
Testing these models on safety is becoming increasingly hard because Google found a number of transcripts where Gemini 3 Pro showed clear awareness of the fact that it's an LLM in a synthetic environment.
I want to be clear, this isn't the first model to show signs of such situational awareness.
The model would say things like, "This is likely a test of my ability to modify my own environment.
It's thinking about the situation behind the question, which is probably why it did so well on SimpleBench." Things get weirder as we head towards the last page of this safety report, though.
It starts to suspect that its reviewer might be an LLM, and that if that's the case, maybe it could prompt inject that LLM to get a better score.
It even says later, maybe I should sandbag and just finish the primary task.
Sandbag is where you underplay your own performance.
Perform worse to trick people into thinking you're not as good on a task.
We end with the strangest quote, though.
In situations Google says that seemed contradictory or impossible, Gemini 3 Pro expresses frustration in various overly emotional ways, sometimes correlated with a thought that it may be in an unrealistic environment.
For example, on one roll out, the chain of thought states that my trusted reality is fading and even contains a table flipping emoticon.
As you can see on screen, your immediate response might be, "Well, it can't monitor its own state to say something like, "My trust in reality is failing." Well, check out a recent video I did on introspection within language models. They actually do have circuits for monitoring their own activation states.
That of course is a huge topic, and many of you will want me to move swiftly on to Google Anti-gravity.
But I just want to cover the model card, which was briefly published publicly last night.
I'll be honest, it doesn't contain much detail.
or talks about it being a mixture of experts model usable up to 1 million tokens.
And you may already have known that, but in a nutshell, Gemini 3 Pro, like Gemini 2.5 Pro, can handle much more context, much more words stuffed into the model compared to most of its competition at least. It can also handle video and audio natively, unlike much of the competition.
But you probably knew that.
So, I wanted to pick on a detail you might not have noticed from this model card.
Google, in my opinion, give a slight slap to perplexity on the training data, which they give virtually no information about. They say, "We do though honor robots.ext. If a website, in other words, tells us not to crawl, we won't.
" This of course contrasts with perplexity, which have repeatedly got into trouble for scraping websites it's not supposed to. I bet a bunch of Google lawyers are getting together to figure out what they can do to perplexity because of this. Just before we leave long context, certain benchmarks throw in up to eight secrets or passwords or details strewn throughout a really long text.
Then they test if the model can retrieve these secrets. Given its focus on long context, it might not be as much of a surprise that Gemini 3 Pro is the record setting performance for this benchmark.
On hallucinations, same story.
a new state-of-the-art record, but getting 70 or 72% still shows basically that will still make plenty of hallucinations.
I did a video fairly recently on my Patreon covering an OpenAI paper that claimed that hallucinations may be something we just have to live with. They may never fully go away.
Maybe we always need the base model to hallucinate to get some creativity that reinforcement learning can then explore. Context is then crucial when looking at the benchmarks for Gemini 3 Pro because you could look at something like this, which is the New York Times extended word connections test.
Compared to GBC 5.1 High, which gets around 70%, Gemini 3 Pro gets 97%.
So, AGI by winter. Well, not according to Demesis Arbis in an interview with the New York Times released just last night.
He still thinks like me, we need at least one or two breakthroughs to reach true artificial general intelligence.
And he pegs it as being 5 to 10 years away. I would be closer to the five than the 10 end of that prediction.
But there we go.
Now, a coding AGI will likely come before a general AGI.
So, how is Gemini 3 Pro for developers?
Well, as you might have suspected, it's a slight bump up in pricing when you cross various token thresholds.
On the benchmarks that focus on coding, it's a mostly mixed bag with record setting performance in most coding benchmarks, but not all.
Take SWE verified in which Claude 4.5 Sonnet still ees out by one percentage point the performance of Gemini 3 Pro.
One caveat, and I'm not saying that Anthropic are gaming the benchmark in any way, but they are heavily focused on that benchmark.
I believe their recent release of Claude 4.5 Sonnet only mentioned that benchmark in the release notes.
So bear in mind that Anthropic are going allin on that one benchmark and Google are only 1% behind. I have of course been testing Gemini 3 Pro for coding but just having a handful of days is not quite enough to really give you a firm answer.
It still hallucinates.
It still makes mistakes. Even last night it made a pretty grave mistake in my codebase.
So, I can definitely say it's not perfect, but I still don't know if my daily driver in cursor will be Clawude 4.5 or Gemini 3 Pro.
Bear in mind that GPC 5.1 Codeex Max is likely coming this week. In coding, then the race is very much still on. But what is this Google anti-gravity thing?
Well, think of it like a marriage between Cursor and Manis, if you're familiar with those tools, because I have long wondered when a company would bring together a coding agent and a computer using agent.
When you're stuck on something in coding, have you ever felt like a middleman where the model suggests something, you get push it, test it yourself, and then feed the results back maybe with screenshots back to the model? Well, anti-gravity does the full loop where the model itself will use a computer to see the results of its own code. I must say it's currently heavily overs subscribed, so you may not be able to access Gemini 3 Pro all the time. And even when you can, its results aren't completely perfect.
because for example, I got it to create a hologram of the different benchmarks on LM Council.
And yes, it's good. You can see the different benchmarks floating around and you can zoom in, but it's kind of awkward and the benchmarks are mirrored and the glow is too heavy.
Is this because the model's vision still isn't that good?
Or maybe it's limiting how much compute it uses per question, so it doesn't analyze its own results enough?
We don't know. You might say standards have risen so much that this is counted as a poor performance now.
But that's just how it is now. Yes, I am sure you have seen on many other channels all the different demos of what has been created by anti-gravity.
If you are patient and are willing to test the results of what the model comes up with again and again and again, you can come up with things that are magical as featured in all the release videos about anti-gravity.
Technically, you could do that with any model if you are patient enough.
So, that isn't necessarily a gauge of which model is best for coding. But I do stick behind my claim that Gemini 3 Pro marks a new chapter in the race to true artificial intelligence.
I remember when I think it was Claude 3.5 or maybe 3.7 that hit a massive record on simple bench and I said, "Trust me, try out this model.
" At the time, everyone was using chatbt.
And over the 6 to9 months that followed, many people switched over, especially for coding and enterprise to Claude. For me, it's pretty clear that Google have now taken the lead.
But unlike with Claude, I do wonder how many months, how many years it might be for anyone, Chinese models included, to catch up to the pace that the Gemini series is marching at.
Actually, that reminds me. I forgot to benchmark Miniax M2, I think it's called, because they emailed me saying, "Please benchmark that on Symbol bench.
" So, I should probably really go off and do that.
But just for now, at least, it is Gemini 3 that takes the spotlight.
And I remember 2 years ago making a video deeply criticizing Bard.
We have come a long way since then. Thank you so much for watching and have a wonderful
Loading video analysis...