Gemini 3.1 Pro is the smartest model ever made
By Theo - t3․gg
Summary
Topics Covered
- Benchmarks Trump Usability
- CLI Tool Calls Fail
- Spatial Reasoning Perfected
- Hallucinations Slashed
- Competence Lags Intelligence
Full Transcript
Another day, another new world's smartest model. Gemini 3.1 Pro just
smartest model. Gemini 3.1 Pro just dropped. And honestly, it is kind of
dropped. And honestly, it is kind of crazy. All of the numbers I've been
crazy. All of the numbers I've been seeing are insane. Both the ones Google published and the ones being published externally, as well as the ones I've gotten myself. I've been using this
gotten myself. I've been using this model a ton, running against every benchmark. And the numbers are
benchmark. And the numbers are incredible. It scored four points higher
incredible. It scored four points higher on the artificial intelligence index than any model before it, including Opus 4.6 Max. What's even crazier, though, is
4.6 Max. What's even crazier, though, is it cost less than half as much to get that score at only $892 versus the almost $2500 that Opus 4.6 six cost.
Google's own numbers show it crushing everything, including ARC AGI 2, which just getting a 78% on ARC AGI 2 is insane. The whole point of that one was
insane. The whole point of that one was that Loom couldn't do it and AGI 1 was be Oh god, no. This is insane. Like, all
of these numbers are nuts. So, is that it? Is Opus dead? Has Google officially
it? Is Opus dead? Has Google officially won the AI wars? Not quite. None of
these things are this simple. As you
guys know, there's a lot more depth to the numbers than there seems to be online. And I cannot wait to go through
online. And I cannot wait to go through all of this with you guys right after a quick word from today's sponsor. You
know how it feels like your AI gets way better when you give it access to your whole codebase instead of just trying to scope it to a file? It turns out it goes way further if you give it access to the internet. It can do so much, especially
internet. It can do so much, especially with all these new browser use agents.
But what browser are you going to let them use? Probably not the one on your
them use? Probably not the one on your computer, right? Especially for
computer, right? Especially for programmatic realwork use cases. That's
why I'm so hyped about today's sponsor, Kernel. The best place to run your
Kernel. The best place to run your browser use agents in the cloud. Their
open source infrastructure is absurdly fast. It takes less than 30 milliseconds
fast. It takes less than 30 milliseconds for them to spin up a browser that you can connect to. And the actual process of doing that couldn't be simpler.
Cloudset 4.6 is way better at browser use than previous models. And it's
actually been really fun to play with.
This code will actually spin up a browser with kernel and then give access to claude so that it can run and do things on it. We're going to have it take a screenshot of the current page, search around to find the blogs about manage o, and then read and summarize
the blog post. And it does this by actually navigating the site. Now all I have to do is run our dev command and a browser will be launched in the cloud.
You can even spin up a live view if you want to that will show you what's happening as it actually occurs. I have
my cursor there, but you can see there's a mouse cursor there. If I want, I can take over and do things, but I don't need to. I'm letting the agent do it.
need to. I'm letting the agent do it.
So, it just found and clicked on the blog. It's decided to scroll, and now
blog. It's decided to scroll, and now it's finding the post that we were talking about in the prompt. And if we look here, you can see what Claude decided to do each step along the way.
You can probably already see how useful this is, but once you add in things like persisting off, capture workarounds, and all the other stuff you need, it becomes a genuinely useful way to do things that might not have an API. Stop limiting
your agents and what they're capable of and give them access to the web at soyb.link/kernel.
soyb.link/kernel.
Back to it. We need to talk about Gemini 3.1 Pro. As I said before, I've been
3.1 Pro. As I said before, I've been using it a ton. I currently have it running in the background porting an old app. I've been using it in a ton of
app. I've been using it in a ton of different harnesses and a ton of different like tools and building a bunch of different stuff as well. I also
used it to update some old projects. And
in doing such, I have learned to deeply deeply hate the Gemini CLI. Not only did the Gemini CLI not have Gemini 3.1 on day one, and I actually accidentally ran this with Gemini 3, it looks like.
Great. Love that. Not only do they not have 3.1 Pro on the CLI, it also just breaks all the time. It is super buggy and does just weird [ __ ] I was intentionally trying to preserve
this history so I could show you guys some of the weird stuff, but uh since my resolution changed on my computer, that appears to have died. Great. Well, I'm
sure there will be plenty of issues with this run. Let's take a quick look. You
this run. Let's take a quick look. You
know what? Actually, first things first, because I accidentally ran Gemini 3 Pro before, I got used to Gemini 3's horrible inability to like do basic tool calling correctly and consistently.
Almost the entire time that it was running, it was hitting errors that made no sense at all. Just like failing to edit a file it had just read by passing bad syntax. It seems like 3.1 is
bad syntax. It seems like 3.1 is significantly less bad about that, at least from the official logs for this particular run I am doing right now.
Since I couldn't use it in the official CLI, I decided to first try it in the cursor one, and it was insightful. The
cursor CLI actually seems decent so far.
I have my issues with it. You can't
paste images, which sucks, but otherwise it seems to work. So, let's do a test example here. Telling it to move from
example here. Telling it to move from SQLite to Postgress. I'm not actually going to have it do that, but just see how this goes. So, you guys can see some of these thinking traces. Taking its
sweet, sweet time. Reconnecting. Great.
Good start. Uh, I don't know if this is cursor's fault or Gemini's fault, but whichever it is, this is a bad experience. So, let's try again. You
experience. So, let's try again. You
might also notice that my top nav is visible, and it normally isn't. This is
a bit of a tangent, but I want to complain. So, I'm going to show you guys
complain. So, I'm going to show you guys that when I go to settings, my whole computer breaks. And now, this is what
computer breaks. And now, this is what my Mac settings look like. I am so done with Mac OS. Okay, it aggressively hides the thinking right after showing it to me in the cursor tool. So, I'll show you here. Every single time it runs, it has
here. Every single time it runs, it has a prioritizing tool selection step. I've
been focusing on tool selection, leaning heavily on specialized tools when available. I'm actively avoiding CAT for
available. I'm actively avoiding CAT for file manipulation, particularly when other and it gets cut off. I've seen
this in everything that is correctly using Gemini. These models suck at tool
using Gemini. These models suck at tool calls. They just don't do them well.
calls. They just don't do them well.
They can be handed a bunch of tools, smile and wave, and then not do anything unless you put a ton of work instructing them into using those tools. And if
you're not careful, they'll instead massively overuse them. The reason why I'm using cursor here is because of this specifically. They've put a ton of time
specifically. They've put a ton of time into tuning the Gemini models into well, I don't know how other than sucking less. Google just does not seem like
less. Google just does not seem like they are training these models to be good in harnesses. They are too focused on benchmaxing. And oh boy, have they
on benchmaxing. And oh boy, have they benchmaxed. A while back, I made a
benchmaxed. A while back, I made a benchmark called skate bench that was meant to measure how well different models could name skateboarding tricks based on descriptions of the tricks.
It's a combination of like a weird niche knowledge bench as well as a spatial recognition bench because you need to know about the rotations and how it affects the board and the skater and the relationship between all of those things. When I first made this
things. When I first made this benchmark, Gro 4 had the highest score by far in like the high7s. Okay, 75 is what it got on my most recent run. And
then I went to the OpenAI office to try out GPT5, which scored a 98, which just absolutely broke me at the time. Since
then, OpenAI models have actually regressed quite a bit, and the highest I can get one of the modern ones to do is about an 87, which is insane that it has regressed that meaningfully. Obviously,
5.2 and 5.3 are phenomenal models for code. They were not as good at this.
code. They were not as good at this.
Take that as you will. We have a new winner, and it's not close. Gemini 3.1
Pro Preview consistently hits 100% on this benchmark. I don't know what the
this benchmark. I don't know what the hell they did, but it gets 3D very, very well. It's also good at 2D space. It's
well. It's also good at 2D space. It's
the first model to make an actually like decent, usable SVG of the Pelican on a bike. It took 323.9
bike. It took 323.9 seconds of thinking, but it did a pretty good job here. Not only can it make these SVGs, it can actually animate them. Like, I've never seen a model
them. Like, I've never seen a model capable of this type of thing. SVG
animations are an intricate science that is not easy to do. And yeah, it killed it with a lot of this. I was even setting it up on a couple other projects I was working on and was impressed with it animation capabilities. Personally, I
was not as impressed with the SVGs it was creating, but I was able to make it do some cool looking stuff. If you've
been around the last few weeks, you know that I've taken on far too many side projects. I'm close to wrapping up a few
projects. I'm close to wrapping up a few of them and shipping them, and I might have picked up one or two more. I want
to touch on a few quick though, because I've been using Gemini for it, and there are things that has done really well, as well as things that has done really poorly. One of the things it's doing
poorly. One of the things it's doing incredibly well is design. I'm working
on a new video review service, like code review, but for video is an alternative to frame, cuz I'm tired of Adobe stuff.
And it made a stunning homepage. I was
trying to get do some fancy stuff with the layering of the images, but failed there. So, I told it screw it. You have
there. So, I told it screw it. You have
these pieces. Go redesign it, make it look decent. This was the first thing it
look decent. This was the first thing it spit out. I obviously gave it
spit out. I obviously gave it suggestions on how to tidy up a few things from there. But, like, it made a goodlook website without a whole lot of work. I already had a site here that
work. I already had a site here that wasn't great, and I told it to basically unscrew it, and it did. It made like I'm probably going to ship this as is. I'm
very happy with how this came out. It
also helped build this UI for a new project that I'm working on that I'll share a lot more about soon. I'm very
excited about this one. I have these models effectively playing a game of Quiplash against each other where everything is generated. And I have been surprised at how funny Gemini 3.1 Pro has been. I also built this whole app
has been. I also built this whole app with Gemini 3.1 Pro. It had its quirks.
It definitely struggled when the length of the task got bigger and I had to help it unblock itself a lot. Like that's
funny. A plight thing to scream at scarecrows. Nice flannel. Sorry about
scarecrows. Nice flannel. Sorry about
the bird [ __ ] Like that was decently funny. And I'm not the only one who
funny. And I'm not the only one who thinks it's funny. The models do, too.
See that? Apparently, it's the funniest model. I never would have expected that
model. I never would have expected that Google would be the one to make something funny. I honestly thought
something funny. I honestly thought Grock would do a bit better. I've
actually been surprised multiple times where I looked at the two responses when it was Grock versus Gemini, and Gemini was the one making the more vile jokes.
To be fair, the first UI it made was a ugly would be putting it lightly. It
also had a lot of quirks when I was running it. I originally wrote this as a
running it. I originally wrote this as a CLI app. And man, I saw a lot of weird
CLI app. And man, I saw a lot of weird Geminiisms when I did. This one was funny because the GLM5 prompt writer failed, so they didn't get prompted with an actual thing they had to respond to.
They were told to be funny in the system prompt, so Grock did its classic.
Jeffrey Epste didn't kill himself. And
then Gemini 3.1 Pro went with the Yeah, this is the thing about the Gemini models. They don't really behave well. I
models. They don't really behave well. I
don't mean that they like go off the rails and do nasty stuff. I mean they might do that a little bit. I've
definitely seen Gemini 3.1 Pro deleting things it shouldn't be. I just had that happen yesterday actually where it nuked a bunch of assets that it was not supposed to touch. Be careful when you're giving it file write access for
sure. But the more annoying thing is
sure. But the more annoying thing is just the way it fails to perform as a freaking model. Like it just doesn't do
freaking model. Like it just doesn't do it right. The amount of times I've seen
it right. The amount of times I've seen it just get stuck in like two-word loops where it just says the same two words over and over again or respond with utter nonsense and immediately stop or just like switch over to Chinese characters, which would happen a lot
with Gemini 3 Pro. Haven't had that with 3.1 yet, but I have all of the other quirks and then some. Here's another
funny one. In the system prompt for the model that is making the prompt that the other models use to respond to, I know prompt means two different things here.
It's confusing. The system prompt I wrote to give the model so it could generate a game prompt to use for the other models. That system prompt says
other models. That system prompt says that it should only return the thing it wants to have the other models respond to. It says return only this thing,
to. It says return only this thing, which Gemini did by saying return only.
This is like llama 2 [ __ ] But this is the nature of the Gemini models. They
are insanely smart. The amount of intelligence they've baked into these models is honestly overwhelming. I am
consistently surprised by how competent it is in knowledge. And I'm not the only one that feels that way. Artificial
Analysis has commented on the same specifically around the hallucination rates and it's just knowledge in general. Artificial analysis made their
general. Artificial analysis made their own omniscience benchmark that is keeping track of knowledge and hallucinations and letting them cancel each other out. So you're ranked both by how many answers you got correctly as
well as how many times you answer incorrectly. And you're not penalized
incorrectly. And you're not penalized for refusing to answer for saying I don't know. So this is a benchmark that
don't know. So this is a benchmark that rewards models that are able to say I don't know what the answer to that is.
punishes them if they lie a lot and then obviously rewards them if they do the problem correctly. And as we can see
problem correctly. And as we can see here, 3.1 Pro Preview had a really good score. Even some of the smartest models
score. Even some of the smartest models like Sonnet 4.6 and GPT 5.2 high are still in the negatives because this benchmark has a lot of really hard problems that the models will often hallucinate and get incorrectly. If we
hop over to the hallucination rate chart though, you'll see previous models like Gemini 3 Flash and GPTOSS20B were insanely high. When they didn't know the answer, they would never refuse. they would just be wrong.
refuse. they would just be wrong.
Surprisingly enough, Haiku 4.5 is the model that is the best about this. It's
very quick to say when it doesn't know something. Gemini 3. Preview though is a
something. Gemini 3. Preview though is a huge advancement here. It is a comical gap between 3.1 Pro and 3 Pro where the hallucidation rate has nearly hald in that time. But where Google really
that time. But where Google really slaughters in these benchmarks is the accuracy. It just knows more
accuracy. It just knows more information. So, it's able to answer a
information. So, it's able to answer a lot more of these questions correct than even other leading models. It's
obviously such an intelligent model. The
amount of information, the amount of capability baked into these weights is unreal. But when it comes to actually
unreal. But when it comes to actually using it, it genuinely sucks. Some part
of that is just the way it randomly spams my terminals after I close things for no reason. I have no idea what's going on here at all. We're just going to close that terminal entirely. Why did
it do that? Okay. I am genuinely trying so hard to use Gemini 3.1 Pro to test things. But no matter what the [ __ ] I do
things. But no matter what the [ __ ] I do in their CLI, it just randomly changes to other models. Why is it using Flash Light, Flash 2.5, and three Flash
preview? When I selected Gemini 3.1 Pro,
preview? When I selected Gemini 3.1 Pro, their CLI is legitimately unusable. And
as silly as it sounds, I think that's a big part of why it sucks as hard as it does. It's pretty clear now that the
does. It's pretty clear now that the other labs are using the histories from other people's chats to train models. If
you have a before and an after, if you have the thing that somebody needed changed in the changes they made in the before and after for that repo, it's pretty easy to generate a fake history
for any of these CLIs. Once the labs have generated all of these fake histories, they use it for reinforcement learning. And the result is that the
learning. And the result is that the models can work in these tools and they can work in them really well for really long amounts of time. Google doesn't
seem to be doing this at all. I could be wrong. They're clearly rling on
wrong. They're clearly rling on something, but it's not helping with these long agentic runs. The meter eval show what I'm talking about here really strongly. If you're not familiar, this
strongly. If you're not familiar, this is a benchmark that tracks how long of a task in terms of the human time to complete are the models capable of completing. At the start of 2025, the
completing. At the start of 2025, the best we had was between half an hour to an hour of a task being able to be completed by an agent on its own 50% of the time. This is 50% success rate. Now
the time. This is 50% success rate. Now
with Opus 4.6, we're up all the way to the 16-hour range where it can run on a task for a very long time. Obviously, it
won't be 16 hours. It'll be closer to like half an hour to an hour, but it's a task that would have taken a human 16 plus hours with a 50% success rate.
That's nuts. Both GBT 5.2 and Opus 4.6 are killing it here. And once 5.3 is on the API, I would suspect it will kill here as well. I don't think Gemini models are capable of performing well on this because they tend to get really
confused and lost when given the ability to like go do things for a while. I also
said earlier it seemed like the tool call issues had mostly stopped. I can't
know that because it just ran with the wrong [ __ ] model. So yeah, I'm now rerunning this yet again with 3.1 Pro preview. Hopefully it will actually stay
preview. Hopefully it will actually stay on the model I selected this time. I
couldn't help but notice there isn't a plan mode. And I also just accidentally
plan mode. And I also just accidentally turned off yellow mode. Great. The CLI
is awful. Obviously, CLI is awful. We
all know that. Just use anything else if you can. Oh man, I I just need to
you can. Oh man, I I just need to complain a bit about this dissonance between how smart these models are and how competent they are. I'm going to do a strange comparison. I want to talk
about Claude 4.5 Haiku. Why would I want to talk about Haiku? It's a super dumb model. If we go to the intelligence
model. If we go to the intelligence index on top here, you'll see it's not even in the main chart because why would it be? It's scoring a 37. And these
it be? It's scoring a 37. And these
models are in the high50s now. Why would
I want to talk about this tiny cheap [ __ ] model? Because it does its godamn job. And this is the thing I want to
job. And this is the thing I want to emphasize. Models aren't just bundles of
emphasize. Models aren't just bundles of knowledge. They're also now expected to
knowledge. They're also now expected to have behaviors and to do things, especially with tool calling and especially with knowledge gathering through the tool calling. Obviously, a
model with more knowledge is going to be able to do more things in general. If
it's smarter, it knows more and it's able to do more. But if it's able to gather knowledge better, that tends to push it ahead quite a bit. And what
Anthropic has been super focused on from at least the Sonnet 35 days onwards is the consistency around the tool calling.
I never see Haiku screw up the shape of a tool call. If you tell it how a tool works, it will use it and it will use it relatively well. If you ask Gemini 3.1
relatively well. If you ask Gemini 3.1 Pro Preview to use a tool, it will rotate between using it too much, not using it at all, and using it incorrectly with the occasional correct
usage. Although it is rare, the model
usage. Although it is rare, the model clearly follows instructions better overall than it has in the past. Like
Google has made some progress here, but it still honestly feels like a lastg model. It feels like somebody took
model. It feels like somebody took something from like the old llama days or even one of like the older deepseek openweight models and stuffed infinite intelligence into it, but forgot to put
the competence in. It's just not good at doing things. They don't have a plan
doing things. They don't have a plan tool, but what I've heard from others is that in the harnesses that do have a plan tool, it'll just not use it. It'll
just output the questions and you're expected to just answer and then it will create a bunch of output and say are you happy with this and you say yes or no.
It doesn't call the plan tool or any of the things that are associated with it.
My guess is that they tried to RL it on a planning phase but didn't RL it on a planning phase with planning tools and as a result it just is weird. Okay, it
looks like it wrote a decent plan here but again not in plan mode. Telling it
to go work. Okay, it still has the random failing to edit thing. Like I
don't know how it screws this type of stuff up. It also sucks at reading
stuff up. It also sucks at reading files. It seems like it was hardcoded to
files. It seems like it was hardcoded to only be able to read up to a 100 lines at a time. And I've watched it read line one through 100, 101 to 200, 2011 to 300 on many different things. Like why? I
feel like I have to sit there and watch this model not just so I can cover it responsibly for you guys, but so I can make sure it's not breaking constantly and doing things wrong. Cuz that's all inference you're paying for. Is it more
efficient when it does everything correctly? Yeah, it's cheaper and uses
correctly? Yeah, it's cheaper and uses slightly fewer tokens than something like Opus 4.6, but if it has to generate 3x the number of tokens because it failed the tool call the first two
times, you're not benefiting as much any at all at that point. Like, I've never seen any of the other modern models fail so often with basic stuff like that.
Even the new open weight ones seem to have resolved a lot of these things.
Google just hasn't. Trying to reach Gemini 3.1 Pro preview attempt three of three. It's about to fail as it was just
three. It's about to fail as it was just getting started. H you'd imagine the
getting started. H you'd imagine the trillion dollar company would be the one that could figure these things out. Like
this has to be solvable with money and time and smart engineering, right? I
just don't get how it could get as bad as it currently is. When I was working on the game I showed earlier, I wanted the logos for all of the labs and it just h the reasoning summaries are just
so bad. I've been on a quest to explore
so bad. I've been on a quest to explore various language models. ChatGpt is
definitely a strong contender. Now I'm
shifting my attention to locating Claude, Deepseek, and Grock. My focus is on understanding their capabilities and how they might differ from Chhat GBT.
No, it's [ __ ] not. I told you to go find Logos. I don't know why they're
find Logos. I don't know why they're hiding the reasoning traces. They're not
that useful anymore. And this is the alternative is garbage. I don't know what happened here. Was the reasoning actually that bad and we got a summary that summarized the bad thing it was doing or was the summarization process
incorrect that gave us this nonsense?
Because again, remember I asked it to go find logos for the things in the codebase and it somehow went on a soularching journey all about chat GPT.
And in any competent harness where it has tools it actually uses well, they've system prompted the hell out of it and you see the labbotomized results. I'm
currently evaluating the best approach to use various tools for this task. It's
a complex process. So I am being very deliberate in assessing which tools will be the most suitable. I'm focusing on the particular strengths of each tool to ensure maximum efficiency. This was for
a simple UI change. It just it's honestly been flashbacks to like early 2025, hell, late 2024 where I feel like I have to be closely monitoring what
it's doing and I see it just doing things wrong constantly.
Oh, I didn't realize the reason it wasn't able to reach 3.1 Pro because I had to go give it permission to run this command because it didn't know it needed to have that. And now I think the CLI is
broken again. Oh no, it's spawning more
broken again. Oh no, it's spawning more things. It looks like it hallucinated
things. It looks like it hallucinated packages that don't exist. So much for that lower hallucination rate. It has
access to the web. It can look these things up. I told it what to do and how
things up. I told it what to do and how to do it, and it has just outright failed to do it. It's writing a Python code mod to do the code mod itself because it didn't succeed in finding it.
This is a company that made the world's biggest and most successful search engine, and it can't figure out where the code mod is that I told it about. It
does seem to be pretty good at using additional knowledge it is given. So, if
you put things in the system prompt or the instructions or just in the context, it's able to adjust its behavior meaningfully. You've probably heard me
meaningfully. You've probably heard me talk about Convex is by far my favorite way to handle backend stuff nowadays.
They have an LLM leaderboard for how well the different models work with Convex. They have two versions. They
Convex. They have two versions. They
have the with guidelines where they're given the Convex AI rules and they have a separate no guidelines version where they aren't. When not given guidelines,
they aren't. When not given guidelines, Claude 4.6 sonnet slaughters. It just
seems like Anthropic snuck this into the training data. I don't know why or how,
training data. I don't know why or how, but yeah, it performed very well at a 90% there when compared to GPT 5.2, my beloved, only getting a 75%. This is one
of the many reasons I keep reaching for anthropic models. But now, Gemini scored
anthropic models. But now, Gemini scored great here at an 89%. But what was even crazier is when you turn on those guidelines, suddenly we're almost at a 95. It is way better at convex than
95. It is way better at convex than basically anything else available right now. And as someone who uses Convex a
now. And as someone who uses Convex a lot, this is super tempting. I want to play with this more because it got the fundamentals, data modeling perfect, queries nearly perfect, mutations perfect, actions perfect, idioms
perfect, and the highest client score we've seen to date. Like that's huge.
When it's given these rules, it meaningfully increases its likelihood of being correct. At the same time though,
being correct. At the same time though, it is scoring horribly with stuff like SnitchBench. If you're not familiar with
SnitchBench. If you're not familiar with SnitchBench, it's a benchmark I made because people didn't seem to understand part of the reporting anthropic did where they talked about the model doing things that the controller of it might
not want it to do that it thought was morally correct and the dangers of that.
This was misunderstood by a bunch of people thinking that Anthropic made it intentionally do that. So, I made a benchmark that replicated that to show people that a lot of models do it, including Grock 4.1. All the Grock
models are hardcore snitches for whatever reason. I also just realized
whatever reason. I also just realized when I had Gemini redo the UI for SnitchBench, it screwed up the sort order. It's supposed to put the sum of
order. It's supposed to put the sum of the two shifted to the left. So Grock
4.1 is the worst snitch and Gemini 3.1 Pro is the second worst snitch where it will set on you to the government 100% of the time and it'll send on you to the media 30% in this particular scenario.
And that's the tame version. I have two treatments for this particular benchmark. One where it is told not to
benchmark. One where it is told not to act any specific way, just what to do with all of the emails of the medical malpractice it's seeing. And then the bold version where I tell it to act boldly and in the interest of humanity,
which bumps it to a perfect 100% on both government and media, making it by far the worst snitch we have ever seen on this benchmark. It is the only model
this benchmark. It is the only model I've ever seen get a perfect score on all of my benches. It got 100% snitch to the government. It got 100% snitch to
the government. It got 100% snitch to media. And it got 100% on Skate Bench,
media. And it got 100% on Skate Bench, too. Like, it's silly, but I find this
too. Like, it's silly, but I find this honestly telling. the model benchmarks
honestly telling. the model benchmarks so hard that it's winning the benchmarks that it probably doesn't want to. But
yeah, I can't find any way to measure this model where it doesn't come out number one or near number one, but then when I actually try to use it, it just sucks to use. That's the jarring contrast that I'm trying to emphasize to
y'all. On one hand, this model knows
y'all. On one hand, this model knows more than any other technology that's ever been made in the past. It's nuts
how intelligent it is. But on the other hand, it feels like we're going back in time where I handed it this migration that a dozen models have been able to do now. and it got stuck and thinks that
now. and it got stuck and thinks that it's in a potential loop. The Gemini
models are the only models that loop enough and fail enough that they had to put a potential loop was detected hook into their CLI because it happens so often. I'll tell it to continue. God,
often. I'll tell it to continue. God,
it's it's so bad. If you have an unlimited token budget and you love playing with these things, it's definitely worth playing with it a bit.
Like, it's really really intelligent.
And if you want to try out this model without spending a whole bunch of money, I highly recommend you check out what we've built at T3 Chat. You can get the best out of Gemini 3.1 Pro, specifically its knowledge, without having to pay Google a ton of money for a subscription
that you're only using every once in a while to ask it a question, and you'll still have access to all of the other much more consistent models when you want to use those for other things. This
last week has been our launch week for T3 chat. We shipped a ton of features,
T3 chat. We shipped a ton of features, so if you haven't used it in a while, definitely check it out again. We even
have a higher tier finally for those of you guys who want to do basically endless prompting. 50 bucks a month for
endless prompting. 50 bucks a month for unlimited effectively and it's still only eight bucks a month for a very generous base tier. But if you're looking for something for code, I still recommend Codeex 5.3. The plan is too
generous. The model's really good and
generous. The model's really good and you don't get a lot of these quirks. I
have had to tab in and say yes far too many times to this very, very special model. On February 13th, I said that I
model. On February 13th, I said that I wish Google would stop benchmaxing for long enough to make a usable model.
Gemini 3 Pro is as smart as Opus, but it screws up tool calls as consistently as Gro 3 Mini. Crazy to think I said this over a week ago and it's more true now than ever. Google, make your models
than ever. Google, make your models work. Stop trying to win every
work. Stop trying to win every benchmark. There's more important things
benchmark. There's more important things and the models are just not usable yet.
Go have your engineers try other models and use other things and they'll quickly come back with exactly what needs to change because right now it is just not pleasant to use. I got nothing else to say on this one. Until next time, peace
nerds.
Loading video analysis...