Does Gemini 3.1 Pro Matter?
By The AI Daily Brief: Artificial Intelligence News
Summary
Topics Covered
- Benchmarks Matter Less Now
- Gemini Excels in Cost Efficiency
- Distribution Trumps Raw Intelligence
- Multimodal Powers Unique Products
Full Transcript
Today on the AI daily brief, Gemini 3.1 Pro is here. It looks super powerful, but who does it actually matter for?
Welcome back to the AI daily brief.
Today we are talking about Gemini 3.1 Pro, but I want to situate it in a larger question. And I will start by
larger question. And I will start by saying sorry to Google for drawing the short end of the episode naming straw on this one. If it had been OpenAI that
this one. If it had been OpenAI that released 5.3, it would have been something very similar. The context we now all operate in is one where instead of getting big model releases
infrequently, we get very incremental model releases much more frequently.
There is in fact this meme which came from 2025 but which is more true than ever which is a circular chart that starts OpenAI introducing the world's most powerful model that moves to Grock introducing the world's most powerful
model that moves to Gemini introducing the world's most powerful model that moves to Anthropic introducing the world's most powerful model that moves to OpenAI introducing the world's most powerful model and so on in that with
the release of 3.1 Pro we are now at the Gemini section of that chart and the point of course is that at this stage state-of-the-art in terms of incremental gains on benchmarks feels less significant as a barometer of
a model's importance than it ever has before. When people say what is the best
before. When people say what is the best model, it is not only constantly shifting, but also, I think in practice, a question that is use case dependent.
So, let's talk about Gemini 3.1 Pro, the first reactions, both good and bad, and then try to figure out where it fits in the ecosystem of models. Now, it is worth pointing out that I think Gemini was absolutely due for a bit of an
upgrade. The conversation for pretty
upgrade. The conversation for pretty much all of 2026 and really heading back into the end of 2025 has been dominated by anthropic versus open AAI or more specifically codeex versus claude code.
Despite Gemini 3 having such wide a claim when it came out towards the end of last year, Google and Gemini have been really nowhere in the conversation when it comes to this incredibly important use case of coding. Now, it is
worth noting that there are lots of different categories of AI users and it is not the case that for all of them coding is what matters. It would be completely reasonable, in other words, for Google to put its priority in other
areas. However, it certainly doesn't
areas. However, it certainly doesn't seem like Google is not trying to compete in that area. They're clearly
investing a lot in Google AI Studio and anti-gravity. But when it comes, at
anti-gravity. But when it comes, at least to the most infranchised subset of users. They were kind of, at least in
users. They were kind of, at least in our recent survey results, a distinct third. All of the big models Claude,
third. All of the big models Claude, ChatBeT, and Gemini had some broad usage in our January AI usage survey. Gemini
in fact matched Claude with 80% of respondents having used it sometime last month, both falling slightly behind ChatgBT which was at 87%.
However, in terms of the number of people reporting that it was their primary model, Gemini was down in third at 16.1%.
And at first blush, there is a lot to be impressed with Gemini 3.1 Pro. Going by
the benchmarks, it is a distinct number one when it comes to humanity's last exam not using tools. Sets a new high for the GPQA diamond scientific knowledge benchmark. sees a big jump up
knowledge benchmark. sees a big jump up for Gemini on terminal bench 2.0 coming in ahead of Opus 4.6 and while it wasn't ahead of Opus 4.6 on Swebench verified
agent decoding test it was nipping at its heels 80.6% compared to 80.8%.
The biggest jump and the one that a lot of folks are talking about was on ARC AGI 2 while Opus 4 6 scored a 68.8% on that test. The jump between Gemini 3 Pro
that test. The jump between Gemini 3 Pro and Gemini 3.1 Pro was from a 31.1% with Gemini 3 to 77.1% on Gemini 31 Pro.
Google CEO Sundar Pachai says Gemini 3.1 Pro is great for super complex tasks like visualizing difficult concepts, synthesizing data into a single view, or bringing creative projects to life.
Devis points to major improvements in core reasoning and problem solving.
Google VP Josh Wward calls out who they want the model to appeal to writing to the scientist, the engineer, and the developer. Gemini 3.1 Pro has arrived.
developer. Gemini 3.1 Pro has arrived.
It's a significant leap in complex reasoning. Once again, he points to ARGI
reasoning. Once again, he points to ARGI 2. So, it's great adentic tasks,
2. So, it's great adentic tasks, intricate coding, and data synthesis projects. You should see fewer errors,
projects. You should see fewer errors, better logic, and surprisingly good SVGs. Attached to the post is an
SVGs. Attached to the post is an animated image of a seal bouncing a beach ball on its nose. So, what are the first impressions? The model is still
first impressions? The model is still rolling out and it's only available in certain pockets of the Google ecosystem, which by the way is its own challenge that people like Ethan Malik have pointed out that the Google ecosystem of AI is so diverse that it's sometimes
hard to wrap your head around what model lives where. But among those who have
lives where. But among those who have tried it, a lot of the responses are pretty positive. AI developer Eric
pretty positive. AI developer Eric Hartford wrote, "Loving Gemini 3.1 Pro.
It made three huge improvements to my compiler and saw things that even chatbt 5.2 Pro extended and Claude Opus 4.6 extended couldn't see." Designer and entrepreneur Mang 2 writes, "Gemini 3.1
Pro is an absolute beast for creating landing pages. It understands design
landing pages. It understands design details and animations so well. Insane
upgrade for web designers." And then of course there's ARGI 2 where it came in at a 77.1% but that might not even be the most impressive thing. The ARC
leaderboard measures not only the score but the cost per task. So, for example, although Gemini 3 Deepthink, which was released last week, got a higher overall score, it did so at more than 10 times
the cost. 3.1 Pro achieved that score at
the cost. 3.1 Pro achieved that score at less than a buck a task. On artificial
analysis's overall intelligence index, Google jumped all the way from the sixth spot behind various versions of Claude, GPT, and even a Chinese model GLM5 all the way up to number one. What's more,
artificial analysis points out that it's doing so at a more efficient cost. They
write, "Google is once again the leader in AI. Gemini 3.1 Pro preview leads the
in AI. Gemini 3.1 Pro preview leads the artificial analysis intelligence index four points ahead of Claude Opus 4.6 while costing less than half as much to run. They said that on their tests it
run. They said that on their tests it led six of the 10 evaluations that make up the index with the biggest gains in reasoning and knowledge, coding, and hallucination reduction. They also point
hallucination reduction. They also point out that it does so with some serious token efficiency. They write that its
token efficiency. They write that its processing efficiency combined with lower per token pricing means that 3.1 Pro preview costs less than half as much as Opus 4.6 Max to run, although it still is nearly twice as much as the
leading open weights model, which is that GLM5 that I mentioned. In terms of specific tests, they found that Gemini 3.1 Pro led their coding index achieving the hardest score on both terminal bench
hard and scode, but that one area where they were kind of lacking was on real world agentic performance. This is
around that GDP valve test which we've talked about before which is an agentic evaluation that focuses on real world tasks. While Gemini 3.1 Pro did jump up
tasks. While Gemini 3.1 Pro did jump up meaningfully from Gemini 3 Pro, it was behind Sonnet 4.6, Opus 4.6, GPT 5.2, and GLM5. That was something that a
and GLM5. That was something that a number of skeptical commentators focused on. Scaling01 on Twitter writes, "Gemini
on. Scaling01 on Twitter writes, "Gemini 3.1 Pro's GDP vow scores are concerning." Simon Smith points out that
concerning." Simon Smith points out that maybe that suggests that work tasks aren't Google's focus. Indeed, he even goes so far as to speculate. They have a stake in anthropic, so maybe they're okay with that. When it comes to coding
outside of that one example that I mentioned already, I'm just not seeing enough feedback yet to really know. Some
people had trouble actually finding the model or getting it to work inside anti-gravity or Gemini CLI. Although,
when they did, as reported by Matt Feloso, they had quote, "Awesome results so far." Akash Gupta gets at what I
so far." Akash Gupta gets at what I think is likely to become a more discussed aspect of this, which is the cost performance frontier. He writes,
"Best AI model crown now rotates on a weekly basis with each lab holding a different column of the same spreadsheet. The real number in this
spreadsheet. The real number in this release is the 96 cents per task on ARGI 2. Google went from 31.1% to 77.1% in 3
2. Google went from 31.1% to 77.1% in 3 months while keeping pricing at $2 per million input tokens, the same pricing as Gemini 3 Pro. They doubled the intelligence and charged zero
incremental cost. That's the game now.
incremental cost. That's the game now.
The frontier is commoditizing so fast that benchmark leadership lasts weeks, not quarters. Open AAI, Anthropic, and
not quarters. Open AAI, Anthropic, and Google are all within singledigit percentage points of each other on most evals. The three labs are converging on
evals. The three labs are converging on comparable intelligence, but diverging on distribution. Google has 2 billion
on distribution. Google has 2 billion Chrome users, Android, Workspace, and Cloud. That's the real moat in this
Cloud. That's the real moat in this chart, not the 77.1%.
Whoever makes intelligence ambient and cheap wins. And this benchmark table
cheap wins. And this benchmark table with its patchwork of leaders across every column is the clearest sign yet that raw capability is table stakes. I
think there is a lot of truth in that.
And so one of the reasons why yes, Gemini 3.1 Pro does matter is that it's pushing on the cost frontier, not just the performance frontier. Now the other thing about Gemini is that it's very
clear that the productization of its multimmoal capabilities is something that really matters to Google. Alongside
the new model update, Google Labs announced a new feature for their Promelli app called Photoshoot. They
write, "With Photoshoot, you can start from a single image of your product and easily create highquality customized product shots to elevate your marketing." That tweet went wildly
marketing." That tweet went wildly viral. In fact, whereas CEO Sundar
viral. In fact, whereas CEO Sundar Pichai's tweet announcing 3.1 had around 1 million views, the Google Labs tweet announcing photo shoot has 12.2 million
views at the time of recording. Google
Labs product director Jacqueline Conselman wrote, "Clearly this hit a nerve. Turns out a lot of people have
nerve. Turns out a lot of people have been waiting for a way to get professional product photos, but didn't have the time or resources to make it happen. Now they can. Go try it. It's
happen. Now they can. Go try it. It's
free." When folks like A16Z partner Justine Moore tried it, they also came away impressed. Another example of
away impressed. Another example of Gemini flexing its multimodal bonafides came with a partner announcement from Replet when they introduced Replet animation. It is exactly what it sounds
animation. It is exactly what it sounds like, a tool to vibe code infographic videos. Powered, they say by Gemini 3.1
videos. Powered, they say by Gemini 3.1 Pro. Replet CEO Amjud Massad wrote,
Pro. Replet CEO Amjud Massad wrote, "Vibe coding as a term is a bit tragic because it implies you're merely making software, but you can really make anything. We've been having a lot of fun
anything. We've been having a lot of fun making videos with Replet animation, the kind I used to pay thousands of dollars for when we needed to do a launch video.
Also, if you dig around enough, you can see the types of things that people are using Gemini 3.1 Pro for are just a little bit different than the other tools. Sure, there's a bunch of weird
tools. Sure, there's a bunch of weird Pelican SVG tests, but you also have examples like this one from Daniel Z, who writes, "Gemini 3.1 Pro vibe coded a double wishbone suspension, independent
double wishbone design, dynamic coilover shock absorber, vented disc brakes with performance caliper, real-time kinematic travel and steering simulation. AI isn't
just generating visuals anymore.
Devis Habis shared an official example from the Google DeepMind account where they used 3.1 Pro to build a realistic city planner app that has complex terrains, infrastructure mapping, and even simulates traffic. Google DeepMind
chief scientist Jeff Dean shared an example of 3.1 Pro doing heat transfer analysis based on a CAD file and material properties and then turning that heat transfer analysis at different times into a visual representation.
Overall, I agree on the surface with latent space when they wrote, "It's getting a little hard to say interesting things with all the roundroin minor version updates of Frontier models every week. Gemini 3.1 Pro seems like a decent
week. Gemini 3.1 Pro seems like a decent enough advance to catch up and in some cases supersede the fellow Frontier models. It's better at some SVG design
models. It's better at some SVG design things and translating textual vibes to visual aesthetics. But that's kind of
visual aesthetics. But that's kind of all they had to say. I think though coming back to this question of why 3.1 Pro matters or why any new model release matters the point that I was trying to make at the beginning is that it's not
just about state-of-the-art on the benchmarks that is as Akos pointed out table stakes what's important is to try to understand what it does uniquely well it's very clear when you actually dig
deep that Gemini is flexing its multimodal capabilities in a full spectrum of ways from being able to do much more technically and scientifically advanced work to being at the core of products that aren't possible with the
other models. Now, that doesn't
other models. Now, that doesn't necessarily mean for Google that they can still get away with competing on core use cases like coding. But part of the reason I think we found that even though it was the primary model for just 16.1%,
still a full 80% of people had used Gemini in the previous month because there are just some use cases that it is ideally suited for. It is very clear that as we head deeper into the AI and age and age, the greatest gains will not
come from just shifting wholesale from one model to the next as new capabilities emerge, but instead to understand with each model release what that particular model is going to do best and where it should be in your
model portfolio. I'm excited to dig into
model portfolio. I'm excited to dig into 3.1 Pro and I'm sure I will have more to report in the week to come. For now
though, that is going to do it for today's AI daily brief. Appreciate you
listening or watching.
Loading video analysis...