Why Claude Opus 4.5 Changes What's Possible with Vibe Coding
By The AI Daily Brief: Artificial Intelligence News
Summary
## Key takeaways - **Opus 4.5 Tops SWE-Bench Verified**: Opus 4.5 achieves 80.9% on SWE-Bench verified, blowing out GPT-5.1 Code Max at 77.9% and setting a new record that a 3% lead has never looked so large. [01:50], [01:52] - **Leads Terminal-Bench 2.0**: On Terminal-Bench 2.0 agentic terminal coding benchmark, Opus 4.5 is meaningfully ahead of all others and sets a new standard on agentic tool use, scaled tool use, and computer use. [01:58], [02:09] - **Outperforms Humans on Exam**: Claude Opus 4.5 scored higher than any human candidate ever on their notoriously difficult take-home performance engineering exam within the 2-hour time limit. [05:41], [05:47] - **Massive Productivity Gains**: 50% of 18 surveyed Anthropic staff reported at least 100% productivity improvement using Opus 4.5 in Claude Code, with a mean self-estimated improvement of 220%. [06:09], [06:25] - **New Agentic Tool Features**: Releasing tool search to access thousands of tools without consuming context, programmatic tool calling in code execution environments, and standardized tool-use examples for scalable agents integrating IDEs and deployment pipelines. [07:10], [07:34] - **Token Efficient Coder**: Opus 4.5 does better at SWE-Bench verified without thinking than with 64k reasoning tokens and uses 76% fewer output tokens than Sonnet 4.5 for the same tasks while being cheaper per task success. [09:26], [14:06]
Topics Covered
- Full Video
Full Transcript
Welcome back to the AI daily brief. The
Thanksgiving 2025 parade of models has continued into a new week. This time
with the launch of Claude Opus 4.5 from Anthropic. Now, people have been
Anthropic. Now, people have been assuming for some time that we were going to get an Opus 4.5. We've
obviously had Sonnet 4.5 for a while now, and so people figured that this was in the offing, but there had been a lot less conversation leading up to this around when it was going to come. The
big model, of course, that people have been anticipating is Gemini 3. And in
many ways, this was a wildly understated announcement. And yet, the response has
announcement. And yet, the response has been, in a word, significant. While they
may not have hypeed, Anthropic minces no words in their launch post. Our newest
model, Claude Opus 4.5, is available today. It's intelligent, efficient, and
today. It's intelligent, efficient, and the best model in the world for coding agents, and computer use. It's also
meaningfully better at everyday tasks like deep research and working with slides and spreadsheets. Opus 4.5 is a step forward in what AI systems can do and a preview of larger changes to how
work gets done. So let's talk first about the benchmarks and it is no accident that the one they choose to put right at the top is bench verified. Now
you might remember that in our discussions about Gemini 3, the only major benchmark that they didn't win or at least match was this one. While
sonnet 4.5 was at a 77.2% Gemini 3 Pro was at 76.2%.
Not like it was super far behind, but still not technically state-of-the-art.
GPT51 was also a little tiny bit ahead of Gemini 3 Pro at 76.3% and extended that lead at 77.9% when they released GPT51 Codeex Max in the days following
Gemini 3. For a very short time, 51
Gemini 3. For a very short time, 51 Codeex Max was the top of the Sweetbench verified chart, but Opus 4.5, at least by the benchmarks, blows it out of the
water 80.9%.
writes Morgan, "A 3% lead has never looked so large." And it wasn't just SweetBench verified. On the terminal
SweetBench verified. On the terminal bench 2.0, aentic terminal coding benchmark, 45 was meaningfully ahead of all the others as well. On agentic tool use, scaled tool use, and computer use,
Opus 4.5 sets a new standard. Now, there
were some tests where Opus 4.5 meaningfully lagged behind Gemini 3, such as humanity's last exam, where they were significantly behind both without search and with search. And yet what
everyone was talking about of course was the coding results. If you are a regular listener of the show, you will know that the ascendancy of Anthropic this year and the speed with which they are
catching up to OpenAI has much to do with them being the preferred AI coding model for developers that started with 3.5 and has basically continued unchallenged. Although after the release
unchallenged. Although after the release of GPT5, there have at least been credible competitors. Enthropic seems
credible competitors. Enthropic seems very clearly to agree with Swix on the relative importance of coding as compared to all other use cases. A
couple times I've referenced Shawn's post about what made him decide to go work with cognition where he basically book coding as the high value short timeline activity. The line which I've
timeline activity. The line which I've shared a couple of times code AGI will be achieved in 20% of the time of full AGI and capture 80% of the value of AGI.
Whether or not that's true, Anthropic has certainly behaved as such. Now,
outside just the standard SWEBench, there were a couple of other things that people noticed. Igor Cotenkov points out
people noticed. Igor Cotenkov points out that while there are ways to overfit towards the Sweetbench verified benchmark, the more recent SWEBench Pro is a lot more difficult and connected to
the real world, an Opus blows previous models out of the water. Opus gets a 52 where Sonnet 45 got 43.6 and GPT5 got
just 36%. On ARGI, Opus 4.5 set a new
just 36%. On ARGI, Opus 4.5 set a new standard ahead of 51 and Gemini 3. And
at ARC AGI 2, they got 37.64% at 240 a task. Already just hours after the release, the people who had early access were also independently verifying some of these results. Bin new ready
writes, Opus 4.5 tops livebench AI and is the world's best agentic model. We
can confirm this after testing this over the past few days. Now, interestingly,
one of the things that we've seen a lot from labs recently is the people inside the labs really talking up the specifics about what they like about the models.
We got a spade of that from anthropic team members such as Jake Eaton who writes, "OPUS 4.5 is very good at a lot of things and you should read the benchmarks, the model card, etc. But my favorite thing about working with it these past 2 weeks is that in
conversation it is somehow more fine- grained. It has a depth and texture that
grained. It has a depth and texture that for me was immediately noticeable. It
also feels interestingly much more self-contained." Sasha Dearing says,
self-contained." Sasha Dearing says, "The internal response to Opus 4.5 has been a mix of excitement, awe, and surprise, particularly around how good it is at coding." Tariq writes, "Opus 4.5 is special. A world record in
SweetBench and OS World benchmarks. The
best model we've ever had at Vision on Cloud Code. I've completely stopped
Cloud Code. I've completely stopped writing code in the IDE. I think there's so much to discover about Opus 4.5." And
indeed, some of the most interesting responses from Anthropics members come from their engineering team. Shelto
Douglas writes, "I am so excited about this model. First off, the most
this model. First off, the most important eval has been posting stories of crazy bugs that Opus found or incredible PRs that it nearly soloed. A couple of our best engineers are hitting the interventions
only phase of coding. Adam Wolf writes, "This new model is something else. Since
Sonnet 4.5, I've been tracking how long I can get the agent to work autonomously. With Opus 4.5, this is
autonomously. With Opus 4.5, this is starting to routinely stretch to 20 or 30 minutes. When I come back, the task
30 minutes. When I come back, the task is often done simply and idiomatically."
They talked about how Claude Opus compared on a notoriously difficult candidate exam. In their announcement
candidate exam. In their announcement post, they wrote, "We give prospective performance engineering candidates a notoriously difficult take-home exam. We
also test new models on this exam as an internal benchmark. Within our
internal benchmark. Within our prescribed 2-hour time limit, Claude Opus 4.5 scored higher than any human candidate ever." They continue, "The
candidate ever." They continue, "The take-home test is designed to assess technical ability and judgment under time pressure. It doesn't test for other
time pressure. It doesn't test for other crucial skills candidates may possess like collaboration, communication, or the instincts that develop over years.
But this result where an AI model outperforms strong candidates on important technical skills raises questions about how AI will change engineering as a profession. Now they
also talked to staff members to estimate the impact of using Opus 4.5 in clawed code. 50% nine of the 18 they surveyed
code. 50% nine of the 18 they surveyed reported a productivity improvement of at least 100%. The mean self-estimated productivity improvement was 220%.
They also popped open the hood a little bit on how they're making Claude even better when it comes to Agentics. In
short, they have a huge emphasis on tools. Indeed, they write, "The future
tools. Indeed, they write, "The future of AI agents is one where models work seamlessly across hundreds or thousands of tools. An IDE assistant that
of tools. An IDE assistant that integrates Git operations, file manipulation, package managers, testing frameworks, and deployment pipelines. An
operations coordinator that connects Slack, GitHub, Google Drive, Jira, company databases, and dozens of MCP servers simultaneously. To build
servers simultaneously. To build effective agents, they need to work with unlimited tool libraries without stuffing every definition into context up front. Agents also need to be able to
up front. Agents also need to be able to call tools from code. Agents also need to learn correct tool usage from examples. Following that, they share
examples. Following that, they share that they were releasing three features to make all of that possible. A tool
search tool, which allows Claude to use search tools to access thousands of tools without consuming its context window. Programmatic tool calling, which
window. Programmatic tool calling, which allows Claude to invoke tools in a code execution environment, reducing the impact on the model's context window.
and tool use examples which provide a universal standard for demonstrating how to effectively use a given tool. So
again, all of this is telling a very consistent story which is that claude is for coding and pushing the frontier of what agents can do. So outside of interacting with the benchmarks, what
were people's first impressions? Some
were excited and appreciated that there was less hype around this. Nico Christie
writes, "Have to respect Anthropic's commitment to not vague posting all weekend. This is the most exciting model
weekend. This is the most exciting model release in Sonnet 3.5. Leo at Synthwaved writes, "Beanthropic. Pretend Gemini 3
writes, "Beanthropic. Pretend Gemini 3 does not exist. Know you're ready to cook it for code anyways. Wait, zero
hype posting. Drop new Opus.
State-of-the-art for code.
State-of-the-art in RKGI. Better than
expected. Cost less than old Opus. Be
more like Anthropic." On the flip side, Ethan Mollik basically asked why they were burying the lead. I'm not sure why Anthropic keeps doing very low-key launches for fairly major releases and materially important improvements to
their services. I kind of think it has
their services. I kind of think it has to do with the assessment and the specificity of their audience in and among developers. Basically, it's a
among developers. Basically, it's a group of people that they think is going to respond more to having their peers and colleagues tell them about an update rather than getting maximum social distribution because of being loud and
hypy. But what about people's early
hypy. But what about people's early tests? Victor Talon writes, "To my
tests? Victor Talon writes, "To my surprise, Opus 4.1 oneshotted my hardest calculus problem tying with Gemini 3. In
terms of first hour impressions, couldn't be more promising, I guess."
Ethan Mollik writes, "I had early access to Opus 4.5, and it's a very impressive model that seems to be right at the frontier. Big gains in ability to do
frontier. Big gains in ability to do practical work like make a PowerPoint from an Excel." Nico again writes, "OPUS 4.5 is a step function improvement for spreadsheet work. Extremely hard became
spreadsheet work. Extremely hard became doable. Doable tasks became easy and
doable. Doable tasks became easy and easy tasks are now solved. And yet, if there were a few examples of people trying non-coding things, coding is very much where the main excitement lies. GL,
the CEO of Versel, writes, "Oopus is on a different level. It's unreasonably
good at Nex.js and the best model we've tried on Vzero to date." Menventure's DD Doss writes, "Anthropic just dropped the best coding model, Opus 4.5." The
coolest thing he points out is it does better at Sweetbench verified without thinking than with 64k reasoning tokens.
In other words, a super token efficient model. Matt Schumer, who didn't have
model. Matt Schumer, who didn't have early access, said, "First test of Claude is 4.5 and I'm already impressed.
I asked it for a collab competitor UI and it quickly pulled together this screen. Definitely better than my
screen. Definitely better than my similar test with GBT 51 and shockingly Gemini 3. More testing to go, but this
Gemini 3. More testing to go, but this is a good start." He followed it up.
Okay. Wow. I'm kind of blown away. In
one shot, Opus 4.5 made the UI actually functional with Python running in the browser. Some like Super Dario pointed
browser. Some like Super Dario pointed out that this may not even be the best model than Anthropic has behind the scenes. They write, "Good time to remind
scenes. They write, "Good time to remind everyone Enthropic has a long-standing policy of not significantly pushing the frontier to prevent an arms race. Dario
can hit sweeb bench scores at will."
Now, whether or not that's true, the fact that there is a lot of chatter like that, I think is good reflection of the sentiment in the community. Maybe the
most vocally excited about this is Dan Shipper and the team at every. He
writes, "Breaking news, Anthropic just dropped Claude Opus 4.5. It is by far the best coding model I've ever used."
And here's how Dan describes it. It
extends the horizon of what you can vibe code. Explaining, he writes, "The
code. Explaining, he writes, "The current generation of new models, Santhropic Sonnet 4.5, Google's Gemini 3, or OpenAI's Codex Max 5.1 can all competently build a minimum viable
product in one shot or fix a highly technical bug autonomously." But
eventually, if you keep pushing them to vibe code more, they'd start to trip over their own feet. The code would be convoluted and contradictory, and you'd get stuck in endless bugs. We have not
found that limit yet with Opus 4.5. It
seems to be able to vibe code forever.
Two more observations. Opus 4.5, he says, takes working in parallel to a whole new level because it's far better at planning and coding. It can work with more autonomy, meaning you can do more in parallel without breaking anything.
One of his teammates worked on 11 different projects in six hours and had good results on all of them. Lastly, he
points out it's great at design iteration. Opus 4.5, Dan writes, is
iteration. Opus 4.5, Dan writes, is incredibly skilled at iterating through a design autonomously using an MCP like playright. Previous models would lose
playright. Previous models would lose the thread after a few cycles or say a design was done when it wasn't. Opus 4.5
is incredible at autonomously iterating until a design is pixel perfect. Indeed,
Dan's team at every were equally as vocal in their love of this model.
Kieran Classen writes, "2023 was GPT4.
2024 was Sonnet 3.5. 2025 is Opus 4.5.
This is the coding model launch I've been waiting for. First time I genuinely believe I can vibe code an entire app end to end without touching the implementation details. We haven't found
implementation details. We haven't found the limit yet. Previous models would eventually trip over their own feet.
Convoluted code, contradictory logic, endless bugs. Opus 4.5 just keeps going.
endless bugs. Opus 4.5 just keeps going.
If you write code with AI, you need to try this. And I think that this idea is
try this. And I think that this idea is the thing to watch for to see whether Kieran and Dan's first impressions here and some of the impressions of the anthropic team really play out that this
is as Kieran puts it the first time we can vibe code an entire app end to end without touching the implementation details. It strikes me that if that is
details. It strikes me that if that is the case that could be the most massive implication of this model. Adam Wolf
from Anthropic again wrote, "I believe this new model in Claude Code is a glimpse of the future we're hurdling towards. Maybe as soon as the first half
towards. Maybe as soon as the first half of next year, software engineering is done. Soon we won't bother to check
done. Soon we won't bother to check generated code for the same reasons we don't check compiler output. I love
programming and it's a little scary to think it might not be a big part of my job." But coding was always the easy
job." But coding was always the easy part. The hard part is requirements,
part. The hard part is requirements, goals, feedback, figuring out what to build and whether it's working. There's
still so much left to do and plenty the models aren't close to yet.
architecture, systems design, understanding users, coordinating across teams. It's going to continue being fun and very interesting for the foreseeable future, but still, it's not hard to see that that's a fairly big pronouncement.
Now, moving back to the realm of the non-speculative, the other thing that captured people's attention about this is that Opus 4.5 is significantly cheaper than Opus 4.1. The cost dropped
from $15 to 5 per million input tokens and from 75 to 25 per million output tokens. Indeed, Jeremy from Anthropic
tokens. Indeed, Jeremy from Anthropic points out one fact people won't realize immediately about Opus 4.5. It's
remarkably token efficient. All in, it's often cheaper than Sonnet 4.5 and other models for cost per task success. Simon
Willis points out why we probably need to be looking not just at cost per output and input tokens, but also token efficiency when he writes, "This is notable. Opus 4.5 is around 60% more
notable. Opus 4.5 is around 60% more expensive than Sonnet, $25 per million output compared to $15 per million output. But if it can use 76% fewer
output. But if it can use 76% fewer output reasoning tokens for the same complex task, it may end up cheaper. Now
that 76% came from cloud relations Alex Albert, who said on SweetBench verified at medium effort, Opus 4.5 beats Sonnet 4.5 while using 76% fewer output tokens.
Look, it's early days, but the first impressions are big. Dan Shipper again sums up every 6 to 12 months, a model drops that truly shifts the paradigm.
Opus 4.5 launched today, and that's what it is. best coding model I've ever used
it is. best coding model I've ever used and it's not close. We're never going back. Brian Atwood points out, I said a
back. Brian Atwood points out, I said a month or two ago that Anthropic is a vertical AI company and this is what I meant. They rightly identified that
meant. They rightly identified that coding is the number one use case for LLMs right now and are overwhelmingly focused on it. Meanwhile, others are throwing darts in every conceivable direction, spreading themselves thin.
Interestingly, just a couple days ago, Sam Alman posted, "It has been amazing to watch the progress of the Codeex team. They are beasts. The product and
team. They are beasts. The product and model is already so good and will get much better. I believe they will create
much better. I believe they will create the best and most important product in the space and enable so much downstream work. It has been pretty clear for some
work. It has been pretty clear for some time now that OpenAI has come around to a similar view of the importance of coding and are very much not content to seed that ground. Summing up, Ethan Malik writes, "The main lesson of the
past few weeks is that the big four US labs all seem to have figured out a path forward in continuing the exponential pace of LLM improvement, at least in the near future." More simply put, Andrew
near future." More simply put, Andrew Curran writes, "AI winter is canled. Try
again next year." Grinch squad. There
will, I'm sure, be lots more to discuss around Opus 4.5 as people get deeper into it. But for now, like I said, the
into it. But for now, like I said, the Thanksgiving model explosion continues unabated. That's going to do it for
unabated. That's going to do it for today's episode. Appreciate you
today's episode. Appreciate you listening as always and until next time, peace.
Loading video analysis...