LongCut logo

Why Claude Opus 4.5 Changes What's Possible with Vibe Coding

By The AI Daily Brief: Artificial Intelligence News

Summary

## Key takeaways - **Opus 4.5 Tops SWE-Bench Verified**: Opus 4.5 achieves 80.9% on SWE-Bench verified, blowing out GPT-5.1 Code Max at 77.9% and setting a new record that a 3% lead has never looked so large. [01:50], [01:52] - **Leads Terminal-Bench 2.0**: On Terminal-Bench 2.0 agentic terminal coding benchmark, Opus 4.5 is meaningfully ahead of all others and sets a new standard on agentic tool use, scaled tool use, and computer use. [01:58], [02:09] - **Outperforms Humans on Exam**: Claude Opus 4.5 scored higher than any human candidate ever on their notoriously difficult take-home performance engineering exam within the 2-hour time limit. [05:41], [05:47] - **Massive Productivity Gains**: 50% of 18 surveyed Anthropic staff reported at least 100% productivity improvement using Opus 4.5 in Claude Code, with a mean self-estimated improvement of 220%. [06:09], [06:25] - **New Agentic Tool Features**: Releasing tool search to access thousands of tools without consuming context, programmatic tool calling in code execution environments, and standardized tool-use examples for scalable agents integrating IDEs and deployment pipelines. [07:10], [07:34] - **Token Efficient Coder**: Opus 4.5 does better at SWE-Bench verified without thinking than with 64k reasoning tokens and uses 76% fewer output tokens than Sonnet 4.5 for the same tasks while being cheaper per task success. [09:26], [14:06]

Topics Covered

  • Full Video

Full Transcript

Welcome back to the AI daily brief. The

Thanksgiving 2025 parade of models has continued into a new week. This time

with the launch of Claude Opus 4.5 from Anthropic. Now, people have been

Anthropic. Now, people have been assuming for some time that we were going to get an Opus 4.5. We've

obviously had Sonnet 4.5 for a while now, and so people figured that this was in the offing, but there had been a lot less conversation leading up to this around when it was going to come. The

big model, of course, that people have been anticipating is Gemini 3. And in

many ways, this was a wildly understated announcement. And yet, the response has

announcement. And yet, the response has been, in a word, significant. While they

may not have hypeed, Anthropic minces no words in their launch post. Our newest

model, Claude Opus 4.5, is available today. It's intelligent, efficient, and

today. It's intelligent, efficient, and the best model in the world for coding agents, and computer use. It's also

meaningfully better at everyday tasks like deep research and working with slides and spreadsheets. Opus 4.5 is a step forward in what AI systems can do and a preview of larger changes to how

work gets done. So let's talk first about the benchmarks and it is no accident that the one they choose to put right at the top is bench verified. Now

you might remember that in our discussions about Gemini 3, the only major benchmark that they didn't win or at least match was this one. While

sonnet 4.5 was at a 77.2% Gemini 3 Pro was at 76.2%.

Not like it was super far behind, but still not technically state-of-the-art.

GPT51 was also a little tiny bit ahead of Gemini 3 Pro at 76.3% and extended that lead at 77.9% when they released GPT51 Codeex Max in the days following

Gemini 3. For a very short time, 51

Gemini 3. For a very short time, 51 Codeex Max was the top of the Sweetbench verified chart, but Opus 4.5, at least by the benchmarks, blows it out of the

water 80.9%.

writes Morgan, "A 3% lead has never looked so large." And it wasn't just SweetBench verified. On the terminal

SweetBench verified. On the terminal bench 2.0, aentic terminal coding benchmark, 45 was meaningfully ahead of all the others as well. On agentic tool use, scaled tool use, and computer use,

Opus 4.5 sets a new standard. Now, there

were some tests where Opus 4.5 meaningfully lagged behind Gemini 3, such as humanity's last exam, where they were significantly behind both without search and with search. And yet what

everyone was talking about of course was the coding results. If you are a regular listener of the show, you will know that the ascendancy of Anthropic this year and the speed with which they are

catching up to OpenAI has much to do with them being the preferred AI coding model for developers that started with 3.5 and has basically continued unchallenged. Although after the release

unchallenged. Although after the release of GPT5, there have at least been credible competitors. Enthropic seems

credible competitors. Enthropic seems very clearly to agree with Swix on the relative importance of coding as compared to all other use cases. A

couple times I've referenced Shawn's post about what made him decide to go work with cognition where he basically book coding as the high value short timeline activity. The line which I've

timeline activity. The line which I've shared a couple of times code AGI will be achieved in 20% of the time of full AGI and capture 80% of the value of AGI.

Whether or not that's true, Anthropic has certainly behaved as such. Now,

outside just the standard SWEBench, there were a couple of other things that people noticed. Igor Cotenkov points out

people noticed. Igor Cotenkov points out that while there are ways to overfit towards the Sweetbench verified benchmark, the more recent SWEBench Pro is a lot more difficult and connected to

the real world, an Opus blows previous models out of the water. Opus gets a 52 where Sonnet 45 got 43.6 and GPT5 got

just 36%. On ARGI, Opus 4.5 set a new

just 36%. On ARGI, Opus 4.5 set a new standard ahead of 51 and Gemini 3. And

at ARC AGI 2, they got 37.64% at 240 a task. Already just hours after the release, the people who had early access were also independently verifying some of these results. Bin new ready

writes, Opus 4.5 tops livebench AI and is the world's best agentic model. We

can confirm this after testing this over the past few days. Now, interestingly,

one of the things that we've seen a lot from labs recently is the people inside the labs really talking up the specifics about what they like about the models.

We got a spade of that from anthropic team members such as Jake Eaton who writes, "OPUS 4.5 is very good at a lot of things and you should read the benchmarks, the model card, etc. But my favorite thing about working with it these past 2 weeks is that in

conversation it is somehow more fine- grained. It has a depth and texture that

grained. It has a depth and texture that for me was immediately noticeable. It

also feels interestingly much more self-contained." Sasha Dearing says,

self-contained." Sasha Dearing says, "The internal response to Opus 4.5 has been a mix of excitement, awe, and surprise, particularly around how good it is at coding." Tariq writes, "Opus 4.5 is special. A world record in

SweetBench and OS World benchmarks. The

best model we've ever had at Vision on Cloud Code. I've completely stopped

Cloud Code. I've completely stopped writing code in the IDE. I think there's so much to discover about Opus 4.5." And

indeed, some of the most interesting responses from Anthropics members come from their engineering team. Shelto

Douglas writes, "I am so excited about this model. First off, the most

this model. First off, the most important eval has been posting stories of crazy bugs that Opus found or incredible PRs that it nearly soloed. A couple of our best engineers are hitting the interventions

only phase of coding. Adam Wolf writes, "This new model is something else. Since

Sonnet 4.5, I've been tracking how long I can get the agent to work autonomously. With Opus 4.5, this is

autonomously. With Opus 4.5, this is starting to routinely stretch to 20 or 30 minutes. When I come back, the task

30 minutes. When I come back, the task is often done simply and idiomatically."

They talked about how Claude Opus compared on a notoriously difficult candidate exam. In their announcement

candidate exam. In their announcement post, they wrote, "We give prospective performance engineering candidates a notoriously difficult take-home exam. We

also test new models on this exam as an internal benchmark. Within our

internal benchmark. Within our prescribed 2-hour time limit, Claude Opus 4.5 scored higher than any human candidate ever." They continue, "The

candidate ever." They continue, "The take-home test is designed to assess technical ability and judgment under time pressure. It doesn't test for other

time pressure. It doesn't test for other crucial skills candidates may possess like collaboration, communication, or the instincts that develop over years.

But this result where an AI model outperforms strong candidates on important technical skills raises questions about how AI will change engineering as a profession. Now they

also talked to staff members to estimate the impact of using Opus 4.5 in clawed code. 50% nine of the 18 they surveyed

code. 50% nine of the 18 they surveyed reported a productivity improvement of at least 100%. The mean self-estimated productivity improvement was 220%.

They also popped open the hood a little bit on how they're making Claude even better when it comes to Agentics. In

short, they have a huge emphasis on tools. Indeed, they write, "The future

tools. Indeed, they write, "The future of AI agents is one where models work seamlessly across hundreds or thousands of tools. An IDE assistant that

of tools. An IDE assistant that integrates Git operations, file manipulation, package managers, testing frameworks, and deployment pipelines. An

operations coordinator that connects Slack, GitHub, Google Drive, Jira, company databases, and dozens of MCP servers simultaneously. To build

servers simultaneously. To build effective agents, they need to work with unlimited tool libraries without stuffing every definition into context up front. Agents also need to be able to

up front. Agents also need to be able to call tools from code. Agents also need to learn correct tool usage from examples. Following that, they share

examples. Following that, they share that they were releasing three features to make all of that possible. A tool

search tool, which allows Claude to use search tools to access thousands of tools without consuming its context window. Programmatic tool calling, which

window. Programmatic tool calling, which allows Claude to invoke tools in a code execution environment, reducing the impact on the model's context window.

and tool use examples which provide a universal standard for demonstrating how to effectively use a given tool. So

again, all of this is telling a very consistent story which is that claude is for coding and pushing the frontier of what agents can do. So outside of interacting with the benchmarks, what

were people's first impressions? Some

were excited and appreciated that there was less hype around this. Nico Christie

writes, "Have to respect Anthropic's commitment to not vague posting all weekend. This is the most exciting model

weekend. This is the most exciting model release in Sonnet 3.5. Leo at Synthwaved writes, "Beanthropic. Pretend Gemini 3

writes, "Beanthropic. Pretend Gemini 3 does not exist. Know you're ready to cook it for code anyways. Wait, zero

hype posting. Drop new Opus.

State-of-the-art for code.

State-of-the-art in RKGI. Better than

expected. Cost less than old Opus. Be

more like Anthropic." On the flip side, Ethan Mollik basically asked why they were burying the lead. I'm not sure why Anthropic keeps doing very low-key launches for fairly major releases and materially important improvements to

their services. I kind of think it has

their services. I kind of think it has to do with the assessment and the specificity of their audience in and among developers. Basically, it's a

among developers. Basically, it's a group of people that they think is going to respond more to having their peers and colleagues tell them about an update rather than getting maximum social distribution because of being loud and

hypy. But what about people's early

hypy. But what about people's early tests? Victor Talon writes, "To my

tests? Victor Talon writes, "To my surprise, Opus 4.1 oneshotted my hardest calculus problem tying with Gemini 3. In

terms of first hour impressions, couldn't be more promising, I guess."

Ethan Mollik writes, "I had early access to Opus 4.5, and it's a very impressive model that seems to be right at the frontier. Big gains in ability to do

frontier. Big gains in ability to do practical work like make a PowerPoint from an Excel." Nico again writes, "OPUS 4.5 is a step function improvement for spreadsheet work. Extremely hard became

spreadsheet work. Extremely hard became doable. Doable tasks became easy and

doable. Doable tasks became easy and easy tasks are now solved. And yet, if there were a few examples of people trying non-coding things, coding is very much where the main excitement lies. GL,

the CEO of Versel, writes, "Oopus is on a different level. It's unreasonably

good at Nex.js and the best model we've tried on Vzero to date." Menventure's DD Doss writes, "Anthropic just dropped the best coding model, Opus 4.5." The

coolest thing he points out is it does better at Sweetbench verified without thinking than with 64k reasoning tokens.

In other words, a super token efficient model. Matt Schumer, who didn't have

model. Matt Schumer, who didn't have early access, said, "First test of Claude is 4.5 and I'm already impressed.

I asked it for a collab competitor UI and it quickly pulled together this screen. Definitely better than my

screen. Definitely better than my similar test with GBT 51 and shockingly Gemini 3. More testing to go, but this

Gemini 3. More testing to go, but this is a good start." He followed it up.

Okay. Wow. I'm kind of blown away. In

one shot, Opus 4.5 made the UI actually functional with Python running in the browser. Some like Super Dario pointed

browser. Some like Super Dario pointed out that this may not even be the best model than Anthropic has behind the scenes. They write, "Good time to remind

scenes. They write, "Good time to remind everyone Enthropic has a long-standing policy of not significantly pushing the frontier to prevent an arms race. Dario

can hit sweeb bench scores at will."

Now, whether or not that's true, the fact that there is a lot of chatter like that, I think is good reflection of the sentiment in the community. Maybe the

most vocally excited about this is Dan Shipper and the team at every. He

writes, "Breaking news, Anthropic just dropped Claude Opus 4.5. It is by far the best coding model I've ever used."

And here's how Dan describes it. It

extends the horizon of what you can vibe code. Explaining, he writes, "The

code. Explaining, he writes, "The current generation of new models, Santhropic Sonnet 4.5, Google's Gemini 3, or OpenAI's Codex Max 5.1 can all competently build a minimum viable

product in one shot or fix a highly technical bug autonomously." But

eventually, if you keep pushing them to vibe code more, they'd start to trip over their own feet. The code would be convoluted and contradictory, and you'd get stuck in endless bugs. We have not

found that limit yet with Opus 4.5. It

seems to be able to vibe code forever.

Two more observations. Opus 4.5, he says, takes working in parallel to a whole new level because it's far better at planning and coding. It can work with more autonomy, meaning you can do more in parallel without breaking anything.

One of his teammates worked on 11 different projects in six hours and had good results on all of them. Lastly, he

points out it's great at design iteration. Opus 4.5, Dan writes, is

iteration. Opus 4.5, Dan writes, is incredibly skilled at iterating through a design autonomously using an MCP like playright. Previous models would lose

playright. Previous models would lose the thread after a few cycles or say a design was done when it wasn't. Opus 4.5

is incredible at autonomously iterating until a design is pixel perfect. Indeed,

Dan's team at every were equally as vocal in their love of this model.

Kieran Classen writes, "2023 was GPT4.

2024 was Sonnet 3.5. 2025 is Opus 4.5.

This is the coding model launch I've been waiting for. First time I genuinely believe I can vibe code an entire app end to end without touching the implementation details. We haven't found

implementation details. We haven't found the limit yet. Previous models would eventually trip over their own feet.

Convoluted code, contradictory logic, endless bugs. Opus 4.5 just keeps going.

endless bugs. Opus 4.5 just keeps going.

If you write code with AI, you need to try this. And I think that this idea is

try this. And I think that this idea is the thing to watch for to see whether Kieran and Dan's first impressions here and some of the impressions of the anthropic team really play out that this

is as Kieran puts it the first time we can vibe code an entire app end to end without touching the implementation details. It strikes me that if that is

details. It strikes me that if that is the case that could be the most massive implication of this model. Adam Wolf

from Anthropic again wrote, "I believe this new model in Claude Code is a glimpse of the future we're hurdling towards. Maybe as soon as the first half

towards. Maybe as soon as the first half of next year, software engineering is done. Soon we won't bother to check

done. Soon we won't bother to check generated code for the same reasons we don't check compiler output. I love

programming and it's a little scary to think it might not be a big part of my job." But coding was always the easy

job." But coding was always the easy part. The hard part is requirements,

part. The hard part is requirements, goals, feedback, figuring out what to build and whether it's working. There's

still so much left to do and plenty the models aren't close to yet.

architecture, systems design, understanding users, coordinating across teams. It's going to continue being fun and very interesting for the foreseeable future, but still, it's not hard to see that that's a fairly big pronouncement.

Now, moving back to the realm of the non-speculative, the other thing that captured people's attention about this is that Opus 4.5 is significantly cheaper than Opus 4.1. The cost dropped

from $15 to 5 per million input tokens and from 75 to 25 per million output tokens. Indeed, Jeremy from Anthropic

tokens. Indeed, Jeremy from Anthropic points out one fact people won't realize immediately about Opus 4.5. It's

remarkably token efficient. All in, it's often cheaper than Sonnet 4.5 and other models for cost per task success. Simon

Willis points out why we probably need to be looking not just at cost per output and input tokens, but also token efficiency when he writes, "This is notable. Opus 4.5 is around 60% more

notable. Opus 4.5 is around 60% more expensive than Sonnet, $25 per million output compared to $15 per million output. But if it can use 76% fewer

output. But if it can use 76% fewer output reasoning tokens for the same complex task, it may end up cheaper. Now

that 76% came from cloud relations Alex Albert, who said on SweetBench verified at medium effort, Opus 4.5 beats Sonnet 4.5 while using 76% fewer output tokens.

Look, it's early days, but the first impressions are big. Dan Shipper again sums up every 6 to 12 months, a model drops that truly shifts the paradigm.

Opus 4.5 launched today, and that's what it is. best coding model I've ever used

it is. best coding model I've ever used and it's not close. We're never going back. Brian Atwood points out, I said a

back. Brian Atwood points out, I said a month or two ago that Anthropic is a vertical AI company and this is what I meant. They rightly identified that

meant. They rightly identified that coding is the number one use case for LLMs right now and are overwhelmingly focused on it. Meanwhile, others are throwing darts in every conceivable direction, spreading themselves thin.

Interestingly, just a couple days ago, Sam Alman posted, "It has been amazing to watch the progress of the Codeex team. They are beasts. The product and

team. They are beasts. The product and model is already so good and will get much better. I believe they will create

much better. I believe they will create the best and most important product in the space and enable so much downstream work. It has been pretty clear for some

work. It has been pretty clear for some time now that OpenAI has come around to a similar view of the importance of coding and are very much not content to seed that ground. Summing up, Ethan Malik writes, "The main lesson of the

past few weeks is that the big four US labs all seem to have figured out a path forward in continuing the exponential pace of LLM improvement, at least in the near future." More simply put, Andrew

near future." More simply put, Andrew Curran writes, "AI winter is canled. Try

again next year." Grinch squad. There

will, I'm sure, be lots more to discuss around Opus 4.5 as people get deeper into it. But for now, like I said, the

into it. But for now, like I said, the Thanksgiving model explosion continues unabated. That's going to do it for

unabated. That's going to do it for today's episode. Appreciate you

today's episode. Appreciate you listening as always and until next time, peace.

Loading...

Loading video analysis...