Opus 4.8 Changes Everything I Thought About AI

By Build In Public

Summary

Topics Covered

It's not dumb agents, it's confident broken ones
Free upgrade breaks the standard release playbook
Parallel sub-agents turn migrations into judgment
Steer mid-run without blowing your prompt cache
The alignment bar jumped before Mythos shipped

Full Transcript

Opus 4.8 just dropped today and no one saw it coming and neither did I. And

here's everything you need to know on how it performs versus GPT 5.5 and Opus 4.7. It is available right now and you

4.7. It is available right now and you need to go ahead and stop what you're doing and update because you're going to notice the difference right away. Stop

reading tweets about it. Stop watching

videos about it and just make the upgrade right now because Opus 4.8 is the same price as Opus 4.7. So, same

input price and the same output price and the same context window on the same plan, which is amazing. But it ships fewer broken pieces of code and it

actually uses tools more cleanly. It

catches more of its own mistakes and it pushes back when a plan does not actually make sense. So, like I said, this dropped just a few minutes ago. It

says, "We're updating or upgrading Opus 4.7 to 4.8. It builds on Opus 4.7 with improvements across benchmarks and is more effective as a collaborator. And

it's available today for the same price." So, Opus 4.8 launches alongside

price." So, Opus 4.8 launches alongside several new features. Users on Claude AI now control over the amount of effort

Claude puts into a task. So, Claude code has a new dynamic workflow feature that allows it to tackle very large-scale problems. So, you can see here on the

benchmark scores, 69% on Agentic Coding versus Opus 64.3% that's on the 4.7 model. And you could see Agentic Terminal Coding on Terminal

Bench, 74% over 66.1.

These are massive improvements. So, if

you build with AI agents, this is the upgrade you've been waiting for. Just

fewer mistakes, better code refactor, better build-outs, better Agentic workflows. I mean, come on, and they

workflows. I mean, come on, and they didn't increase the price. I love that.

I want to talk about it more in-depth on how this actually impacts our workflow and how it makes us better builders.

Because there is one stat in this benchmark and in the announcement that everybody building with AI agents really needs to focus on. So, Anthropic

basically says Opus 48 is around four times less likely than its predecessor to allow flaws in code and it writes to pass unremarked, which is

amazing. So, if you read that one more

amazing. So, if you read that one more time slowly, four times less likely to let bad code slip through without flagging it. And that's not a small

flagging it. And that's not a small improvement. I'm going to be making

improvement. I'm going to be making another video today because I'm going to be going through my SaaS products, Easy Flip in the Magic Hand using Opus 48 in these workflows. I'll be dropping that

these workflows. I'll be dropping that video later today. I'll put the link in the description and just to see how it performs. How well does it do with tool calling? How well does it catch errors

calling? How well does it catch errors and refactor the code base? So, I just want to see it in a practical application. I will be dropping that

application. I will be dropping that video in a few hours. So, the difference here basically from what I'm gathering from this release is between an agent you can actually leave running and an

agent you can actually have to babysit.

Those are the two differences that we've seen, right? We've seen agents like even

seen, right? We've seen agents like even going back to the Opus 41 days when we're like, "Wow, this is amazing." We

would typically have to sit there and babysit it to make sure it was doing what it actually said it did. Here, I

think that we're getting one step closer to an actual AGI agentic workflow. You

know, here's the thing about Opus 47. I

mean, it just dropped a few weeks ago.

It was already really good. Um, and

people shipped real businesses on it and people shipped real revenue using that model and they shipped real production code, but the failure mode of every agent built on a strong model is

basically the same. The model does something that looks right. It says it's done and then you find out three days later the function it wrote silently fails or does the wrong thing. That is

so annoying. So, this release note goes over a lot of the problems that Opus 4.7 had and fixes it with Opus 4.8, which is amazing. Like you can see here the

amazing. Like you can see here the misaligned behavior score. You can see 4.7 versus Opus 4.8, a great reduction.

I mean, it's amazing. The other thing that no one's talking about, well, I mean it just released, but the dynamic roof work closes is a pretty cool feature. So, this new feature available

feature. So, this new feature available in research preview allows Claude to take on even bigger tasks in Claude code. So, Claude can plan the work and

code. So, Claude can plan the work and then run hundreds of parallel sub-agents in a single session. You do need to be aware of the token consumption on that.

Like if you're not on the $200 a month plan, you're probably going to use a ton of tokens using that. But, for example, it says Claude with Opus 4.8 can now carry out code-based scale migrations

across hundreds of thousands of lines of code from kickoff to merge with the existing test suite as its bar. That's

amazing. I'm going to be testing that out in the video that I drop later today. So, I love that. You know, a

today. So, I love that. You know, a condition um that basically swallows errors, like that's what we saw in Opus 4.7. Like you

saw that maybe it said something that, "Oh, yes, we did that. We shipped that feature. It all looks good." But, you

feature. It all looks good." But, you know, as a non-technical person, you don't deep dive into the code. You just

figure out days later that, "Wow, Claude code told me it did this, but it actually didn't and it's not working."

You know, these are tests that pass for the wrong reason and it's really annoying. Even using GPT 5.5, you come

annoying. Even using GPT 5.5, you come across these errors all the time. And

that is literally what kills agent workflows in production. It's not the model being dumb. I've said this before.

It's the model being confident about something that quietly broke. So, when

Anthropic says Opus is 4.8 less likely to ship those kinds of silent flaws, and that is the most important sentence in this whole announcement. People are just going to gloss over that. Like I said, I

have the link to this release down below. Have your agent review it and see

below. Have your agent review it and see if it's going to be a really good update. Maybe you're using GPT 5.5, say,

update. Maybe you're using GPT 5.5, say, "Hey, run a comparison analysis between GPT 5.5 and Opus 4.8. Which one should I be running? Which is the best bang for

be running? Which is the best bang for my buck?" So, availability it talks

my buck?" So, availability it talks about Claude Opus 4.8 is available right now today as this video is dropping and the pricing is the same. So, there's no reason why you shouldn't be updating.

For me, this is what makes it worth switching really instead of waiting.

Like I don't know why you'd be waiting.

It's the price. Like it's the same price. Usually they increase the price.

price. Usually they increase the price.

That's the best part and where most model releases get awkward is when they hike the price up. The new model drops, it's more expensive usually. You have to decide if the bump in quality is worth

the bump in price. Here you're getting better quality for the same price you were already paying with 4.7. So, open

for Opus 4.8 is not that release. It's

you're paying the same amount. It's the

same $5 per million input tokens and the same $25 per million output tokens. So,

it's the same as 4.7. You just get a better model for basically free. I mean,

not free, but at least no cost advantage of, you know, paying more, right? And

fast mode got even more interesting here because Opus 4.7 compared to GB 5.5, I don't know if you noticed, it was a lot slower for me. Fast mode on Opus 4.8

runs at two and a half times the speed of standard mode and the fast mode price is now three times cheaper than it was previously on other fast most modes. So,

if you have agents that were getting slowed down by latency or if you were paying a premium for fast mode and basically cannot justify it, that math

just changed for you. You can now run more faster for less with a smarter model. This I love stuff like this. This

model. This I love stuff like this. This

is so fun, like just seeing the improvements and, you know, when GPT 5.5 came out, I was a die-hard Opus 4.7 user, and I just was like, "You know

what? I'm going to give GPT 5.5 a try."

what? I'm going to give GPT 5.5 a try."

And I was blown away. But now this gets me more excited. I've always been a Claude fan, and I think, you know, they're cooking. They're cooking. 4.8, I

they're cooking. They're cooking. 4.8, I

think is going to be the model that I'm going to be using and playing around with today. I've been meaning to

with today. I've been meaning to refactor my codebase for a lot of the SaaS apps that I've built, and I'm excited to share with you to see if it's actually really worth it if you're using GPT 5.5, if it's worth making the switch

back over. I don't think that there's a

back over. I don't think that there's a trade-off here, you know, this is actually a free upgrade if you're already using 4.7. You know, if you're still routing default traffic to Opus 4.7, just switch your model ID to Claude

Opus 4.8. Sometimes you got to close

Opus 4.8. Sometimes you got to close your IDE, restart it, make the update, and you should be good to go. Or just go right into the terminal and run the update. So, there is no really reason to

update. So, there is no really reason to wait. I want to talk about benchmarks

wait. I want to talk about benchmarks here, you know, this is where everyone loves to just drool over and go, "Wow!"

I know it gets the clicks, and that's why I wanted to just bring them up briefly, but you really need to actually use it in production to see if the benchmarks hold true, because I've seen tons of benchmarks. And then when you

use that model in production, it is terrible. It sucks. The tool calling is

terrible. It sucks. The tool calling is terrible, or it ends up costing more than you think. So, I do want to go over the benchmarks quickly. You know, it's always like a marketing tactic until you

look at which benchmarks the lab actually picks. Anthropic led with

actually picks. Anthropic led with Terminal Bench 2.1. You can see here at the top, 74.6% versus Opus being Opus 4.7 being 66%, and Opus 4.8

[clears throat] scored 89%, and GPT 5.5 with Codex CLI scored 83.4%. So, what is Terminal Bench

scored 83.4%. So, what is Terminal Bench actually measuring? I mean, for people

actually measuring? I mean, for people that are new, it is measuring whether the model can complete real terminal tasks, running commands, reading files, debugging code, stitching tools

together, the stuff your agent actually does when it is sitting in front of a real shell. So, 5 and 1/2 points over

real shell. So, 5 and 1/2 points over the GPT-5.5 model on this benchmark is not just vibe coding tests, but that is actually the model being more reliable

at the kind of work that pays you back in production, which is amazing. The

next one is online mind to web. This is

a benchmark that Opus 4.8 scored 84% on.

That is a meaningful jump over both 4.7 Opus and GPT-5.5, and online mind to web measures whether the model can complete real tasks on

real live websites. So, things like [clears throat] clicking real buttons, reading real pages, filling out real forms, and reasoning about a live browser environment. So, if you are

browser environment. So, if you are building any kind of browser agent or computer use agent, this is the score that maps to your reliability in the wild, something that you could actually

sell as an agent as a service. And that

becomes more reliable for your clients.

The third one, and people are sleeping on this, is the super agent benchmark score. Opus 4.8 is the only model to

score. Opus 4.8 is the only model to complete every case end-to-end on this one, betting prior Opus models and

parity on costs with GPT-5.5.

Reading those three together, you start what Anthropic is actually claiming.

Opus 4.8 is the most reliable model right now at agentic work, terminal tasks, live web tasks, long horizon end-to-end tasks. That lines up with the

end-to-end tasks. That lines up with the four times bug catching number. The

whole release is one thesis. Basically,

you make the model that is the best partner for long, messy, multi-setup work that actually agents can actually do, and I love that. Now, to talk a little bit about our Shipping School

community and why updates just like Opus 4.8 are super relevant for builders today. This is what we focus on. Right

today. This is what we focus on. Right

now, we're in the middle of a 24-week sprint of really learning how machines work, how AI works, how AI engineering works, and AI inference works. And this

is a syllabus you're looking at right now. If you join the community, this is

now. If you join the community, this is the foundation of which to build upon.

We go over things like what does AI actually mean? And is AI actually

actually mean? And is AI actually intelligent? We go into teaching points,

intelligent? We go into teaching points, beginner analogies, we deep dive into the technical stuff. You know, we go into lesson number two is all about machine learning. Lesson number three is

machine learning. Lesson number three is what is an actual large language model?

How does it operate? We go into AI inference. We also go into module two,

inference. We also go into module two, which is prompting and context engineering. We go into deep diving

engineering. We go into deep diving teaching points of context engineering and how it applies for us builders. We

go into building reusable prompt templates to actually build that foundation layer to help you build. So,

I don't want to go over all of this, but you are getting a ton of useful information to become the builder in this AI age. Like I said it, we're in a 24-week sprint where we get four live

calls every week that are constant courses being updated in the curriculum.

I'll put the link to the community down below. Join over 150 builders building

below. Join over 150 builders building their app using Cloud Code, Hermes, Open Claw. We'll get you set up. You could

Claw. We'll get you set up. You could

always just reach out to me in the community as well. We have four other Here is the syllabus you'll have access to. If you want to learn how to actually

to. If you want to learn how to actually use AI and actually build a business using AI, you want to click the link down below, join the community, and we'll be happy to have you. Let's get

back into the video. Now, I want to get into the features that actually shipped alongside this model because there's a lot of sneaky ones that we're kind of glossing over because this release is

not just a model swap. It's actually a real product with real changes. The

first one is that dynamic workflow we touched on earlier. This is in Claude code. This is basically research preview

code. This is basically research preview still, and it is a pretty big deal though because dynamic workflows can run hundreds of parallel sub-agents in a

single Claude code session, and it can handle code base scale migrations across hundreds of thousands of lines of code.

I know we talked about that, but that's going to have a bigger impact than most people even think. So, you can see here dynamic workflows is one feature that they shipped along with this model. I

mean, one session to to spawn a hundred sub-agents, that scales your code base with fixes and, you know, updates to the code base, I mean, that's just insane.

If you've ever tried to like migrate a large code base to a new framework to like, let's say, Next.js from React, it's intense, and it's a lot. I mean,

you could hit your usage limit if you're on that $100 month plan in 1 hour. So,

basically, this new framework, a new API, in a new lint config, a new typing system, you know, these are the pain points of how much of a pain this used to be. So, you would either have to do

to be. So, you would either have to do it manually to refactor all this, or you write like a custom script that handles maybe 70% of the cases, and then you would spend another 2 weeks chasing the

edge cases just making sure nothing breaks. But, dynamic workflows is

breaks. But, dynamic workflows is basically Anthropic saying, let the agent run migration in parallel across the whole code base, and let it handle

those long tails with judgment instead of regex. So, that's amazing. You know,

of regex. So, that's amazing. You know,

if that works the way they're describing it, I'm still going to test this out today because talking and doing are two different things, but that would be like one of the most powerful single feature in any coding agent right now. For the

Shipping School community members who are working on real code bases, this is the feature to spend time on like today, like right now. You need to check it out. The next one is the effort control,

out. The next one is the effort control, which is pretty cool. You can see here effort control in co-work and Claude AI.

This one is live right now on co-work and Claude and you can get to choose how much effort the model puts into a response. So, higher settings spend more

response. So, higher settings spend more tokens, obviously, but you get a better result. By default, Opus 4.8 is set to

result. By default, Opus 4.8 is set to high effort. That is the best balance

high effort. That is the best balance and that's usually how I operate on Opus 4.7, either high or extra high. For

difficult or long-running async tasks, you can crank it up to an extra effort and let it think harder. It will take longer, but still, I think you'll get better quality output.

[clears throat] The honest read on all this is you get a knob to tell the model how much it should care, essentially. This is great for builders who want one model that

could be fast and cheap for simple stuff and deeply considered for the hard stuff. The third one here, and people

stuff. The third one here, and people just miss this, is the messages API now accepts system entries inside of the messages array, which is pretty amazing.

So, they are accepted inside the messages array. What does that even

messages array. What does that even mean? It just means you can update

mean? It just means you can update mid-task instructions without breaking your prompt cache. That sounds super boring and I know it sounds boring, but it's not boring. If you run agents with

prompt caching to keep costs down, you know that pretty much any change to the system prompt blows the cache. You eat

full context costs. That will hit your token usage limit in hours instead of that 5-hour limit repeatedly and that's annoying. This change actually lets you

annoying. This change actually lets you steer mid-run without paying that tax.

So, if you are running production agents, your cost just dropped basically without you doing anything. So, combine

these three, better model at the same price, you got dynamic workflows for the code-based scale work, you have effort control for cost and depth control, you have messages API update for the mid-run

steering at cost-friendly costs, and that's the real release. Not just the benchmark glamour, which is amazing. I

got excited, too. I mean, who isn't?

But, it's the real deep dive within Opus 4.8 that allows us to be better builders. I want to talk about what the

builders. I want to talk about what the partners are actually saying because Anthropic did load this announcement up with quotes from CTOs and engineers running real production stacks. And

those quotes do matter more than the benchmark slides, I think. You know, you got to read in the details, really. The

proof's in the pudding. So, Cursor said tool calling is meaningful, more efficient, using fewer steps. And that

is Michael Truel, co-founder of Cursor and CEO. So, if Cursor is reporting

and CEO. So, if Cursor is reporting cleaner tool calling, that means real reduction in latency and cost for everyone using Cursor downstream. Devin

said Opus 4.8 uses tool cleanly and follows instructions, that it actually fixes common verbosity and tool calling issues. That is Scott Wu, CEO of

issues. That is Scott Wu, CEO of Cognition. Devin is one of the most

Cognition. Devin is one of the most aggressive autonomous engineering products out there. If you haven't heard about it, you should check it out. They

notice every regression. So, them saying tool calling cleaned up is a strong signal from them. Databricks said it is a step change in agentic reasoning with

61% cheaper token costs than Opus 4.7 on their Genie product. That is Han Lantang, CTO of Neural Networks at Databricks. So, step change is a strong

Databricks. So, step change is a strong word for a CTO to use. And 61% token cost reduction is a huge amount of money. Harvey said better citation

money. Harvey said better citation precision and more token efficiency.

Thomson Reuters co-counsel said meaningful improvements in consistency and reasoning quality for legal work, which is amazing. A staff engineer at Anthropic said the model asks the right

questions, catches its own mistakes, and pushes back when a plan is not sound. I

love that. Stack all these together, tool calling cleaner, step change in agentic reasoning, and 61% cost reduction, better citations, catches its

own mistakes, pushes back on bad plans.

That is the practical version of the 4x bug catching number, and I love it.

Great changes here. There is also a real story on the legal side. Nico Grupen,

head of applied research, said Opus 4.8 is the first model to break 10% overall on the All Standard of the legal agent benchmark. That sounds tiny until you

benchmark. That sounds tiny until you understand what All Pass means. All Pass

means the model has to get every single subtask right, not most of them, and not the ones that matter, but actually all of them. One miss is a failure. In legal

of them. One miss is a failure. In legal

work, that is the actual bar. A contract

analysis misses one clause, it's contract analysis that just lost you a case. That's could be hundreds of

case. That's could be hundreds of thousands dollars. So, being the first

thousands dollars. So, being the first model to break all 10% All Pass on legal is a model saying I can actually do attorney work without you double-checking every single step. That

is the big door opening for AI in legal, which is pretty cool. I want to talk about alignment as well, because this is the part most releases skip past in a single sentence. Entropic says Opus 4.8

single sentence. Entropic says Opus 4.8 reaches new highs on measures of prosocial traits. Misaligned behavior

prosocial traits. Misaligned behavior rates are substantially lower than Opus 4.7, and the rates are similar to the best aligned model, which they called

Claude Mythos preview. Mythos is the next class of model they are working on.

They mentioned in the same announcement that Mythos class models are expected to be coming in the next few weeks, pending cybersecurity safeguards. There are also

cybersecurity safeguards. There are also a Project Glass Week reference. Claude

Mythos preview is being used by select organizations for cybersecurity work.

So, the timeline reads basically like this. Like, we get Opus 4.8 today, which

this. Like, we get Opus 4.8 today, which no one expected. This is the production grade model that already behaves close to Mythos on alignment. So, in the coming weeks, you can expect Mythos

itself being shipped with even higher intelligence above Opus. Though, the

price is yet to be seen, which is kind of scary. And separately, Anthropic is

of scary. And separately, Anthropic is working on cheaper models that hit Opus quality at a lower price, which is great for builders if you're building on a budget, you know? And that is a lot of

forward motion in one release. This was

a big one. If you are betting on Anthropic as your default agent model, this is the signal that the lane is open for the next 12 months. And Anthropic is crushing it right now. The builder

takeaway here is if you're running production agent on Opus 4.7, you need to change the model ID to Opus 4.8 right now. That is the only required change

now. That is the only required change you need to make it to get access to these features. You will get a cleaner

these features. You will get a cleaner tool calling, you will ship less broken code, you will pay the same price. So,

if you are running a fast mode agent, you just got basically three times cost reduction at a two and a half speed gain. Update your fast mode flag now. If

gain. Update your fast mode flag now. If

you are doing any code-based migration work, get on Claude code research preview of the dynamic workflows and then run the parallel sub-agent migration on a real branch and just see if it actually handles your edge cases.

If you're running cost-sensitive workloads, look at your effort control.

Map out your low effort to background batch jobs, map extra effort to your long-running async tasks and default mid for everything else. Why do you need high for simple fixes, right? Only use

it when you need to. So, if you're using prompt caching as well, look at the messages API update. Move your mid-run instructions out of the system prompt and into the messages array. It's going

to be cheaper steering, same control.

And if you're still on Sonnet or any smaller models for serious agent work, I mean, I don't know what you're doing.

This is the release that makes me confident pushing more people to using Opus tier for the hard stuff. The 4x bug caching number alone justifies the price. This is the mono upgrade where

price. This is the mono upgrade where you do not have to wait for the dust to settle. The dust already settled. Get

settle. The dust already settled. Get

going. What are you doing? The partners

are already in production. The pricing

is the same and the model is better.

Make the swap right now. If you want to build with agents the way we do it inside the community like open claw Hermes, Claude code, cursor, whatever you're using codex and running the stack

that we run inside the community, come join us. Like I said, I showed the

join us. Like I said, I showed the syllabus earlier. There's everything

syllabus earlier. There's everything you're going to get to learn. Come build

with with us. We're testing Opus 4.8 this week inside the community. Real

code bases, real people, real agents and if you want to be in the room when those tests land, join us. We'll see in the community and if you haven't liked the video already, please do so. Please

subscribe to the channel. We're going to be dropping a detailed more in-depth video on vibe coding with Opus 4.8 testing all these features on my apps just to see how it performs and we'll

see in the next one guys. Have a blessed day. Happy building.

day. Happy building.

Loading...

Loading video analysis...