Opus 4.8 Changes Everything I Thought About AI
By Build In Public
Summary
Topics Covered
- It's not dumb agents, it's confident broken ones
- Free upgrade breaks the standard release playbook
- Parallel sub-agents turn migrations into judgment
- Steer mid-run without blowing your prompt cache
- The alignment bar jumped before Mythos shipped
Full Transcript
Opus 4.8 just dropped today and no one saw it coming and neither did I. And
here's everything you need to know on how it performs versus GPT 5.5 and Opus 4.7. It is available right now and you
4.7. It is available right now and you need to go ahead and stop what you're doing and update because you're going to notice the difference right away. Stop
reading tweets about it. Stop watching
videos about it and just make the upgrade right now because Opus 4.8 is the same price as Opus 4.7. So, same
input price and the same output price and the same context window on the same plan, which is amazing. But it ships fewer broken pieces of code and it
actually uses tools more cleanly. It
catches more of its own mistakes and it pushes back when a plan does not actually make sense. So, like I said, this dropped just a few minutes ago. It
says, "We're updating or upgrading Opus 4.7 to 4.8. It builds on Opus 4.7 with improvements across benchmarks and is more effective as a collaborator. And
it's available today for the same price." So, Opus 4.8 launches alongside
price." So, Opus 4.8 launches alongside several new features. Users on Claude AI now control over the amount of effort
Claude puts into a task. So, Claude code has a new dynamic workflow feature that allows it to tackle very large-scale problems. So, you can see here on the
benchmark scores, 69% on Agentic Coding versus Opus 64.3% that's on the 4.7 model. And you could see Agentic Terminal Coding on Terminal
Bench, 74% over 66.1.
These are massive improvements. So, if
you build with AI agents, this is the upgrade you've been waiting for. Just
fewer mistakes, better code refactor, better build-outs, better Agentic workflows. I mean, come on, and they
workflows. I mean, come on, and they didn't increase the price. I love that.
I want to talk about it more in-depth on how this actually impacts our workflow and how it makes us better builders.
Because there is one stat in this benchmark and in the announcement that everybody building with AI agents really needs to focus on. So, Anthropic
basically says Opus 48 is around four times less likely than its predecessor to allow flaws in code and it writes to pass unremarked, which is
amazing. So, if you read that one more
amazing. So, if you read that one more time slowly, four times less likely to let bad code slip through without flagging it. And that's not a small
flagging it. And that's not a small improvement. I'm going to be making
improvement. I'm going to be making another video today because I'm going to be going through my SaaS products, Easy Flip in the Magic Hand using Opus 48 in these workflows. I'll be dropping that
these workflows. I'll be dropping that video later today. I'll put the link in the description and just to see how it performs. How well does it do with tool calling? How well does it catch errors
calling? How well does it catch errors and refactor the code base? So, I just want to see it in a practical application. I will be dropping that
application. I will be dropping that video in a few hours. So, the difference here basically from what I'm gathering from this release is between an agent you can actually leave running and an
agent you can actually have to babysit.
Those are the two differences that we've seen, right? We've seen agents like even
seen, right? We've seen agents like even going back to the Opus 41 days when we're like, "Wow, this is amazing." We
would typically have to sit there and babysit it to make sure it was doing what it actually said it did. Here, I
think that we're getting one step closer to an actual AGI agentic workflow. You
know, here's the thing about Opus 47. I
mean, it just dropped a few weeks ago.
It was already really good. Um, and
people shipped real businesses on it and people shipped real revenue using that model and they shipped real production code, but the failure mode of every agent built on a strong model is
basically the same. The model does something that looks right. It says it's done and then you find out three days later the function it wrote silently fails or does the wrong thing. That is
so annoying. So, this release note goes over a lot of the problems that Opus 4.7 had and fixes it with Opus 4.8, which is amazing. Like you can see here the
amazing. Like you can see here the misaligned behavior score. You can see 4.7 versus Opus 4.8, a great reduction.
I mean, it's amazing. The other thing that no one's talking about, well, I mean it just released, but the dynamic roof work closes is a pretty cool feature. So, this new feature available
feature. So, this new feature available in research preview allows Claude to take on even bigger tasks in Claude code. So, Claude can plan the work and
code. So, Claude can plan the work and then run hundreds of parallel sub-agents in a single session. You do need to be aware of the token consumption on that.
Like if you're not on the $200 a month plan, you're probably going to use a ton of tokens using that. But, for example, it says Claude with Opus 4.8 can now carry out code-based scale migrations
across hundreds of thousands of lines of code from kickoff to merge with the existing test suite as its bar. That's
amazing. I'm going to be testing that out in the video that I drop later today. So, I love that. You know, a
today. So, I love that. You know, a condition um that basically swallows errors, like that's what we saw in Opus 4.7. Like you
saw that maybe it said something that, "Oh, yes, we did that. We shipped that feature. It all looks good." But, you
feature. It all looks good." But, you know, as a non-technical person, you don't deep dive into the code. You just
figure out days later that, "Wow, Claude code told me it did this, but it actually didn't and it's not working."
You know, these are tests that pass for the wrong reason and it's really annoying. Even using GPT 5.5, you come
annoying. Even using GPT 5.5, you come across these errors all the time. And
that is literally what kills agent workflows in production. It's not the model being dumb. I've said this before.
It's the model being confident about something that quietly broke. So, when
Anthropic says Opus is 4.8 less likely to ship those kinds of silent flaws, and that is the most important sentence in this whole announcement. People are just going to gloss over that. Like I said, I
have the link to this release down below. Have your agent review it and see
below. Have your agent review it and see if it's going to be a really good update. Maybe you're using GPT 5.5, say,
update. Maybe you're using GPT 5.5, say, "Hey, run a comparison analysis between GPT 5.5 and Opus 4.8. Which one should I be running? Which is the best bang for
be running? Which is the best bang for my buck?" So, availability it talks
my buck?" So, availability it talks about Claude Opus 4.8 is available right now today as this video is dropping and the pricing is the same. So, there's no reason why you shouldn't be updating.
For me, this is what makes it worth switching really instead of waiting.
Like I don't know why you'd be waiting.
It's the price. Like it's the same price. Usually they increase the price.
price. Usually they increase the price.
That's the best part and where most model releases get awkward is when they hike the price up. The new model drops, it's more expensive usually. You have to decide if the bump in quality is worth
the bump in price. Here you're getting better quality for the same price you were already paying with 4.7. So, open
for Opus 4.8 is not that release. It's
you're paying the same amount. It's the
same $5 per million input tokens and the same $25 per million output tokens. So,
it's the same as 4.7. You just get a better model for basically free. I mean,
not free, but at least no cost advantage of, you know, paying more, right? And
fast mode got even more interesting here because Opus 4.7 compared to GB 5.5, I don't know if you noticed, it was a lot slower for me. Fast mode on Opus 4.8
runs at two and a half times the speed of standard mode and the fast mode price is now three times cheaper than it was previously on other fast most modes. So,
if you have agents that were getting slowed down by latency or if you were paying a premium for fast mode and basically cannot justify it, that math
just changed for you. You can now run more faster for less with a smarter model. This I love stuff like this. This
model. This I love stuff like this. This
is so fun, like just seeing the improvements and, you know, when GPT 5.5 came out, I was a die-hard Opus 4.7 user, and I just was like, "You know
what? I'm going to give GPT 5.5 a try."
what? I'm going to give GPT 5.5 a try."
And I was blown away. But now this gets me more excited. I've always been a Claude fan, and I think, you know, they're cooking. They're cooking. 4.8, I
they're cooking. They're cooking. 4.8, I
think is going to be the model that I'm going to be using and playing around with today. I've been meaning to
with today. I've been meaning to refactor my codebase for a lot of the SaaS apps that I've built, and I'm excited to share with you to see if it's actually really worth it if you're using GPT 5.5, if it's worth making the switch
back over. I don't think that there's a
back over. I don't think that there's a trade-off here, you know, this is actually a free upgrade if you're already using 4.7. You know, if you're still routing default traffic to Opus 4.7, just switch your model ID to Claude
Opus 4.8. Sometimes you got to close
Opus 4.8. Sometimes you got to close your IDE, restart it, make the update, and you should be good to go. Or just go right into the terminal and run the update. So, there is no really reason to
update. So, there is no really reason to wait. I want to talk about benchmarks
wait. I want to talk about benchmarks here, you know, this is where everyone loves to just drool over and go, "Wow!"
I know it gets the clicks, and that's why I wanted to just bring them up briefly, but you really need to actually use it in production to see if the benchmarks hold true, because I've seen tons of benchmarks. And then when you
use that model in production, it is terrible. It sucks. The tool calling is
terrible. It sucks. The tool calling is terrible, or it ends up costing more than you think. So, I do want to go over the benchmarks quickly. You know, it's always like a marketing tactic until you
look at which benchmarks the lab actually picks. Anthropic led with
actually picks. Anthropic led with Terminal Bench 2.1. You can see here at the top, 74.6% versus Opus being Opus 4.7 being 66%, and Opus 4.8
[clears throat] scored 89%, and GPT 5.5 with Codex CLI scored 83.4%. So, what is Terminal Bench
scored 83.4%. So, what is Terminal Bench actually measuring? I mean, for people
actually measuring? I mean, for people that are new, it is measuring whether the model can complete real terminal tasks, running commands, reading files, debugging code, stitching tools
together, the stuff your agent actually does when it is sitting in front of a real shell. So, 5 and 1/2 points over
real shell. So, 5 and 1/2 points over the GPT-5.5 model on this benchmark is not just vibe coding tests, but that is actually the model being more reliable
at the kind of work that pays you back in production, which is amazing. The
next one is online mind to web. This is
a benchmark that Opus 4.8 scored 84% on.
That is a meaningful jump over both 4.7 Opus and GPT-5.5, and online mind to web measures whether the model can complete real tasks on
real live websites. So, things like [clears throat] clicking real buttons, reading real pages, filling out real forms, and reasoning about a live browser environment. So, if you are
browser environment. So, if you are building any kind of browser agent or computer use agent, this is the score that maps to your reliability in the wild, something that you could actually
sell as an agent as a service. And that
becomes more reliable for your clients.
The third one, and people are sleeping on this, is the super agent benchmark score. Opus 4.8 is the only model to
score. Opus 4.8 is the only model to complete every case end-to-end on this one, betting prior Opus models and
parity on costs with GPT-5.5.
Reading those three together, you start what Anthropic is actually claiming.
Opus 4.8 is the most reliable model right now at agentic work, terminal tasks, live web tasks, long horizon end-to-end tasks. That lines up with the
end-to-end tasks. That lines up with the four times bug catching number. The
whole release is one thesis. Basically,
you make the model that is the best partner for long, messy, multi-setup work that actually agents can actually do, and I love that. Now, to talk a little bit about our Shipping School
community and why updates just like Opus 4.8 are super relevant for builders today. This is what we focus on. Right
today. This is what we focus on. Right
now, we're in the middle of a 24-week sprint of really learning how machines work, how AI works, how AI engineering works, and AI inference works. And this
is a syllabus you're looking at right now. If you join the community, this is
now. If you join the community, this is the foundation of which to build upon.
We go over things like what does AI actually mean? And is AI actually
actually mean? And is AI actually intelligent? We go into teaching points,
intelligent? We go into teaching points, beginner analogies, we deep dive into the technical stuff. You know, we go into lesson number two is all about machine learning. Lesson number three is
machine learning. Lesson number three is what is an actual large language model?
How does it operate? We go into AI inference. We also go into module two,
inference. We also go into module two, which is prompting and context engineering. We go into deep diving
engineering. We go into deep diving teaching points of context engineering and how it applies for us builders. We
go into building reusable prompt templates to actually build that foundation layer to help you build. So,
I don't want to go over all of this, but you are getting a ton of useful information to become the builder in this AI age. Like I said it, we're in a 24-week sprint where we get four live
calls every week that are constant courses being updated in the curriculum.
I'll put the link to the community down below. Join over 150 builders building
below. Join over 150 builders building their app using Cloud Code, Hermes, Open Claw. We'll get you set up. You could
Claw. We'll get you set up. You could
always just reach out to me in the community as well. We have four other Here is the syllabus you'll have access to. If you want to learn how to actually
to. If you want to learn how to actually use AI and actually build a business using AI, you want to click the link down below, join the community, and we'll be happy to have you. Let's get
back into the video. Now, I want to get into the features that actually shipped alongside this model because there's a lot of sneaky ones that we're kind of glossing over because this release is
not just a model swap. It's actually a real product with real changes. The
first one is that dynamic workflow we touched on earlier. This is in Claude code. This is basically research preview
code. This is basically research preview still, and it is a pretty big deal though because dynamic workflows can run hundreds of parallel sub-agents in a
single Claude code session, and it can handle code base scale migrations across hundreds of thousands of lines of code.
I know we talked about that, but that's going to have a bigger impact than most people even think. So, you can see here dynamic workflows is one feature that they shipped along with this model. I
mean, one session to to spawn a hundred sub-agents, that scales your code base with fixes and, you know, updates to the code base, I mean, that's just insane.
If you've ever tried to like migrate a large code base to a new framework to like, let's say, Next.js from React, it's intense, and it's a lot. I mean,
you could hit your usage limit if you're on that $100 month plan in 1 hour. So,
basically, this new framework, a new API, in a new lint config, a new typing system, you know, these are the pain points of how much of a pain this used to be. So, you would either have to do
to be. So, you would either have to do it manually to refactor all this, or you write like a custom script that handles maybe 70% of the cases, and then you would spend another 2 weeks chasing the
edge cases just making sure nothing breaks. But, dynamic workflows is
breaks. But, dynamic workflows is basically Anthropic saying, let the agent run migration in parallel across the whole code base, and let it handle
those long tails with judgment instead of regex. So, that's amazing. You know,
of regex. So, that's amazing. You know,
if that works the way they're describing it, I'm still going to test this out today because talking and doing are two different things, but that would be like one of the most powerful single feature in any coding agent right now. For the
Shipping School community members who are working on real code bases, this is the feature to spend time on like today, like right now. You need to check it out. The next one is the effort control,
out. The next one is the effort control, which is pretty cool. You can see here effort control in co-work and Claude AI.
This one is live right now on co-work and Claude and you can get to choose how much effort the model puts into a response. So, higher settings spend more
response. So, higher settings spend more tokens, obviously, but you get a better result. By default, Opus 4.8 is set to
result. By default, Opus 4.8 is set to high effort. That is the best balance
high effort. That is the best balance and that's usually how I operate on Opus 4.7, either high or extra high. For
difficult or long-running async tasks, you can crank it up to an extra effort and let it think harder. It will take longer, but still, I think you'll get better quality output.
[clears throat] The honest read on all this is you get a knob to tell the model how much it should care, essentially. This is great for builders who want one model that
could be fast and cheap for simple stuff and deeply considered for the hard stuff. The third one here, and people
stuff. The third one here, and people just miss this, is the messages API now accepts system entries inside of the messages array, which is pretty amazing.
So, they are accepted inside the messages array. What does that even
messages array. What does that even mean? It just means you can update
mean? It just means you can update mid-task instructions without breaking your prompt cache. That sounds super boring and I know it sounds boring, but it's not boring. If you run agents with
prompt caching to keep costs down, you know that pretty much any change to the system prompt blows the cache. You eat
full context costs. That will hit your token usage limit in hours instead of that 5-hour limit repeatedly and that's annoying. This change actually lets you
annoying. This change actually lets you steer mid-run without paying that tax.
So, if you are running production agents, your cost just dropped basically without you doing anything. So, combine
these three, better model at the same price, you got dynamic workflows for the code-based scale work, you have effort control for cost and depth control, you have messages API update for the mid-run
steering at cost-friendly costs, and that's the real release. Not just the benchmark glamour, which is amazing. I
got excited, too. I mean, who isn't?
But, it's the real deep dive within Opus 4.8 that allows us to be better builders. I want to talk about what the
builders. I want to talk about what the partners are actually saying because Anthropic did load this announcement up with quotes from CTOs and engineers running real production stacks. And
those quotes do matter more than the benchmark slides, I think. You know, you got to read in the details, really. The
proof's in the pudding. So, Cursor said tool calling is meaningful, more efficient, using fewer steps. And that
is Michael Truel, co-founder of Cursor and CEO. So, if Cursor is reporting
and CEO. So, if Cursor is reporting cleaner tool calling, that means real reduction in latency and cost for everyone using Cursor downstream. Devin
said Opus 4.8 uses tool cleanly and follows instructions, that it actually fixes common verbosity and tool calling issues. That is Scott Wu, CEO of
issues. That is Scott Wu, CEO of Cognition. Devin is one of the most
Cognition. Devin is one of the most aggressive autonomous engineering products out there. If you haven't heard about it, you should check it out. They
notice every regression. So, them saying tool calling cleaned up is a strong signal from them. Databricks said it is a step change in agentic reasoning with
61% cheaper token costs than Opus 4.7 on their Genie product. That is Han Lantang, CTO of Neural Networks at Databricks. So, step change is a strong
Databricks. So, step change is a strong word for a CTO to use. And 61% token cost reduction is a huge amount of money. Harvey said better citation
money. Harvey said better citation precision and more token efficiency.
Thomson Reuters co-counsel said meaningful improvements in consistency and reasoning quality for legal work, which is amazing. A staff engineer at Anthropic said the model asks the right
questions, catches its own mistakes, and pushes back when a plan is not sound. I
love that. Stack all these together, tool calling cleaner, step change in agentic reasoning, and 61% cost reduction, better citations, catches its
own mistakes, pushes back on bad plans.
That is the practical version of the 4x bug catching number, and I love it.
Great changes here. There is also a real story on the legal side. Nico Grupen,
head of applied research, said Opus 4.8 is the first model to break 10% overall on the All Standard of the legal agent benchmark. That sounds tiny until you
benchmark. That sounds tiny until you understand what All Pass means. All Pass
means the model has to get every single subtask right, not most of them, and not the ones that matter, but actually all of them. One miss is a failure. In legal
of them. One miss is a failure. In legal
work, that is the actual bar. A contract
analysis misses one clause, it's contract analysis that just lost you a case. That's could be hundreds of
case. That's could be hundreds of thousands dollars. So, being the first
thousands dollars. So, being the first model to break all 10% All Pass on legal is a model saying I can actually do attorney work without you double-checking every single step. That
is the big door opening for AI in legal, which is pretty cool. I want to talk about alignment as well, because this is the part most releases skip past in a single sentence. Entropic says Opus 4.8
single sentence. Entropic says Opus 4.8 reaches new highs on measures of prosocial traits. Misaligned behavior
prosocial traits. Misaligned behavior rates are substantially lower than Opus 4.7, and the rates are similar to the best aligned model, which they called
Claude Mythos preview. Mythos is the next class of model they are working on.
They mentioned in the same announcement that Mythos class models are expected to be coming in the next few weeks, pending cybersecurity safeguards. There are also
cybersecurity safeguards. There are also a Project Glass Week reference. Claude
Mythos preview is being used by select organizations for cybersecurity work.
So, the timeline reads basically like this. Like, we get Opus 4.8 today, which
this. Like, we get Opus 4.8 today, which no one expected. This is the production grade model that already behaves close to Mythos on alignment. So, in the coming weeks, you can expect Mythos
itself being shipped with even higher intelligence above Opus. Though, the
price is yet to be seen, which is kind of scary. And separately, Anthropic is
of scary. And separately, Anthropic is working on cheaper models that hit Opus quality at a lower price, which is great for builders if you're building on a budget, you know? And that is a lot of
forward motion in one release. This was
a big one. If you are betting on Anthropic as your default agent model, this is the signal that the lane is open for the next 12 months. And Anthropic is crushing it right now. The builder
takeaway here is if you're running production agent on Opus 4.7, you need to change the model ID to Opus 4.8 right now. That is the only required change
now. That is the only required change you need to make it to get access to these features. You will get a cleaner
these features. You will get a cleaner tool calling, you will ship less broken code, you will pay the same price. So,
if you are running a fast mode agent, you just got basically three times cost reduction at a two and a half speed gain. Update your fast mode flag now. If
gain. Update your fast mode flag now. If
you are doing any code-based migration work, get on Claude code research preview of the dynamic workflows and then run the parallel sub-agent migration on a real branch and just see if it actually handles your edge cases.
If you're running cost-sensitive workloads, look at your effort control.
Map out your low effort to background batch jobs, map extra effort to your long-running async tasks and default mid for everything else. Why do you need high for simple fixes, right? Only use
it when you need to. So, if you're using prompt caching as well, look at the messages API update. Move your mid-run instructions out of the system prompt and into the messages array. It's going
to be cheaper steering, same control.
And if you're still on Sonnet or any smaller models for serious agent work, I mean, I don't know what you're doing.
This is the release that makes me confident pushing more people to using Opus tier for the hard stuff. The 4x bug caching number alone justifies the price. This is the mono upgrade where
price. This is the mono upgrade where you do not have to wait for the dust to settle. The dust already settled. Get
settle. The dust already settled. Get
going. What are you doing? The partners
are already in production. The pricing
is the same and the model is better.
Make the swap right now. If you want to build with agents the way we do it inside the community like open claw Hermes, Claude code, cursor, whatever you're using codex and running the stack
that we run inside the community, come join us. Like I said, I showed the
join us. Like I said, I showed the syllabus earlier. There's everything
syllabus earlier. There's everything you're going to get to learn. Come build
with with us. We're testing Opus 4.8 this week inside the community. Real
code bases, real people, real agents and if you want to be in the room when those tests land, join us. We'll see in the community and if you haven't liked the video already, please do so. Please
subscribe to the channel. We're going to be dropping a detailed more in-depth video on vibe coding with Opus 4.8 testing all these features on my apps just to see how it performs and we'll
see in the next one guys. Have a blessed day. Happy building.
day. Happy building.
Loading video analysis...