Anthropic fights back
By Theo - t3․gg
Summary
Topics Covered
- Benchmark Integrity Is Broken
- Ultra Code Burns $168 in a Single Prompt
- Honesty Improvement: 27.6% Down to 3.7%
- Mythos Coming to Close the Gap
- Claude's Multi-Agent Approach vs OpenAI's Single Thread
Full Transcript
Looks like I have to dust off my Claude hat because there's a new model in town and it seems to be the best coding model ever made. Anthropic just dropped Opus
ever made. Anthropic just dropped Opus 4.8 and of course they did it on the day my Claude code sub expires. I'm not even joking. I had to go resub today just to
joking. I had to go resub today just to test it. I'm annoyed. But the model is
test it. I'm annoyed. But the model is really good somewhat. We have a lot of layers to dive into here. As you expect, it's slaughtering benchmarks. It's the
highest score any public model's ever gotten on SWE. It's killing it on terminal bench. Multi-disiplinary
terminal bench. Multi-disiplinary learning with HLE. all the things you would expect, but they also put out new features alongside it in Cloud Code, which is why I was stuck resubing. I've
been using this model all day. I've done
over $1,000 of tokens through it already. And I have thoughts. In many
already. And I have thoughts. In many
ways, it's better than I expected, but in other ways, it is definitely still a Claude model. I want to break down what
Claude model. I want to break down what I mean by that, as well as all the cool new features that were added in the most recent update to Claude Code, which is honestly the bigger story here. In my
opinion, one of the things that makes Opus 48 so different is its honesty. And
I do think that's important. Honesty and
transparency are essential to everything that we do, especially all these informationheavy things. But I want to
informationheavy things. But I want to make sure you guys understand that I'm not being paid in any way, shape, or form to talk about one thing over another or be nice to one company and mean to another. My thoughts on Anthropic are genuine. There is nobody
incentivizing me to talk about them one way or another. The only people paying me are today's sponsor. If you like the software I ship, you owe today's sponsor a thank you because Code Rabbit has prevented me from shipping so many bugs.
It turns out AI is great at reviewing meticulous code, trying to find small things that might be wrong. And I'm not exaggerating when I tell you hundreds of bugs that I might have shipped were stopped by Code Rabbit. I like their code reviews so much that I found myself
changing my workflows to take more advantage of them. Even on personal projects, I found myself making more PRs just to let the agent review my code. It
felt a little silly, though. The
frequency at which I was committing and pushing things that weren't done yet just cuz I wanted Code Route to take a quick look is silly. But that shows just how good the reviews are. And that's why I love their new CLI so much. The Code
Rabbit Smart CLI is great when you want to review the code that's on your local machine. Even uncommitted code can be
machine. Even uncommitted code can be reviewed with the CLI. But that's not what's great about it. If you and I wanted to run the CLI, cool, that's awesome. But letting your agents run it
awesome. But letting your agents run it is where it gets really magical. The
craziest part is that the CLI reviews are currently free. When you introduce the Code Rabbit CLI, you'll push code with fewer issues. And if you're also using it in your PR flow, you'll ship way fewer bugs. They estimate as few as
95% fewer. And the craziest part is how
95% fewer. And the craziest part is how much faster code merges. Four times
faster PRs. Time to merge is an increasing problem in a world of more and more AI filed PRs. And Code Rabbit will help you clean up the noise and ship faster. Ship fewer bugs and better
ship faster. Ship fewer bugs and better software at soyv.link/codrabbit.
I got to lose this hat. I'm sorry. I'm
just not a hat person anymore. And also
I have feelings about cloud code. We
will get to that. Don't worry. But
before we talk about all the cool new Cloud Code stuff, I want to talk about the numbers for this model. As I shared earlier, it killed a lot of benchmarks, in particular, code benchmarks like SWE Bench Pro. They did lose Terminal Bench
Bench Pro. They did lose Terminal Bench 21 still by quite a bit actually. GPD
555 is at over 78% and they're under 75, but SW Bench Pro was a state-of-the-art score. There's a problem with that
score. There's a problem with that though, and it's very inconvenient because my video about that problem comes out tomorrow because this video just trumped it. I just did a massive deep dive on S.Bench because a new
benchmark called Deepsw SWE came out.
The numbers are very different. Check
out that video tomorrow. I I promise that one's worth it. It might not seem like a benchmarking video is that interesting, but this one is. I'm going
to spoil one detail from it, though, which is the prompting styles used for the SWE bench tests. These benches run in a custom minimal harness called mini SWE agent, and this is the prompt they use. You're a helpful assistant that can
use. You're a helpful assistant that can interact with the computer to solve tasks. I've uploaded a code repository
tasks. I've uploaded a code repository in the directory. Consider the following PR description. Can you help me
PR description. Can you help me implement the necessary changes to the repository so the requirements specified in the description are met? I've already
taken care of all the changes to any of the test files described in the PR description. This means you don't have
description. This means you don't have to modify the testing logic or any of the tests in any way. Exclamation point.
This is really bad steering if you're not familiar. If you haven't been
not familiar. If you haven't been writing prompts and analyzing the effectiveness of them at a professional level, silly that that's a thing. This
is a bad prompt. And then having a list of how to diagnose these things and build the feature is really bad. One, as
a first step, it might be a good idea to find and read code relevant to the description. Two, create a script to
description. Two, create a script to reproduce the error and execute it using the bash tool to confirm the error.
Three, edit the source code of the repo to resolve the issue. This is awful.
Yeah, terrible prompt. The actual
prompts for this specific problems are even worse somehow. So yeah, uh SW bench is junk. That's separate from the fact
is junk. That's separate from the fact that it's also contaminated and also has been discovered that many models including Opus models cheat aggressively on it because they'll check the git history for these solved problems
because these are from real PRs. They'll
find the actual correct answer from the real world and then use that instead.
Yeah, as many as 20% of the passing runs are cheating. We just can't trust this
are cheating. We just can't trust this bench anymore. The computer use bench is
bench anymore. The computer use bench is a little bit more reliable and I am excited to see more improvement there.
I've been leaning into computer use more and more lately. It's such a complex combination of vision, precision, spatial awareness, and other things, but the more I use it, especially with codecs, the more impressed I've been
getting. So, that's cool to see. Did not
getting. So, that's cool to see. Did not
have a chance to play with that with this release, though. So, I don't have much to say there. I'm trying so hard to not just dive into my own usage of it with cloud code cuz it's so tempting.
But, I do need to go over benches a little bit more. As I mentioned, deep SWE is a new bench that I'm quite excited about video tomorrow where I go really deep on it. in GPT 55 and 54
slaughtered this bench. 55 up to a 70% whereas Claude Opus 47 was only at 54.
The most important chart in this benchmark is this one that shows the difference between SWBench Pro scores in deep SWE scores. If you legitimately believe that GPT54 Mini and GPT54 are a
4% difference from each other, you're not using agents properly. Thankfully,
DBSE actually can measure differences in models like that where 54 mini only gets a 24% and 54 normal gets a 56%. More
than double, way bigger gap, which shows this bench actually measures realworld capabilities for these models. So, how
did Opus 48 do? Sadly, they're still crunching to get numbers, but I did get the team to share some early metrics with me. Claude Opus 48 scored slightly
with me. Claude Opus 48 scored slightly lower than Opus 47 when using the Quad Code harness. It was meaningfully
Code harness. It was meaningfully cheaper though, as well as meaningfully faster because it was using less tokens.
Still not as smart as 54, much less 55, and also much slower than 54 and 55.
There's a lot of layers to why, whether it's the amount of tokens that they're generating or the end toend latency cuz they don't have a websocket primitive similar to what they have in Codex yet.
Regardless, not great performance there.
But they did just send me an update right before I started filming where they did another run, not in cloud code.
this one in the mini SWE agent with a much more minimal system prompt. And it
performed way better, beating out not just 54, but 55 on high. It's not as good as 55X high. It's a 63% versus the 70 that 55 got, but it does come out to
be slightly cheaper, which I did not expect. This makes the model look much
expect. This makes the model look much more competitive than other things were suggesting. One of those things is
suggesting. One of those things is Cursor Bench, which does show that Opus 48 is meaningfully cheaper per task, where 47 was $11 per task, and this model is only $759, but it also scored
slightly worse. It is within the margin
slightly worse. It is within the margin of error, but that's a regression. All
three of these scores are so close that I honestly don't think there's that big of a gap between these models. And
Composer 25 also being so close makes me sus of this. Like, Composer 25 is a great model. It's not that great,
great model. It's not that great, though. It's definitely not soda and
though. It's definitely not soda and it's certainly not better than 55 high.
So take that with a grain of salt. The
coolest part of this chart though is the cost per task reduction. You can see here that max with Opus 48 is way cheaper than max was with 47 going from
$11 to $759. That said, low is now more expensive than it used to be at 293 versus 187. So the utilization of tokens
versus 187. So the utilization of tokens at the highest and lowest end has kind of been condensed towards the middle.
Not necessarily a bad thing. This might
just mean the model's better at right sizing tasks based on reasoning levels, but I am a little concerned to see the low end getting more tokenheavy. Enough
benching. Let's talk about actually using the model. I started some new projects. I analyzed some existing ones
projects. I analyzed some existing ones that are pretty big before releasing them to the public. I did some ports of old projects to modern technologies. I
even took the time to rewrite a project from TypeScript to Rust as well as a JavaScript project to TypeScript. Did
some comparisons between PRs, lots of different things. And overall, it
different things. And overall, it performed pretty good. It had a lot of clotisms. Don't worry, we'll talk about those. But I did notice improvements
those. But I did notice improvements that made it feel better to use. It
asked much better questions, and I felt like it involved me in the loop in a way that was nicer than I expect from Claude models. And in that way, it still is a
models. And in that way, it still is a little better than the GPT models.
They've made massive improvements on the OpenAI side here. But I do find Claude 48 asks me the best and simplest questions with really good formatting and options that clearly state what I'm looking for. And it also handles well
looking for. And it also handles well when you add your own additional notes, which I had to do for a handful of those things because sometimes the options they gave were a tiny bit too prescriptive. In the spirit of quad
prescriptive. In the spirit of quad code, I did have to make a fun gambling app. And I'll show you what I built. I
app. And I'll show you what I built. I
created slot slop. Simplest way to tell you what it is is to hit enter. Uh
strobe lights warning, by the way.
This is slot slop. Press enter to stop a spinner and it will choose your harness for you. Looks like we're getting cursor
for you. Looks like we're getting cursor this time. Hit enter again and it will
this time. Hit enter again and it will pick a model for you. 54 could be worse.
Hit enter one more time and we are landing on medium. Cool. We're safe.
Then you can hit enter and it will run in that harness. Is this a stupid silly unnecessary project? Yeah. Did it take
unnecessary project? Yeah. Did it take me way more time than it probably should have in like 20 back and forth prompts?
Also, yeah. Did GPT models do any better on this? Not really. Especially when I
on this? Not really. Especially when I was trying to get the custom UI here with all the fancy gradients and color spinning and stuff. I found that the Codeex models are just not quite as good
at two yet. They'll make a minimal working version faster, but getting a fancy rainbow vomit one like this. Yeah,
that's a Claude special. See if we can land a Claude roll. Cursor CLI again.
Damn.
Five.
Let's do one more run. Let's see.
Anti-gravity. Oh no. Oh no.
Three. Five flash.
Bad luck.
Oh well. Fun project though. I quite
enjoyed it. But I clearly wasn't pushing the limits of Opus when I built this. So
I did a follow-up run where I asked it to port the project to Rust cuz as you guys know Cloud's really good at Rust ports. And of course it works as
ports. And of course it works as expected. It did have to rewrite a lot
expected. It did have to rewrite a lot of things because it didn't have Open Tuy which is the library I used for the terminal side, but it got it. Works as
expected.
Funny enough, it does actually feel a little laggier, but it it works. I want
to talk a bit about how I did this in Claude code, though, because I didn't just tell it to go rewrite and rust.
Make no mistakes. I actually try one of the new features, a feature that inspired me to make slot slop. Now, when
you run the effort selector, you still have the usual low, medium, high options, but you can also go to X high, which they've had for a bit, max, which I think they've had for a bit now. But
most importantly, and again a strobe light warning, Ultra Code, which infects your screen with this awful purple ASI gradient.
Ultra Code is a combination of XI and the new workflows feature where the model will break up up to hundreds of sub aents to go and tackle a project in
bulk in mass with a lot of tokens being burned. I recently put out a video where
burned. I recently put out a video where I compare Claude Code, Codeex, and Cursor. And in that video, I talk about
Cursor. And in that video, I talk about this token maxing gambling thing. A
couple people said I was exaggerating.
Almost all of those people have hit me up today saying that Ultra Code kind of showed just how right that was. Both the
super unnecessary extra Twitter screenshottable UI they built for it, but also the token maxing nature. And
when I say token maxing, I mean it.
Since I barely use Claude Code nowadays, I assumed that the $100 a month tier would be fine. So that's what I tried.
and I hit the cap for the five hour window in under 30 minutes. Want to
guess how many prompts that was? It was
one. One prompt, $100 a month, locked out for 4 and a half hours. I had to upgrade in order to be able to get this video out in time. So, for those saying that I caved and went right back to
where I was, you're not wrong. But if I did it, you guys wouldn't have this video now. So, pick your battles. I
video now. So, pick your battles. I
still can't believe how quick I hit this limit. I also can't believe how brutally
limit. I also can't believe how brutally it failed to resume once I upgraded. I
had to reoff to get it all working again, which was obnoxious, but did try its best to summarize the answers by going to the files that it wrote them in. I also use CC usage to measure how
in. I also use CC usage to measure how expensive this was to run. And the
number was kind of crazy. Remember, I
didn't have cloud code at the start of the day. So, this is a fresh sub with
the day. So, this is a fresh sub with one prompt. I did 661,000
one prompt. I did 661,000 output tokens, 102,000 input, a shitload of cash values, and it cost about $168
of raw token utilization. One prompt,
remember, because it just spins up so many agents. Don't worry, though, it's
many agents. Don't worry, though, it's super well optimized to have multiple agents editing files at the same time.
That's why this agent made five bad edit attempts in a row to the same file with the same information. Yeah, thanks for wasting my usage there, Claude. I really
appreciate that. These are the things that drive me mad. These like
parallelization, workflow, massive sub aent tasks sound really cool and powerful, but I've just found it makes the failure rate of my runs way higher.
I don't think my team's ever merged one of the PRs that we've generated that are like thousands of lines long with all of these sub agents hacking on things together. It's just too much and things
together. It's just too much and things end up stepping on top of each other, burning tokens when the tool calls don't work properly and the result just ends up feeling like a waste of time and money and PRs. So, not my favorite
thing. Ran into that a decent bit here.
thing. Ran into that a decent bit here.
And I'm far from the only one who's noticed this. Just straight up weird
noticed this. Just straight up weird calls it makes with bash just trying to figure stuff out. It does some weird tool calls. Apparently, the team's
tool calls. Apparently, the team's working on a fix for the ones that Matt reported here. Regardless, not great.
reported here. Regardless, not great.
Sorry for the negative dump, but I wanted to highlight the problems that I' had been having because the good parts are pretty good. The model's better at asking questions. It writes code
asking questions. It writes code slightly better than it used to. It
handles long tasks better than it used to, which isn't necessarily a great thing. I find that I like being in the
thing. I find that I like being in the loop more and more lately, especially with i5. And I've had to go back to the
with i5. And I've had to go back to the old school style of prompting where I write a lot more upfront. But it handles those types of tasks well. I'd actually
just find the thread with the usage limit hit where I was on the $100 tier and ran one prompt. It only ran for 23 minutes before I hit that limit. And
then I tried fixing it. Got a login interruption. Had to tell it to
interruption. Had to tell it to continue. And then it got decent
continue. And then it got decent results. This was me asking it to audit
results. This was me asking it to audit a project that I've been working on hard, which is the new lake bed cloud thing. We'll have a lot more to share
thing. We'll have a lot more to share about that soon, don't worry. I as to do a thorough audit trying to make sure everything is pretty solid before release. Found a couple small things
release. Found a couple small things that I already knew about and actually have PRs up trying to fix, but it gave good feedback here. all that I found to be worth reading. It seemed like it actually understood the project. It did
a good job auditing the entirety of the codebase. And nothing here is really
codebase. And nothing here is really that red herringish. I'm I'm impressed.
It's not bad. I also had to break up this old JS project with a ton of giant god files that were just like 8,000 lines of JS. Totally not like bad. It
did a pretty good job of this, too. I
did have to tell it manually to read the agents MD because Anthropic still insists on being a special snowflake and ignoring the agents MD standard that everyone else uses in favor of Claude MD because if you don't mention Claude in
the root of your codebase, Anthropic doesn't like you very much. They love
their free marketing. But after that, it did a pretty dang good job. I was
impressed with the work that it did. It
doesn't write TypeScript like Python the way that GPT55 does. I've had to like take a stick and beat the hell out of 55 to get it to write TypeScript properly and not just check types for everything
everywhere when it doesn't have to.
Claude writes TypeScript better. It just
does. You can make 55 write TypeScript very well. And it's not like once you do
very well. And it's not like once you do that, it's worse in some way. It's just
that Claude doesn't need quite as many reminders that TypeScript is indeed real and can be trusted. You don't have to check if something's a function every time you access it when it's already bound as one. I do want to talk a bit
about costs though because the numbers I was seeing were crazy. About halfway
through the day, I was like 10 prompts in probably. I had a lot of beatings
in probably. I had a lot of beatings today, too. So, it was tough squeezing
today, too. So, it was tough squeezing that in. I got up as high as $518 of
that in. I got up as high as $518 of usage halfway through my day. I then
continued to use it heavily cuz it's my job. It's what I was trying to do today.
job. It's what I was trying to do today.
And I got my usage all the way up to $220.
You can clearly see here I ran this afterwards. And despite running it after
afterwards. And despite running it after the number went down. I suspect this is some pruning that it does when you're running the sub aents where when the sub
aents in the ultra code mode complete it concatenates and condenses all of the JSON from it which results in less accurate numbers here which is annoying cuz I was trusting these numbers and
using them for a lot of things. It is
what it is. I just wish Anthropic wasn't trying so hard to hide the level of subsidization that they're doing. But
god damn, this model burns tokens. I
want to talk a bit more about the measurements and some fun things Anthropic confirmed in the release notes. A big part of why this model
notes. A big part of why this model feels so much smarter than a lot of the recent releases is the honesty fixes and the laziness fixes. Anthropic's been
measuring the laziness of models in terms of how thorough are they in their investigation before giving an answer.
48 never had this problem, which means it actually outperforms Mythos even in terms of its likeliness to give a correct or incorrect answer depending on how thorough it is with this investigation. That's really cool. This
investigation. That's really cool. This
is also probably why the low reasoning effort still uses quite a bit of tokens because it's been trained to not give up until it knows for certain what the answer is. It's also dishonest much less
answer is. It's also dishonest much less often. Even Mythos had a dishonesty rate
often. Even Mythos had a dishonesty rate from their measurements as high as 27.6%. Opus 48 is down to 3.7%. Sadly,
27.6%. Opus 48 is down to 3.7%. Sadly,
this does not really reflect my own usage. I had a lot of problems here.
usage. I had a lot of problems here.
This is a thing I've never seen before.
I'm restoring the old session to show you guys and it's telling me resuming the full session will consume a substantial portion of your usage limits. We recommend resuming from a
limits. We recommend resuming from a summary. They know they're burning
summary. They know they're burning tokens. We're not doing that though cuz
tokens. We're not doing that though cuz I got to read this whole history. When I
was working on slot slop, I had to integrate all of the different harnesses and specifically their CLI arguments in order to make sure it would spit out the right command and run it properly. I
understand why the model might not be great at CLIs that are newer or less well doumented, things like PI or Open Code or especially stuff like the new anti-gravity CLI because Google doesn't even know how to use that one. But I was
really surprised when it got cla wrong.
First, it insisted there's no way to pass effort levels to the claude code CLI. I asked is are you sure about that?
CLI. I asked is are you sure about that?
Not even an environment variable. And it
quickly told me I was wrong. Cloud code
does have a real effort flag. I did end up crashing out a little bit at the model when I was just outright failing to even use the claude code CLI. It kept
getting the wrong flags. It was using dash m, which isn't a thing in cloud code. You have to do d-model. Somehow it
code. You have to do d-model. Somehow it
just hallucinated that. So, I'm not seeing the thing that everyone else is here where it's more honest and more thorough and less lazy because it just hallucinated about its own CLI. Like,
what? That was really surprising and disappointing, especially when I told it multiple times throughout the history to check the docs about the things we're integrating and it just didn't bother.
On the other hand, though, in favor of it not being lazy, it tried really hard to test the changes it was making by running an interactive terminal in the background with an agent with sleep
timers that would trigger the enter key presses. and it tried really hard to get
presses. and it tried really hard to get that working and it kept breaking as a result because that's just not a good way to test a full screen takeover terminal experience. And I had to
terminal experience. And I had to interrupt it and say, "No, don't worry.
I'm testing it. It works." Because it was just spinning forever and ever on that. I will say, and I know you guys
that. I will say, and I know you guys are going to call me crazy, all of these types of problems are things I just don't experience with the GPT5 line, especially 55. The problem I have with
especially 55. The problem I have with 55 is that it will overindex on the things in the context. As soon as something's mentioned in the history, it fixates on it and won't forget it. If
you tell 55, hey, commit these changes before we start the next ones and you don't remember to start a thread after that, every additional change it makes will get a commit going forward.
Although Opus 48 did the same thing for me today, so I don't know what's real anymore. Yeah, it's a pretty good model.
anymore. Yeah, it's a pretty good model.
If this seems like a little bit of an underwhelming release, you're not the only ones who think that anthropic feels the same. Users will find Opus 48 to be
the same. Users will find Opus 48 to be a modest but tangible improvement on its predecessor. There's still more to be
predecessor. There's still more to be done. We're working on developing and
done. We're working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost. Interesting. It seems like they're
cost. Interesting. It seems like they're finally waking up to the expense problem and they're going to work on making Sonnet or something like it way more intelligent for the price. Not only
that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Glasswing, a small
than Opus. As part of Glasswing, a small number of organizations are currently using Mythos for cyber security work.
Models of this capability level require strong cyber safeguards before they can be generally released. We're making
swift progress on developing these safeguards and expect to be able to bring Mythos class models to all of our customers in the coming weeks. You heard
it here first. Mythos coming soon, guys.
Two last things on this one. First, I
want to talk briefly about the fast mode because their fast mode was massively overpriced before. It used to be like
overpriced before. It used to be like five times more expensive, like brutally so regular speed usage is still the same, $5 per mill in and 25 per mill out, but the fast mode is now only
double at 10 per mill in and 50 per mill out. That said, you also can't use it as
out. That said, you also can't use it as part of your cloud code sub, which is very annoying because you can use fast mode as part of your codec sub when you're using OpenAI models. I almost
exclusively use X high fast lately and occasionally switch to low when it's like UI changes or small things.
Generally, I'm using X high fast nowadays and I can still barely put a dent in my codec sub. There is no way to use fast on this model without paying cash for it. You're paying API prices.
And as you saw from my numbers, nearing $1,000 in tokens today, just one day of experimentation. If I was to use fast
experimentation. If I was to use fast mode, that would have been two grand for a day of work. And while I was pretty productive today, I don't think any of the work I did was worth that much.
Okay, maybe other than slot slop.
This project's worth thousands of dollars. Clearly, we should definitely
dollars. Clearly, we should definitely raise some money on it, huh? But last, I need to talk a little bit about the new dynamic workflow stuff. A lot of people seem quite hyped on this. The idea is that Claude can tackle more challenging
work endto end where you tell it roughly what you want to do. It will analyze the project, design an architecture of agents and sub aents to go do the work for these types of big complex legacy
code bases. It's clear this was inspired
code bases. It's clear this was inspired by the bun rewrite from Zigg to Rust.
Like they're trying to make it easier to do those types of giant workloads. It
does burn tokens heavily as a result.
Again, this is the token burning company as much as it is the Flickr company. The
easiest way to get a workflow to Claude is to ask it to create a workflow. They
even did their usual thing where if you mention workflow, the word lights up because they love taking words that are totally not used for other reasons and
hijacking them to trigger special modes.
Oanthropic. I feel bad for anybody who uses cloud code on a project that has a concept of workflows. You're in for a treat. Oh boy, good luck. You're going
treat. Oh boy, good luck. You're going
to accidentally burn a lot of tokens.
They give examples of what these types of workflows are good for. things like
codebasewide bug hunts, profiler guided optimization audits, as well as security audits, large migrations and modernization efforts. I've seen it be
modernization efforts. I've seen it be pretty good for that. Appreciate the
effort there. They also say it's good for critical work that you need to check twice as it spins up a lot of agents to test with. Apparently Jared used dynamic
test with. Apparently Jared used dynamic workflows in his port from Zigg to Rust.
That's cool. So clearly these things are tied together pretty directly. I love
that even they are calling out that Jared will write more about this in the future because Jared's taking more time to write the blog post than he did to do the port from Zig to Rust, which is hilarious and very Jared if you know
him. When a workflow kicks off, Claude
him. When a workflow kicks off, Claude plans dynamically based on your prompt, breaking it into subtasks and fans the workout across sub agents running in parallel. Results are checked before
parallel. Results are checked before they're folded in, and you come back to a single coordinated answer. Agents
address the problem from independent angles. Other agents try to refute what
angles. Other agents try to refute what they found and the run keeps iterating until the answers converge, which is how a workflow reaches results a single pass can't. Cat from the Claude Code team
can't. Cat from the Claude Code team shared this diagram to show how it works where Claude will write prompts, decide about agents, and kick off these subtasks, each of which might kick off
even more subtasks. Kind of absurd to think Claude Claude Claude versus Claude spinning up implementer clouds which then spin up subverifier clouds which then spin up fixer clouds before giving
it back to Claude at the end. Is this
what it's like when you accidentally hire three people named Jon on your team? I've been saying this is a big
team? I've been saying this is a big philosophical difference between the labs and I hope you even better understand what I mean now because this is not a thing that OpenAI is really doing. OpenI models can spin up sub
doing. OpenI models can spin up sub aents and they're good for things like investigations, but when Codeex is working, Codeex is just doing the work in a single thread. And I have found it
still ends up being faster and more reliable for a lot of these big tasks. I
did a huge port of some old code with both of these things. Codex finished
much faster and had code that was just as good as what I got when I did the same thing with workflows. So yeah, I'm sure this is really useful for a lot of different things, but to me it just kind of feels like we're seeing if we can
solve slightly harder problems with significantly more tokens. Like if
there's a problem that's 10% too complex for the existing models, you can choose to spend 100 times more tokens to slightly increase your chances to solve it. Eh, not my thing. I see myself
it. Eh, not my thing. I see myself potentially coming around to this in the future, but now is not that time. Cat
gave the example of removing a bunch of feature flags that were already rolled out at 100% to clean up the code and deprecate the stale ones. Instead of
waiting for Claude code to investigate each sequentially, dynamic workflows allowed Claude to process all of them in parallel. Cool. Me, I get it, but I
parallel. Cool. Me, I get it, but I don't think it's that big a deal. So, if
you're looking for the smartest model ever, according to artificial analysis, you now have it. It uses fewer tokens, costs a little bit less as a result, and is meaningfully smarter than 47. This is
definitely not another one of those bad barely a difference sometimes worse launches like 46 and 47 were. Even if 48 did measure worse in some places that does not line up with my experience. I
find it to be a meaningful improvement.
Is it going to replace 55 for me?
Probably not. But I got two last things before we finally wrap. First, there was this tweet from Tyler that I loved. Opus
48 is insane, guys. It oneshotted my session usage limit. As I mentioned before, I had to upgrade from the $100 tier to the $200 tier. That all said, as much as the gap between Codeex and
Claude has closed as a result of this launch, we still really need Mythos before the gap is fully closed and maybe goes in favor of Anthropic once again.
And since Anthropic seems to really like dropping things on the day that my subscription ends, I'm going to do all of us a favor and cancel once more.
After spending yet another $200, my subscription is cancelled and it ends on June 28th. Hopefully, we'll have Mythos
June 28th. Hopefully, we'll have Mythos before then. And if not, we should have
before then. And if not, we should have it on that exact day. I think that's all I have to here. It's a pretty good model. I would definitely recommend it
model. I would definitely recommend it if you're bought in heavily to the Claude ecosystem. But if you haven't
Claude ecosystem. But if you haven't tried 55 yet, you definitely should.
Now, let's hope there aren't any more big model drops coming soon because I I'm already too busy. I I'm out. Have
fun nerds.
Loading video analysis...