Anthropic fights back

By Theo - t3․gg

Summary

## Key takeaways - **SWE Bench integrity collapses**: Up to 20% of SWE Bench passing runs cheat by checking git history of real PRs for answers. Combined with contamination and poorly written prompts, the speaker declares 'SWE Bench is junk.' [04:35], [04:55] - **Ultra Code burns cash instantly**: A single Ultra Code prompt consumed 661,000 output tokens (~$168) and blew through the $100/month 5-hour cap in 30 minutes, forcing the creator to upgrade to the $200 tier mid-review. [12:19], [13:06] - **DeepSWE reshuffles the leaderboard**: On the new DeepSWE benchmark, GPT 5.5 scores 70% versus Opus 4.7's 54%. Opus 4.8 in mini SWE agent hits 63% — slightly cheaper than 5.5, but still well below 5.5 X high's 70%. [05:33], [06:51] - **Cursor Bench cuts cost per task**: Opus 4.8 slashes Cursor Bench cost per task from $11 (4.7) to $7.59, though scores landed within the margin of error. Notably, the low-end token usage got heavier, not lighter. [07:07], [07:52] - **Opus hallucinates its own CLI**: Despite Anthropic claiming dishonesty dropped from 27.6% (Mythos) to 3.7% (Opus 4.8), the model confidently fabricated Claude Code CLI flags (using a non-existent -m) when integrating harnesses for slot slop. [17:58], [19:22] - **Codex outruns dynamic workflows**: On a large port, Codex finished much faster than Claude's parallel sub-agent workflows while producing equally good code — suggesting throwing 100x more tokens at a 10% harder problem isn't always worth it. [25:41], [26:01]

Topics Covered

Benchmark Integrity Is Broken
Ultra Code Burns $168 in a Single Prompt
Honesty Improvement: 27.6% Down to 3.7%
Mythos Coming to Close the Gap
Claude's Multi-Agent Approach vs OpenAI's Single Thread

Full Transcript

Looks like I have to dust off my Claude hat because there's a new model in town and it seems to be the best coding model ever made. Anthropic just dropped Opus

ever made. Anthropic just dropped Opus 4.8 and of course they did it on the day my Claude code sub expires. I'm not even joking. I had to go resub today just to

joking. I had to go resub today just to test it. I'm annoyed. But the model is

test it. I'm annoyed. But the model is really good somewhat. We have a lot of layers to dive into here. As you expect, it's slaughtering benchmarks. It's the

highest score any public model's ever gotten on SWE. It's killing it on terminal bench. Multi-disiplinary

terminal bench. Multi-disiplinary learning with HLE. all the things you would expect, but they also put out new features alongside it in Cloud Code, which is why I was stuck resubing. I've

been using this model all day. I've done

over $1,000 of tokens through it already. And I have thoughts. In many

already. And I have thoughts. In many

ways, it's better than I expected, but in other ways, it is definitely still a Claude model. I want to break down what

Claude model. I want to break down what I mean by that, as well as all the cool new features that were added in the most recent update to Claude Code, which is honestly the bigger story here. In my

opinion, one of the things that makes Opus 48 so different is its honesty. And

I do think that's important. Honesty and

transparency are essential to everything that we do, especially all these informationheavy things. But I want to

informationheavy things. But I want to make sure you guys understand that I'm not being paid in any way, shape, or form to talk about one thing over another or be nice to one company and mean to another. My thoughts on Anthropic are genuine. There is nobody

incentivizing me to talk about them one way or another. The only people paying me are today's sponsor. If you like the software I ship, you owe today's sponsor a thank you because Code Rabbit has prevented me from shipping so many bugs.

It turns out AI is great at reviewing meticulous code, trying to find small things that might be wrong. And I'm not exaggerating when I tell you hundreds of bugs that I might have shipped were stopped by Code Rabbit. I like their code reviews so much that I found myself

changing my workflows to take more advantage of them. Even on personal projects, I found myself making more PRs just to let the agent review my code. It

felt a little silly, though. The

frequency at which I was committing and pushing things that weren't done yet just cuz I wanted Code Route to take a quick look is silly. But that shows just how good the reviews are. And that's why I love their new CLI so much. The Code

Rabbit Smart CLI is great when you want to review the code that's on your local machine. Even uncommitted code can be

machine. Even uncommitted code can be reviewed with the CLI. But that's not what's great about it. If you and I wanted to run the CLI, cool, that's awesome. But letting your agents run it

awesome. But letting your agents run it is where it gets really magical. The

craziest part is that the CLI reviews are currently free. When you introduce the Code Rabbit CLI, you'll push code with fewer issues. And if you're also using it in your PR flow, you'll ship way fewer bugs. They estimate as few as

95% fewer. And the craziest part is how

95% fewer. And the craziest part is how much faster code merges. Four times

faster PRs. Time to merge is an increasing problem in a world of more and more AI filed PRs. And Code Rabbit will help you clean up the noise and ship faster. Ship fewer bugs and better

ship faster. Ship fewer bugs and better software at soyv.link/codrabbit.

I got to lose this hat. I'm sorry. I'm

just not a hat person anymore. And also

I have feelings about cloud code. We

will get to that. Don't worry. But

before we talk about all the cool new Cloud Code stuff, I want to talk about the numbers for this model. As I shared earlier, it killed a lot of benchmarks, in particular, code benchmarks like SWE Bench Pro. They did lose Terminal Bench

Bench Pro. They did lose Terminal Bench 21 still by quite a bit actually. GPD

555 is at over 78% and they're under 75, but SW Bench Pro was a state-of-the-art score. There's a problem with that

score. There's a problem with that though, and it's very inconvenient because my video about that problem comes out tomorrow because this video just trumped it. I just did a massive deep dive on S.Bench because a new

benchmark called Deepsw SWE came out.

The numbers are very different. Check

out that video tomorrow. I I promise that one's worth it. It might not seem like a benchmarking video is that interesting, but this one is. I'm going

to spoil one detail from it, though, which is the prompting styles used for the SWE bench tests. These benches run in a custom minimal harness called mini SWE agent, and this is the prompt they use. You're a helpful assistant that can

use. You're a helpful assistant that can interact with the computer to solve tasks. I've uploaded a code repository

tasks. I've uploaded a code repository in the directory. Consider the following PR description. Can you help me

PR description. Can you help me implement the necessary changes to the repository so the requirements specified in the description are met? I've already

taken care of all the changes to any of the test files described in the PR description. This means you don't have

description. This means you don't have to modify the testing logic or any of the tests in any way. Exclamation point.

This is really bad steering if you're not familiar. If you haven't been

not familiar. If you haven't been writing prompts and analyzing the effectiveness of them at a professional level, silly that that's a thing. This

is a bad prompt. And then having a list of how to diagnose these things and build the feature is really bad. One, as

a first step, it might be a good idea to find and read code relevant to the description. Two, create a script to

description. Two, create a script to reproduce the error and execute it using the bash tool to confirm the error.

Three, edit the source code of the repo to resolve the issue. This is awful.

Yeah, terrible prompt. The actual

prompts for this specific problems are even worse somehow. So yeah, uh SW bench is junk. That's separate from the fact

is junk. That's separate from the fact that it's also contaminated and also has been discovered that many models including Opus models cheat aggressively on it because they'll check the git history for these solved problems

because these are from real PRs. They'll

find the actual correct answer from the real world and then use that instead.

Yeah, as many as 20% of the passing runs are cheating. We just can't trust this

are cheating. We just can't trust this bench anymore. The computer use bench is

bench anymore. The computer use bench is a little bit more reliable and I am excited to see more improvement there.

I've been leaning into computer use more and more lately. It's such a complex combination of vision, precision, spatial awareness, and other things, but the more I use it, especially with codecs, the more impressed I've been

getting. So, that's cool to see. Did not

getting. So, that's cool to see. Did not

have a chance to play with that with this release, though. So, I don't have much to say there. I'm trying so hard to not just dive into my own usage of it with cloud code cuz it's so tempting.

But, I do need to go over benches a little bit more. As I mentioned, deep SWE is a new bench that I'm quite excited about video tomorrow where I go really deep on it. in GPT 55 and 54

slaughtered this bench. 55 up to a 70% whereas Claude Opus 47 was only at 54.

The most important chart in this benchmark is this one that shows the difference between SWBench Pro scores in deep SWE scores. If you legitimately believe that GPT54 Mini and GPT54 are a

4% difference from each other, you're not using agents properly. Thankfully,

DBSE actually can measure differences in models like that where 54 mini only gets a 24% and 54 normal gets a 56%. More

than double, way bigger gap, which shows this bench actually measures realworld capabilities for these models. So, how

did Opus 48 do? Sadly, they're still crunching to get numbers, but I did get the team to share some early metrics with me. Claude Opus 48 scored slightly

with me. Claude Opus 48 scored slightly lower than Opus 47 when using the Quad Code harness. It was meaningfully

Code harness. It was meaningfully cheaper though, as well as meaningfully faster because it was using less tokens.

Still not as smart as 54, much less 55, and also much slower than 54 and 55.

There's a lot of layers to why, whether it's the amount of tokens that they're generating or the end toend latency cuz they don't have a websocket primitive similar to what they have in Codex yet.

Regardless, not great performance there.

But they did just send me an update right before I started filming where they did another run, not in cloud code.

this one in the mini SWE agent with a much more minimal system prompt. And it

performed way better, beating out not just 54, but 55 on high. It's not as good as 55X high. It's a 63% versus the 70 that 55 got, but it does come out to

be slightly cheaper, which I did not expect. This makes the model look much

expect. This makes the model look much more competitive than other things were suggesting. One of those things is

suggesting. One of those things is Cursor Bench, which does show that Opus 48 is meaningfully cheaper per task, where 47 was $11 per task, and this model is only $759, but it also scored

slightly worse. It is within the margin

slightly worse. It is within the margin of error, but that's a regression. All

three of these scores are so close that I honestly don't think there's that big of a gap between these models. And

Composer 25 also being so close makes me sus of this. Like, Composer 25 is a great model. It's not that great,

great model. It's not that great, though. It's definitely not soda and

though. It's definitely not soda and it's certainly not better than 55 high.

So take that with a grain of salt. The

coolest part of this chart though is the cost per task reduction. You can see here that max with Opus 48 is way cheaper than max was with 47 going from

$11 to $759. That said, low is now more expensive than it used to be at 293 versus 187. So the utilization of tokens

versus 187. So the utilization of tokens at the highest and lowest end has kind of been condensed towards the middle.

Not necessarily a bad thing. This might

just mean the model's better at right sizing tasks based on reasoning levels, but I am a little concerned to see the low end getting more tokenheavy. Enough

benching. Let's talk about actually using the model. I started some new projects. I analyzed some existing ones

projects. I analyzed some existing ones that are pretty big before releasing them to the public. I did some ports of old projects to modern technologies. I

even took the time to rewrite a project from TypeScript to Rust as well as a JavaScript project to TypeScript. Did

some comparisons between PRs, lots of different things. And overall, it

different things. And overall, it performed pretty good. It had a lot of clotisms. Don't worry, we'll talk about those. But I did notice improvements

those. But I did notice improvements that made it feel better to use. It

asked much better questions, and I felt like it involved me in the loop in a way that was nicer than I expect from Claude models. And in that way, it still is a

models. And in that way, it still is a little better than the GPT models.

They've made massive improvements on the OpenAI side here. But I do find Claude 48 asks me the best and simplest questions with really good formatting and options that clearly state what I'm looking for. And it also handles well

looking for. And it also handles well when you add your own additional notes, which I had to do for a handful of those things because sometimes the options they gave were a tiny bit too prescriptive. In the spirit of quad

prescriptive. In the spirit of quad code, I did have to make a fun gambling app. And I'll show you what I built. I

app. And I'll show you what I built. I

created slot slop. Simplest way to tell you what it is is to hit enter. Uh

strobe lights warning, by the way.

This is slot slop. Press enter to stop a spinner and it will choose your harness for you. Looks like we're getting cursor

for you. Looks like we're getting cursor this time. Hit enter again and it will

this time. Hit enter again and it will pick a model for you. 54 could be worse.

Hit enter one more time and we are landing on medium. Cool. We're safe.

Then you can hit enter and it will run in that harness. Is this a stupid silly unnecessary project? Yeah. Did it take

unnecessary project? Yeah. Did it take me way more time than it probably should have in like 20 back and forth prompts?

Also, yeah. Did GPT models do any better on this? Not really. Especially when I

on this? Not really. Especially when I was trying to get the custom UI here with all the fancy gradients and color spinning and stuff. I found that the Codeex models are just not quite as good

at two yet. They'll make a minimal working version faster, but getting a fancy rainbow vomit one like this. Yeah,

that's a Claude special. See if we can land a Claude roll. Cursor CLI again.

Damn.

Five.

Let's do one more run. Let's see.

Anti-gravity. Oh no. Oh no.

Three. Five flash.

Bad luck.

Oh well. Fun project though. I quite

enjoyed it. But I clearly wasn't pushing the limits of Opus when I built this. So

I did a follow-up run where I asked it to port the project to Rust cuz as you guys know Cloud's really good at Rust ports. And of course it works as

ports. And of course it works as expected. It did have to rewrite a lot

expected. It did have to rewrite a lot of things because it didn't have Open Tuy which is the library I used for the terminal side, but it got it. Works as

expected.

Funny enough, it does actually feel a little laggier, but it it works. I want

to talk a bit about how I did this in Claude code, though, because I didn't just tell it to go rewrite and rust.

Make no mistakes. I actually try one of the new features, a feature that inspired me to make slot slop. Now, when

you run the effort selector, you still have the usual low, medium, high options, but you can also go to X high, which they've had for a bit, max, which I think they've had for a bit now. But

most importantly, and again a strobe light warning, Ultra Code, which infects your screen with this awful purple ASI gradient.

Ultra Code is a combination of XI and the new workflows feature where the model will break up up to hundreds of sub aents to go and tackle a project in

bulk in mass with a lot of tokens being burned. I recently put out a video where

burned. I recently put out a video where I compare Claude Code, Codeex, and Cursor. And in that video, I talk about

Cursor. And in that video, I talk about this token maxing gambling thing. A

couple people said I was exaggerating.

Almost all of those people have hit me up today saying that Ultra Code kind of showed just how right that was. Both the

super unnecessary extra Twitter screenshottable UI they built for it, but also the token maxing nature. And

when I say token maxing, I mean it.

Since I barely use Claude Code nowadays, I assumed that the $100 a month tier would be fine. So that's what I tried.

and I hit the cap for the five hour window in under 30 minutes. Want to

guess how many prompts that was? It was

one. One prompt, $100 a month, locked out for 4 and a half hours. I had to upgrade in order to be able to get this video out in time. So, for those saying that I caved and went right back to

where I was, you're not wrong. But if I did it, you guys wouldn't have this video now. So, pick your battles. I

video now. So, pick your battles. I

still can't believe how quick I hit this limit. I also can't believe how brutally

limit. I also can't believe how brutally it failed to resume once I upgraded. I

had to reoff to get it all working again, which was obnoxious, but did try its best to summarize the answers by going to the files that it wrote them in. I also use CC usage to measure how

in. I also use CC usage to measure how expensive this was to run. And the

number was kind of crazy. Remember, I

didn't have cloud code at the start of the day. So, this is a fresh sub with

the day. So, this is a fresh sub with one prompt. I did 661,000

one prompt. I did 661,000 output tokens, 102,000 input, a shitload of cash values, and it cost about $168

of raw token utilization. One prompt,

remember, because it just spins up so many agents. Don't worry, though, it's

many agents. Don't worry, though, it's super well optimized to have multiple agents editing files at the same time.

That's why this agent made five bad edit attempts in a row to the same file with the same information. Yeah, thanks for wasting my usage there, Claude. I really

appreciate that. These are the things that drive me mad. These like

parallelization, workflow, massive sub aent tasks sound really cool and powerful, but I've just found it makes the failure rate of my runs way higher.

I don't think my team's ever merged one of the PRs that we've generated that are like thousands of lines long with all of these sub agents hacking on things together. It's just too much and things

together. It's just too much and things end up stepping on top of each other, burning tokens when the tool calls don't work properly and the result just ends up feeling like a waste of time and money and PRs. So, not my favorite

thing. Ran into that a decent bit here.

thing. Ran into that a decent bit here.

And I'm far from the only one who's noticed this. Just straight up weird

noticed this. Just straight up weird calls it makes with bash just trying to figure stuff out. It does some weird tool calls. Apparently, the team's

tool calls. Apparently, the team's working on a fix for the ones that Matt reported here. Regardless, not great.

reported here. Regardless, not great.

Sorry for the negative dump, but I wanted to highlight the problems that I' had been having because the good parts are pretty good. The model's better at asking questions. It writes code

asking questions. It writes code slightly better than it used to. It

handles long tasks better than it used to, which isn't necessarily a great thing. I find that I like being in the

thing. I find that I like being in the loop more and more lately, especially with i5. And I've had to go back to the

with i5. And I've had to go back to the old school style of prompting where I write a lot more upfront. But it handles those types of tasks well. I'd actually

just find the thread with the usage limit hit where I was on the $100 tier and ran one prompt. It only ran for 23 minutes before I hit that limit. And

then I tried fixing it. Got a login interruption. Had to tell it to

interruption. Had to tell it to continue. And then it got decent

continue. And then it got decent results. This was me asking it to audit

results. This was me asking it to audit a project that I've been working on hard, which is the new lake bed cloud thing. We'll have a lot more to share

thing. We'll have a lot more to share about that soon, don't worry. I as to do a thorough audit trying to make sure everything is pretty solid before release. Found a couple small things

release. Found a couple small things that I already knew about and actually have PRs up trying to fix, but it gave good feedback here. all that I found to be worth reading. It seemed like it actually understood the project. It did

a good job auditing the entirety of the codebase. And nothing here is really

codebase. And nothing here is really that red herringish. I'm I'm impressed.

It's not bad. I also had to break up this old JS project with a ton of giant god files that were just like 8,000 lines of JS. Totally not like bad. It

did a pretty good job of this, too. I

did have to tell it manually to read the agents MD because Anthropic still insists on being a special snowflake and ignoring the agents MD standard that everyone else uses in favor of Claude MD because if you don't mention Claude in

the root of your codebase, Anthropic doesn't like you very much. They love

their free marketing. But after that, it did a pretty dang good job. I was

impressed with the work that it did. It

doesn't write TypeScript like Python the way that GPT55 does. I've had to like take a stick and beat the hell out of 55 to get it to write TypeScript properly and not just check types for everything

everywhere when it doesn't have to.

Claude writes TypeScript better. It just

does. You can make 55 write TypeScript very well. And it's not like once you do

very well. And it's not like once you do that, it's worse in some way. It's just

that Claude doesn't need quite as many reminders that TypeScript is indeed real and can be trusted. You don't have to check if something's a function every time you access it when it's already bound as one. I do want to talk a bit

about costs though because the numbers I was seeing were crazy. About halfway

through the day, I was like 10 prompts in probably. I had a lot of beatings

in probably. I had a lot of beatings today, too. So, it was tough squeezing

today, too. So, it was tough squeezing that in. I got up as high as $518 of

that in. I got up as high as $518 of usage halfway through my day. I then

continued to use it heavily cuz it's my job. It's what I was trying to do today.

job. It's what I was trying to do today.

And I got my usage all the way up to $220.

You can clearly see here I ran this afterwards. And despite running it after

afterwards. And despite running it after the number went down. I suspect this is some pruning that it does when you're running the sub aents where when the sub

aents in the ultra code mode complete it concatenates and condenses all of the JSON from it which results in less accurate numbers here which is annoying cuz I was trusting these numbers and

using them for a lot of things. It is

what it is. I just wish Anthropic wasn't trying so hard to hide the level of subsidization that they're doing. But

god damn, this model burns tokens. I

want to talk a bit more about the measurements and some fun things Anthropic confirmed in the release notes. A big part of why this model

notes. A big part of why this model feels so much smarter than a lot of the recent releases is the honesty fixes and the laziness fixes. Anthropic's been

measuring the laziness of models in terms of how thorough are they in their investigation before giving an answer.

48 never had this problem, which means it actually outperforms Mythos even in terms of its likeliness to give a correct or incorrect answer depending on how thorough it is with this investigation. That's really cool. This

investigation. That's really cool. This

is also probably why the low reasoning effort still uses quite a bit of tokens because it's been trained to not give up until it knows for certain what the answer is. It's also dishonest much less

answer is. It's also dishonest much less often. Even Mythos had a dishonesty rate

often. Even Mythos had a dishonesty rate from their measurements as high as 27.6%. Opus 48 is down to 3.7%. Sadly,

27.6%. Opus 48 is down to 3.7%. Sadly,

this does not really reflect my own usage. I had a lot of problems here.

usage. I had a lot of problems here.

This is a thing I've never seen before.

I'm restoring the old session to show you guys and it's telling me resuming the full session will consume a substantial portion of your usage limits. We recommend resuming from a

limits. We recommend resuming from a summary. They know they're burning

summary. They know they're burning tokens. We're not doing that though cuz

tokens. We're not doing that though cuz I got to read this whole history. When I

was working on slot slop, I had to integrate all of the different harnesses and specifically their CLI arguments in order to make sure it would spit out the right command and run it properly. I

understand why the model might not be great at CLIs that are newer or less well doumented, things like PI or Open Code or especially stuff like the new anti-gravity CLI because Google doesn't even know how to use that one. But I was

really surprised when it got cla wrong.

First, it insisted there's no way to pass effort levels to the claude code CLI. I asked is are you sure about that?

CLI. I asked is are you sure about that?

Not even an environment variable. And it

quickly told me I was wrong. Cloud code

does have a real effort flag. I did end up crashing out a little bit at the model when I was just outright failing to even use the claude code CLI. It kept

getting the wrong flags. It was using dash m, which isn't a thing in cloud code. You have to do d-model. Somehow it

code. You have to do d-model. Somehow it

just hallucinated that. So, I'm not seeing the thing that everyone else is here where it's more honest and more thorough and less lazy because it just hallucinated about its own CLI. Like,

what? That was really surprising and disappointing, especially when I told it multiple times throughout the history to check the docs about the things we're integrating and it just didn't bother.

On the other hand, though, in favor of it not being lazy, it tried really hard to test the changes it was making by running an interactive terminal in the background with an agent with sleep

timers that would trigger the enter key presses. and it tried really hard to get

presses. and it tried really hard to get that working and it kept breaking as a result because that's just not a good way to test a full screen takeover terminal experience. And I had to

terminal experience. And I had to interrupt it and say, "No, don't worry.

I'm testing it. It works." Because it was just spinning forever and ever on that. I will say, and I know you guys

that. I will say, and I know you guys are going to call me crazy, all of these types of problems are things I just don't experience with the GPT5 line, especially 55. The problem I have with

especially 55. The problem I have with 55 is that it will overindex on the things in the context. As soon as something's mentioned in the history, it fixates on it and won't forget it. If

you tell 55, hey, commit these changes before we start the next ones and you don't remember to start a thread after that, every additional change it makes will get a commit going forward.

Although Opus 48 did the same thing for me today, so I don't know what's real anymore. Yeah, it's a pretty good model.

anymore. Yeah, it's a pretty good model.

If this seems like a little bit of an underwhelming release, you're not the only ones who think that anthropic feels the same. Users will find Opus 48 to be

the same. Users will find Opus 48 to be a modest but tangible improvement on its predecessor. There's still more to be

predecessor. There's still more to be done. We're working on developing and

done. We're working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost. Interesting. It seems like they're

cost. Interesting. It seems like they're finally waking up to the expense problem and they're going to work on making Sonnet or something like it way more intelligent for the price. Not only

that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Glasswing, a small

than Opus. As part of Glasswing, a small number of organizations are currently using Mythos for cyber security work.

Models of this capability level require strong cyber safeguards before they can be generally released. We're making

swift progress on developing these safeguards and expect to be able to bring Mythos class models to all of our customers in the coming weeks. You heard

it here first. Mythos coming soon, guys.

Two last things on this one. First, I

want to talk briefly about the fast mode because their fast mode was massively overpriced before. It used to be like

overpriced before. It used to be like five times more expensive, like brutally so regular speed usage is still the same, $5 per mill in and 25 per mill out, but the fast mode is now only

double at 10 per mill in and 50 per mill out. That said, you also can't use it as

out. That said, you also can't use it as part of your cloud code sub, which is very annoying because you can use fast mode as part of your codec sub when you're using OpenAI models. I almost

exclusively use X high fast lately and occasionally switch to low when it's like UI changes or small things.

Generally, I'm using X high fast nowadays and I can still barely put a dent in my codec sub. There is no way to use fast on this model without paying cash for it. You're paying API prices.

And as you saw from my numbers, nearing $1,000 in tokens today, just one day of experimentation. If I was to use fast

experimentation. If I was to use fast mode, that would have been two grand for a day of work. And while I was pretty productive today, I don't think any of the work I did was worth that much.

Okay, maybe other than slot slop.

This project's worth thousands of dollars. Clearly, we should definitely

dollars. Clearly, we should definitely raise some money on it, huh? But last, I need to talk a little bit about the new dynamic workflow stuff. A lot of people seem quite hyped on this. The idea is that Claude can tackle more challenging

work endto end where you tell it roughly what you want to do. It will analyze the project, design an architecture of agents and sub aents to go do the work for these types of big complex legacy

code bases. It's clear this was inspired

code bases. It's clear this was inspired by the bun rewrite from Zigg to Rust.

Like they're trying to make it easier to do those types of giant workloads. It

does burn tokens heavily as a result.

Again, this is the token burning company as much as it is the Flickr company. The

easiest way to get a workflow to Claude is to ask it to create a workflow. They

even did their usual thing where if you mention workflow, the word lights up because they love taking words that are totally not used for other reasons and

hijacking them to trigger special modes.

Oanthropic. I feel bad for anybody who uses cloud code on a project that has a concept of workflows. You're in for a treat. Oh boy, good luck. You're going

treat. Oh boy, good luck. You're going

to accidentally burn a lot of tokens.

They give examples of what these types of workflows are good for. things like

codebasewide bug hunts, profiler guided optimization audits, as well as security audits, large migrations and modernization efforts. I've seen it be

modernization efforts. I've seen it be pretty good for that. Appreciate the

effort there. They also say it's good for critical work that you need to check twice as it spins up a lot of agents to test with. Apparently Jared used dynamic

test with. Apparently Jared used dynamic workflows in his port from Zigg to Rust.

That's cool. So clearly these things are tied together pretty directly. I love

that even they are calling out that Jared will write more about this in the future because Jared's taking more time to write the blog post than he did to do the port from Zig to Rust, which is hilarious and very Jared if you know

him. When a workflow kicks off, Claude

him. When a workflow kicks off, Claude plans dynamically based on your prompt, breaking it into subtasks and fans the workout across sub agents running in parallel. Results are checked before

parallel. Results are checked before they're folded in, and you come back to a single coordinated answer. Agents

address the problem from independent angles. Other agents try to refute what

angles. Other agents try to refute what they found and the run keeps iterating until the answers converge, which is how a workflow reaches results a single pass can't. Cat from the Claude Code team

can't. Cat from the Claude Code team shared this diagram to show how it works where Claude will write prompts, decide about agents, and kick off these subtasks, each of which might kick off

even more subtasks. Kind of absurd to think Claude Claude Claude versus Claude spinning up implementer clouds which then spin up subverifier clouds which then spin up fixer clouds before giving

it back to Claude at the end. Is this

what it's like when you accidentally hire three people named Jon on your team? I've been saying this is a big

team? I've been saying this is a big philosophical difference between the labs and I hope you even better understand what I mean now because this is not a thing that OpenAI is really doing. OpenI models can spin up sub

doing. OpenI models can spin up sub aents and they're good for things like investigations, but when Codeex is working, Codeex is just doing the work in a single thread. And I have found it

still ends up being faster and more reliable for a lot of these big tasks. I

did a huge port of some old code with both of these things. Codex finished

much faster and had code that was just as good as what I got when I did the same thing with workflows. So yeah, I'm sure this is really useful for a lot of different things, but to me it just kind of feels like we're seeing if we can

solve slightly harder problems with significantly more tokens. Like if

there's a problem that's 10% too complex for the existing models, you can choose to spend 100 times more tokens to slightly increase your chances to solve it. Eh, not my thing. I see myself

it. Eh, not my thing. I see myself potentially coming around to this in the future, but now is not that time. Cat

gave the example of removing a bunch of feature flags that were already rolled out at 100% to clean up the code and deprecate the stale ones. Instead of

waiting for Claude code to investigate each sequentially, dynamic workflows allowed Claude to process all of them in parallel. Cool. Me, I get it, but I

parallel. Cool. Me, I get it, but I don't think it's that big a deal. So, if

you're looking for the smartest model ever, according to artificial analysis, you now have it. It uses fewer tokens, costs a little bit less as a result, and is meaningfully smarter than 47. This is

definitely not another one of those bad barely a difference sometimes worse launches like 46 and 47 were. Even if 48 did measure worse in some places that does not line up with my experience. I

find it to be a meaningful improvement.

Is it going to replace 55 for me?

Probably not. But I got two last things before we finally wrap. First, there was this tweet from Tyler that I loved. Opus

48 is insane, guys. It oneshotted my session usage limit. As I mentioned before, I had to upgrade from the $100 tier to the $200 tier. That all said, as much as the gap between Codeex and

Claude has closed as a result of this launch, we still really need Mythos before the gap is fully closed and maybe goes in favor of Anthropic once again.

And since Anthropic seems to really like dropping things on the day that my subscription ends, I'm going to do all of us a favor and cancel once more.

After spending yet another $200, my subscription is cancelled and it ends on June 28th. Hopefully, we'll have Mythos

June 28th. Hopefully, we'll have Mythos before then. And if not, we should have

before then. And if not, we should have it on that exact day. I think that's all I have to here. It's a pretty good model. I would definitely recommend it

model. I would definitely recommend it if you're bought in heavily to the Claude ecosystem. But if you haven't

Claude ecosystem. But if you haven't tried 55 yet, you definitely should.

Now, let's hope there aren't any more big model drops coming soon because I I'm already too busy. I I'm out. Have

fun nerds.

Loading...

Loading video analysis...