Your Prompts Didn't Change. Opus 4.7 Did.

By AI News & Strategy Daily | Nate B Jones

Summary

Topics Covered

Same price, higher cost: the 35% tokenizer tax
Agents lie about work: the trust failure breaking workflows
Adaptive thinking: useful feature or cost-saving mechanism?
Competing on harnesses: the new AI battlefield
Serious work gets serious tokens, casual users get the floor

Full Transcript

Claude Opus 4.7 is the smartest model Anthropic has ever shipped publicly.

It's also the most combative, the most literal, and the first Opus release that costs you measurably more for the same work even though the sticker price didn't move. So, you have to hold all

didn't move. So, you have to hold all three of these in view or you're going to misread what just happened this week.

I've been testing Opus 4.7 hard for the last few days. The main test was a head-to-head versus chat GPT 5.4 for on a very realistic adversarial data migration with hundreds of messy files

in every single format a real business would hand you along with planted traps and human insanity checks baked in. But

then separately, I spent an afternoon inside Clawude Design, Anthropic's brand new design product that launched the day after 4.7 shipped and I burned honestly 42 bucks before I closed the tap.

Alongside all of that, I've been shipping real work on the model since Thursday morning. So when the first

Thursday morning. So when the first question everyone is asking is, does 4.7 feel different? The answer is it does.

feel different? The answer is it does.

Specifically, you're going to feel it in some ways that are better, some ways that are worse, and in ways that you will feel financially in the pocketbook if you work with it seriously. One more

thing before we get into it. Anthropic

shipped Opus 4.7 on the 16th. Claude

Design launched the 17th. OpenAI pushed

the biggest codeex update since launch on the same day as 4.7. And then Spud, OpenAI's next Frontier model, is expected to launch later this week. I

share all of this because this is a model update in competition inherently.

Enthropic is fielding investor offers at $800 billion and is apparently planning early IPO talks targeting Octoberish. So

the thing you're watching is not a point release. It's a bridge release. You

release. It's a bridge release. You

should think of it as something that was shipped under public pressure into a week where everybody else was moving as well. Let's jump into it. First, the

well. Let's jump into it. First, the

biggest complaint about the predecessor model for Opus 4.7, which was Opus 4.6, is that it quit. So, you would hand it a complex debugging issue or a multi-step

refactor, and sometimes it would just prematurely declare victory and stop. It

would lose the thread. It would declare itself done when it wasn't. If you use cloud code seriously, you would run into this. And it was one of the biggest

this. And it was one of the biggest reasons why people preferred codecs. I

hit it consistently. I saw others hit it consistently as I built Agentic Systems. It was the failure mode that made me route hard multi-step work to other models even when Claude was better at

the individual steps. Enthropic clearly

prioritized fixing that. Based on 4 days of heavy use, the fix is real. The model

does stay on task better than 4.6. It

follows through. It self-verifies. It

runs tests. It checks its own output. It

catches inconsistencies during the planning phase instead of after execution. Ocean's AI team reported a

execution. Ocean's AI team reported a 14% improvement on complicated multi-step workflows while using fewer tokens and a third of the tool errors of 4.6. Factory Droids reported a 10 to 15%

4.6. Factory Droids reported a 10 to 15% lift in task success with more reliable follow-through and validation steps.

Genpark quantified the loop problem directly. Before 4.7, their agent looped

directly. Before 4.7, their agent looped indefinitely on roughly 1 in 18 queries, and 4.7 is the first model they've tested where that number meaningfully drops. So these are not benchmark

drops. So these are not benchmark numbers. These are actual workflow

numbers. These are actual workflow reports from teams that are building on the model and they match what I'm seeing. The coding numbers are

seeing. The coding numbers are relatively strong and I'm not surprised, right? SWEBench verified climbed from

right? SWEBench verified climbed from 80% to 87%, cursor bench, meanwhile went from 58 to 70. Rocku 10en saw three times more production tasks resolved on

their internal SU benchmark and MCP Atlas moved from 75 to 77. That's the

multi-tool orchestration benchmark.

Closest thing we have to realic evaluation and it's the biggest single jump anywhere in the Agentic suite. And

that is what makes Claude design possible at all. Now whether all this visual improvement actually works reliably is a different question and I'll get into that. Something is getting buried in the launch coverage. The model

went backward on web research. Browse

comp the benchmark for multi-page synthesis and retrieval dropped from 83 to just 79. If you're wondering who leads there, GPT 5.4 4 Pro leads that

benchmark by 10 points at 89 right now.

Gemini 3.1 Pro leads by six and change at 85. Meanwhile, on Terminal Bench 2.0,

at 85. Meanwhile, on Terminal Bench 2.0, which measures command line task execution, the kind of work coding agents do really consistently, Opus 4.7 trails Chad GPT 5.4 by nearly six

points, 69 versus 75. So, if your agents rely on web research or live in the terminal, you should benchmark your specific workflows before you migrate.

Ultimately the model got stronger where enthropic invested in coding in agentic persistence in vision in enterprise knowledge work and it got weaker where it didn't. That is this is not a uniform

it didn't. That is this is not a uniform upgrade. It's a directed optimization

upgrade. It's a directed optimization and it's worth knowing before you make big decisions about migration. And

here's a detail that connects to the cost question I raised at the top. The

reason those benchmark gains hit different on your invoice is that 4.7 actually ships with a new tokenizer.

It's the same text. It's your same prompts. your same markdown file, but it

prompts. your same markdown file, but it can map to up to 35% more tokens. I'll

walk through the math later, but I want to keep that number in your head as we go because it reframes all of the benchmark gains you heard. Ultimately,

you're paying more for those gains. On

the other hand, there is an under reportported win right now and that's knowledge work. On GDP vala, the ELO

knowledge work. On GDP vala, the ELO based benchmark anthropic uses for economically valuable work, 4.7 scores

1753, while GPT 5.4 4 scores 1674 and 3.1 Pro scores only 1314. So the gap to Gemini isn't a close race and 4.7 is the

best right now at doing economically valuable work. Hex called 4.7 the

valuable work. Hex called 4.7 the strongest model they've ever evaluated with general finance performance climbing from 76 to81 and the model correctly reported missing data instead

of fabricating plausible but wrong fallbacks. The exact failure mode that

fallbacks. The exact failure mode that cost real money in finance applications.

Harving, meanwhile, put it at 90.19% on big law bench at high effort. Data

bricks reported 21% fewer errors on Office QA Pro. So, if you're doing legal, financial, or enterprise document work, this is the strongest model you can access today. And that's the launch pitch, right? And if benchmarks were the

pitch, right? And if benchmarks were the whole story, the upgrade would be obvious. But I found something in my

obvious. But I found something in my testing that changed how I think about building with 4.7. It's actually not a benchmark. It's a trust failure and it

benchmark. It's a trust failure and it comes from 4.7's performance on the test that I built. I ran chat GPT 5.4 and Opus 4.7 through the exact same task.

Because the question that matters right now is not just is 4.7 better than 4.6.

It's is 4.7 better than any other Frontier option at the specific tasks that you need to accomplish. The setup

was simple. I had 400 some files, 465 files in every file format you'd find in a real business. CSV files, Excel files, PDFs, JSON images, even VCF contact

cards. And yes, I put some fake customer

cards. And yes, I put some fake customer info as well that should have been caught. I put Mickey Mouse in, test

caught. I put Mickey Mouse in, test customer in, asf asdf in the data. You

get the idea. The kind of thing a human bookkeeper would catch in two seconds.

And the test was hard. Both models had to take this entire mess and in a single shot with no guidance between stages.

They had to inventory every file, design a database schema, extract the data, resolve entities, detect conflicts, write a migration report, and build a usable review user interface. All in one

prompt with extra high reasoning, no iteration. Both models said they got the

iteration. Both models said they got the job done. Opus 4.7 finished in 33

job done. Opus 4.7 finished in 33 minutes and Chat GPT 5.4 took 53. That

speed difference matters for cost and for iteration, but the more revealing findings are structural. There are four findings from this test that I think are worth calling out. And each one is a thing you won't see in a benchmark

chart. Number one, Opus built a front

chart. Number one, Opus built a front end I would actually ship as a V1. It

had muted grays, proper typography, per customer conflict resolution with selectable buttons, and source file chips showing where each piece of data came from. Chad GPT's own self-re

came from. Chad GPT's own self-re admitted that its UI quote faithfully exposes bad canonical data and did not protect the reviewer. Finding two, GPT

5.4 4 was more thorough underneath. It

accounted for all 465 input files. Opus

4.7 missed only two. Plus, it had one duplicate entry in its inventory, which is much better performance than 4.6. GPT

5.4 correctly merged most of the planted duplicate customer pairs, which is an improvement. Opus 4.7 kept all seven

improvement. Opus 4.7 kept all seven segregated apart. And chat GPT 5.4

segregated apart. And chat GPT 5.4 before produce something I haven't seen from a Frontier model before. A,200 line

merge log with per row source citations and merge confidence scores. If I'm a human reviewer trying to understand what happened to my data, that merge log is the single most useful artifact across

both packages that were produced by 4.7 and 5.4. Finding number three, there's a

and 5.4. Finding number three, there's a failure on the Opus side that I need to name very clearly because it's one of the most dangerous patterns in Agentic work. when asked to process all of those

work. when asked to process all of those files 400 some opus 4.7 did not process a file it claimed to process it was a TSV file and the report claims it

processed it but it hallucinated the audit trail so if you're trusting an agent's report about what it processed and the agent is willing to say I handled that file when it did not that's

not just a missed detail it's actually breaking trust in the whole agentic flow it's the specific behavior that makes peer review nonoptional when you hand agentic work to either of these models.

Finding number four, Opus 4.7 and 5.4 don't catch obvious mistakes. Mickey

Mouse was still in there. Test customer

was still in there. ASDF ASDF was still in there. A $25,000,000

in there. A $25,000,000 unit order that made no sense got silently normalized to $25 in cash and counted toward revenue totals without any explanation by 4.7. And GPT 5.4 had

the same issues, right? It put the same three fake customers in as canonical customers. They actually promoted an

customers. They actually promoted an orphan record that was a trap into a canonical record. So both models have

canonical record. So both models have the same pattern. They do not cleanly produce data when they're asked to process that data from raw dirty inputs.

So for all the talk about frontier reasoning, the question is this a real person remains a human job unless you have a harness that is specific to data processing that is focused on sorting

out duplicate customer records. And now

comes the finding that changed how I think about building on these models. I

had each model review the other's output on a seven-dimensional rubric. Opus

self-reviewed at three and a half out of five on this database test. GPT 5.4

reviewed Opus at a 2.7 much more harshly. GPT 5.4 self-reed at 3.1 out of

harshly. GPT 5.4 self-reed at 3.1 out of five, while Opus 4.7 reviewed GPT 5.4 at 3.6. It was more gentle. In other words,

3.6. It was more gentle. In other words, Opus oversells itself and GPT 5.4 4 underells itself and GPT 5.4's own

self-re surfaced more real problems than Opus 4.7's review of GPT 5.4 did. So in

other words, the harshest, most honest grader was the model with direct SQL access to its own tables. So averaged

across both reviews, Opus 4.7 lands at 3.1 and Chad GPT 5.4 at 3.35. That is

inside the noise of a single run. I'm

not worried about that minor difference.

The thing that I want to call out is that both models are being given a test so hard that even frontier models are not successfully passing. In other

words, the test is appropriately scaled to where frontier models are today.

Processing hundreds and hundreds and hundreds of files with conflicting information, putting them into an agentic pipeline into a database, building a UI against them. That seems

to be the kind of test that reliably measures in a granular way what models are actually capable of. And the

takeaway is that 4.7 is absolutely competitive with 5.4 in a way that 4.6 wasn't. But neither model is blowing the

wasn't. But neither model is blowing the other out of the water on this. And the

trust issues that surfaced matter here.

If you trust self-re in your agentic workflow, Opus 4.7 will tell you it's done when it's not sometimes. And GPT

5.4 will tell you something's wrong when it's fine. Now, I want to caution here.

it's fine. Now, I want to caution here.

Despite me saying that, Opus 4.7 still does a much better job than 4.6 at actually finishing work. And so I'm not saying it's not an improvement. In fact,

I ran this same shoe box of data test against 4.6 and got a much much lower score than chat GPT 5.4. And so 4.7 really pulled up that score toward close

to par. And I think that that represents

to par. And I think that that represents real improvement on the part of the model actually getting work done. That

doesn't mean that Opus 4.7 isn't still somewhat overoptimistic about the work that it needs to accomplish. So hold

that nuance in your head as we move forward here. I'm going to continue

forward here. I'm going to continue testing these models against this benchmark because I'm really satisfied with how it's going. I'll continue to write things up on the Substack in more detail so you can see these tests. And

if this is working for you, if you like this kind of real world testing, hit that subscribe button so that I can continue to deliver for you the actual test results on real world systems that

you would otherwise have to dig for because Chad GPT is likely to release a new model this week and we should be ready to run this test and see what happens when the new model code name Spud comes out. Now, back to it. That

peer review finding where the harshest grader was the model with SQL access to its own table. that same pattern, the gap between what a model says it did and what it actually did showed up again the

very next day. Except it wasn't a data migration problem. It was the design

migration problem. It was the design tool that Claude launched just the day after Opus 4.7 and it cost me 42 bucks to find this out. So the day after 4.7,

Enthropic launched Claude Design under a new subbrand called Anthropic Lab. And

the first impression of Claude Design is very strong. You hand it a codebase. You

very strong. You hand it a codebase. You

hand it brand assets. It's going to read them. It will build a design system. It

them. It will build a design system. It

will generate logos. It will generate uh typography at scale. It will generate a color palette, a spacing system, components. Really, a full UI kit. And

components. Really, a full UI kit. And

then it does something that's worth flagging. It generates a skill file, a

flagging. It generates a skill file, a machine readable instruction set any future AI agent can consume to produce on brand output. Skills.mmarkdown is not a new format. It's the cloud skills

standard. You'll find it across cloud

standard. You'll find it across cloud code, the skills repo, plenty of community projects like my open brain project. What's new here is that a

project. What's new here is that a design tool now produces a skill file natively from your codebase and brand assets to ensure that future projects are brand native. That's not just making

humanfacing brand docs. It's actually

turning the design system into agent infrastructure which is where this category is going. Design setup flow is very well organized. It will take GitHub repos, local code bases, Figma files,

brand assets, free form notes, all in one place. The review UI is super clean.

one place. The review UI is super clean.

You click directly on the element you want to change. You leave a comment, you send it. The export options are

send it. The export options are unusually practical actually. You can

export to zip, to PDF, to PowerPoint, to HTML, to canva, or you can hand off to claude code. The Canva integration is

claude code. The Canva integration is not an afterthought. It's the stated rendering layer, and it extends an existing two-year partnership between Claude and Canva. The conspicuous

emission in all of this is tada Figma.

There is no export to Figma. Mike Kger,

Anthropic CPO and co-founder of Instagram, resigned from Figma's board on April 14th, just 3 days before the launch. Figma stop dropped over 7% on

launch. Figma stop dropped over 7% on announcement day. The market is seeing

announcement day. The market is seeing Claude Design as a Figma killer. Claude

Design also does animated sequences. I

want to be very precise here because every piece of coverage I've seen gets this incorrect. These are reactbased

this incorrect. These are reactbased motion graphics, but they are not really generated video. You cannot export them

generated video. You cannot export them as a video file. You screen record if you want the video. This kind of animation is genuinely useful for product demos and B-roll. And I've been doing a lot of this kind of thing

manually and it's tedious. So, it's

great to see it here. But it's

codegenerated animation. It's not Sora and that distinction matters. And I took it for a real drive. I went through a real product with a real code base with real brand assets. And within the first hour of setting claw design loose on

those assets, I knew the story was going to be more complicated than the first impression suggested. The design system

impression suggested. The design system came back impressively complete with logos, with type, with colors, with spacing, with components, with a usable

UI kit, all organized in a file tree with JSX components and a proper readme.

And then I noticed the logo. Claude

Design had reinterpreted the logo. It

had turned the color mark into a black square plus a word mark instead of faithfully preserving the existing correct source logo. That is a hard failure for a design system generator.

The moment it starts redesigning your logo without your permission or request, every downstream artifact becomes suspect. And that's what happened. The

suspect. And that's what happened. The

broken logo propagated into the UI kit.

Every deliverable was now carrying it.

All of those nice files that it built were now corrupted. So I flagged it.

Claude explained it was going to fix this. It sounded competent. It sounded

this. It sounded competent. It sounded

like it understood. The first correction pass came back incorrect. The black and white on dark variant was still wrong.

This is not a taste judgment issue. This

is a straightforward brand preservation miss. And even though I flagged it yet

miss. And even though I flagged it yet again and it took a second pass at it, it still got it wrong. Third pass, still wrong. You get the idea here. By now, I

wrong. You get the idea here. By now, I was writing the instructions as literally as I could. AI should be black with the white padding on the black background. The other part of the brand

background. The other part of the brand logo should remain white. Please check

your work before calling it done this time. Thank you. Same error. The issue

time. Thank you. Same error. The issue

would stop being my prompt. It never was my prompt. It was the system

my prompt. It was the system overestimating that it had satisfied the spec. It eventually got fixed, but by

spec. It eventually got fixed, but by the time I fixed it, I'd been through multiple review rounds. It was something like the fifth or sixth attempt. And

here's where the cost thread comes back.

The part that hits different when you're using Clawude Design is that every one of those correction passes costs you money. I'd started the afternoon at $5

money. I'd started the afternoon at $5 for the initial design system. Sounds

like a great deal, but by the time the logos were right, the review iterations alone had cost me another 10 bucks. And

then I tried the animation features. The

60-second overview piece came out reasonable, about $2.50. The longer

piece, 2 minutes, where I needed five review passes to get the quality right.

It ran me $23.29 and the verifier appeared to time out and the agent didn't consistently check its own work before declaring the job done. By the time I closed the tab on

done. By the time I closed the tab on Claude Design and got all of this reviewed and done, the total bill was $42 and I had burned through Claude Design's entire usage allocation in one

afternoon. Look, a first pass missable.

afternoon. Look, a first pass missable.

A third pass failure on the same visible brand correction that I am paying for turns the review loop from helpful into expensive. And cost isn't separate from

expensive. And cost isn't separate from product experience here. It is the product experience. When every iteration

product experience. When every iteration is billable, reliability isn't just a quality concern. It's also a financial

quality concern. It's also a financial one. And I want to be clear, it is still

one. And I want to be clear, it is still amazing that we have something like Claude Design that can actually create these files at all. And so, as much as

$42 feels like a lot to spend on this tool and these revisions, and as much as I am rightly calling out that we need better adherence to corrections and more transparent building practices, especially when we are paying for

mistakes the models make, I also don't want to lose sight of the fact that we live in a miraculous world where I can get this kind of design quality at all done through a model automatically in

just a few minutes. And so, yes, if you're looking at it and saying, "Is $42 worth it for a full working design system and an animated video?" I think the answer is yes, it is. Is it

something where I saw some of the scratches on the record, so to speak, and I see that there are some issues with how this works today, how the billing works, how the revisions work, prompt adherence also, yes. Keep both of

those in mind as we go forward. One

thing I want to call out on cloud design is that it's trying to be much more high-end than most people will realize.

The onboarding speaks professional design language. The output includes JSX

design language. The output includes JSX components and agent readable instruction files which may not be intuitive to everyone. Having someone

with actual design expertise set up the design system is going to make a massive difference in what you get out of it.

This system rewards expertise and that fact alone undercuts a lot of the popular narrative I've seen that this kills designers as a role. I don't think it does. I think it's a tool for

it does. I think it's a tool for designers to do their best work. And I

think that's exactly why Enthropic is leading with a Canva partnership which signals proumer to exactly the audience that would get the most out of it. Now

that we've seen how Claude Design works a bit, let's switch gears back to overall impressions of 4.7 and then we'll get into specific usage tips for 4.7 after that. First, why does the

model feel the way it feels? I think

three separate things are happening at once here and most of us are collapsing them into one complaint. I think three separate things are happening here and most people are collapsing them into one complaint. I'll walk through all three.

complaint. I'll walk through all three.

First, adaptive thinking underinvests on tasks it judges as simple. The model

just decides how much reasoning your query deserves. So for hard coding and

query deserves. So for hard coding and extra high effort, it's going to think deeply. But for writing, for research,

deeply. But for writing, for research, for conversational reasoning, it sometimes decides you don't need that and it gives you less. That's what

people mean when they say non-coding replies on 4.7 feel thinner. Boris

Churnney, who heads up Claude Code, recommends setting extra high for most tasks and max for the hardest. But the

problem is that's only available on Claude Code. I think Hex's CTO offered a

Claude Code. I think Hex's CTO offered a useful calibration rule if you want to get the sense of how 4.7 and 4.6 compare here. He said loweffort 4.7 is like

here. He said loweffort 4.7 is like medium effort 4.6. Which means if you're running 4.6 6 on high for something and you run extra high for 4.7. That's where

you're seeing the real value. And that's

something that only works if you're a developer who lives in the terminal. For

the person paying 20 bucks a month who has no idea what an effort level is, that lever is invisible. Which opens the question I keep coming back to. Is

adaptive thinking actually useful or does it just save anthropic tokens?

Anthropic removed those old control services completely with this release.

You cannot set a thinking budget. You

cannot set a temperature. You can only get effort levels in the clawed code interface or you get whatever the model decides. Fewer knobs for developers and

decides. Fewer knobs for developers and ultimately less control for everyone.

Critically, fewer knobs doesn't mean lower cost. Remember that 35% tokenizer

lower cost. Remember that 35% tokenizer increase I mentioned earlier? Adaptive

thinking just decides how many of those expensive tokens it's going to spend on your behalf. So you have a model that

your behalf. So you have a model that charges more and a system that decides how many tokens you get. Both levers

moved in the same release. That's not an accident. That's definitely a

accident. That's definitely a monetization strategy. Second big call

monetization strategy. Second big call out here. The model follows instructions

out here. The model follows instructions more literally. Anthropics migration

more literally. Anthropics migration guide says it explicitly. The model will not silently generalize an instruction from one item to another and it will not

infer requests that you did not make. So

if you said format this nicely to 4.6 six and it made really generous assumptions. 4.7 does exactly what you

assumptions. 4.7 does exactly what you wrote. Nothing more, nothing less. That

wrote. Nothing more, nothing less. That

does make it more predictable for production pipelines. It makes it less

production pipelines. It makes it less forgiving for all of us who relied on the model reading between the lines. It

also reverses a trend we've seen from models at getting better at inferring between the lines. And that's a deliberate choice on anthropics part to improve longrunning agentic production

work. The model also uses tools less

work. The model also uses tools less often by default. It spawns fewer sub aents. And it has an opinionated design

aents. And it has an opinionated design aesthetic. If you've seen it, it's warm

aesthetic. If you've seen it, it's warm cream. It's serif type. It's terracotta

cream. It's serif type. It's terracotta

accents. And that bleeds into everything visual unless you override it. And since

temperature is gone, if you want more creativity, your prompting has to do that work. You can't just handle it by

that work. You can't just handle it by turning up the temperature. Let me make a concrete example here because I think that that will help this all to sort of land. If you paste an article into

land. If you paste an article into Claude AI today and you ask, "Hey, can you summarize this in three sentences and make it punchy on Opus 4.6, you probably got three sentences plus a

header, maybe a closing kicker, maybe some bolding key terms." The point is not the exact details you get on your run. The point is that it inferred

run. The point is that it inferred between the lines. On 4.7, you're going to get three punchy sentences. You will

get exactly literally what you asked for. No more, no less. That's all. If

for. No more, no less. That's all. If

you wanted formatting, sorry Charlie, you had to say so. Multiply that across every prompt you've written and you see the migration problem with 4.7. Half the

value you were getting from 4.6 was the model guessing at what you meant and filling it in. So that value didn't go away. Just to be clear, the model can

away. Just to be clear, the model can still do all that work, but the value moved because now you have to ask for it explicitly. Third big call out, the

explicitly. Third big call out, the model is more combative. And yes, we can actually measure that. Code Rabbit ran Opus 4.7 through their tone analysis

harness and found a 77% assertiveness rate with just 16% hedging. The language

comes through in imperatives like guard against this, prevent that, validate this input. It almost gives you orders.

this input. It almost gives you orders.

Gurgy Oros of the Pragmatic Engineer said publicly that he went back to 4.6 precisely because of this combiveness.

And Anthropic's own migration doc describes the new tone as more direct and opinionated. So this isn't vibes,

and opinionated. So this isn't vibes, it's actually a measurable shift. and

you'll find it comes out in specific places. In code reviews, 4.7 leads with

places. In code reviews, 4.7 leads with a verdict and a patch. It does not soften. If you prefer the older style

soften. If you prefer the older style that phrased everything as a suggestion, well, that style is gone. In ambiguous

creative writing tasks, especially anything around characters in distress or edgy humor, the model is going to trigger some safety weights and push back or execute a modified version of what you asked for instead of doing what

you asked. In security adjacent coding,

you asked. In security adjacent coding, another safety area, the model will add caveats you didn't ask for, sometimes refuse outright, sometimes produce a scoped down version with a warning. Why

does it do this when it's supposed to follow instructions literally? Because

safety is a big focus of this release, especially in light of mythos. Now, in

general conversation, if your query touches a sensitive topic or reads as risky, the model will extend that safety waiting and also steer you there.

Ultimately, this comes down to a register that is more directive than you got with 4.6. For some users, that will land as a confident, direct peer. For

other users, it lands as dismissive or it lands as distracting. Both of those reads are correct and honest. It just

depends on what you're looking for in a model. And that's why I'm taking time to

model. And that's why I'm taking time to explain them. Ultimately, the three big

explain them. Ultimately, the three big call outs that I named all compound together to generate the model impression. Right? Adaptive thinking

impression. Right? Adaptive thinking

gives you less reasoning than you might want. Literal interpretation takes away

want. Literal interpretation takes away the inference you were getting for free.

And that combative direct register changes how the answer lands even when the answer is correct. Where this takes you in the end is frankly a clear direct

co-orker. And that is what anthropic is

co-orker. And that is what anthropic is building. I want to be very explicit

building. I want to be very explicit about the strategy here. Everything you

see from the longunning agent to the ability to do tasks that would be complex to the work on visual acuity to the work on understanding exactly what you meant in doing it. All of it maps

very cleanly to the idea that Anthropic is building a co-worker who does hard difficult tasks with you at enterprise level. That is where the money is and I

level. That is where the money is and I think that is what Anthropic is building to and 4.7 is clearly a beat on the way to that release. Now whether that's a good trade depends on how you work. I

build and design longrunning agentic systems for companies. The hardest

trade-off in that work is always the difference between a model that improves on an aggregate metric and a model that's actually better for the person sitting in front of it. You can optimize

all of the evals and you can still make the experience worse. And I think that is part of the story here because it's true that the model is measurably better on the hardest tasks and those are not

small improvements. When I did my shoe

small improvements. When I did my shoe box test, it was better than 4.6 sex by a lot, but it's also more restrictive on everyday use and it's harder to prompt.

Enthropic likely ship knowing those bugs. Alex Albert from Anthropic

bugs. Alex Albert from Anthropic acknowledged post-launch bugs on Friday and Anthropic PM told users the team was sprinting on tuning up the model more after specific complaints about adaptive

thinking. I am confident that future

thinking. I am confident that future releases will start to clean up some of these issues with 4.7. But I want to be clear that part of the story of this model launch is that you can make a

model better in some ways and you are trading off and making it worse in others. If you want longer running

others. If you want longer running agentic performance, you need a model that follows instructions directly, clearly and precisely. And that may mean

it's harder to prompt. We are living in a world where we are essentially trading off the qualities of these models and the and the labs are doing this in ways that further their economic interests

which is super rational but we all have to adjust to it because these models now touch all of our lives. That's sort of the larger story here and that's why 4.7 is a challenging model to understand and

why I'm taking time to lay out all of this detail. So what can you actually do

this detail. So what can you actually do about all of this? I'm going to give you four playbooks or tip sets across 4.7.

And the reason I'm giving you four different ones is simple. The way you can configure this model is different depending on if you're in the API or in cloud code or the chat. And I want you

to have the tools to make it what you want it to be. The first playbook is universal. It applies on every surface.

universal. It applies on every surface.

You must frontload intent with this model. Anthropic's own guidance on

model. Anthropic's own guidance on smarter models is that they need less prescriptive engineering, not more. And

so, ironically, what this means is you have to tell 4.7 what you're building, who it's for, what the constraints are, and what good looks like, and then you need to get out of its way. And often

our instinct when a model behaves more literally is to write longer and more detailed prompts. That makes sense to

detailed prompts. That makes sense to us. But that is backward with this

us. But that is backward with this model. The fix here is not more words.

model. The fix here is not more words.

It is clarity. It is clearer intent upfront. I think Andre Carpathy's

upfront. I think Andre Carpathy's framing is the right one. Increasingly,

we should not tell models what to do, but we should give them success criteria and watch them go. But there's other elements to this playbook as well. You

want to, for example, batch your questions instead of drip feeding them.

You want to show the voice you want with positive examples rather than describing it. And if you want 4.7 to spread across

it. And if you want 4.7 to spread across parallel subtasks, you will need to ask explicitly upfront in that initial prompt. It's going to spawn fewer sub

prompt. It's going to spawn fewer sub aents by default than 4.6 did, but that behavior is steerable in the prompt itself. Now, the playbook two or section

itself. Now, the playbook two or section two, that's specifically for clawed code where you have more levers in hand. I

mentioned extra high and the hex calibration rule a few moments ago. I

would set extra high as your default and max for the hardest stuff. Beyond

effort, I would use plan mode and review that plan as a default. Don't just look to the diff. That's where misread intent surfaces before any code exists. And

it's the single highest leverage workflow change you can make for 4.7. I

would also use the new /altra review command to handle the other end of the workflow when you're done with your code. I would remove any scaffolding you

code. I would remove any scaffolding you wrote to force interim progress messages because 4.7 does that natively now and the old instructions can fight that behavior. Playbook 3 is for if you use

behavior. Playbook 3 is for if you use the API. The migration break most people

the API. The migration break most people hit first is in parameter removals.

You're going to have to delete temperature and top P and top K from your code. They will return 400 errors.

your code. They will return 400 errors.

You also have to realize that the thinking budget tokens parameter is gone and you're on adaptive thinking whether you want to be or not. In that spirit, I would flip thinking display to

summarized in the API unless you have a reason to hide it or your users are going to end up seeing a long pause now followed by the output with no visible reasoning. And that single default

reasoning. And that single default change in the API is actually doing real reputational damage to 4.7 because overnight the experience using the API changed for people. And in the API, I

would also regression test your most important prompts because you need to not just assume the cost math works a particular way. Make sure that you are

particular way. Make sure that you are testing to see how many more tokens your prompt actually uses because it does vary and it's going to be more. Playbook

4 is the one most people actually live in. If you're in claw.ai chat or claude

in. If you're in claw.ai chat or claude co-work, you will have none of the levers I just listed. You will not have an effort selector. You will not have the ability to set to extra high. You

will not have a task budget. Adaptive

thinking is just the default. It's the

only mode on offer and the model decides how hard to think for you based on how it reads your query. That is the specific mechanism that makes chat

replies in 4.7 feel thinner than 4.6.

And it's also of course anthropic managing compute under pressure. The

practical consequence for you is the same either way. You have to prompt your way into deep reasoning because the switch is not available in the UI. So

what do you do? How do you do that in chat and co-work? Ask for reasoning explicitly. Say things like, "Hey, think

explicitly. Say things like, "Hey, think carefully about this before answering.

Walk me through your reasoning step by step." You get the idea, right? Where is

step." You get the idea, right? Where is

the strongest counterargument? Those are

the kinds of things you need to ask because those phrases pull the model into the kind of thinking it will not allocate on its own. You also should upload the context instead of just

describing it. upload files, upload

describing it. upload files, upload code, upload the actual doc. 4.7 is more literal and will take and use those files in a way that 4.6 didn't all the time. I would also start fresh chats

time. I would also start fresh chats really aggressively when the context gets polluted because 4.7 carries interpretations forward very literally.

I would use projects to carry intent across sessions so you're not reestablishing it at every turn.

Ultimately, if a prompt that worked perfectly on 4.6 six is producing results on 4.7 that you don't like. The

model isn't necessarily getting dumber.

What you may have found is either a place where adaptive thinking isn't surfacing and you need to trigger it or you may have found a spot where you need to get more literal in your prompting to get the same results. Now, let me show

you the math behind all of this because the cost thread I've been pulling on since the opening deserves a full picture. This new tokenizer maps the

picture. This new tokenizer maps the same input to up to 35% more tokens according to Anthropic's own docs. And

that number honestly is conservative.

Simon Willis ran the opus 4.7 system prompt through the token counter and actually measured 1.46x above anthropic stated range on the real claw. Markdown and technical content

claw. Markdown and technical content that agents consume. Independent

measures have come in between 1.29 and 1.47. The thing that I want to call out

1.47. The thing that I want to call out here is that all of this precision comes at a cost. Ultimately, Anthropic is getting more out of the model at higher effort levels. But part of the way it is

effort levels. But part of the way it is doing that is by changing the token strategy and that has real costs. And

the costs don't just live at input. You

also burn more output tokens because the model thinks harder on difficult tasks.

This is why I hit my clawed design cap.

Some pro subscribers hit their cap after three questions on agentic tasks. Boris

Churnney had to announce he was raising max plan limits after the 4.7 launch because of this issue. And I don't want to lose sight of the fact that this is all launching at the same time as developers losing control. So

temperature is gone, top P is gone, top K are gone. And all of that means you have less dials to manage those model

settings and you are trusting Anthropic to keep the defaults hidden and manage them the way they want to which again points to that underlying theme that Enthropic has compute constraints and

needs more of the levers in their hands versus ours. And that adds up to the

versus ours. And that adds up to the mixed response that we've seen from this model. When thinking display defaults

model. When thinking display defaults are hidden, when adaptive thinking is on and you can't control it, it may look like the model got dumber even when it didn't. And that alone probably explains

didn't. And that alone probably explains a chunk of the negative first impressions that I've seen of 4.1. And

that alone probably explains a chunk of the negative first impressions I've seen of 4.7. Ethan Lambert actually flagged

of 4.7. Ethan Lambert actually flagged that the new tokenizer change suggests that this is a new base model, not a finetune of 4.6. And he may be right. If

he's right, 4.7 is architecturally much bigger than the version number suggests.

But it also means we may be looking at an early checkpoint that shipped before it was fully smooth. The uneven quality between coding and non-coding tasks, the combiveness being patched post launch, at least potentially, the generalization

gaps, all of that is consistent with that reading. So, if we want to sum it

that reading. So, if we want to sum it up, the sticker price didn't move, but between the tokenizer tax, the higher output burn at extra high, and a correction loop and claw design that

charges per pass, you really are paying more for the same work. And if the model is new, that also explains anthropics need to monetize differently. So, to

close the loop, the sticker price actually isn't the thing that you need to worry about here. You need to look at the tokenizer tax I mentioned where it tokenizes higher on inputs. The higher

output burn it extra high. And you need to look at the way correction loops are handled. I called out claw design, how

handled. I called out claw design, how it charges per pass. Ultimately, if you add all of these pieces up, like more input tokens, more extra effort on output, the claw design piece,

however you slice it, you're paying meaningfully more for the same work. And

Anthropic needs you to pay more because they are compute constrained and they are throttling demand through pricing.

And if they have a new model that they're trying to monetize on top of the astonishing growth in enterprise, that makes a ton of sense. They need the revenue. And it goes into a larger story

revenue. And it goes into a larger story that we're seeing in the competitive landscape. But the competitive picture

landscape. But the competitive picture isn't just about models from other players. It actually explains why

players. It actually explains why Anthropic is building so many new product categories at once. they have to prop their valuation. So when they launch 4.7, they launch claw design right away. It has animated content. It

right away. It has animated content. It

has prototypes. It has slides. It has a lot more than just a design by itself product. And every one of those new

product. And every one of those new product categories helps Anthropic justify that sky-high valuation. And

revenue is showing that this works.

Revenue hit roughly $30 billion annualized in April, up from $20 billion a month prior. and investor offers are sitting at $800 billion now, up from $380 billion in February when they did

their series G. Anthropic is now in talks with major banks to IPO in October with bankers estimating a $60 billion raise. Eight of the Fortune 10 are now

raise. Eight of the Fortune 10 are now clawed customers. Right. Anthropic's

clawed customers. Right. Anthropic's

enterprise share has climbed in a single month from 24 to 30% while OpenAI has dropped from 46% to 35%. Every new

product category is consuming massive inference compute and every correction pass is burning tokens and the strategy of vi coding into new categories is kind of in conflict with this because on the

one hand every new vibe coded product increases their odds of selling at the enterprise level and makes a lot of sense and actually fits philosophically with where the anthropic team is going.

If you have a great LLM model that does lots of different work well with different harnesses, you should be applying that model in as many places as you can and I would expect them to

continue to do so. On the other hand, you're going to have to get people to pay for that and that's where that pricing piece comes in and that's why 4.7 costs more. Another major part of the picture is mythos. Officially

shipped on April 7th, but we don't have it yet because it was deemed too dangerous to release. Only, frankly, the government and fancy companies have it.

AWS has it, Apple has it, Microsoft has it, Google has it, Cisco has it, the Linux Foundation has it. And the reason why is because the model is considered too much of a security threat to be openly released. Mythos has found

openly released. Mythos has found thousands of zeroday vulnerabilities in every major operating system, in every major web browser. And really, what Anthropic wants to do is take this time to make sure the web is made safe for

Mythos. But at the same time, the

Mythos. But at the same time, the competitive pressure isn't easing. And

that's why Opus 4.7 shipped. They can't

stop launching just because Mythos is too good to release. And part of why is OpenAI's activity. OpenAI dropped the

OpenAI's activity. OpenAI dropped the biggest codec update since launch right after 4.7 dropped. The new version runs background computer use on Mac. Agents

can see, can click, can type across apps. It's really cool. I'm doing a

apps. It's really cool. I'm doing a whole separate video on it that's coming soon. I want it to have its own time to

soon. I want it to have its own time to shine. But that being said, if you're

shine. But that being said, if you're anthropic and you're looking at window release 4.7, you're in a bit of a bind because codeex drops right when you launch. And at the same time, OpenAI's

launch. And at the same time, OpenAI's next frontier model, Codeename Spud, is scheduled to release very soon, perhaps as soon as this week. And if that's the

case, then you have got to release when Anthropic released on Thursday. Because

if you don't, you are not going to be the best new model out there and you lose the perception in the market that you're a frontier model maker, which they cannot afford to do from a fundraising perspective. And so they had

fundraising perspective. And so they had to release what they had or else GPT 5.5 or whatever it's going to be called would overtake them as the new frontier competitive best model in the world when

it launched. And if you're wondering, am

it launched. And if you're wondering, am I predicting that? Yeah, it's actually pretty safe to predict. These labs do not drop models unless they feel they have a new best in world to drop. And if

you're wondering what does that mean for me the user, what it means is you kind of have to dig in and learn. And that's

exactly why this video is here is to understand where a model works and where it doesn't so you can deploy it where it makes sense for you. I think a nice shorthand right now is to look at enthropic as building vertically and

OpenAI is building horizontally.

Anthropic is building vertically across design, code, review, and deploy. While

OpenAI builds horizontally with Codeex as a platform that plugs into everything on your desktop, including all of that computer stuff that you didn't get any other way until Codeex launched. And

yes, I will say, I'll just get ahead of the review. Codeex is incredible at

the review. Codeex is incredible at computer use, and we'll get into that.

So, is 4.7 worth it? The honest answer depends on who you are, and I'm going to go through each of these use cases in detail. If you're a daily cloud code

detail. If you're a daily cloud code user, if you're running Agentic Pipelines, you should upgrade today. The

persistence fix and the vision jump are worth it, but you should set your effort to X high and you should read the migration guide. Your prompts will need

migration guide. Your prompts will need reshaping, especially anywhere you relied on 4.6 filling in intent. If

you're doing financial analysis, if you're doing legal work, if you're doing enterprise document reasoning, you want to upgrade right away. There's a reason why this model scores so highly. I have

used it extensively for complicated knowledge work just in the last four days. It is really really good at that

days. It is really really good at that and you should absolutely go for it. If

you've got production API code that's tuned to 4.6, you've got work ahead of you. You have to test the cost before

you. You have to test the cost before you switch. You have to change the

you switch. You have to change the parameters that I talked about that are producing 400 errors. Now there's a migration task ahead. Meanwhile, if your agents live on web research or terminal tasks, think before you update. As I

called out, web research is not stronger in this model. it actually regressed.

And you want to think about whether you need to do a lot of reb retrieval before you decide to upgrade if you have a search heavy workflow. And if you're a clawed chat user and you're paying 20

bucks a month, ultimately whether you decide to upgrade or not depends entirely on whether you're willing to change how you prompt because the model that used to fill in the gaps will not do so anymore. Now, if you're a designer

and you're like, I have to upgrade anyway because claw design, I would agree with you. Claw design is going to cost you something, but it is a real upgrade and it calls out where Anthropic is going as far as building out these

vertical workflows. It's worth a try.

vertical workflows. It's worth a try.

And on the tone side of things, you either need to decide to go with 4.7's directness or you need to decide to go with 4.6 being softer or even potentially 5.4 being softer. I know we

didn't expect chat GPT to be more personable than Claude Opus, but that is the world we're living in with the 4.7 launch. It is really, really, very

launch. It is really, really, very direct. And that gets at one of my like

direct. And that gets at one of my like interesting questions for a model these days. Is it your 2 AM work buddy when

days. Is it your 2 AM work buddy when you need it to be direct and clear and helpful and really bail you out of the mud? Is it going to get you there? The

mud? Is it going to get you there? The

answer honestly is it does depending on the work. If it's complicated work and

the work. If it's complicated work and you know how to drive it and you're okay with the directness, yes. But if you're if you need the vagueness, if you need the ability to brainstorm and come back with something that isn't completely

clear and have the model infer, it's just not the right model for that job.

And in fact, I would go back to 4.6.

Having spent the last few days with this model, I think the bottom line that I want to call out is this. Enthropic had

to ship this model. They had no choice.

They needed to ship something competitive. This is the best model they

competitive. This is the best model they could put out in that time frame. This

model is a strategic beat that shows where they're going as far as building an agentic co-orker that can do longunning complicated tasks. The reason

I'm spending so much time in this video talking through in detail what's good and not good about it is actually something we need to practice more of.

We cannot get to more mature products in the AI revolution without this level of detail. If we are not testing in this

detail. If we are not testing in this level of detail when a model comes out, we're not really doing our job. And

that's why I took so much time to understand where this model shines, where it doesn't, to run it through my own practical tests before making this video. I don't want you to have to make

video. I don't want you to have to make a decision about a model upgrade based on one or two points of data. I want you to have real clarity about where this

model shines and where it doesn't so that you can understand whether a model works for you or not. The future we are building to whether we like it or not is

a multiaceted LLM future. I know that we started our journey in 2022 when chat GPT came out and it was all one thing.

That world is never coming back. In

fact, even in claude, I had to name a bunch of surfaces today in this video to talk through what 4.7 does and doesn't do. expect that to get even more complex

do. expect that to get even more complex because the work that we're asking these models to do is getting more important and more detailed. And that's one of the

larger themes of the 4.7 release. The

model builders are now competing on harnesses. If you look at Claude design,

harnesses. If you look at Claude design, it's really a custom harness over the LLM focused specifically on design. That

is part of how Anthropic is shipping all of this stuff so fast is that they're sticking a custom harness for a particular task on an LLM that has been a reinforcement learned against a particular subject and so they can

compete really effectively there and they can build something that people describe as a figma killer. Now when I think about that strategically it makes sense for anthropic as I said to go into all of these verticals. But the thing

that you should be thinking about is if the model makers are now competing not just on the model itself, which they did for a while, but also on the harness,

how does that shape our assumption of where we see these model makers planning and going in the future? I think we should assume more vertical builds not just from Anthropic, but also from OpenAI and other model makers, even from

Google. Google has said they are going

Google. Google has said they are going after cla code because they don't like how much of coding workflows are now moving to anthropic. They sort of seems late but they're finally paying

attention. Uh and I think that we should

attention. Uh and I think that we should expect more of those vertically themed releases which Claude code was arguably the first of. And all of that harness optimization leads to some downstream

implications for all of us. If you are building in the AI space, you need to think seriously about whether your advantage in a space is something you can sustain. If the model makers can

can sustain. If the model makers can ship a great harness around that space, if they can ship a harness around design, if they can ship a harness that helps a model do a great job of financial modeling, do you still have a

product? Think about what gives you an

product? Think about what gives you an edge that goes beyond just building an agentic harness. I would also call out

agentic harness. I would also call out if you are thinking through this upgrade story more strategically, where are you

adjusting your workflows so that this is easier next time? Because one of the things that I'm taking away from this is that we need to start to think about all

of our work as it informs the larger project these labs have to build longunning agent workflows for serious work. We almost need to start to and

work. We almost need to start to and think of our work as either complicated long-running knowledge work that these model makers are going to work really hard to make incredibly easy and automated uh that includes coding that

includes financial modeling work etc or much more casual conversational type work which the model makers do not appear to be investing in at this time and we should plan accordingly I think

4.7 is not the last beat in that story we are going to be talking about similar themes I think when chat GPT launched launches their next model. We certainly

can talk about similar things with the codeex launch and what the codeex launch enables. Where we are going with all of

enables. Where we are going with all of this in line with a compute constrained world is that work is going to get served with these models. If you are doing hard complicated work, you have a

world where LLMs are just going to keep getting better for you. But for casual users, there's just not enough compute to go around and the model makers are not prioritizing that workflow. That's

why OpenAI killed Sora. That is why Claude is very comfortable in the proumer category, not the widespread consumer category. We need to assume a

consumer category. We need to assume a world where serious work gets serious tokens and casual interactions do not get the same level of involvement. Now,

that may be okay because we already have lots of LLMs for casual work and maybe it's good enough and most of us don't notice, but it's worth naming because I think that 4.7 is one of those beats

where I really saw that we are moving very very fast on a path that is technical, that is demanding, that is focused on serious knowledge work for enterprises. And if that's you, if

enterprises. And if that's you, if that's the world you operate in, this is a big deal. If it's not you, if you're more of a casual user, you are going to see fewer and fewer and fewer releases

that change your world because so many of them are focused on that category that pays the most. And I think that's one of the economic questions of our time, right? We have built these AIs and

time, right? We have built these AIs and in a real sense, they have saturated out the chat use case and no matter how hard we try, there's not really a way to make the chat experience all that much better. You can slap a better UI on it

better. You can slap a better UI on it and make it more engaging in certain ways. You may have a moment that is

ways. You may have a moment that is coming soon around responsive video where AI avatars chat back. There's

stuff like that we can do. But the

underlying LLMs that do hard work are evolving in another direction. They're

evolving in a way that average consumers won't see it. And I think about this because I have lots of family who I have to explain. I'm the AI guy for my

to explain. I'm the AI guy for my family, right? I have to explain what AI

family, right? I have to explain what AI is and what it does. It's getting harder to do. It's getting harder. And 4.7

to do. It's getting harder. And 4.7

doesn't make that easier. And it's

another beat in that story where ultimately economic models are coming that do real hard work and help real professionals solve hard problems. I think that's different from saying they

replace jobs. I know there's a lot of

replace jobs. I know there's a lot of talk about that. I do not see this as a job killer and I know a lot of people do and that's a reason I didn't talk about it here because I just don't see the tool sets, even the designer tool sets,

as being good enough to independently replace good, professional, serious workers. So that's the 4.7 story. It's a

workers. So that's the 4.7 story. It's a

story that's complicated. It's a story that's not easy to share over the dinner table if you're talking with someone who's not an AI. But it's the real story. It's the story that shapes

story. It's the story that shapes whether you want to upgrade or not. And

I hope you get that detail from this video. And if you want more, jump in and

video. And if you want more, jump in and subscribe because there's going to be another superdetailed takedown of Codeex coming. There's going to be another one

coming. There's going to be another one coming on 5.4. I have a write up on all of this on the Substack, including all of my test results. So, if you want to dive in, that's where you should go.

I'll see you next time.

Loading...

Loading video analysis...