Anthropic Just Dropped the New Blueprint for Long-Running AI Agents.

By The AI Automators

Summary

Topics Covered

Harness Design Matters as Much as Model Choice
Context Anxiety Makes Models Rush and Quit Early
Agents Cannot Reliably Self-Evaluate Their Work
Adversarial Evaluation Beats Self-Evaluation
Harnesses Must Evolve as Models Improve

Full Transcript

Yesterday, Anthropic published a fascinating article on harness design for longrunning agents. And there are some interesting insights and honest admissions from the team that really can

help us all when we're building specialized agent systems. In the post, they demonstrated how they built a 2D retro game engine over a 6-hour autonomous coding session. And they also

built a digital audio workstation in browser over a 4-hour period. And while

these examples of long-running autonomous agents are specific to coding, these principles apply to all types of specialized agent systems like compliance audits, risk analysis,

content pipelines, and impact assessments. Last week, I published a

assessments. Last week, I published a video on this channel where I built a specialized agent harness into a custom Python and React app. And a custom harness like this could definitely be improved with some of the insights from

Antropics blog post. And if you're not really sure what a harness is, it's essentially just the software and structure that wraps an AI model to keep it on track. It's essentially an orchestration layer that includes the

prompts, the tools, the feedback loops, the constraints, the validation. So it's

everything around the model that turns it into a reliable system. One analogy

for an agent harness is that of a car where the model is the engine and the car itself is the harness. So without

the car, the engine just sits there revving and you're not really getting anywhere. While actually a better

anywhere. While actually a better analogy is that of a horse and an actual harness. A wild horse has raw power, but

harness. A wild horse has raw power, but it'll go wherever it wants. While the

harness allows you to control the power, set it in a direction and get where you want to go. And one of the key insights from this post is that for longunning complex tasks, the harness design is as important as the model itself.

Yesterday's article builds on their previous post from last November, which talks about effective harnesses for longunning agents. And this article

longunning agents. And this article tackles the fundamental problem that the entire industry is trying to solve, which is how do you give an AI agent a complex goal and let it work away over

hours or even days to achieve that. And

this is where real value can be created.

So from a dev perspective, it could be oneshotting a large feature or even an app itself. But in other industries, it

app itself. But in other industries, it could be a case of carrying out compliance audits that would normally take a full month worth of effort. And

the problem to be solved without a harness is that an AI agent might try to oneshot an entire app build in a single go. It might run out of context halfway

go. It might run out of context halfway through. It might leave work half

through. It might leave work half finished and undocumented. Or it might declare that the job was done early just to finish up. So Enthropic's original solution was a two-part system. An

initializer agent that set up the environment and broke the project into features. It then created a progress

features. It then created a progress tracking file and then a coding agent worked one feature at a time committing to git after each chunk and then it would leave clear artifacts for the next coding agent to take over. So that way

you're decomposing the work, you're making incremental progress and you're handing off context cleanly. And it's

fair to say that anthropic were not the first to come up with this idea of long running agent harnesses. Jeffrey Huntley

came up with the Ralph Wigum loop a few months before which essentially allows you to run an agent within a loop and to check the output against something that can't actually lie. So that could be a llinter or a type checker and it

essentially keeps the loop progressing until it's done. So having these explicit stop conditions means that you can loop over and over with an agent to really make sure that it's actually finished the job. And then that gets

more powerful if you bundle it with the likes of specd driven development. So

frameworks like BMAD or SpecKit or OpenSpec allow you to create structured requirements before the actual dev begins and that way the agent isn't looping in isolation. It's working

against a predefined plan. So these

frameworks solve the problem of an agent under scoping the work to be done. But

outside of external hard validation, it is the agent that's actually self- evvaluating its work. So against that backdrop, even with these approaches of an agent harness and a Ralph Wiggum

loop, Anthropic observed two common failure modes when agents executed against these types of tasks. And

interestingly, these failure modes apply whether you're building an app or you're carrying out more general purpose work like a research pipeline or a content pipeline. And the first is what's known

pipeline. And the first is what's known as context anxiety. As the context window fills up, the models don't just lose coherence, they actually change their behavior. They start wrapping up

their behavior. They start wrapping up the conversation prematurely. They rush

through steps and declare that things are done when they're not actually done.

And you've probably noticed this yourself if you're chatting to an LLM in a single context window over a long time period. It eventually gets shorter and

period. It eventually gets shorter and shorter with you. And there is a technique called context compaction where the actual conversational thread is compacted and summarized to leave

more room, more space for usable context. But Enthropic found that even

context. But Enthropic found that even with context compaction, models like Sonnet 4.5 still showed context anxiety and tried to finish early. And the

reason for that is you're not starting with a clean slate. And that is why their original solution last November was a context reset. You start with a fresh context window, read the latest feature from the progress file, test

features that were previously built, and then build out your specific task. Once

you're finished, trigger the structured handoff, and then start with the next agent with a clean slate. And as I mentioned, they found that solid 4.5 exhibited these context anxiety symptoms and that's why they built this context

reset system, whereas Opus 4.5 does not have this problem to the same extent.

And interestingly, in this blog post, when they moved to Opus 4.6, they found they didn't need to context reset at all. They were able to rely on context

all. They were able to rely on context compaction and not have these anxiety symptoms where the LLM was trying to quit early. And what's interesting is

quit early. And what's interesting is Anthropic recently brought out Opus 4.6 six with a 1 million token context window and they claim that the retrieval quality generally holds up over the

longer context. However, to be honest, I

longer context. However, to be honest, I would take a bit of a cynical view on this. It's in Anthropic's best interest

this. It's in Anthropic's best interest to process tokens. So, I'm sure they're delighted if you're sending in a 1 million token request every time, even if some of that's going to be cached.

So, I definitely don't think it's the end of context resets the way they've been designed to date. The second

failure mode is quite interesting, which is poor self-evaluation. Anthropic

haven't talked about this before, but essentially if you ask an agent to evaluate its own work, it's likely going to praise it. And as the engineers said, even if the quality is obviously mediocre to a human observer. So I think

these are some interesting admissions.

They talk about how Claude often produced outputs from a front-end design perspective that were bland at best. And

when they were building this system, they penalized highly generic AI slop patterns. And this idea of

patterns. And this idea of self-evaluation is a tricky subject because it's different for subjective versus objective tasks. In this article, they were focused on front-end design

trying to make subjective quality gradable. And that's the real challenge

gradable. And that's the real challenge for even nonAI coding based use cases.

How can you evaluate the quality of the writing style of an AI or the visual design of a graphic or the professionalism of a legal analysis for example? And I think that's why AI

example? And I think that's why AI coding use cases have taken off so much in the last 12 months. But because you have verifiable outputs where you can run lint checkers, you can run type

checkers, regression tests, browser tests, and essentially the AI can iterate on its own output. So this idea of making subjective qualities gradable

means that an AI can actually evaluate them in a more objective way. Which then

leads us to the main solution that Entropic came to in this post which is the idea of adversarial evaluation. So

inspired by GAN networks where you would have a generator and a discriminator here we have a generator agent which creates the code creates the content and you would have an evaluator agent whose

job it is to judge the work and then ideally graded so that it's somewhat objective and then they can send back that feedback to the main agent. And

like a GAN network, the idea here is the tension between these agents should improve the quality. And they found that it was far harder in isolation to get the likes of a generator agent to be

more skeptical about the work that they just created versus having a dedicated QA or evaluator agent whose system prompt is dedicated to being skeptical about the work that it was just handed.

And then once the generator agent has provided the feedback from the evaluator, it has something concrete to then iterate on. So I know what you're thinking. This is absolutely not new.

thinking. This is absolutely not new.

Multi- aent systems have existed for years now at this point. And the idea of an evaluator agent is no different to LLM as a judge, for example. But I think the difference here is you essentially

have an evaluator wired into a production loop rather than used as an ad hoc evaluation tool, let's say. And

from looking at the blog post, it really wasn't plain sailing. And another

interesting admission, out of the box, Claude is a poor QA agent. And they talk about how in early runs, Claude identified legitimate issues and then talked itself into deciding they weren't a big deal and approved the work anyway.

And how it also tended to test superficially rather than probing edge cases. So more subtle bugs often slipped

cases. So more subtle bugs often slipped through. So the evaluator agent took

through. So the evaluator agent took multiple rounds of iteration and refinement. It really wasn't a

refinement. It really wasn't a plug-and-play solution. And having gone

plug-and-play solution. And having gone through that experience, they found that there was three things that were required to really make an evaluator agent work. And the first one I've

agent work. And the first one I've already touched on which is the idea of making subjective quality gradable from a front-end design perspective as opposed to just asking is this design

beautiful instead it's does this follow our principles for good design and then they define the principles. So the

grading criteria they created one was on design quality another on originality to avoid the AI slop another on craft which is technical execution and another on

functionality. The second learning was

functionality. The second learning was the need to weight the criteria towards the model's capabilities as they found that Opus scored well on two out of those four criterias but struggled on

the other two and through iteration they ended up weighting that criteria heavier to avoid AI slot patterns. And the third learning was that you have to let the evaluator interact with the output. So

they use the playride MCP so that the evaluator has tools it can use to actually navigate the app, screenshot it and test it like a real user. And that's

not to say there's not still room for more deterministic testing. In my last video, I talked about how Stripe created their concept of minions where each code change results in testing against a

subset of their 3 million tests in their test suite. That's not being prompted by

test suite. That's not being prompted by an evaluator agent that is hard baked, hardcoded into their deployment system.

So within the case study, they carried out three experiments. The first one was on front-end design where they prompted the model to create a website for a Dutch art museum. And as you can see here, these are some of the earlier

iterations and after 10 rounds of feedback, they ended up with something that was completely unique, which is essentially a 3D room with a checkered floor. And they talked about how that

floor. And they talked about how that was kind of a creative leap that they hadn't seen before from a single past generation, which is interesting when it comes to iterating on something that is so subjective like front-end design. So,

this is what that design harness looked like. It was essentially a single kind

like. It was essentially a single kind of one-s sentence prompt and then that went to a generator agent which created HTML, CSS, JavaScript, goes to the evaluator agent who has the ability to

interact with it via Playright MCP and then between 5 to 15 iterations of feedback, you end up with your finished product. And obviously all of these

product. And obviously all of these harnesses were built on the clawed agent SDK. But that's not to say that you

SDK. But that's not to say that you can't use the learnings of how you create these effective harnesses and build them on competitor models or local open source models. So the second

experiment was to scale up to full stack coding. So here they introduced the

coding. So here they introduced the planner agent. And the prompt here was

planner agent. And the prompt here was to create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode. And they ran this twice. One with the solo harness and

twice. One with the solo harness and another with the full harness of the planner, generator, and evaluator.

obviously a lot more expensive and a lot longer to run the full harness, but they found that from the solo run, which was a lot cheaper and faster, the game basically didn't really work. You know,

it looked okay, but it was not actually playable in any sense. Whereas the game built by the full harness actually was functional. So, within this version of

functional. So, within this version of the harness that went to the planner agent, which then dramatically expanded out the spec, and it split it into sprints. Now, we're using Opus 4.5 here,

sprints. Now, we're using Opus 4.5 here, which is important. But within each sprint, there was contract negotiations between the generator agent and the evaluator agent. It was defining the

evaluator agent. It was defining the definition of done up front before it actually started building anything. So

that way, the generator agent couldn't move the goalpost halfway through the build to say, "Ah, it's done." So then the evaluator agent evaluated once per sprint and then there was a context reset and a handoff to the next sprint.

So not having a planner agent would mean that you would dramatically underscope the project because you've only sent in a single sentence of a game that you want built. And without the evaluator

want built. And without the evaluator agent, the generator would just over approve all its work and quit early. So

then during these experiments, Opus 4.6 was released. So they created a second

was released. So they created a second version of the harness and they decided to simplify things because to build effective agents, you should always look to find the simplest solution possible and not actually over complicate or

overengineer it. So they had some trial

overengineer it. So they had some trial and error of trying to simplify the architecture and they ended up with this. They removed the sprints. They

this. They removed the sprints. They

removed the contract negotiation. They

removed the context resets. So they

relied on context compaction. So again

we had a single sentence prompt to build an app. Went to the planner agent

an app. Went to the planner agent expanded it into a much larger spec. And

then that full spec was passed to the generator agent to build to essentially oneshot the entire app. One continuous

session. and it was relying on the clawed agent SDK which has context compaction built into it. So the

evaluator agent in this case only ran at the end of the full build and that would then provide feedback so that it could then iterate on that feedback. So I

think there is a level of anthropic just showing off how good Opus 4.6 is that it can build an entire app that you don't need these harnesses at a 1 million token context window. It's not exactly

going to be cheap, but they do seem convinced that the context rot problem and the context anxiety problem don't exist in this version of the model. So

for this version two of the harness, the prompt was to build a fully featured DAW in the browser using the web audio API.

So this is a digital audio workstation.

And here is the phase byphase breakdown.

So the planner agent took 5 minutes and 50ent and then the first full build took 2 hours and $71. the evaluator or the QA

agent took 10 minutes essentially and then another hour for the second build and then 10 minutes for the third build.

So all in all around four hours total $125 and of course all relying on the clawed agent SDK which has context compaction so that you can keep it running over the long term. Now of

course you could build this yourself.

You don't need to use the clawed agent SDK. You could build the functionality

SDK. You could build the functionality that you need for these agents to run over the long horizon. And so this is the finished product. They did want a fully functional digital audio

workstation. I don't think this is fully

workstation. I don't think this is fully functional in the sense that, as they say, it's far from a professional music production program. That being said, it

production program. That being said, it costs $100 to create. So maybe that's not to be expected either, but it does seem to have some of the key features that you would actually need. And this

brings about one of the big take-home points from this article, which is the idea of a harness evolution. because

every component in a harness essentially encodes an assumption that the model can't actually carry out that task itself and the context reset is a perfect example of that. Sonnet 4.5 had

context anxiety. So we had to build an

context anxiety. So we had to build an entire handing off of context in a harness. So what Entropic are saying is

harness. So what Entropic are saying is that those assumptions go stale as the models improve as they saw with the leap to opus 4.6. And they also found that

the evaluators work depended on how good the model was as well. So if you're really stretching the model to its limits, it really is important to have an evaluator to make sure it's done everything it needs to do. Whereas if

you're asking the model to do something that's very much in its wheelhouse, you might not need an evaluator agent at all. And I suppose this is the real work

all. And I suppose this is the real work of harness and context engineering because it's not ever a one-shot setup.

You do need to refine and iterate as you go. As I mentioned earlier, we're in the

go. As I mentioned earlier, we're in the middle of an AI builder series on this channel where we build out a full stack AI agent platform. And it's not using cloud agent SDK. It's a completely

custom Python and React app with a superbase back end. And last week, we built in a specialized contract review harness into this system. I'll leave a link for it in the card above. And if

you're interested in going beyond the theory and actually building these harnesses for yourself, then check out the link in the description to our community, the AI Automators, where you can access our full AI builder course and codebase. We have a private

and codebase. We have a private community here of serious builders all creating specialized harnesses and advanced rag systems. We'd love to see you here, so check out the link below.

Thanks so much for watching and I'll see you in the next

Loading...

Loading video analysis...