Anthropic Just Dropped the New Blueprint for Long-Running AI Agents.
By The AI Automators
Summary
Topics Covered
- Harness Design Matters as Much as Model Choice
- Context Anxiety Makes Models Rush and Quit Early
- Agents Cannot Reliably Self-Evaluate Their Work
- Adversarial Evaluation Beats Self-Evaluation
- Harnesses Must Evolve as Models Improve
Full Transcript
Yesterday, Anthropic published a fascinating article on harness design for longrunning agents. And there are some interesting insights and honest admissions from the team that really can
help us all when we're building specialized agent systems. In the post, they demonstrated how they built a 2D retro game engine over a 6-hour autonomous coding session. And they also
built a digital audio workstation in browser over a 4-hour period. And while
these examples of long-running autonomous agents are specific to coding, these principles apply to all types of specialized agent systems like compliance audits, risk analysis,
content pipelines, and impact assessments. Last week, I published a
assessments. Last week, I published a video on this channel where I built a specialized agent harness into a custom Python and React app. And a custom harness like this could definitely be improved with some of the insights from
Antropics blog post. And if you're not really sure what a harness is, it's essentially just the software and structure that wraps an AI model to keep it on track. It's essentially an orchestration layer that includes the
prompts, the tools, the feedback loops, the constraints, the validation. So it's
everything around the model that turns it into a reliable system. One analogy
for an agent harness is that of a car where the model is the engine and the car itself is the harness. So without
the car, the engine just sits there revving and you're not really getting anywhere. While actually a better
anywhere. While actually a better analogy is that of a horse and an actual harness. A wild horse has raw power, but
harness. A wild horse has raw power, but it'll go wherever it wants. While the
harness allows you to control the power, set it in a direction and get where you want to go. And one of the key insights from this post is that for longunning complex tasks, the harness design is as important as the model itself.
Yesterday's article builds on their previous post from last November, which talks about effective harnesses for longunning agents. And this article
longunning agents. And this article tackles the fundamental problem that the entire industry is trying to solve, which is how do you give an AI agent a complex goal and let it work away over
hours or even days to achieve that. And
this is where real value can be created.
So from a dev perspective, it could be oneshotting a large feature or even an app itself. But in other industries, it
app itself. But in other industries, it could be a case of carrying out compliance audits that would normally take a full month worth of effort. And
the problem to be solved without a harness is that an AI agent might try to oneshot an entire app build in a single go. It might run out of context halfway
go. It might run out of context halfway through. It might leave work half
through. It might leave work half finished and undocumented. Or it might declare that the job was done early just to finish up. So Enthropic's original solution was a two-part system. An
initializer agent that set up the environment and broke the project into features. It then created a progress
features. It then created a progress tracking file and then a coding agent worked one feature at a time committing to git after each chunk and then it would leave clear artifacts for the next coding agent to take over. So that way
you're decomposing the work, you're making incremental progress and you're handing off context cleanly. And it's
fair to say that anthropic were not the first to come up with this idea of long running agent harnesses. Jeffrey Huntley
came up with the Ralph Wigum loop a few months before which essentially allows you to run an agent within a loop and to check the output against something that can't actually lie. So that could be a llinter or a type checker and it
essentially keeps the loop progressing until it's done. So having these explicit stop conditions means that you can loop over and over with an agent to really make sure that it's actually finished the job. And then that gets
more powerful if you bundle it with the likes of specd driven development. So
frameworks like BMAD or SpecKit or OpenSpec allow you to create structured requirements before the actual dev begins and that way the agent isn't looping in isolation. It's working
against a predefined plan. So these
frameworks solve the problem of an agent under scoping the work to be done. But
outside of external hard validation, it is the agent that's actually self- evvaluating its work. So against that backdrop, even with these approaches of an agent harness and a Ralph Wiggum
loop, Anthropic observed two common failure modes when agents executed against these types of tasks. And
interestingly, these failure modes apply whether you're building an app or you're carrying out more general purpose work like a research pipeline or a content pipeline. And the first is what's known
pipeline. And the first is what's known as context anxiety. As the context window fills up, the models don't just lose coherence, they actually change their behavior. They start wrapping up
their behavior. They start wrapping up the conversation prematurely. They rush
through steps and declare that things are done when they're not actually done.
And you've probably noticed this yourself if you're chatting to an LLM in a single context window over a long time period. It eventually gets shorter and
period. It eventually gets shorter and shorter with you. And there is a technique called context compaction where the actual conversational thread is compacted and summarized to leave
more room, more space for usable context. But Enthropic found that even
context. But Enthropic found that even with context compaction, models like Sonnet 4.5 still showed context anxiety and tried to finish early. And the
reason for that is you're not starting with a clean slate. And that is why their original solution last November was a context reset. You start with a fresh context window, read the latest feature from the progress file, test
features that were previously built, and then build out your specific task. Once
you're finished, trigger the structured handoff, and then start with the next agent with a clean slate. And as I mentioned, they found that solid 4.5 exhibited these context anxiety symptoms and that's why they built this context
reset system, whereas Opus 4.5 does not have this problem to the same extent.
And interestingly, in this blog post, when they moved to Opus 4.6, they found they didn't need to context reset at all. They were able to rely on context
all. They were able to rely on context compaction and not have these anxiety symptoms where the LLM was trying to quit early. And what's interesting is
quit early. And what's interesting is Anthropic recently brought out Opus 4.6 six with a 1 million token context window and they claim that the retrieval quality generally holds up over the
longer context. However, to be honest, I
longer context. However, to be honest, I would take a bit of a cynical view on this. It's in Anthropic's best interest
this. It's in Anthropic's best interest to process tokens. So, I'm sure they're delighted if you're sending in a 1 million token request every time, even if some of that's going to be cached.
So, I definitely don't think it's the end of context resets the way they've been designed to date. The second
failure mode is quite interesting, which is poor self-evaluation. Anthropic
haven't talked about this before, but essentially if you ask an agent to evaluate its own work, it's likely going to praise it. And as the engineers said, even if the quality is obviously mediocre to a human observer. So I think
these are some interesting admissions.
They talk about how Claude often produced outputs from a front-end design perspective that were bland at best. And
when they were building this system, they penalized highly generic AI slop patterns. And this idea of
patterns. And this idea of self-evaluation is a tricky subject because it's different for subjective versus objective tasks. In this article, they were focused on front-end design
trying to make subjective quality gradable. And that's the real challenge
gradable. And that's the real challenge for even nonAI coding based use cases.
How can you evaluate the quality of the writing style of an AI or the visual design of a graphic or the professionalism of a legal analysis for example? And I think that's why AI
example? And I think that's why AI coding use cases have taken off so much in the last 12 months. But because you have verifiable outputs where you can run lint checkers, you can run type
checkers, regression tests, browser tests, and essentially the AI can iterate on its own output. So this idea of making subjective qualities gradable
means that an AI can actually evaluate them in a more objective way. Which then
leads us to the main solution that Entropic came to in this post which is the idea of adversarial evaluation. So
inspired by GAN networks where you would have a generator and a discriminator here we have a generator agent which creates the code creates the content and you would have an evaluator agent whose
job it is to judge the work and then ideally graded so that it's somewhat objective and then they can send back that feedback to the main agent. And
like a GAN network, the idea here is the tension between these agents should improve the quality. And they found that it was far harder in isolation to get the likes of a generator agent to be
more skeptical about the work that they just created versus having a dedicated QA or evaluator agent whose system prompt is dedicated to being skeptical about the work that it was just handed.
And then once the generator agent has provided the feedback from the evaluator, it has something concrete to then iterate on. So I know what you're thinking. This is absolutely not new.
thinking. This is absolutely not new.
Multi- aent systems have existed for years now at this point. And the idea of an evaluator agent is no different to LLM as a judge, for example. But I think the difference here is you essentially
have an evaluator wired into a production loop rather than used as an ad hoc evaluation tool, let's say. And
from looking at the blog post, it really wasn't plain sailing. And another
interesting admission, out of the box, Claude is a poor QA agent. And they talk about how in early runs, Claude identified legitimate issues and then talked itself into deciding they weren't a big deal and approved the work anyway.
And how it also tended to test superficially rather than probing edge cases. So more subtle bugs often slipped
cases. So more subtle bugs often slipped through. So the evaluator agent took
through. So the evaluator agent took multiple rounds of iteration and refinement. It really wasn't a
refinement. It really wasn't a plug-and-play solution. And having gone
plug-and-play solution. And having gone through that experience, they found that there was three things that were required to really make an evaluator agent work. And the first one I've
agent work. And the first one I've already touched on which is the idea of making subjective quality gradable from a front-end design perspective as opposed to just asking is this design
beautiful instead it's does this follow our principles for good design and then they define the principles. So the
grading criteria they created one was on design quality another on originality to avoid the AI slop another on craft which is technical execution and another on
functionality. The second learning was
functionality. The second learning was the need to weight the criteria towards the model's capabilities as they found that Opus scored well on two out of those four criterias but struggled on
the other two and through iteration they ended up weighting that criteria heavier to avoid AI slot patterns. And the third learning was that you have to let the evaluator interact with the output. So
they use the playride MCP so that the evaluator has tools it can use to actually navigate the app, screenshot it and test it like a real user. And that's
not to say there's not still room for more deterministic testing. In my last video, I talked about how Stripe created their concept of minions where each code change results in testing against a
subset of their 3 million tests in their test suite. That's not being prompted by
test suite. That's not being prompted by an evaluator agent that is hard baked, hardcoded into their deployment system.
So within the case study, they carried out three experiments. The first one was on front-end design where they prompted the model to create a website for a Dutch art museum. And as you can see here, these are some of the earlier
iterations and after 10 rounds of feedback, they ended up with something that was completely unique, which is essentially a 3D room with a checkered floor. And they talked about how that
floor. And they talked about how that was kind of a creative leap that they hadn't seen before from a single past generation, which is interesting when it comes to iterating on something that is so subjective like front-end design. So,
this is what that design harness looked like. It was essentially a single kind
like. It was essentially a single kind of one-s sentence prompt and then that went to a generator agent which created HTML, CSS, JavaScript, goes to the evaluator agent who has the ability to
interact with it via Playright MCP and then between 5 to 15 iterations of feedback, you end up with your finished product. And obviously all of these
product. And obviously all of these harnesses were built on the clawed agent SDK. But that's not to say that you
SDK. But that's not to say that you can't use the learnings of how you create these effective harnesses and build them on competitor models or local open source models. So the second
experiment was to scale up to full stack coding. So here they introduced the
coding. So here they introduced the planner agent. And the prompt here was
planner agent. And the prompt here was to create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode. And they ran this twice. One with the solo harness and
twice. One with the solo harness and another with the full harness of the planner, generator, and evaluator.
obviously a lot more expensive and a lot longer to run the full harness, but they found that from the solo run, which was a lot cheaper and faster, the game basically didn't really work. You know,
it looked okay, but it was not actually playable in any sense. Whereas the game built by the full harness actually was functional. So, within this version of
functional. So, within this version of the harness that went to the planner agent, which then dramatically expanded out the spec, and it split it into sprints. Now, we're using Opus 4.5 here,
sprints. Now, we're using Opus 4.5 here, which is important. But within each sprint, there was contract negotiations between the generator agent and the evaluator agent. It was defining the
evaluator agent. It was defining the definition of done up front before it actually started building anything. So
that way, the generator agent couldn't move the goalpost halfway through the build to say, "Ah, it's done." So then the evaluator agent evaluated once per sprint and then there was a context reset and a handoff to the next sprint.
So not having a planner agent would mean that you would dramatically underscope the project because you've only sent in a single sentence of a game that you want built. And without the evaluator
want built. And without the evaluator agent, the generator would just over approve all its work and quit early. So
then during these experiments, Opus 4.6 was released. So they created a second
was released. So they created a second version of the harness and they decided to simplify things because to build effective agents, you should always look to find the simplest solution possible and not actually over complicate or
overengineer it. So they had some trial
overengineer it. So they had some trial and error of trying to simplify the architecture and they ended up with this. They removed the sprints. They
this. They removed the sprints. They
removed the contract negotiation. They
removed the context resets. So they
relied on context compaction. So again
we had a single sentence prompt to build an app. Went to the planner agent
an app. Went to the planner agent expanded it into a much larger spec. And
then that full spec was passed to the generator agent to build to essentially oneshot the entire app. One continuous
session. and it was relying on the clawed agent SDK which has context compaction built into it. So the
evaluator agent in this case only ran at the end of the full build and that would then provide feedback so that it could then iterate on that feedback. So I
think there is a level of anthropic just showing off how good Opus 4.6 is that it can build an entire app that you don't need these harnesses at a 1 million token context window. It's not exactly
going to be cheap, but they do seem convinced that the context rot problem and the context anxiety problem don't exist in this version of the model. So
for this version two of the harness, the prompt was to build a fully featured DAW in the browser using the web audio API.
So this is a digital audio workstation.
And here is the phase byphase breakdown.
So the planner agent took 5 minutes and 50ent and then the first full build took 2 hours and $71. the evaluator or the QA
agent took 10 minutes essentially and then another hour for the second build and then 10 minutes for the third build.
So all in all around four hours total $125 and of course all relying on the clawed agent SDK which has context compaction so that you can keep it running over the long term. Now of
course you could build this yourself.
You don't need to use the clawed agent SDK. You could build the functionality
SDK. You could build the functionality that you need for these agents to run over the long horizon. And so this is the finished product. They did want a fully functional digital audio
workstation. I don't think this is fully
workstation. I don't think this is fully functional in the sense that, as they say, it's far from a professional music production program. That being said, it
production program. That being said, it costs $100 to create. So maybe that's not to be expected either, but it does seem to have some of the key features that you would actually need. And this
brings about one of the big take-home points from this article, which is the idea of a harness evolution. because
every component in a harness essentially encodes an assumption that the model can't actually carry out that task itself and the context reset is a perfect example of that. Sonnet 4.5 had
context anxiety. So we had to build an
context anxiety. So we had to build an entire handing off of context in a harness. So what Entropic are saying is
harness. So what Entropic are saying is that those assumptions go stale as the models improve as they saw with the leap to opus 4.6. And they also found that
the evaluators work depended on how good the model was as well. So if you're really stretching the model to its limits, it really is important to have an evaluator to make sure it's done everything it needs to do. Whereas if
you're asking the model to do something that's very much in its wheelhouse, you might not need an evaluator agent at all. And I suppose this is the real work
all. And I suppose this is the real work of harness and context engineering because it's not ever a one-shot setup.
You do need to refine and iterate as you go. As I mentioned earlier, we're in the
go. As I mentioned earlier, we're in the middle of an AI builder series on this channel where we build out a full stack AI agent platform. And it's not using cloud agent SDK. It's a completely
custom Python and React app with a superbase back end. And last week, we built in a specialized contract review harness into this system. I'll leave a link for it in the card above. And if
you're interested in going beyond the theory and actually building these harnesses for yourself, then check out the link in the description to our community, the AI Automators, where you can access our full AI builder course and codebase. We have a private
and codebase. We have a private community here of serious builders all creating specialized harnesses and advanced rag systems. We'd love to see you here, so check out the link below.
Thanks so much for watching and I'll see you in the next
Loading video analysis...