Rethinking AI Agents: The Rise of Harness Engineering
By PY
Summary
## Key takeaways - **Harness Outperforms Model Upgrades**: Stanford research shows orchestration code now drives more performance variation than the underlying model itself, with the same model and benchmark producing six times the performance difference depending on harness design. [00:06], [00:27] - **Harness Is an Operating System**: The harness acts like an operating system for a language model: the context window is RAM, external databases serve as disk, and tool integrations function as device drivers coordinating what the CPU sees and when. [00:52], [01:21] - **Self-Evolution Beats Verifiers**: Ablation testing revealed that self-evolution was the only consistently helpful module (+4.8 on SWE-bench, +2.7 on OS World), while verifiers actively hurt performance (-0.8 and -8.4) and multicandidate search also degraded results. [05:24], [05:54] - **Representation Drives Gains**: Migrating OSWorld's logic into natural language harness representation boosted performance from 30.4% to 47.2%, cut runtime from 361 to 141 minutes, and reduced LLM calls from 1,200 to just 34—the representation itself drove the gain. [06:08], [06:33] - **Smaller Model Outranks Larger via Harness**: Meta-harness optimization allowed Haiku (a smaller model) to rank higher than Opus through harness optimization alone, scoring 76.4% on Terminal Bench 2 as the only automatically optimized system in a field of hand-engineered entries. [08:27], [08:42] - **Harness Is the Reusable Asset**: A harness optimized on one model transferred successfully to five others, improving all of them—the reusable asset in AI systems is the harness, not the model, fundamentally shifting where to invest engineering effort. [08:56], [09:10]
Topics Covered
- Six Times Performance: Harness Over Model
- Natural Language Control Enables Clean Ablation
- Smaller model beats larger ones through harness optimization alone
- Mature harness work means pruning structure, not building it
Full Transcript
Same model, same benchmark, six times the performance difference. Stanford
researchers found that the orchestration code wrapping a language model now drives more performance variation than the model itself. Langchain confirmed it by modifying only harness
infrastructure. Their coding agent
infrastructure. Their coding agent jumped from outside the top 30 to rank five on terminal bench 2. Two March 2026 papers now formalize this from complimentary directions. And what they
complimentary directions. And what they found redefineses what we should actually be optimizing when we build agents. Agent equals model plus harness.
agents. Agent equals model plus harness.
If you're not the model, you're the harness. That's how Langchain frames it.
harness. That's how Langchain frames it.
The sharpest definition of what agents actually are. But what does the harness
actually are. But what does the harness half look like? The operating system analogy captures it. A raw LLM is a CPU,
powerful but inert. No RAM, no disk, no IO. The context window acts as RAM, fast
IO. The context window acts as RAM, fast but limited. External databases serve as
but limited. External databases serve as disk. Tool integrations are device
disk. Tool integrations are device drivers. The harness is the operating
drivers. The harness is the operating system coordinating what the CPU sees and when. Concretely, everything that
and when. Concretely, everything that isn't model weights, system prompts, tool definitions, orchestration logic, memory management, verification loops,
safety guardrails.
Anthropic identified five canonical patterns. Prompt chaining, routing,
patterns. Prompt chaining, routing, parallelization, orchestrator workers, and evaluator optimizer loops. Each a
different strategy for when and how the model gets called. Every production
agent combines these patterns.
And those architectural choices, not the model underneath, drive the 6x performance gaps.
If harnesses matter this much, how were people building them? messily logic
scattered across controller code, framework defaults, verifier scripts.
Two systems that nominally differed by one design choice actually differed in prompts, tools, verification gates, and state semantics simultaneously.
Anthropic evolution exposes the pattern.
Knive harnesses suffer two failure modes. oneshotting where the agent tries
modes. oneshotting where the agent tries everything at once and exhausts its context and premature completion where a later session sees partial progress and
declares victory. Their fix evolved into
declares victory. Their fix evolved into a three agent GAN inspired architecture planner generator and evaluator with the evaluator clicking through the running
app like a real user. 20 times more expensive $200 versus 9. But now the core thing worked instead of being broken.
Open AI converged independently.
5 months, a million lines of application logic, tests, CI and tooling. Zero
manually written. And their discovery, the engineering team's primary job became enabling agents to do useful work, productive but ad hoc,
non-portable, impossible to ablate.
Standards did emerge. Agents MD reached 60,000 repositories.
Anthropic's agent skills added reusable procedures, but both packaged components, conventions, and snippets, not the full harness itself. The field
needed harness logic made explicit and executable.
What if you could write an agent's entire control logic, not in Python, not in YAML, but in structured natural language? The Tingua team builds exactly
language? The Tingua team builds exactly this. Their natural language agent
this. Their natural language agent harness separates into three layers.
Back end infrastructure and tools runtime charter, universal physics, how contracts bind, how state persists, how
child agents are managed and the NLA itself, task specific control logic, contracts, roles, stage structure, failure taxonomies.
Why this separation? It gives harness engineering something it never had.
controlled experiments. Swap the NLH while fixing the charter. You're testing
harness design. Fix the NLH while swapping the charter. You're testing
runtime policy. Clean ablation at last.
Two mechanisms underpin it. Execution
contracts turn fuzzy LLM completions into bounded agent calls with five elements. Required outputs, budgets,
elements. Required outputs, budgets, permissions, completion conditions, output paths. Think function signatures
output paths. Think function signatures for agents and filebacked state externalizes memory to path addressable files surviving truncation restarts and
delegation.
Same pass rate 14 times the compute.
Does all this structure actually help?
on swb bench verified with GPT5 four at maximum reasoning resolved rates clustered between 74 and 76% regardless of configuration
but the full harness burned 16.3 million prompt tokens per sample 642 tool calls 32 minutes stripped down 1.2 2 million
tokens, 51 calls under 7 minutes. Same
destination, radically different paths.
Then the module by module ablation found something stranger. Self-evolution was
something stranger. Self-evolution was the only consistently helpful module.
Plus 4.8 on SWE plus 2.7 on OS World.
Via an acceptance gated attempt loop that stays narrow until failure signals justify broadening. Verifiers actively
justify broadening. Verifiers actively hurt minus0.8. 8 and - 8.4.
hurt minus0.8. 8 and - 8.4.
Multicandidate search -2.4 and -5.6.
More structure is not always better.
The same paper's headline finding came from a different experiment. The
researchers took OS symfony, a native code harness for desktop automation, and migrated its logic into NLH representation.
Same strategy, different representation.
Performance jumped from 30.4 to 47.2%.
Runtime dropped from 361 minutes to 141.
LLM calls collapsed from 1,200 to just 34. The representation itself drove the
34. The representation itself drove the gain, replacing brittle GUI repair loops with durable runtime state and artifact backed completion. Two patterns
backed completion. Two patterns crystallize from the full results.
Roughly 90% of all compute flows through delegated child agents, not the parent.
The harness is an orchestration pattern, not a reasoning pattern. It decomposes,
delegates, and verifies.
And the only module that consistently helps is the one that narrows the agents own attempt loop. Discipline narrowing
beats expensive broadening every time, which raises a question. If
representation matters this much, can we find the right harness automatically?
Representation alone moved one benchmark 16.8 points. Same logic, same model,
16.8 points. Same logic, same model, just rewritten as natural language. If
how you express the harness matters that much, what about optimizing it automatically?
Meta harness from Stanford's Omar Katab, creator of DSPI, treats the harness as an optimization target.
DSP tunes prompts within a fixed pipeline. Meta harness rewrites the
pipeline. Meta harness rewrites the pipeline itself. Structure retrieval
pipeline itself. Structure retrieval memory orchestration topology.
Here's the loop. An agentic proposer Claude code with Opus 4.6 reads failed execution traces, diagnoses what broke, and writes a complete new harness.
Scores and raw traces accumulate in a growing file system. An evaluator tests each proposal.
Repeat the scale. 10 million tokens per iteration. 400 times more feedback than
iteration. 400 times more feedback than any prior method. 82 files read per round. Those traces are irreplaceable.
round. Those traces are irreplaceable.
Remove them. Accuracy drops from 50% to 34.6.
Replace with summaries 34.9.
The signal lives in the raw details.
Rank two with Opus. Rank one with Haiku.
a smaller model outranking larger ones through harness optimization alone. Meta
harness scores 76.4% on terminal bench 2. The only automatically optimized
2. The only automatically optimized system in a field of handgineered entries on 215 class text classification, 48.6% accuracy, 7.7
points above state-of-the-art using four times fewer tokens. But the finding that changes the calculus, a harness optimized on one model transferred to
five others, improving all of them. The
reusable asset isn't the model, it's the harness. Two more systems complete the
harness. Two more systems complete the picture. Deep Mind's auto harness
picture. Deep Mind's auto harness compiles game rules into code harnesses, eliminating 10% of illegal moves across
145 games. One variant replaces the LLM
145 games. One variant replaces the LLM entirely. The decision policy runs as
entirely. The decision policy runs as pure code and agentspec provides safety constraints as a domain specific language preventing over 90% of unsafe
executions.
Four systems, four facets, representation optimization constraints safety prompt engineering, context engineering,
harness engineering, three eras in four years, each one swallowing the last.
Harness engineering absorbs the prior two and adds what the model can't do on its own. Orchestration, memory,
its own. Orchestration, memory, verification, safety. The discipline
verification, safety. The discipline takes on an odd shape in practice.
Anthropic named the dynamic. Every
harness component encodes an assumption about what the model can't do alone, and those assumptions expire. When Opus 4.6 stopped needing context resets,
Anthropic dropped them entirely. Manus
rewrote their harness five times in 6 months. Versel removed 80% of an agent's
months. Versel removed 80% of an agent's tools and got better results. The
harness space doesn't shrink as models improve. It moves. Which is why mature
improve. It moves. Which is why mature harness work looks less like building structure up and more like pruning it down. A craft of subtraction as much as
down. A craft of subtraction as much as addition.
The practical takeaway is unambiguous.
Investing in your harness yields larger, faster, and more reliable gains than waiting for the next model upgrade. If
you build agents, you are a harness engineer whether you call yourself one or not. And it's no longer a question of
or not. And it's no longer a question of which model to pick. It's a question of which structure to remove. Open problems
remain. Portable harness logic lowers the barrier to spreading risky workflows. Prompt injection buried in
workflows. Prompt injection buried in harness text. Malicious tools grafted
harness text. Malicious tools grafted into shared artifacts.
Research already found one in four community contributed agent skills contains a vulnerability.
And the most consequential open question, can harness and model weights be co-evolved? Letting strategy shape
be co-evolved? Letting strategy shape what the model learns and the model reshape the strategy that wraps it? The
field is moving from artal construction to systematic science. What sits between a language model and useful work has always mattered. We're finally learning
always mattered. We're finally learning how to engineer it.
Loading video analysis...