Rethinking AI Agents: The Rise of Harness Engineering

By PY

Summary

## Key takeaways - **Harness Outperforms Model Upgrades**: Stanford research shows orchestration code now drives more performance variation than the underlying model itself, with the same model and benchmark producing six times the performance difference depending on harness design. [00:06], [00:27] - **Harness Is an Operating System**: The harness acts like an operating system for a language model: the context window is RAM, external databases serve as disk, and tool integrations function as device drivers coordinating what the CPU sees and when. [00:52], [01:21] - **Self-Evolution Beats Verifiers**: Ablation testing revealed that self-evolution was the only consistently helpful module (+4.8 on SWE-bench, +2.7 on OS World), while verifiers actively hurt performance (-0.8 and -8.4) and multicandidate search also degraded results. [05:24], [05:54] - **Representation Drives Gains**: Migrating OSWorld's logic into natural language harness representation boosted performance from 30.4% to 47.2%, cut runtime from 361 to 141 minutes, and reduced LLM calls from 1,200 to just 34—the representation itself drove the gain. [06:08], [06:33] - **Smaller Model Outranks Larger via Harness**: Meta-harness optimization allowed Haiku (a smaller model) to rank higher than Opus through harness optimization alone, scoring 76.4% on Terminal Bench 2 as the only automatically optimized system in a field of hand-engineered entries. [08:27], [08:42] - **Harness Is the Reusable Asset**: A harness optimized on one model transferred successfully to five others, improving all of them—the reusable asset in AI systems is the harness, not the model, fundamentally shifting where to invest engineering effort. [08:56], [09:10]

Topics Covered

Six Times Performance: Harness Over Model
Natural Language Control Enables Clean Ablation
Smaller model beats larger ones through harness optimization alone
Mature harness work means pruning structure, not building it

Full Transcript

Same model, same benchmark, six times the performance difference. Stanford

researchers found that the orchestration code wrapping a language model now drives more performance variation than the model itself. Langchain confirmed it by modifying only harness

infrastructure. Their coding agent

infrastructure. Their coding agent jumped from outside the top 30 to rank five on terminal bench 2. Two March 2026 papers now formalize this from complimentary directions. And what they

complimentary directions. And what they found redefineses what we should actually be optimizing when we build agents. Agent equals model plus harness.

agents. Agent equals model plus harness.

If you're not the model, you're the harness. That's how Langchain frames it.

harness. That's how Langchain frames it.

The sharpest definition of what agents actually are. But what does the harness

actually are. But what does the harness half look like? The operating system analogy captures it. A raw LLM is a CPU,

powerful but inert. No RAM, no disk, no IO. The context window acts as RAM, fast

IO. The context window acts as RAM, fast but limited. External databases serve as

but limited. External databases serve as disk. Tool integrations are device

disk. Tool integrations are device drivers. The harness is the operating

drivers. The harness is the operating system coordinating what the CPU sees and when. Concretely, everything that

and when. Concretely, everything that isn't model weights, system prompts, tool definitions, orchestration logic, memory management, verification loops,

safety guardrails.

Anthropic identified five canonical patterns. Prompt chaining, routing,

patterns. Prompt chaining, routing, parallelization, orchestrator workers, and evaluator optimizer loops. Each a

different strategy for when and how the model gets called. Every production

agent combines these patterns.

And those architectural choices, not the model underneath, drive the 6x performance gaps.

If harnesses matter this much, how were people building them? messily logic

scattered across controller code, framework defaults, verifier scripts.

Two systems that nominally differed by one design choice actually differed in prompts, tools, verification gates, and state semantics simultaneously.

Anthropic evolution exposes the pattern.

Knive harnesses suffer two failure modes. oneshotting where the agent tries

modes. oneshotting where the agent tries everything at once and exhausts its context and premature completion where a later session sees partial progress and

declares victory. Their fix evolved into

declares victory. Their fix evolved into a three agent GAN inspired architecture planner generator and evaluator with the evaluator clicking through the running

app like a real user. 20 times more expensive $200 versus 9. But now the core thing worked instead of being broken.

Open AI converged independently.

5 months, a million lines of application logic, tests, CI and tooling. Zero

manually written. And their discovery, the engineering team's primary job became enabling agents to do useful work, productive but ad hoc,

non-portable, impossible to ablate.

Standards did emerge. Agents MD reached 60,000 repositories.

Anthropic's agent skills added reusable procedures, but both packaged components, conventions, and snippets, not the full harness itself. The field

needed harness logic made explicit and executable.

What if you could write an agent's entire control logic, not in Python, not in YAML, but in structured natural language? The Tingua team builds exactly

language? The Tingua team builds exactly this. Their natural language agent

this. Their natural language agent harness separates into three layers.

Back end infrastructure and tools runtime charter, universal physics, how contracts bind, how state persists, how

child agents are managed and the NLA itself, task specific control logic, contracts, roles, stage structure, failure taxonomies.

Why this separation? It gives harness engineering something it never had.

controlled experiments. Swap the NLH while fixing the charter. You're testing

harness design. Fix the NLH while swapping the charter. You're testing

runtime policy. Clean ablation at last.

Two mechanisms underpin it. Execution

contracts turn fuzzy LLM completions into bounded agent calls with five elements. Required outputs, budgets,

elements. Required outputs, budgets, permissions, completion conditions, output paths. Think function signatures

output paths. Think function signatures for agents and filebacked state externalizes memory to path addressable files surviving truncation restarts and

delegation.

Same pass rate 14 times the compute.

Does all this structure actually help?

on swb bench verified with GPT5 four at maximum reasoning resolved rates clustered between 74 and 76% regardless of configuration

but the full harness burned 16.3 million prompt tokens per sample 642 tool calls 32 minutes stripped down 1.2 2 million

tokens, 51 calls under 7 minutes. Same

destination, radically different paths.

Then the module by module ablation found something stranger. Self-evolution was

something stranger. Self-evolution was the only consistently helpful module.

Plus 4.8 on SWE plus 2.7 on OS World.

Via an acceptance gated attempt loop that stays narrow until failure signals justify broadening. Verifiers actively

justify broadening. Verifiers actively hurt minus0.8. 8 and - 8.4.

hurt minus0.8. 8 and - 8.4.

Multicandidate search -2.4 and -5.6.

More structure is not always better.

The same paper's headline finding came from a different experiment. The

researchers took OS symfony, a native code harness for desktop automation, and migrated its logic into NLH representation.

Same strategy, different representation.

Performance jumped from 30.4 to 47.2%.

Runtime dropped from 361 minutes to 141.

LLM calls collapsed from 1,200 to just 34. The representation itself drove the

34. The representation itself drove the gain, replacing brittle GUI repair loops with durable runtime state and artifact backed completion. Two patterns

backed completion. Two patterns crystallize from the full results.

Roughly 90% of all compute flows through delegated child agents, not the parent.

The harness is an orchestration pattern, not a reasoning pattern. It decomposes,

delegates, and verifies.

And the only module that consistently helps is the one that narrows the agents own attempt loop. Discipline narrowing

beats expensive broadening every time, which raises a question. If

representation matters this much, can we find the right harness automatically?

Representation alone moved one benchmark 16.8 points. Same logic, same model,

16.8 points. Same logic, same model, just rewritten as natural language. If

how you express the harness matters that much, what about optimizing it automatically?

Meta harness from Stanford's Omar Katab, creator of DSPI, treats the harness as an optimization target.

DSP tunes prompts within a fixed pipeline. Meta harness rewrites the

pipeline. Meta harness rewrites the pipeline itself. Structure retrieval

pipeline itself. Structure retrieval memory orchestration topology.

Here's the loop. An agentic proposer Claude code with Opus 4.6 reads failed execution traces, diagnoses what broke, and writes a complete new harness.

Scores and raw traces accumulate in a growing file system. An evaluator tests each proposal.

Repeat the scale. 10 million tokens per iteration. 400 times more feedback than

iteration. 400 times more feedback than any prior method. 82 files read per round. Those traces are irreplaceable.

round. Those traces are irreplaceable.

Remove them. Accuracy drops from 50% to 34.6.

Replace with summaries 34.9.

The signal lives in the raw details.

Rank two with Opus. Rank one with Haiku.

a smaller model outranking larger ones through harness optimization alone. Meta

harness scores 76.4% on terminal bench 2. The only automatically optimized

2. The only automatically optimized system in a field of handgineered entries on 215 class text classification, 48.6% accuracy, 7.7

points above state-of-the-art using four times fewer tokens. But the finding that changes the calculus, a harness optimized on one model transferred to

five others, improving all of them. The

reusable asset isn't the model, it's the harness. Two more systems complete the

harness. Two more systems complete the picture. Deep Mind's auto harness

picture. Deep Mind's auto harness compiles game rules into code harnesses, eliminating 10% of illegal moves across

145 games. One variant replaces the LLM

145 games. One variant replaces the LLM entirely. The decision policy runs as

entirely. The decision policy runs as pure code and agentspec provides safety constraints as a domain specific language preventing over 90% of unsafe

executions.

Four systems, four facets, representation optimization constraints safety prompt engineering, context engineering,

harness engineering, three eras in four years, each one swallowing the last.

Harness engineering absorbs the prior two and adds what the model can't do on its own. Orchestration, memory,

its own. Orchestration, memory, verification, safety. The discipline

verification, safety. The discipline takes on an odd shape in practice.

Anthropic named the dynamic. Every

harness component encodes an assumption about what the model can't do alone, and those assumptions expire. When Opus 4.6 stopped needing context resets,

Anthropic dropped them entirely. Manus

rewrote their harness five times in 6 months. Versel removed 80% of an agent's

months. Versel removed 80% of an agent's tools and got better results. The

harness space doesn't shrink as models improve. It moves. Which is why mature

improve. It moves. Which is why mature harness work looks less like building structure up and more like pruning it down. A craft of subtraction as much as

down. A craft of subtraction as much as addition.

The practical takeaway is unambiguous.

Investing in your harness yields larger, faster, and more reliable gains than waiting for the next model upgrade. If

you build agents, you are a harness engineer whether you call yourself one or not. And it's no longer a question of

or not. And it's no longer a question of which model to pick. It's a question of which structure to remove. Open problems

remain. Portable harness logic lowers the barrier to spreading risky workflows. Prompt injection buried in

workflows. Prompt injection buried in harness text. Malicious tools grafted

harness text. Malicious tools grafted into shared artifacts.

Research already found one in four community contributed agent skills contains a vulnerability.

And the most consequential open question, can harness and model weights be co-evolved? Letting strategy shape

be co-evolved? Letting strategy shape what the model learns and the model reshape the strategy that wraps it? The

field is moving from artal construction to systematic science. What sits between a language model and useful work has always mattered. We're finally learning

always mattered. We're finally learning how to engineer it.

Loading...

Loading video analysis...