Harness Engineering: The Skill That Will Define 2026 for Solo Devs

By Solo Swift Crafter

Summary

Topics Covered

AI Benchmarks Miss Real Execution Failures
Model Failures Stem from Orchestration Breakdowns
Fewer Tools Boost Agent Accuracy to 100%
File System Serves as Model's External Memory
Simplify Harnesses as Models Advance

Full Transcript

So, okay. What's the best AI model right now? Claude, GPT, Gemini? And honestly,

now? Claude, GPT, Gemini? And honestly,

I think that's the wrong question. Like,

completely the wrong question. Just real

quick, uh I'm Daniel. I've been deep in iOS dev for uh over eight years now.

Started out freelancing, designing UIs, bouncing from client to client, shipping other people's ideas while trying to figure out my own. And then after dubdub

25, I just went all in solo. No more

clients, no safety net. Since then, I've crafted over 15 of my own apps. All

Swift UI, all built in public. And right

now, honestly, every ounce of energy I've got goes into making this solo studio into something that actually lasts. Not another round of quick MVPs

lasts. Not another round of quick MVPs or AI generated slop, but real apps that hold up and scale. And yeah, all of that process, the whole messy journey lives

on crafterslab. It's at crafterslab.dev.

on crafterslab. It's at crafterslab.dev.

And it's not some tutorial graveyard or AI clone factory. It's genuinely my home base built for solo devs who use AI like a real teammate. Not like a vending

machine you poke when you're stuck and hope for the best. If you care about the craft, if you're serious about uh leveling up and building things that actually last, yeah, you'd feel right at

home. And hey, if you're still on

home. And hey, if you're still on Patreon, huge thanks for that. But heads

up, everything's moved over to crafterslab.dev. That's where the whole

crafterslab.dev. That's where the whole crew is now. Come build with us. So,

here's what got me thinking about all this. There was a study that came out

this. There was a study that came out recently. Um, researchers published this

recently. Um, researchers published this benchmark called Epics's agent. And what

makes it different from like every other benchmark you see people arguing about online is that it tests agents on real professional work. Not coding puzzles,

professional work. Not coding puzzles, not multiplechoice. We're talking actual

not multiplechoice. We're talking actual tasks that consultants, lawyers, analysts do on a daily basis. Each one

takes a human about one to two hours to complete. So they ran every major

complete. So they ran every major frontier model through it. the best one completed those tasks about 24% of the time, one in four. And after eight

attempts with the same model, it only climbed to around 40%. Now, keep in mind, uh, these are the same models, uh, scoring above 90% on the benchmarks

everyone loses their minds over. So,

either those benchmarks are off or we're measuring the wrong thing. And I think it's the second one, right? But okay, so here's where it gets real for us. The

researchers actually dug into uh why the agents failed. And the answer wasn't

agents failed. And the answer wasn't that the models are dumb. Uh they had all the knowledge they needed. They

could reason through the problems just fine. The failures were almost entirely

fine. The failures were almost entirely about execution and orchestration. The

agents would get lost after too many steps. They'd loop back to approaches

steps. They'd loop back to approaches that already failed. they just lose track of what they were even supposed to be doing in the first place. And if

you're a solo dev using clawed code or cursor every day, yeah, you've been there. You've watched the agent spiral

there. You've watched the agent spiral retry the same broken thing three times, completely forget the context from 20 steps ago, and you're sitting there

like, "Maybe I should switch to Opus.

Maybe I need a different provider, but the data is saying that's not it. The

model isn't the bottleneck, it's everything wrapped around it. And

there's a word for that and I think it's gonna define uh 2026 the way agents define 2025. The word is harness. An

define 2025. The word is harness. An

agent harness is all the infrastructure around the model, what it can see, what tools it has access to, how it recovers when things go sideways, how it keeps track of what it's doing over a long

session. OpenAI literally published a

session. OpenAI literally published a blog post called Harness Engineering.

Anthropic put out a whole guide on building effective harnesses for longunning agents. Manish um the AI

longunning agents. Manish um the AI company Meta just acquired I that they published their context engineering lessons after rebuilding their entire agent framework five times in six

months. Five times. And they're all

months. Five times. And they're all saying the exact same thing. The harness

is where the real engineering work lives, not the model. Okay. So, and this is the part that honestly surprised me because it runs completely counter to

how most of us think about building with these tools. So, there's this story from

these tools. So, there's this story from Verscell. They had a text to SQL agent.

Verscell. They had a text to SQL agent.

Uh you ask a question, it writes a SQL query and they built it the way most people build agents, right? Gave it a bunch of specialized tools. One for

understanding the database schema, one for writing queries, one for validating results. all this error handling wrapped

results. all this error handling wrapped around it and it worked about 80% of the time. Then they tried something kind of

time. Then they tried something kind of radical. They removed 80% of the tools,

radical. They removed 80% of the tools, just ripped them out, gave the agent basic stuff, run bash commands, read files, standard command line tools like

GP and cat, the kind of stuff you or I would actually use. And uh accuracy went from 80% to 100%. It used 40% fewer

tokens and it was three and a half times faster. Not going to lie, that's kind of

faster. Not going to lie, that's kind of wild, right? And the engineer who built

wild, right? And the engineer who built it said something that really stuck with me. Models are getting smarter. Context

me. Models are getting smarter. Context

windows are getting larger. So maybe the best agent architecture is almost no architecture at all. And that just flips everything. You know what I mean?

flips everything. You know what I mean?

Because the the instinct uh especially when you're solo and you're trying to make this thing reliable is to keep adding more tools, more guard rails, more routing logic. You think more

structure is going to help, but those tools weren't helping the model. They

were getting in the way. And this isn't an isolated thing either. Manis went

through the exact same realization. They

rebuilt their entire agent framework five times in six months. And their

biggest performance gains didn't come from adding features. Uh they came from removing them. They ripped out complex

removing them. They ripped out complex document retrieval, killed the fancy routting logic, replaced uh management agents with simple structured handoffs.

Every iteration, the thing got simpler and it got better. And here's the part I think every solo dev running long clawed code sessions needs to hear. Manis found

that their agent averaged around 50 uh tool calls per task. That's a lot of steps. And even with models that

steps. And even with models that technically support huge context windows, performance just degrades past a certain point. The model doesn't

suddenly forget everything. It's more

like the signal gets buried under noise.

Your important instructions from the start of the session get lost under hundreds of intermediate results. So

their fix was dead simple. They started

treating the file system as the model's external memory. Instead of cramming

external memory. Instead of cramming everything into the context window, the agent writes key info to a file and reads it back when needed. And yeah, if

you use clawed code, you've literally seen this. The claw.md files, the to-do

seen this. The claw.md files, the to-do lists, the progress tracking, that's this exact pattern playing out in your terminal every day. All right. So,

remember what I said about everyone converging on the same idea because when you look at the three most successful agent systems right now, they all

arrived at the same place from completely different directions. Codeex

from OpenAI, it's got this layered approach. An orchestrator that plans, an

approach. An orchestrator that plans, an executive that handles individual tasks, and a recovery layer that yeah catches failures. It's robust. You can hand it

failures. It's robust. You can hand it stuff and walk away. That's one

philosophy. Clawed Code. And I use this every single day. The core is literally just four tools. Read a file, write a file, edit a file, run a bash command.

That's it. Most of the intelligence lives in the model itself. The harness

stays minimal. And when you need more extensibility comes through MCP and skills that the agent picks up as needed. And then manish landed on what

needed. And then manish landed on what I'd call reduce offload isolate actively shrink the context use the file system

for memory spin up sub aents for heavy tasks and just bring back the summary.

Three totally different approaches all converging on the same insight. Uh the

harness matters more than the model. And

for solo devs, this changes what you should actually be spending your time on because, you know, we don't have infinite hours. Every hour you spend on

infinite hours. Every hour you spend on Reddit debating Claude versus GPT is an hour you're not shipping. And there's

this idea from Richard Sutton, uh, one of the creators of reinforcement learning called the bitter lesson. The

core argument is that approaches which scale with compute always end up beating approaches that rely on handgineered knowledge applied to what we're doing.

That means something very specific. As

models get smarter, your harness should get simpler, not more complex. If you're

adding more handcoded logic, more custom pipelines with every model upgrade, you're swimming against the current. And

honestly, that overengineering is probably why your agent keeps breaking.

So, here's what I'd actually try. First,

uh, do the Versel experiment yourself.

If you've got any kind of agent setup, strip it down. Remove the specialized tools. Give it a bash terminal and basic

tools. Give it a bash terminal and basic file access and just see what happens.

The model is probably smarter than the tool pipeline you built around it.

Second, add a progress file. Have your

agent maintain a running to-do list that it updates after each step. It reads the file at the start of each action, writes to it at the end. This is exactly what Claude code does with those markdown

files. And it's the same pattern manis

files. And it's the same pattern manis landed on after five complete rewrites.

I actually have a whole system for this wired up in the lab with all my agent instructions and MD templates ready to go if you're curious. And third, uh,

start learning about MCP and skills.

These give the model clean standardized ways to work with external tools without you having to hardcode every integration. That's where the

integration. That's where the extensibility lives. Now 2025 was the

extensibility lives. Now 2025 was the year of agents and for the most part yeah that happened but 2026 I think 2026

is the year of harnesses and the same model uh the exact same model be behaves completely differently in claude code compared to cursor compared to codeex.

So choose your harness carefully whether you're using a coding agent or building one. So yeah, if you're still here,

one. So yeah, if you're still here, honestly, you're a legend. And look, I know the model discourse is loud right now. Every week there's a new drop, a

now. Every week there's a new drop, a new benchmark, a new thread about which one is king. But the actual data, the actual engineering coming out of the companies building this stuff, it's all

pointing somewhere else. The harness is where the wins are. And as solo devs, that's actually great news. But um

because building a better harness is something you can do right now today without waiting for the next model release. And you know if you want to go

release. And you know if you want to go deeper into how I actually set all this up, the MD files, the agent workflows, how I wire everything together for my

own apps, come check out crafterslab.dev.

crafterslab.dev.

It's not some tutorial dump or another AI content farm. It's genuinely my home base built for solo devs who treat a I like a real teammate and actually care

about what they ship. Inside you get full walkthroughs, real short video tutorials, a bunch of clawed code skills you can grab and use right away, and downloadable resources you can drop

straight into your projects. Members

riff in the comments, ask follow-ups, go back and forth. It's a real conversation, not some one-way content feed. But the real core, the notion team

feed. But the real core, the notion team spaces, my live playbook, you get a front row seat to how I run every single app I'm building, the actual MD files I

use on real projects, the prompt library, the docs I'm writing as I go, all the automations running behind the scenes, nothing polished for the camera, just the real process, messy parts and

all. And there's Swift Brain, a curated

all. And there's Swift Brain, a curated Swift and Swift UI library I've been building out for years. deep dive

keynotes, private talks I spent real money curating, the kind of material that's not floating around in public training data. This is what I actually

training data. This is what I actually use to build um custom MCPs to set up skills for claude code, for cursor, all of it. Always experimenting, always

of it. Always experimenting, always sharing what sticks. And then ops lab, that's where all the AI agent instructions live. the notion templates,

instructions live. the notion templates, the clawed code skills, the workflows, automations, all wired up and ready for you to copy, tear apart, totally break, and rebuild your own way. The whole

point is keeping the indie stack connected, so you're never really building alone, even if you're solo at the keyboard. So, yeah, if you want to

the keyboard. So, yeah, if you want to get in while the crew is still small and prices are locked, now is kind of the sweet spot. It feels way more like a

sweet spot. It feels way more like a behind-the-scenes dev lounge than some giant faceless forum. Would genuinely

love to see you in there. Trade some

takes on this harness stuff. Maybe learn

something from what you're building next. Keep crafting, keep experimenting,

next. Keep crafting, keep experimenting, and don't let the benchmark noise distract you from what actually matters.

Peace.

Loading...

Loading video analysis...