Harness Engineering: How to Build Software When Humans Steer, Agents Execute — Ryan Lopopolo, OpenAI

By AI Engineer

Summary

Topics Covered

Code Is Free, Abundance Replaces Scarcity
Every Human Is a Staff Engineer with Infinite Agents
Every Human Interaction with an Agent Is a Harness Failure
LLMs Are Fuzzy Compilers, Code Is Disposable

Full Transcript

Our next speaker is here to speak about harness engineering. How to build

harness engineering. How to build software when humans steer and agents execute. Please join me in welcoming to

execute. Please join me in welcoming to the stage member of technical staff at OpenAI, Ryan Leopo.

Good morning, London.

I'm super excited to be here today. I'm

Ryan Laapo and for the last nine months I have had the privilege of building software exclusively with agents. Uh I

am a token billionaire and I believe that in order for us to get into our AGI future, we want everybody to be token

billionaires to use the models to do the full job. And what that means is to lean

full job. And what that means is to lean into the idea that the models are capable of being a full software engineer. And I've lived that experience

engineer. And I've lived that experience by banning my team from even touching their editors, to have to work through the models in order to get the job done.

And today I'm going to talk to you a little bit about what it means to lean into that and operationalize the way you work, the code spaces you live in, and the processes on your teams in order to

get the agents to do the full job.

I believe I'm preaching to the choir here when I say that the way we build software has changed. In the last six months, we have seen coding agents take over the world and capability has

continually advanced at a super fast pace to have these models and the harnesses within which they live take more complex actions, do more

complicated work with higher reliability over longer time horizons.

And the place we've gotten to here is that implementation is no longer the scarce resource of what it means to do the job of software engineering. Code is

free. We have an abundance of code to solve the problems that we come across in our day-to-day as we run our teams,

build software, and solve user problems. Hiring the hands on the keyboards as part of our teams is only constrained by GPU capacity and token budgets. And each

engineer today in this room has access to five, 50, or 5,000 engineers worth of capacity 247 every day of the year. The

only thing that needs to happen, our roles is to figure out how to productively deploy these resources into our code and into our teams to make use

of this new capacity.

And in this world, skill sets are shifting more towards systems thinking, system design, and delegation in order to make use of this abundant capacity to

produce code to solve problems. And there are three reasons that this happened. All of which happened in late

happened. All of which happened in late 2025.

For me, the magic moment was GPT 5.2, which when it came out was able to do the full job of a software engineer. The

models at this point are good enough where they're isomeorphic to you and I in terms of the ability to produce code at high quality that solve real user

problems in real code bases.

Code is free. And I know this is maybe a scary thing to hear because code carries maintenance burden, but it's free to produce, free to refactor, and it is not

a thing to get hung up on anymore.

We think of code as burden because it it's a synchronous attention drain on the human engineers on our team. But the

models are incredibly patient. They are

infinitely parallel. So the ability to produce, maintain, refactor, and delete code is no longer a forcing function on figuring out how to allocate resources

on your engineering teams. So sort of be agi pill here is to believe that the models are capable of producing every line of code we could ever possibly need, figuring out when to delete them,

figuring out when to refactor them or make them more reliable. And it's your role as software engineers to figure out how to unblock your team of agents and

humans driving those agents from being able to drive them over long horizon work to do the full job.

The idea here is that every one of you is a staff engineer. You have as many team members as you can possibly drive concurrently and have tokens to support

and you need to look one day, one week, six months into the future to figure out what structures you need to put in place to productively harness this infinite

capacity to produce code.

The scarce resources in this world that we see today are three things. human

time, human and model attention and model context window. And in the world where human time and attention is

scarce, the role is to think about where that time is going, figure out ways to productively automate it and move that synchronous human time into higher

leverage activities.

In a world where human time is scarce and human time is required to produce code, we have a stack rank. Things are

either P zeros or P2s. Those P3s will never get done. However, in a world where code is free and infinitely abundant, all those P3s get kicked off

immediately, maybe 4x in parallel. We

pick one that solves the problem and in it goes.

I've had the privilege of building a ton of agents internally at OpenAI to improve the productivity of my co-workers. And when code is free, all

co-workers. And when code is free, all these internal tools can have good localization and internationalization

from day one. I can make tools that my colleagues in London, Dublin, Paris, Brussels, Zurich, and Munich are able to experience in their native languages

without really having to trade against any of my other teams capacity in order to make highquality tools.

We should be working with the assumption that the best parts of software engineering that we all know, live, and breathe are available in any product that we could ever build all the time.

Humans no need no longer need to concern themselves with implementation. The

important thing is not the code but the prompt and the guardrails that got you there. This is why leaving breadcrumbs,

there. This is why leaving breadcrumbs, documentation, ADRs, persona oriented documentation around what a good job looks like. All the historical logs of

looks like. All the historical logs of tickets and code reviews. This is the process that got you and your teams to the code and products that you have today. And this is what is need needs to

today. And this is what is need needs to happen in order to get your agents there as well. Your job is to build systems,

as well. Your job is to build systems, software and structures that enable your team to be successful. And to do that, we need to make them legible to those

agents that are driving the implementation. That means structuring

implementation. That means structuring them in a way that's native to the agents. Writing them in a way that is

agents. Writing them in a way that is respecting of scarce context, which is this other scarce resource here, and figuring out ways to make the tokens that are required to do the job easy to

predict. That means making things the

predict. That means making things the same as much as possible so we can limit the amount of attention the model needs to activate in order to do the job.

Large scale refactoring in this world is free. So making things the same is

free. So making things the same is something that you are all able to do.

There's never going to be a migration that hangs open for six months now that you can't get the last parts of the codebase to do because you can just fire off 15 agents to drive that work to completion. This is what it means to

completion. This is what it means to have a migration, right? We can finish them now. Come on. That's good. That's

them now. Come on. That's good. That's

good. Clap.

There's sort of this like meta epistemological question here about like what it means to do a good job and doing a good job as a software engineer is

hard. It requires us years of being in

hard. It requires us years of being in the industry to fully internalize what it means to write highquality maintainable reliable code that our teammates are able to build on top of

that is going to acrue leverage to the codebase to do a single patch. well, probably

requires 500 little decisions along the way around the underspecified non-functional requirements that go into producing good code. The agents, the

models during their training have seen trillions of lines of code that make every possible choice of those non-functional requirements that you could ever imagine. So, it's our job to

specify those non-functional requirements to write them down in a way that the agents can see this is what it is to do a good acceptable job that's going to produce a merged patch. And if

the agents aren't doing that, it's our job to figure out ways to refine and restrict their output such that the code they write is acceptable. You can just

simply say do not produce slop. Don't

accept slop. You won't get slop in your codebase. But to do that requires taking

codebase. But to do that requires taking short-term velocity hits in order to back up or doubleclick into a task to figure out what it is the agents are

struggling with in your environment.

Put the guardrails in place so they stop making those mistakes and then figure out ways to step back and spend your time on higher leverage activities once you solve some of the

blockers in the short term.

When I think about empowering my team in this way, everyone is an expert in what it is they bring. I have a diverse full stack team that is experts in front-end

architecture, backend scalability, being productminded. And each one of those

productminded. And each one of those different personas fleshes out the skill set of my team by bringing a different understanding, a different set of solves for those non-functional requirements.

Getting teammates to write those down actually means that every engineer driving agents gets the best of every single person on my team. I don't need

to block on low signal code review in order to learn what it means to write a good QA plan. To have one engineer on my team document that in a durable way

means every agent trajectory is going to get a good QA plan. And we can do this once in a high lever way that we're able to stack on top of.

So how can we get the agents to do a good job? What are some of the tools and

good job? What are some of the tools and techniques we have in order to essentially prompt inject our agents and continually remind them of what it means to make those specific choices that we

expect around those non-functional requirements. And there's a bunch of

requirements. And there's a bunch of ways we can do this. We can write good agents.mmd files. However, with autoco

agents.mmd files. However, with autoco compaction, which is a thing that has continued to improve, GPT 5.4 and CEX is fantastic at autoco compaction. I essentially never have to

compaction. I essentially never have to write slashnew anymore. I've got some pictures on my Twitter of me strapping my laptop into the back of my car so I can continue do running inference while

I'm commuting to and from work. And in

this world, you have to kind of build for that expectation that context will get paged out over time. We need to be continually refreshing context as the

agent goes about doing a task. And the

ways we can do that are by having reviewer agents look at the code along the way through the lens of what it means to be successful. Right? We have

security and reliability review agents in our codebase that are continually running as part of every push and CI that look at those documentations and the proposed patch and do simple things

like say, are there timeouts and retries on this bit of network code? Has the

code that has been introduced have a secure interface that is impossible to misuse?

I'm sure everyone here has been paged at some point for network code that failed in production causing an outage that could have been remediated by a retry

and a timeout. And I know I'm guilty of putting that retry and timeout in merging the bug fix and otherwise ignoring that. I am not a reliable

ignoring that. I am not a reliable reviewer or author of code with respect to this non-functional requirement.

However, taking the time to write some docs, write a lint that is bespoke to my codebase that is going to look at every time I call fetch to make sure that there's a retry and a timeout wrapped

around it means I've durably solved this problem and I'm able to do it because I lean on this axiom that code is free that the agents are able to do a good job that I can completely migrate the

codebase to solve this problem durably once and for all. And in order to kind of operate in this way, we need to step

back and look at the durable classes of failures that the agents and the humans in the codebase are making time after time. Figure out why we're spending time

time. Figure out why we're spending time on it. Devise a solution to

on it. Devise a solution to systematically eliminate this class of misbehavior and then continue to observe, refine, and make additional choices on those non-functional

requirements.

One really neat trick I use here is that you can write tests about the source code as well that are separate from lints. Right? If we know that context is

lints. Right? If we know that context is limited, we can write a test that limits the fact that files are no longer than 350 lines. We're adapting our codebase

350 lines. We're adapting our codebase to the harness to the models to do a little bit of engineering to be context efficient and squeeze more juice out of

the model capability that we have today.

The other things we can think about are providing good error messages that give actual remediation steps to the model and to humans for how to proceed next.

It's not enough to say we've got a lint failure because we're awaiting in a loop or that we have an unknown at this deep part of the codebase and why is the model writing a function called is

record. What we need to do is provide a

record. What we need to do is provide a prompt via a lint or a test failure that says no no no you shouldn't have an unknown here at all because we parse

don't validate at the edge and you certainly have a type here which was derived from zot loadbearing infrastructure for our AI future you can just prompt things

I've talked about here today is a prompt you can do this without touching the model weights at all.

Kind of a funny digression here is it seems like each advancement we've had in the complexity of the way we write code to interact with these models comes from

both increasing capability in the models and increasingly niche ways for injecting prompts into those models. prompts I'm sure you're

those models. prompts I'm sure you're aware are prompts powers prompts rules files prompts skills prompts these lint error messages that I am talking about

prompts review agents that inject comments onto the PR that we require the agent to address before it is able to propose it for merge prompts you're going to find lots of ways to

insert prompts into your code and one way you can do that is by embedding agent SDKs into your tests that are going review the codebase for acceptability

using prompts that get embedded into the code. And if I find myself spending a

code. And if I find myself spending a ton of time writing prompts, we can actually shell out to the agent for that as well. Uh I've pointed codecs at all

as well. Uh I've pointed codecs at all of the prompting cookbooks we have on the OpenAI developer guide and told to synthesize a skill out of them for how to write prompts. Which means when I

find a need to write prompts in order to improve my agent performance locally in the code, I use the skill to write prompts that I wrote with the agent looking at the prompts to write the prompts.

All the leverage that you're encoding in in to your repository, your team, and the agents in this way stacks incredibly well. to kind of pull back to this idea

well. to kind of pull back to this idea that a single product-minded engineer on my team was able to give us a big lift.

They know what it means to write a good QA plan. To write a good QA plan though,

QA plan. To write a good QA plan though, you have to document all the features that you have, the critical user journeys, and how users engage with your applications, web apps, APIs, and

services.

Once you write those down on how to write a good QA plan with the expectation that all userfacing work has a QA plan, now a review agent is able to assert expectations around what it means

to prove that you have effectively written the feature. A QA plan indicates what media should be attached to the PR for the humans and agents to know that you've done a good job, which has the

consequence of me trusting the output more, needing to shoulders surf the agent less.

and removing myself from the loop even more to delegate more and more of the work to agents. And all of this is just making sure the agents have the tools

and tokens and context to do the full job to remove myself from the need as a synchronous driver. The

models crave tokens. We can

operationalize our codebase to give them tokens to drive them forward using sub agents and all these other techniques to refine the agent output.

I'm excited to let you all know today in the way you all do that you can just go build things. Do not hesitate to remove

build things. Do not hesitate to remove yourselves from the loop by getting the agents to do the full job because they can. Thank you.

can. Thank you.

Very excited to bring on our guest.

We've got Ryan Leapo today. He just gave the keynote. Um, very exciting speaker.

the keynote. Um, very exciting speaker.

The man is full send hyperengineering at OpenAI. So, uh, a little bit of

OpenAI. So, uh, a little bit of background. We did a Laten Space episode

background. We did a Laten Space episode with him. We shipped it the other day.

with him. We shipped it the other day.

The the story he wrote this great article called Harness Engineering and we're like, "Wow, this is pure gold." We

have him on the podcast. He's a token billionaire spending over a billion output tokens a day. That's like over $1,000. So, you know, man is really

$1,000. So, you know, man is really living it. Uh, we want to keep this

living it. Uh, we want to keep this exciting. Ask good questions, ask

exciting. Ask good questions, ask interesting stuff, ask things that people can learn from. But, you know, let's welcome Ryan onto the stage.

Hi folks, how's it going? Excited to be here. Uh, London has been fantastic and

here. Uh, London has been fantastic and uh, excited to kind of walk through what it is, uh, that we do and how we work here.

I think you got to come on. This camera

is just here. So,

I got blinded by the QR code. So, we're

Okay. So, background. We have about an hour. Um, scan this QR code. You should

hour. Um, scan this QR code. You should

get Slido. Slido will let you ask questions. If you see interesting stuff,

questions. If you see interesting stuff, you can thumbs them up and we'll try to get through them. Unfortunately, the

first one I can't superdo, but let's just kick it off. Ryan, can you show us your actual working setup with no laptop? Um,

laptop? Um, uh, yeah. Uh, here. Beach

uh, yeah. Uh, here. Beach

Margarita linear right?

Oh, wow.

Um, I'll say watch the podcast we put out. We go through some of the work, but

out. We go through some of the work, but if you want to talk about it, I guess without actually showing us what's your what's your workflow like? What's your

setup? How do you how do you approach a task?

Sure. So, uh, the way me and my team work is to start with tickets, right? We

have chunks of work that we want to do, features we want to add to our apps, reliability work that we want to do. uh

we give that ticket to an agent along with a couple of skills that enable it to manipulate our app. Uh we want the entry point to the development process

to be codecs not an environment which we build around it. So we kind of do things um outside in right like codeex is the entry point the same way you would be and we give it tools we give it

instructions on how to cook. So rather

than like creating a shell that our app and CEX get spawned into, we have a skill that teaches Codex how to launch the app that teaches Codeex how to spin up that local observability stack to

give it logging and telemetry. We give

it a skill that enables it to uh boot up Chrome DevTools and attach to the application with a you know local CLI that will connect via some Damon that we

have. So the whole way we have set up

have. So the whole way we have set up the repository and all of the local dev tools is for codeex to invoke them first. Um that means we have kind of

first. Um that means we have kind of like a bunch of little mini harnesses within the codebase that make it really easy for us to slot in additional guard rails. Uh you know a big package of

rails. Uh you know a big package of custom ESLint rules which get wired into every PNPM package in the workspace. We

have another sort of local dev harness that allows us to add sort of like higher level wholesome tests that assert the structure of the code itself rather than like either the syntax or the

behavior of the code. Things like you know package privacy dependency edges between different layers of our stack.

these sorts of things. Uh making sure that you know across multiple files zod schemas are dduplicated that there's a single canonical implementation of like our async helpers. Uh these sorts of

things because you know the way we have seen the agents work is to sometimes optimize for local coherence of a package rather than using like our shared utilities and things like that.

So having observed that behavior, we kind of have built a bunch of little pseudo llinter source code verification things that shake out some of that bad behavior so the humans don't get

distracted paying attention to that in reviews, stuff like that. But uh the setup optimizes for the agent to do the job and for the humans to not have to

keep track of the high churn in the codebase. Um we kind of centralize our

codebase. Um we kind of centralize our leverage around five to 10 skills. uh we

don't go super wide on skills preferring to make the existing skills better because at least I find that the infrastructure within the repository all

the local developer tools change super frequently uh and I don't really have the bandwidth to keep track of this. So

we hide all that complexity beneath the skills that the human has to invoke and let the agent just kind of figure it out. One one kind of neat thing here is

out. One one kind of neat thing here is um when we moved from using uh Chrome DevTools protocol directly to having this like Damon thing like I didn't know that had happened for like three weeks.

Uh it was like totally fine because Codex was able to do the thing uh you know with the documentation and things that we had in place and part of this you can get more detail in your article. So some background you

wrote a great piece called harness engineering. There's a whole section in

engineering. There's a whole section in there on how you thought about skills, thousands of skills versus simplifying it to just quite a few. But okay, uh,

continuing on, how do you stop yourself from overgineering harnesses? And a

little bit of a similar followup is, do you often build small tools for yourself, if ever? Uh, do you do you build custom tools?

Yeah. So, I think this is kind of gesturing in the direction of the bitter lesson here, right? which is how do I make sure the work that I do isn't like completely obsoleted by an increase in

model capability and the way I have thought about that is doing sort of the bare minimum amount of context management to kind of pull in requirements uh for the agent to do an

acceptable job over the course of its work and context is a thing that I don't think will ever be obsoleted right like the the models must be told like the requirements of the task which

guardrails to pay attention patention to these sorts of things. So a good harness is really operationalized around giving the model text at the right time so it can look at the work it has done and the

information around what a good job looks like and you know fundamentally the models are trained to follow instructions. All the harness should do

instructions. All the harness should do is surface instructions to the model at the right time. So we do want to minimize that too, right? You don't want to frontload all those instructions

because then you kind of like overwhelm the agent, but all of these sort of requirements around what a good job do need to be paid attention to over the entire course of a PR, right? So

figuring out ways to either defer or just in time surface those instructions is kind of what uh a good harness should do right?

If you know that uh you want your React components, right, to be decomposed so that they make good snapshot tests for individual more stateless pieces, right?

You don't need to load that up front.

Instead, you should kind of let the agent cook and prototype and experiment with the UI you want to build and then at lint or test time say, "Okay, you've done the work. In order to finish it, you have to break this apart so that

your components are small and as stateless as possible and have local dependencies on hooks instead of prop drilling or whatever it is that you want uh the code to look like. And then the agent will say, "Oh, this is a new

instruction for me. Let me take the patch as written, modify it to make sure that it aderes to the instructions." And

then up it goes to GitHub. And this sort of thing is not going to be obsoleted by increases in model capability. It's

really just about getting that right text, that right context to the agent at the right time.

Can we talk about an example of a good harness? So, a lot of people are asking

harness? So, a lot of people are asking about the codeex model, the codeex harness. How does that compare to other

harness. How does that compare to other harnesses? So, cloud code, open code.

harnesses? So, cloud code, open code.

Uh, how do you guys take these decisions into play? You don't work directly on

into play? You don't work directly on codec, but if there's you can if there's stuff you can speak about about the codeex harness, what you guys see as you architect it out.

Yeah. So one thing that I think is super powerful is this notion that the labs are not just post-training the models but post-training the models in the context of the harness in which they are

primarily deployed in right like the apply patch tool or like the specific quoting semantics of how to invoke the bash tool are like in the loop for the post- training process for the harnesses

from the labs which means like there is leverage to be had by depending on these sort of like firstparty harnesses directly At least this is what I

believe. Uh, and as such, kind of being

believe. Uh, and as such, kind of being able to direct through them via things like the SDK or manipulating the Codex app server directly means you kind of

get to ride the wave of all that leverage in post- training. Instead,

focus on the parts that you care about, which is like what correct code looks like. Um, I kind of have high confidence

like. Um, I kind of have high confidence that things like clog code and codecs will continue to get better. uh that is the responsibility of like the teams

working on these coding agents. So in my role where I don't really want to focus on the coding harness at all is figuring out ways to plug into them in ways that

um kind of like steer the agent. That

means my job can sort of like move up to thinking about differences in model behavior between releases rather than deeply understanding the nuts and bolts of the harness. Instead, I can think

about what it means to, you know, drive the behavior that I want vers based on the observed behavior rather than like the inner mechanics of the thing. It's a

perfect follow-up to the next question, which is, uh, do you have any recommendations for collaboration platform? So, when you're in the

platform? So, when you're in the software development life cycle, is there any platform that you use for agents, engineers, developers all to collaborate on working on anything? Any

tips, any tools?

Yeah. So

in this world it has largely been just markdown files in the repository and GitHub that have been the primary sort of hub and spoke sort of thing. If you

think about collaborating on a document like you open Google Docs, you write something, you ask for feedback, people comment, you apply suggestions, these sorts of things. This is kind of like a little clean room environment just for

this work artifact that you're producing. like a PR kind of has a

producing. like a PR kind of has a similar purpose. So we kind of treat

similar purpose. So we kind of treat that as a big hub and spoke broadcast domain where all of the agents and humans collaborate together. Uh and

because we optimize for throughput, we don't block on any sort of like contribution to that like folks can either review or not. Agents can either review or not. The implementation agent

can acknowledge, defer or reject any feedback that it gets. uh really

allowing each participant in the production of diffs to kind of make their own judgments around what it means to deliver, receive, respond to feedback. Uh and this has a nice

feedback. Uh and this has a nice property of like not putting the model in a box in a bunch of places. We want

them to use their good reasoning sort of thing. So being super prescriptive

thing. So being super prescriptive around like every bit of feedback must be addressed can kind of have this like catastrophic failure mode of your coding agent being bullied by all of the

reviewers when really we want to bias toward code being accepted, not perfect, not drowning in minutia and these sorts of things.

How should people get started with using coding agents? People that have been

coding agents? People that have been using a lot of doing a lot of manually written code, how how do they start to transition? What should they offload?

transition? What should they offload?

How do they kind of come over that barrier of okay, I'm still checking every PR I'm copy pasting from codecs.

How should like the average engineer start to use these tools?

I think there's two ways to approach this problem. One is to

this problem. One is to start using the coding agents to improve your confidence in the code itself as it is written today. Right? I think we would all agree that like more tests is

probably a good thing, right? to assert

that our programs are well specified and behave correctly as our users interact with them is a good thing. Uh and the agents are super good at looking at the existing code with some context around

how it is meant to be used and writing tests that assert that behavior. So kind

of using this to improve your confidence in the quality of the code will also increase the agents ability to successfully navigate it which means you don't have to worry as much around doing

super detailed review of the agent output. The other way to think about

output. The other way to think about this is to look at how you are spending your time. Is it you know staring at

your time. Is it you know staring at your editor writing code? Is it waiting for tests to run? Is it waiting for human review feedback? is CI slow and you're like waiting on that maybe you

have a ton of flaky tests and using the agents to incrementally automate the parts where you are spending your time because ultimately the high lever parts

of our jobs is to define the work that must be done prioritize and schedule that work and then effectively empower folks on our team to do that work. uh

and the more and more we can delegate and move into sort of this like sequencing and orchestration role even if if you just think about like managing your teams right the more parallel and

the more like deeper individual executions of those delegations we're able to do right if I put primitives in place that make it super easy to like spin up ways to respond to events on my

Kafka queue right like I don't really need to be in the weeds with every engineer making sure they like implement a consumer correctly Right. And these

same sort of like building block style techniques apply really well to the agents and stack really well too.

A fun one. How do you work with agents in your car?

Um so I have not used the new uh voice mode that launched in CarPlay uh recently. Uh not ready for that. But uh

recently. Uh not ready for that. But uh

usually what I'll do is kick off uh a task uh right before I leave the office.

uh tether my laptop to my phone, buckle it into the back seat, and kind of let it cook in the 30 minutes it takes me to get home. Uh most of the time with the

get home. Uh most of the time with the skills we invoke that tell the agent, you know, you're operating on a task, you go until the tests are green. Uh you

know, I don't have to reach back there and poke yes, continue onto the thing.

Uh and I'm basically able to more fully saturate, you know, my day with token consumption. Um, the dream here is that

consumption. Um, the dream here is that I actually have 50 agents running 247 and I don't have to interact with them at all. Uh, and the way to do that is to

at all. Uh, and the way to do that is to define the work well, figure out ways for it to automatically be scheduled and remove myself from having to click the button. Right? Every time I have to type

button. Right? Every time I have to type continue to the agent is like a failure of the harness to provide enough context around what it means to continue to completion.

Wow, good statement at the end. every

time you have to interact with the agent is a failure. Okay, so the following question kind of scales this out, right?

As your org knowledge map scales, what practical steps do you have to like enable progressive disclosure? So as you have a larger and larger codebase, as you have more people, how do you scale

your agents to work better with this?

Yeah. So

when I sort of initially started this project that I was working on, blank repository, create Electron app, right?

you know, V single package, all this sort of stuff. And eventually ended up with a mess, right? Because there's no package privacy that allows me to enforce invariance around what APIs are

public versus which ones are not. The

agent didn't have like concrete hooks in the file system to determine which domains were separate from the other ones. So we ended up going like full

ones. So we ended up going like full 10,000 engineer organization heavy on the architecture 750 packages in the PNPM workspace

isolated by business logic domain or layer of the stack individual small util packages that encapsulate reusable functionality that we lint on being used

that we can encode leverage in and I do think that like in this world even if you don't actually have microservices structuring your repositories in ways

that you can actually scope like the directory subree you are looking in to be able to do most of the change helps uh and you know code in the file system

is also text which means it's effectively prompts that you're giving to your coding agent. Uh, so making the code as much the same as possible kind

of makes it so that regardless of where in the repository your agent is looking, it develops a ton of transferable context, right? Like you should have one

context, right? Like you should have one way to like do a bounded concurrency helper. You should have one way to

helper. You should have one way to construct a observable and instrumented side effectful command. You should have one OM, right? Like you should have one programming language. You should have

programming language. You should have one way of writing CI scripts. you

should have one way of adding additional lint rules, these sorts of things because it means that like the tokens that you want the model to produce are easier to predict and more consistently predicted regardless of where it looks.

Um, so I would say figure out ways to structure the code so it is local to a subree in the repository for most of the ways you would interact with that system and then figure out a way to use these

agents to completely migrate the codebase to be the same. you know,

empower someone on your team to be a dictator to say this is the way it must be done, right? Or, you know, figure that out together and, you know, write it down, evolve the code so that it reflects that reality,

these sorts of things.

We've got a few questions on code review. How do you approach code review

review. How do you approach code review now that you have such high velocity?

Uh, do you just not read the code? Do

you just trust trust the test coverage?

Uh, how do you write good tests? How do

you offload that stigma of like, you know, you have a mental blocker. I need

to manually check everything before I merge pure.

So that same sort of idea where you have to look at where you're spending your time and figure out ways to spend less of it. Uh, you know, when we started,

of it. Uh, you know, when we started, right, the first thing to do was figure out how to get the agent reliably producing code that we would accept. And

a big challenge we ran into is with each engineer producing three to five PRs per day, even on a team of three, merge conflicts were super miserable, right?

Because these PRs tended to be pretty big. We were working on the same parts

big. We were working on the same parts of the codebase. So that's where we moved in two directions. One was to like tree out the code a bit more to minimize these merge conflicts, but also minimize

the amount of time PRs were open so that we were uh reducing the likelihood of a merge conflict actually occurring. And

the reason PRs were staying open so long was because we needed code review uh because humans were being the blocker in

this scenario. So in order to do

this scenario. So in order to do that piece automatically, I essentially asked every engineer on the team to take one day a week, Fridays, we called it

garbage collection day, where our entire job was to take every bit of slop we had observed over the course of the week that was making a PR difficult to merge

and figure out ways to categorically eliminate it from ever happening in the first place, which is where we kind of started closing this loop between the feedback that humans were giving on the

PR indicates some context failure on behalf of the agent, getting that into the repository and then figuring out ways to automatically prompt inject the agent so that it would selfheal when it

produced this bad behavior. And this is kind of how you go from synchronous human time spent giving feedback as code review comments to documentation in the repository to automatically surfing this

documentation either via a failing test or a reviewer agent who is primed to review the code as written in the context of these docs. But all of that happens by putting those docs in a

single place that all these processes are able to attach to. Um, you know, we kind of asked folks to basically bucket the types of review feedback they were

giving into like um like the persona they were operating as like front-end architect, you know, reliability engineer, scalability sort of thing. And

then basically for each of those personas, we spun up a review agent that gets triggered on every push that says, is this code good? Surface any P2s or above that would block this PR from

merging based on these documentation that says what good looks like. Uh and

with that and just continuously appending to these files, we started to see slop reduce reduce reduce.

People have questions about your billion tokens. Where do you think those are

tokens. Where do you think those are split up? So how much of it is on code

split up? So how much of it is on code review? Where where is the majority of

review? Where where is the majority of that usage coming from? And a followup for people that are just getting started. Say they have they've jumped

started. Say they have they've jumped and done a $200 pro plan, right? If you

had to cut your usage by a fifth, how should people maximize their usage?

Right? You run into usage limits. Um,

you know, you don't want to just copy paste million lines of code every six hours. No prompt hit prompt cache hit.

hours. No prompt hit prompt cache hit.

But how should we how should we think about that?

Yeah. So I would say probably a it's probably a third a third a third between like planning ticket curation

documentation implementation and stuff that runs in CI.

Do you use plan mode?

Uh we uh I've used exec plans which was kind of like an early version of this that we published which is sort of like a proto skill that says this is how you should structure a plan with milestones

and acceptance criteria. um haven't

really used plan mode as part of the harness at all. My my sort of expectation here is that I should be able to drop a ticket in and have it do the job anyway without diverting through a plan. Uh because most of the time I'm

a plan. Uh because most of the time I'm never going to read it anyway. Uh so I find that if you do use a plan and you approve it without reading it at all, you're actually encoding a bunch of

instructions that you don't necessarily want followed. Uh so if you are going to

want followed. Uh so if you are going to use plans, my recommendation is to push those up as single PRs with just the plan where you actually have human

review every line of it and like block on human approval before they get merged and then kicked off. Uh because it's you're effectively potentially wasting your time on a rollout with instructions

that like are bad. Uh so you want to kind of like minimize the time that happens. But I do think that uh kind of

happens. But I do think that uh kind of getting tokens to be spent in CI is a necessary part here because writing code no longer is the hard part. Like getting

code accepted and advancing the code and product forward is like what it takes to make that written code be valuable. And

you know you kind of have all heard the apherism that like you know senior engineers give good code reviews like we expect our senior engineers as agents to do the same.

Uh someone asked is code a disposable build artifact?

Yes.

Yes.

Uh I think we we touch on this with uh symfony which is sort of this agent orchestrator that we release. This idea

that you know we can publish a library that's actually a super well-defined spec that the code is a compiled

artifact of. And I think like using LLM

artifact of. And I think like using LLM as fuzzy compiler is like an interesting mental model to have, right? Like all of the context that we're putting in the

codebase for harness engineering is effectively like constraints and optimization passes on which code is acceptable to build in the first place.

Uh and this is pretty similar to like the static analysis and optimization passes that something like LLVM would do in the process of compiling Rust code.

uh and sort of swapping out one model for another is sort of like changing your code generation backend from you know LLVM to

crane lift in the Rust compiler and you would expect that all of the sort of rules around what acceptable Rust code looks like produce valid sound machine

code out the back even if the generation process is different and you end up with different x86 instructions. So same sort of mindset for LLMs swapping out different models sort of thing. We want

the structure around the code to basically limit how it is written to things that would be acceptable to us.

At a high level, can you give us a picture of what future you're building for? Does context still matter? How do

for? Does context still matter? How do

people do engineering, harness engineering, context engineering? What

does the future look like?

sort of the the feature that I want to build toward here is where I'm able to take a token budget and a

quarter, a half or a year's worth of work, take the human input to rank what is most important success metrics, reliability metrics, give it to the

machines and have them continually work and advance my product forward. uh

without sort of you know my hands explicitly on the wheels at all. We

as we have gone through like very early prototyping to internal alpha internal beta external alpha I kind of have felt that like new parts of the software

engineering process have kind of like started from zero and we've had to build up capability kind of like these like you know pentagonal like personality charts right where like I spike in this

direction maybe I'm weak over here and you know when we get to deployed software for the first time, right? The

agents ability to do like QA smoke testing on our built artifacts before they're promoted to distribution was weak. We hadn't invested any time in

weak. We hadn't invested any time in this. There were no docs. There were no

this. There were no docs. There were no tools that the agents could use to like download the built artifact, launch it, poke around to make sure that our like most critical user journeys were well

validated and tested. So, because I don't want to be touching the computer, we needed to figure out like ways for the agents to build themselves tools to do that part. Uh, and

there's a whole universe of software engineering outside of writing code, right? Like I am triaging user feedback.

right? Like I am triaging user feedback.

I'm triaging pages. I am making sure that we don't have any PII leaking in the logs in production. I'm making sure that like the Twitter vibes are good and

people are enjoying my software that uh our user operations staff are supported with well written runbooks that allow them to triage and mitigate high volume user issues and then moving that into

the code itself so they don't happen in the first place and as I no longer have to produce code like my mind can shift to these other higher level or more

squishy activities but the agents are good enough to do these things too and figuring out how like write down the processes and the acceptance criteria becomes like the sort of like meta programming part of the job using these agents.

That's a great way to end it. What an

exciting future. Give it up for Ryan guys.

Thank you folks.

Loading...

Loading video analysis...