Future-Proof Coding Agents – Bill Chen & Brian Fioca, OpenAI
By AI Engineer
Summary
Topics Covered
- Coding Signals AGI Proximity
- Harness is Model's Interface Layer
- Align Prompts to Model Habits
- Harness Emerges as Abstraction Layer
- Future: Unsupervised Long-Horizon Tasks
Full Transcript
[music] Hello everyone. Um, today we'll be
Hello everyone. Um, today we'll be talking about how to build coding agents.
And uh, I'm Bill. I work on the applied AI startups team at OpenAI.
>> And I'm Brian. I work with Bill on the OpenAI startups team.
>> And we specifically uh focus on uh building coding agents here at OpenAI.
Um yeah. So why are we talk giving this talk? Why why are we you know u talking
talk? Why why are we you know u talking about coding agents? Well, it's really quite interesting because it's been booming for the the the past year.
Actually, it's just if you think about it, it's not that much time ago. like
only have been a year or so. The ground
keeps shifting really under the uh harness on on the coding agents. But if
you think about it, it's really like why it's interesting is because it's really a signal on how close we are to AGI.
Software engineering can be set as a universal medium for problem solving.
But because the ground is shifting so fast, uh we h kept having to rebuild the agent on top of the model whenever a model is released. And today we're going to talk a little bit about how we might
be able to get around that.
So here's what we're going to go over today. We'll start with the anatomy of a
today. We'll start with the anatomy of a coding agent, especially going into the details of models and harnesses and how they work together. We'll share some lessons that we learned from putting
them together ourselves. And we're
specifically going to talk about codeex here, which is our own coding agent.
We'll talk a little bit about emerging patterns that we're seeing from all of you for using agents like Codeex in your own products. And lastly, we'll talk a
own products. And lastly, we'll talk a little bit about what to expect from Codeex in the future so that you can build along with us if you want to.
To start, let's talk a little bit about what makes a coding agent an agent as a whole. Um, it really is quite simple. I
whole. Um, it really is quite simple. I
think you know people kind of over complicate things a little bit these days. It's made out of three parts. It's
days. It's made out of three parts. It's
a user interface. It has a model. It's a
harness, right? Uh the interface quite self-explanatory. Could be a computer uh
self-explanatory. Could be a computer uh like a CLI tool or it could be a uh integrated developer environment, could be also cloud or background agent. Um
models also very quite self-explanatory are you know the things like the latest and greatest the GPD 5.1 codeex uh max
that we just released yesterday uh or the GPD 5.1 series of models or other uh models from other providers as well. And
the harness uh is a little bit more of an interesting part. This is the part that directly interacts with the model uh in the most reductive way. You can
sort of think of it as a collection of prompts and tools combined in a core agent loop which provides input and outputs uh from a model. Uh the last
part will be our focus for today.
As touched on a bit earlier, coding is one of the most active frontiers in applied AI and uh how models are constantly getting released and we're not making the problem uh easier for
everybody is that people have to constantly adapt uh the agents to the new models.
So, um, Bill's done a great job of giving us an overview of coding agents, what they're made up of. So, let's zoom in a little bit on the harness. Um, it
turns out that's a little bit tricky.
So, what is a harness? A harness is really the interface layer to the model.
It's the surface area the model uses to talk to users and the code and perform actions with tools. It's made up of all
of the pieces that the model needs to work over many turns, call tools, and and really write code for you and interpret what the user is actually
asking. [snorts] Um, for some, the
asking. [snorts] Um, for some, the harness might actually be the special sauce of the product. But as we're going to go into a little bit more, it's
really challenging work to build a good harness. And we'll talk about how we did
harness. And we'll talk about how we did that.
So let's see what are some of these challenges. Um just to name a few, AV is
challenges. Um just to name a few, AV is one. Um your [laughter]
one. Um your [laughter] um your brand new innovative custom tool that you're giving to your agent might not actually be something the model is using is used to using. It may not have
ever seen that tool before in trading.
And even if it is, you need to spend time tuning your prompt to that particular model and the habits that it comes with.
And new models are coming out all the time. What about latency? Like does the
time. What about latency? Like does the model take a while to think about certain things? Which things do you
certain things? Which things do you prompt it not to? How do you expose the UX of what a thinking model is doing while it's thinking? Is it communicating with you while it's thinking or do you
have to summarize it? Managing the
context window and compaction can be really challenging. We just launched
really challenging. We just launched Codeex Max that does that out of the box for you. you don't have to worry about
for you. you don't have to worry about compaction and context window management. It's really hard to do. Um,
management. It's really hard to do. Um,
and so if you were to do it yourself, have fun. Um, and then also like the
have fun. Um, and then also like the APIs keep changing, right? So we have completions, we have responses, we have whatever else is coming in the future.
What does the model know how to use and get to get the most intelligence out of the box?
And so this is the interesting part. Fitting a
model into a harness takes a lot of prompting.
It turns out that how the model is trained has side effects.
I like to think about it this way.
Intelligence plus habit. Intelligence.
What is the model good at? What
languages does it know really well? What
is what is its capabilities in terms of like how well it can write code in certain frameworks? And then what habits
certain frameworks? And then what habits did it learn to to use to solve those problems? We've trained our models to
problems? We've trained our models to have habits of like planning a solution, looking around, gathering context, and and thinking about a problem before
diving in and writing code, and then testing its work at the end.
Developing a feel for these habits is how you become a good prompt engineer.
If you don't instruct the model in ways that it's familiar with, you can have problems. We saw this when we launched GPD5. A lot of people who weren't used
GPD5. A lot of people who weren't used to using our models encoding tried to take prompts that existed for other models and put them into their harness
and have GPD5 follow those instructions.
And it turned out that we taught our model to do some of the things that the other models didn't really do out of the box. And so when they were prompting
box. And so when they were prompting them to look really hard at the context and like examine every single file before making a a code edit, our model
was being very kind of thorough about that and it was taking a really long time and they weren't seeing the best performance. And so we figured out that
performance. And so we figured out that if you let the model just do the behaviors that it's used to and don't overprompt it, it'll actually perform really better. We found out by asking. I
really better. We found out by asking. I
was literally like, "Hey, like I like the solution, but it took you a long time to get there. What can I do differently in your instructions to help you get there faster next time?" And
literally it said, "Uh, you're telling me to go look at everything and I don't really need to. So that's what's taking forever."
forever." And so you can actually see the advantages of building both the model and the harness together because you just like know all of that while you're
building it. And that's why Codex is
building it. And that's why Codex is both a model and a harness combined.
So let's dig deeper into Codeex and what it can actually do.
So we built Codex to be an agent for everywhere that you code. It's a VS Code plugin. It's a CLI. You can call it in
plugin. It's a CLI. You can call it in the cloud from the VS Code plugin or from chatgbt from your phone. Um, and
it's very basic. You can use it to turn your specs into runnable code starting from a prompt. Um, having a plan. It
navigates your repo to edit files. It
runs commands, executes tasks, and you can call it from Slack or you can have it review PRs and GitHub. So, all of the things that you would expect.
And that means that the that codec um the harness of codec needs to be able to do a lot of really complex things. Uh
when I talked to a member of the codeex team about this slide and what should be on it, he was like it's way harder than you think. You have to manage parallel
you think. You have to manage parallel tool calls like thread merging and all of the things involved in that. Think
about all the security considerations you have with sandboxing, prompt forwarding permissions uh port management. Um, compaction is a whole
management. Um, compaction is a whole thing. Um, and doing that well is really
thing. Um, and doing that well is really complex. When do you trigger compaction?
complex. When do you trigger compaction?
When do you reinject? How do you worry about uh cache optimization during that MCP, right? Like all of the uh plumbing
MCP, right? Like all of the uh plumbing you have to build for MCP support into the harness. Uh, and then not even
the harness. Uh, and then not even mentioning images and what's the resolution that you need to compress them to to send them to the model. All
this all of this is like work that you have to do if you're going to build this from scratch and keep it updated as new features come online.
So since we've bundled all of these features together for you in an agent that can safely write its own tools to
solve new problems that it encounters.
Oops.
Uh we actually have here uh a computer use agent for the terminal.
Wow, that sounds quite a bit powerful than just plain old coding agent, doesn't it? Um but just think about it
doesn't it? Um but just think about it again. Well, before browser and graphic
again. Well, before browser and graphic user interface was a thing, wasn't that how we always operate a computer?
they're writing code and chain them together in a command line interface. So
that means if you can express your tasks in command line as well as files tasks codeex will be able to know what to do.
Um the example is I like to use codeex to organize a lot of the photos from my desktop into a folder and that's a very simple use case but what it can also do
is it can analyze huge amounts of CSV files inside of a folder uh doing data analysis it does not have to be a coding task and if it can be accomplished by running tools from command line you can
use codeex so now that we see codeex is such a cool harness um I want to also share a little a bit about how you can use it to build your own agents. And what you can do is
you can use codeex [clears throat] the agent inside of your own agent.
Um, how does that work? Well, if you want to build uh a coding uh the next coding startup, we don't really have all the answers, but we do have a few
patterns uh that we thought uh might help you having worked with some of the top coding customers uh like cursor and VS code. Uh one of those patterns is uh
VS code. Uh one of those patterns is uh harness becoming the new abstraction layer. The benefits of this is quite
layer. The benefits of this is quite obvious. Um, you no longer have to care
obvious. Um, you no longer have to care about prioritize optimizing the prompt and tools with every model upgrade.
[snorts] >> But, um, does that mean you're just building a wrapper?
>> Well, I disagree with that take.
I disagree. I was disagreeing with my colleague here. Um, just like how
colleague here. Um, just like how building rappers on top of models I think is really reductive on uh on the whole value prop of the infrastructure layer. Sorry, I used to be a VC.
layer. Sorry, I used to be a VC.
[laughter] >> Focusing most of your efforts on differentiating your product is what this pattern allows you to do. And
that's where most of the value lies.
Exactly. Okay. So, let's look at some of these patterns that we've seen and actually have helped our customers build um along with them. Codeex is an SDK. It
can be called through a TypeScript library. You can call it
library. You can call it programmatically and a Python exec.
There's a GitHub action that you can plug into to have it merge merge conflicts on PRs that everybody hates doing. Then uh you can also add it to
doing. Then uh you can also add it to the agents SDK and give it MCP connectors back to your product. So now
you have an agent. I like to say we started with chat bots that you can talk to. Then we gave the chatbots tools to
to. Then we gave the chatbots tools to use. And then now you can give uh a tool
use. And then now you can give uh a tool to your chatbot that can make other tools that it doesn't have. And so now you can actually build out enterprise
software that does it that writes its own plug-in connectors to the API level for each customer on the spot. That's
something that a professional services team used to have to do. Um, so you have fully customizable software that can now talk back to itself. Um, I made a conbon board for dev day that can actually fix
its own bugs. Um, it's pretty fun. And
then lastly, um, you can actually do something like what Zed has done. They
have just decided to wrap codeex inside of a layer and give it an interface to the IDE for talking back and forth for the user and making code edits. And now
they don't actually have to do all the work of staying on top of all of the things that we're good at doing and they can focus on building like the best code editor.
Uh so our top coding partners like GitHub has used this uh to great effect and well uh we've created an SDK uh for it that they used to directly integrate
uh with codeex. You can also use the SDK to uh control codecs as part of your CI/CD pipeline as well as use it as an agent that directly interacts with your
own agent as well. Uh [clears throat] if you really want to customize the agent layer, you can do it too. As an example of this, we worked with closely with the cursor team to get the best performance
out of the codecs. The model, not the agent, we're bad at naming things. The
model is different from the agent. They
did so by aligning their tools to be in distribution with how the model is trained and they did so by aligning uh their harness with our open- source uh implementation of codeex CLI. All of
this is publicly available. Uh you can fork the repo, you can use our source code, you can use it. Uh go nuts.
So what does the future hold for Codeex?
It hasn't even been out for a year. Um
and especially with the lo la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la launch of CEX match yesterday like things are really changing fast. Uh it's the fastest
changing fast. Uh it's the fastest growing model in usage now serving dozens of trillions of tokens per week which has actually doubled since dev
day.
It's always good to build where the models are going. It's safe to assume that the models will get better. They'll
be able to get to work on much longer horizon tasks unsupervised.
New models will raise the trust ceiling.
I trust these models now to do some way harder work than I would have 6 months ago. And that's going to keep
ago. And that's going to keep increasing. The future is about
increasing. The future is about sprawling code bases and non-standard libraries and knowing how to work in closed source environments, matching existing templates and practices
and the models uh and and and so you can imagine that the SDK will evolve to better support these model capabilities, letting the model learn as it goes and not repeat mistakes and generally
provide more surface area for an agent that writes code and uses a terminal to solve whatever problems it encounters.
counters and you can use that in your products via the SDK.
So, what have we learned? Harnesses are
really complicated and take a lot of work to maintain, especially with all the new models coming out. So, we've
built one for you inside of Codeex that you can use off the shelf or look at the source if you want to and you can use it to build new things outside of coding
and let us do all of the work making sure that you have the most capable computer agent.
And we're really excited to see what you craft.
[applause] Heat. Heat.
[music]
Loading video analysis...