I got an inside look at how OpenAI PMs ship code
By Aakash Gupta
Summary
Topics Covered
- If you are not using a billion tokens a day, you are negligent
- Code is a liability in the age of AI code generation
- Every team member can own the codebase
- Every person brings unique lens, agents compile them
- A PM shipped from PRD to production in one week
Full Transcript
One of the craziest statements I heard from you was code is a liability. What
does that mean exactly? And
yeah, there's a bunch of different ways that this expresses, but primarily all of our human software engineering organizations have been toolled around this idea that code is the most
expensive thing to produce. You have
synchronous limited throughput of your human engineer operators on the keyboard.
Ryan Leopo led the team building it. He
works on the open AI frontier team and today he is letting you insight how you can transform your code base transform how you work to operate at the frontier.
You've said that if you're not using a billion tokens a day, you're basically negligent. The more tokens that you're
negligent. The more tokens that you're running through these models, the smarter they get, the more intelligence you are able to extract from them. This
is sort of like the realization behind why we have reasoning models in the first place, right? If we give the model more effort to do inference, we're actually able to get greater results out of this.
The engineer owns the code. The designer
owns how it looks. The PM owns the problems we're trying to solve. What do
the new lines look like in the new EP?
Before we get into today's episode, I wanted to share that you can get a free year of my favorite AI tools including bolt new, mobin, arise, relay app,
dovetail, linear, magic patterns, reforge build, descript, and speechifi if you join my bundle at bundle.ac.com.
On top of that, I wanted to quickly ask you to please double check that you are subscribed on YouTube, Apple, and Spotify podcast. It's a free thing you
Spotify podcast. It's a free thing you can do that really helps support the show. And now into today's episode.
show. And now into today's episode.
Every time I talk to people from OpenAI, I get the feeling that they are operating how the rest of us are going to be operating in two years. Whether
that's PMS and designers actually having access to the codebase. How many of us still don't have access to the codebase, right? or
right? or actually building the right scaffolding so that PMs and designers can feel confident operating with the codebase.
That is the next level. And today we have on one of the people who gave the best talks on harness engineering that I have ever seen, Ryan Leopo. He leads one
of the teams at OpenAI on the frontier, OpenAI frontier. and today he's going to
OpenAI frontier. and today he's going to break down how you can go from the old way of working to the new way of working. Ryan, thanks so much for being
working. Ryan, thanks so much for being on the podcast. Thanks Aos for having me and I'm super excited to talk a little bit about today around like what it means to enter this new wave of working
where we can sort of empower the machine to do much of the work for us which frees up the humans in our organizations and teams to focus on the things that the true value that they bring to the
table, right? Scheduling the work,
table, right? Scheduling the work, bringing their taste and expertise and making sure that we're shipping actual delightful high quality products to the world that solve real user needs, right?
That's the job. Uh, and it's been fun to be able to experiment with how we can do that more effectively and at higher velocity.
I'm so glad you said that because I think there's a sort of a definitional phase we need to help people with at the front. One of the craziest statements I
front. One of the craziest statements I heard from you was code is a liability.
What does that mean exactly? And what
has been recently enabled with the latest versions of GPT 5 and 5.5 that allows you to say that? Yeah, there's a bunch of different ways that this
expresses, but primarily all of our human software engineering organizations have been toolled around this idea that code is the most expensive thing to produce. You have
synchronous limited throughput of your human engineer operators on the keyboard, right? Uh so there has been a
keyboard, right? Uh so there has been a high sense of like protectedness around that code. It is expensive to produce.
that code. It is expensive to produce.
It is expensive to validate. It is
expensive to deploy. So these things have been treated as requiring the bulk of the organizational apparatus to manage. In this world we
are today where the token prediction machine these lovely highly advanced models that we have. Codex is fantastic at taking all the best parts of GPT5 and
putting those uh you know codes and words into the world.
The code is trivial to generate. It can
be generated at arbitrarily parallelism depending on how confident you are that the code is able to solve those problems which we will get into in a little bit more detail. Right? But you kind of have
more detail. Right? But you kind of have to change the way you think about scheduling work and exploring when you no longer have what was the overriding
constraint in your teams. And because of this, you can sort of remove humans from the loop almost entirely on that production aspect of the code. And then
you have a whole host of other problems on what it means to ensure quality, right? You have a less tight feedback
right? You have a less tight feedback loop for those human engineers, right?
Because they're not physically typing on the keyboard. They can't attest to every
the keyboard. They can't attest to every line of code as it is produced. But this
is sort of the same spot that the rest of the organization has traditionally been in, right? Like there is an indirection from my designer putting a mock in Figma to me actually realizing
that in HTML, SVG and CSS. There's also
this lossy translation between doing customer interviews, going to a PRD, and turning that into code and deployed artifacts. So it's kind of nice that we
artifacts. So it's kind of nice that we have this equalizing force here where we can potentially end up in this world where every member of these software delivery teams kind of is on equal footing in terms of what it means to
actually produce that executable artifact which we're delivering.
When everybody's on equal footing though what even is the purpose of roles?
Traditionally we've had this triangle product manager designer engineering.
We've been able to cling to what we own.
The engineer owns the code. The designer
owns how it looks. The PM owns the problems we're trying to solve. What do
the new lines look like in a new EPD or that's built to take advantage of the latest this technology is offering.
Yeah, this is really interesting. It's a
question I've gotten a lot actually from uh executives and VP director shaped folks that I'm talking to here. And
everyone comes to this question with a slightly different set of priors.
Basically asking which roles do I need to hire more of or less of sort of thing. But I think that's the wrong way to answer this question.
Every one of these individuals on your team brings their unique lens of viewing the world and these coding agents have all the best parts and all the worst parts of how to build software in them.
So having a diverse team with a bunch of different perspectives on what successful delivery looks like means that you're actually able to take the best parts of everybody's worldview and
compile them into your single code producing agent to do the full job. Uh
this is sort of what it means I think for GPT5 series of models to be like fully agentic. They're able to do the
fully agentic. They're able to do the full job. It's not just fancy
full job. It's not just fancy autocomplete. It's not pair programming.
autocomplete. It's not pair programming.
It's being able to go from understanding of the user challenges, understanding the actual codebase we have, the design system that we're building on top of, what our typography primitives are, how
we actually are able to deploy these vehicles in order to get them into the hands of users, what it means to observe whether or not they are effective, and then feeding bug reports and user feedback back into the work. This is
like a fully closed loop that I would traditionally expect a team of four, five, six, seven to have to do. But if
instead all of these different people with diverse perspectives are empowering the same agent, that one agent can kind of like with full context do the full job.
So if you were to frame what the old and the new roles are, how would you do that?
So sort of in the old way of working, you've got your engineering team who is responsible for compiling the
user visible artifacts that we want to see into code that is then deployed into the world. And to do that because code
the world. And to do that because code has been very expensive to produce and loops have been long on feedback. This
has been something that has been bounded by throughput because the human engineers need to spend a lot of synchronous time and attention maintaining the quality bar, maintaining the long-term viability of the codebase
as a long-lived artifact sort of thing.
And because their time has been largely focused on the production of code within this system, there has been at least in my career comparatively little effort
around empowering others to do that same sort of work. This is why you know very often in organizations you see elbows out around not permitting folks outside of the team or outside of the
engineering organization to contribute directly because uh the code that is produced is potentially very expensive to unwind or very expensive to lift to
the quality bar versus having had your you know expert code authors producing it from scratch. And this has sort of led to each role on the team operating within their own lane with these very
expensive coordination points and handoff points between folks. And
because you have these sort of bottlenecks between handing off work, they're highly contended and low signal.
There's lossiness there. It's very
difficult for the designer to in the weeds observe how difficult it is to compile some of their mocks into code given the constraints of the design
system that the engineering team is working with. And because the designer
working with. And because the designer doesn't feel that pain directly, there can be sort of either resentment or missing feedback channels from the engineers to design. Which means like on
the whole the system is not working together as effectively as it could be.
uh sort of when I think about you know collaborating with my infrastructure teams and uh developer productivity teams we often want to shift left as
early as possible right we want to feel the breakage in a system before it gets to customers before it gets to a PR early in the developer loop because it's
easier to correct them u similarly if it is the case that we are adopting product explorations that are difficult
to prototype or experiment with because we don't have the right experimentation infrastructure here, right? As a
hypothetical example, it may be difficult to push back on product priorities in favor of this sort of product platform work. But this
collection of three roles as a whole would very much benefit from being able to pull that work above the line in order to inlect their throughput going forward. And basically being able to in
forward. And basically being able to in this new world get all three of these roles directly collaborating in the same codebase artifact feeling each other's challenges means there's much more
empathy between them. Folks are able to help each other more often. And in the happy case where everyone has contributed their expertise here. We get
higher throughput, higher velocity, more experimentation, more high quality UI and more product delivered into end user hands.
I hope you're enjoying today's episode.
Are you interested in becoming an AI product manager, making hundreds of thousands of dollars more, joining OpenAI anthropic? Then you might want to
OpenAI anthropic? Then you might want to do a course that I've taken myself, the AIPM certificate ran by OpenAI product leader McDad Jaffer. If you use my code
and my link, you get a special discount on this course. It is a course that I highly recommend. We have done a lot of
highly recommend. We have done a lot of collaborations together on things like AI product strategy. So check out our newsletter articles if you want to see the quality of the type of thinking you'll get. One of my frequent
you'll get. One of my frequent collaborators, Pavle Hearn, is the Build Labs leader. So you're going to live
Labs leader. So you're going to live build an AI product with Pavvel's feedback if you take this AIPM certificate. So be sure to check that
certificate. So be sure to check that out. Be sure to use my code and my link
out. Be sure to use my code and my link in order to get a special discount.
Here's the dirty secret about prototyping. You spend two weeks
prototyping. You spend two weeks building a prototype. You validate your assumptions. Engineering loves the
assumptions. Engineering loves the direction. Then what happens? You throw
direction. Then what happens? You throw
the whole thing away. Bolt changes this completely. When you prototype in Bolt,
completely. When you prototype in Bolt, you're not building throwaway mockup.
You're building real front-end code that integrates with your existing design system. So, when you hand it to
system. So, when you hand it to engineering, they don't throw it away.
They ship on top of what you've built. I
use Bolt every single day. I host my land PM job cohort on it. And honestly,
I'm up till 2:00 a.m. some days just vibing in the tool, having fun, and building. That's when you know a product
building. That's when you know a product is good. When you're using it past
is good. When you're using it past midnight, not because you need to, but because you want to. Check out Bold at bolt.new. link in the show notes.
bolt.new. link in the show notes.
That's what I want to see. How what is what are the receipts? What actually
happens? How can we quantify whether we're getting more product velocity, more throughput at quality?
So, I can talk a little bit about sort of the project that I worked on here, this experiment that I was running. And
back in June of last year, these coding models had just started to become viable, right? We were working with
viable, right? We were working with Codeex Mini and internally the early versions of GPT5. We had just launched Codex CLI and for historical happen stance and
accident I set about to try and build an agent utilizing these capabilities that was for non-coding knowledge work. Uh
the thesis was that all agents were coding agents and we were going to prototype an internal agent that we were going to deploy at OpenAI. Uh and in order to do that, we started from zero
empty codebase uh and had codeex do the full software engineering job to produce this artifact. One interesting thing
this artifact. One interesting thing here from a product development perspective is that having
this process of codecs can only do the job to produce something that was not an engineering native agent was super super
useful because it forced us into delivering value to actual end customers that were not the builders. uh which
helped with prioritization and externalizing the learnings that we were do in this new product category uh to folks that could give feedback about the product rather feedback about the
process of developing the product. Uh
but in doing so we produced a repository with about a million lines of code. Um
funny story about producing agents.
There's a lot of prompts embedded in agents. Uh there's about 250,000 lines
agents. Uh there's about 250,000 lines of markdown in this repository. Uh
because when you're shipping an agent, what you're really shipping is a multi- aent system. You have agents running
aent system. You have agents running around all over the place uh in what was uh an Electron app for us. Things that
you know garden the individual git repositories that are used to uh bootstrap all the tasks that the agents are working on. Uh things for summarization, things for distilling
skills, things for actually executing the work that we're doing on users behalf. uh pretty pretty cool way of
behalf. uh pretty pretty cool way of building products and an interesting way for non-engineering users to collaborate because if many of the product surfaces
are driven by agents, improving their behavior or changing their behavior is prompt engineering is tweaking text and everybody knows how to tweak text.
Everybody can tweak text and run the agent to get a vibe check to do an eval on whether or not the change was good and that in and of itself is super super empowering. Um, but the important thing
empowering. Um, but the important thing is not just that we produced an app with a million lines of code. It's that when doing so, literally zero of those lines
of code were written by humans. And this
was like Ryan as dictator constraint that I imposed on my team here, which I thought was very interesting because it forces the engineers right into the back
seat of the Uber, right? You can't
drive. the only thing you can do is like very politely give backseat driver directions. Uh and you know that sort of
directions. Uh and you know that sort of mental model is I think very useful because like it's not great to yell at your Uber driver, right? Like you you
need to you know have them use the tools available them to them in the car in order to get you to where you're going.
Uh so a big part of building in this way is making sure that the car the GPS the test harness the lints in the codebase all the knowledge corpus of all the features we've built critical user
journeys is there in the moment ready for the coding agent to drive. Uh so
when codeex was given a task and for whatever reason was not able to do the thing, the job of the engineers working on the codebase was to basically stop,
diagnose what went wrong, figure out creatively how to prevent the agent from making those mistakes going forward and try again. And every time you do that,
try again. And every time you do that, you've taken feedback that normally you would have to give via synchronous steering and figured out a way for the repository itself to give that context
to the agent. And this is why it was so easy for us to have non-engineers contribute to the codebase because every time the agent made a mistake, the engineers were forcing the agent to
learn how to self-correct, which means we basically did not have slop in the codebase. It was very slow to get to
codebase. It was very slow to get to that point, but you can not have slop by simply not permitting the agent to write slop in the first place. And you do that with tests, review agents, and other
mechanisms that are, you know, LLM as judge driven to prevent that slop from being permitted to be committed in the first place.
You mentioned it took a little while to get there. And I think elsewhere you
get there. And I think elsewhere you also talked about how the first month was about 10x slower than it would have been doing it slow solo. Can you explain like what did the first month and the
evolution look like?
Yeah, so the earliest versions of the product exploration uh starting from zero was to kind of have the agent triage pages on my behalf. Right? Going
on call is a thing that I do. We're
exploring what it means to have these agents do knowledge work. And you know, dealing with outages is a thing that I do. It's text heavy. I thought it was
do. It's text heavy. I thought it was pretty amenable to having agents do this work. And from zero at the time without
work. And from zero at the time without all these lovely apps we have in codeex today if you said read my Slack and figure out what inbound outages are
happening, it's just not a thing the harness was able to do. Uh, so it's obviously, of course, much quicker for me to tab over to Slack and copy and paste into Codeex, but that's not the
goal, right? The goal is to get Codeex
goal, right? The goal is to get Codeex to do the full job. Uh, so kind of had to stop say, okay, I can copy and paste, but I don't want to. What must I build
first to give Codeex access to Slack?
And actually, the thing I have to build first is like credential management in the harness. So, Codex is able to
the harness. So, Codex is able to request credentials on my behalf. And in
order to do that, I need access to the keychain. And in order to do that, I
keychain. And in order to do that, I need like secure coding practices that prevent me from making cryptography mistakes when we're dealing with this sensitive material. And you can kind of
sensitive material. And you can kind of recurse into a task like that sometimes like six, seven, eight times before you bottom out at the core primitive capability that is missing. And then you
kind of likeoop pop back up the stack to accomplish the actual higher level primitive that you were trying to add to the product. One neat thing about that
the product. One neat thing about that is like as you're doing this breath first search through the tools and quality gates that you need to give the agent you knock those out one by one you
don't have to do them going forward right like you are importing leverage into the codebase for every subsequent change um it means that
every bit of feedback I give I at most have to give two or three times but before we figure out a way for that misbehavior to just go away entirely So, one part that blew my mind, and I
always like this PM angle, is that your PM just wrote a PRD in Markdown and they went to a shipped pull request. Can you
give us a couple concrete examples of what they shipped and how that flow works?
Sure. Um, back in September of last year, uh, we were experimenting with what were the precursor to skills in this agent. this idea that the human
this agent. this idea that the human driver should be able to teach the model about the things they care about. Uh we
were doing some data analysis tasks. So
understanding the data ontology of what it means to do uh business metrics in this particular part of the company was
important. Uh and we want the model, we
important. Uh and we want the model, we want the harness to be able to pull this information out from the user, right?
And you know, we have this lovely machine that is able to write a bunch of text. It should be easy for it to have a
text. It should be easy for it to have a conversation with the user to figure out the things they care about versus the things they don't. So that all subsequent runs that are doing these business metrics questions kind of acrue
that leverage context. And the thing that we wanted to explore was what product shape does this look like, right? How can we give the user the
right? How can we give the user the ability to inject durable knowledge into the agent over and over again? Uh so our PM set
about to explore this and the hypothesis was that a library with the ability to both interview the user and learn from all the tasks that the agent was already
doing uh was the way we wanted to package up this capability.
And we kind of reviewed the PRD as a team in sort of the beginning of the week team meeting. And at the end of the week on our demos without any of the engineers having you know talked at all
to the PM, we had a demo of this feature and it rolled out to customers uh in the next week. Uh and this is very very
next week. Uh and this is very very cool. Uh was able to be done safely
cool. Uh was able to be done safely because we have architected our code in a very very modular way. uh where
individual pieces of business logic uh are able to connect together in safe ways without kind of spaghetti dependencies. Everything takes an
dependencies. Everything takes an interface. The concrete implementations
interface. The concrete implementations are hidden. It means every bit of
are hidden. It means every bit of business logic that we have whether it was the skill wiki or the ability to manage connectors uh always exports a
highfidelity fake for tests which means the tests that our PM was vibing into existence actually exercise its dependent services and the real business
logic of the skill library without you know having to worry about monkey packaging or having tests that don't reflect reality. Uh the eval here is
reflect reality. Uh the eval here is whether or not the agent was, you know, able to learn from these skills, whether or not they were mounted into context.
Uh whether or not the UI presented uh in the way that it should. This is the same way I would think about evaling this slide deck we're looking at now, which was also created by codeex, right? Codex
is a coding agent. It's going to write code in order to make this slide deck, but I don't care about that code. I care
that it produces a final artifact that I care about. Uh and putting all this work
care about. Uh and putting all this work into the harness into the repo means that we can actually move to sort of PRD as code input with the app as compiled
output. Uh but you know slow in the
output. Uh but you know slow in the beginning. You can't you can't do that
beginning. You can't you can't do that you know in any brownfield codebase today and expect to get good results. I
think uh we'll talk a little bit around how we can sort of uh give you a cookbook for what to do next week. uh in
your code bases. Uh but I think the idea here is to have the role of the engineering team to encode leverage around what it means to do a good job.
So the actual specifics around what you build and how it is rec realized in code are just kind of artifacts of the repository itself. Encoding leverage is
repository itself. Encoding leverage is the name of the game. We've heard a success story on the PM side in our precall mentioned that it wasn't always
successful. There were some things that
successful. There were some things that for instance you guys built with your designer early on. It was like an open claw style that you had to trash. So
what failed? What did you learn? And how
can people learn from that failure so it doesn't happen to them?
Yeah. So one thing we were trying to build uh was what you now see in the codeex app as automations. Sort of this scheduled cron uh mechanism in order to
kick off agent tasks uh on your behalf on a schedule. Uh very open claw like as you said. Uh and this was a capability
you said. Uh and this was a capability that our designer put into the product.
uh but in doing so had to bite off a fairly chunky bit of app infrastructure work in order to put in a cron system which interface cleanly with our agentic
core. Uh we wanted you know the cron
core. Uh we wanted you know the cron itself to create tasks not necessarily be responsible for driving them to completion. We wanted a nice loose
completion. We wanted a nice loose coupling between these core infrastructure primitives. The ultimate
infrastructure primitives. The ultimate thing that our designer wanted to prove out was the UI interaction pattern. what
it would mean to even surface scheduling to users in ways that they felt in was intuitive. Uh but in order to go from
intuitive. Uh but in order to go from zero to putting that feature in front of users felt that the back end needed to exist as well. Uh and the reason we
ended up unwinding that is because uh we ended up with uh spaghetti, right? Like
uh we were still halfway through uh you know realizing the true value of the harness here, right? engineering had not yet fully specified all these non-functional requirements, right, of
what it means to write high quality code that is scalable for future maintenance.
Uh, this is really all the harness is is figuring out ways to get those non-functional requirements written down and injected into the agent just in time. Uh, this is what enables everyone
time. Uh, this is what enables everyone to contribute to it safely. This is what empowers growth of the team to continue to acrue leverage back into the harness. But in
the absence of that, we ended up with a spaghetti mess. I think the scheduler
spaghetti mess. I think the scheduler lived in uh the front-end JavaScript at some point. Uh it's just like uh like I
some point. Uh it's just like uh like I I look at this and I'm like, "Oh, I'm going to regret this later." So, uh we made the decision to uh revert the PR.
Uh but we did want to enable the core objective here which was to experiment with what this product surface could look like to figure out which interaction patterns were most intuitive
to users. Uh and we do want our designer
to users. Uh and we do want our designer contributing to the codebase because we need his expertise on how to use the design system effectively, how we should think about component architecture. We
wanted to continue to explore good information architecture so we're not overwhelming users but also giving them the power tools that they need. So the
norm we aligned on was we would kind of put a door between the UI as exposed to users and they actually implemented functionality on the back end and ended
up with what we're calling a painted door style where we can get the full product feature implemented with just a
noop on the back end. uh with you know the addition of some creative product instrumentation we're actually able to get most of the way there in terms of doing that experimentation and actually
get higher signal for the engineering team on whether or not we should even prioritize this work with code generation increasingly being
automated the bulk of my time is spent on feature triage and scheduling uh so figuring out ways to get more information into that part of my work is an amazing amazing uh outcome of this
sort of norm we aligned on.
I used to think I had a retention problem. Turns out I had a messaging
problem. Turns out I had a messaging problem. I was sending the same
problem. I was sending the same onboarding emails to every new user whether they activated on day one or never logged in again. I had no idea who was slipping or why. Customer.io changed
that. Every message I send is now based on what users actually do in the product. Someone hits a key activation
product. Someone hits a key activation moment, they get nudged to the next one.
Someone goes quiet, they get a different path entirely. Their AI agent makes it
path entirely. Their AI agent makes it fast. I describe the campaign I want and
fast. I describe the campaign I want and it builds the full journey form.
Triggers, timing, copy, even branching logic. And when I want to know how
logic. And when I want to know how something is performing, I just ask the agent directly and it tells me what to do next. They also have an MCP server
do next. They also have an MCP server which means AI tools like Claude can see directly what's happening in your customer.io workspace. Your segments,
customer.io workspace. Your segments, your customer data, your attribution, all of it. So instead of explaining your business context every time you need help, Claude already knows it. Notion
used customer.io IO to personalize their onboarding and hit nearly 50% open rate, improved conversion by 6 to 7% with localized campaigns, and pushed open
rates up another 20% through AB testing.
The idea is simple. Customer.io helps
you deliver more impact from every message you send. If you're a PMR founder and your onboarding is still oneizefits-all, try Customer.io at customer.io. I'm notoriously bad at my
customer.io. I'm notoriously bad at my inboxes. I guess there's a version of
inboxes. I guess there's a version of that where I seem cool and unavailable, but the reality is I miss sponsor emails, guest pitches, and stuff that my team actually needs me for. So, I got an AI assistant, the sponsor of today's
episode, Ariso. Ariso connects to my
episode, Ariso. Ariso connects to my email, calendar, and Slack. Then, I just chat with it over Slack, and it helps me with everything. It builds workflows to
with everything. It builds workflows to respond to emails, resolve customer issues, prep me for meetings. It
actually comes to my meetings, updates its own knowledge, and remembers context from past conversations. So, every time I talk to it, it already knows what I'm working on. I used to pay for Granola
working on. I used to pay for Granola and Lindy separately. Oriso replaced
both. One tool does more and it lives right in Slack where I already work.
Check it out ato.ai/ aos. That's a r i so.ai/
so.ai/ aa k ash. Today's podcast is brought to you by Pendo, the leading software experience management platform. McKenzie
found that 78% of companies are using Genai, but just as many have reported no bottom line improvements. So, how do you know if your AI agents are actually working? Are they giving users the wrong
working? Are they giving users the wrong answers, creating more work instead of less, improving retention, or hurting it? When your software data and AI data
it? When your software data and AI data are disconnected, you can't answer these questions. But when you bring all your
questions. But when you bring all your usage data together in one place, you can see what users do before, during, and after they use AI. showing you when agents work, how they help you grow, and
when to prioritize on your roadmap.
Pendo Agent Analytics is the only solution built to do this for product teams. Start measuring your AI's performance with agent analytics at pendo.io/acos.
pendo.io/acos.
That's pendo.io aka.
Okay. So, PM writes PRD and markdown.
Designer paints a door. Then engineering
team takes over the actual back end.
That's the highlevel boiled down summary that I'm hearing. The thing that makes this
I'm hearing. The thing that makes this all possible is the harness. Can you
show us what's actually in it? The
harness lives in the repository, but really the harness is a mechanism for getting the coding agent getting codeex the right context it needs in order to
do the job at all phases of the implementation life cycle. I think
there's kind of three phases you would think about here. One is I'm getting ready to go to work. I have a ticket. I
have a PRD and I need to figure out what must be done in order to implement the thing and get it deployed. The next
phase is kind of in the messy middle of implementation. I'm writing code. I'm
implementation. I'm writing code. I'm
exercising the UI. I'm seeing the bugs.
I'm seeing logs. I'm making sure that, you know, everything is pixel perfect.
And then there's the back half of things where I have produced a diff that I claim implements the feature. and I need to convince the rest of my team to merge it, that it is acceptable, that it meets
the actual user need that we're trying to solve with this change. And in each of those phases, we have different mechanisms in order to surface context
to codeex in order for it to do a good job. Uh in the beginning here, uh we
job. Uh in the beginning here, uh we really have agents.md, which is this file that lives in the root of the repository that is forcibly injected
into codeex. It's like what codec cli
into codeex. It's like what codec cli does is take the contents of that agents.mmd and give it into context for
agents.mmd and give it into context for the agent and this is both good and bad right uh things in agents.mmd we know
are 100% reliably going to inform the agents actions but we're paying for that with context window usage and with these
models context window is a hard constraint you know you cannot fight against it there is only so many hundreds of thousands thousand of tokens you get. Uh I will say that autoco
you get. Uh I will say that autoco compaction in GPD 52 53 54 55 is so so good. I never run out of context these
good. I never run out of context these days. Uh sort of like a crown jewel of
days. Uh sort of like a crown jewel of research here to uh have good autocompaction. It is truly amazing to
autocompaction. It is truly amazing to be able to go on any size of ticket completely end to end on a single session. Uh it's magical.
session. Uh it's magical.
But still you do have to contend with things getting paged out of context even with autocompaction. So agents.mmd is a
with autocompaction. So agents.mmd is a tool but one that we want to kind of be constantly trading off from this scarce resource of context.
Yeah.
So what we do there is we put an operating model for codeex in there.
Basically a sort of like three-step, four, fivestep plan of how you should think about doing a task. And what that means for us is you ground the decisions
that you make in the documentation we have in the repository. There's a big docs tree. It includes everything from
docs tree. It includes everything from what it means to write performant React to writing reliable networking code to
all the design docs and critical user journey flows that we have in the app.
And that's important to have as part of the operating loop because we don't inject that stuff all into the agents.mmd upfront. Instead, we just
agents.mmd upfront. Instead, we just tell codeex in the agents.mmd you have all these res resources available to you. They are of x y and z shape and
you. They are of x y and z shape and they live here. Uh you decide whether or not to look at them. uh which is nice because we can like deterministically steer the agent toward the context we
want it to discover but we let the magic of these models reasoning decide what is important what it actually thinks should pull into context. This is sort of the outset of the task. So ground yourself
in the documentation that we have the app as it is written uh and then move into implementation. During
into implementation. During implementation, the thing the agent is naturally going to do, the thing it is trained to do is write code and make tool calls. So we have to figure out
tool calls. So we have to figure out ways to use those two mechanisms to inject our context into the agent. Uh
this is kind of a silly thing, but actually the code in the repository is prompts. uh codeex is going to pull code
prompts. uh codeex is going to pull code into its context window in order to figure out what business logic do we have how do we do things here what
database adapter am I using uh am I using jotai on the front end for state management uh so actually having the repository enforce that the code is all
the same or uses the same patterns or it's structured similarly allows you to efficiently and sort of like in parallel
across the codebase apply context you've read in one file to other files. Uh it
means that the agent is able to more effectively cargo cult. I love to cargo cult stuff.
cargo cult. I love to cargo cult stuff.
This is how I have worked in big brownfield codebases when I'm new, right? I'm like the team has probably
right? I'm like the team has probably done this before. We kind of like poke around and copy and paste some stuff. Uh
and you know, we can let codeex be lazy too by making it do the hard work of making the codebase homogeneous.
Um, but another way we can inject context into the agent over the course of doing these runs is to take all that human judgment around what good looks
like and basically make it fail tests if the code doesn't match. Um, a stupid one
that I do is to fail the build if there is any text in markdown files or HTML that does not use the curly quotes. uh
because our design team loves that proper typography in our userfacing strings to have the the smart curly quotes and apostrophes and this is probably a thing that is like unevenly
adhered to I think in most code bases if at all but because this kind of like irked our designer a couple of times had him vibe code some tests into existence
we don't really care about the quality of this code right so long as it actually does the job and prevents the code from becoming misaligned with the user intent. Uh, and we have a lot of
user intent. Uh, and we have a lot of tests like that that do probably weird things, maybe things you would consider low leverage, but collectively improve the quality of the software that we're delivering, which I think is pretty
cool. And then on the back end, right,
cool. And then on the back end, right, we've got sort of the agent executing grounded in the documentation we have in the repository, which is this accumulation of all of the team's
historical work and knowledge. We've got
taste refining the output as we go via these tests and lints and the code itself showing the agent what good looks like. And on the back end kind of take
like. And on the back end kind of take all of that same context and give it to a review agent. um in that docs repository right we have you know
front-end architect MD reliability engineer MD apps engineer MD and for each one of those we have this big matrixed CI job that says you review for
front-end architecture here's your guardrails here's the diff identify all the P2 issues with this thing and uh leave some feedback on the PR this
feedback more context just like those tool call error messages that we're giving the implementation agent to say you must address this, right? We found
some issues where you are misaligned with the way we expect things to be produced. And with sort of all of this
produced. And with sort of all of this effort in prompt injecting the agent with the collective human judgment on the team, my synchronous involvement here can basically move all the way to
the back of this process where the code is presumably already reviewed and approved and aligned by all this machinery that we've put in place over the course of executing producing the
change. And then my job is to say, okay,
change. And then my job is to say, okay, we've got this change. It looks good, but I think it's misaligned in ways X, Y, and Z. And then once that those
things are addressed in the moment, my job is to close the loop by taking all this feedback that I had to give and making it so I don't have to give it next time. And then back into the
next time. And then back into the harness it goes. So we got the high level now of what it involves. Can you
go really granular? Show your actual codec setup. show what's on every
codec setup. show what's on every engineer PM designer's machine and how to put this all together practically. I
would say that uh the way I have steered my team has been intentionally frictionful so that we don't want them staring at the computer looking at their
agent which means they have largely been booting up codecs in the CLI going to multiple tabs where it's actually painful in order to pay attention to these things because we don't we don't
want them to. Um if I were doing it again today this is probably not what I would do, right? the the app is quite fantastic and is a thing that we are seeing meteoric growth in. I think it
was uh just past 400 or four million uh weekly active users. Uh and all the good parts of the workflow that I'm describing are in the app, right? I
would package all that collective designer taste into a plug-in that you know my team has a little shared marketplace of that I can just suck all of that expertise into the app. that's
going to pre-wire Figma, which means the place my designer naturally works is going to be a thing that is directly addressable by codeex. Um, in the same way that I have access to Figma and
maybe I'm unreliable at refreshing my knowledge of the current mock. So, Codex
is just going to do that because these agents will follow instructions. Well,
that is that is a core part of how they are built. Um so these plugins which
are built. Um so these plugins which bundle the skills include those instructions on these are the tools we use these are how they are built. Uh so
I think the app today as sort of the central coordination mechanisms of teams gives a much more userfriendly affordance for distributing these sorts of things. Uh plugins are definitely the
of things. Uh plugins are definitely the future and what I would lean into today.
All right. Can you show us what this actually looks like inside of codeex?
How someone would set all this up?
Sure. Let's get into the nitty-gritty of it. This is my favorite part here. So,
it. This is my favorite part here. So,
right, we started with an empty repository, right? Which means we had a
repository, right? Which means we had a high degree of freedom to start structuring the code in a way that codeex most natively understood. And
this is how we kind of came to the pattern that all the things we want codeex to know need to live in the repository itself. This was its
repository itself. This was its workspace. This was its brain. In order
workspace. This was its brain. In order
to misen plus get the agent set up, we wanted it to have the easiest time and the thing that Codex loves to do is search code with rip grip. So we put
everything in one place and what that looks like in practice is this big docs directory that has a ton of different content in there. uh we make heavy use
of a thing called an exec plan which you can find on the openai cookbook which is essentially a skill you can give codeex in order to write a good implementation plan phased with milestones and
deliverables so it can track its work and make sure that it is continuing to make progress against sort of highle and tricky implementation plans and we kind of keep every single one of those that
has ever been produced in the repository they form a durable log of the implementation choices we have historically Similarly, every design docs
and Google docs live in the codebase, which means these things are actually continuously kept up to date to reflect the actual state of our app. They're
sunset and archived when they become irrelevant or features get removed. And
we have also built code and tests around this documentation tree. So all these markdown files cross-link to each other.
They embed git repository revisions of when they are current as of. If codeex
ever touches one of them to update it, the build will fail unless all the cross-link documents are also updated to reflect the current state of the world.
Uh and this is what it means in order to have the harness self-maintain to have the test prompt injecting codecs with what it must do. We don't rely on codeex
knowing that this is how it maintains the doc repository. A test will tell it what it needs to do, how it must proceed in order to get the build to green. The
agents are super trained in order to get the test to pass. So if we can figure out ways to build context injection primitives into failing tests, we're not fighting the harness here. This is not a
thing that will be obsoleted by the model capability improving. This is just a very neat way of aligning what must be done with this scarce resources of context with the actual thing the model
will and always do which is call tools and run tests. Um similarly here you can see in this references section we have a bunch of these llms.ext. I don't know if
everybody knows this but like a lot of programming projects uh will publish a you know plain text version of their documentation at llms.ext.
Uh and this strips away all the tokens from the HTML and it's just like markdown.
Stripe has these, OpenAI docs have these. Uh UV which is a Python runtime
these. Uh UV which is a Python runtime has these. Our internal design system
has these. Our internal design system has these and we check these directly into the codebase which basically give codecs a full reference manual for all of the big chunky libraries and
components that we depend on and we point to those. You know, our documentation on the design system highlights the key components that are probably going to be used most of the
time and how to look for them in this LLM.ext.
LLM.ext.
Um, here you can see these persona oriented documentation that we talked about earlier, right? What it means to write secure code. Um, this product
sense MD is how we think about encoding what it means to document a feature, how to turn a PRD into code in a QA plan.
And this is a sort of thing that I doubt any teams are writing down today. It's
just a thing they kind of, you know, talk about together and maybe whack each other on the nose when uh they don't do this when producing a PR. But figuring
out a way to actually durably document what it means for the team to operate is necessary for these agents which every time they spin up have a blank slate in
terms of context. Um, another neat thing here is this quality score.md which is sort of a meta way for codeex to diagnose the state of the codebase. We
have it keep notes here every time it writes code because over the process of making a change, it's going to pull a ton of code into context. Maybe code that hasn't been
context. Maybe code that hasn't been touched in a month, but our guardrails have changed. So, this is a way for
have changed. So, this is a way for Codeex to be continually assessing the quality of the codebase, doing some gardening, and spinning up some tasks for later on how we can keep aligning
the codebase to baseline.
Another neat thing we can sort of talk about here is what it means to close the loop. When we're working on this app, we
loop. When we're working on this app, we split it into a backend and front-end pattern, which means they could be tested independently. And we tried to
tested independently. And we tried to house as much of the business logic as possible in our backend. And that means codec should be able to validate the behavior of our business logic
headlessly. Right? You don't need to
headlessly. Right? You don't need to drive a UI. You can hit APIs and look at the database to observe the side effects that you were hoping to have. And
normally for almost all of my career, when I'm working with services in local dev, I run them in my terminal and I get this neverending stream of logs in the
console that like I have no idea what to do with and it's kind of impossible to use because they scroll away so quickly.
What we've done instead is permit codeex to spin up like something approximating a production observability stack. Uh you
know think giving data dog to codeex for every change that it makes. Uh and this allows it to use like normal observability tooling for metrics and logs. The same things your engineers
logs. The same things your engineers would be using to diagnose issues in production. And we let codeex use it to
production. And we let codeex use it to diagnose issues in local dev. Right?
shift left, validate the change and exercise it end to end. And we do the same thing with the UI itself. We mount
the UI into a local browser shell that pre-wires up Chrome Dev Tools. Today, we
would do this with computer use to allow Codeex to actually inspect the app, drive it, click the buttons, observe that the loop is closed, that the feature is implemented, and that it can
attest that it works end to end. So I
mentioned a little bit how uh being super overly architected in the application was a point of leverage for the engineering team and the reason that is because we've actually limited
statically the plausible places that code can go. Uh you see this kind of represented here as sort of a package layering stack. One common pattern in
layering stack. One common pattern in hyperrowth code bases that is a failure mode is that business logic kind of just gets spewed everywhere and it becomes a
sort of codebase that approximates a ball of mud. Uh these are very difficult to deal with uh and evolve because uh there's a ton of non-local side effects
from touching any bit of code. Uh but I mentioned that we have these nice isolated business logic domains that are able to take all their dependencies as fakes and make it safe for us to
prototype new features. And the only reason we're able to do that is because we actually with code enforce that these things are hard separated with package boundaries between them. You cannot
possibly create a ball of mud in the first place. uh which basically means
first place. uh which basically means that for uh the engineers we permit them to write packages at any layer of the stack. They can write database code,
stack. They can write database code, config, parsing, you know, react code in the front end, business logic. RPM, we
kind of by norm logic and UI code, but try not to touch these lower level primitives in the app.
And to get that painted door invariant for our designer had him focus largely in the UI layer of the app uh with you know some affordances for stubbing out
APIs on the back end so that uh we could get those uh usage signals uh that we cared about. U which is kind of neat to
cared about. U which is kind of neat to be able to think about reflecting sort of like ales on who is permitted to touch what code and the actual structure of the code itself. like this is
actually by folder paths that we're able to determine whether changes are allowed or not.
This adds up and I think you're you have a really hot take on how it should add up. You've said that if you're not using
up. You've said that if you're not using a billion tokens a day, you're basically negligent. Now that's $2 to $3,000 per
negligent. Now that's $2 to $3,000 per engineer.
How does an engineering leader at a normal company build the case for that internally? The idea here is that the
internally? The idea here is that the more tokens that you're running through these models, the smarter they get, the more intelligence you are able to extract from them. This is sort of like
the the realization behind why we have reasoning models in the first place, right? If we give the model more effort
right? If we give the model more effort to, you know, do inference, we're actually able to get greater results out of this. Uh, and we see this time and
of this. Uh, and we see this time and time again with longer time horizons that agents are able to execute over. Uh
and I think it is the case that in order to have the sort of lofty coal of being a token billionaire, right, you kind of
have to find ways to get the agent to do more work on your behalf to execute in parallel more autonomously. And because
I believe that GPT5 is able to do the full job, kind of having that as a northstar constraint means that I am constantly looking for ways to get the
agent to do the hard parts of my job that take a lot of my time and constantly uh being surprised in ways this frees me up. Um, one uh sort of
insane thing uh is not just you know a billion tokens a day but sometimes it can be 350 million tokens on a single PR. Uh there was a time a couple of
PR. Uh there was a time a couple of months ago where uh I was buckling my laptop into the back of my car going back and forth to the office tethered to
my corporate phone uh in order to have the agent continue to cook while I was commuting to work. This is like a 60h hour codec CLI session that like I was like please don't go to sleep laptop.
Uh, and this is a PR that probably would have taken me three weeks in order to do. Like one of the hardest refactors I
do. Like one of the hardest refactors I have ever done in my career and it took place over the course of three, four days. Uh, and did as good a job as I
days. Uh, and did as good a job as I would have here. Uh because of the investment we've put into letting the harness prompt inject the agent and manage its context efficiently over that
60 hours, I only had to provide two prompts in addition to my first prompt that we gave the agent which truly magical. Um it means that it was
magical. Um it means that it was actually, you know, approximating a human engineer on my team, right? I did
not have to, you know, synchronously bash it on the head in order to make progress. It just, you know, went heads
progress. It just, you know, went heads down and did the work the way I would.
Okay. So, 60 hour running long task. I
think that points to something that if people have their conception of these models five, six months ago, it's really outdated. It's the what's recently
outdated. It's the what's recently become possible that's enabling this new way of working. How would you describe it? When did that happen?
it? When did that happen?
Yeah, so there was a huge uplift in capability when GPT 5.2 2 came out and the rate of progress with each additional point release of the model
has been much bigger than what came before. Um, GBD 5.2 came out while I was
before. Um, GBD 5.2 came out while I was on holiday for winter break. Uh, and
when I came back without any additional investment, we were getting one or two PRs more per day per engineer on the
team. uh which had this very strange
team. uh which had this very strange effect where folks were looking for ways to do more work with the agent. It was
kind of like uh you know adding another lane to the highway. It was like how can I put more cars on this thing right now?
Um and because we had done so much work in order to get codeex to operate over those long horizons, the next sort of
frontier was to go in parallel. Uh we
just blogged yesterday about a a sort of agent orchestrator that came out of my team called Symphony. Uh which once we
saw the uplift in capability with 52 and invested in this additional orchestrator allowed us to further 10x the PRs per engineer uh per week. And that
investment in increased parallelization is only a thing we were able to do because we were confident in the code that was being produced already by the agent. Right? If I'm 10xing the number
agent. Right? If I'm 10xing the number of PRs I am producing a week, I cannot possibly look at all of them. I couldn't
possibly look at them all before. Uh so
the only way we could do this safely is with these techniques in order to eliminate classes of misbehavior and have confidence by default in the code that's being produced.
But yeah uh we continue to see these huge improvements in autonomy and confidence with each model revision. One
of the most interesting things uh about 55 is with computer use there's much less wrangling we have to do in order to get the agent to actually see and drive
the app. This was a significant point of
the app. This was a significant point of friction for us early on. We were having to figure out ways to wire up ffmpeg to docker containers to give codecs, you
know, a virtual xs server so it could drive the app. Like this is all a ton of low-level goo plumbing nonsense that was slow to develop and ultimately undifferentiated.
But with codeex now shipping computer use, a tool that's natively post-trained into the model, we kind of get to throw away all that bespoke code, keep the
same process of having codecs drive the app, and do it much more efficiently with better image understanding, ability to faithfully replicate wireframes and
mocks, these sorts of things. Uh, truly
amazing.
So you mentioned one of the hardest refactors in your career. You've worked
at the regular companies, the Brex, the Stripe, these types of companies. If you
are an engineer at this and you're looking at this new way of working, what is the new engineering job?
The new engineering job, I think, is to have everybody be staff engineers. And
the role of a staff engineer, the primary role, at least for me, is not to be producing all of the code, but to be empowering my team to produce the code.
And today it is the case that every single engineer on this planet modulo token budget has access to 5, 50,
or 5,000 concurrent hands-on keyboards.
and not being able to leverage that amount of parallelism today is largely a reflection of not having put in the work
to harness those resources.
One experiment we're running internally in order to drive toward this amount of parallelism is to tell teams that they have been forced to hire five interns
named Codex. part of their success
named Codex. part of their success criteria is judged on how effectively they incorporate these resources into their team. What does it mean to make
their team. What does it mean to make use of this capacity that the organization has given them? And part of success is being able to use those
resources. And if we think that, you
resources. And if we think that, you know with access to these lovely models and
compute being available to use them, letting those GPUs idle is in a way not having hands-on keyboards. Uh, so my job is to figure out ways to saturate these
things, to have them doing productive work, to have them not waste cycles, to have fewer PRs that we throw away because they are high quality by default, to have fewer cycles back and
forth in CI, fewer reviewer agents having to repeatedly burn cycles for the same feedback, that sort of thing. uh it
is very much a systems thinking sort of mindset where you know in a way you're playing Factorio with your codebase and figuring out ways to build the token
factory you're describing essentially being totally divorced from the code and really focused on using these interns
but I guess a lot of engineers they prided themselves on being the best at writing code that has been their output for so long I imagine people on your team have struggled with this mental
shift. What is the right way to think
shift. What is the right way to think about it?
This is a thing I have struggled with too. Uh for a long time because
too. Uh for a long time because the primary function of the job was to produce code in order to achieve you know our product and business objectives. It's very easy to fixate on
objectives. It's very easy to fixate on the production of code itself as the job. But ultimately the reason we work
job. But ultimately the reason we work in tech is to deliver product into the world. Uh and initially kind of like
world. Uh and initially kind of like having this hard break in mental model of what it means to do work was a little bit like weird on the ego sort of thing.
But uh where I have gotten to is latching on to the same creativity the production aspect of things. Uh
ultimately the thing that motivates me is to get product into the hands of users and see their delight and how they're able to improve their lives with the uh with the tools we build. And I'm
still able to do that today. I'm able to do a lot more of that today. And you
know, I kind of like puzzles. Uh you
know, you can have puzzles in how you structure the code or you can have puzzles in terms of how you put teams together to produce it. Um to me, these uh skills and challenges are, you know,
pretty transferable between each other.
So I think there is a new art to be found, a new craft in terms of how you, you know, effectively assemble a virtual team to go solve these things.
Okay. So let's move into the road map.
Your counterpart at B or Citadel or Stripe, they want their PM shipping 100,000 lines of code. What's the Monday morning road map that they need to take so that they can get started right now
to move to this way of working over the next six months?
I'm so looking forward to the Monday morning road map. You have no idea. I I
I was super excited to to vibe this slide into existence with Codeex because, you know, this is the hook right here. This is why this is why
right here. This is why this is why we're talking. Um, so I think those
we're talking. Um, so I think those map or those phases of the code production process that we talked about map pretty clearly to sort of what the
Monday morning road map should be. Uh,
you need the repository to be legible to the agent. Uh, it needs to know all the
the agent. Uh, it needs to know all the things that your team does, right? The
process of bringing product into the world to producing code to having good design has a ton of implicit context around what good looks like. Hundreds or
thousands of little decisions that we all make by default because we cut our teeth the hard way gaining experience.
But you know the agents don't have this, right? Well, they do, but they have
right? Well, they do, but they have every possible permutation of all those thousands of decisions because they have seen so much code and so much product
over the course of their training. So,
it's our job to take the specific choices that we as a team would make, write them down, give them to the agent, and starting with this simple technique
of putting a map in agents.mmd on where the agent should look. I really like having a documentation tree as sort of the entry point for making this happen for your codebase because it's super
easy for humans to think about just writing stuff down. Uh and it's also really easy to use some of the tools that we have to do that. Uh it's very
often the case that you know agent will produce code we didn't like or uh there's a bug in production. We'll have
a slack thread as a team where we discuss what went wrong, how we think it should be done differently and it's really easy at the end of that to just do at@mention codecs you know update
docs x y and z to reflect these new guardrails this new taste we have discussed and you know that kind of removes us from the loop entirely from maintaining the legibility. We just kind
of get to talk and we let codeex maintain it for us. Uh and this is a way too that I think um sort of by default in larger teams you can kind of sort of
build a flywheel around improving agent behavior and build sort of cultural norms and processes around having this happen as a default way of doing the
work. Uh the other next step is to make
work. Uh the other next step is to make validation cheap. validation is
validation cheap. validation is happening at all parts of the code production pipeline that I talked about on the review side of looking at those guardrails over the course of writing the code with all these lints and tests
being able to actually close the loop and exercise the app to see that yes the search box that was in the PRD actually appears in the UI and when I click on it
it shows a pending set of results that are our best guess and it has very nice type ahead without any jank in the box when we type Right? These are, you know,
kind of the squishy ill-specified parts of the task, but kind of providing proof of work to the reviewer agents and the humans looking at these PRs is necessary in order to build confidence. This is
what it takes in order for me to not have to shoulder surf the codecs. The
same way I'm not going to shoulder surf my teammates in VS Code or cursor. Uh
and finally, it's important to have the users of this harness be more than just the engineers themselves, right? The exercise here is
themselves, right? The exercise here is to empower all of the team in order to write code safely. Uh so having a tight collaboration loop with the other
members of the team, product design, user ops, uh your devops folks allows for really high bandwidth feedback
between these uh uh different roles and it allows you to do more work, different types of work, extract different domain expertise from the folks on your team and reflect it
all back into the repository for the benefit of everyone. Uh, one sort of like emergent property we had here that I thought was really, really cool was we
hired a new engineer onto the team, uh, who had formerly been a PM. And he
noticed that we had no documentation in the codebase for what product features we had. Uh and this is kind of an
we had. Uh and this is kind of an important thing because it grounds all the engineers, all the designers, all the product, the reviewer agents, everything in what we are actually
trying to achieve. What are we shipping here? Uh and Codeex is very good at
here? Uh and Codeex is very good at crawling through a codebase and figuring out what the product features are and writing them down and surfacing them for human review. But what fell out of that
human review. But what fell out of that was a set of QA agents that are constantly booting the app, running it through these critical user journeys and asserting things work. Which means that
when it comes time for deploy, we have to do less manual smoke testing because we actually have sort of these just in time generated runbooks around what it means to have acceptance testing. And
all of that came out of a single person with domain expertise and an idea of what good looks like. and sort of the rest of the work we had done made these
higher level correctness uh verifiers kind of fall out of that one thing.
So we're actually full circle all the way back to the beginning of the discussion now that people understand the harness they understand the Monday morning
roadmap what does the product team of 2027 look like? How are these individual functions working together if everybody has access to the code?
I think we're going to see a lot more prototyping in the world, a lot more diversity of products. As a result,
I think too in this new wave of computing with AI and agents, we are currently at the early part of the technology curve where a lot of the
products you see today are lowfidelity text interfaces. Uh and
text interfaces. Uh and when I think about where I sit in openAI working on frontier and deploying to businesses uh in order to get the
machine to do the work there is a fractal nature of work different surfaces different ways of working to wedge these models into the nooks and crannies of the hard knowledge work that
happens. So I'm really excited to be
happens. So I'm really excited to be able to rapidly experiment with different modalities and different ways of delivering this technology. Like this
is core part of the OpenAI mission here, right? Is to have wide access to do
right? Is to have wide access to do economically valuable work with these models, with these tools. Uh so I'm excited to increase the rate of experimentation to increase the
diversity of product we see in the world and to see new and exciting ways we're able to use technology to do great things for users. All right. So,
we're into the little home stretch here.
I want to ask you a couple quick questions. Is that okay?
questions. Is that okay?
Yeah, sure. Let's do it.
What's one skill a tech lead who's trying to move to this way should build this weekend?
I really think it's important that you figure out a way to get codeex to be able to see your app and drive the business logic. uh in the absence of any
business logic. uh in the absence of any other modularity or investment to be able the ability to actually close the loop and observe the behavior that
you're changing is what it takes in order to know that you have completed the job. Uh I see this all the time
the job. Uh I see this all the time where folks will vive something up and not be able to attest that it solves the
problem they set out to do. Uh, and in order to convince the rest of the team to accept the code in the first place, you got to be able to prove somehow that
the code meets the brief, that it does what it says on the tin. Uh, and the way to do that is to invest in closed verification loops, the highest leverage of which is the full endto-end integration test. You click the buttons,
integration test. You click the buttons, observe the side effect, that sort of thing. It's funny how it comes back to
thing. It's funny how it comes back to some of the fundamentals but yeah now applied to AI. What's the biggest mistake engineering leaders are making
with AI coding tools right now?
I think having the tools in your head or in a box as these things that are tools that must be driven synchronously by a human is limiting uh the amount of
creativity in terms of deploying these things widely. Uh ever since GPT 5.2
things widely. Uh ever since GPT 5.2 came out. My belief is that it is as
came out. My belief is that it is as capable as me at doing the full job. And
it's sort of my job to figure out ways to have it do as much of my job as possible. Uh, a lot of that means I
possible. Uh, a lot of that means I can't be sitting in front of it because I don't sit in front of my teammates. Uh
so there's this whole metag game of building confidence in the code that these agents are producing so that we can have it do more and more of the job to execute on longer and longer time
horizons to be more and more independent. And this is necessary in
independent. And this is necessary in order to unlock that true parallelism, right? That 5,000 engineers worth of
right? That 5,000 engineers worth of compute that I talked about. I can't
possibly do that if you know I'm locked behind you know two tabs in the app or three panes in my T-Mox in my terminal.
Final sort of meaty question here for you. We've been talking about GPT 5.2 at
you. We've been talking about GPT 5.2 at the same time came out Opus is 4.6 roughly and the the whole discussion really changed in that December time
period. If a team can only pick one, the
period. If a team can only pick one, the anthropic stack or the OpenAI stack, what's the case for picking the OpenAI one?
Our models are optimized around that full agentic behavior to being able to do the full job. I am super easily able to delegate entire swaths of work to the
GPT5 series without having to constantly poke and prod it. And this is the fundamental unlock for me that has gotten to that parallelism. uh symfony
this agent orchestrator that we have released is fundamentally very uncomplicated. It is a thing that
uncomplicated. It is a thing that advances linear tickets through states and gives the ticket text to codeex uh and expects it to produce a PR. There's
very little magic to it. Uh which is why I think that uh sort of these GPT series of models that we have are the ones uh that are going to unlock the true power
of agentic software engineering in your organizations. Additionally, the pace of
organizations. Additionally, the pace of product delivery on the model capability side in order to get more and more of that loop closing in place has just been
fantastic. things like parallel tool
fantastic. things like parallel tool calling and background shells in GPT 5.3 to everinccreasing autocompaction
capability to computer use and the ability to have multimodal understanding at high fidelity with built-in image generation capabilities in the latest
5.5 means that what would normally have been two or three people on the team handing off between them means that instead those two or three people can
encode all their leverage into one tool which is then able to do the full job.
Um there is this kind of neat story that I have been telling that in multi- aent systems the correct amount of agents to want to
optimize toward is not multi it is one because one agent is able to have full addressability over the entire task and its entire context which means being able to do design backend and front end
in a single agent means you're going to get higher quality results than having these lossy frictionful handoffs between them and that experience comes from me having built agents for knowledge work
with handoffs in them and seeing the massively incred uh increasing capability of the GPT5 series over time.
I want a single agent to cook and do the full job and we have done a fantastic job I think at increasing codeex's ability to address all parts of the SDLC
in a single agent.
Wow. There there was so much information in there. If I want to get more, if I
in there. If I want to get more, if I want to get more of Ryan, where can I find you online?
Uh, I tweet a lot uh on X. My handle is uh Laoplo. Uh, and uh me and the team
uh Laoplo. Uh, and uh me and the team should have more content coming uh on the OpenAI blog. And uh I uh am doing a bunch of conference tour over the
summer. I am next going to be at AI
summer. I am next going to be at AI DevCon in London. We'll see you there.
All right, guys. Follow him on Twitter.
Check out his amazing blog post on the Open Eye blog and we'll see you guys in the next episode.
Been a pleasure, Akos. Thank you.
Thank you. I hope you enjoyed that episode. Couple things you can do to
episode. Couple things you can do to support the show. One, comment. Two,
review. Those ratings and reviews really help other people understand the value and the production that we are putting into this. Right? This wasn't an easy
into this. Right? This wasn't an easy episode to produce. We put in a ton of pre-work. We edited it for you. We
pre-work. We edited it for you. We
brought in the best guests. If you don't mind sharing a rating and review, sharing the episode with others, making sure you are subscribed, that really helps the show do bigger and better
productions. I'll see you in the next
productions. I'll see you in the next episode. Here is one of those that
episode. Here is one of those that YouTube thinks would be a great fit for
Loading video analysis...