LongCut logo

I got an inside look at how OpenAI PMs ship code

By Aakash Gupta

Summary

Topics Covered

  • If you are not using a billion tokens a day, you are negligent
  • Code is a liability in the age of AI code generation
  • Every team member can own the codebase
  • Every person brings unique lens, agents compile them
  • A PM shipped from PRD to production in one week

Full Transcript

One of the craziest statements I heard from you was code is a liability. What

does that mean exactly? And

yeah, there's a bunch of different ways that this expresses, but primarily all of our human software engineering organizations have been toolled around this idea that code is the most

expensive thing to produce. You have

synchronous limited throughput of your human engineer operators on the keyboard.

Ryan Leopo led the team building it. He

works on the open AI frontier team and today he is letting you insight how you can transform your code base transform how you work to operate at the frontier.

You've said that if you're not using a billion tokens a day, you're basically negligent. The more tokens that you're

negligent. The more tokens that you're running through these models, the smarter they get, the more intelligence you are able to extract from them. This

is sort of like the realization behind why we have reasoning models in the first place, right? If we give the model more effort to do inference, we're actually able to get greater results out of this.

The engineer owns the code. The designer

owns how it looks. The PM owns the problems we're trying to solve. What do

the new lines look like in the new EP?

Before we get into today's episode, I wanted to share that you can get a free year of my favorite AI tools including bolt new, mobin, arise, relay app,

dovetail, linear, magic patterns, reforge build, descript, and speechifi if you join my bundle at bundle.ac.com.

On top of that, I wanted to quickly ask you to please double check that you are subscribed on YouTube, Apple, and Spotify podcast. It's a free thing you

Spotify podcast. It's a free thing you can do that really helps support the show. And now into today's episode.

show. And now into today's episode.

Every time I talk to people from OpenAI, I get the feeling that they are operating how the rest of us are going to be operating in two years. Whether

that's PMS and designers actually having access to the codebase. How many of us still don't have access to the codebase, right? or

right? or actually building the right scaffolding so that PMs and designers can feel confident operating with the codebase.

That is the next level. And today we have on one of the people who gave the best talks on harness engineering that I have ever seen, Ryan Leopo. He leads one

of the teams at OpenAI on the frontier, OpenAI frontier. and today he's going to

OpenAI frontier. and today he's going to break down how you can go from the old way of working to the new way of working. Ryan, thanks so much for being

working. Ryan, thanks so much for being on the podcast. Thanks Aos for having me and I'm super excited to talk a little bit about today around like what it means to enter this new wave of working

where we can sort of empower the machine to do much of the work for us which frees up the humans in our organizations and teams to focus on the things that the true value that they bring to the

table, right? Scheduling the work,

table, right? Scheduling the work, bringing their taste and expertise and making sure that we're shipping actual delightful high quality products to the world that solve real user needs, right?

That's the job. Uh, and it's been fun to be able to experiment with how we can do that more effectively and at higher velocity.

I'm so glad you said that because I think there's a sort of a definitional phase we need to help people with at the front. One of the craziest statements I

front. One of the craziest statements I heard from you was code is a liability.

What does that mean exactly? And what

has been recently enabled with the latest versions of GPT 5 and 5.5 that allows you to say that? Yeah, there's a bunch of different ways that this

expresses, but primarily all of our human software engineering organizations have been toolled around this idea that code is the most expensive thing to produce. You have

synchronous limited throughput of your human engineer operators on the keyboard, right? Uh so there has been a

keyboard, right? Uh so there has been a high sense of like protectedness around that code. It is expensive to produce.

that code. It is expensive to produce.

It is expensive to validate. It is

expensive to deploy. So these things have been treated as requiring the bulk of the organizational apparatus to manage. In this world we

are today where the token prediction machine these lovely highly advanced models that we have. Codex is fantastic at taking all the best parts of GPT5 and

putting those uh you know codes and words into the world.

The code is trivial to generate. It can

be generated at arbitrarily parallelism depending on how confident you are that the code is able to solve those problems which we will get into in a little bit more detail. Right? But you kind of have

more detail. Right? But you kind of have to change the way you think about scheduling work and exploring when you no longer have what was the overriding

constraint in your teams. And because of this, you can sort of remove humans from the loop almost entirely on that production aspect of the code. And then

you have a whole host of other problems on what it means to ensure quality, right? You have a less tight feedback

right? You have a less tight feedback loop for those human engineers, right?

Because they're not physically typing on the keyboard. They can't attest to every

the keyboard. They can't attest to every line of code as it is produced. But this

is sort of the same spot that the rest of the organization has traditionally been in, right? Like there is an indirection from my designer putting a mock in Figma to me actually realizing

that in HTML, SVG and CSS. There's also

this lossy translation between doing customer interviews, going to a PRD, and turning that into code and deployed artifacts. So it's kind of nice that we

artifacts. So it's kind of nice that we have this equalizing force here where we can potentially end up in this world where every member of these software delivery teams kind of is on equal footing in terms of what it means to

actually produce that executable artifact which we're delivering.

When everybody's on equal footing though what even is the purpose of roles?

Traditionally we've had this triangle product manager designer engineering.

We've been able to cling to what we own.

The engineer owns the code. The designer

owns how it looks. The PM owns the problems we're trying to solve. What do

the new lines look like in a new EPD or that's built to take advantage of the latest this technology is offering.

Yeah, this is really interesting. It's a

question I've gotten a lot actually from uh executives and VP director shaped folks that I'm talking to here. And

everyone comes to this question with a slightly different set of priors.

Basically asking which roles do I need to hire more of or less of sort of thing. But I think that's the wrong way to answer this question.

Every one of these individuals on your team brings their unique lens of viewing the world and these coding agents have all the best parts and all the worst parts of how to build software in them.

So having a diverse team with a bunch of different perspectives on what successful delivery looks like means that you're actually able to take the best parts of everybody's worldview and

compile them into your single code producing agent to do the full job. Uh

this is sort of what it means I think for GPT5 series of models to be like fully agentic. They're able to do the

fully agentic. They're able to do the full job. It's not just fancy

full job. It's not just fancy autocomplete. It's not pair programming.

autocomplete. It's not pair programming.

It's being able to go from understanding of the user challenges, understanding the actual codebase we have, the design system that we're building on top of, what our typography primitives are, how

we actually are able to deploy these vehicles in order to get them into the hands of users, what it means to observe whether or not they are effective, and then feeding bug reports and user feedback back into the work. This is

like a fully closed loop that I would traditionally expect a team of four, five, six, seven to have to do. But if

instead all of these different people with diverse perspectives are empowering the same agent, that one agent can kind of like with full context do the full job.

So if you were to frame what the old and the new roles are, how would you do that?

So sort of in the old way of working, you've got your engineering team who is responsible for compiling the

user visible artifacts that we want to see into code that is then deployed into the world. And to do that because code

the world. And to do that because code has been very expensive to produce and loops have been long on feedback. This

has been something that has been bounded by throughput because the human engineers need to spend a lot of synchronous time and attention maintaining the quality bar, maintaining the long-term viability of the codebase

as a long-lived artifact sort of thing.

And because their time has been largely focused on the production of code within this system, there has been at least in my career comparatively little effort

around empowering others to do that same sort of work. This is why you know very often in organizations you see elbows out around not permitting folks outside of the team or outside of the

engineering organization to contribute directly because uh the code that is produced is potentially very expensive to unwind or very expensive to lift to

the quality bar versus having had your you know expert code authors producing it from scratch. And this has sort of led to each role on the team operating within their own lane with these very

expensive coordination points and handoff points between folks. And

because you have these sort of bottlenecks between handing off work, they're highly contended and low signal.

There's lossiness there. It's very

difficult for the designer to in the weeds observe how difficult it is to compile some of their mocks into code given the constraints of the design

system that the engineering team is working with. And because the designer

working with. And because the designer doesn't feel that pain directly, there can be sort of either resentment or missing feedback channels from the engineers to design. Which means like on

the whole the system is not working together as effectively as it could be.

uh sort of when I think about you know collaborating with my infrastructure teams and uh developer productivity teams we often want to shift left as

early as possible right we want to feel the breakage in a system before it gets to customers before it gets to a PR early in the developer loop because it's

easier to correct them u similarly if it is the case that we are adopting product explorations that are difficult

to prototype or experiment with because we don't have the right experimentation infrastructure here, right? As a

hypothetical example, it may be difficult to push back on product priorities in favor of this sort of product platform work. But this

collection of three roles as a whole would very much benefit from being able to pull that work above the line in order to inlect their throughput going forward. And basically being able to in

forward. And basically being able to in this new world get all three of these roles directly collaborating in the same codebase artifact feeling each other's challenges means there's much more

empathy between them. Folks are able to help each other more often. And in the happy case where everyone has contributed their expertise here. We get

higher throughput, higher velocity, more experimentation, more high quality UI and more product delivered into end user hands.

I hope you're enjoying today's episode.

Are you interested in becoming an AI product manager, making hundreds of thousands of dollars more, joining OpenAI anthropic? Then you might want to

OpenAI anthropic? Then you might want to do a course that I've taken myself, the AIPM certificate ran by OpenAI product leader McDad Jaffer. If you use my code

and my link, you get a special discount on this course. It is a course that I highly recommend. We have done a lot of

highly recommend. We have done a lot of collaborations together on things like AI product strategy. So check out our newsletter articles if you want to see the quality of the type of thinking you'll get. One of my frequent

you'll get. One of my frequent collaborators, Pavle Hearn, is the Build Labs leader. So you're going to live

Labs leader. So you're going to live build an AI product with Pavvel's feedback if you take this AIPM certificate. So be sure to check that

certificate. So be sure to check that out. Be sure to use my code and my link

out. Be sure to use my code and my link in order to get a special discount.

Here's the dirty secret about prototyping. You spend two weeks

prototyping. You spend two weeks building a prototype. You validate your assumptions. Engineering loves the

assumptions. Engineering loves the direction. Then what happens? You throw

direction. Then what happens? You throw

the whole thing away. Bolt changes this completely. When you prototype in Bolt,

completely. When you prototype in Bolt, you're not building throwaway mockup.

You're building real front-end code that integrates with your existing design system. So, when you hand it to

system. So, when you hand it to engineering, they don't throw it away.

They ship on top of what you've built. I

use Bolt every single day. I host my land PM job cohort on it. And honestly,

I'm up till 2:00 a.m. some days just vibing in the tool, having fun, and building. That's when you know a product

building. That's when you know a product is good. When you're using it past

is good. When you're using it past midnight, not because you need to, but because you want to. Check out Bold at bolt.new. link in the show notes.

bolt.new. link in the show notes.

That's what I want to see. How what is what are the receipts? What actually

happens? How can we quantify whether we're getting more product velocity, more throughput at quality?

So, I can talk a little bit about sort of the project that I worked on here, this experiment that I was running. And

back in June of last year, these coding models had just started to become viable, right? We were working with

viable, right? We were working with Codeex Mini and internally the early versions of GPT5. We had just launched Codex CLI and for historical happen stance and

accident I set about to try and build an agent utilizing these capabilities that was for non-coding knowledge work. Uh

the thesis was that all agents were coding agents and we were going to prototype an internal agent that we were going to deploy at OpenAI. Uh and in order to do that, we started from zero

empty codebase uh and had codeex do the full software engineering job to produce this artifact. One interesting thing

this artifact. One interesting thing here from a product development perspective is that having

this process of codecs can only do the job to produce something that was not an engineering native agent was super super

useful because it forced us into delivering value to actual end customers that were not the builders. uh which

helped with prioritization and externalizing the learnings that we were do in this new product category uh to folks that could give feedback about the product rather feedback about the

process of developing the product. Uh

but in doing so we produced a repository with about a million lines of code. Um

funny story about producing agents.

There's a lot of prompts embedded in agents. Uh there's about 250,000 lines

agents. Uh there's about 250,000 lines of markdown in this repository. Uh

because when you're shipping an agent, what you're really shipping is a multi- aent system. You have agents running

aent system. You have agents running around all over the place uh in what was uh an Electron app for us. Things that

you know garden the individual git repositories that are used to uh bootstrap all the tasks that the agents are working on. Uh things for summarization, things for distilling

skills, things for actually executing the work that we're doing on users behalf. uh pretty pretty cool way of

behalf. uh pretty pretty cool way of building products and an interesting way for non-engineering users to collaborate because if many of the product surfaces

are driven by agents, improving their behavior or changing their behavior is prompt engineering is tweaking text and everybody knows how to tweak text.

Everybody can tweak text and run the agent to get a vibe check to do an eval on whether or not the change was good and that in and of itself is super super empowering. Um, but the important thing

empowering. Um, but the important thing is not just that we produced an app with a million lines of code. It's that when doing so, literally zero of those lines

of code were written by humans. And this

was like Ryan as dictator constraint that I imposed on my team here, which I thought was very interesting because it forces the engineers right into the back

seat of the Uber, right? You can't

drive. the only thing you can do is like very politely give backseat driver directions. Uh and you know that sort of

directions. Uh and you know that sort of mental model is I think very useful because like it's not great to yell at your Uber driver, right? Like you you

need to you know have them use the tools available them to them in the car in order to get you to where you're going.

Uh so a big part of building in this way is making sure that the car the GPS the test harness the lints in the codebase all the knowledge corpus of all the features we've built critical user

journeys is there in the moment ready for the coding agent to drive. Uh so

when codeex was given a task and for whatever reason was not able to do the thing, the job of the engineers working on the codebase was to basically stop,

diagnose what went wrong, figure out creatively how to prevent the agent from making those mistakes going forward and try again. And every time you do that,

try again. And every time you do that, you've taken feedback that normally you would have to give via synchronous steering and figured out a way for the repository itself to give that context

to the agent. And this is why it was so easy for us to have non-engineers contribute to the codebase because every time the agent made a mistake, the engineers were forcing the agent to

learn how to self-correct, which means we basically did not have slop in the codebase. It was very slow to get to

codebase. It was very slow to get to that point, but you can not have slop by simply not permitting the agent to write slop in the first place. And you do that with tests, review agents, and other

mechanisms that are, you know, LLM as judge driven to prevent that slop from being permitted to be committed in the first place.

You mentioned it took a little while to get there. And I think elsewhere you

get there. And I think elsewhere you also talked about how the first month was about 10x slower than it would have been doing it slow solo. Can you explain like what did the first month and the

evolution look like?

Yeah, so the earliest versions of the product exploration uh starting from zero was to kind of have the agent triage pages on my behalf. Right? Going

on call is a thing that I do. We're

exploring what it means to have these agents do knowledge work. And you know, dealing with outages is a thing that I do. It's text heavy. I thought it was

do. It's text heavy. I thought it was pretty amenable to having agents do this work. And from zero at the time without

work. And from zero at the time without all these lovely apps we have in codeex today if you said read my Slack and figure out what inbound outages are

happening, it's just not a thing the harness was able to do. Uh, so it's obviously, of course, much quicker for me to tab over to Slack and copy and paste into Codeex, but that's not the

goal, right? The goal is to get Codeex

goal, right? The goal is to get Codeex to do the full job. Uh, so kind of had to stop say, okay, I can copy and paste, but I don't want to. What must I build

first to give Codeex access to Slack?

And actually, the thing I have to build first is like credential management in the harness. So, Codex is able to

the harness. So, Codex is able to request credentials on my behalf. And in

order to do that, I need access to the keychain. And in order to do that, I

keychain. And in order to do that, I need like secure coding practices that prevent me from making cryptography mistakes when we're dealing with this sensitive material. And you can kind of

sensitive material. And you can kind of recurse into a task like that sometimes like six, seven, eight times before you bottom out at the core primitive capability that is missing. And then you

kind of likeoop pop back up the stack to accomplish the actual higher level primitive that you were trying to add to the product. One neat thing about that

the product. One neat thing about that is like as you're doing this breath first search through the tools and quality gates that you need to give the agent you knock those out one by one you

don't have to do them going forward right like you are importing leverage into the codebase for every subsequent change um it means that

every bit of feedback I give I at most have to give two or three times but before we figure out a way for that misbehavior to just go away entirely So, one part that blew my mind, and I

always like this PM angle, is that your PM just wrote a PRD in Markdown and they went to a shipped pull request. Can you

give us a couple concrete examples of what they shipped and how that flow works?

Sure. Um, back in September of last year, uh, we were experimenting with what were the precursor to skills in this agent. this idea that the human

this agent. this idea that the human driver should be able to teach the model about the things they care about. Uh we

were doing some data analysis tasks. So

understanding the data ontology of what it means to do uh business metrics in this particular part of the company was

important. Uh and we want the model, we

important. Uh and we want the model, we want the harness to be able to pull this information out from the user, right?

And you know, we have this lovely machine that is able to write a bunch of text. It should be easy for it to have a

text. It should be easy for it to have a conversation with the user to figure out the things they care about versus the things they don't. So that all subsequent runs that are doing these business metrics questions kind of acrue

that leverage context. And the thing that we wanted to explore was what product shape does this look like, right? How can we give the user the

right? How can we give the user the ability to inject durable knowledge into the agent over and over again? Uh so our PM set

about to explore this and the hypothesis was that a library with the ability to both interview the user and learn from all the tasks that the agent was already

doing uh was the way we wanted to package up this capability.

And we kind of reviewed the PRD as a team in sort of the beginning of the week team meeting. And at the end of the week on our demos without any of the engineers having you know talked at all

to the PM, we had a demo of this feature and it rolled out to customers uh in the next week. Uh and this is very very

next week. Uh and this is very very cool. Uh was able to be done safely

cool. Uh was able to be done safely because we have architected our code in a very very modular way. uh where

individual pieces of business logic uh are able to connect together in safe ways without kind of spaghetti dependencies. Everything takes an

dependencies. Everything takes an interface. The concrete implementations

interface. The concrete implementations are hidden. It means every bit of

are hidden. It means every bit of business logic that we have whether it was the skill wiki or the ability to manage connectors uh always exports a

highfidelity fake for tests which means the tests that our PM was vibing into existence actually exercise its dependent services and the real business

logic of the skill library without you know having to worry about monkey packaging or having tests that don't reflect reality. Uh the eval here is

reflect reality. Uh the eval here is whether or not the agent was, you know, able to learn from these skills, whether or not they were mounted into context.

Uh whether or not the UI presented uh in the way that it should. This is the same way I would think about evaling this slide deck we're looking at now, which was also created by codeex, right? Codex

is a coding agent. It's going to write code in order to make this slide deck, but I don't care about that code. I care

that it produces a final artifact that I care about. Uh and putting all this work

care about. Uh and putting all this work into the harness into the repo means that we can actually move to sort of PRD as code input with the app as compiled

output. Uh but you know slow in the

output. Uh but you know slow in the beginning. You can't you can't do that

beginning. You can't you can't do that you know in any brownfield codebase today and expect to get good results. I

think uh we'll talk a little bit around how we can sort of uh give you a cookbook for what to do next week. uh in

your code bases. Uh but I think the idea here is to have the role of the engineering team to encode leverage around what it means to do a good job.

So the actual specifics around what you build and how it is rec realized in code are just kind of artifacts of the repository itself. Encoding leverage is

repository itself. Encoding leverage is the name of the game. We've heard a success story on the PM side in our precall mentioned that it wasn't always

successful. There were some things that

successful. There were some things that for instance you guys built with your designer early on. It was like an open claw style that you had to trash. So

what failed? What did you learn? And how

can people learn from that failure so it doesn't happen to them?

Yeah. So one thing we were trying to build uh was what you now see in the codeex app as automations. Sort of this scheduled cron uh mechanism in order to

kick off agent tasks uh on your behalf on a schedule. Uh very open claw like as you said. Uh and this was a capability

you said. Uh and this was a capability that our designer put into the product.

uh but in doing so had to bite off a fairly chunky bit of app infrastructure work in order to put in a cron system which interface cleanly with our agentic

core. Uh we wanted you know the cron

core. Uh we wanted you know the cron itself to create tasks not necessarily be responsible for driving them to completion. We wanted a nice loose

completion. We wanted a nice loose coupling between these core infrastructure primitives. The ultimate

infrastructure primitives. The ultimate thing that our designer wanted to prove out was the UI interaction pattern. what

it would mean to even surface scheduling to users in ways that they felt in was intuitive. Uh but in order to go from

intuitive. Uh but in order to go from zero to putting that feature in front of users felt that the back end needed to exist as well. Uh and the reason we

ended up unwinding that is because uh we ended up with uh spaghetti, right? Like

uh we were still halfway through uh you know realizing the true value of the harness here, right? engineering had not yet fully specified all these non-functional requirements, right, of

what it means to write high quality code that is scalable for future maintenance.

Uh, this is really all the harness is is figuring out ways to get those non-functional requirements written down and injected into the agent just in time. Uh, this is what enables everyone

time. Uh, this is what enables everyone to contribute to it safely. This is what empowers growth of the team to continue to acrue leverage back into the harness. But in

the absence of that, we ended up with a spaghetti mess. I think the scheduler

spaghetti mess. I think the scheduler lived in uh the front-end JavaScript at some point. Uh it's just like uh like I

some point. Uh it's just like uh like I I look at this and I'm like, "Oh, I'm going to regret this later." So, uh we made the decision to uh revert the PR.

Uh but we did want to enable the core objective here which was to experiment with what this product surface could look like to figure out which interaction patterns were most intuitive

to users. Uh and we do want our designer

to users. Uh and we do want our designer contributing to the codebase because we need his expertise on how to use the design system effectively, how we should think about component architecture. We

wanted to continue to explore good information architecture so we're not overwhelming users but also giving them the power tools that they need. So the

norm we aligned on was we would kind of put a door between the UI as exposed to users and they actually implemented functionality on the back end and ended

up with what we're calling a painted door style where we can get the full product feature implemented with just a

noop on the back end. uh with you know the addition of some creative product instrumentation we're actually able to get most of the way there in terms of doing that experimentation and actually

get higher signal for the engineering team on whether or not we should even prioritize this work with code generation increasingly being

automated the bulk of my time is spent on feature triage and scheduling uh so figuring out ways to get more information into that part of my work is an amazing amazing uh outcome of this

sort of norm we aligned on.

I used to think I had a retention problem. Turns out I had a messaging

problem. Turns out I had a messaging problem. I was sending the same

problem. I was sending the same onboarding emails to every new user whether they activated on day one or never logged in again. I had no idea who was slipping or why. Customer.io changed

that. Every message I send is now based on what users actually do in the product. Someone hits a key activation

product. Someone hits a key activation moment, they get nudged to the next one.

Someone goes quiet, they get a different path entirely. Their AI agent makes it

path entirely. Their AI agent makes it fast. I describe the campaign I want and

fast. I describe the campaign I want and it builds the full journey form.

Triggers, timing, copy, even branching logic. And when I want to know how

logic. And when I want to know how something is performing, I just ask the agent directly and it tells me what to do next. They also have an MCP server

do next. They also have an MCP server which means AI tools like Claude can see directly what's happening in your customer.io workspace. Your segments,

customer.io workspace. Your segments, your customer data, your attribution, all of it. So instead of explaining your business context every time you need help, Claude already knows it. Notion

used customer.io IO to personalize their onboarding and hit nearly 50% open rate, improved conversion by 6 to 7% with localized campaigns, and pushed open

rates up another 20% through AB testing.

The idea is simple. Customer.io helps

you deliver more impact from every message you send. If you're a PMR founder and your onboarding is still oneizefits-all, try Customer.io at customer.io. I'm notoriously bad at my

customer.io. I'm notoriously bad at my inboxes. I guess there's a version of

inboxes. I guess there's a version of that where I seem cool and unavailable, but the reality is I miss sponsor emails, guest pitches, and stuff that my team actually needs me for. So, I got an AI assistant, the sponsor of today's

episode, Ariso. Ariso connects to my

episode, Ariso. Ariso connects to my email, calendar, and Slack. Then, I just chat with it over Slack, and it helps me with everything. It builds workflows to

with everything. It builds workflows to respond to emails, resolve customer issues, prep me for meetings. It

actually comes to my meetings, updates its own knowledge, and remembers context from past conversations. So, every time I talk to it, it already knows what I'm working on. I used to pay for Granola

working on. I used to pay for Granola and Lindy separately. Oriso replaced

both. One tool does more and it lives right in Slack where I already work.

Check it out ato.ai/ aos. That's a r i so.ai/

so.ai/ aa k ash. Today's podcast is brought to you by Pendo, the leading software experience management platform. McKenzie

found that 78% of companies are using Genai, but just as many have reported no bottom line improvements. So, how do you know if your AI agents are actually working? Are they giving users the wrong

working? Are they giving users the wrong answers, creating more work instead of less, improving retention, or hurting it? When your software data and AI data

it? When your software data and AI data are disconnected, you can't answer these questions. But when you bring all your

questions. But when you bring all your usage data together in one place, you can see what users do before, during, and after they use AI. showing you when agents work, how they help you grow, and

when to prioritize on your roadmap.

Pendo Agent Analytics is the only solution built to do this for product teams. Start measuring your AI's performance with agent analytics at pendo.io/acos.

pendo.io/acos.

That's pendo.io aka.

Okay. So, PM writes PRD and markdown.

Designer paints a door. Then engineering

team takes over the actual back end.

That's the highlevel boiled down summary that I'm hearing. The thing that makes this

I'm hearing. The thing that makes this all possible is the harness. Can you

show us what's actually in it? The

harness lives in the repository, but really the harness is a mechanism for getting the coding agent getting codeex the right context it needs in order to

do the job at all phases of the implementation life cycle. I think

there's kind of three phases you would think about here. One is I'm getting ready to go to work. I have a ticket. I

have a PRD and I need to figure out what must be done in order to implement the thing and get it deployed. The next

phase is kind of in the messy middle of implementation. I'm writing code. I'm

implementation. I'm writing code. I'm

exercising the UI. I'm seeing the bugs.

I'm seeing logs. I'm making sure that, you know, everything is pixel perfect.

And then there's the back half of things where I have produced a diff that I claim implements the feature. and I need to convince the rest of my team to merge it, that it is acceptable, that it meets

the actual user need that we're trying to solve with this change. And in each of those phases, we have different mechanisms in order to surface context

to codeex in order for it to do a good job. Uh in the beginning here, uh we

job. Uh in the beginning here, uh we really have agents.md, which is this file that lives in the root of the repository that is forcibly injected

into codeex. It's like what codec cli

into codeex. It's like what codec cli does is take the contents of that agents.mmd and give it into context for

agents.mmd and give it into context for the agent and this is both good and bad right uh things in agents.mmd we know

are 100% reliably going to inform the agents actions but we're paying for that with context window usage and with these

models context window is a hard constraint you know you cannot fight against it there is only so many hundreds of thousands thousand of tokens you get. Uh I will say that autoco

you get. Uh I will say that autoco compaction in GPD 52 53 54 55 is so so good. I never run out of context these

good. I never run out of context these days. Uh sort of like a crown jewel of

days. Uh sort of like a crown jewel of research here to uh have good autocompaction. It is truly amazing to

autocompaction. It is truly amazing to be able to go on any size of ticket completely end to end on a single session. Uh it's magical.

session. Uh it's magical.

But still you do have to contend with things getting paged out of context even with autocompaction. So agents.mmd is a

with autocompaction. So agents.mmd is a tool but one that we want to kind of be constantly trading off from this scarce resource of context.

Yeah.

So what we do there is we put an operating model for codeex in there.

Basically a sort of like three-step, four, fivestep plan of how you should think about doing a task. And what that means for us is you ground the decisions

that you make in the documentation we have in the repository. There's a big docs tree. It includes everything from

docs tree. It includes everything from what it means to write performant React to writing reliable networking code to

all the design docs and critical user journey flows that we have in the app.

And that's important to have as part of the operating loop because we don't inject that stuff all into the agents.mmd upfront. Instead, we just

agents.mmd upfront. Instead, we just tell codeex in the agents.mmd you have all these res resources available to you. They are of x y and z shape and

you. They are of x y and z shape and they live here. Uh you decide whether or not to look at them. uh which is nice because we can like deterministically steer the agent toward the context we

want it to discover but we let the magic of these models reasoning decide what is important what it actually thinks should pull into context. This is sort of the outset of the task. So ground yourself

in the documentation that we have the app as it is written uh and then move into implementation. During

into implementation. During implementation, the thing the agent is naturally going to do, the thing it is trained to do is write code and make tool calls. So we have to figure out

tool calls. So we have to figure out ways to use those two mechanisms to inject our context into the agent. Uh

this is kind of a silly thing, but actually the code in the repository is prompts. uh codeex is going to pull code

prompts. uh codeex is going to pull code into its context window in order to figure out what business logic do we have how do we do things here what

database adapter am I using uh am I using jotai on the front end for state management uh so actually having the repository enforce that the code is all

the same or uses the same patterns or it's structured similarly allows you to efficiently and sort of like in parallel

across the codebase apply context you've read in one file to other files. Uh it

means that the agent is able to more effectively cargo cult. I love to cargo cult stuff.

cargo cult. I love to cargo cult stuff.

This is how I have worked in big brownfield codebases when I'm new, right? I'm like the team has probably

right? I'm like the team has probably done this before. We kind of like poke around and copy and paste some stuff. Uh

and you know, we can let codeex be lazy too by making it do the hard work of making the codebase homogeneous.

Um, but another way we can inject context into the agent over the course of doing these runs is to take all that human judgment around what good looks

like and basically make it fail tests if the code doesn't match. Um, a stupid one

that I do is to fail the build if there is any text in markdown files or HTML that does not use the curly quotes. uh

because our design team loves that proper typography in our userfacing strings to have the the smart curly quotes and apostrophes and this is probably a thing that is like unevenly

adhered to I think in most code bases if at all but because this kind of like irked our designer a couple of times had him vibe code some tests into existence

we don't really care about the quality of this code right so long as it actually does the job and prevents the code from becoming misaligned with the user intent. Uh, and we have a lot of

user intent. Uh, and we have a lot of tests like that that do probably weird things, maybe things you would consider low leverage, but collectively improve the quality of the software that we're delivering, which I think is pretty

cool. And then on the back end, right,

cool. And then on the back end, right, we've got sort of the agent executing grounded in the documentation we have in the repository, which is this accumulation of all of the team's

historical work and knowledge. We've got

taste refining the output as we go via these tests and lints and the code itself showing the agent what good looks like. And on the back end kind of take

like. And on the back end kind of take all of that same context and give it to a review agent. um in that docs repository right we have you know

front-end architect MD reliability engineer MD apps engineer MD and for each one of those we have this big matrixed CI job that says you review for

front-end architecture here's your guardrails here's the diff identify all the P2 issues with this thing and uh leave some feedback on the PR this

feedback more context just like those tool call error messages that we're giving the implementation agent to say you must address this, right? We found

some issues where you are misaligned with the way we expect things to be produced. And with sort of all of this

produced. And with sort of all of this effort in prompt injecting the agent with the collective human judgment on the team, my synchronous involvement here can basically move all the way to

the back of this process where the code is presumably already reviewed and approved and aligned by all this machinery that we've put in place over the course of executing producing the

change. And then my job is to say, okay,

change. And then my job is to say, okay, we've got this change. It looks good, but I think it's misaligned in ways X, Y, and Z. And then once that those

things are addressed in the moment, my job is to close the loop by taking all this feedback that I had to give and making it so I don't have to give it next time. And then back into the

next time. And then back into the harness it goes. So we got the high level now of what it involves. Can you

go really granular? Show your actual codec setup. show what's on every

codec setup. show what's on every engineer PM designer's machine and how to put this all together practically. I

would say that uh the way I have steered my team has been intentionally frictionful so that we don't want them staring at the computer looking at their

agent which means they have largely been booting up codecs in the CLI going to multiple tabs where it's actually painful in order to pay attention to these things because we don't we don't

want them to. Um if I were doing it again today this is probably not what I would do, right? the the app is quite fantastic and is a thing that we are seeing meteoric growth in. I think it

was uh just past 400 or four million uh weekly active users. Uh and all the good parts of the workflow that I'm describing are in the app, right? I

would package all that collective designer taste into a plug-in that you know my team has a little shared marketplace of that I can just suck all of that expertise into the app. that's

going to pre-wire Figma, which means the place my designer naturally works is going to be a thing that is directly addressable by codeex. Um, in the same way that I have access to Figma and

maybe I'm unreliable at refreshing my knowledge of the current mock. So, Codex

is just going to do that because these agents will follow instructions. Well,

that is that is a core part of how they are built. Um so these plugins which

are built. Um so these plugins which bundle the skills include those instructions on these are the tools we use these are how they are built. Uh so

I think the app today as sort of the central coordination mechanisms of teams gives a much more userfriendly affordance for distributing these sorts of things. Uh plugins are definitely the

of things. Uh plugins are definitely the future and what I would lean into today.

All right. Can you show us what this actually looks like inside of codeex?

How someone would set all this up?

Sure. Let's get into the nitty-gritty of it. This is my favorite part here. So,

it. This is my favorite part here. So,

right, we started with an empty repository, right? Which means we had a

repository, right? Which means we had a high degree of freedom to start structuring the code in a way that codeex most natively understood. And

this is how we kind of came to the pattern that all the things we want codeex to know need to live in the repository itself. This was its

repository itself. This was its workspace. This was its brain. In order

workspace. This was its brain. In order

to misen plus get the agent set up, we wanted it to have the easiest time and the thing that Codex loves to do is search code with rip grip. So we put

everything in one place and what that looks like in practice is this big docs directory that has a ton of different content in there. uh we make heavy use

of a thing called an exec plan which you can find on the openai cookbook which is essentially a skill you can give codeex in order to write a good implementation plan phased with milestones and

deliverables so it can track its work and make sure that it is continuing to make progress against sort of highle and tricky implementation plans and we kind of keep every single one of those that

has ever been produced in the repository they form a durable log of the implementation choices we have historically Similarly, every design docs

and Google docs live in the codebase, which means these things are actually continuously kept up to date to reflect the actual state of our app. They're

sunset and archived when they become irrelevant or features get removed. And

we have also built code and tests around this documentation tree. So all these markdown files cross-link to each other.

They embed git repository revisions of when they are current as of. If codeex

ever touches one of them to update it, the build will fail unless all the cross-link documents are also updated to reflect the current state of the world.

Uh and this is what it means in order to have the harness self-maintain to have the test prompt injecting codecs with what it must do. We don't rely on codeex

knowing that this is how it maintains the doc repository. A test will tell it what it needs to do, how it must proceed in order to get the build to green. The

agents are super trained in order to get the test to pass. So if we can figure out ways to build context injection primitives into failing tests, we're not fighting the harness here. This is not a

thing that will be obsoleted by the model capability improving. This is just a very neat way of aligning what must be done with this scarce resources of context with the actual thing the model

will and always do which is call tools and run tests. Um similarly here you can see in this references section we have a bunch of these llms.ext. I don't know if

everybody knows this but like a lot of programming projects uh will publish a you know plain text version of their documentation at llms.ext.

Uh and this strips away all the tokens from the HTML and it's just like markdown.

Stripe has these, OpenAI docs have these. Uh UV which is a Python runtime

these. Uh UV which is a Python runtime has these. Our internal design system

has these. Our internal design system has these and we check these directly into the codebase which basically give codecs a full reference manual for all of the big chunky libraries and

components that we depend on and we point to those. You know, our documentation on the design system highlights the key components that are probably going to be used most of the

time and how to look for them in this LLM.ext.

LLM.ext.

Um, here you can see these persona oriented documentation that we talked about earlier, right? What it means to write secure code. Um, this product

sense MD is how we think about encoding what it means to document a feature, how to turn a PRD into code in a QA plan.

And this is a sort of thing that I doubt any teams are writing down today. It's

just a thing they kind of, you know, talk about together and maybe whack each other on the nose when uh they don't do this when producing a PR. But figuring

out a way to actually durably document what it means for the team to operate is necessary for these agents which every time they spin up have a blank slate in

terms of context. Um, another neat thing here is this quality score.md which is sort of a meta way for codeex to diagnose the state of the codebase. We

have it keep notes here every time it writes code because over the process of making a change, it's going to pull a ton of code into context. Maybe code that hasn't been

context. Maybe code that hasn't been touched in a month, but our guardrails have changed. So, this is a way for

have changed. So, this is a way for Codeex to be continually assessing the quality of the codebase, doing some gardening, and spinning up some tasks for later on how we can keep aligning

the codebase to baseline.

Another neat thing we can sort of talk about here is what it means to close the loop. When we're working on this app, we

loop. When we're working on this app, we split it into a backend and front-end pattern, which means they could be tested independently. And we tried to

tested independently. And we tried to house as much of the business logic as possible in our backend. And that means codec should be able to validate the behavior of our business logic

headlessly. Right? You don't need to

headlessly. Right? You don't need to drive a UI. You can hit APIs and look at the database to observe the side effects that you were hoping to have. And

normally for almost all of my career, when I'm working with services in local dev, I run them in my terminal and I get this neverending stream of logs in the

console that like I have no idea what to do with and it's kind of impossible to use because they scroll away so quickly.

What we've done instead is permit codeex to spin up like something approximating a production observability stack. Uh you

know think giving data dog to codeex for every change that it makes. Uh and this allows it to use like normal observability tooling for metrics and logs. The same things your engineers

logs. The same things your engineers would be using to diagnose issues in production. And we let codeex use it to

production. And we let codeex use it to diagnose issues in local dev. Right?

shift left, validate the change and exercise it end to end. And we do the same thing with the UI itself. We mount

the UI into a local browser shell that pre-wires up Chrome Dev Tools. Today, we

would do this with computer use to allow Codeex to actually inspect the app, drive it, click the buttons, observe that the loop is closed, that the feature is implemented, and that it can

attest that it works end to end. So I

mentioned a little bit how uh being super overly architected in the application was a point of leverage for the engineering team and the reason that is because we've actually limited

statically the plausible places that code can go. Uh you see this kind of represented here as sort of a package layering stack. One common pattern in

layering stack. One common pattern in hyperrowth code bases that is a failure mode is that business logic kind of just gets spewed everywhere and it becomes a

sort of codebase that approximates a ball of mud. Uh these are very difficult to deal with uh and evolve because uh there's a ton of non-local side effects

from touching any bit of code. Uh but I mentioned that we have these nice isolated business logic domains that are able to take all their dependencies as fakes and make it safe for us to

prototype new features. And the only reason we're able to do that is because we actually with code enforce that these things are hard separated with package boundaries between them. You cannot

possibly create a ball of mud in the first place. uh which basically means

first place. uh which basically means that for uh the engineers we permit them to write packages at any layer of the stack. They can write database code,

stack. They can write database code, config, parsing, you know, react code in the front end, business logic. RPM, we

kind of by norm logic and UI code, but try not to touch these lower level primitives in the app.

And to get that painted door invariant for our designer had him focus largely in the UI layer of the app uh with you know some affordances for stubbing out

APIs on the back end so that uh we could get those uh usage signals uh that we cared about. U which is kind of neat to

cared about. U which is kind of neat to be able to think about reflecting sort of like ales on who is permitted to touch what code and the actual structure of the code itself. like this is

actually by folder paths that we're able to determine whether changes are allowed or not.

This adds up and I think you're you have a really hot take on how it should add up. You've said that if you're not using

up. You've said that if you're not using a billion tokens a day, you're basically negligent. Now that's $2 to $3,000 per

negligent. Now that's $2 to $3,000 per engineer.

How does an engineering leader at a normal company build the case for that internally? The idea here is that the

internally? The idea here is that the more tokens that you're running through these models, the smarter they get, the more intelligence you are able to extract from them. This is sort of like

the the realization behind why we have reasoning models in the first place, right? If we give the model more effort

right? If we give the model more effort to, you know, do inference, we're actually able to get greater results out of this. Uh, and we see this time and

of this. Uh, and we see this time and time again with longer time horizons that agents are able to execute over. Uh

and I think it is the case that in order to have the sort of lofty coal of being a token billionaire, right, you kind of

have to find ways to get the agent to do more work on your behalf to execute in parallel more autonomously. And because

I believe that GPT5 is able to do the full job, kind of having that as a northstar constraint means that I am constantly looking for ways to get the

agent to do the hard parts of my job that take a lot of my time and constantly uh being surprised in ways this frees me up. Um, one uh sort of

insane thing uh is not just you know a billion tokens a day but sometimes it can be 350 million tokens on a single PR. Uh there was a time a couple of

PR. Uh there was a time a couple of months ago where uh I was buckling my laptop into the back of my car going back and forth to the office tethered to

my corporate phone uh in order to have the agent continue to cook while I was commuting to work. This is like a 60h hour codec CLI session that like I was like please don't go to sleep laptop.

Uh, and this is a PR that probably would have taken me three weeks in order to do. Like one of the hardest refactors I

do. Like one of the hardest refactors I have ever done in my career and it took place over the course of three, four days. Uh, and did as good a job as I

days. Uh, and did as good a job as I would have here. Uh because of the investment we've put into letting the harness prompt inject the agent and manage its context efficiently over that

60 hours, I only had to provide two prompts in addition to my first prompt that we gave the agent which truly magical. Um it means that it was

magical. Um it means that it was actually, you know, approximating a human engineer on my team, right? I did

not have to, you know, synchronously bash it on the head in order to make progress. It just, you know, went heads

progress. It just, you know, went heads down and did the work the way I would.

Okay. So, 60 hour running long task. I

think that points to something that if people have their conception of these models five, six months ago, it's really outdated. It's the what's recently

outdated. It's the what's recently become possible that's enabling this new way of working. How would you describe it? When did that happen?

it? When did that happen?

Yeah, so there was a huge uplift in capability when GPT 5.2 2 came out and the rate of progress with each additional point release of the model

has been much bigger than what came before. Um, GBD 5.2 came out while I was

before. Um, GBD 5.2 came out while I was on holiday for winter break. Uh, and

when I came back without any additional investment, we were getting one or two PRs more per day per engineer on the

team. uh which had this very strange

team. uh which had this very strange effect where folks were looking for ways to do more work with the agent. It was

kind of like uh you know adding another lane to the highway. It was like how can I put more cars on this thing right now?

Um and because we had done so much work in order to get codeex to operate over those long horizons, the next sort of

frontier was to go in parallel. Uh we

just blogged yesterday about a a sort of agent orchestrator that came out of my team called Symphony. Uh which once we

saw the uplift in capability with 52 and invested in this additional orchestrator allowed us to further 10x the PRs per engineer uh per week. And that

investment in increased parallelization is only a thing we were able to do because we were confident in the code that was being produced already by the agent. Right? If I'm 10xing the number

agent. Right? If I'm 10xing the number of PRs I am producing a week, I cannot possibly look at all of them. I couldn't

possibly look at them all before. Uh so

the only way we could do this safely is with these techniques in order to eliminate classes of misbehavior and have confidence by default in the code that's being produced.

But yeah uh we continue to see these huge improvements in autonomy and confidence with each model revision. One

of the most interesting things uh about 55 is with computer use there's much less wrangling we have to do in order to get the agent to actually see and drive

the app. This was a significant point of

the app. This was a significant point of friction for us early on. We were having to figure out ways to wire up ffmpeg to docker containers to give codecs, you

know, a virtual xs server so it could drive the app. Like this is all a ton of low-level goo plumbing nonsense that was slow to develop and ultimately undifferentiated.

But with codeex now shipping computer use, a tool that's natively post-trained into the model, we kind of get to throw away all that bespoke code, keep the

same process of having codecs drive the app, and do it much more efficiently with better image understanding, ability to faithfully replicate wireframes and

mocks, these sorts of things. Uh, truly

amazing.

So you mentioned one of the hardest refactors in your career. You've worked

at the regular companies, the Brex, the Stripe, these types of companies. If you

are an engineer at this and you're looking at this new way of working, what is the new engineering job?

The new engineering job, I think, is to have everybody be staff engineers. And

the role of a staff engineer, the primary role, at least for me, is not to be producing all of the code, but to be empowering my team to produce the code.

And today it is the case that every single engineer on this planet modulo token budget has access to 5, 50,

or 5,000 concurrent hands-on keyboards.

and not being able to leverage that amount of parallelism today is largely a reflection of not having put in the work

to harness those resources.

One experiment we're running internally in order to drive toward this amount of parallelism is to tell teams that they have been forced to hire five interns

named Codex. part of their success

named Codex. part of their success criteria is judged on how effectively they incorporate these resources into their team. What does it mean to make

their team. What does it mean to make use of this capacity that the organization has given them? And part of success is being able to use those

resources. And if we think that, you

resources. And if we think that, you know with access to these lovely models and

compute being available to use them, letting those GPUs idle is in a way not having hands-on keyboards. Uh, so my job is to figure out ways to saturate these

things, to have them doing productive work, to have them not waste cycles, to have fewer PRs that we throw away because they are high quality by default, to have fewer cycles back and

forth in CI, fewer reviewer agents having to repeatedly burn cycles for the same feedback, that sort of thing. uh it

is very much a systems thinking sort of mindset where you know in a way you're playing Factorio with your codebase and figuring out ways to build the token

factory you're describing essentially being totally divorced from the code and really focused on using these interns

but I guess a lot of engineers they prided themselves on being the best at writing code that has been their output for so long I imagine people on your team have struggled with this mental

shift. What is the right way to think

shift. What is the right way to think about it?

This is a thing I have struggled with too. Uh for a long time because

too. Uh for a long time because the primary function of the job was to produce code in order to achieve you know our product and business objectives. It's very easy to fixate on

objectives. It's very easy to fixate on the production of code itself as the job. But ultimately the reason we work

job. But ultimately the reason we work in tech is to deliver product into the world. Uh and initially kind of like

world. Uh and initially kind of like having this hard break in mental model of what it means to do work was a little bit like weird on the ego sort of thing.

But uh where I have gotten to is latching on to the same creativity the production aspect of things. Uh

ultimately the thing that motivates me is to get product into the hands of users and see their delight and how they're able to improve their lives with the uh with the tools we build. And I'm

still able to do that today. I'm able to do a lot more of that today. And you

know, I kind of like puzzles. Uh you

know, you can have puzzles in how you structure the code or you can have puzzles in terms of how you put teams together to produce it. Um to me, these uh skills and challenges are, you know,

pretty transferable between each other.

So I think there is a new art to be found, a new craft in terms of how you, you know, effectively assemble a virtual team to go solve these things.

Okay. So let's move into the road map.

Your counterpart at B or Citadel or Stripe, they want their PM shipping 100,000 lines of code. What's the Monday morning road map that they need to take so that they can get started right now

to move to this way of working over the next six months?

I'm so looking forward to the Monday morning road map. You have no idea. I I

I was super excited to to vibe this slide into existence with Codeex because, you know, this is the hook right here. This is why this is why

right here. This is why this is why we're talking. Um, so I think those

we're talking. Um, so I think those map or those phases of the code production process that we talked about map pretty clearly to sort of what the

Monday morning road map should be. Uh,

you need the repository to be legible to the agent. Uh, it needs to know all the

the agent. Uh, it needs to know all the things that your team does, right? The

process of bringing product into the world to producing code to having good design has a ton of implicit context around what good looks like. Hundreds or

thousands of little decisions that we all make by default because we cut our teeth the hard way gaining experience.

But you know the agents don't have this, right? Well, they do, but they have

right? Well, they do, but they have every possible permutation of all those thousands of decisions because they have seen so much code and so much product

over the course of their training. So,

it's our job to take the specific choices that we as a team would make, write them down, give them to the agent, and starting with this simple technique

of putting a map in agents.mmd on where the agent should look. I really like having a documentation tree as sort of the entry point for making this happen for your codebase because it's super

easy for humans to think about just writing stuff down. Uh and it's also really easy to use some of the tools that we have to do that. Uh it's very

often the case that you know agent will produce code we didn't like or uh there's a bug in production. We'll have

a slack thread as a team where we discuss what went wrong, how we think it should be done differently and it's really easy at the end of that to just do at@mention codecs you know update

docs x y and z to reflect these new guardrails this new taste we have discussed and you know that kind of removes us from the loop entirely from maintaining the legibility. We just kind

of get to talk and we let codeex maintain it for us. Uh and this is a way too that I think um sort of by default in larger teams you can kind of sort of

build a flywheel around improving agent behavior and build sort of cultural norms and processes around having this happen as a default way of doing the

work. Uh the other next step is to make

work. Uh the other next step is to make validation cheap. validation is

validation cheap. validation is happening at all parts of the code production pipeline that I talked about on the review side of looking at those guardrails over the course of writing the code with all these lints and tests

being able to actually close the loop and exercise the app to see that yes the search box that was in the PRD actually appears in the UI and when I click on it

it shows a pending set of results that are our best guess and it has very nice type ahead without any jank in the box when we type Right? These are, you know,

kind of the squishy ill-specified parts of the task, but kind of providing proof of work to the reviewer agents and the humans looking at these PRs is necessary in order to build confidence. This is

what it takes in order for me to not have to shoulder surf the codecs. The

same way I'm not going to shoulder surf my teammates in VS Code or cursor. Uh

and finally, it's important to have the users of this harness be more than just the engineers themselves, right? The exercise here is

themselves, right? The exercise here is to empower all of the team in order to write code safely. Uh so having a tight collaboration loop with the other

members of the team, product design, user ops, uh your devops folks allows for really high bandwidth feedback

between these uh uh different roles and it allows you to do more work, different types of work, extract different domain expertise from the folks on your team and reflect it

all back into the repository for the benefit of everyone. Uh, one sort of like emergent property we had here that I thought was really, really cool was we

hired a new engineer onto the team, uh, who had formerly been a PM. And he

noticed that we had no documentation in the codebase for what product features we had. Uh and this is kind of an

we had. Uh and this is kind of an important thing because it grounds all the engineers, all the designers, all the product, the reviewer agents, everything in what we are actually

trying to achieve. What are we shipping here? Uh and Codeex is very good at

here? Uh and Codeex is very good at crawling through a codebase and figuring out what the product features are and writing them down and surfacing them for human review. But what fell out of that

human review. But what fell out of that was a set of QA agents that are constantly booting the app, running it through these critical user journeys and asserting things work. Which means that

when it comes time for deploy, we have to do less manual smoke testing because we actually have sort of these just in time generated runbooks around what it means to have acceptance testing. And

all of that came out of a single person with domain expertise and an idea of what good looks like. and sort of the rest of the work we had done made these

higher level correctness uh verifiers kind of fall out of that one thing.

So we're actually full circle all the way back to the beginning of the discussion now that people understand the harness they understand the Monday morning

roadmap what does the product team of 2027 look like? How are these individual functions working together if everybody has access to the code?

I think we're going to see a lot more prototyping in the world, a lot more diversity of products. As a result,

I think too in this new wave of computing with AI and agents, we are currently at the early part of the technology curve where a lot of the

products you see today are lowfidelity text interfaces. Uh and

text interfaces. Uh and when I think about where I sit in openAI working on frontier and deploying to businesses uh in order to get the

machine to do the work there is a fractal nature of work different surfaces different ways of working to wedge these models into the nooks and crannies of the hard knowledge work that

happens. So I'm really excited to be

happens. So I'm really excited to be able to rapidly experiment with different modalities and different ways of delivering this technology. Like this

is core part of the OpenAI mission here, right? Is to have wide access to do

right? Is to have wide access to do economically valuable work with these models, with these tools. Uh so I'm excited to increase the rate of experimentation to increase the

diversity of product we see in the world and to see new and exciting ways we're able to use technology to do great things for users. All right. So,

we're into the little home stretch here.

I want to ask you a couple quick questions. Is that okay?

questions. Is that okay?

Yeah, sure. Let's do it.

What's one skill a tech lead who's trying to move to this way should build this weekend?

I really think it's important that you figure out a way to get codeex to be able to see your app and drive the business logic. uh in the absence of any

business logic. uh in the absence of any other modularity or investment to be able the ability to actually close the loop and observe the behavior that

you're changing is what it takes in order to know that you have completed the job. Uh I see this all the time

the job. Uh I see this all the time where folks will vive something up and not be able to attest that it solves the

problem they set out to do. Uh, and in order to convince the rest of the team to accept the code in the first place, you got to be able to prove somehow that

the code meets the brief, that it does what it says on the tin. Uh, and the way to do that is to invest in closed verification loops, the highest leverage of which is the full endto-end integration test. You click the buttons,

integration test. You click the buttons, observe the side effect, that sort of thing. It's funny how it comes back to

thing. It's funny how it comes back to some of the fundamentals but yeah now applied to AI. What's the biggest mistake engineering leaders are making

with AI coding tools right now?

I think having the tools in your head or in a box as these things that are tools that must be driven synchronously by a human is limiting uh the amount of

creativity in terms of deploying these things widely. Uh ever since GPT 5.2

things widely. Uh ever since GPT 5.2 came out. My belief is that it is as

came out. My belief is that it is as capable as me at doing the full job. And

it's sort of my job to figure out ways to have it do as much of my job as possible. Uh, a lot of that means I

possible. Uh, a lot of that means I can't be sitting in front of it because I don't sit in front of my teammates. Uh

so there's this whole metag game of building confidence in the code that these agents are producing so that we can have it do more and more of the job to execute on longer and longer time

horizons to be more and more independent. And this is necessary in

independent. And this is necessary in order to unlock that true parallelism, right? That 5,000 engineers worth of

right? That 5,000 engineers worth of compute that I talked about. I can't

possibly do that if you know I'm locked behind you know two tabs in the app or three panes in my T-Mox in my terminal.

Final sort of meaty question here for you. We've been talking about GPT 5.2 at

you. We've been talking about GPT 5.2 at the same time came out Opus is 4.6 roughly and the the whole discussion really changed in that December time

period. If a team can only pick one, the

period. If a team can only pick one, the anthropic stack or the OpenAI stack, what's the case for picking the OpenAI one?

Our models are optimized around that full agentic behavior to being able to do the full job. I am super easily able to delegate entire swaths of work to the

GPT5 series without having to constantly poke and prod it. And this is the fundamental unlock for me that has gotten to that parallelism. uh symfony

this agent orchestrator that we have released is fundamentally very uncomplicated. It is a thing that

uncomplicated. It is a thing that advances linear tickets through states and gives the ticket text to codeex uh and expects it to produce a PR. There's

very little magic to it. Uh which is why I think that uh sort of these GPT series of models that we have are the ones uh that are going to unlock the true power

of agentic software engineering in your organizations. Additionally, the pace of

organizations. Additionally, the pace of product delivery on the model capability side in order to get more and more of that loop closing in place has just been

fantastic. things like parallel tool

fantastic. things like parallel tool calling and background shells in GPT 5.3 to everinccreasing autocompaction

capability to computer use and the ability to have multimodal understanding at high fidelity with built-in image generation capabilities in the latest

5.5 means that what would normally have been two or three people on the team handing off between them means that instead those two or three people can

encode all their leverage into one tool which is then able to do the full job.

Um there is this kind of neat story that I have been telling that in multi- aent systems the correct amount of agents to want to

optimize toward is not multi it is one because one agent is able to have full addressability over the entire task and its entire context which means being able to do design backend and front end

in a single agent means you're going to get higher quality results than having these lossy frictionful handoffs between them and that experience comes from me having built agents for knowledge work

with handoffs in them and seeing the massively incred uh increasing capability of the GPT5 series over time.

I want a single agent to cook and do the full job and we have done a fantastic job I think at increasing codeex's ability to address all parts of the SDLC

in a single agent.

Wow. There there was so much information in there. If I want to get more, if I

in there. If I want to get more, if I want to get more of Ryan, where can I find you online?

Uh, I tweet a lot uh on X. My handle is uh Laoplo. Uh, and uh me and the team

uh Laoplo. Uh, and uh me and the team should have more content coming uh on the OpenAI blog. And uh I uh am doing a bunch of conference tour over the

summer. I am next going to be at AI

summer. I am next going to be at AI DevCon in London. We'll see you there.

All right, guys. Follow him on Twitter.

Check out his amazing blog post on the Open Eye blog and we'll see you guys in the next episode.

Been a pleasure, Akos. Thank you.

Thank you. I hope you enjoyed that episode. Couple things you can do to

episode. Couple things you can do to support the show. One, comment. Two,

review. Those ratings and reviews really help other people understand the value and the production that we are putting into this. Right? This wasn't an easy

into this. Right? This wasn't an easy episode to produce. We put in a ton of pre-work. We edited it for you. We

pre-work. We edited it for you. We

brought in the best guests. If you don't mind sharing a rating and review, sharing the episode with others, making sure you are subscribed, that really helps the show do bigger and better

productions. I'll see you in the next

productions. I'll see you in the next episode. Here is one of those that

episode. Here is one of those that YouTube thinks would be a great fit for

Loading...

Loading video analysis...