How Claude Code Works - Jared Zoneraich, PromptLayer

By AI Engineer

Summary

Topics Covered

Simple Loop Beats Complex Scaffolding
Bash Is Universal Coding Tool
To-Dos Enable Model Steerability
Ditch DAGs for Model Reliance
No Single Best Coding Agent

Full Transcript

So, welcome to the last workshop. Um,

you made it. Congrats. Y

out of like 800 uh people, you're you're the last standing uh sort of very very dedicated engineers. Uh yeah, so this

dedicated engineers. Uh yeah, so this one's a weird one. I got in trouble with entropic on this one. Uh obviously

because of the title. I actually also gave him the title and I was like, do you want to change it? He was like, no, I just roll with it. It's kind of funny.

Uh uh so so yeah, this is not officially endorsed by Copic, but we're hackers, right? And Jared is like super

right? And Jared is like super dedicated. He's um and the other thing I

dedicated. He's um and the other thing I also like really enjoy is featuring like notable New York AI people, right? Like

so don't take this as like one is the only thing that Jared does. He has a whole startup that you should definitely ask him about. Um but like you know I'm just really excited to feature more content for local people. So yeah,

Jared, take it away.

>> Thank you very much. Thank you very much. And what an amazing conference.

much. And what an amazing conference.

very sad we're ending it, but hopefully it'll be a good ending here. Um, and

yeah, uh, my name is Jared. Uh, this

will be a talk on how Claude Code works.

Again, not affiliated with Anthropic.

Uh, they don't pay me. I would take money, but they don't. Um, but we're going to talk about a few other coding agents as well. And kind of the highle

goal that I'll go into is me personally, I I'm a big user of all the coding agents, as is everyone here. and they

kind of exploded recently and as a developer I was curious what changed what made it finally what made coding agents finally be good. So let's get

started. I'll start about me. I'm Jared.

started. I'll start about me. I'm Jared.

You can find me I'm Jared Z on X on Twitter whatever. Um I'm building the

Twitter whatever. Um I'm building the workbench for AI engineering. So uh my company is called Prompt Layer. We're

based in New York. You can kind of see our office here. It's like a little building. So it's blocked by a few of

building. So it's blocked by a few of the other buildings. So we're we're a small team. We launched the product 3

small team. We launched the product 3 years ago. So uh long for AI but small

years ago. So uh long for AI but small for everything else. And uh yeah, what kind of our core thesis is that we believe in rigorous prompt engineering,

rigorous agent developing development and we believe that the product team should be involved, the engineering team should be involved. We believe if you're building AI lawyers, you should have lawyers involved as well as engineers.

Um so that's kind of what we do. uh

processing millions of LM requests a day. And a lot of the insights in this

day. And a lot of the insights in this talk come from just conversations we have with our customers on how to build coding agents and stuff like that. And

also feel free throughout the talk we can make this casual. So if there's anything I say if you have a question feel free to just throw it in. Uh and I spend a lot of my time kind of dog fooding the product. It's kind of weird

the job of of a founder these days because it's half like kicking off agents and then half just using my own product to build agents and feels weird but it's kind of fun. And uh yeah, the

last thing I'll add here is I'm a big enthusiast. We literally rebuilt our

enthusiast. We literally rebuilt our engineering org around cloud code. I

think the hard part about building a platform is that you have to deal with all these edge cases and oh uh we're uploading data sets here it doesn't work and you could die a death by a thousand

cuts. So we made a rule for our

cuts. So we made a rule for our engineering organization if you can complete something in less than an hour using cloud code. Just do it. Don't

prioritize it. And we're a small team on purpose but uh it's helped us a lot and I think it's really taken us to the next level. So I'm a big fan and let's dive

level. So I'm a big fan and let's dive into how these things work. So this is what as I was saying the goal of this talk. First, why have these things

talk. First, why have these things exploded? What is the

exploded? What is the what was the innovation? What was the invention that made coding agents finally work? If you've been around this

finally work? If you've been around this field for a little bit, you know that uh a lot of these autonomous coding agents sucked at the beginning and we all tried

to use them. Uh but it's it's night and day. uh we'll dive into the internals

day. uh we'll dive into the internals and and lastly we like everything in this talk is oriented around how do you build your own agents and how do you use

this to do AI engineering for yourself.

So let's just go uh talk about history for a second here. How did we get here?

uh everybody knows started with uh remember the workflow of you just copy and paste your code back from chat GPT back and forth and that was great and that was kind of revolutionary when it

happened. Uh, step two, when cursor came

happened. Uh, step two, when cursor came out, if we all remember, it was not not great software at the beginning. It was

just the VS Code fork with the command K and we all loved it. But, uh, now now we're not going to be doing command K anymore. Then we got the cursor

anymore. Then we got the cursor assistant. So, that little agent back

assistant. So, that little agent back and forth and then cloud code. And

honestly, in the last few days since I made this slide, maybe there's a new version we could talk about here. And uh

at the end I'll talk about like kind of what's next. But this is how we got

what's next. But this is how we got here. And this is really I think the

here. And this is really I think the cloud code is kind of this headless uh not even this this new workflow of not even touching code. And it it has to be

really good. So why is it so good? What

really good. So why is it so good? What

what was uh what was the big breakthrough here? Let's try to figure

breakthrough here? Let's try to figure that out. And again throw this in one

that out. And again throw this in one more time. These are all my opinions uh

more time. These are all my opinions uh and what I think is the breakthrough.

Maybe there's other things but simple architecture. I think a lot of things

architecture. I think a lot of things were simplified with how the agent was designed and then better models, better models and better models. Uh I think the

a lot of the breakthrough is kind of boring in that it's just anthropic releasing a better model that works better for these type of tooling calls and these type of things. But the simple

architecture relates to that. So we can dive into that. the architecture and and this is our little you'll see uh prompt wrangler is our little mascot for our

company. So we made a lot of graphics

company. So we made a lot of graphics for these slides but uh basically give it tools and then get out of the way is what a oneliner of the

architecture is today. I think if you've been building on top of LMS for a little bit this has not always been true.

Obviously tool calls haven't always existed and tool calls is kind of this new abstraction for JSON formatting and if you remember the GitHub libraries like JSON former and stuff like that in

the olden days but give it tools get out of the way. Uh the models are built for these things and being trained to get better at tool calling and better at this. So the more you want to

this. So the more you want to overoptimize and every engineer uh including my especially myself loves to overoptimize and when you first have an idea of how to build the agent you're going to sit down and say oh and then

I'm going to prevent this hallucination by doing this prompt and then this prompt and then this prompt don't do that just a simple loop and get out of

the way and just delete scaffolding and less less scaffolding more model is kind of the tagline here and you know This is uh the leaderboard from this week.

Obviously, these models are getting better and better. Uh we could have a whole conversation and I'm sure there's been many conversations about is it slowing down? Is it plateauing? It

slowing down? Is it plateauing? It

doesn't really matter for this talk. We

know it's getting better and they're getting better at tool calling and they're getting better optimized for running autonomously. And don't this is

running autonomously. And don't this is I I think Anthropic calls this like the AGI pill to way to think about it is don't try to overengineer around model

flaws today because a lot of the things will just get better and you'll be wasting your time. So here's the philosophy, the way I see it of cloud

code, ignoring embeddings, ignoring classifiers, ignoring par matching. The

we had this whole rag thing actually cursors bring back a little bit of rag and how they're doing it and they're mixing and matching. But I think the genius with cloud code is that they they scratched all this and they said we

don't need all these fancy uh paradigms to get around how the model's bad. let's

just make a better model and then let it let it cook and uh just leaning uh on these tool calls and simplifying the tool calls which is a

very important part part instead of having a workflow where the master prompt can break into three different branches and then go into four different

branches there there's really just a few simple tool calls uh including GP instead of rag and uh yeah and that's kind of what it's trained on. So uh

these are very optimized tool calling models.

So this is uh the zen of Python if if you guys are familiar if you do import this in Python. This is I love this philosophy when it comes to building

systems and I think it's really apt for how cloud code was built. So

really just simple is better than complex, complex is better than complicated, flat is better than nested.

This is this is all you need to this is the whole talk. This is all you need to know about how cloud code works and why it works specifically that just in we're going back to engineering principles

such that simple design is better design. Uh I think this is true whether

design. Uh I think this is true whether you're building a database schema uh but this is also true when you're building these autonomous

coding agents. So let's I'm going to now

coding agents. So let's I'm going to now kind of break down all the specific parts of this coding agent and uh why I think they're interesting. So the first

is the constitution. Now a lot of the stuff we kind of take for granted even though they started doing it a month or two ago or maybe three or four months ago. So this is the cloud MD codeex or

ago. So this is the cloud MD codeex or others use agents MD. The interesting

thing I think I assume most of you know what it is. Uh it's again it's where you put the instructions for your library.

But the interesting thing about this is it's basically the team saying we don't need to overengineer a system where the

model first researches the repo and cursor uh like cursor 1.0 as you know makes uh vector DB locally to understand the repo and kind of does all this

research. They're just saying, "Ah, just

research. They're just saying, "Ah, just put a markdown file. Let the user change stuff when they need. Let the agent change stuff when they need very simple and kind of goes back to prompt

engineering, which I'm a little biased towards because prompt layer is a prompt engineering platform, but uh everything's prompt engineering at the end of the day or context engineering.

Everything is how do you uh how do you adapt these general purpose models for your usage?" And the simplest answer is

your usage?" And the simplest answer is the best one here, I think.

So this this is the core of the system.

It's just a simple master loop. Uh and

and this is actually kind of revolutionary considering how we used to build agents. Everything in cloud code

build agents. Everything in cloud code and and all the coding agents today, codeex and and and the new cursor and AMP and all that, it's just one while loop with tool calls just running the

master while loop calling the tools and going back to the master while loop.

This is basically four lines of what it's called. I think they call it N0

it's called. I think they call it N0 internally. Uh at least based on my

internally. Uh at least based on my research, but while there are tool calls, run the tool, give the tool results to the model, and do it again until there's no tool calls and then ask

the user what to do. The first time I did this, uh, the first time I used tool calls, it was very shocking to me that the models are so good at just knowing when to keep calling the tool and

knowing when to fix their mistake. And I

think that's one of the most interesting thing about LM just they're really good at fixing mistakes and being flexible.

And the more just going back, the more you lean on the model to explore and uh figure it out, the better and more

robust your system is going to be when it comes to better models.

So, so these are the core tools uh we have in cloud code today. And to be honest, these change every day. you know,

they're doing new releases every few days, but these are the core ones that I found most interesting to talk about. Uh

there could be 15 tomorrow, there could be down to five tomorrow, but this is what I find interesting. So, first of all, read. Uh yeah, they could just do a

all, read. Uh yeah, they could just do a cat. Uh but what's interesting is read

cat. Uh but what's interesting is read is we have token limits. So, if you've used cloud code a lot, you've seen that sometimes it'll say this file's too big or something like that. That's why it's

worth building this read tool. Grep

glob. Uh,

this one's very interesting too because it goes against a lot of the wisdom at the time of using rag and using vectors.

And I'm not saying rag has no place by the way either. But in these general purpose agents, GP is good and and and GP is uh how users would do it. And I

think that's actually a highle point here. As as you're as I'm talking about

here. As as you're as I'm talking about these tools, remember these are all human tasks. They're not we're not

human tasks. They're not we're not making up a brand new tool for the model to use. We're kind of just mimicking the

to use. We're kind of just mimicking the human actions and what you and I would do if we were at a terminal trying to fix a problem. Edit. Edit makes sense. I

think the interesting thing to note in edit is it's using diffs and it's not rewriting files most of the time. uh way

faster, way way uh less context used, but also way less uh issues. Uh if if I asked you to if I

uh issues. Uh if if I asked you to if I if I gave you these slides and asked you to review the slides and you read it and had to write down all the slides for me in your new revisions versus if you

could just cross out things in the paper, the crossing out is way easier.

Diff is kind of a natural thing to prevent mistakes.

Bash. Bash is uh bash is the core thing here. I think you could probably get rid

here. I think you could probably get rid of all these tools and only have bash.

And the first time I saw this when when you run something in claw code and claude code creates a Python file and then runs the Python file then deletes

the Python file. That's that's the beauty of why this thing works. So bash

is the most important. I'd say web search, web fetch. Uh the interesting thing about these is they move move it to a cheaper and faster model. So for

example, if you're building a some sort of agent maybe on your platform and you're building an agent and it needs to connect to some endpoints, some list of endpoints, might be worth to bring that

into a kind of sub tier as opposed to that master while loop. That's why this is its own tool. To-dos, uh we've all se seen to-dos. talk about it a little bit

seen to-dos. talk about it a little bit more later, but keeping the model on track, steerability, and then tasks.

Tasks is very interesting. It's context

management. It's how do we how do we run this long process, read this whole file without cluttering the context? Because

the biggest enemy here is when your context is full, the model gets stupid for lack of better words. So basically,

bash is all you need. Uh I think this is the one thing I want to drill down. The

amazing thing about there's two amazing things about bash for coding agents. The

first is that it's simple uh and it does everything. It's it's very robust. But

everything. It's it's very robust. But

the second thing that's equally important is there's so much training data on it because that's what we use.

It's not it's the reason that models are not as good at Rust or less common programming languages just because there's less people doing it.

So it's really the universal adapter.

Um, you thousands of tools, you could do anything. Uh, this is that Python

anything. Uh, this is that Python example I gave. I I I always find it so cool when it does the Python script thing or creates tests and I always have to tell it not to. But it all these

shell tools are in it. And this is I mean I find myself using cloud code to spin up local environments where normally I'd have like five commands written down on some file somewhere and

then they get out of date. It's really

good at figuring this stuff out and running the stuff you'd want to do.

uh and it specifically lets the model try things.

So uh yeah, the other suggestions here and the tool usage uh I think there's a little bit of a system prompt uh that

tells it which to use and when to use which tool over which and this changes a lot but the these are kind of like the edge cases and the corners you find the model getting stuck in. So reading

before editing uh they actually make make you do that using GP the tool instead of the bash. So if you look at the tool list here there's a special GP

tool. Uh there could be a lot of reasons

tool. Uh there could be a lot of reasons for that. I think security is a big one

for that. I think security is a big one uh and sandboxing but then also just that token limit thing running independent operations in parallel. Uh,

so kind of pushing the model to do that more. And then also like these trivial

more. And then also like these trivial things like quoting paths with spaces.

It's just the common common things. I'm

sure they're just dog fooding a lot at anthropic and they find it and they're like, "All right, we'll throw it in the system prompt."

system prompt." Okay, so let's talk about to-do lists.

Uh, now again, a very common thing, but was not a common thing before. The the

So this is actually I think a to-do list for from some of my my research for this slide deck. Um, but the really

slide deck. Um, but the really interesting thing about to-do lists is that they're structured but not

structurally enforced. So, here are the

structurally enforced. So, here are the rules. One task at a time. Uh, mark them

rules. One task at a time. Uh, mark them completed. This is kind of stuff you

completed. This is kind of stuff you would expect. Uh, keep working on the in

would expect. Uh, keep working on the in progress if there's block blocks or errors and kind of break up the tasks into different instructions. But the

most interesting thing to me is it's not enforced deterministically. It's purely

enforced deterministically. It's purely prompt based. It's purely in the system

prompt based. It's purely in the system prompt. It's purely because our models

prompt. It's purely because our models are just good at instruction following now. And this would not have worked a

now. And this would not have worked a year ago. This would not have worked two

year ago. This would not have worked two years ago. Um there's tool descriptions

years ago. Um there's tool descriptions at the top of the system prompt. We're

kind of uh injecting the todos into the system prompt. uh there's they're not

system prompt. uh there's they're not but it but it's not enforced in actual code and again uh maybe there's other agents that take an opposite path. Uh I

just found this pretty interesting that this at least as a user makes a big difference and it doesn't even see it seems it was it seems like it was very simple to implement almost a a weekend

project someone did and seemed to work.

could be wrong about about that as well, but uh um so yeah, it's literally a function call. Uh

function call. Uh it's the first time you ask something, the reasoning exports this to-do block, and I'll show you what the structure is on the next slide. Uh there's ids there.

There's some kind of structured schema and determinism, but it it's just injected there. So here's a example of what it could look like. You

get a version, you get your ID, uh a title of the to-do, and then it could actually inject evidence. So, this is uh seemingly arbitrary blobs of data it

could use. And the ids are hashes that

could use. And the ids are hashes that it could then refer to title, something human readable, but this is a just another way to structure the data. And in the same way that

the data. And in the same way that you're going to organize your desk when you work, this is how we're trying to organize the model.

So I think there's uh these are kind of the four benefits we're getting. We're

forcing it to plan. Uh we get to resume after crashes. Uh clog code fails. I

after crashes. Uh clog code fails. I

think UX is a big part of this. As a

user, you know how it's going. It's not

just running off in a loop for 40 minutes without any uh signal to you. So

UX is non-negligible. Even though UX might not make it a better coding agent, it might make it better for us all to use. and uh the steerability one. So

use. and uh the steerability one. So

here's two other parts that were under the hood. Async buffer, so they called

the hood. Async buffer, so they called it H2A. Uh it's kind of uh the IO

it H2A. Uh it's kind of uh the IO process and how to decouple it from reasoning and and how to manage context in a way that you're not just stuffing everything you're seeing in the terminal

and everything back into the model, which again context is our biggest enemy here. It's going to make the model

here. It's going to make the model stupider. So we need to uh be a little

stupider. So we need to uh be a little bit smart about that and and how we do compact and how we do summarization. So

here you see when it reaches capacity it kind of drops the middle summarizes the head and tail. Um then we have the that's the context compressor there. So

what is the limit 92% it seems like something like that. Uh and and how does it how does it save long-term storage?

That's actually another kind of advantage of bash in my opinion and having a sandbox. I would even make a prediction here that all your all chat GPT windows, all clawed windows are

going to come with a sandbox in the near future. It's just so much better because

future. It's just so much better because you can store that long-term memory. And

I do this all the time. I have I have cloud code skills for deep research and stuff like that. And I'm always instructing it save markdown files because the shorter the context, the

quicker it is and the smarter it is.

So this is what I'm most excited about.

We don't need DAGs like this. We

I'll give you I'll give you a real example. Uh so some users at prompt

example. Uh so some users at prompt layer uh different agents like customer support agent basically everybody was building DAGs like this for the last two

two and a half years. Uh and it was crazy. Hundreds of nodes of okay this if

crazy. Hundreds of nodes of okay this if this user wants a refund route them to this prompt if they want this and a lot of uh classifying prompts. The advantage

of this is you can kind of guarantee there's not going to be hallucinations or guarantee there's not going to be refunds to people who shouldn't be having refunds or kind of that pro it solves the prompt injection problem

because if you're in a prompt that purely classifies it as X or Y injecting doesn't really matter especially if you throw out the context. Now we kind of brought back bring back that attack

vector but the but the major benefit is we don't have to deal with this web of engineering uh madness and uh it just it's 10x easier to develop these things

10x more maintainable and it actually works way better because our models are just good now.

So this is this is kind of a takeaway is rely on the model. uh when in doubt, don't don't try to think through every edge case and think through every if

statement. Just rely on the model to

statement. Just rely on the model to explore and figure it out. And I was actually two days ago, I think, or yesterday, sometime this week, I was

doing an experiment on our dashboard to add like trying these browser agents.

And I wanted to see if I could add little titles to all our buttons and it would help the agent navigate our website automatically. And it actually

website automatically. And it actually made it worse, surprisingly. Uh, and

maybe I could run it again and maybe I did something wrong with this test, but it made the agent navigate prompt layer worse because it was getting distracted because I was telling it you have to click this button, then you have to

click this button and then it's it didn't know what to do. So, it's

better to rely on exploration. You have

a question?

>> Yeah, I'll I'll push back a little bit, >> please. I'll admit any

>> please. I'll admit any scaffolding we create today to resolve the idiosyncrasies of limitations will

be that'll be obsolete 3 to 6 months even if that's the case they help a little bit today I how do you balance that like wasted engineering to solve a

problem we only have for three months >> it's a great question so just to repeat uh the question is basically what is the trade-off between solving the actual problems we have today and if

you're relying on the model that can't do it yet but it'll be able to do it in three months, right? Um it's case by case. It depends what you're building.

case. It depends what you're building.

If you're building a chatbot for a bank, you probably do want to be a little bit more comp be careful. To me, the happy

middle ground is to use this agent paradigm of a master while loop and tool calls, but make your tool calls very rigorous. So I think it's okay to have a

rigorous. So I think it's okay to have a tool call that looks like this or looks like half of this uh in the same way that claude code uses read as a tool

call or GP as a tool call. So for the edge cases, throw it in a structured tool that you can then eval in version and stuff like that. And I could talk I'm going to talk

that. And I could talk I'm going to talk a little bit more about that later, but throw it in that structured tool. But

for everything else, uh for the exploration phase, leave it to the model or throw some system prompt. Uh so

it's a trade-off and it's very use case dependent, but I think it's a good question. Thank you. So yeah, uh just

question. Thank you. So yeah, uh just back to cloud code. Uh we're we're getting rid of all this stuff. We're

saying we don't want MLbased intent detection. We don't want reax. We don't

detection. We don't want reax. We don't

want the I mean it uses reax a little bit, but we don't want reax baked into it. We don't want classifiers. And and

it. We don't want classifiers. And and

there was a long time we actually built a product for prompt layer. We never

released it because there's only a prototype of using a MLbased like a nonlm based classifier in your prompt pipeline instead of LMS. A lot of people

have a lot of success with it, but it it feels more and more like it's not going to be that helpful unless cost is a huge concern for you. And even then

cost is the smaller models is going less and less as uh kind of financial engineering between all these companies

pays for our tokens. Um so Claude does also this smart thing I think with the trigger phases. you know, you have

trigger phases. you know, you have think, think hard, think harder, and ultra think is my favorite. Uh, and this lets us use the reasoning budget, the

reasoning token budget as another parameter that the model can adjust. And

this is actually the model can adjust this, but this is how we force it to adjust. And as opposed to you could make

adjust. And as opposed to you could make a tool call for hard planning. And actually, there's

hard planning. And actually, there's some coding agents that do this. or you

can uh let the user specify it and then just on the fly change it.

So this is this is one of the biggest topics here. Sandboxing and permissions.

topics here. Sandboxing and permissions.

I'm going to be completely honest, it's the most boring part of this to me because I just run it on YOLO mode half the time. Um it's uh

the time. Um it's uh some people on our team actually dropped all their local databases. So you do have to be careful. Uh so uh you know we

don't yolo mode with our enterprise customers obviously but uh I but but I think this stuff is it feels like it's going to be solved but but we do need to

know how to works a little bit. So

there's a big issue of in prompt injection from the internet. If you're

connecting this agent that has shell access and you're doing web fetch that's a pretty big attack vector. Uh, so

there's some containerization of that.

There's blocking URLs. You could see cloud code's pretty annoying about can I fetch from this URL? Can I do this? And

it kind of puts it into a sub agent. And

uh, yeah, most of the most of the complex code here is in this sandboxing and permission set.

I think there's this whole pipeline to gate bash command. So it depending on the prefix is how it goes through the sandboxing environment and a lot of the

other models work differently here. Uh

but this is how cloud code does it. I'll

explain the other ones later at the end.

The next topic uh of relevance here is sub aents. Uh so this is going back to

sub aents. Uh so this is going back to context management and this this problem we keep going back to of the longer context the the stupider our agent is.

This is a this is an answer to it. So

using sub aents for specific tasks and the key with the sub aent is it has its own context and it feeds back only the results and this is how you don't clutter it. So we got the researcher

clutter it. So we got the researcher these are just four examples researcher docs reader testr runner code reviewer in that example I was talking about earlier when I added all the tags to our

website to let the agent do it better. I

obviously I use a coding agent to do that and I said read our docs first and then do it and it's going to do this in a sub agent. It's going to feed back the information and the the key thing here

is the forks of the agent and how we aggregate it back into our main context.

So here's an example. I think this is actually very interesting. I want to call out a thing or two here. So task is what a sub aent is. We're giving task

two things. Description and a prompt.

two things. Description and a prompt.

The description is what the user is going to see. So you're going to say task uh find default chat context instantiation or something. And then the prompt you're going to give a long

string which is really interesting because now we have the coding agent prompting its own agents. And I've

actually used this paradigm in agents I've built for our product. Uh if you can you can just have the agent stuff as much information as it wants in this string. And if we're going back to

string. And if we're going back to relying on the model if this task returns an error now stuff even more information and let it solve the problems. It's better to be flexible

rather than rigid.

If I was building this I would consider switching a string to maybe an object here uh depending on what you're building and maybe let it give actually more structured data. Yes. So I can see

this prompt has quite a couple sentences. Is that in the main agent? Is

sentences. Is that in the main agent? Is

that taking the context of the main agent or is there some sort of intermediate step where the sub agent double reads over you know like what the

main agent is doing and then generates >> right? So the question is does the task

>> right? So the question is does the task just get the prompt here or does it also get your chat history? Is that the question?

The question is is are all of the I have my main agent. Is all of this in the system prompt of the main agent to inform how that prompts the sub agent?

>> No. No. Like it's not in the system.

It's in the whole context. Is the all of this context of the main agent >> the task it calls or or you're saying the structure for the task

>> this whole JSON right or >> yes. So this is a tool call. So the tool

>> yes. So this is a tool call. So the tool called structure of what a task is is in the maiden agent. Uh and then these are generated on the fly. Uh so as you want

to run a task, it's generating the description and the prompt. Task is a tool call. They could be run in parallel

tool call. They could be run in parallel and then they're returning the results of it. Hopefully that helps.

of it. Hopefully that helps.

Um so we could go back to the system prompt. So there's some leaks of the

prompt. So there's some leaks of the cloud code system prompt. So that's what I'm basing this on. Uh you can find it online. Um here are some things I I

online. Um here are some things I I noted from it. Uh concise outputs. Uh

obviously don't give anything too long.

No here is or I will just do the do the task the user wants. Uh kind of pushing it to use tools more more instead of text explanations. Obviously, I think

text explanations. Obviously, I think when we we've all built coding agents and when we do it, it usually says, "Hey, I want to run this SQL." No, push

it to use the tool. Um,

matching the existing code, not adding comments. This one does not work for me,

comments. This one does not work for me, but uh running commands in parallel extensively and then the to-dos and stuff like that. There's a lot that you can nudge it to do with the system

prompts. But as you see, I think there's

prompts. But as you see, I think there's a really interesting point to the earlier question you had about where what's the trade-off between DAGs and loops.

A lot of these things you could see are feel like they came from someone using it clawed code and saying, "Oh, if only it did this a little less or if it did this a little bit more." That's where

prompting comes in because it's so easy to iterate and it's not you're not it's not a hard requirement but if only it said here is a little bit more. It's

okay to say it sometimes but all right skills. Skills is great. It's

a slightly newer. I've I honestly got convinced of it only recently. So good.

I built these slides with skills. uh

it's basically I think in the context of this talk about architecture, let's think of it as a extendable system prompt. So in the same way that we don't

prompt. So in the same way that we don't want to clutter the context, there's a lot of different type of tasks you're going to need to do where you want a lot more context. So this is how we give

more context. So this is how we give cloud code a few options of how it could tap into more information. Here are some examples. Uh, I use this for I have a

examples. Uh, I use this for I have a skill for docs updates to tell it my writing style and and my product. So, if

I want to do a docs update, I say use that skill. Load in that skill. Uh,

that skill. Load in that skill. Uh,

editing Microsoft Office uh Microsoft do Microsoft Word and Excel. Um, I I don't use this, but I've seen a lot of people using it. It kind of like decompiles the

using it. It kind of like decompiles the f it's really cool. Uh, but it lets cloud code do this design style guide.

This is a common one. Deep research. I

the other day I threw in a like article or GitHub uh repo on how deep research works and I said rebuild this as a cloud code skill works so well it's amazing.

So unified diffing I think this is worth its own slide. Uh it's very obvious probably not too much we need to talk about here but it makes this so much better and it

makes the token limit shorter. It makes

it faster and makes it less prone to mistakes like I gave with that example when you rewrite an essay versus marking it with a red line. It's just better. I

highly recommend using diffing in any agents you're doing. Unified diff is a standard. When I looked into a lot of

standard. When I looked into a lot of these coding agents, some actually built their own kind of standard uh and like with slight variations on unified diff because you don't always need the line

numbers and but unified diff works. You

had a question >> to go back to skills.

I are uh I don't know if anyone's seen the cloud the cloud code warns you and in yellow text if your quad indeed is like greater than 40k characters and so I was like okay I'm up. Let me

break this down into skills. So I bet spent some time and then Claude ignored all of my skills and so I put them in some. So what am I? I don't know. Skills

some. So what am I? I don't know. Skills

feel globally misunderstood or like not I don't know I'm missing something. Help

me understand.

>> Yeah. So the the question was on okay so cloud code system cloud MD it tells you when it's too long. So uh you move it into skills and then it's not recognizing the skills and not picking

it up when it's needed.

>> Yeah.

take that up with the anthropic team I'd say. Uh but that's also a good example

say. Uh but that's also a good example of maybe the system prompt >> that was the intention like skills you need to invoke them and like the agent

itself shouldn't like just call them all the time, >> right? It does give a dis description of

>> right? It does give a dis description of each skill to the model or it should uh tell it okay here's like a oneliner about each skill. So theoretically in a perfect world it would pick up all the

skills all the time. But you're right, I generally have to call the skill myself manually. I but I think this is a good

manually. I but I think this is a good tieback into when is prompting the right solution or when is the DAG the right solution or maybe this is a model

training problem. Maybe they need to do

training problem. Maybe they need to do a little bit more in post-raining of getting the model to call the skills is almost like calling a tool call. You

have to know when to call it. So maybe

this is just uh a functionality that's not that good yet, but I think the paradigm is very interesting, but it's not perfect as we're learning.

So diffing we just talked about what's next. So this is more opinion based, but

next. So this is more opinion based, but where I see these things going and where the next kind of innovations might likely be. So

likely be. So I I think there's two schools of thoughts here. A lot of people think

thoughts here. A lot of people think we're going to have one master loop with hundreds of tool calls and just tool calling is going to get much better.

That's highly likely. Uh I take the alternate view which I think we need to reduce the tool calls as much as possible and just go back to just bash

and maybe even put scripts in the local directory. I think I am on the proponent

directory. I think I am on the proponent of one mega tool call instead of a lot of tool calls. Maybe not actually one. I

actually think that slide I showed you before is probably a good list, but a lot of people think we need hundreds of tool calls. I just don't think it's

tool calls. I just don't think it's going there. Adaptive budgets, uh,

going there. Adaptive budgets, uh, adjusting reasoning, we do this a little bit, uh, the thinking and ultra think and stuff like that, but I I think

reasoning models as a tool makes a lot of sense as a paradigm. Can you use I think a lot of us would make a trade-off of a 20 times quicker model with

slightly stupider results and being able to call a tool call for a very good model. I think that's a trade-off we

model. I think that's a trade-off we we'd make in a lot of cases. Maybe not

our planner. Maybe we go to the planner first with GPD 51 codeex or opus or whatever if when the new opus comes out.

Uh but I think I think there's a lot of uh mix and matching we can do and that's I think the next frontier and I think the last frontier I think there's a lot we can learn from

to-do lists and and new first class paradigms we can build skills is another example of a first class paradigm we can kind of try to build into it maybe it doesn't work perfectly uh but I think there's a I think there's a lot of new

discoveries to be made there in my opinion do I have them I don't know uh so now I I want to for the for the latter part of this talk I want to talk

about the other frontier agents and the other philosophies they've designed philosophies they've chosen and we all have the benefit we can mix and match when we were building our agent we

could do whatever we want and learn from the best and the frontier labs are very good at this so uh something I like to go back to a lot

I call it the AI therapist problem may maybe there's a better name to give it uh but I believe there's a lot of problems, the most interesting AI problems around. There isn't a global

problems around. There isn't a global maximum. Meaning,

maximum. Meaning, all right, we're in New York City. If I

need to see a therapist, there's six on every block here. There's no global answer for what the best therapist is.

There's different strategies. There's a

therapist that does meditation or CBT or maybe one that gives you Iawaska. and

and these are just kind of like different strategies for the same goal in the same way that if you're building an AI therapist, there isn't a global maxima. This is kind of my anti-AGI

maxima. This is kind of my anti-AGI take, but this is also the take to say that when you're building these applications, taste comes into it a lot and design architecture matters a lot.

You can have five different coding agents that are all amazing. Nobody

knows which today. Nobody knows which one's best to be honest. I don't think Anthropic knows. I don't think OpenAI

Anthropic knows. I don't think OpenAI knows. I don't think source graph knows.

knows. I don't think source graph knows.

Nobody knows whose has the best, but some are better at some things. I

personally like claude code for I said like running my local environment or using git or using these kind of like human actions that require back and forth, but I go to codeex for the hard

problems or I go to composer from cursor because it's faster. And there's a lot basically all this to say there's value in having different philosophies here.

And I don't think there's going to be one winner to this. I think there's going to be different winners for different use cases. And and this is not just coding agents, by the way. This is

all AI products. This is this is kind of why our whole company focuses on domain experts and bringing in the PM and the the the subject matter expert into it because that's how you build defensibility.

So here are the perspectives. The way I see it, this is not a complete list of coding agents, but these are the ones that I think are the most interesting.

Cloud code I think I think to me it wins in user friendliness and simplicity. Uh

like I said if I'm doing something that requires a lot of applications that git git's just the best example. If I want to make a PR I'm going to cloud codeex

uh context it's really good at context management. Uh it feels powerful. Do I

management. Uh it feels powerful. Do I

have the evidence to show you that it's more powerful? Probably not. But uh it

more powerful? Probably not. But uh it feels that way to me and the market feel there's a whole another conversation here to say the market knows best and what people talk about knows best but I

don't know if they know either. Cursor

IDE is kind of that perspective model agnostic. It's faster. Factory uh makes

agnostic. It's faster. Factory uh makes Droid uh great team. They were here too.

Uh they have multiple they they really specialize these droid sub aents they have. So that's kind of their edge and

have. So that's kind of their edge and that's maybe a DAG conversation too or maybe a model training uh cognition. Uh

so Devon uh kind of this endto-end autonomy self-reflection AMP which I'll talk about more in a second. They have a lot of interesting perspectives and actually I find them very exciting these

days free it's model agnostic uh and there's a lot of UX sugar for users and I actually I love their design their their talks at this this conference they they they have very very unique

perspectives so let's start with codeex because it's a popular one so it's pretty similar to cloud code uh same master while loop most of these do

because that's just the winning architecture uh interesting ly rust core. Uh the cool thing is it's open

core. Uh the cool thing is it's open source so you can actually use codeex to understand how codeex works which is kind of what I did. Um it's a little more event driven a little more uh work

went into concurrent threading here uh kind of submission cues event outputs kind of the the thing I was talking about with the IO buffer in cloud code.

I think they do it a little bit differently. Uh sandboxing is very

differently. Uh sandboxing is very different. So theirs is more you I mean

different. So theirs is more you I mean you could see here Mac OS seat belt and Linux land theirs is more kernel based uh and then state kind of this it's all

under threading and and permissions is how I'd say it's mostly different and then the real difference is the model to

be honest. Uh so this is a this is

be honest. Uh so this is a this is actually me using cloud code to understand how codeex works. Uh so you see we have a few explore. I didn't talk

about explore but uh it's uh it's a it's another sub agent type as as I as I mentioned these go in and out. Uh but

yeah this is researching codecs with cloud code. It's always a fun thing to

cloud code. It's always a fun thing to do. So let's talk about AMP.

do. So let's talk about AMP.

So this is source graphs coding agent.

I it has a free tier. That's just a cool perspective in my opinion. Uh they

leverage kind of these excess tokens uh from providers and they give ads. So, we

actually have an ad on them. I think

it's a cool I'm pro-AD. A lot of people are anti- ad. I think it's one of my hot takes, but I like it. They don't have a model selector. This is very

model selector. This is very interesting, too. This is its own

interesting, too. This is its own perspective. Uh, it actually helps them

perspective. Uh, it actually helps them move faster because you're you have less of an exact expectation of what the output is because, you know, they might

be switching models here and there. So,

that changes how they develop. And then,

uh, I think their vision is pretty interesting.

uh their vision is how do we build not just the best agent but how do we build the agent that works with the most agentfriendly environments and actually factory gave a talk similar to this as

well but how do how do you build a hermetically sealed uh a like coding repo that the agent can run tests on how do you build the feedback loop because that's kind of the holy grail that's how

we build an autonomous agent and how do we uh I'd love to see the front-end version of this how do let it look at its own design and make it better and go back and forth and this is kind of their

guiding philosophy and you could boil it down to the agent perspective as I've been calling it.

I think they do interesting stuff with context. So, we're all familiar with

context. So, we're all familiar with compact. It's the worst. You have to

compact. It's the worst. You have to wait 10. I don't know why it takes so

wait 10. I don't know why it takes so long. Uh and if you're not familiar,

long. Uh and if you're not familiar, it's summarizing your chat window when the context gets too high and giving the summary. So, they have something called

summary. So, they have something called handoff, which makes me think of if you any was a anyone was a Call of Duty player back in the day, switch weapons.

It's faster than reloading. And uh

that's what handoff is. You're you're

just starting a new thread and you're giving it the information it needs for a new thread. That feels like the winning

new thread. That feels like the winning strategy to me. Could be wrong, but maybe you need both. That's where

they're pushing it. And I kind of like that. I They get they give a very fresh

that. I They get they give a very fresh perspective. So, the second thing is

perspective. So, the second thing is model choice. This is the reasoning

model choice. This is the reasoning knobs uh and their view on it. They have

fast, smart, and Oracle. So, they lean even more heavily into we have different models. We're not telling you what

models. We're not telling you what Oracle is. They tell you, but we're

Oracle is. They tell you, but we're willing to switch what Oracle is, but we're going to use Oracle when we have a very hard problem.

So, yeah. So, that's AMP. Let's go to Cursor's Agent. I think Cursor's agent

Cursor's Agent. I think Cursor's agent has a very interesting perspective here.

First, obviously, it's UI. uh UI first, not CLI. I think they might have a CLI,

not CLI. I think they might have a CLI, not entirely sure, but the UI is the interesting part. It's just so fast.

interesting part. It's just so fast.

Their new model composer, it's distilled. They have they have the data.

distilled. They have they have the data.

They actually made, in my opinion, people interested in fine-tuning again.

fine-tuning. It was almost uh we'd never recommend it to our customers, but composer shows you that you can actually build defensibility based on your data

again, which which is uh surprising, but uh yeah, cursors agent composer, I've been almost switching completely to it since because it's just so fast. It's

almost too fast. Accidentally pushed to master on one of my personal projects.

Uh so you don't you don't want that always. Uh but cursor was just the crowd

always. Uh but cursor was just the crowd favorite and and I want to give a lot of uh props to their team. They built

iteratively. The first version of cursor was so bad and it was and we all use I used it because it's a VS code for fork.

I have nothing to lose and it's gotten so good. It's such a good piece of

so good. It's such a good piece of software and it's a great team and uh but I I'll say the same can be said about OpenAI's codeex models. They're

not quite as fast, but they are optimized for these coding agents and they are distilled. And I could see OpenAI coming out with a really fast model here because they also have the

data.

So here's a picture. Um I think you could this is a picture they put on their blog and you could see what their perspective is on coding agents here just based on the fact that they show you the three models they're running.

So, they're offering composer, but they're letting you use the state-of-the-art because they know that maybe GPD 5.1 is better at planning or

here it's five, but now we have 5.1.

So, here begs the big question, which one should we all use? Which

architecture is best? What should we do?

And uh my opinion here is that benchmarks are pretty useless.

Benchmarks have become marketing for a lot of these model providers. every

model beats the benchmarks. I don't know how that happens, but I think there's there's world where evals matter here. And

the question is what you can eval. The

question is how this whole simplic simple while loop architecture that I've been kind of trying to push based on my understanding of it actually makes it harder to eval because if we're relying

more on model flexibility, how do you test it? You could run an integration

test it? You could run an integration test, kind of this endto-end test, and just say, "Does it fix the problem?"

That's one way to do it. You could break it up. You could kind of do point in

it up. You could kind of do point in time snapshots and say, "Hey, I'm going to give a context to my chatbot from like a half-finish conversation where I know it should be running a specific

tool call." I could run those. Uh I or I

tool call." I could run those. Uh I or I could maybe just run a back test and say, "How how often does it change the tools?" I think there's also another

tools?" I think there's also another concept here that's starting to be developed called agent smell or at least I'm calling it agent smell. So run an agent and see how many times does it call a tool call. How many times does it

retry? How long does it take? And these

retry? How long does it take? And these

are all surface level metrics but it's really good for sanity checking. And

these things are hard to eval. There's a

lot that goes into it. I'll show you an example of what I did uh just to kind of dive into it. But but on that subject maybe I'll just say one more thing. I

would break it down me my mental model is you could do an endto-end test, you can do a point in time test or what I most often recommend is just do a back

test. Start with back test, start

test. Start with back test, start capturing historical data and then just rerun it. So yeah, let me give you uh

rerun it. So yeah, let me give you uh this example. So basically what I have

this example. So basically what I have here, so this is a screenshot of prompt layer. This is our our eval product is

layer. This is our our eval product is also just a batch runner. So you could kind of just run a bunch of columns through a prompt. But in this case, I'm running it through not a prompt, but cloud code. So I just have like a

cloud code. So I just have like a headless cloud code and I'm taking all these providers and I just my headless cloud code says I think I have it on the next slide. Search the web for the model

next slide. Search the web for the model provider. It's given to you in a file

provider. It's given to you in a file variables. Find the most recent and

variables. Find the most recent and largest model released and then return the name. So I don't know what it's

the name. So I don't know what it's doing. It's doing web search. I'm not

doing. It's doing web search. I'm not

even caring about that. This is an endto-end test. This is how we kind of

endto-end test. This is how we kind of try doing cloud code. And I actually think there's a lot about putting cloud code into your workflows and those type of headless SDKs. I'll talk about that I

think next slide. But

kind of main takeaway here is you can kind of start to do endto-end tests. You

can look at it from a high level do a model smell and then kind of look into the statistics on each row and see how many times it called a tool.

And going back and we we've talked about this a lot in this talk. rigorous tools.

The tools can be rigorously tested. You

can This is how you offload the deter This is how you offload the determinism to different parts of your model. It's

you test the tools. You you test the out of your tools. Look at them like functions. It's an input and an

like functions. It's an input and an output. If your tools a sub agent that

output. If your tools a sub agent that runs, then we're in a kind of recursion here because then you have to go back and test the end to end thing. But for

your tools, I'll give you this example.

If I so there in my coding agents or my agents in general, my autonomous agents, if there's something very specific that I want to output. So in this case, if I

have a very specific type of email format or type of blog post that I want to write and I really want it to get my voice right, I don't want to rely on the model exploration. I want to actually

model exploration. I want to actually build a tool that I can rigorously test.

So in this case, this is also just a prompt layer screenshot, but this is a like a workflow I've built. It has an LM assertion where it says check if the email is good to my standards. If it's

good, it revises it. If it's not good, it adds the parts. So like the header that it missed and it revises it with the same step. This is obviously a very

simple example, but in I we have another version for some of our SEO blog posts that has like 20 different nodes and writes an outline from a deep research

and then fixes a conclusion and adds links.

for the stuff that you have a very specific vision that's when testing it just gets so much easier because as you can see obviously testing this sort of

workflow has less steps and less flexibility. So this is an eval I made I

flexibility. So this is an eval I made I start with just a bunch of sample emails I run the prompt actually I run the the agentic workflow here and I'm just

adding a bunch of heruristics. So this

is a very simple LMS judge does it include three parts in it. So this is what I was testing for like the hi Jared email body and the signature. You can

get a lot more complicated. You could do a code execution. You can do I don't know LM's judge is usually the easiest.

But now obviously you could see I could keep running this until it's correct on all of them and kind of uh see my eval over time. This is just from this

over time. This is just from this example. I got it to 100. So that was

example. I got it to 100. So that was fun.

Uh and then I want to I want to add another future looking thing. keep an

eye on headless uh cloud code SDK. I

know there was a talk about it this morning. Um so I don't want to I won't

morning. Um so I don't want to I won't spend too much time on it, but it's amazing. You just give a simple prompt

amazing. You just give a simple prompt and it's just another part of your pipeline. I use it for I think I have it

pipeline. I use it for I think I have it on the next slide. I have a GitHub action that updates my docs every day and just reads all the commits we've pushed to our other repos. And we have a

lot of commits going and it just runs cloud code. The cloud code pulls down

cloud code. The cloud code pulls down all the repos, checks what's updated, reads our cloud MD to see if it should even update the docs, then creates a PR.

So I think this unlocks a lot of things and there's a possibility that we're going to start building agents at a higher order of abstraction and just rely on cloud code and these other

agents to do a lot of the harnesses and orchestration.

>> Are you reviewing those?

Yeah, I it creates a PR. It doesn't uh it doesn't merge the VR.

So, here are my takeaways. Number one,

trust in the model. Uh when in doubt, rely on the model when you're building agents. Number two, simple design wins.

agents. Number two, simple design wins.

Number one and number two kind of go together here. Number three, bash is all

together here. Number three, bash is all you need.

Go simple with your tools. Don't have 40 tools, have 10 or five tools. For

context management matters, this is the boogeyman we're running from all the time in agents at this point. Maybe

there'll be new models in the future that are just so much better at context.

But there's always going to be a limit because ah you're talking to a human. I

forget people's names if I meet too many in one day. That's context management or my stupidity. I don't know. And number

my stupidity. I don't know. And number

five, different perspectives. matter in

agents. I think this is the engineering brain doesn't always comprehend this as much as it should especially in and I'm an engineer so I'm also talking about

myself but the different perspectives matter such that there's different uh ways to solve a problem where there's not one is better than the other and you

kind of you probably want a mixture of experts agent I I would love to have mine run cloud code and codeex and this and give me the output and considered a team and maybe have them talk to each

other in a slack based message channel.

I'm waiting for someone to build that.

That would be great. But these are my takeaways. Uh my bonus thing that I'll

takeaways. Uh my bonus thing that I'll show you is how I built this slide deck using cloud code. So uh I built a slide dev skill. So I I basically told cloud

dev skill. So I I basically told cloud code to research how slide dev works and how it can and that's kind of just a library that I made this in. I built a deep research skill to research all

these agents and how they work. I built

a design skill because I know half a thing looks terrible or looks good, but I'm not a good designer to figure it out. So, these boxes even I was just

out. So, these boxes even I was just like, "Oh, m make the box a little nicer. Give it an accent color." Uh, so

nicer. Give it an accent color." Uh, so yeah, this is how I built it. But again,

thank you for listening. Uh, happy to answer any questions. I'm Jared, founder of Prompt Layer. Find me there.

>> Yes.

>> Thank you. Great talk. Um so you mentioned u regarding DAGs basically like let's get rid of them right but DAGs kind of enforc this like sequential

uh execution right pass I don't know customer service like agent asks the name email right like in some sort of uh

sequence um so are you saying just write this out um like this is now this should be uh just written out as a

plan for an agent to execute and just trust that the model is going to be calling those tools in that sequence like how do we enforce uh the order?

>> Right? So the question was why do I keep talking about getting rid of DAGs? How else am are you supposed to

of DAGs? How else am are you supposed to enforce a specific order for solving a problem? So I think there are different

problem? So I think there are different types of problems. So the problem of building a general purpose coding agent that we can all use to do our work and

even non-technical people can use there's no specific step to solving that problem which is why it's better to rely on the model. If your problem was

to build let's say a travel itinerary it's more of a specific step because you have a deliverable that's always the

same. So there's a little bit more of a

same. So there's a little bit more of a DAG that could matter, but in the research step of traveling, you probably don't want a DAG because every city is going to be different. So it really depends on the problem you're solving. I

would if I wanted to make an agent for a travel itinerary, I'd probably have my tool call would one of my tool calls be a DAG of creating the output file because I want the output to look the

same or creating the plan. And then in the system problem, I could say always end with the output for example. But

you need to mix and match. There's a

every use case is different, but if you want to make something general purpose, my take is to rely more on the model on simple loops and less on a DAG.

>> Cool. Any other questions?

Yes.

>> Yeah. Building on that point, like do you think we're heading towards a world where most of you're not actually going to call the API through code and that most LM calls are by triggering cloud

code and just write just writing the files instead?

So the question is are we going to move away from calling models directly and just call call like a headless cloud code right?

>> Yeah. Like if I had a like I have a pipeline that does one lm call per document, summarizes it at the end. You

could make a while loop cloud code that saves a file every time. You never call the API besides using cloud code in in a while loop

>> potentially. Uh, I'll give you the pro

>> potentially. Uh, I'll give you the pro and the con there.

>> Yeah, >> the pro is it's easier to develop and we can kind of rely on the frontier. I

mean, if you think about it, a reasoning model is just that. The reasoning models didn't always exist. We just had normal LM model and then oh, now we have 01 and reasoning models. All that is is a I

reasoning models. All that is is a I mean, it's a little more complicated than this, but it's basically just a while loop on OpenAI servers that keeps running the context and then eventually gives you the output. in the same way

that cloud code SDK is a while loop with a bunch of more things. So I could totally see a lot of builders only touching these agentic endpoints. Maybe

even seeing a model provider release a model as a agentic endpoint. But for a lot of tasks, you're going to want a little bit more control. And they're pro

and probably you'd still want to go as close to metal as possible. Having said

that, there's there was a lot of people who still wanted completions models and that never happened and nobody really talks about that anymore. So, it's very likely that everything just becomes this

SDK, but I don't have a crystal ball, but those are those are how I I would think about it.

>> Yes, >> thanks for the talk. Um, I know you said the simpler the better, but um, what's your thoughts about test during development, spec during development in

AI? Have you tried it? What is it about

AI? Have you tried it? What is it about >> for building agents or for getting work done?

>> For coding.

>> Okay. So the question on spec driven development, test-driven development for coding with agents.

When in doubt, go back to good engineering practices is what I would say. So it if you

would say. So it if you and there's there's whole engineering debates on if test-driven development is the right way and some people swear by it and some people don't. So I don't

think there's an answer. I think coding agents clearly test-driven development makes it easier. I think as I was showing you that's AMP's source graphs whole philosophy that if you can build

good tests and factory I think thinks this as well. If you could build good tests your coding agent can work much better. So it makes sense to me when I'm

better. So it makes sense to me when I'm working personally I rely pretty heavily on the planning phase and the spectr in development phase and I think the simpler tasks are pretty easy for the

model but if I'm doing a very simple edit I'll skip that step. So no

oneizefits-all but return to the engineering principles that you believe when in doubt I'd say yes.

>> So earlier you talked about about system rock leaks is possible to just look at the u downloads bundle or they have a special

end point that has prompts behind endpoint.

>> Yeah. Uh I think I think they hide it. I

think they hide it. There was a there was actually an interesting article someone because codeex is open source they before openai released the codeex model

that it was using they were able to hack together the open source codeex to give a custom prompt to the model and be able to use the model without it. So yeah you can dive into it but generally it's

tried to be hidden and also laziness of someone posted it. So there you go that's the work but someone had to have found it right. like is this problem

somewhere on your machine?

>> I actually don't know that answer.

>> Do you know that answer?

>> Yeah.

>> Yes.

>> It's on your machine. Nico says it's on your machine. So there we go. So maybe

your machine. So there we go. So maybe

the prompt I was looking at is a little bit old and I have to update it. But the

s but uh the question was does uh is the prompt hidden on their servers or can you find it if you are so determined?

And the answer seems to be yes. Any

other questions?

>> Yes.

>> Is this the last one?

>> Is this the last question?

>> It can be.

>> Can you talk about prompt layer and how can people help you?

>> Yes, that's a good one. I forgot about that. Thank you.

that. Thank you.

Um, so yeah, my one, we're hiring. Uh,

so if you're looking for coding jobs at a very fun and fastmoving team in New York, you can reach out to me on X or email jaredprompter.com.

email jaredprompter.com.

We're based in New York. We are uh, yeah, we're we're a platform for building and testing AI products for prompt management, audibility, governance, all that fun stuff, but also

logging and evals. And those screenshots I showed you came from prompt layer. If

you're building an AI application and you're building it with a team, you should probably try Problem layer. It'll

make your life easier. Uh especially the bigger your team is, the more you want to collaborate, the more you want to collaborate with PMs and non-technical users and or if you're just technical users, it's a great tool. It'll make

your life better. Highly recommend it.

prompt layer.com and it's easy to do.

And that was my show.

Thank you for listening.

Heat.

Loading...

Loading video analysis...