How Claude Code Works - Jared Zoneraich, PromptLayer
By AI Engineer
Summary
Topics Covered
- Simple Loop Beats Complex Scaffolding
- Bash Is Universal Coding Tool
- To-Dos Enable Model Steerability
- Ditch DAGs for Model Reliance
- No Single Best Coding Agent
Full Transcript
So, welcome to the last workshop. Um,
you made it. Congrats. Y
out of like 800 uh people, you're you're the last standing uh sort of very very dedicated engineers. Uh yeah, so this
dedicated engineers. Uh yeah, so this one's a weird one. I got in trouble with entropic on this one. Uh obviously
because of the title. I actually also gave him the title and I was like, do you want to change it? He was like, no, I just roll with it. It's kind of funny.
Uh uh so so yeah, this is not officially endorsed by Copic, but we're hackers, right? And Jared is like super
right? And Jared is like super dedicated. He's um and the other thing I
dedicated. He's um and the other thing I also like really enjoy is featuring like notable New York AI people, right? Like
so don't take this as like one is the only thing that Jared does. He has a whole startup that you should definitely ask him about. Um but like you know I'm just really excited to feature more content for local people. So yeah,
Jared, take it away.
>> Thank you very much. Thank you very much. And what an amazing conference.
much. And what an amazing conference.
very sad we're ending it, but hopefully it'll be a good ending here. Um, and
yeah, uh, my name is Jared. Uh, this
will be a talk on how Claude Code works.
Again, not affiliated with Anthropic.
Uh, they don't pay me. I would take money, but they don't. Um, but we're going to talk about a few other coding agents as well. And kind of the highle
goal that I'll go into is me personally, I I'm a big user of all the coding agents, as is everyone here. and they
kind of exploded recently and as a developer I was curious what changed what made it finally what made coding agents finally be good. So let's get
started. I'll start about me. I'm Jared.
started. I'll start about me. I'm Jared.
You can find me I'm Jared Z on X on Twitter whatever. Um I'm building the
Twitter whatever. Um I'm building the workbench for AI engineering. So uh my company is called Prompt Layer. We're
based in New York. You can kind of see our office here. It's like a little building. So it's blocked by a few of
building. So it's blocked by a few of the other buildings. So we're we're a small team. We launched the product 3
small team. We launched the product 3 years ago. So uh long for AI but small
years ago. So uh long for AI but small for everything else. And uh yeah, what kind of our core thesis is that we believe in rigorous prompt engineering,
rigorous agent developing development and we believe that the product team should be involved, the engineering team should be involved. We believe if you're building AI lawyers, you should have lawyers involved as well as engineers.
Um so that's kind of what we do. uh
processing millions of LM requests a day. And a lot of the insights in this
day. And a lot of the insights in this talk come from just conversations we have with our customers on how to build coding agents and stuff like that. And
also feel free throughout the talk we can make this casual. So if there's anything I say if you have a question feel free to just throw it in. Uh and I spend a lot of my time kind of dog fooding the product. It's kind of weird
the job of of a founder these days because it's half like kicking off agents and then half just using my own product to build agents and feels weird but it's kind of fun. And uh yeah, the
last thing I'll add here is I'm a big enthusiast. We literally rebuilt our
enthusiast. We literally rebuilt our engineering org around cloud code. I
think the hard part about building a platform is that you have to deal with all these edge cases and oh uh we're uploading data sets here it doesn't work and you could die a death by a thousand
cuts. So we made a rule for our
cuts. So we made a rule for our engineering organization if you can complete something in less than an hour using cloud code. Just do it. Don't
prioritize it. And we're a small team on purpose but uh it's helped us a lot and I think it's really taken us to the next level. So I'm a big fan and let's dive
level. So I'm a big fan and let's dive into how these things work. So this is what as I was saying the goal of this talk. First, why have these things
talk. First, why have these things exploded? What is the
exploded? What is the what was the innovation? What was the invention that made coding agents finally work? If you've been around this
finally work? If you've been around this field for a little bit, you know that uh a lot of these autonomous coding agents sucked at the beginning and we all tried
to use them. Uh but it's it's night and day. uh we'll dive into the internals
day. uh we'll dive into the internals and and lastly we like everything in this talk is oriented around how do you build your own agents and how do you use
this to do AI engineering for yourself.
So let's just go uh talk about history for a second here. How did we get here?
uh everybody knows started with uh remember the workflow of you just copy and paste your code back from chat GPT back and forth and that was great and that was kind of revolutionary when it
happened. Uh, step two, when cursor came
happened. Uh, step two, when cursor came out, if we all remember, it was not not great software at the beginning. It was
just the VS Code fork with the command K and we all loved it. But, uh, now now we're not going to be doing command K anymore. Then we got the cursor
anymore. Then we got the cursor assistant. So, that little agent back
assistant. So, that little agent back and forth and then cloud code. And
honestly, in the last few days since I made this slide, maybe there's a new version we could talk about here. And uh
at the end I'll talk about like kind of what's next. But this is how we got
what's next. But this is how we got here. And this is really I think the
here. And this is really I think the cloud code is kind of this headless uh not even this this new workflow of not even touching code. And it it has to be
really good. So why is it so good? What
really good. So why is it so good? What
what was uh what was the big breakthrough here? Let's try to figure
breakthrough here? Let's try to figure that out. And again throw this in one
that out. And again throw this in one more time. These are all my opinions uh
more time. These are all my opinions uh and what I think is the breakthrough.
Maybe there's other things but simple architecture. I think a lot of things
architecture. I think a lot of things were simplified with how the agent was designed and then better models, better models and better models. Uh I think the
a lot of the breakthrough is kind of boring in that it's just anthropic releasing a better model that works better for these type of tooling calls and these type of things. But the simple
architecture relates to that. So we can dive into that. the architecture and and this is our little you'll see uh prompt wrangler is our little mascot for our
company. So we made a lot of graphics
company. So we made a lot of graphics for these slides but uh basically give it tools and then get out of the way is what a oneliner of the
architecture is today. I think if you've been building on top of LMS for a little bit this has not always been true.
Obviously tool calls haven't always existed and tool calls is kind of this new abstraction for JSON formatting and if you remember the GitHub libraries like JSON former and stuff like that in
the olden days but give it tools get out of the way. Uh the models are built for these things and being trained to get better at tool calling and better at this. So the more you want to
this. So the more you want to overoptimize and every engineer uh including my especially myself loves to overoptimize and when you first have an idea of how to build the agent you're going to sit down and say oh and then
I'm going to prevent this hallucination by doing this prompt and then this prompt and then this prompt don't do that just a simple loop and get out of
the way and just delete scaffolding and less less scaffolding more model is kind of the tagline here and you know This is uh the leaderboard from this week.
Obviously, these models are getting better and better. Uh we could have a whole conversation and I'm sure there's been many conversations about is it slowing down? Is it plateauing? It
slowing down? Is it plateauing? It
doesn't really matter for this talk. We
know it's getting better and they're getting better at tool calling and they're getting better optimized for running autonomously. And don't this is
running autonomously. And don't this is I I think Anthropic calls this like the AGI pill to way to think about it is don't try to overengineer around model
flaws today because a lot of the things will just get better and you'll be wasting your time. So here's the philosophy, the way I see it of cloud
code, ignoring embeddings, ignoring classifiers, ignoring par matching. The
we had this whole rag thing actually cursors bring back a little bit of rag and how they're doing it and they're mixing and matching. But I think the genius with cloud code is that they they scratched all this and they said we
don't need all these fancy uh paradigms to get around how the model's bad. let's
just make a better model and then let it let it cook and uh just leaning uh on these tool calls and simplifying the tool calls which is a
very important part part instead of having a workflow where the master prompt can break into three different branches and then go into four different
branches there there's really just a few simple tool calls uh including GP instead of rag and uh yeah and that's kind of what it's trained on. So uh
these are very optimized tool calling models.
So this is uh the zen of Python if if you guys are familiar if you do import this in Python. This is I love this philosophy when it comes to building
systems and I think it's really apt for how cloud code was built. So
really just simple is better than complex, complex is better than complicated, flat is better than nested.
This is this is all you need to this is the whole talk. This is all you need to know about how cloud code works and why it works specifically that just in we're going back to engineering principles
such that simple design is better design. Uh I think this is true whether
design. Uh I think this is true whether you're building a database schema uh but this is also true when you're building these autonomous
coding agents. So let's I'm going to now
coding agents. So let's I'm going to now kind of break down all the specific parts of this coding agent and uh why I think they're interesting. So the first
is the constitution. Now a lot of the stuff we kind of take for granted even though they started doing it a month or two ago or maybe three or four months ago. So this is the cloud MD codeex or
ago. So this is the cloud MD codeex or others use agents MD. The interesting
thing I think I assume most of you know what it is. Uh it's again it's where you put the instructions for your library.
But the interesting thing about this is it's basically the team saying we don't need to overengineer a system where the
model first researches the repo and cursor uh like cursor 1.0 as you know makes uh vector DB locally to understand the repo and kind of does all this
research. They're just saying, "Ah, just
research. They're just saying, "Ah, just put a markdown file. Let the user change stuff when they need. Let the agent change stuff when they need very simple and kind of goes back to prompt
engineering, which I'm a little biased towards because prompt layer is a prompt engineering platform, but uh everything's prompt engineering at the end of the day or context engineering.
Everything is how do you uh how do you adapt these general purpose models for your usage?" And the simplest answer is
your usage?" And the simplest answer is the best one here, I think.
So this this is the core of the system.
It's just a simple master loop. Uh and
and this is actually kind of revolutionary considering how we used to build agents. Everything in cloud code
build agents. Everything in cloud code and and all the coding agents today, codeex and and and the new cursor and AMP and all that, it's just one while loop with tool calls just running the
master while loop calling the tools and going back to the master while loop.
This is basically four lines of what it's called. I think they call it N0
it's called. I think they call it N0 internally. Uh at least based on my
internally. Uh at least based on my research, but while there are tool calls, run the tool, give the tool results to the model, and do it again until there's no tool calls and then ask
the user what to do. The first time I did this, uh, the first time I used tool calls, it was very shocking to me that the models are so good at just knowing when to keep calling the tool and
knowing when to fix their mistake. And I
think that's one of the most interesting thing about LM just they're really good at fixing mistakes and being flexible.
And the more just going back, the more you lean on the model to explore and uh figure it out, the better and more
robust your system is going to be when it comes to better models.
So, so these are the core tools uh we have in cloud code today. And to be honest, these change every day. you know,
they're doing new releases every few days, but these are the core ones that I found most interesting to talk about. Uh
there could be 15 tomorrow, there could be down to five tomorrow, but this is what I find interesting. So, first of all, read. Uh yeah, they could just do a
all, read. Uh yeah, they could just do a cat. Uh but what's interesting is read
cat. Uh but what's interesting is read is we have token limits. So, if you've used cloud code a lot, you've seen that sometimes it'll say this file's too big or something like that. That's why it's
worth building this read tool. Grep
glob. Uh,
this one's very interesting too because it goes against a lot of the wisdom at the time of using rag and using vectors.
And I'm not saying rag has no place by the way either. But in these general purpose agents, GP is good and and and GP is uh how users would do it. And I
think that's actually a highle point here. As as you're as I'm talking about
here. As as you're as I'm talking about these tools, remember these are all human tasks. They're not we're not
human tasks. They're not we're not making up a brand new tool for the model to use. We're kind of just mimicking the
to use. We're kind of just mimicking the human actions and what you and I would do if we were at a terminal trying to fix a problem. Edit. Edit makes sense. I
think the interesting thing to note in edit is it's using diffs and it's not rewriting files most of the time. uh way
faster, way way uh less context used, but also way less uh issues. Uh if if I asked you to if I
uh issues. Uh if if I asked you to if I if I gave you these slides and asked you to review the slides and you read it and had to write down all the slides for me in your new revisions versus if you
could just cross out things in the paper, the crossing out is way easier.
Diff is kind of a natural thing to prevent mistakes.
Bash. Bash is uh bash is the core thing here. I think you could probably get rid
here. I think you could probably get rid of all these tools and only have bash.
And the first time I saw this when when you run something in claw code and claude code creates a Python file and then runs the Python file then deletes
the Python file. That's that's the beauty of why this thing works. So bash
is the most important. I'd say web search, web fetch. Uh the interesting thing about these is they move move it to a cheaper and faster model. So for
example, if you're building a some sort of agent maybe on your platform and you're building an agent and it needs to connect to some endpoints, some list of endpoints, might be worth to bring that
into a kind of sub tier as opposed to that master while loop. That's why this is its own tool. To-dos, uh we've all se seen to-dos. talk about it a little bit
seen to-dos. talk about it a little bit more later, but keeping the model on track, steerability, and then tasks.
Tasks is very interesting. It's context
management. It's how do we how do we run this long process, read this whole file without cluttering the context? Because
the biggest enemy here is when your context is full, the model gets stupid for lack of better words. So basically,
bash is all you need. Uh I think this is the one thing I want to drill down. The
amazing thing about there's two amazing things about bash for coding agents. The
first is that it's simple uh and it does everything. It's it's very robust. But
everything. It's it's very robust. But
the second thing that's equally important is there's so much training data on it because that's what we use.
It's not it's the reason that models are not as good at Rust or less common programming languages just because there's less people doing it.
So it's really the universal adapter.
Um, you thousands of tools, you could do anything. Uh, this is that Python
anything. Uh, this is that Python example I gave. I I I always find it so cool when it does the Python script thing or creates tests and I always have to tell it not to. But it all these
shell tools are in it. And this is I mean I find myself using cloud code to spin up local environments where normally I'd have like five commands written down on some file somewhere and
then they get out of date. It's really
good at figuring this stuff out and running the stuff you'd want to do.
uh and it specifically lets the model try things.
So uh yeah, the other suggestions here and the tool usage uh I think there's a little bit of a system prompt uh that
tells it which to use and when to use which tool over which and this changes a lot but the these are kind of like the edge cases and the corners you find the model getting stuck in. So reading
before editing uh they actually make make you do that using GP the tool instead of the bash. So if you look at the tool list here there's a special GP
tool. Uh there could be a lot of reasons
tool. Uh there could be a lot of reasons for that. I think security is a big one
for that. I think security is a big one uh and sandboxing but then also just that token limit thing running independent operations in parallel. Uh,
so kind of pushing the model to do that more. And then also like these trivial
more. And then also like these trivial things like quoting paths with spaces.
It's just the common common things. I'm
sure they're just dog fooding a lot at anthropic and they find it and they're like, "All right, we'll throw it in the system prompt."
system prompt." Okay, so let's talk about to-do lists.
Uh, now again, a very common thing, but was not a common thing before. The the
So this is actually I think a to-do list for from some of my my research for this slide deck. Um, but the really
slide deck. Um, but the really interesting thing about to-do lists is that they're structured but not
structurally enforced. So, here are the
structurally enforced. So, here are the rules. One task at a time. Uh, mark them
rules. One task at a time. Uh, mark them completed. This is kind of stuff you
completed. This is kind of stuff you would expect. Uh, keep working on the in
would expect. Uh, keep working on the in progress if there's block blocks or errors and kind of break up the tasks into different instructions. But the
most interesting thing to me is it's not enforced deterministically. It's purely
enforced deterministically. It's purely prompt based. It's purely in the system
prompt based. It's purely in the system prompt. It's purely because our models
prompt. It's purely because our models are just good at instruction following now. And this would not have worked a
now. And this would not have worked a year ago. This would not have worked two
year ago. This would not have worked two years ago. Um there's tool descriptions
years ago. Um there's tool descriptions at the top of the system prompt. We're
kind of uh injecting the todos into the system prompt. uh there's they're not
system prompt. uh there's they're not but it but it's not enforced in actual code and again uh maybe there's other agents that take an opposite path. Uh I
just found this pretty interesting that this at least as a user makes a big difference and it doesn't even see it seems it was it seems like it was very simple to implement almost a a weekend
project someone did and seemed to work.
could be wrong about about that as well, but uh um so yeah, it's literally a function call. Uh
function call. Uh it's the first time you ask something, the reasoning exports this to-do block, and I'll show you what the structure is on the next slide. Uh there's ids there.
There's some kind of structured schema and determinism, but it it's just injected there. So here's a example of what it could look like. You
get a version, you get your ID, uh a title of the to-do, and then it could actually inject evidence. So, this is uh seemingly arbitrary blobs of data it
could use. And the ids are hashes that
could use. And the ids are hashes that it could then refer to title, something human readable, but this is a just another way to structure the data. And in the same way that
the data. And in the same way that you're going to organize your desk when you work, this is how we're trying to organize the model.
So I think there's uh these are kind of the four benefits we're getting. We're
forcing it to plan. Uh we get to resume after crashes. Uh clog code fails. I
after crashes. Uh clog code fails. I
think UX is a big part of this. As a
user, you know how it's going. It's not
just running off in a loop for 40 minutes without any uh signal to you. So
UX is non-negligible. Even though UX might not make it a better coding agent, it might make it better for us all to use. and uh the steerability one. So
use. and uh the steerability one. So
here's two other parts that were under the hood. Async buffer, so they called
the hood. Async buffer, so they called it H2A. Uh it's kind of uh the IO
it H2A. Uh it's kind of uh the IO process and how to decouple it from reasoning and and how to manage context in a way that you're not just stuffing everything you're seeing in the terminal
and everything back into the model, which again context is our biggest enemy here. It's going to make the model
here. It's going to make the model stupider. So we need to uh be a little
stupider. So we need to uh be a little bit smart about that and and how we do compact and how we do summarization. So
here you see when it reaches capacity it kind of drops the middle summarizes the head and tail. Um then we have the that's the context compressor there. So
what is the limit 92% it seems like something like that. Uh and and how does it how does it save long-term storage?
That's actually another kind of advantage of bash in my opinion and having a sandbox. I would even make a prediction here that all your all chat GPT windows, all clawed windows are
going to come with a sandbox in the near future. It's just so much better because
future. It's just so much better because you can store that long-term memory. And
I do this all the time. I have I have cloud code skills for deep research and stuff like that. And I'm always instructing it save markdown files because the shorter the context, the
quicker it is and the smarter it is.
So this is what I'm most excited about.
We don't need DAGs like this. We
I'll give you I'll give you a real example. Uh so some users at prompt
example. Uh so some users at prompt layer uh different agents like customer support agent basically everybody was building DAGs like this for the last two
two and a half years. Uh and it was crazy. Hundreds of nodes of okay this if
crazy. Hundreds of nodes of okay this if this user wants a refund route them to this prompt if they want this and a lot of uh classifying prompts. The advantage
of this is you can kind of guarantee there's not going to be hallucinations or guarantee there's not going to be refunds to people who shouldn't be having refunds or kind of that pro it solves the prompt injection problem
because if you're in a prompt that purely classifies it as X or Y injecting doesn't really matter especially if you throw out the context. Now we kind of brought back bring back that attack
vector but the but the major benefit is we don't have to deal with this web of engineering uh madness and uh it just it's 10x easier to develop these things
10x more maintainable and it actually works way better because our models are just good now.
So this is this is kind of a takeaway is rely on the model. uh when in doubt, don't don't try to think through every edge case and think through every if
statement. Just rely on the model to
statement. Just rely on the model to explore and figure it out. And I was actually two days ago, I think, or yesterday, sometime this week, I was
doing an experiment on our dashboard to add like trying these browser agents.
And I wanted to see if I could add little titles to all our buttons and it would help the agent navigate our website automatically. And it actually
website automatically. And it actually made it worse, surprisingly. Uh, and
maybe I could run it again and maybe I did something wrong with this test, but it made the agent navigate prompt layer worse because it was getting distracted because I was telling it you have to click this button, then you have to
click this button and then it's it didn't know what to do. So, it's
better to rely on exploration. You have
a question?
>> Yeah, I'll I'll push back a little bit, >> please. I'll admit any
>> please. I'll admit any scaffolding we create today to resolve the idiosyncrasies of limitations will
be that'll be obsolete 3 to 6 months even if that's the case they help a little bit today I how do you balance that like wasted engineering to solve a
problem we only have for three months >> it's a great question so just to repeat uh the question is basically what is the trade-off between solving the actual problems we have today and if
you're relying on the model that can't do it yet but it'll be able to do it in three months, right? Um it's case by case. It depends what you're building.
case. It depends what you're building.
If you're building a chatbot for a bank, you probably do want to be a little bit more comp be careful. To me, the happy
middle ground is to use this agent paradigm of a master while loop and tool calls, but make your tool calls very rigorous. So I think it's okay to have a
rigorous. So I think it's okay to have a tool call that looks like this or looks like half of this uh in the same way that claude code uses read as a tool
call or GP as a tool call. So for the edge cases, throw it in a structured tool that you can then eval in version and stuff like that. And I could talk I'm going to talk
that. And I could talk I'm going to talk a little bit more about that later, but throw it in that structured tool. But
for everything else, uh for the exploration phase, leave it to the model or throw some system prompt. Uh so
it's a trade-off and it's very use case dependent, but I think it's a good question. Thank you. So yeah, uh just
question. Thank you. So yeah, uh just back to cloud code. Uh we're we're getting rid of all this stuff. We're
saying we don't want MLbased intent detection. We don't want reax. We don't
detection. We don't want reax. We don't
want the I mean it uses reax a little bit, but we don't want reax baked into it. We don't want classifiers. And and
it. We don't want classifiers. And and
there was a long time we actually built a product for prompt layer. We never
released it because there's only a prototype of using a MLbased like a nonlm based classifier in your prompt pipeline instead of LMS. A lot of people
have a lot of success with it, but it it feels more and more like it's not going to be that helpful unless cost is a huge concern for you. And even then
cost is the smaller models is going less and less as uh kind of financial engineering between all these companies
pays for our tokens. Um so Claude does also this smart thing I think with the trigger phases. you know, you have
trigger phases. you know, you have think, think hard, think harder, and ultra think is my favorite. Uh, and this lets us use the reasoning budget, the
reasoning token budget as another parameter that the model can adjust. And
this is actually the model can adjust this, but this is how we force it to adjust. And as opposed to you could make
adjust. And as opposed to you could make a tool call for hard planning. And actually, there's
hard planning. And actually, there's some coding agents that do this. or you
can uh let the user specify it and then just on the fly change it.
So this is this is one of the biggest topics here. Sandboxing and permissions.
topics here. Sandboxing and permissions.
I'm going to be completely honest, it's the most boring part of this to me because I just run it on YOLO mode half the time. Um it's uh
the time. Um it's uh some people on our team actually dropped all their local databases. So you do have to be careful. Uh so uh you know we
don't yolo mode with our enterprise customers obviously but uh I but but I think this stuff is it feels like it's going to be solved but but we do need to
know how to works a little bit. So
there's a big issue of in prompt injection from the internet. If you're
connecting this agent that has shell access and you're doing web fetch that's a pretty big attack vector. Uh, so
there's some containerization of that.
There's blocking URLs. You could see cloud code's pretty annoying about can I fetch from this URL? Can I do this? And
it kind of puts it into a sub agent. And
uh, yeah, most of the most of the complex code here is in this sandboxing and permission set.
I think there's this whole pipeline to gate bash command. So it depending on the prefix is how it goes through the sandboxing environment and a lot of the
other models work differently here. Uh
but this is how cloud code does it. I'll
explain the other ones later at the end.
The next topic uh of relevance here is sub aents. Uh so this is going back to
sub aents. Uh so this is going back to context management and this this problem we keep going back to of the longer context the the stupider our agent is.
This is a this is an answer to it. So
using sub aents for specific tasks and the key with the sub aent is it has its own context and it feeds back only the results and this is how you don't clutter it. So we got the researcher
clutter it. So we got the researcher these are just four examples researcher docs reader testr runner code reviewer in that example I was talking about earlier when I added all the tags to our
website to let the agent do it better. I
obviously I use a coding agent to do that and I said read our docs first and then do it and it's going to do this in a sub agent. It's going to feed back the information and the the key thing here
is the forks of the agent and how we aggregate it back into our main context.
So here's an example. I think this is actually very interesting. I want to call out a thing or two here. So task is what a sub aent is. We're giving task
two things. Description and a prompt.
two things. Description and a prompt.
The description is what the user is going to see. So you're going to say task uh find default chat context instantiation or something. And then the prompt you're going to give a long
string which is really interesting because now we have the coding agent prompting its own agents. And I've
actually used this paradigm in agents I've built for our product. Uh if you can you can just have the agent stuff as much information as it wants in this string. And if we're going back to
string. And if we're going back to relying on the model if this task returns an error now stuff even more information and let it solve the problems. It's better to be flexible
rather than rigid.
If I was building this I would consider switching a string to maybe an object here uh depending on what you're building and maybe let it give actually more structured data. Yes. So I can see
this prompt has quite a couple sentences. Is that in the main agent? Is
sentences. Is that in the main agent? Is
that taking the context of the main agent or is there some sort of intermediate step where the sub agent double reads over you know like what the
main agent is doing and then generates >> right? So the question is does the task
>> right? So the question is does the task just get the prompt here or does it also get your chat history? Is that the question?
The question is is are all of the I have my main agent. Is all of this in the system prompt of the main agent to inform how that prompts the sub agent?
>> No. No. Like it's not in the system.
It's in the whole context. Is the all of this context of the main agent >> the task it calls or or you're saying the structure for the task
>> this whole JSON right or >> yes. So this is a tool call. So the tool
>> yes. So this is a tool call. So the tool called structure of what a task is is in the maiden agent. Uh and then these are generated on the fly. Uh so as you want
to run a task, it's generating the description and the prompt. Task is a tool call. They could be run in parallel
tool call. They could be run in parallel and then they're returning the results of it. Hopefully that helps.
of it. Hopefully that helps.
Um so we could go back to the system prompt. So there's some leaks of the
prompt. So there's some leaks of the cloud code system prompt. So that's what I'm basing this on. Uh you can find it online. Um here are some things I I
online. Um here are some things I I noted from it. Uh concise outputs. Uh
obviously don't give anything too long.
No here is or I will just do the do the task the user wants. Uh kind of pushing it to use tools more more instead of text explanations. Obviously, I think
text explanations. Obviously, I think when we we've all built coding agents and when we do it, it usually says, "Hey, I want to run this SQL." No, push
it to use the tool. Um,
matching the existing code, not adding comments. This one does not work for me,
comments. This one does not work for me, but uh running commands in parallel extensively and then the to-dos and stuff like that. There's a lot that you can nudge it to do with the system
prompts. But as you see, I think there's
prompts. But as you see, I think there's a really interesting point to the earlier question you had about where what's the trade-off between DAGs and loops.
A lot of these things you could see are feel like they came from someone using it clawed code and saying, "Oh, if only it did this a little less or if it did this a little bit more." That's where
prompting comes in because it's so easy to iterate and it's not you're not it's not a hard requirement but if only it said here is a little bit more. It's
okay to say it sometimes but all right skills. Skills is great. It's
a slightly newer. I've I honestly got convinced of it only recently. So good.
I built these slides with skills. uh
it's basically I think in the context of this talk about architecture, let's think of it as a extendable system prompt. So in the same way that we don't
prompt. So in the same way that we don't want to clutter the context, there's a lot of different type of tasks you're going to need to do where you want a lot more context. So this is how we give
more context. So this is how we give cloud code a few options of how it could tap into more information. Here are some examples. Uh, I use this for I have a
examples. Uh, I use this for I have a skill for docs updates to tell it my writing style and and my product. So, if
I want to do a docs update, I say use that skill. Load in that skill. Uh,
that skill. Load in that skill. Uh,
editing Microsoft Office uh Microsoft do Microsoft Word and Excel. Um, I I don't use this, but I've seen a lot of people using it. It kind of like decompiles the
using it. It kind of like decompiles the f it's really cool. Uh, but it lets cloud code do this design style guide.
This is a common one. Deep research. I
the other day I threw in a like article or GitHub uh repo on how deep research works and I said rebuild this as a cloud code skill works so well it's amazing.
So unified diffing I think this is worth its own slide. Uh it's very obvious probably not too much we need to talk about here but it makes this so much better and it
makes the token limit shorter. It makes
it faster and makes it less prone to mistakes like I gave with that example when you rewrite an essay versus marking it with a red line. It's just better. I
highly recommend using diffing in any agents you're doing. Unified diff is a standard. When I looked into a lot of
standard. When I looked into a lot of these coding agents, some actually built their own kind of standard uh and like with slight variations on unified diff because you don't always need the line
numbers and but unified diff works. You
had a question >> to go back to skills.
I are uh I don't know if anyone's seen the cloud the cloud code warns you and in yellow text if your quad indeed is like greater than 40k characters and so I was like okay I'm up. Let me
break this down into skills. So I bet spent some time and then Claude ignored all of my skills and so I put them in some. So what am I? I don't know. Skills
some. So what am I? I don't know. Skills
feel globally misunderstood or like not I don't know I'm missing something. Help
me understand.
>> Yeah. So the the question was on okay so cloud code system cloud MD it tells you when it's too long. So uh you move it into skills and then it's not recognizing the skills and not picking
it up when it's needed.
>> Yeah.
take that up with the anthropic team I'd say. Uh but that's also a good example
say. Uh but that's also a good example of maybe the system prompt >> that was the intention like skills you need to invoke them and like the agent
itself shouldn't like just call them all the time, >> right? It does give a dis description of
>> right? It does give a dis description of each skill to the model or it should uh tell it okay here's like a oneliner about each skill. So theoretically in a perfect world it would pick up all the
skills all the time. But you're right, I generally have to call the skill myself manually. I but I think this is a good
manually. I but I think this is a good tieback into when is prompting the right solution or when is the DAG the right solution or maybe this is a model
training problem. Maybe they need to do
training problem. Maybe they need to do a little bit more in post-raining of getting the model to call the skills is almost like calling a tool call. You
have to know when to call it. So maybe
this is just uh a functionality that's not that good yet, but I think the paradigm is very interesting, but it's not perfect as we're learning.
So diffing we just talked about what's next. So this is more opinion based, but
next. So this is more opinion based, but where I see these things going and where the next kind of innovations might likely be. So
likely be. So I I think there's two schools of thoughts here. A lot of people think
thoughts here. A lot of people think we're going to have one master loop with hundreds of tool calls and just tool calling is going to get much better.
That's highly likely. Uh I take the alternate view which I think we need to reduce the tool calls as much as possible and just go back to just bash
and maybe even put scripts in the local directory. I think I am on the proponent
directory. I think I am on the proponent of one mega tool call instead of a lot of tool calls. Maybe not actually one. I
actually think that slide I showed you before is probably a good list, but a lot of people think we need hundreds of tool calls. I just don't think it's
tool calls. I just don't think it's going there. Adaptive budgets, uh,
going there. Adaptive budgets, uh, adjusting reasoning, we do this a little bit, uh, the thinking and ultra think and stuff like that, but I I think
reasoning models as a tool makes a lot of sense as a paradigm. Can you use I think a lot of us would make a trade-off of a 20 times quicker model with
slightly stupider results and being able to call a tool call for a very good model. I think that's a trade-off we
model. I think that's a trade-off we we'd make in a lot of cases. Maybe not
our planner. Maybe we go to the planner first with GPD 51 codeex or opus or whatever if when the new opus comes out.
Uh but I think I think there's a lot of uh mix and matching we can do and that's I think the next frontier and I think the last frontier I think there's a lot we can learn from
to-do lists and and new first class paradigms we can build skills is another example of a first class paradigm we can kind of try to build into it maybe it doesn't work perfectly uh but I think there's a I think there's a lot of new
discoveries to be made there in my opinion do I have them I don't know uh so now I I want to for the for the latter part of this talk I want to talk
about the other frontier agents and the other philosophies they've designed philosophies they've chosen and we all have the benefit we can mix and match when we were building our agent we
could do whatever we want and learn from the best and the frontier labs are very good at this so uh something I like to go back to a lot
I call it the AI therapist problem may maybe there's a better name to give it uh but I believe there's a lot of problems, the most interesting AI problems around. There isn't a global
problems around. There isn't a global maximum. Meaning,
maximum. Meaning, all right, we're in New York City. If I
need to see a therapist, there's six on every block here. There's no global answer for what the best therapist is.
There's different strategies. There's a
therapist that does meditation or CBT or maybe one that gives you Iawaska. and
and these are just kind of like different strategies for the same goal in the same way that if you're building an AI therapist, there isn't a global maxima. This is kind of my anti-AGI
maxima. This is kind of my anti-AGI take, but this is also the take to say that when you're building these applications, taste comes into it a lot and design architecture matters a lot.
You can have five different coding agents that are all amazing. Nobody
knows which today. Nobody knows which one's best to be honest. I don't think Anthropic knows. I don't think OpenAI
Anthropic knows. I don't think OpenAI knows. I don't think source graph knows.
knows. I don't think source graph knows.
Nobody knows whose has the best, but some are better at some things. I
personally like claude code for I said like running my local environment or using git or using these kind of like human actions that require back and forth, but I go to codeex for the hard
problems or I go to composer from cursor because it's faster. And there's a lot basically all this to say there's value in having different philosophies here.
And I don't think there's going to be one winner to this. I think there's going to be different winners for different use cases. And and this is not just coding agents, by the way. This is
all AI products. This is this is kind of why our whole company focuses on domain experts and bringing in the PM and the the the subject matter expert into it because that's how you build defensibility.
So here are the perspectives. The way I see it, this is not a complete list of coding agents, but these are the ones that I think are the most interesting.
Cloud code I think I think to me it wins in user friendliness and simplicity. Uh
like I said if I'm doing something that requires a lot of applications that git git's just the best example. If I want to make a PR I'm going to cloud codeex
uh context it's really good at context management. Uh it feels powerful. Do I
management. Uh it feels powerful. Do I
have the evidence to show you that it's more powerful? Probably not. But uh it
more powerful? Probably not. But uh it feels that way to me and the market feel there's a whole another conversation here to say the market knows best and what people talk about knows best but I
don't know if they know either. Cursor
IDE is kind of that perspective model agnostic. It's faster. Factory uh makes
agnostic. It's faster. Factory uh makes Droid uh great team. They were here too.
Uh they have multiple they they really specialize these droid sub aents they have. So that's kind of their edge and
have. So that's kind of their edge and that's maybe a DAG conversation too or maybe a model training uh cognition. Uh
so Devon uh kind of this endto-end autonomy self-reflection AMP which I'll talk about more in a second. They have a lot of interesting perspectives and actually I find them very exciting these
days free it's model agnostic uh and there's a lot of UX sugar for users and I actually I love their design their their talks at this this conference they they they have very very unique
perspectives so let's start with codeex because it's a popular one so it's pretty similar to cloud code uh same master while loop most of these do
because that's just the winning architecture uh interesting ly rust core. Uh the cool thing is it's open
core. Uh the cool thing is it's open source so you can actually use codeex to understand how codeex works which is kind of what I did. Um it's a little more event driven a little more uh work
went into concurrent threading here uh kind of submission cues event outputs kind of the the thing I was talking about with the IO buffer in cloud code.
I think they do it a little bit differently. Uh sandboxing is very
differently. Uh sandboxing is very different. So theirs is more you I mean
different. So theirs is more you I mean you could see here Mac OS seat belt and Linux land theirs is more kernel based uh and then state kind of this it's all
under threading and and permissions is how I'd say it's mostly different and then the real difference is the model to
be honest. Uh so this is a this is
be honest. Uh so this is a this is actually me using cloud code to understand how codeex works. Uh so you see we have a few explore. I didn't talk
about explore but uh it's uh it's a it's another sub agent type as as I as I mentioned these go in and out. Uh but
yeah this is researching codecs with cloud code. It's always a fun thing to
cloud code. It's always a fun thing to do. So let's talk about AMP.
do. So let's talk about AMP.
So this is source graphs coding agent.
I it has a free tier. That's just a cool perspective in my opinion. Uh they
leverage kind of these excess tokens uh from providers and they give ads. So, we
actually have an ad on them. I think
it's a cool I'm pro-AD. A lot of people are anti- ad. I think it's one of my hot takes, but I like it. They don't have a model selector. This is very
model selector. This is very interesting, too. This is its own
interesting, too. This is its own perspective. Uh, it actually helps them
perspective. Uh, it actually helps them move faster because you're you have less of an exact expectation of what the output is because, you know, they might
be switching models here and there. So,
that changes how they develop. And then,
uh, I think their vision is pretty interesting.
uh their vision is how do we build not just the best agent but how do we build the agent that works with the most agentfriendly environments and actually factory gave a talk similar to this as
well but how do how do you build a hermetically sealed uh a like coding repo that the agent can run tests on how do you build the feedback loop because that's kind of the holy grail that's how
we build an autonomous agent and how do we uh I'd love to see the front-end version of this how do let it look at its own design and make it better and go back and forth and this is kind of their
guiding philosophy and you could boil it down to the agent perspective as I've been calling it.
I think they do interesting stuff with context. So, we're all familiar with
context. So, we're all familiar with compact. It's the worst. You have to
compact. It's the worst. You have to wait 10. I don't know why it takes so
wait 10. I don't know why it takes so long. Uh and if you're not familiar,
long. Uh and if you're not familiar, it's summarizing your chat window when the context gets too high and giving the summary. So, they have something called
summary. So, they have something called handoff, which makes me think of if you any was a anyone was a Call of Duty player back in the day, switch weapons.
It's faster than reloading. And uh
that's what handoff is. You're you're
just starting a new thread and you're giving it the information it needs for a new thread. That feels like the winning
new thread. That feels like the winning strategy to me. Could be wrong, but maybe you need both. That's where
they're pushing it. And I kind of like that. I They get they give a very fresh
that. I They get they give a very fresh perspective. So, the second thing is
perspective. So, the second thing is model choice. This is the reasoning
model choice. This is the reasoning knobs uh and their view on it. They have
fast, smart, and Oracle. So, they lean even more heavily into we have different models. We're not telling you what
models. We're not telling you what Oracle is. They tell you, but we're
Oracle is. They tell you, but we're willing to switch what Oracle is, but we're going to use Oracle when we have a very hard problem.
So, yeah. So, that's AMP. Let's go to Cursor's Agent. I think Cursor's agent
Cursor's Agent. I think Cursor's agent has a very interesting perspective here.
First, obviously, it's UI. uh UI first, not CLI. I think they might have a CLI,
not CLI. I think they might have a CLI, not entirely sure, but the UI is the interesting part. It's just so fast.
interesting part. It's just so fast.
Their new model composer, it's distilled. They have they have the data.
distilled. They have they have the data.
They actually made, in my opinion, people interested in fine-tuning again.
fine-tuning. It was almost uh we'd never recommend it to our customers, but composer shows you that you can actually build defensibility based on your data
again, which which is uh surprising, but uh yeah, cursors agent composer, I've been almost switching completely to it since because it's just so fast. It's
almost too fast. Accidentally pushed to master on one of my personal projects.
Uh so you don't you don't want that always. Uh but cursor was just the crowd
always. Uh but cursor was just the crowd favorite and and I want to give a lot of uh props to their team. They built
iteratively. The first version of cursor was so bad and it was and we all use I used it because it's a VS code for fork.
I have nothing to lose and it's gotten so good. It's such a good piece of
so good. It's such a good piece of software and it's a great team and uh but I I'll say the same can be said about OpenAI's codeex models. They're
not quite as fast, but they are optimized for these coding agents and they are distilled. And I could see OpenAI coming out with a really fast model here because they also have the
data.
So here's a picture. Um I think you could this is a picture they put on their blog and you could see what their perspective is on coding agents here just based on the fact that they show you the three models they're running.
So, they're offering composer, but they're letting you use the state-of-the-art because they know that maybe GPD 5.1 is better at planning or
here it's five, but now we have 5.1.
So, here begs the big question, which one should we all use? Which
architecture is best? What should we do?
And uh my opinion here is that benchmarks are pretty useless.
Benchmarks have become marketing for a lot of these model providers. every
model beats the benchmarks. I don't know how that happens, but I think there's there's world where evals matter here. And
the question is what you can eval. The
question is how this whole simplic simple while loop architecture that I've been kind of trying to push based on my understanding of it actually makes it harder to eval because if we're relying
more on model flexibility, how do you test it? You could run an integration
test it? You could run an integration test, kind of this endto-end test, and just say, "Does it fix the problem?"
That's one way to do it. You could break it up. You could kind of do point in
it up. You could kind of do point in time snapshots and say, "Hey, I'm going to give a context to my chatbot from like a half-finish conversation where I know it should be running a specific
tool call." I could run those. Uh I or I
tool call." I could run those. Uh I or I could maybe just run a back test and say, "How how often does it change the tools?" I think there's also another
tools?" I think there's also another concept here that's starting to be developed called agent smell or at least I'm calling it agent smell. So run an agent and see how many times does it call a tool call. How many times does it
retry? How long does it take? And these
retry? How long does it take? And these
are all surface level metrics but it's really good for sanity checking. And
these things are hard to eval. There's a
lot that goes into it. I'll show you an example of what I did uh just to kind of dive into it. But but on that subject maybe I'll just say one more thing. I
would break it down me my mental model is you could do an endto-end test, you can do a point in time test or what I most often recommend is just do a back
test. Start with back test, start
test. Start with back test, start capturing historical data and then just rerun it. So yeah, let me give you uh
rerun it. So yeah, let me give you uh this example. So basically what I have
this example. So basically what I have here, so this is a screenshot of prompt layer. This is our our eval product is
layer. This is our our eval product is also just a batch runner. So you could kind of just run a bunch of columns through a prompt. But in this case, I'm running it through not a prompt, but cloud code. So I just have like a
cloud code. So I just have like a headless cloud code and I'm taking all these providers and I just my headless cloud code says I think I have it on the next slide. Search the web for the model
next slide. Search the web for the model provider. It's given to you in a file
provider. It's given to you in a file variables. Find the most recent and
variables. Find the most recent and largest model released and then return the name. So I don't know what it's
the name. So I don't know what it's doing. It's doing web search. I'm not
doing. It's doing web search. I'm not
even caring about that. This is an endto-end test. This is how we kind of
endto-end test. This is how we kind of try doing cloud code. And I actually think there's a lot about putting cloud code into your workflows and those type of headless SDKs. I'll talk about that I
think next slide. But
kind of main takeaway here is you can kind of start to do endto-end tests. You
can look at it from a high level do a model smell and then kind of look into the statistics on each row and see how many times it called a tool.
And going back and we we've talked about this a lot in this talk. rigorous tools.
The tools can be rigorously tested. You
can This is how you offload the deter This is how you offload the determinism to different parts of your model. It's
you test the tools. You you test the out of your tools. Look at them like functions. It's an input and an
like functions. It's an input and an output. If your tools a sub agent that
output. If your tools a sub agent that runs, then we're in a kind of recursion here because then you have to go back and test the end to end thing. But for
your tools, I'll give you this example.
If I so there in my coding agents or my agents in general, my autonomous agents, if there's something very specific that I want to output. So in this case, if I
have a very specific type of email format or type of blog post that I want to write and I really want it to get my voice right, I don't want to rely on the model exploration. I want to actually
model exploration. I want to actually build a tool that I can rigorously test.
So in this case, this is also just a prompt layer screenshot, but this is a like a workflow I've built. It has an LM assertion where it says check if the email is good to my standards. If it's
good, it revises it. If it's not good, it adds the parts. So like the header that it missed and it revises it with the same step. This is obviously a very
simple example, but in I we have another version for some of our SEO blog posts that has like 20 different nodes and writes an outline from a deep research
and then fixes a conclusion and adds links.
for the stuff that you have a very specific vision that's when testing it just gets so much easier because as you can see obviously testing this sort of
workflow has less steps and less flexibility. So this is an eval I made I
flexibility. So this is an eval I made I start with just a bunch of sample emails I run the prompt actually I run the the agentic workflow here and I'm just
adding a bunch of heruristics. So this
is a very simple LMS judge does it include three parts in it. So this is what I was testing for like the hi Jared email body and the signature. You can
get a lot more complicated. You could do a code execution. You can do I don't know LM's judge is usually the easiest.
But now obviously you could see I could keep running this until it's correct on all of them and kind of uh see my eval over time. This is just from this
over time. This is just from this example. I got it to 100. So that was
example. I got it to 100. So that was fun.
Uh and then I want to I want to add another future looking thing. keep an
eye on headless uh cloud code SDK. I
know there was a talk about it this morning. Um so I don't want to I won't
morning. Um so I don't want to I won't spend too much time on it, but it's amazing. You just give a simple prompt
amazing. You just give a simple prompt and it's just another part of your pipeline. I use it for I think I have it
pipeline. I use it for I think I have it on the next slide. I have a GitHub action that updates my docs every day and just reads all the commits we've pushed to our other repos. And we have a
lot of commits going and it just runs cloud code. The cloud code pulls down
cloud code. The cloud code pulls down all the repos, checks what's updated, reads our cloud MD to see if it should even update the docs, then creates a PR.
So I think this unlocks a lot of things and there's a possibility that we're going to start building agents at a higher order of abstraction and just rely on cloud code and these other
agents to do a lot of the harnesses and orchestration.
>> Are you reviewing those?
Yeah, I it creates a PR. It doesn't uh it doesn't merge the VR.
So, here are my takeaways. Number one,
trust in the model. Uh when in doubt, rely on the model when you're building agents. Number two, simple design wins.
agents. Number two, simple design wins.
Number one and number two kind of go together here. Number three, bash is all
together here. Number three, bash is all you need.
Go simple with your tools. Don't have 40 tools, have 10 or five tools. For
context management matters, this is the boogeyman we're running from all the time in agents at this point. Maybe
there'll be new models in the future that are just so much better at context.
But there's always going to be a limit because ah you're talking to a human. I
forget people's names if I meet too many in one day. That's context management or my stupidity. I don't know. And number
my stupidity. I don't know. And number
five, different perspectives. matter in
agents. I think this is the engineering brain doesn't always comprehend this as much as it should especially in and I'm an engineer so I'm also talking about
myself but the different perspectives matter such that there's different uh ways to solve a problem where there's not one is better than the other and you
kind of you probably want a mixture of experts agent I I would love to have mine run cloud code and codeex and this and give me the output and considered a team and maybe have them talk to each
other in a slack based message channel.
I'm waiting for someone to build that.
That would be great. But these are my takeaways. Uh my bonus thing that I'll
takeaways. Uh my bonus thing that I'll show you is how I built this slide deck using cloud code. So uh I built a slide dev skill. So I I basically told cloud
dev skill. So I I basically told cloud code to research how slide dev works and how it can and that's kind of just a library that I made this in. I built a deep research skill to research all
these agents and how they work. I built
a design skill because I know half a thing looks terrible or looks good, but I'm not a good designer to figure it out. So, these boxes even I was just
out. So, these boxes even I was just like, "Oh, m make the box a little nicer. Give it an accent color." Uh, so
nicer. Give it an accent color." Uh, so yeah, this is how I built it. But again,
thank you for listening. Uh, happy to answer any questions. I'm Jared, founder of Prompt Layer. Find me there.
>> Yes.
>> Thank you. Great talk. Um so you mentioned u regarding DAGs basically like let's get rid of them right but DAGs kind of enforc this like sequential
uh execution right pass I don't know customer service like agent asks the name email right like in some sort of uh
sequence um so are you saying just write this out um like this is now this should be uh just written out as a
plan for an agent to execute and just trust that the model is going to be calling those tools in that sequence like how do we enforce uh the order?
>> Right? So the question was why do I keep talking about getting rid of DAGs? How else am are you supposed to
of DAGs? How else am are you supposed to enforce a specific order for solving a problem? So I think there are different
problem? So I think there are different types of problems. So the problem of building a general purpose coding agent that we can all use to do our work and
even non-technical people can use there's no specific step to solving that problem which is why it's better to rely on the model. If your problem was
to build let's say a travel itinerary it's more of a specific step because you have a deliverable that's always the
same. So there's a little bit more of a
same. So there's a little bit more of a DAG that could matter, but in the research step of traveling, you probably don't want a DAG because every city is going to be different. So it really depends on the problem you're solving. I
would if I wanted to make an agent for a travel itinerary, I'd probably have my tool call would one of my tool calls be a DAG of creating the output file because I want the output to look the
same or creating the plan. And then in the system problem, I could say always end with the output for example. But
you need to mix and match. There's a
every use case is different, but if you want to make something general purpose, my take is to rely more on the model on simple loops and less on a DAG.
>> Cool. Any other questions?
Yes.
>> Yeah. Building on that point, like do you think we're heading towards a world where most of you're not actually going to call the API through code and that most LM calls are by triggering cloud
code and just write just writing the files instead?
So the question is are we going to move away from calling models directly and just call call like a headless cloud code right?
>> Yeah. Like if I had a like I have a pipeline that does one lm call per document, summarizes it at the end. You
could make a while loop cloud code that saves a file every time. You never call the API besides using cloud code in in a while loop
>> potentially. Uh, I'll give you the pro
>> potentially. Uh, I'll give you the pro and the con there.
>> Yeah, >> the pro is it's easier to develop and we can kind of rely on the frontier. I
mean, if you think about it, a reasoning model is just that. The reasoning models didn't always exist. We just had normal LM model and then oh, now we have 01 and reasoning models. All that is is a I
reasoning models. All that is is a I mean, it's a little more complicated than this, but it's basically just a while loop on OpenAI servers that keeps running the context and then eventually gives you the output. in the same way
that cloud code SDK is a while loop with a bunch of more things. So I could totally see a lot of builders only touching these agentic endpoints. Maybe
even seeing a model provider release a model as a agentic endpoint. But for a lot of tasks, you're going to want a little bit more control. And they're pro
and probably you'd still want to go as close to metal as possible. Having said
that, there's there was a lot of people who still wanted completions models and that never happened and nobody really talks about that anymore. So, it's very likely that everything just becomes this
SDK, but I don't have a crystal ball, but those are those are how I I would think about it.
>> Yes, >> thanks for the talk. Um, I know you said the simpler the better, but um, what's your thoughts about test during development, spec during development in
AI? Have you tried it? What is it about
AI? Have you tried it? What is it about >> for building agents or for getting work done?
>> For coding.
>> Okay. So the question on spec driven development, test-driven development for coding with agents.
When in doubt, go back to good engineering practices is what I would say. So it if you
would say. So it if you and there's there's whole engineering debates on if test-driven development is the right way and some people swear by it and some people don't. So I don't
think there's an answer. I think coding agents clearly test-driven development makes it easier. I think as I was showing you that's AMP's source graphs whole philosophy that if you can build
good tests and factory I think thinks this as well. If you could build good tests your coding agent can work much better. So it makes sense to me when I'm
better. So it makes sense to me when I'm working personally I rely pretty heavily on the planning phase and the spectr in development phase and I think the simpler tasks are pretty easy for the
model but if I'm doing a very simple edit I'll skip that step. So no
oneizefits-all but return to the engineering principles that you believe when in doubt I'd say yes.
>> So earlier you talked about about system rock leaks is possible to just look at the u downloads bundle or they have a special
end point that has prompts behind endpoint.
>> Yeah. Uh I think I think they hide it. I
think they hide it. There was a there was actually an interesting article someone because codeex is open source they before openai released the codeex model
that it was using they were able to hack together the open source codeex to give a custom prompt to the model and be able to use the model without it. So yeah you can dive into it but generally it's
tried to be hidden and also laziness of someone posted it. So there you go that's the work but someone had to have found it right. like is this problem
somewhere on your machine?
>> I actually don't know that answer.
>> Do you know that answer?
>> Yeah.
>> Yes.
>> It's on your machine. Nico says it's on your machine. So there we go. So maybe
your machine. So there we go. So maybe
the prompt I was looking at is a little bit old and I have to update it. But the
s but uh the question was does uh is the prompt hidden on their servers or can you find it if you are so determined?
And the answer seems to be yes. Any
other questions?
>> Yes.
>> Is this the last one?
>> Is this the last question?
>> It can be.
>> Can you talk about prompt layer and how can people help you?
>> Yes, that's a good one. I forgot about that. Thank you.
that. Thank you.
Um, so yeah, my one, we're hiring. Uh,
so if you're looking for coding jobs at a very fun and fastmoving team in New York, you can reach out to me on X or email jaredprompter.com.
email jaredprompter.com.
We're based in New York. We are uh, yeah, we're we're a platform for building and testing AI products for prompt management, audibility, governance, all that fun stuff, but also
logging and evals. And those screenshots I showed you came from prompt layer. If
you're building an AI application and you're building it with a team, you should probably try Problem layer. It'll
make your life easier. Uh especially the bigger your team is, the more you want to collaborate, the more you want to collaborate with PMs and non-technical users and or if you're just technical users, it's a great tool. It'll make
your life better. Highly recommend it.
prompt layer.com and it's easy to do.
And that was my show.
Thank you for listening.
Heat.
Loading video analysis...