LongCut logo

Prompting for Agents | Code w/ Claude

By Anthropic

Summary

## Key takeaways - **Agents: Models using tools in a loop**: Agents are defined as models that utilize tools in a continuous loop, taking a task and independently working towards its completion by updating decisions based on tool call feedback. [01:29] - **Agent Use Cases: Complex, Valuable Tasks**: Agents are best suited for complex and valuable tasks, not for every scenario, as using them inappropriately can lead to wasted resources and suboptimal results. [02:22] - **Think Like Your Agent**: To effectively prompt agents, it's crucial to develop a mental model of their environment and actions, simulating their process to identify potential confusion or errors. [07:28] - **Reasonable Heuristics for Agents**: Provide agents with clear, reasonable heuristics, such as 'stop searching when the answer is found' or setting budgets for tool calls, to guide their behavior and prevent unnecessary actions. [08:11] - **Iterative Prompt Development**: Prompt development for agents should start simple and iterate based on observed failures or edge cases, gradually adding instructions and examples as needed for production consistency. [26:37] - **LLMs as Judges for Evaluation**: Leveraging LLMs with a clear rubric as judges can effectively evaluate agent outputs, offering robustness to variations in structure and content, which is crucial for complex agentic tasks. [22:40]

Topics Covered

  • Don't Deploy Agents Everywhere: When Are They Truly Valuable?
  • Master Agent Prompting: Think Like Your Agent, Not Just Words.
  • Agents Are Unpredictable: Guide Thinking, Anticipate Side Effects.
  • Extend Agent Context: Use Compaction, Files, and Sub-Agents.
  • Evaluating Agents: Start Small, Use LLMs as Judges.

Full Transcript

All right, thank you. Thank you everyone for  joining us. Uh, so we're picking up with prompting  

for agents. Um, hopefully you were here for  prompting 101 or maybe you're just joining us. U,  

but I'll give a little intro. My name is Hannah.  I'm part of the applied AI team in Anthropic. Hi,  

I'm Jeremy. I'm on our applied AI team as well  and I'm a product engineer. Uh, so we're going  

to talk about prompting for agents. So, we're  going to switch gears a little bit, move on from  

the basics of prompting, um, and talk about how  we do this for agents like playing Pokemon. Uh,  

so hopefully you were here, uh, for prompting  101 or maybe you have some familiarity with  

basic prompting. So, we're not going to go over  um the really kind of basic console prompting or  

interacting with Claude and the desktop today.  But just a refresher, uh, we think about prompt  

engineering as kind of programming in natural  language. you're thinking about what your agent  

or your model is going to be doing, what kind  of tasks it's accomplishing. You're trying to  

clearly communicate to the agent, give examples  where necessary, um, and give guidelines. Uh,  

we do, you know, follow kind of a very specific  structure for console prompting. I want you to  

remove this from your mind because it could look  very different for an agent. So, for an agent,  

you may not be laying out this type of very  structured prompt. Uh, it's actually going to  

look a lot different. We're going to allow  a lot of different things to come in. So,  

I'm going to turn it over I'm going to talk about  what agents are and then I'll turn it over to  

Jeremy to talk about how we do this for agents.  So, hopefully you have a sense in your mind of  

what an agent is. At Anthropic, we like to say  that agents are models using tools in a loop. So,  

we give the agent a task and we allow it to work  continuously and use tools as it thinks fit. Um,  

update its decisions based on the information  that it's getting back from its tool calls and  

continue working independently until it completes  the task. So that's we kind of keep it as simple  

as that. Um the environment which is where the  agent is working, the tools that the agent has  

and the system prompt is just where we tell the  agent what it should be doing or what it should be  

accomplishing. And we typically find the simpler  you can keep this the better. Allow the agent to  

do its work. Allow the model to be the model and  kind of work through this task. So when do you use  

agents? You do not always need to use an agent.  In fact, there's many scenarios in which you won't  

actually want to use an agent. There are other  approaches that would be more appropriate. Um,  

agents are really best for complex and  valuable tasks. It's not something you  

should deploy in every possible scenario. You  will not get the results that you want. Um,  

and you'll spend a lot more resources than  you maybe need to. So, we'll talk a little  

bit about checklist or or kind of ways of thinking  about when you should be using an agent and maybe  

you don't want to be using an agent. So, is the  task complex? Is this a task that you, a human,  

can think through a step-by-step process to  complete? If so, you probably don't need an  

agent. You want to use an agent where it's not  clear to you how you'll go about accomplishing the  

task. You might know where you want to go, but you  don't know exactly how you're going to get there,  

what tools, and what information you might need  to arrive at the end state. Is a task valuable?  

Are you going to get a lot of value out of the  agent accomplishing this task? Or is this a kind  

of a low value uh task or workflow? In that case,  a workflow might also be better. You don't really  

want to be using the resources of an agent unless  this is something you get that's highly leveraged.  

It's maybe revenue generating. It's something  that's really valuable to your user. Again,  

it's something that's complex. Uh the last next  piece is are the parts of the task doable? So,  

when you think about the task that has to occur,  would you be able to give the agents the tools  

that it needs in order to accomplish this task?  If you can't define the tools or if you can't  

give the agent access to the information or  the tool that it would need, you may want to  

scope the task down. Um, if you can define and  give to the agent the tools that it would want,  

that's a better use case for an agent. The last  thing you might want to think about is the cost of  

errors or how easy it is to discover errors. So,  if it's really uh difficult to correct an error or  

detect an error, that is maybe not a place where  you want the agent to be working independently.  

you might want to have a human in the loop in  that case. If it the error is something that  

you can recover from or if it's not too costly to  have an error occurring, then you might continue  

to allow the agent to work independently. So to  make this a little bit more real, uh we'll talk  

about a few examples. I'm not going to go through  each single one of these, but let's pick out a few  

that will be pretty clear or intuitive for most of  us. So coding, obviously, um all of you are very  

familiar with using agents and coding. Uh coding  is a great use case. We can think about something  

uh like a design document. And although you know  where you want to get to, which is raising a PR,  

you don't know exactly how you're going to get  there. It's not clear to you what you'll build  

first, how you'll iterate on that, what changes  you might make along the way depending on what  

you find. Um this is high value. You're all very  skilled. If an agent, okay, if an agent is able,  

this is like more like what the midway is like  at night. I feel I feel more at home now. Um,  

uh, Claude Claude is great at coding. Um, and this  is a high value use case, right? If your agent is  

actually able to go from a design document to  a PR, that's a lot of time that you, a highly  

skilled engineer, are saved and you're able to  then spend your time on something else that's  

higher leverage. So, great use case for agents.  A couple other examples I'll mention here. Um,  

maybe we'll talk about the the cost of error.  So, search, if we make an error in the search,  

there's ways that we can correct that, right? So  we can use citations, we can use other methods of  

double-checking the results. So if the agent makes  a mistake in the search process, this is something  

we can recover from and it's probably not too  costly. Computer use, um, this is also a place  

where we can recover from errors. We might just go  back, we might try clicking again. It's not, uh,  

too difficult to allow Claude just to click a few  times until it's able to use the tool properly.  

Um, data analysis, I think, is another interesting  example, kind of analogous to coding. We might  

know uh the end result that we want to get to.  We know a set of insights that we want to gather  

out of data or a visualization that we want to  produce from data. We don't know exactly what the  

data might look like. Uh so the data could have  different formats. It could have errors in it.  

It could have other uh it could have granularity  issues that we're not sure how to disagregate. We  

don't know the exact process that we're going to  take in analyzing that data, but we know where we  

want to get in the end. Um so this is another  example of a great use case for agents. Uh,  

so hopefully these make sense to you and I'm going  to turn it over to Jeremy now. He has some really  

rich experience building agents and he's going to  share some best practices for actually prompting  

them well and how to structure a great prompt  for an agent. Thanks Hannah. Hi all. Um, yeah,  

so prompting for agents. Um, I think some things  that we think about here, I I'll go over a few of  

them. We've learned these experiences mostly from  building agents ourselves. So some agents that you  

can try from enthropic are cla code which works in  your terminal and sort of agentically browses your  

files and uses the bash tool to really accomplish  tasks um in coding. Similarly we have our new  

advanced research feature in cloud.ai and this  allows you to do hours of research. For example,  

you can find hundreds of startups building agents  or you can find hundreds of potential prospects  

for your company. And this allows the model to  do research across your tools, your Google Drive,  

web search and stuff like that. And so in the  process of building these products, one things  

that we learned is that you need to think like  your agents. This is maybe the most important  

principle. Um the idea is that essentially you  need to understand and develop a mental model  

of what your agent is doing and what it's like to  be in that environment. So the environment for the  

agent is a set of tools and the responses it gets  back from those tools. In the context of cloud  

code, the way you might do this is by actually  simulating the process and just imagining if you  

were in cloud code's shoes given the exact tool  descriptions it has and the tool schemas it has,  

would you be confused or would you be able to  do do the task that it's doing? If a human can't  

understand what your agent should be doing, then  an AI will not be able to either. And so this is  

really important for thinking about tool design,  thinking about prompting is to simulate and go  

through their environment. Another is that you  need to give your agents reasonable heristics.  

And so, you know, Hannah mentioned that prompt  engineering is conceptual engineering. What does  

that really mean? It's one of the reasons why  prompt engineering is not going away and why I  

personally expect prompting to get more important,  not less important as models get smarter. This is  

because prompting is not just about text. It's not  just about the words that you give the model. It's  

about deciding what concepts the model should have  and what behaviors it should follow to perform  

well in a specific environment. So for example,  cloud code has the concept of irreversibility.  

It should not take irreversible actions that  might harm the user or harm their environment.  

So it will avoid these kinds of harmful actions  or anything that might cause irreversible damage  

to your environment or to your code or anything  like that. So that concept of irreversibility is  

something that you need to instill in the model  and be very clear about and think about the edge  

cases. How might the model in misinterpret  this concept? How might it not know what it  

means? For example, if you want the model to be  very eager and you want it to be very agentic,  

well, it might go over the top a little bit. It  might misinterpret what you're saying and do more  

than what you expect. And so, you have to be very  crisp and clear about the concepts you're giving  

the models. Um, some examples of these reasonable  heristics that we've learned. One is that while  

we were building research, we noticed that the  model would often do a ton of web searches when  

it was unnecessary. For example, it would find the  actual answer it needed. like maybe you would find  

a list of scaleups in the United States and then  it would keep going even though it already had the  

answer and that's because we hadn't told the model  explicitly when you find the answer you can stop  

you no longer need to keep searching uh similarly  we had to give the model sort of budgets to think  

about for example we told it that for simple  queries it should use under five tool calls  

but for more complex queries it might use up to 10  or 15 so these kinds of heruristics that you might  

assume the model already understands you really  have to articulate clearly. A good way to think  

about this is that if you're managing maybe a new  intern who's fresh out of college and has not had  

a job before, how would you articulate to them  how to get around all the problems they might get  

run into in their first job? And how would you  be very crisp and clear with them about how to  

accomplish that? That's often how you should  think about giving heristics to your agents,  

which are just general principles that it  should follow. They may not be strict rules,  

but they're, you know, sort of practices.  Another point is that tool selection is key.  

So as models get more powerful able to handle more  and more tools. Sonnet 4 and Opus 4 can handle  

you know up to a hundred tools even more than  that if you have great prompting. But in order  

to use these tools you have to be clear about  which tools it should use for different tasks.  

So for example for research we can give the model  access to Google Drive. We can give it access to  

MCP tools like Sentry or Data Dog or GitHub. It  can search across all these tools, but the model  

doesn't know already which tools are important  for which tasks. Especially in your specific  

company context. For example, if your company uses  Slack a lot, maybe it should default to searching  

Slack for company related information. All these  questions about how the model should use tools,  

you have to give it explicit principles about  when to use which tools and in which contexts. Um,  

and this is really important and it's often  something I see where people don't prompt the  

agent at all about which tool to use and they  just give the model some tools with some very  

short descriptions and then they wonder like  why isn't the model using the right tool? Well,  

it's likely because the model doesn't know what  it should be doing in that context. Another point  

here is that you can guide the thinking process.  So people often sort of turn extended thinking on  

and then let their agents run and assume it will  get out of the box better performance. Actually  

that assumption is true. Most of the time you will  get out of the box better performance, but you can  

squeeze even more performance out of it if you  just prompt the agent to use its thinking well.  

So for example, for search, what we do is tell  the model to plan out its search process. So in  

advance, it should decide how complicated is this  query? How many tool calls should I use here? What  

sources should I look for? How will I know when  I'm successful? We tell it to plan out all these  

exact things in its first thinking block. And then  a new capability that the cloud 4 models have is  

the ability to use interled thinking between tool  calls. So after getting results from the web, we  

often find that models assume that all web search  results are true, right? They don't have any,  

you know, we we haven't told them explicitly that  this isn't the case. And so they might take these  

web results and run with them immediately. So,  one thing we prompted our models to do is to use  

this interleaf thinking to really reflect on the  quality of the search results and decide if they  

need to verify them, if they need to get more  information, or if they should add a disclaimer  

about how the results might not be accurate. Um,  another point with when prompting agents is that  

agents are more unpredictable than workflows  or just, you know, classification type prompts.  

Most changes will have unintended side effects.  This is because agents will operate in a loop  

autonomously. And so for example, if you tell the  agent, you know, keep searching until you find the  

correct answer, you know, find the highest quality  possible source and always keep searching until  

you find that source. What you might run into is  the unintended side effect of the agent just not  

finding any sources. Maybe this perfect source  doesn't exist for the for the query. And so it  

will just keep searching until it hits its context  window. And that's actually what we ran into as  

well. And so you have to tell the agent if you  don't find the perfect source, that's okay. You  

can stop after a few tool calls. Um, so just be  aware that your prompts may have unintended side  

effects and you may have to roll those back.  Another point is to help the agent manage its  

context window. The Cloud 4 models have a 200k  token context window. Um, this is long enough for  

a lot of longrunning tasks, but when you're using  an agent to do work autonomously, you may hit this  

context window and there are several strategies  you can use to sort of extend the effective  

context window. One of them that we use for cloud  code is called compaction. And this is just a tool  

that the model has um that will automatically be  called once it hits around 190,000 tokens. So near  

the context window. And this will summarize  or compress everything in the context window  

to a really dense but accurate summary that is  then passed to a new instance of claude with the  

summary. And it continues the process. And we find  that this essentially allows you to run infinitely  

with cloud code. You almost never run out of  context. um occasionally it will miss details  

from the previous session but the vast majority of  the time this will keep all the important details  

and the model will sort of remember what happened  in the last session. Similarly you can sort of  

write to an external file. So the model can have  access to an extra file and these cloud for models  

are especially good at writing memory to a file  and they can use this file to essentially extend  

their context window. Another point is that you  can use sub aents. Um, we won't talk about this  

a lot here, but essentially if you have agents  that are always hitting their context windows, you  

may delegate some of what the agent is doing to  another agent. Um, which can sort of, for example,  

you can have one agent be the lead agent and then  sub agents do the actual searching process. Then  

the sub agents can compress the results to the  lead agent in a really dense form that doesn't  

use as many tokens and the lead agent can give the  final report to the user. So we actually use this  

process in our research system and this allows you  to sort of compress what's going on in the search  

and then only use the context window for the lead  agent for actually writing the report. So this  

kind of multi- aent system can be effective  for limiting the context window. Finally,  

you can let Claude be Claude. And essentially  what this means is that Claude is great at being  

an agent already. You don't have to do a ton of  work at the very beginning. So, I would recommend  

just trying out your system with sort of a bare  bones prompt and barebones tools and seeing where  

it goes wrong and then working from there. Don't  sort of assume that Claude can't do it ahead of  

time because cloud often will surprise you with  how good it is. Um, I talked already about tool  

design, but essentially the key point here is you  want to make sure that your tools are good. Um,  

what is a good tool? It will have a simple  accurate tool name that reflects what it does.  

You'll have tested it and make sure that it works  well. um it'll have a well-formed description  

so that a human reading this tool like imagine  you give a function to another engineer on your  

team would they understand this function and be  able to use it. You should ask the same question  

about the agent computer interfaces or the tools  that you are giving your agent. Make sure that  

they're usable and clear. Um we also often find  that people will give an agent a bunch of tools  

that have very similar names or descriptions.  So for example, you give it six search tools  

and each of the search tools searches a slightly  different database. This will confuse the model.  

So try to keep your tools fairly distinct  um and combine similar tools into just one.  

So, one quick example here is just that you can  have an agent, for example, use these different  

tools to first search the inventory in a database,  run a query. Based on the information it finds, it  

can reflect on the inventory, think about it for  a little bit, then decide to generate an invoice,  

generate this invoice, think about what it should  do next, and then decide to send an email. And so,  

this loop involves the agent getting information  from the database, which is its external  

environment, using its tools, and then updating  based on that information. until it accomplishes  

the task. And that's sort of how agents work  in general. So, let's walk through a demo real  

quick. I'll switch to my computer. Um, so you can  see here that this is our console. The console is  

a great tool for sort of simulating your prompts  and seeing what they would look like in a UI. Um,  

and I use this while we were iterating on research  to sort of understand what's really going on and  

what the agents doing. This is a great way to  think like your agents and sort of put yourself  

in their shoes. So, you can see we have a big  prompt here. Um, it's not sort of super long.  

It's around a thousand tokens. It involves the  researcher going through a research process. We  

tell it exactly what should what it what it should  plan ahead of time. We tell it how many tool  

calls it should typically use. We give it some  guidelines about what facts it should think about,  

what makes a high quality source, stuff like  that. And then we tell it to use parallel tool  

calls. So, you know, run multiple web searches in  parallel at the same time rather than running them  

all sequentially. Then we give it this question.  How many bananas can fit in a Rivian R1S? This  

is not a question that the model will be able  to answer because the Rivian R1S came out very  

recently. It's a car. It doesn't know in advance  all the specifications and everything. So, it'll  

have to search the web. Let's run it and see what  happens. You'll see that at the very beginning,  

it will think and break down this request. And  so, it realizes, okay, web search is going to  

be helpful here. I should get cargo capacity.  I should search. Um, woo. Um, and you see here  

it ran two web searches in parallel at the same  time. That allowed it to get these results back  

very quickly. And then it's reflecting on the  results. So it's realizing, okay, I found the  

banana dimensions. I know that a USDA identifies  bananas as 7 to 8 in long. I need to run another  

web search. Let me convert these to more standard  measurements. You can see it's using tool calls  

interled with thinking, which is something  new that the quad 4 models can do. Finally,  

it's running some calculations. It's about how  many bananas could be packed into the cargo space  

of the truck. And it's running a few more web  searches. You can see here that this is a fairly

pending

approximately 48,000 bananas. I've seen the model  estimate anything between 30,000 50,000. I think  

the right answer is around 30,000. So this is this  is roughly correct. Um going back to the slides,  

I think that you know this this sort of approach  of testing out your prompt, seeing what tools  

the model calls, reading its thinking blocks,  and actually seeing how the model's thinking  

will often make it really obvious. um what the  issues are and what's going wrong. So you'll  

test it out and you'll just see like okay  maybe the model's using too many tools here,  

maybe it's using the wrong sources or maybe  it's just following the wrong guidelines. Um  

so this is a really helpful way to sort of think  like your agents and make them more concrete.

Um switching back to the slides.

Okay, so eval evaluations are really important  for any system. Um, they're really important  

for systematically measuring whether you're  making progress in your prompt. Very quickly,  

you'll notice that it's difficult to really make  progress on a prompt if you don't have an eval  

that tells you meaningfully whether your prompt is  getting better and whether your system is getting  

better. But eval are much more difficult for  agents. Um, agents are longunning. They do a bunch  

of things. They may not they may not always have  a predictable process. classification is easier to  

eval because you can just check did it classify  this output correctly but agents are harder. So  

a few tips to make this a bit easier. One is that  the larger the effect size the smaller the sample  

size you mean you need um and so this is sort of  just a principle from science in general where  

if an effect size is very large for example if a  medication will cure people immediately you don't  

really need a large sample size of a ton of people  to know that the model is that that this treatment  

is having an effect. Similarly, when you change a  prompt, if it's really obvious that the system is  

getting better, you don't need a large eval. I  often see teams think that they need to set up  

a huge eval of like hundreds of test cases and  make it completely automated when they're just  

starting out building an agent. This is a failure  mode and it's an antiattern. You should start out  

with a very small eval and just run it and see  what happens. You can even start out manually. Um,  

but the important thing is to just get started.  I often see teams delaying evals because they  

think that they're so intimidating or that they  need such a sort of intense eval to really get  

some signal, but you can get great signal from  a small number of test cases. You just want to  

keep those test cases s consistent and then keep  testing them so you know whether the model and  

the prompt is getting better. You also want to  use realistic tasks. So don't just sort of come  

up with arbitrary prompts or descriptions  or tasks that don't really have any real  

correlation to what your system will be doing.  For example, if you're working on coding tasks,  

you don't won't want to give the model just  competitive programming problems because this is  

not what real world coding is like. You'll want to  give it realistic tasks that really reflect what  

your agent will be doing. Similarly, in finance,  you'll want to sort of take tasks that real people  

are trying to solve and just use them to evaluate  whether the model can do those. This allows you  

to really measure whether the model is getting  better at the tasks that you care about. Another  

point is that LLM is judge is really powerful,  especially when you give it a rubric. So agents  

will have lots of different kinds of outputs.  For example, if you're using them for search,  

they might have tons of different kinds of search  reports with different kinds of structure. But LMS  

are great at handling lots of different kinds of  structure and text with different characteristics.  

And so one thing that we've done, for example,  is given the model just a clear rubric and then  

ask it to evaluate the output of the agent.  For example, for search tasks, we might give  

it a rubric that says, check that the model,  you know, um, looked at the right sources,  

check that it got the correct answer. In this  case, we might say, um, check that the model  

guessed that the amount of bananas that can fit  in a Rivian R1s is between like 10,000 and 50,000.  

Anything outside that range is not realistic. So,  you know, you can use things like that to sort of  

benchmark whether the model is getting the right  answers, whether it's following the right process.  

At the end of the day though, nothing is a perfect  replacement for human evals. You need to test the  

system manually. You need to see what it's doing.  You need to sort of look at the transcripts, look  

at what the model is doing, and sort of understand  your system if you want to make progress on it.  

Here are some examples of eval. So one example  that I sort of showed uh talked about is answer  

accuracy. And this is where you just use an LLM  as judge to judge whether the answer is accurate.  

So for example in this case you might say the  agent needs to use a tool to query the number  

of employees and then report the answer and then  you know the number of employees at your company.  

So you can just check that with an LM as judge.  The reason you use an LMS as judge here is because  

it's more robust to variations. For example, if  you're just checking for the integer 47 in this  

case in the output that is not very robust and  if the model says 47 as text you'll grade it  

incorrectly. So you want to use an LMS as judge  there to be robust to those minor variations.  

Another way you can eval agents is tool use  accuracy. Agents involve using tools in a  

loop. And so if you know in advance what tools  the model should use or how it should use them,  

you can just evaluate if it used the correct tools  in the process. For example, in this case, I might  

evaluate the agent should use web search at least  five times to answer this question. And so I could  

just check in the transcript programmatically did  the tool call for web search appear five times or  

not. Similarly, you might check in this case  for in response to the question book a flight,  

the agent should use the search flights tool  and you can just check that programmatically  

and this allows you to make sure that the right  tools are being used at the right times. Finally,  

a really good eval for agents is tobench. You  can sort of look this up. Towen is a sort of open  

source benchmark that shows that you can evaluate  whether agents reach the correct final state.  

So a lot of agents are sort of modifying a  database or interacting with a user in a way  

where you can say the model should always get to  this state at the end of the process. For example,  

if your agent is a customer service agent for  airlines and the user asks to change their flight  

at the end of the agentic process in response to  that prompt, it should have changed the flight in  

the database. And so you can just check at the end  of the agentic process, was the flight changed?  

was this row in the database changed to a  different date and that can verify that the  

agent is working correctly. This is really robust  and you can use it a lot in a lot of different use  

cases. For example, you can check that your  database is updated correctly. You can check  

that certain files were modified, things like  that as a way to evaluate the final state that  

the agent reaches. And that's it from us. Um,  we're happy to take your questions. [Applause]

Can you talk about building prompts for agents?  Are you giving it kind of long longer prompts  

first and then iterating or you starting kind  of chunk by chunk? Uh what's that look like?  

And can you show sort of a little bit more on that  thought process? That's a great question. Um can  

I switch back to my screen actually? I just want  to sort of show the demo. Thank you. Um, yeah. So,  

you can see this is sort of a final prompt  that we've arrived at, but this is not where  

we started. I think the answer to your question  is that you start with a short simple prompt.

Um, and I might just say search the web  aentically. I'll change this to a different  

question. Um, how good are the Cloud 4 models  and then we'll just run that. And so you'll  

want to start with something very simple and just  see how it works. You'll often find that Claude  

can do the task well out of the box. But if you  have more needs and you need it to operate really  

consistently in production, you'll notice edge  cases or small flaws as you test with more use  

cases. And so you'll sort of add those into the  prompt. So I would say building an agent prompt  

what it looks like concretely is start simple,  test it out, see what happens, iterate from there,  

start collecting test cases where the model fails  or succeeds and then over time try to increase the  

number of test cases that pass. Um, and the way  to do this is by sort of adding instructions,  

adding examples to the prompt. But you really  only do that when you find out what the edge  

cases are. And you can see that it thinks that  the models are indeed good. So that's great.  

when I do like normal prompting and it's not  agentic, uh I'll often give like a few shot  

example of like, hey, here's like input,  here's output. This works really well for  

like classification tasks, stuff like that, right?  Uh is there a parallel here in this like agentic  

world? Are you finding that that's ever helpful  or should I not think about it that way? That is  

a great question. Yeah. So should you include  fewshot examples in your prompt and sort of  

traditional prompting techniques involve like  giving the saying the model should use a chain  

of thought and then giving a few shot examples  like a bunch of examples to imitate. We find  

that these techniques are not as effective for  state-of-the-art frontier models and for agents.  

Um the main reason for this is that if you give  the model a bunch of examples of exactly what  

process it should follow, that just limits the  model too much. These models are smarter than  

you can predict and so you don't want to tell  them exactly what they need to do. Similarly,  

chain of thought has just been trained into the  models at this point. The models know to think  

in advance. They don't need to be told like use  chain of thought. But what we can do here is one  

you can tell the model how to use its thinking.  So you know I talked about earlier rather than  

telling the model you need to use a chain of  thought. It already knows that. You can just  

say use your thinking process to plan out your  search or to plan out what you're going to do  

in terms of coding. Reme or you can tell it to  remember specific things in its thinking process  

and that sort of helps the agent stay on track.  As far as examples go, um you'll want to give  

the model examples but not too prescriptive.  I think we are out of time, but you can come  

up to me personally and I'll talk to you all  after. Thanks. Thank you. Thanks for coming.

Loading...

Loading video analysis...