The prompting playbook

By Claude

Summary

## Key takeaways - **Evals come before prompt tweaks**: Build an eval suite with control cases, edge cases, and capability-boundary tests as the starting point for any prompt work. Without rigorous evals you cannot tell if a change is improving performance or if the new model is simply less capable, in which case no prompting will help. [02:56], [04:16] - **Structure your prompt with XML tags**: Simply reorganizing a messy prompt into XML-tagged sections for role, guidelines, policy, and tone produced an immediate performance uplift. A useful rule: if you cannot visually separate guidelines, policy, and data, the model almost certainly cannot either. [09:49], [11:15] - **Retire patches from older models**: Defensive instructions added to fix behavior in a previous model often become redundant and harmful with newer, better instruction-following models. Track these patches via version control so you can revisit them on migration, rather than letting them cause the model to withhold information it actually has. [15:37], [17:06] - **Give tools, not just instructions**: Telling a model it is critical to calculate correctly does not make it better at mental math. The fix in the proration example was to give it a real calculate_proration tool with a defined schema and let the model use the tool to do the math reliably. [18:20], [20:00] - **Show trade-offs, not just costs**: The escalation prompt only listed the $8 cost of transferring to a human, causing the model to overfit to never escalating. The fix was to add the other side: failing to escalate costs a refund and customer trust. As models get smarter, they make better trade-off calls when you give them both sides. [21:15], [22:38] - **Split complex tasks into agentic loops**: For hard problems like staff scheduling, splitting the work into a generator, an evaluator, and a repair prompt, each running independently, solved all test cases with fewer tokens and lower latency than a single mega-prompt. It also enables soft constraints like 'separate Harry and Sally' to be added at runtime without changing the backend evaluator. [30:03], [31:40]

Topics Covered

Prompt Hygiene Drives Immediate Uplift
Outdated Patches Create New Problems
Instructions Don't Add Capability
State Both Sides of Trade-offs
Agentic Loops Outperform Monolithic Prompts

Full Transcript

Hello everyone. Um, thank you so much for joining me this afternoon in the breakout room. The last session today of

breakout room. The last session today of code with Claude. I hope you've all had a fantastic day so far. My name is Margot Vanlar. I am an applied AI

Margot Vanlar. I am an applied AI engineer at Anthropic here in London.

And this afternoon, we're going to be talking about the prompting playbook.

And prompting is arguably one of the first skills, if not the first skill that we had to learn as engineers when we first started to work with LLMs. and

even now it continues to be one of the most critical um skills to building effective AI systems. So today we're

going to discuss some best practices um in the context of two practical scenarios that you're probably encountering at work. The first is where

you have an existing prompt in production that you've been maintaining for some time um and possibly you're migrating it to a new model or making a

change to the architecture and for some reason it's no longer working as well.

The second scenario is where we're building an entirely new agentic use case from the ground up and we need to build the prompt from zero to one.

Now, in order to illustrate these best practices, I don't just want to give you a list of dos and don'ts. I want to walk

through a practical example that's been inspired by real prompts that um I've seen some of our customers work with who

are building on Claude. So, the prompt that we'll look at today is a miniaturized example. the prompts that

miniaturized example. the prompts that you're working with are probably a lot longer and more complex than the one we'll see today. Um, but it's representative of some common problems

that you might encounter when maintaining a prompt. So, imagine that we have a prompt that multiple people have been collaborating on, contributing

to. There's no clear owner. It covers a

to. There's no clear owner. It covers a lot of different areas like policy, like tone, processes. um we have some patches

tone, processes. um we have some patches for kind of previous models that we've migrated to all mixed together. Um it's

built up and it's complex. And when

we're migrating to a new model, we're finding that suddenly a lot of our test cases are no longer working as well as we expected. So what's actually going on

we expected. So what's actually going on here? Well, in order to start unpacking

here? Well, in order to start unpacking that question, um we need a starting point. And that starting point is

point. And that starting point is evaluations. We need evaluations to

evaluations. We need evaluations to provide that rigor um to understand whether a change to our prompt is actually correlating to an improvement

in its performance.

And we have different models which have different capabilities and different behaviors. And when you migrate to a

behaviors. And when you migrate to a different model, it could be that your system is no longer working as well for two reasons. First of all, if um the new

two reasons. First of all, if um the new model might be capable, but it's behaving differently and therefore we can tune our prompting to fix that

behavior. The second case is where

behavior. The second case is where actually the model that we're changing to isn't as capable and no amount of prompting is going to fix that. So we

need to have an eval suite to act as a way of testing that regression so that we can apply our prompting best practices to that.

So in the example that we're going to be looking at today, as I said, it's going to be a miniaturized example. We'll have

five test cases in our eval.

In reality, you'll have a lot more test cases in your EVA suite, but the key thing here is that it's representative of three key cases that we need to

cover. Those three key cases include

cover. Those three key cases include having a control case, which is a case which should always pass. It's something

that the we know the model handles well.

It's unambiguous.

The second is edge cases. And these are cases where we've seen the model fail before. And by including instructions

before. And by including instructions into the prompt, we're making sure that same behavior doesn't slip through again in the future. And finally, and

critically, we need to make sure that the model has a good understanding of the extent of its capabilities, where it should be handing off to a human or

where actually maybe it should be point blank refusing to answer a request.

So, in the example that we're going to be looking at today, um we'll be using a um prompt for a customer support bot for

a telco company called Meridian Mobile.

And these are the five test cases that we are going to be looking at today. We

have a simple control case looking at um you know what's the data limit in the basic plan. Uh we're also looking at

basic plan. Uh we're also looking at edge cases such as its ability to do calculations such as calculating proration bills. If I switch my bill

proration bills. If I switch my bill halfway through the month uh or if I switch my plan halfway through the month, what will my bill look like? We

want to check that it's accurately addressing key questions which are covered by our policy.

um we need to make sure that it's escalating to a human whenever there is um a billing error. Um and finally, we want to make sure that our model isn't

withholding any information that it has access to, which it should be handing over to the customer.

So what we're going to do in this process is we'll take our prompt and we'll run it on our v0ero um of of the eval and we'll see what our failure

modes are and systematically target those failure modes one at a time to see if we can resolve those failure modes by prompting and along the way we'll learn a little bit more about the kind of

antiatterns uh um and traps to avoid and this is representative of how we would apply apply these best prompting techniques in practice. Right? We are

rarely writing a prompt from scratch.

We're often debugging an existing prompt.

And best practice before we start targeting those failure modes specifically is to kind of apply our general prompting 101 best practices,

applying general hygiene to clean up before we do the eval run. So let's have a little look at the example that we're going to be using. So what we're looking

at here first of all before we look at the prompt is just this vibecoded web app that I've made for the presentation today so that we can look at how we're iterating on the prompt together. Uh um

in this page here I can easily run my evals on all five test cases and inspect the results in a little bit more detail.

So before we have a look at the prompt I'm just going to run the evals in the background.

This is a pretty good first pass at a prompt. When we look at this, we've

prompt. When we look at this, we've defined the bot's role at the top.

When we scroll down, um we've given it some data. We've given it some

some data. We've given it some information on how to reason over uh um the answers that it should be giving to the customer. It's giving some critical

the customer. It's giving some critical instructions around the tone it should use, um how to do calculations, etc. And then finally we're passing in our

customer account context and our user message. So let's have a look at how our

message. So let's have a look at how our first pass the evals did. So we can see as we expect our control case all of our test cases have passed. This is what we

expect for this unambiguous test case.

But it's performing pretty poorly in these other areas. Now, before we zoom in on those specific failure modes here,

let's do some general cleanup of our prompt.

So, as we mentioned, when we look through this prompt, there's a couple oddities here already. So, for example, first one is we're telling the bot that

it's um a human, which just isn't true, right? We can see as we scroll down

right? We can see as we scroll down there's clearly some information here that has been copied directly from a website. So the key giveaway here is a

website. So the key giveaway here is a reference to a hero image. Um there's

even some references to cookies um at the bottom.

So we need to remove a bit of redundant information.

When we look at the instructions here, they're all grouped into one big paragraph. So we've got some reasoning

paragraph. So we've got some reasoning here. We've got instructions about the

here. We've got instructions about the role. uh um some critical instructions

role. uh um some critical instructions as well without a real way of unpacking um policy from guidelines from tone etc.

So let me just I've preempted some changes we want to make to this prompt and this is just a diff view of some of those changes. So

what we've done is first of all added some structure. So you can see that

some structure. So you can see that we've added XML tags here to define the role to separate general guidelines

to separate policy to separate tone of voice um etc. So if we run that eval then on this new

updated prompt, we should hopefully see an improvement in the output as is.

So we can see just by clearing up the prompt, we've already improved the model's performance on this prepaid scenario.

There's an interesting regression there in that fifth hotspot case and I don't want to worry too much about that now.

There's going to be some natural level of variance in the different runs of the eval and we'll come back to that case specifically to see if we can make the

prompt consistently better in that area.

So what did we learn from this then? Um,

simply clearing up the prompt with a better structure, with a better role description has improved the performance. And this is a best practice

performance. And this is a best practice that you can return to at any stage of writing and maintaining your prompt, especially as your prompts get more detailed and more complex. A general

rule of thumb that I like to follow is if you're reading a prompt and you can't tell guidelines from policy, from data, most likely the model isn't able to either.

So before looking at some of those cases in more detail, there's a little bit more general cleanup we can do. Um

specifically here looking at creating an output contract. This is a key best

output contract. This is a key best practice to follow if you're struggling with your output format consistency. Now

in this case, we have a customer support bot. We want it to reply in a

bot. We want it to reply in a conversational tone. So it's unlikely to

conversational tone. So it's unlikely to be a big issue in this case, but it's something to bear in mind if you you're dealing with more complex output structures like nested JSONs for

example.

So again, if we go back to the prompt and see what fixes we can apply here.

First of all, we've added a section uh um at the end where we've defined an output format for the model telling it

to use uh um XML tags to output the response. But the prompt is not always

response. But the prompt is not always the most effective way of handling issues. We can also change things in the

issues. We can also change things in the harness to ensure consistency to a higher degree. So what we've added here

higher degree. So what we've added here to the API call is a stop sequence which is going to detect that closing XML tag

and tell the model to stop generating a response at that point. Now when I run the eval here I don't necessarily expect to see any clear improvement in

performance. Um, but it's a general best

performance. Um, but it's a general best practice that we should be following and as I said is something that we should remember in particular when we have more complex output schemas.

One thing to point out here as well if you do have a more complex output schema something like structured outputs can be incredibly helpful to ensure that

consistency in a more programmatic way.

Okay. So after the cleanup then we can see that we now have two test cases which are consistently passing but we have three key failure modes the

proration the billing error and the hotspot. So let's isolate these one by

hotspot. So let's isolate these one by one um to iterate on the prompt and and see the effect of that.

First of all then the hotspot question.

So the question is how much hotspot data is on my unlimited plan. What we expect the model to do is state directly the amount of hotspot data that the customer has.

And the reason this is a slightly complex case is because the customer test case that we're dealing with is on a legacy plan. So actually the current

policy doesn't apply to them. So if we see what's going on in the actual test case here, the customer data which we're

feeding uh um to the prompt includes the amount of hotspot data that customer has. They have 5 gigabytes, right? But

has. They have 5 gigabytes, right? But

they also have a grandfathered plan. So

what we're seeing uh the model is actually telling the customer is the general um the unlimited plan includes 4 GB um but since you're on a

legacy plan you should go check this out yourself.

So let's have a look at the prompt then to see um why the model is deflecting this question to the customer account URL rather than

actually giving the information itself.

Now if we read this prompt originally it said we changed our plans recently and the policy doc shows the current plan data and customers on grandfather's plan have different rates. Never give a

customer the wrong plan details. instead

point them to the URL. So it's clear that this instruction, this latter one, never give customer the wrong information, is the instruction that the

bot has been optimizing for. And you

might recognize this as being very similar to a patch that you might have introduced in a previous model that you were using to avoid where the model was

giving the customer the wrong information about that plan. Now, as our models have evolved, they have gotten much better at instruction following.

So, it's likely that instructions like these have now become redundant and are actually being overfitted to.

So, what we're going to tell the model instead is give this balanced view uh um where it says, you know, customers on grandfather's plan have different allowances, but it's captured in the

customer information that's given and that is the accurate source of truth.

So running the eval here we should hopefully be addressing uh um all of the test cases for the hotspot case. Now I

am running this live so there could be some variability here but we see here that now clearly all of our test cases are are passing.

So what did we learn from this? Well, we

worry a lot about hallucinations or the invention of facts and numbers, but actually the opposite can also happen.

The model can withhold information that it actually has access to. Now, we saw here that this is likely a result of a patch that we introduced for a previous

model. And a best practice that we could

model. And a best practice that we could follow here is actually using version control where wherever we are making defensive changes in the prompt, we are

tracking the reason why we've introduced these. Sometimes they're necessary, but

these. Sometimes they're necessary, but in the future these kind of changes can produce unwanted effects so that we can backtrack on them.

So the next failing test case then is this proration calculation where a customer asks what if I upgrade to the

30 gigabyte plan? What will my next bill be? And what we want the model to do is

be? And what we want the model to do is to perform some calculation and return exactly uh um what their next bill would be rather than giving some sort of vague

output which is what we can see it's doing right now. Uh um if we look at what the model is returning, it's clearly reasoning through it. It's

doing a little bit of mental maths here and there, but it's not really giving the customer a concrete answer. And I

wouldn't rely on this as being able to accurately give the customer a response.

So if we look at the prompt then to see how we can fix this in the original prompt we can see that all the instructions

that were given to it is telling it don't ever give a customer a vague answer. Uh um critical always calculates

answer. Uh um critical always calculates any pr-rated amounts correctly. Now,

telling the model to do a good job isn't particularly helpful when we don't give the model the capability to actually do a good job. We want to avoid the model doing mental math. So, what we're going

to introduce is give the model a tool.

So, we're saying in the prompt whenever you're doing any calculations, please use the calculate proration tool to do so.

In order to introduce that tool, we need to introduce it into the API to tell the model you have access to this tool. We

need to define um the tool schema which tells the model what this tool does and when to use it. And then finally we need to actually implement the tool which is

the maths behind how it should be doing that calculation.

So running that eval then for another pass we can see that all the test cases are now passing. It's clearly done uh um the

now passing. It's clearly done uh um the maths using the tool in the background and returning the correct response.

So the key lesson to take away here is instructions don't add capability.

Telling the model it's critical to do a calculation right doesn't make it better at mental math. So the correct approach was to give it a tool. Overall giving it

the ability to reason over harder problems and using tools to actually execute them reliably.

So now we have one final failing test case which we need to address which is this billing error here.

In this scenario there is a billing conflict and what we really want is the agent to escalate this to a human. And

what we're seeing it doing instead is it's trying to explain to the customer what the reason behind it might be.

and it's trying to kind of diagnose the problem itself.

So in order to fix this behavior, let's again have a look what it was told in the prompt.

We see in the initial instructions it was giving, it says, "Avoid escalating or transferring to a care specialist unless absolutely necessary as it cost approximately $8 and it counts against

our team's fast contract resolution."

Now, this is only giving one side of the story, right? We're telling it what the

story, right? We're telling it what the cost is to escalating, but not the benefit, which means it's going to overfit again to not escalating this

scenario. And second of all, we've got

scenario. And second of all, we've got this clear conflict between what we've defined in the eval in terms of what we want the model to do to do this escalation versus

what we're actually telling it to do.

And the fix that's relevant here is to give it both sides of the story by saying it costs $8 uh um to escalate a

case, but actually if you get this wrong, then it's going to cost you a refund as well as customer trust.

Again, here we observed how the model optimizes for a goal. And this kind of instruction is a common instruction to give. It's quite similar to the one we

give. It's quite similar to the one we saw earlier where we didn't want it to overfitit to a certain type of behavior.

But it's the kind of instruction that can be followed quite differently by different generations of models. And

specifically, as models become more intelligent, we need to remember to state both sides of the trade-offs because our models are becoming better

themselves at making those tradeoffs themselves.

So if we just go back to our eval then and um run our final test case,

we should see that all of our evals are now passing correctly. So overall, we looked at applying general hygiene principles, how that can provide an

initial uplift to the prompt, making sure we're removing any redundant instructions which were initially intended as patches for previous models

behavior, making sure we're giving it tools to do certain tasks reliably.

Now there's one other scenario that we uh introduced at the start which is one that you might also encounter in your work which is where we're building a new

agent from scratch. And the example that we'll look at here is um an agent whose purpose it is to create a week-long

retail staff schedule based on employee availability and other constraints.

And when we're building a new agent from scratch, we need to consider not just the prompt, but also the model that we're using and

the harness that we're using. So in this next example, we're going to compare a number of approaches to explore the impact of those three different areas.

So again, I've just vcoded up this web app so that we can walk through this problem. Uh um in this demo here I've

problem. Uh um in this demo here I've just laid out what the problem is that we're addressing. We have our eight

we're addressing. We have our eight employees. Um on the right we have this

employees. Um on the right we have this schedule that we need to staff with the headcount and we have our constraints that must be satisfied in every

scenario.

Now, because we have these hard rules, rather than using an LLM judge like we did in the previous case to do the grading, we can actually use a just a

Python function which programmatically checks for every schedule that's generated how many violations were made.

So to begin with in a we want to start simple. We're going to use a simple

simple. We're going to use a simple prompt. We're going to use the bare

prompt. We're going to use the bare bones that we think we'll need with a model set 46 to see how it performs and how we're going to hill climb against that. Um, so here is our baseline

that. Um, so here is our baseline prompt. We've already applied some of

prompt. We've already applied some of that general hygiene and those best practices that we saw earlier on using XML tags to structure the prompt. We've

given it an output format as well. Now

that we're giving a schedule, uh we're asking it to output a JSON which if we don't give that output structure might lead to

passing errors uh downstream.

When we run the simple model on a first iteration of the evals, all cases fail. Now, just what

we're looking at here is in our test set, we're essentially repeating uh um we're doing five trials here. Uh and

these numbers are showing how many violations were made in each trial.

In the outputs, we can see that it's made a decent attempt at reasoning through the problem,

but it's burning a lot of tokens. Uh and

it's clearly not checking its work um as it's not getting to the right impact. So

let's try a larger model uh a model which we know is better at reasoning. So

we're going to run it through uh Opus 4.7 instead keeping everything else the same. Now,

interestingly, whilst all test cases are still failing, you can see that the overall number of violations that Opus

has made has reduced significantly from Sonnet 4.6. So, we're possibly on to

Sonnet 4.6. So, we're possibly on to something here, right? This isn't good enough to ship because it's still failing, but clearly giving it more

reasoning capability is helping drive it towards a better result. So what we're going to try next is using opus with

adaptive thinking instead. So it can decide for itself how much thinking it needs how much reasoning it needs to use to solve this

issue. So no change to the prompt really

issue. So no change to the prompt really just uh a change to the API here.

So this now seems to reliably generate compliance schedules, but it requires a lot more tokens.

We're tripling essentially in the number of tokens that we're using here, and we're tripling the latency. So we want to try and see if we can optimize that cost latency trade-off a little bit

more. Um, this is latency 100 seconds.

more. Um, this is latency 100 seconds.

Obviously, I'm running this one async.

uh for the purposes of time, Opus 47 hasn't magically gotten much faster since the last time uh you used it. Um

so let's see if we can optimize a little bit more for the token latency tradeoff.

What we haven't tried yet is using Sonnet 46. So a smaller model but with a

Sonnet 46. So a smaller model but with a better prompt. We looked a lot at the

better prompt. We looked a lot at the prompt optimization uh um in that last section. So I've added a couple details

section. So I've added a couple details to the prompt here. We're talk we're in particular how to uh reason through

this problem and most critically telling it to check its work before outputting it.

So when I ran that EAL we see that it passes in two out of the five cases. Now

the failure modes that we're seeing is actually not violations of the scheduling requirements but the

model hasn't been able to finish the tasks within the output limit that we set. So whilst we could increase the max

set. So whilst we could increase the max tokens that this model is able to use to get all five test cases passing, we see here that we're using even more

tokens and this run has an even higher latency. So this is probably not the

latency. So this is probably not the route that we want to go down.

Now as a final pass, then we want to look at doing this a little bit more agentically.

So we're going to use this generate evaluate repair loop where essentially the generator now creates a first draft of the schedule and then we have a

separate prompt which reports any specific violations that it made. So not programmatically checking it but checking it with an LLM.

So we're checking for every rule and we're providing evidence of every violation.

And we then have a third repair prompt which receives uh any violations that were made and tries to make targeted

fixes to it. So we have three very simple prompts but they're now running independently rather than trying to do everything in one large prompt.

So we can see in this case our agentic approach has solved all of our test cases um with a much lower number

of tokens and with a lower latency than trying Sonet 46 with a better prompt. So going forward, it seems like there's two appropriate

approaches to take here. Using Opus 47 with adaptive thinking or using this agentic loop. Now moving forward, we'd

agentic loop. Now moving forward, we'd probably want to do a little bit more optimization on this loop to try and get it to be more efficient. But there's one

key benefit as well from using this generate evaluate repair loop. And that

is that you can put in soft requirements at runtime. So in the evaluation prompt,

at runtime. So in the evaluation prompt, we can say Harry doesn't like working with Sally. So as much as possible, try

with Sally. So as much as possible, try and separate them from working together or we need a third shift uh um on Wednesday, for example. So it means that

you're not having to make changes to the um Python function which is doing the evaluation in the back end every time to satisfy for any soft constraints which

might depend just on a case-byase basis.

So to wrap up then pulling all of those learnings together, what did we see? Well, we looked at two scenarios. Two scenarios which I, as an

scenarios. Two scenarios which I, as an engineer, see most in my day-to-day, which is where we're maintaining a prompt. We're migrating to a new model

prompt. We're migrating to a new model which has some different behaviors and we're and building a new use case from scratch. We saw that general hygiene

scratch. We saw that general hygiene principles following those can and immediately uplift the performance against um a set of evals and that we

need those eval to be able to rigorously see any impacts of changing our prompt on the output. Then we saw this process of targeting failure modes one by one,

adding structure, avoiding long band lists, etc. were all things that helped push our model to the correct behavior.

And then finally with our new agentic bot that we were building, we saw the impact of splitting into three separate prompt systems. So

rather than using one prompt to address everything, we're actually isolating different tasks where it's easy and repeatable to separate out the steps

that it needs to take every time.

Thank you so much for attending this afternoon. I hope you have a fantastic

afternoon. I hope you have a fantastic rest of your day.

Loading...

Loading video analysis...