The prompting playbook
By Claude
Summary
Topics Covered
- Prompt Hygiene Drives Immediate Uplift
- Outdated Patches Create New Problems
- Instructions Don't Add Capability
- State Both Sides of Trade-offs
- Agentic Loops Outperform Monolithic Prompts
Full Transcript
Hello everyone. Um, thank you so much for joining me this afternoon in the breakout room. The last session today of
breakout room. The last session today of code with Claude. I hope you've all had a fantastic day so far. My name is Margot Vanlar. I am an applied AI
Margot Vanlar. I am an applied AI engineer at Anthropic here in London.
And this afternoon, we're going to be talking about the prompting playbook.
And prompting is arguably one of the first skills, if not the first skill that we had to learn as engineers when we first started to work with LLMs. and
even now it continues to be one of the most critical um skills to building effective AI systems. So today we're
going to discuss some best practices um in the context of two practical scenarios that you're probably encountering at work. The first is where
you have an existing prompt in production that you've been maintaining for some time um and possibly you're migrating it to a new model or making a
change to the architecture and for some reason it's no longer working as well.
The second scenario is where we're building an entirely new agentic use case from the ground up and we need to build the prompt from zero to one.
Now, in order to illustrate these best practices, I don't just want to give you a list of dos and don'ts. I want to walk
through a practical example that's been inspired by real prompts that um I've seen some of our customers work with who
are building on Claude. So, the prompt that we'll look at today is a miniaturized example. the prompts that
miniaturized example. the prompts that you're working with are probably a lot longer and more complex than the one we'll see today. Um, but it's representative of some common problems
that you might encounter when maintaining a prompt. So, imagine that we have a prompt that multiple people have been collaborating on, contributing
to. There's no clear owner. It covers a
to. There's no clear owner. It covers a lot of different areas like policy, like tone, processes. um we have some patches
tone, processes. um we have some patches for kind of previous models that we've migrated to all mixed together. Um it's
built up and it's complex. And when
we're migrating to a new model, we're finding that suddenly a lot of our test cases are no longer working as well as we expected. So what's actually going on
we expected. So what's actually going on here? Well, in order to start unpacking
here? Well, in order to start unpacking that question, um we need a starting point. And that starting point is
point. And that starting point is evaluations. We need evaluations to
evaluations. We need evaluations to provide that rigor um to understand whether a change to our prompt is actually correlating to an improvement
in its performance.
And we have different models which have different capabilities and different behaviors. And when you migrate to a
behaviors. And when you migrate to a different model, it could be that your system is no longer working as well for two reasons. First of all, if um the new
two reasons. First of all, if um the new model might be capable, but it's behaving differently and therefore we can tune our prompting to fix that
behavior. The second case is where
behavior. The second case is where actually the model that we're changing to isn't as capable and no amount of prompting is going to fix that. So we
need to have an eval suite to act as a way of testing that regression so that we can apply our prompting best practices to that.
So in the example that we're going to be looking at today, as I said, it's going to be a miniaturized example. We'll have
five test cases in our eval.
In reality, you'll have a lot more test cases in your EVA suite, but the key thing here is that it's representative of three key cases that we need to
cover. Those three key cases include
cover. Those three key cases include having a control case, which is a case which should always pass. It's something
that the we know the model handles well.
It's unambiguous.
The second is edge cases. And these are cases where we've seen the model fail before. And by including instructions
before. And by including instructions into the prompt, we're making sure that same behavior doesn't slip through again in the future. And finally, and
critically, we need to make sure that the model has a good understanding of the extent of its capabilities, where it should be handing off to a human or
where actually maybe it should be point blank refusing to answer a request.
So, in the example that we're going to be looking at today, um we'll be using a um prompt for a customer support bot for
a telco company called Meridian Mobile.
And these are the five test cases that we are going to be looking at today. We
have a simple control case looking at um you know what's the data limit in the basic plan. Uh we're also looking at
basic plan. Uh we're also looking at edge cases such as its ability to do calculations such as calculating proration bills. If I switch my bill
proration bills. If I switch my bill halfway through the month uh or if I switch my plan halfway through the month, what will my bill look like? We
want to check that it's accurately addressing key questions which are covered by our policy.
um we need to make sure that it's escalating to a human whenever there is um a billing error. Um and finally, we want to make sure that our model isn't
withholding any information that it has access to, which it should be handing over to the customer.
So what we're going to do in this process is we'll take our prompt and we'll run it on our v0ero um of of the eval and we'll see what our failure
modes are and systematically target those failure modes one at a time to see if we can resolve those failure modes by prompting and along the way we'll learn a little bit more about the kind of
antiatterns uh um and traps to avoid and this is representative of how we would apply apply these best prompting techniques in practice. Right? We are
rarely writing a prompt from scratch.
We're often debugging an existing prompt.
And best practice before we start targeting those failure modes specifically is to kind of apply our general prompting 101 best practices,
applying general hygiene to clean up before we do the eval run. So let's have a little look at the example that we're going to be using. So what we're looking
at here first of all before we look at the prompt is just this vibecoded web app that I've made for the presentation today so that we can look at how we're iterating on the prompt together. Uh um
in this page here I can easily run my evals on all five test cases and inspect the results in a little bit more detail.
So before we have a look at the prompt I'm just going to run the evals in the background.
This is a pretty good first pass at a prompt. When we look at this, we've
prompt. When we look at this, we've defined the bot's role at the top.
When we scroll down, um we've given it some data. We've given it some
some data. We've given it some information on how to reason over uh um the answers that it should be giving to the customer. It's giving some critical
the customer. It's giving some critical instructions around the tone it should use, um how to do calculations, etc. And then finally we're passing in our
customer account context and our user message. So let's have a look at how our
message. So let's have a look at how our first pass the evals did. So we can see as we expect our control case all of our test cases have passed. This is what we
expect for this unambiguous test case.
But it's performing pretty poorly in these other areas. Now, before we zoom in on those specific failure modes here,
let's do some general cleanup of our prompt.
So, as we mentioned, when we look through this prompt, there's a couple oddities here already. So, for example, first one is we're telling the bot that
it's um a human, which just isn't true, right? We can see as we scroll down
right? We can see as we scroll down there's clearly some information here that has been copied directly from a website. So the key giveaway here is a
website. So the key giveaway here is a reference to a hero image. Um there's
even some references to cookies um at the bottom.
So we need to remove a bit of redundant information.
When we look at the instructions here, they're all grouped into one big paragraph. So we've got some reasoning
paragraph. So we've got some reasoning here. We've got instructions about the
here. We've got instructions about the role. uh um some critical instructions
role. uh um some critical instructions as well without a real way of unpacking um policy from guidelines from tone etc.
So let me just I've preempted some changes we want to make to this prompt and this is just a diff view of some of those changes. So
what we've done is first of all added some structure. So you can see that
some structure. So you can see that we've added XML tags here to define the role to separate general guidelines
to separate policy to separate tone of voice um etc. So if we run that eval then on this new
updated prompt, we should hopefully see an improvement in the output as is.
So we can see just by clearing up the prompt, we've already improved the model's performance on this prepaid scenario.
There's an interesting regression there in that fifth hotspot case and I don't want to worry too much about that now.
There's going to be some natural level of variance in the different runs of the eval and we'll come back to that case specifically to see if we can make the
prompt consistently better in that area.
So what did we learn from this then? Um,
simply clearing up the prompt with a better structure, with a better role description has improved the performance. And this is a best practice
performance. And this is a best practice that you can return to at any stage of writing and maintaining your prompt, especially as your prompts get more detailed and more complex. A general
rule of thumb that I like to follow is if you're reading a prompt and you can't tell guidelines from policy, from data, most likely the model isn't able to either.
So before looking at some of those cases in more detail, there's a little bit more general cleanup we can do. Um
specifically here looking at creating an output contract. This is a key best
output contract. This is a key best practice to follow if you're struggling with your output format consistency. Now
in this case, we have a customer support bot. We want it to reply in a
bot. We want it to reply in a conversational tone. So it's unlikely to
conversational tone. So it's unlikely to be a big issue in this case, but it's something to bear in mind if you you're dealing with more complex output structures like nested JSONs for
example.
So again, if we go back to the prompt and see what fixes we can apply here.
First of all, we've added a section uh um at the end where we've defined an output format for the model telling it
to use uh um XML tags to output the response. But the prompt is not always
response. But the prompt is not always the most effective way of handling issues. We can also change things in the
issues. We can also change things in the harness to ensure consistency to a higher degree. So what we've added here
higher degree. So what we've added here to the API call is a stop sequence which is going to detect that closing XML tag
and tell the model to stop generating a response at that point. Now when I run the eval here I don't necessarily expect to see any clear improvement in
performance. Um, but it's a general best
performance. Um, but it's a general best practice that we should be following and as I said is something that we should remember in particular when we have more complex output schemas.
One thing to point out here as well if you do have a more complex output schema something like structured outputs can be incredibly helpful to ensure that
consistency in a more programmatic way.
Okay. So after the cleanup then we can see that we now have two test cases which are consistently passing but we have three key failure modes the
proration the billing error and the hotspot. So let's isolate these one by
hotspot. So let's isolate these one by one um to iterate on the prompt and and see the effect of that.
First of all then the hotspot question.
So the question is how much hotspot data is on my unlimited plan. What we expect the model to do is state directly the amount of hotspot data that the customer has.
And the reason this is a slightly complex case is because the customer test case that we're dealing with is on a legacy plan. So actually the current
policy doesn't apply to them. So if we see what's going on in the actual test case here, the customer data which we're
feeding uh um to the prompt includes the amount of hotspot data that customer has. They have 5 gigabytes, right? But
has. They have 5 gigabytes, right? But
they also have a grandfathered plan. So
what we're seeing uh the model is actually telling the customer is the general um the unlimited plan includes 4 GB um but since you're on a
legacy plan you should go check this out yourself.
So let's have a look at the prompt then to see um why the model is deflecting this question to the customer account URL rather than
actually giving the information itself.
Now if we read this prompt originally it said we changed our plans recently and the policy doc shows the current plan data and customers on grandfather's plan have different rates. Never give a
customer the wrong plan details. instead
point them to the URL. So it's clear that this instruction, this latter one, never give customer the wrong information, is the instruction that the
bot has been optimizing for. And you
might recognize this as being very similar to a patch that you might have introduced in a previous model that you were using to avoid where the model was
giving the customer the wrong information about that plan. Now, as our models have evolved, they have gotten much better at instruction following.
So, it's likely that instructions like these have now become redundant and are actually being overfitted to.
So, what we're going to tell the model instead is give this balanced view uh um where it says, you know, customers on grandfather's plan have different allowances, but it's captured in the
customer information that's given and that is the accurate source of truth.
So running the eval here we should hopefully be addressing uh um all of the test cases for the hotspot case. Now I
am running this live so there could be some variability here but we see here that now clearly all of our test cases are are passing.
So what did we learn from this? Well, we
worry a lot about hallucinations or the invention of facts and numbers, but actually the opposite can also happen.
The model can withhold information that it actually has access to. Now, we saw here that this is likely a result of a patch that we introduced for a previous
model. And a best practice that we could
model. And a best practice that we could follow here is actually using version control where wherever we are making defensive changes in the prompt, we are
tracking the reason why we've introduced these. Sometimes they're necessary, but
these. Sometimes they're necessary, but in the future these kind of changes can produce unwanted effects so that we can backtrack on them.
So the next failing test case then is this proration calculation where a customer asks what if I upgrade to the
30 gigabyte plan? What will my next bill be? And what we want the model to do is
be? And what we want the model to do is to perform some calculation and return exactly uh um what their next bill would be rather than giving some sort of vague
output which is what we can see it's doing right now. Uh um if we look at what the model is returning, it's clearly reasoning through it. It's
doing a little bit of mental maths here and there, but it's not really giving the customer a concrete answer. And I
wouldn't rely on this as being able to accurately give the customer a response.
So if we look at the prompt then to see how we can fix this in the original prompt we can see that all the instructions
that were given to it is telling it don't ever give a customer a vague answer. Uh um critical always calculates
answer. Uh um critical always calculates any pr-rated amounts correctly. Now,
telling the model to do a good job isn't particularly helpful when we don't give the model the capability to actually do a good job. We want to avoid the model doing mental math. So, what we're going
to introduce is give the model a tool.
So, we're saying in the prompt whenever you're doing any calculations, please use the calculate proration tool to do so.
In order to introduce that tool, we need to introduce it into the API to tell the model you have access to this tool. We
need to define um the tool schema which tells the model what this tool does and when to use it. And then finally we need to actually implement the tool which is
the maths behind how it should be doing that calculation.
So running that eval then for another pass we can see that all the test cases are now passing. It's clearly done uh um the
now passing. It's clearly done uh um the maths using the tool in the background and returning the correct response.
So the key lesson to take away here is instructions don't add capability.
Telling the model it's critical to do a calculation right doesn't make it better at mental math. So the correct approach was to give it a tool. Overall giving it
the ability to reason over harder problems and using tools to actually execute them reliably.
So now we have one final failing test case which we need to address which is this billing error here.
In this scenario there is a billing conflict and what we really want is the agent to escalate this to a human. And
what we're seeing it doing instead is it's trying to explain to the customer what the reason behind it might be.
and it's trying to kind of diagnose the problem itself.
So in order to fix this behavior, let's again have a look what it was told in the prompt.
We see in the initial instructions it was giving, it says, "Avoid escalating or transferring to a care specialist unless absolutely necessary as it cost approximately $8 and it counts against
our team's fast contract resolution."
Now, this is only giving one side of the story, right? We're telling it what the
story, right? We're telling it what the cost is to escalating, but not the benefit, which means it's going to overfit again to not escalating this
scenario. And second of all, we've got
scenario. And second of all, we've got this clear conflict between what we've defined in the eval in terms of what we want the model to do to do this escalation versus
what we're actually telling it to do.
And the fix that's relevant here is to give it both sides of the story by saying it costs $8 uh um to escalate a
case, but actually if you get this wrong, then it's going to cost you a refund as well as customer trust.
Again, here we observed how the model optimizes for a goal. And this kind of instruction is a common instruction to give. It's quite similar to the one we
give. It's quite similar to the one we saw earlier where we didn't want it to overfitit to a certain type of behavior.
But it's the kind of instruction that can be followed quite differently by different generations of models. And
specifically, as models become more intelligent, we need to remember to state both sides of the trade-offs because our models are becoming better
themselves at making those tradeoffs themselves.
So if we just go back to our eval then and um run our final test case,
we should see that all of our evals are now passing correctly. So overall, we looked at applying general hygiene principles, how that can provide an
initial uplift to the prompt, making sure we're removing any redundant instructions which were initially intended as patches for previous models
behavior, making sure we're giving it tools to do certain tasks reliably.
Now there's one other scenario that we uh introduced at the start which is one that you might also encounter in your work which is where we're building a new
agent from scratch. And the example that we'll look at here is um an agent whose purpose it is to create a week-long
retail staff schedule based on employee availability and other constraints.
And when we're building a new agent from scratch, we need to consider not just the prompt, but also the model that we're using and
the harness that we're using. So in this next example, we're going to compare a number of approaches to explore the impact of those three different areas.
So again, I've just vcoded up this web app so that we can walk through this problem. Uh um in this demo here I've
problem. Uh um in this demo here I've just laid out what the problem is that we're addressing. We have our eight
we're addressing. We have our eight employees. Um on the right we have this
employees. Um on the right we have this schedule that we need to staff with the headcount and we have our constraints that must be satisfied in every
scenario.
Now, because we have these hard rules, rather than using an LLM judge like we did in the previous case to do the grading, we can actually use a just a
Python function which programmatically checks for every schedule that's generated how many violations were made.
So to begin with in a we want to start simple. We're going to use a simple
simple. We're going to use a simple prompt. We're going to use the bare
prompt. We're going to use the bare bones that we think we'll need with a model set 46 to see how it performs and how we're going to hill climb against that. Um, so here is our baseline
that. Um, so here is our baseline prompt. We've already applied some of
prompt. We've already applied some of that general hygiene and those best practices that we saw earlier on using XML tags to structure the prompt. We've
given it an output format as well. Now
that we're giving a schedule, uh we're asking it to output a JSON which if we don't give that output structure might lead to
passing errors uh downstream.
When we run the simple model on a first iteration of the evals, all cases fail. Now, just what
we're looking at here is in our test set, we're essentially repeating uh um we're doing five trials here. Uh and
these numbers are showing how many violations were made in each trial.
In the outputs, we can see that it's made a decent attempt at reasoning through the problem,
but it's burning a lot of tokens. Uh and
it's clearly not checking its work um as it's not getting to the right impact. So
let's try a larger model uh a model which we know is better at reasoning. So
we're going to run it through uh Opus 4.7 instead keeping everything else the same. Now,
interestingly, whilst all test cases are still failing, you can see that the overall number of violations that Opus
has made has reduced significantly from Sonnet 4.6. So, we're possibly on to
Sonnet 4.6. So, we're possibly on to something here, right? This isn't good enough to ship because it's still failing, but clearly giving it more
reasoning capability is helping drive it towards a better result. So what we're going to try next is using opus with
adaptive thinking instead. So it can decide for itself how much thinking it needs how much reasoning it needs to use to solve this
issue. So no change to the prompt really
issue. So no change to the prompt really just uh a change to the API here.
So this now seems to reliably generate compliance schedules, but it requires a lot more tokens.
We're tripling essentially in the number of tokens that we're using here, and we're tripling the latency. So we want to try and see if we can optimize that cost latency trade-off a little bit
more. Um, this is latency 100 seconds.
more. Um, this is latency 100 seconds.
Obviously, I'm running this one async.
uh for the purposes of time, Opus 47 hasn't magically gotten much faster since the last time uh you used it. Um
so let's see if we can optimize a little bit more for the token latency tradeoff.
What we haven't tried yet is using Sonnet 46. So a smaller model but with a
Sonnet 46. So a smaller model but with a better prompt. We looked a lot at the
better prompt. We looked a lot at the prompt optimization uh um in that last section. So I've added a couple details
section. So I've added a couple details to the prompt here. We're talk we're in particular how to uh reason through
this problem and most critically telling it to check its work before outputting it.
So when I ran that EAL we see that it passes in two out of the five cases. Now
the failure modes that we're seeing is actually not violations of the scheduling requirements but the
model hasn't been able to finish the tasks within the output limit that we set. So whilst we could increase the max
set. So whilst we could increase the max tokens that this model is able to use to get all five test cases passing, we see here that we're using even more
tokens and this run has an even higher latency. So this is probably not the
latency. So this is probably not the route that we want to go down.
Now as a final pass, then we want to look at doing this a little bit more agentically.
So we're going to use this generate evaluate repair loop where essentially the generator now creates a first draft of the schedule and then we have a
separate prompt which reports any specific violations that it made. So not programmatically checking it but checking it with an LLM.
So we're checking for every rule and we're providing evidence of every violation.
And we then have a third repair prompt which receives uh any violations that were made and tries to make targeted
fixes to it. So we have three very simple prompts but they're now running independently rather than trying to do everything in one large prompt.
So we can see in this case our agentic approach has solved all of our test cases um with a much lower number
of tokens and with a lower latency than trying Sonet 46 with a better prompt. So going forward, it seems like there's two appropriate
approaches to take here. Using Opus 47 with adaptive thinking or using this agentic loop. Now moving forward, we'd
agentic loop. Now moving forward, we'd probably want to do a little bit more optimization on this loop to try and get it to be more efficient. But there's one
key benefit as well from using this generate evaluate repair loop. And that
is that you can put in soft requirements at runtime. So in the evaluation prompt,
at runtime. So in the evaluation prompt, we can say Harry doesn't like working with Sally. So as much as possible, try
with Sally. So as much as possible, try and separate them from working together or we need a third shift uh um on Wednesday, for example. So it means that
you're not having to make changes to the um Python function which is doing the evaluation in the back end every time to satisfy for any soft constraints which
might depend just on a case-byase basis.
So to wrap up then pulling all of those learnings together, what did we see? Well, we looked at two scenarios. Two scenarios which I, as an
scenarios. Two scenarios which I, as an engineer, see most in my day-to-day, which is where we're maintaining a prompt. We're migrating to a new model
prompt. We're migrating to a new model which has some different behaviors and we're and building a new use case from scratch. We saw that general hygiene
scratch. We saw that general hygiene principles following those can and immediately uplift the performance against um a set of evals and that we
need those eval to be able to rigorously see any impacts of changing our prompt on the output. Then we saw this process of targeting failure modes one by one,
adding structure, avoiding long band lists, etc. were all things that helped push our model to the correct behavior.
And then finally with our new agentic bot that we were building, we saw the impact of splitting into three separate prompt systems. So
rather than using one prompt to address everything, we're actually isolating different tasks where it's easy and repeatable to separate out the steps
that it needs to take every time.
Thank you so much for attending this afternoon. I hope you have a fantastic
afternoon. I hope you have a fantastic rest of your day.
Loading video analysis...