Tool, skill, or subagent? Decomposing an agent that outgrew its prompt

By Claude

Summary

## Key takeaways - **Skills vs system prompt stuffing**: Skills provide progressive disclosure—packaging information for Claude to pull into context only when needed—rather than stuffing everything into a long system prompt that pollutes the context window with unnecessary information. [21:14], [22:47] - **Use code execution over uploading data**: Code execution dramatically reduces token usage; instead of uploading entire CSVs into context, give Claude the ability to write and run Python, reducing tokens from 200,000+ down to 54,000 for the same task. [28:18], [34:05] - **Start with Claude Code primitives**: Build agents using human-like primitives (file system, code execution, web search) that mirror what Claude Code has access to—foundationally strong tools that improve as models get smarter, without custom wrappers. [26:32], [27:31] - **Sub-agents need fresh context**: Sub-agents are best when you need a separate Claude instance to review work without context contamination—like having a colleague review your code rather than reviewing your own. [36:31], [37:55] - **Simplify system prompt to ~15 lines**: After refactoring, the orchestrator's system prompt was reduced from 400 lines to 15 lines, with all business logic moved to skills, dramatically reducing confusion and policy conflicts. [40:54], [41:52] - **Hill climbing with evals**: Use iterative eval-driven development—run evals to get baseline (62-83%), identify issues, fix architecture, then rerun evals to climb toward improvement (ending at 92% in this case). [10:43], [42:17]

Topics Covered

Use Skills for Progressive Disclosure Instead of Long System Prompts
Start with Human-like Primitives When Building Agents
Sub-agents Excel for Parallelization and Fresh Perspectives
Migrate to Claude Managed Agents to Focus on Architecture
Hill Climbing: Iterate on Evals to Improve Agent Performance

Full Transcript

All right. Fantastic. Can everyone hear me?

right. Fantastic. Can everyone hear me?

Thumbs up. All good. All right,

everyone. I hope that you have had a fantastic day at Code with Claude London. So far today, my name is Will.

London. So far today, my name is Will.

I'm on our engineering team at Anthropic. I sit on a team called

Anthropic. I sit on a team called Applied AI. What that means is I

Applied AI. What that means is I essentially split my time between internal engineering work and time spent building agents with customers. So

folks, imagine that you built and shipped an agent to solve a problem. I'm

sure that's something that a lot of the folks in this room have actually done.

And imagine this that this agent worked fantastic, right? But it worked so well

fantastic, right? But it worked so well that a few weeks after shipping, you were asked to add some additional capability to the agent. A few weeks

after that, you received more business requirements and you added additional capability. This pattern continued and

capability. This pattern continued and continued until before you know it, your system prompt had grown to become several hundred lines long. you have

dozens of tools and sub aents that exist for your agent and because of the complexity you've started to see regressions in the areas that your agent was previously

accelerating in. So if this is you,

accelerating in. So if this is you, you're not alone. We see this type of scenario happen pretty commonly with customers and actually with ourselves included in that. So within this

workshop, we are going to simulate an agent that has essentially grown to a complexity where we start to see degradation in its performance. We're

then going to walk through some of the decisions that we as engineers and architects make um in order to improve the design of our agent to restore the

performance that we expect with the additional capability. Specifically,

additional capability. Specifically, we're going to make some decisions around tools and skills and sub agents.

As we modernize the stack of our agent, we want to make sure that we're using the right agentic primitives at the right time. So, when do you use a tool?

right time. So, when do you use a tool?

When do you use a skill? And when do you use a sub agent? We're going to talk through all of that in this session. As

I mentioned folks, this session will be hands-on. So, let's go ahead and get

hands-on. So, let's go ahead and get started. I first want to walk you

started. I first want to walk you through our problem statement in our agent. So for the purposes of this

agent. So for the purposes of this session, we're going to be focusing on an agent called Stock Pilot. This is an inventory management agent that was

designed by and for a midsize re uh retailer. The agent that you see on the

retailer. The agent that you see on the screen can do several things. It can

flag low levels of stock. It can

forecast demand. It can pick suppliers.

It can file POS. and ultimately it can write weekly reports for the employees of this retailer. Now, none of these capabilities are particularly complex on

their own, but again, the issue is that we've essentially bolted capabilities onto our agent over time without modernizing our architecture. This

complexity has started to cause some problems. Let's take a look at the actual architecture today of the agent.

Folks, today the agent is facilitated by a single orchestrator. So you see the stockpilot orchestrator sitting at the top of the screen. The agent has a system prompt as I mentioned that's

grown to be about 400 lines long. It has

12 different tools. Three of those tools happen to be wrappers around sub aents with completely isolated context windows. So if you have the repo pulled

windows. So if you have the repo pulled up, which we'll go into more detail in just a bit, there's an agent that's under a folder called before, which essentially walks through this this

agent. Exactly. So again, orchestrator,

agent. Exactly. So again, orchestrator, long system prompt, a lot of tools, we have a lot of sub agents. The result of this is that our eval started to dip. So

let's imagine how we got here for a moment. Again, like we built that agent

moment. Again, like we built that agent up front to solve a really specific problem. We received business

problem. We received business requirements to say add maybe some forecasting capability to our inventory management agent. So what we decided to

management agent. So what we decided to do was essentially just spin up a forecaster as a sub agent. Again later

on we we received more requirements to add report writing capability to our agent. So we decided to add another sub

agent. So we decided to add another sub aent for that report writing capability.

Again, our eval started to dip over time because we added more and more complexity while just bolting this capability on. So, let's take just a

capability on. So, let's take just a little bit of time and talk about eval specifically for this agent. Folks, we have 12

different eval tasks across five different types of graders. So, my

colleague gave a talk on eval shortly before this. Evals will have a component

before this. Evals will have a component within this workshop, but it won't be the main focus. I'll give you a quick summary of the tactical eval that we're using for this agent.

On the left side of the screen, you see some IDs. You see several evals that

some IDs. You see several evals that start with the letter R. This stands for regression. These are more realistic

regression. These are more realistic single turn tasks that we grade the model's capability on. So imagine I give the model a task. The model comprehends

that task in the for within the agent uh calls some tools and then provides a response back to me. We're essentially

evaluating that response. We also have some more complex tasks that we're grading the model on. So you see those F uh IDs, the IDs that start with F on the left side of the screen, that stands for

failure mode. In this case, we're

failure mode. In this case, we're evaluating the model over a more complicated multi-turn task that we're grading. Now again I won't go into eval

grading. Now again I won't go into eval too specifically. We have a number of

too specifically. We have a number of different types of graders that are both deterministic and non-deterministic.

Right? When I talk about not when I talk about uh deterministic eval count and like latency and like the number of tokens that are used as our

agent is completing a particular task and we're tracking those deterministic metrics over time.

We're also using the idea of LLM as a judge to evaluate the non-deterministic characteristics of our agent. So,

personality and tone and style and output quality. We're using a

output quality. We're using a nondeterministic graater as a part of our eval to evaluate our agents uh non-deterministic characteristics.

Now, we're going to run the evals for our agent in just a bit, but when you do, you'll find that the agent is struggling a bit. I'll talk about some of these eval in just a little bit more

depth. So F1 on the screen, third from

depth. So F1 on the screen, third from the bottom. This is essentially

the bottom. This is essentially simulating a daily low stock sweep. So

again, this is an inventory agent. We're

simulating our ability to look through all of our inventory and pull the low levels of stock.

This eval will actually fail because the agent is going to do the right thing, but it's going to take a very winding path to do so. So instead of taking the straightest line from point A to point

B, the agent is going to take a very inefficient path. It's going to get to

inefficient path. It's going to get to the right end, but it's going to fail the eval because it's not at the efficiency that we'd like. F2 on the screen is another eval that you'll see

fail. This evaluates

fail. This evaluates the ordering process under a particular promotion package. This is going to fail

promotion package. This is going to fail because we are using a sub agent for this particular task. the sub agent is actually getting the task right, but

there's a communication breakdown between our sub agent and our orchestrator. This is a really common

orchestrator. This is a really common point of failure that we see when customers have have really complicated systems with a lot of sub aents. It's

important to get the communication between your sub agents and your orchestrator just right. In the case of F2, like you see on the screen, this is an eval that's going to fail because we

have a breakdown in that communication.

The last one that I'll highlight that you'll see fails is R8 on the screen. R8

will essentially check the forecasting during a particular promotion month.

This eval is also going to fail because we have two different policies that live in very different parts of our system prompt and actually end up contradicting each other. So I mentioned over time our

each other. So I mentioned over time our system prompt has grown. we start to have some conflicts and the model gets confused leading towards a failure for this particular eval. Now in the repo,

you'll see it in the readme when we run these evals, you'll see that they're going to pass up front at about 83% which is okay, but if you work in the world of manufacturing, that is not

okay. 17% failure is a really expensive

okay. 17% failure is a really expensive failure percentage.

Now let's doubleclick on R8 again just so that we can understand a little bit about what's happening behind the scenes. Again, R8 is where we're

scenes. Again, R8 is where we're essentially calculating the forecast during a particular month with a promotion. And so in my uh on my screen

promotion. And so in my uh on my screen here on the right side where you see kind of the simulated terminal window uh within the first block under the commented text we can see that the agent

pulled the right forecasting baseline and also pulled the right promotion multiplier. So forecasting baseline 12

multiplier. So forecasting baseline 12 units a day promotion multiplier 3.1x this is all correct but in the calculation part below that we can see

that there was actually some kind of hallucination that happened instead of using that 3.1x promo multiplier the agent actually ended up using 1.35. So

something happened along the way. A hint

here is that the reason for this is that we have context problems. So this isn't a model problem. It's an issue with our the information that we're surrounding the model with. Our system prompt has

grown to be really long and is very confusing for the model and has some conflicts in it which lead to the issue that shows up within this eval.

So folks, our objective in this workshop will first be to run our suite of eval.

We're going to triage the issues and we're going to update the design of our agent accordingly. And then we're going

agent accordingly. And then we're going to do something that we call internally hill climbing towards eval improvement.

Right? So we run our evals, we get a baseline. It's going to be about 83%.

baseline. It's going to be about 83%.

We're then going to optimize the architecture of our agent and we're going to continue then running our eval so that we climb on them hopefully seeing the success percentage uh improve

over time. In this lab, we're also going

over time. In this lab, we're also going to start with an agent that is selfcreated on our messages API. Um,

again, if you have the repo and you click on the before folder, I'll show you this in just a bit. This is an agent that is built from scratch on our messages API. We're going to actually

messages API. We're going to actually migrate that a that agent to claude managed agents.

Cloud managed agents essentially allows us to offload the messiness that comes with maintaining an agentic harness in

scaling agents safely and securely to thousands and tens of thousands of users, right? Like if I want to build my

users, right? Like if I want to build my agent locally and run it locally, I can do that pretty quickly and pretty easily. But the moment that I need to

easily. But the moment that I need to take that agent, I need to host it remotely and I need to allow hundreds and and and thousands of users to at the

same time engage with that agent.

There's an infrastructure problem, there's a scaling problem, there's memory, there's security, there's so much that I have to account for. So in

order to offload that so I can just worry about the architecture of my agent itself and make decisions around tools,

skills and sub agents. I'm going to offload everything else to claude managed agents. So again, to break that

managed agents. So again, to break that down just a bit, um there's been a few talks on CMA so far today, but this is really where we're able to separate the

agent from the session details from the sandboxed environment where tool calls are actually happening. Um again, this allows us to offload particular parts of

the stack to then worry about the to then only worry about the design of our agent itself.

All right.

I mentioned that we're going to get hands-on in this workshop. We are going to go ahead and do that right now.

Now, what you see on the screen here is the workshop URL as well. Um, if you haven't had a chance to grab it, feel free to go ahead and do so. Um, this is where we're keeping all of the different

workshops throughout Code with Claude within London, so you can go back and revisit them if helpful. Within this

workshop, we're going to be working on agent decomposition. So that's going to

agent decomposition. So that's going to be the name of the folder that we're actually going to be working within.

Great. Let me jump forward here.

Perfect. So the first thing that we're going to do as a part of this workshop is we're first going to get a baseline. So when

you open up that link, you'll first uh clone the repo. So, we're going to clone the repo locally. We have a UV project that's set up. So, we're going to run UV sync in order to make sure that we have

all of our packages and our dependencies to be able to invoke the anthropic SDK and then eventually deploy our agent to claude managed agents. So, we can run UVC to do that. I mentioned previously

that we're going to need an API key for this workshop as well. So using those credits that you got at the start of this session, uh you can go to your claude console account and create an API

key. If you copy the ENV example, you'll

key. If you copy the ENV example, you'll just have to manually copy your API key into the ENV file that's created for you.

Now, all the 12 evals that I previously walked you through, we have all of those set up already. So in order to get a baseline and run those evals, you have to run uvun evals-

agent before. This is all in the readme,

agent before. This is all in the readme, but if you just run that command, you will be able to um actually go about running your evals.

Now, in terms of our building here, we're going to take a number of steps to actually go about running our evals, using cloud code to triage the results

of them, and then climbing accordingly on our agent. Um, so we're first going to take some we're going to take a look at our the system prompt that we have for our agent itself. Um, so I mentioned

earlier that our system prompt is currently sitting at about 400 lines long. We've been stacking information on

long. We've been stacking information on our system prompt over and over again as we've continued to get more business requirements. So our system prompt is

requirements. So our system prompt is very long. We'll take a look at that. We

very long. We'll take a look at that. We

are then going to take some time to evaluate the tools that we're using.

Right now, as I mentioned, we have 12 different tools. three of them are

different tools. three of them are actually kind of wrapped sub aents. So,

we'll take a look to see what we can do to make that more efficient. And then

lastly, if there are any sub aents that we really need to make our agent effective, we're going to take a look at the best way um to actually construct

sub aents with claude managed agents.

I'm going to jump back just for a moment. There's one thing that I forgot

moment. There's one thing that I forgot to mention for you as you get started.

Um, within the repo folder, there's two different uh folders that you'll see.

There's a before folder and then there's a starter folder. Those contain two separate agents. So, if you want to view

separate agents. So, if you want to view the messages API version of the agent, again, this is just me building my own agent loop and my own agent harness

around the anthropic messages API to invoke Claude. You'll see that within

invoke Claude. You'll see that within the before folder.

If you want to view what that agent looks like when deployed on cloud managed agents um you can look in the starter folder which exists right below

that. If you want to deploy your agent

that. If you want to deploy your agent on cloud managed agents you can run uv run deploy starter. So again run your

evals using the messages API version uh- agent before you can then deploy your agent on cloud managed agents. we

already have it built for you and it's really easy to use cloud code to kind of compare the two um and understand exactly what's going on and what some of the differences are uh with claude

managed agents.

Okay, so I'm going to jump over here and we're just going to open up cloud code and we are going to build together. I'm

going to zoom in very far so that you can see everything and so that I can see everything and we'll just talk through exactly what happens when I run some of these evals and we'll talk through the

process that we usually go through to do what I just called hill climbing on the evals themselves. Okay, so if you're

evals themselves. Okay, so if you're looking at cloud code here again I just used cla code to actually run my evals because I want claude's help in triaging

what's going on. Um, so this is me. I'm

using claude code. I have Opus 4.7 running as you can see on the screen.

Um, my effort level is set to extra high. I usually set effort as extra high

high. I usually set effort as extra high with Opus 47 and I forget about it.

That's the effort level that I usually stay on. We find that it gets great

stay on. We find that it gets great performance um with extra high effort altogether.

Now you can see on the screen the first thing that I did was I ran my evals. So

I used the bash capability in cloud code and I ran uv rune eval- agent before claude actually went ahead and ran my eval. So I'm going to scroll down and

eval. So I'm going to scroll down and we're going to look at what claude found while actually running those. So you can see the response that we got the results that we got from this eval run was

actually lower than what I told you before. So we ran them and we got 62%

before. So we ran them and we got 62% which is worse than the 83% that we started with. So, we passed seven out of

started with. So, we passed seven out of 12 of them and it looks like Claude has provided us with a diagnosis for the

different evals that we actually failed.

Let's scroll down just a bit more and we are going to use Claude to understand a little bit more about why this actually happened. So you can see

here I am using Claude to provide me some of the themes around why we actually failed some of these evals.

Again, this is a great technique if you have evals for your agent. Um again, as Gary showed before this session, you can use Claude to actually go about triaging these. So it looks like there's a few

these. So it looks like there's a few different themes that Claude is figuring out based on this agent. So the first thing Claude is seeing that our model is taking on a lot of work that it should

have tools in order to do. So our model is doing a lot of reasoning across information that it just doesn't have the tools to be able to complete.

It looks like there is some issues that we have with the enforcement of output structure. So our model and our sub

structure. So our model and our sub aents are producing information in a particular output structure that doesn't align um exactly with uh what we're

looking for with um to to pull the best performance from uh from our agent.

If I continue to scroll down here, you can see there was some policy issues, etc. Um, as I mentioned before, we have a system prompt that's really long right now. Um, and so Claude is seeing some

now. Um, and so Claude is seeing some confusions based on the information that's found within the system prompt.

So again, you can see Claude has found some root causes.

Now, we're going to do a few different things here. Again, we're going to go

things here. Again, we're going to go one by one and address some of the areas um that we're seeing issues on within our agent. So, I'm going to scroll down

our agent. So, I'm going to scroll down here and we are going to use um Claude code to triage some things within our agent. Okay. So, the first thing that

agent. Okay. So, the first thing that I'm going to ask Claude to do, we're going to talk through this. Claude is

making some changes, which is great. Um,

system prompts tend to get very, very long when we accumulate agents over time. So, the first prompt that I ran,

time. So, the first prompt that I ran, if you're following along, feel free to go ahead and do this. I encouraged

Claude to look at my agent.py PI file, which is where our main CMA um agent loop is located. Again, that's agent.py.

And I essentially said, "Hey, Claude, do you have any thoughts on the system prompt? Maybe I can use skills instead

prompt? Maybe I can use skills instead of a longunning system prompt for progressive disclosure. So, the first

progressive disclosure. So, the first thing that we'll talk about is skills.

There's been a few other sessions on skills. The short definition that I like

skills. The short definition that I like to use is that skills are packaged in composable information that Claude has the ability to pull into context

whenever Claude realizes that it needs that information to complete a particular task. Right? Skills are

particular task. Right? Skills are

really useful with Claude code. Like if

you need to provide Claude information on your testing process or if you want to package up your brand and your UI components and bundle them into a skill

that Claude can pull into context whenever needed. Skills are fantastic.

whenever needed. Skills are fantastic.

Skills are also useful within the agents that you're building for your customers.

So if you're building a product and you are going to give that product to customers, you're building an agent, skills are great within that. In the

case of the agent that we have on the screen here, um again, we have a lot of different policies and a lot of procedures that go into our inventory

management system. As I accumulated

management system. As I accumulated requirements over time, instead of building skills, I decided to take all of that information and keep appending

it to my system prompt. So, my system prompt got longer and longer and longer over time.

This is not something that we recommend you do based on the introduction of skills, right? Leave the system prompt

skills, right? Leave the system prompt only for the information that Claude needs in its mind, regardless of the task that you give it. Skills are

fantastic for packaging information that Claude is going to need some of the time, not all of the time. Right? So, if

I ask Claude to go build a forecast, Claude is going to um go ahead and do that. Let's see. I lost my computer just

that. Let's see. I lost my computer just for a second. There we go. If I ask Claude to go ahead and build a forecast, right, Claude is not going to need

forecasting information unless I specifically ask it um to go ahead and and build that forecast, right? So, in

the case of that particular task, I want Claw to pull forecasting information into its context window. Skills are also fantastic for making sure that you are

being efficient with context because if you stuff all of this information into the system prompt, you're polluting that context window with information that

Claude does not need um in order to complete a particular task. So again,

the first thing that I did, I'll zoom in just a bit more so that you can see this and I'll scroll up just a bit.

I said, 'Hey Claude, can you help me take a look through my system prompt?

Um, can I use skills instead? Um, my

system prompt is too long and I need some help. And so Claude did an analysis

some help. And so Claude did an analysis of this and realized that I have some pre-built skills that I can use to supplement information in my system

prompt. So the first correction or fix

prompt. So the first correction or fix that we're going to make to modernize our architecture here is we are actually

going to um remove uh many of uh much of the system prompt and we're going to put that information into skills. And so you can see here the first thing that we're

doing with Claude is we are activating a number of different skills that previously were not there before. And

we're actually swapping our system prompt to be a short prompt instead of a long one. So if you're curious, if you

long one. So if you're curious, if you feel like you have a long system prompt within the agents that you're building, feel free to take a look at this to see the differences between what was like a 400 line system prompt compared to about

a 50line system prompt. We've

supplemented that and we've switched a lot of that information to skills.

Great.

I am now going to continue working with cloud. You can see we made those changes

cloud. You can see we made those changes here, which is fantastic. Um, there's

some evals that I can go rerun. I'm

going to ask Claude to do one more thing, and then we're going to we're going to rerun some of our evals to see where we've improved.

So, I mentioned before that we have 12 different tools. You saw those on the

different tools. You saw those on the screen in the second slide that I shared. As a part of this inventory

shared. As a part of this inventory management agent, we have we have tools that we've created for everything. So,

whenever Claude needs to retrieve data, we have a tool. Whenever Claude needs to analyze data, we have a tool for that.

We have tools for everything. So, I'm

going to ask Claude to take a look at the tools that my agent has and help me think through how I can optimize here.

So, right now, Claude is running an analysis across the different tools that I have for my agent, and we're going to get to see what some of the results were.

Now, while this is working, um I'll give you a a tip. Um when it comes to building agents that we carry with us at Enthropic for our agents internally and the agents built with customers,

whenever we build agents, we lean into the same primitives um that we as humans have access to. So, imagine yourself when you show up to work, right? You

have a computer that's sitting in front of you. You have the ability to navigate

of you. You have the ability to navigate files on a file system. You can type in the browser and you can search the web.

If you're an engineer, you have the ability to write and execute code.

When you think about Claude Code as an agent, we've effectively given Claude access to all of the same primitives that you and I have access to when we show up to work every single day. Like

Claude Code is a great coding agent because Claude is really good at code.

But essentially what we've done with Claude Code is we've just given Claude access to a computer, right? And this is really powerful because this allows us

to drop in better versions of Claude as we continue to release new models. And

Claude just uses those primitives better than it did before, right? Like imagine

yourself after this conference compared to yourself when you walked in. You're

going to have the same tools at your fingertips, but you're theoretically your brain's going to be a little bit bigger. you're going to be smarter based

bigger. you're going to be smarter based on what you learned here and you're going to be more effective while using the same tools. Claude works the same exact way, right? And so whenever we

build agents, we lean into humanlike primitives first. These primitives are

primitives first. These primitives are things like code execution and the navigation of a file system, the keeping of a to-do list, the ability to search

the web. These are foundational tools

the web. These are foundational tools that we always start with when we build agents and we remove them as needed. An

example that I like to give is with file uh like document analysis.

If you're building an agent that requires document analysis, maybe you have a lot of uh CSVs or Excel sheets that your agent is going to be looking

over code execution. And so the ability to write and run code is one of the best ways of uh uh doing data analysis and

working across lots of documents, right?

Like if you need Claude to look across a CSV, giving Claude a bash tool so that Claude can write a quick Python script and reason across the results after

running that Python script is much more effective than just uploading the entire CSV into Claude's context window. Right?

So again, we lean into these uh computer-like primitives first when building an agent. So if I scroll down here, that's exactly what we did here.

You can see we took a lot of steps and we actually removed most of the tools that exist within our agent and we replaced them with uh some of the primitives that I talked through

previously. This is an inventory

previously. This is an inventory management agent that leans really well to this. Um, I have the ability to

to this. Um, I have the ability to consolidate and remove a lot of the tools that I'm using to reason across Excels and reason across forecasting

data and just give Claude access to the same tools that Claude code has in order to do that. What's cool about this is that when you build using uh Claude

managed agents, these tools are actually included by default. So if you want to give Claude

default. So if you want to give Claude access to those same tools that Claude code has and use them to uh drive powerful capability within your agent,

you don't have to worry about writing a tool that gives Claude the ability to write and run code or you don't have to write a tool that gives Claude the ability to use the file system. you can

just rely on those builtin tools um that we have built ourselves for cloud code that we just make available through uh

cloud managed agents. I'm going to ask Claude to rerun an eval to see if we are getting better.

Now with your agent there's always going to be the need to add some custom tools as well. like you're not you're only

as well. like you're not you're only going to get so far by giving your agent the same tools that we give Claude code.

Um so we always start with those uh primitives like code execution and web search and to-do lists etc. um we always start there and then uh we

either remove those tools as we don't need them right there might be some agents where we just don't need web search so we'll go ahead and remove that tool um and then we'll add custom tools

whenever we need them right so again when you think about tools I encourage you to start with those claude code primitives those humanlike primitives

and then add custom tools only as you need them in the case of this specific inventory agent um we were able to remove most of the tools and replace them with cloud code.

So you can see right now Claude is redeploying my agent to claude managed agents. So again I have my agent

agents. So again I have my agent locally. I am redeploying it based on

locally. I am redeploying it based on some of the changes that we've made. And

now I can rerun some of my evals to see the result. So you can see in that last

the result. So you can see in that last command I'm rerunning uh the F1 eval.

And we're going to see what happens as a result.

Now, we always get a lot of questions when it comes to MCP. So, in the case of CMA here, you have a couple different options when it comes to tools. You can

first lean on those cloud code primitives, things like web search and code execution and file system. Again,

that's what we start with. You can then create uh just custom tools, so standalone tools that only your agent has the ability to use. Then you can

connect your agent to MCP. We see a lot of folks run towards MCP first and a lot of our customers end up in this ecosystem where there's a lot of kind of

chaotic MCP servers that exist. A lot of times they have overlap um which can create some problems. So when we build agents again we start with those cloud

code tools. We then create local tools

code tools. We then create local tools only for our agent. we don't run to MCP and then only in the case where we have a common collection of tools that

multiple clients will benefit from accessing do we go about the process of collecting those and publishing them as an MCP server. So only when we have

multiple agents maybe multiple cloud code clients that need to access the same set of standardized and governed tools we run towards MCP.

Something else that's becoming increasingly common throughout the industry is leaning on Claude's ability to effectively use code execution as a

means of executing tools. So we see a lot of capabilities coming out around just giving Claude access to uh use CLIs

and invoke APIs using code and actually run tools using code instead of MCP. One

of the drawbacks of MCP is that it does um cause some uh it can cause some context issues just in terms of polluting context and taking up a lot of space. So there may be some cases where

space. So there may be some cases where you can just rely on code execution either through CLIs or just by giving claw the ability to invoke APIs using

code as a means of um creating more flexibility for your agent where you do not have to use MCP. So, just something

to keep in mind as you're building.

Great. Okay, so Claude just got done.

Um, looks like we have the before and after from some of the changes that we've made. And I think that this is

we've made. And I think that this is pretty compelling, right? The first

thing that jumps off the screen to me is the token usage. So, before I was using over 200,000 tokens for a particular task. After leaning in on some of those

task. After leaning in on some of those file system primitives, you can see that that went down dramatically. This is a direct result of giving my agent code execution. So again, imagine instead of

execution. So again, imagine instead of giving my agent a full CSV that needs to be read into context, I just give my agent the ability to write and run Python as a means of kind of navigating

across all of that information. The

agent uses a lot less tokens when it can write code and then run code and then read the results instead of having to consume all of that data in Claude's mind and then use all of that kind of

collective brain power to then make decisions based on the results. Um, a

few other things. Um, we can see that our costs went down as well because we're just not using as many tokens, which makes sense. Um, our our our execution time went down as well. So

this was a pretty good case where I think we got better overall, but this is not something that will happen all the time, right? We like we might see some

time, right? We like we might see some cases where we regress, but this was the case where using some of those primitives as opposed to some of our more stagnant tools was clearly the uh

the right decision.

Great. Okay, we're going to jump back and we're going to talk about sub aents for just a bit. I'm going to copy another prompt to Claude and we are

going to um investigate sub agents.

Now I mentioned before that we had 12 different tools. Three of them were

different tools. Three of them were effectively wrapping sub agents. So if

I'm Claude, I have the ability to call on a tool. That tool is a wrapper for a sub aent. I can then go and invoke that

sub aent. I can then go and invoke that sub agent. See Claude's doing a lot

sub agent. See Claude's doing a lot here. I'll scroll up just a bit and then

here. I'll scroll up just a bit and then we'll talk through it.

The two main use cases where or the two main instances where we see sub aents initially as being really effective is first when you want to throw a lot of claude at a problem,

right? So let's say that you're trying

right? So let's say that you're trying to do deep research or like web search um let's say that you're trying to do in the case of claude codebased exploration. That's a great case where

exploration. That's a great case where like having many different minds running at the same problem makes sense. So sub

aents are a great way to parallelize and throw a lot of claw at a problem to get it done faster and more effectively.

The second case where it's really common to use sub aents is when you need a fresh mind to look at a problem. So I'll

use the cloud code example first. If I

I'll use my example as a developer. If I

am writing code, I do not want to be the same person that is writing and also reviewing my code. I'm going to have somebody else review my code. So in the case of Claude code, it makes a lot of

sense to have one instance of Claude doing the writing of the code and then another instance of Claude coming over the top and reviewing that that does not

have context about the initial uh instance of Claude. This is a great case for a sub agent. Using just a code review sub agent and layering it over

the top is a great way to do this. We

also have a a sub agent within um our our agent here, our inventory management agent that we've actually kept as the result of um of some of the changes that

we made and that's for forecasting specifically. So again, I have a

specifically. So again, I have a forecasting capability that's within my inventory management agent. I do want to keep my forecasting separate from my

main instance of claude. I don't want anything in my initial context window to distort the forecasting process. I do

have a skill that kind of walks through the step-by-step sequence and the guidelines that I prefer Claude use when writing and building forecasts, but again, I don't want the same cla that

I'm that uh my customer is talking with to also be the claude that writes the forecast, right? So, I want to divide

forecast, right? So, I want to divide that. So I'm leaning on that second

that. So I'm leaning on that second example of when to use a sub aent um as the place where we'd like to go about doing this.

So in this case we've removed our other sub aents and we've just replaced them with primitive tools. Um but we are going to leave the forecasting sub

agent. Now we're not going to expose our

agent. Now we're not going to expose our sub agent as a tool using cloud managed agents. There's a

native capability for sub aents um that allows the logging and the observability of your sub aents um to be really effective. One of the problems with sub

effective. One of the problems with sub aents is that when you have multiple instances of cloud running, first off, it's difficult to make sure that the communication between your orchestrators

and your sub aents is accurate and is seamless. Right? There's a lot that can

seamless. Right? There's a lot that can get lost in translation. Just like when I'm talking to one of my colleagues, I might be thinking something, they might be interpreting it completely differently. The same thing happens with

differently. The same thing happens with orchestrators and sub agents.

The um the other thing that can happen is logging is really difficult in some cases, right? Because then you have to

cases, right? Because then you have to worry about collecting the transcripts from multiple different agents. So

within cloud managed agents, we've added this native sub aent capability. I saw

it on here. Let me scroll up just a bit.

I think Claude found it. Yes. So there's

this callable agents capability that exists within cloud managed agents which is essentially just like manage sub agents. So that within your session

agents. So that within your session information you have observability and metrics about what exactly your sub aents are doing that is as accurate as

your initial orchestrator. Right? Um,

this is again meant to solve one of the common problems of just having uh a lot of information that is hard to track with sub aents. We just did some building. Again, I'm going to skip

building. Again, I'm going to skip through these because we spent some time talking about them. We just talked about sub agents. Again, there's a few

sub agents. Again, there's a few different cases where you can use them.

We just talked about callable agents.

You can also just define your sub agent as a tool, which is what we did previously, but we actually moved away from that and we decided to use the CMA native capability. Um there are a lot of

native capability. Um there are a lot of cases where you can just now scrap the sub agent entirely and just give more flexibility and capability to your main agent. So what we have a lot of

agent. So what we have a lot of customers doing is actually just consuming capability into their main in this case orchestrator because frontier

models have gotten intelligent enough to manage across more information where you just don't need as many sub agents. So

again, when you're thinking sub agent, I have a lot of CL or I have a big problem that I want to throw a lot of cloud at or I want a separate claude to kind of look at um the work of either me or of a

different instance of claude. Two great

times to use sub agents.

Okay, so let's look at the architecture that we ended with. Again, refreshing

us, we started with an orchestrator system prompt of about 400 lines long.

We had 12 tools. Three of them were sub agents. What did we end with after this

agents. What did we end with after this exercise? We still have an orchestrator,

exercise? We still have an orchestrator, but we deployed that on cloud managed agents because I didn't want to have to worry about infrastructure scaling, security, etc. I just wanted to worry

about my agent, right? Like in in Will's simple terms like that is when I reach for cloud managed agents because I just want to worry about building the best thing possible and not all the messiness

that comes with it. We simplified our tools. We now have uh we have right now

tools. We now have uh we have right now three different tools. So we actually simplified everything to just use bash read and write. Now when our agent

starts executing, we sync some data into the cloud managed agents environment so that it can reason across that data.

We actually simplified our system prompt to 15 lines long and we replaced all of our business logic with skills. So

again, I was just stuffing requirement after requirement into my system prompt.

I decided to take that, package it up as skill so that Claude could pull that information into its brain only when Claude realized that it needed it in order to solve a problem. As a result of

this, we showed how we can then start hill climbing on evals to see improvements over time. So at the end of this, my eval score is about 92%. I've

simplified my design. I'm leaning into some of the primitives um that make uh Claude great. Um and I'm seeing the

Claude great. Um and I'm seeing the positive results after doing so.

Again, some of the eval results you see that here after running this um we're getting faster. We're using fewer tokens

getting faster. We're using fewer tokens because we're leaning into code execution. Um, our turn count is

execution. Um, our turn count is remaining sort of the same, but again, because the token usage and the cost is going down, I'm actually okay with Claude taking more turns. There are some

cases where we'll see the latency not drop maybe as much as you would expect, but for some of these more sophisticated, high intelligence agents where like forecasting is at play, I'm willing to take a little bit higher

latency um at the expense of seeing my performance improve and my costs go down. All

down. All right, let's wrap with some some takeaways here in our last minute.

When we build agents, we start with a single agent loop that that is equipped with very simple primitives that give Claude some of these humanlike

capabilities like the ability to use the file system like the one that you have on your computer, web search, code execution, um sometimes a to-do list. Again, we

start there and then we build accordingly.

The next thing that we did is we used progressive disclosure through skills.

Instead of stuffing our system prompt with a lot of information, we made information accessible to Claude whenever Claude realized that it needed that information in order to solve a problem. This is great because we can

problem. This is great because we can run more efficiently and uh we're not polluting our context window. Um and

we're giving Claude more flexibility to make decisions. The last thing that I

make decisions. The last thing that I want you to walk away with, right, eval in general. This idea of hill climbing

in general. This idea of hill climbing is a concept that we lean on really uh heavily at Enthropic, right? You have

evaluated as your product capability expands.

Always make sure that your evals are encompassing the things that you care about and that you're measuring within your agent so that you can actually make sure that your agent is accomplishing the thing that you set out to

accomplish.

With that, folks, we're going to go ahead and wrap. I really appreciate your time today. Um, I'll be in the back

time today. Um, I'll be in the back after the session just outside of this room in case you have any questions at all. Um, thank you for spending your day

all. Um, thank you for spending your day at Code with Claude in London. And I

hope you have a great rest of your day.

Appreciate it.

Loading...

Loading video analysis...