LongCut logo

How We Build Effective Agents: Barry Zhang, Anthropic

By AI Engineer

Summary

Topics Covered

  • Don't build agents for everything
  • Keep agents as simple as possible while iterating
  • Think like your agent to understand its perspective

Full Transcript

[Music] Wow, it's uh incredible to be on the

same stage as uh so many people I've learned so much from. Let's get into it.

My name is Barry and today we're going to be talking about how we build effective agents. About two months ago, Eric and I

agents. About two months ago, Eric and I wrote a blog post called Building Effective Agents. In there, we shared

Effective Agents. In there, we shared some opinionated take on what an agent is and isn't, and we give some practical learnings that we have gained along the way. Today, I'd like to go deeper on

way. Today, I'd like to go deeper on three core ideas from the blog post and provide you with some personal musings at the end. Here are those

end. Here are those ideas. First, don't build agents for

ideas. First, don't build agents for everything. Second, keep it simple. And

everything. Second, keep it simple. And

third, think like your agents. Let's first start with a recap

agents. Let's first start with a recap of how we got here. Most of us probably started building very simple features.

Things like summarization classification, extraction, just really simple things that felt like magic two to three years ago and have now become table stakes. Then as we got more

table stakes. Then as we got more sophisticated and as products mature, we got more creative. One model call often wasn't enough. So we started

wasn't enough. So we started orchestrating multiple model calls in predefined control flows. This basically

gave us a way to trade off cause and latency for better performance and we call these workflows. We believe this is the

workflows. We believe this is the beginning of agentic systems. Now models are even more capable and we are seeing more and more

domain domain specific agents start to pop up in production. Unlike workflows

agents can decide their own trajectory and operate almost independently based on environment feedback. This is going to be our focus today. It's probably a little bit too

today. It's probably a little bit too early to name what the next phase of agentic system is going to look like especially in production. Single agents

could become a lot more general purpose and more capable or we can start to see collaboration and delegation in multi- aent settings. Regardless, I think the

aent settings. Regardless, I think the broad trend here is that as we give these systems a lot more agency, they become more useful and more capable. But

as a result, the cost, the latency, the consequences of errors also go up. And that brings us to the first

up. And that brings us to the first point. Don't build agents for

point. Don't build agents for everything. Well, why not? We think of

everything. Well, why not? We think of agents as a way to scale complex and valuable tasks. They shouldn't be a drop

valuable tasks. They shouldn't be a drop in upgrade for every use case. If uh if you have read the blog post, you'll know that we talked a lot about workflows and that's because we really like them and

they're a great concrete way to deliver values today. Well, so when should you build an

today. Well, so when should you build an agent? Here's our

agent? Here's our checklist. The first thing to consider

checklist. The first thing to consider is the complexity of your task. Agents

really thrive in ambiguous problem spaces. And if you can map out the

spaces. And if you can map out the entire decision tree pretty easily, just build that explicitly and then optimize every node of that decision tree, it's a lot more cost- effective and it's going

to give you a lot more control. Next thing to consider is the

control. Next thing to consider is the value of your task. That exploration I just mentioned is going to cost you a lot of tokens. So the task really needs to justify the cost. If your budget per

task, is, around, 10, cents,, for example, you're building a u high volume customer support system, that only affords you 30 to 50,000 tokens. In that case, just use a workflow to solve the most common

scenarios and you're able to capture the majority of the values from there. On

the other hand, though, if you look at this question and your first thought is I don't care how many tokens I spend. I

just want to get the task done. Please

see me after the talk. Our go to market team would love to speak with you.

From there, we want to derisk the critical capabilities. This is to make

critical capabilities. This is to make sure that there aren't any significant bottlenecks in the agent's trajectory.

If you're doing a coding agent, you want to make sure it's able to write good code, it's able to debug, and it's able to recover from its errors. If you do have bottlenecks

errors. If you do have bottlenecks that's probably not going to be fatal but they will multiply your cost and latency. So, in that case, we normally

latency. So, in that case, we normally just reduce the scope, simplify the task, and try again.

Finally, the the the last important thing to consider is the cost of error and error discovery. If your errors are going to be high stake and very hard to discover, it's going to be very

difficult for you to trust the agent to take actions on your behalf and to have more autonomy. You can always mitigate

more autonomy. You can always mitigate this by limiting the scope, right? You

can have read only access. You can have more human in the loop, but this will also limit how well you're able to scale your agent in your use case.

Let's see this checklist in in action.

Why is coding a great agent use case?

First, to go from design doc to a PR is obviously a very ambiguous and very complex task. And second, um we're a lot

complex task. And second, um we're a lot of us are developers here, so we know that good code has a lot of value. And

third, many of us already use cloud for coding. So we know that it's great at

coding. So we know that it's great at many parts of the coding workflow. And

last, coding has this really nice property where the output is easily verifiable through unit test and CI. And

that's probably why we're seeing so many creative and successful coding agents.

Right now, once you find a good use case for agents, this is the second core idea which is to keep it as simple as possible. Let me show you what I

possible. Let me show you what I mean. This is what agents look like to

mean. This is what agents look like to us. They're models using tools in a

us. They're models using tools in a loop. And in this frame, three

loop. And in this frame, three components define what an agent really looks like. First is the environment.

looks like. First is the environment.

This is a system that the agent is operating in., Then, we, have a, set, of, tools, which

in., Then, we, have a, set, of, tools, which offer an interface for the agent to take action and get feedback. Then we have the system prompt

feedback. Then we have the system prompt which defines the goals, the constraints and the ideal behavior for the agent to actually work in this environment.

Then the model gets called in a loop and that's agents. We have learned the hard way to

agents. We have learned the hard way to keep this simple because any complexity up front is really going to kill iteration speed. Iterating on just these

iteration speed. Iterating on just these three basic components is going to give you by far the highest ROI and optimizations can come later. Here are examples of three agent

later. Here are examples of three agent use cases that we have built for ourselves or or our customers just to make it more concrete. They're going to look very different on the product surface. They're going to look very

surface. They're going to look very different in their scope. They're going

to look different in the capability, but they share almost exactly the same backbone. They they actually share

backbone. They they actually share almost the exact same code. The environment largely depends on

code. The environment largely depends on your use case. So really the only two design decisions is what are the set of tools you want to offer to the agent and what is the prompt that you want to

instruct your agent to follow.

Um, on this note, if you want to learn more about tools, my friend Mahes is going to be giving a workshop on model context protocol MCP tomorrow morning.

Um, I've seen that workshop. It's going

to be really fun. So, I highly encourage you guys to to check that out. Um, but

back to our talk. Once you have figured out these three basic components, you have a lot of optimizations to do from there. Uh, for coding and computer use

there. Uh, for coding and computer use uh, you might want to, uh, catch the trajectory to reduce cost. For search

where you have a lot of tool calls, you can parallelize a lot of those to reduce latency. And for almost all of these, we

latency. And for almost all of these, we want to make sure to present the agents progress in such a way that gain user trust. But that's it. Keep it as simple

trust. But that's it. Keep it as simple as possible as you're iterating. Build

these three components first and then optimize once you have the behaviors down. All right, this is the last idea.

down. All right, this is the last idea.

Um, is to think like your agents. I've

seen a lot of builders and myself included who develop agents from our own perspectives and get confused when agents make a mistake. It seems

counterintuitive to us. And that's why we always recommend to put yourself in the agents context window. Agents can exhibit some really

window. Agents can exhibit some really sophisticated behavior. It could look

sophisticated behavior. It could look incredibly complex, but at each step what the model is doing is still just running inference on a very limited set of contexts.

Everything that the model knows about the current state of the world is going to be explained in that 10 to 20k tokens. And it's really helpful to limit

tokens. And it's really helpful to limit ourselves in that context and see if it's actually sufficient and coherent.

This will give you a much better understanding of how agents see the world and then kind of bridge the gap between our understanding and theirs. Let's imagine for a second that

theirs. Let's imagine for a second that we're computer use agents now and then see what that feels like. All we're

going to get is a static screenshot and a very poorly written description. This

is by yours truly. Let's read through it. You know, you're a computer use

it. You know, you're a computer use agent. You have a set of tools and you

agent. You have a set of tools and you have a task. Terrible. Uh we can think and talk and reason all we want, but the only thing that's going to take effect in the environment are our

tools. So, we attempt a click without

tools. So, we attempt a click without really seeing what's happening. And

while the inference is happening, while the two execution is happening, this is basically equivalent to us closing our eyes for three to five seconds and using the computer in the dark. Then you open

up your eyes and you see another screenshot. Whatever you did could have

screenshot. Whatever you did could have worked or you could have shut down the computer. You just don't know. This is a

computer. You just don't know. This is a huge lethal phase and the cycle kind of starts again. I highly recommend just

starts again. I highly recommend just trying doing a full task from the agent's perspective like this. I promise

you it's a fascinating and only mildly uncomfortable experience. However, once you go through

experience. However, once you go through that mildly uncomfortable experience, uh I think it becomes very clear what the agents would have actually needed. It's

clearly very crucial to know uh what the screen resolution is so I know how to click. Um it's also good to have

click. Um it's also good to have recommended actions and limitations just so that you know uh we can uh put some guardrails around uh what we should be exploring and we can avoid unnecessary

exploration. These are just some

exploration. These are just some examples and you know do this exercise for your own own agent use case and figure out what kind of context do you actually want to provide for the

agent. Fortunately though um we are

agent. Fortunately though um we are building systems that speak our language. So we could just ask cloud to

language. So we could just ask cloud to understand cloud. You can throw in your

understand cloud. You can throw in your your system prompt and ask well is any of this instruction ambiguous? Does it

make sense to you? Are you able to follow this? You can throw in a two

follow this? You can throw in a two description and see whether the agent knows how to use the tool. You can see if it wants more parameter, fewer parameter. And one thing that we do

parameter. And one thing that we do quite frequently is we throw the entire agent's trajectory into cloud and just ask it, hey, why do you think we made this decision right here? And is there

anything that we can do to help you make better decisions?

This shouldn't replace your own understanding of the context, but you'll help you gain a much closer perspective on how the agent is seeing the world. So

once again, think like your agent as you're iterating. All right. Uh I've I've spent

iterating. All right. Uh I've I've spent most of the talk about very practical stuff. Uh I'm going to indulge myself

stuff. Uh I'm going to indulge myself and spend one slide on personal musings.

This is going to be my view on how this might evolve and some open questions I think we need to answer together as AI engineers.

These are the top three things that are always on my mind. First, I think we need to make agents a lot more budget aare. Unlike workflows, we don't really

aare. Unlike workflows, we don't really have a great sense of control for the cost and latency for agents. I think

figuring this out will enable a lot more use cases as it gives us the necessary control to deploy them in production.

The open question is just what's the best way to define and enforce budgets in terms of time, in terms of money, in terms of tokens, the things that we care about.

Next up is this concept of self-evolving tools. I've I've already hinted at this

tools. I've I've already hinted at this two slides ago, but uh we are already using models to help iterate on the two description, but this should generalize pretty well into a meta tool where agents can design and improve their own

tool ergonomics. This will make agents a lot

ergonomics. This will make agents a lot more general purpose as they can adopt the tools that they need for each use case. Finally, um I don't even think

case. Finally, um I don't even think this is a hot take anymore. I have a personal conviction that we will see a lot more multi- aent uh collaborations in production by the end of this year.

They're well parallelized. They have

very nice separation of concerns and having sub agent for example will really protect the main agents context window. Um but I think a big open

window. Um but I think a big open question here is um how how do these agents actually communicate with each other? We're currently in this very

other? We're currently in this very rigid frame of having mostly synchronous user assistant terms and I think most of our systems are built around that. So

how do we expand from there and build in asynchronous communication and and enable more roles that that afford agents to communicate with each other and recognize each other? I think that's going to be a big open question as we

explore this more multi- aent future. These are the areas that take up

future. These are the areas that take up a lot of my mind space. If you're also thinking about this uh please shoot me a text. I would love to

text. I would love to chat. Okay, let's uh bring it all

chat. Okay, let's uh bring it all together. If you forget everything I

together. If you forget everything I said today, these are the three takeaways. First, don't build agents for

takeaways. First, don't build agents for everything. If you do find a good use

everything. If you do find a good use case and want to build an agent, keep it as simple for as long as possible. And

finally, as you iterate, try to think like your agent, gain their perspective and help them do their job.

I would love to keep in touch with everyone of you. If you want to chat about agents, especially those open questions that I talked about, uh you'll be incredibly lovely. You can just, you

know, uh jam on some of these ideas. Uh

these are my socials if you want to get connected. And I'm going to end the

connected. And I'm going to end the presentation on a personal anecdote. So

back in 2023, I was building AI product at Meta and we had this funny thing where we could change our job description to anything we want. Um

after reading that blog post from Swix I decided I was going to be the first AI engineer. Uh, I I really love the focus

engineer. Uh, I I really love the focus on practicality and just making AI actually useful to the world. And I

think that aspiration brought me here today. So, I hope you enjoy the rest of

today. So, I hope you enjoy the rest of the AI engineer summit. And in the meantime, let's keep building. Thank

you.

[Music]

Loading...

Loading video analysis...