LongCut logo

Context Engineering for Agents

By LangChain

Summary

## Key takeaways - **Context Engineering: The Art of LLM Memory Management**: Context engineering is the discipline of strategically filling an LLM's context window with the most relevant information for each step of an agent's task, akin to an operating system managing RAM. [00:15], [01:17] - **Four Pillars of Context Engineering for Agents**: Key strategies for context engineering include writing (saving data outside the context window), selecting (retrieving relevant data), compressing (retaining only essential tokens), and isolating (partitioning context). [03:35], [03:51] - **Scratchpads vs. Memory: Persistence in Agents**: Scratchpads are temporary storage for information within a single agent session, while memories are designed to persist across multiple sessions, allowing agents to retain knowledge over time. [04:20], [05:24] - **Tool Handling Limits and Retrieval Augmentation**: Agents struggle with large numbers of tools, with performance degrading beyond 30 tools. Retrieval-Augmented Generation (RAG) over tool descriptions, using semantic similarity, can improve tool selection. [08:13], [08:25] - **Context Compression: Summarization and Trimming**: Compressing context involves techniques like summarization, as seen when Cloud Code compacts sessions exceeding 200,000 tokens, and trimming, which selectively removes less relevant information. [10:03], [10:10] - **LangGraph's State Object for Context Isolation**: LangGraph utilizes a state object, accessible within each node, to manage and isolate context. This allows for partitioning information and selectively exposing only necessary data to the LLM. [13:36], [15:15]

Topics Covered

  • Context Engineering: The Number One Job for AI Agents.
  • Agents use scratchpads and memory like humans.
  • Selective retrieval improves tool and knowledge agent performance.
  • Compressing context manages token bloat in agent trajectories.
  • Isolating context scales agents beyond single LLM limits.

Full Transcript

Hey, this is Lance from Langchain. You

might have heard this term context

engineering recently. It's a nice way to

capture many of the different things

that we do when we're building agents.

Of course, agents need context. They

need instructions. They need external

knowledge. They also will use feedback

from tool calls.

Context engineering is just the art and

science of filling the context window

with just the right information at each

step of the agent's trajectory.

Now I want to talk about a few different

strategies for context engineering which

we can group into writing context,

selecting context, compressing context,

and isolating context. I'll walk through

some interesting examples of each one

with popular agents that we use

frequently in our day-to-day work. And

I'll also talk about how Langraph is

designed to support all of these. But

first, what is context engineering?

Where did this come from? Well, Toby

from Shopify had an interesting post

here saying he likes the term contact

engineering. Karpathy followed up on

this and offered a good definition.

Contact engineering is the delicate art

and science of filling the context

window with just the right information

for the next step. And Carpathies

recently highlighted an interesting

analogy between LLMs and operating

systems like the LLM is a CPU context

windows like RAM or its working memory.

And importantly, it has limited capacity

to handle context.

And so just like an operating system

curates what fits in RAM, context

engineering you can think of as the

discipline or the art and science of

deciding what needs to fit in context.

Oh, for example, each step of an agent

trajectory. Now, what are the types of

context we're talking about? Well, we

can think about this as an umbrella over

a few different themes. So, one is

instructions. You've heard a lot about

prompt engineering, and that's just a

subset of this. There's of course things

like memories. There's few shot

examples. There's tool descriptions.

There's also knowledge that could be

facts. That could be memories. And

there's tools which could be feedback

from the environment for example using

APIs or calculator or other tools. And

so you have all these sources of context

that are flowing in to LLM when you're

building applications. Now why is this a

bit trickier for agents in particular?

Well, agents have at least two

properties. They often handle longer

running tasks or higher complexity tasks

and they also utilize tool calling. Now

both of these things result in larger

context utilization. For example, the

feedback from tool calls can accumulate

in the context window or just very long

running tasks can accumulate lots of

token usage over many turns. Here's kind

of an example showing turn one you call

a tool. Turn two you call another tool.

If you have a large number of turns,

that tool feedback just grows and grows

and grows. Now, what's the problem with

that? This blog post from Drew Brun

nicely outlines a number of specific

context failures, context poisoning,

distraction, curation, and clash. I

encourage you to read that post. It's

really interesting, but it's kind of

intuitive. As the context grows longer,

there's more information for an LM to

process and there's more opportunities

for an LM to get, for example, confused

due to conflicting information or

injection of a hallucination which

influences the response in an adverse

way. And so for these reasons,

contexting is particularly critical when

building agents because they typically

have to handle longer context for the

reasons mentioned above. Now, cognition

highlighted this pretty nicely in a

recent blog post saying context

engineering is effectively the number

one job of engineers building AI agents.

So, what can we do about this? Well,

I've had a look at many different

popular agents that many of us use

today. Thought about this a lot,

reflected on my own experience. You can

kind of distill down approaches into

four bins. writing context. Saving

outside the context window to help an

agent perform a task. Selecting context,

selectively pulling context into the

context window to help an agent perform

a task. Compressing, retaining only the

most relevant tokens. And isolating,

splitting context up again to help an

agent perform a task.

And now I'll talk about some examples of

each of these categories.

So writing context. Writing context

means saving outside the context window

help an agent perform a task. When

humans solve tasks, we take notes and we

remember things for future related

tasks. Well, agents can do those same

two things.

For note-taking, agents can use a

scratch pad. And for remembering things,

agents can use memory.

So, you can think about scratch pads as

kind of a term that captures the idea of

persisting information while an agent is

performing a task. And I'll give a good

example of this. Anthropic's recent

multi-agent researcher. The lead

researcher begins by thinking through

the approach and saving that plan to

memory to persist it. And this is a

great point. You want to keep the plan

around. The context window might exceed

the limit of 200,000 tokens, but the

plan can always be retrieved and

retained. Very intuitive

example of taking a note and saving it

in a scratchpad. Now, I do want to make

a subtle point. The implementation for

your scratch pad can differ. So, in

their case, they just save it to a file.

But you can also for example save it to

a runtime state object depending on what

agent library you're using. But the

intuition is really that you want to be

able to write information while the

agent is solving a task so the agent can

then recall that information later if it

needs. Now memories are a bit different.

Sometimes we want to save information

across many different sessions with an

agent. So typically scratch pads are

relevant only within a single agent

session. An agent's trying to solve a

problem. It'll use a scratch pad to

solve the problem and then scratch pad's

not relevant anymore. Memories are

things that you want the agent to retain

over time over many sessions. So there's

some fun examples from the literature.

Gener agents for example synthesize

memories from collections of past agent

feedback. And you've seen this chatbt

has a great memory feature. Cursor

windsurf also will autogenerate memories

based on user agent interactions. So

this pattern is certainly emerging with

popular AI products. And again the

intuition is pretty clear here. You have

some new context. You have some existing

memories. And you can update memories

with new information

dynamically as the agent is interacting

with the user. Now, we covered writing

context and talked about a few examples

of that. Now, let's talk about selecting

context. So, selection means pulling

context into the context window to help

an agent perform a task. Now, we kind of

talked about this previously with

scratch pads. Of course, an agent can

reference what it wrote previously. This

could be via a tool call. This could be

by reading directly from a state object.

Now, memories are a bit more interesting

and a bit more subtle.

There's different types of memories you

might want to pull into context

depending on the problem you're trying

to solve. Could be fot examples that

provide specific instructions for a

desired behavior. It could be a

selectively chosen prompt for a given

task. could be facts. And these are

different memory types. Semantic

memories in humans, for example, are

things like facts, things that I learned

in school. Episodic memories are more

analogous to fot examples, past

experiences. Procedural memories are

like instructions, instincts, motor

skills. These are all things that we may

want to selectively pull into context to

help an agent solve problems. Now, where

do these come up practically different

agents we work with today? Well, you

think about instructions or procedural

memories are typically captured in

things like rules files or like cloud MD

when you're working with the code

agents. This is typically a file that

has like style guidelines or general

instructions for tools to use with a

given project. And often times these are

all pulled into context. For example,

when you start cloud code, it'll just

pull in all the cla files that you have

in your project and in your

organization.

Now, facts are a bit more subtle.

Oftentimes we want to selectively pull

in facts from a large collection. And

this is where it's common to think about

things like embedding based similarity

search or graph databases to actually

house collections of memories

in order to better control their

retrieval and ensure that only relevant

memories from a large collection are

pulled in at the right time.

Now tools are another very interesting

thing we often want to pull into

context. Now, one of the problems is

that agents have difficulty with

handling large collections of tools.

This paper I link here has some

interesting results on that showing

degradation after around 30 tools and

complete failure at around 100 tools.

And so they propose using rag over tool

descriptions. So that's just basically

embedding tool descriptions and using

retrieval based on semantic similarity

to fetch out relevant tools for a given

task. And this can improve performance

significantly. So it's a nice trick for

pulling selectively only relevant tools

into context.

Now finally I want to talk about

knowledge. Rag is a huge topic and you

can kind of think about memories as a

subset of rag but rag is of course much

broader. We often want to augment the

knowledge base of an LLM with for

example private tokens. The code agents

are some of the largest scale rag apps

currently and I actually thought this

post from Verun the CEO of Windsurf was

quite interesting. He talks a bit about

the approaches that they use for

retrieval. And the real point is it's

quite non-trivial. So of course you're

using indexing

and embedding based similarity search as

a core rack technique.

But then you have to think about

chunking. So they for example do a

parsing to chunk along semantically

meaningful code boundaries, not just

random blocks of code. So that's kind of

point one. But he also mentions that

pure embedding based search can become

unreliable. So they use a combination of

techniques like GP file search knowledge

graphs

and they use an LLM based ranking on top

of all of these. So when you look at

knowledge selection in popular agents

for example like the code agents this is

highly non-trivial and the huge amount

of context engineering goes into

knowledge selection.

So we talked about writing context we

talked about selecting context. Now

let's talk about compressing context.

This really involves retaining only the

tokens required to perform a task. So a

common idea here is summarization. If

you used cloud code for example, you

might have noticed

that it'll call autoco compact once the

session reaches 95% of the context

window, 200,000 tokens for the cloud

series. And so that's an example of

applying summarization across a full

kind of agent user trajectory.

But what's kind of interesting is you

can also apply summarization a bit more

narrowly. Like anthropic's recent multi-

aent research paper talked about

applying summarization only to completed

work sections. Cognition's post makes a

similar point, but they apply

summarization in this example at the

interface between different agents and

sub aents. So they kind of use

summarization to compress context such

that in this case sub aent one has a

compression of the context that the

initial agent was using. So it's kind of

a means of information handoff in this

case between linear sub aents. But the

principle is the same in all cases.

Summarization is a very useful technique

for compressing the context in order to

manage overall token bloat when working

with agents. Now it's also worth calling

out that you can use trimming as well.

And you can think kind of think about

this as more selective removal of tokens

that you know are relevant. So you can

use for example heruristics to keep only

recent messages. That's kind of a simple

approach. Or you can use learned

approaches

and this is an LLM based approach for

trimming or context pruning. So we

covered writing, selecting compressing

context. Now let's talk about the final

category of isolating context.

So isolating context involves splitting

up to help an agent perform a task. Now

multi- aent is the most intuitive

example here. So the swarm library from

open AI was designed based upon

separation of concerns where a team of

agents can all have their own context

window and tools and instructions.

Anthropic made this a bit more explicit

in their recent multi-agent researcher

post. They mentioned that the sub agents

operate in parallel with their own

context windows exploring different

aspects of the question simultaneously.

And one of the key points from their

blog post is really that the ability to

use multi- aent expands the number of

tokens that the overall system can

process because each agent has its own

context window and can independently

research a subtopic

and that allows for richer generation of

reports because the system was able to

process more tokens across these various

sub aents. Now there's some other

techniques for context isolation that I

want to call out. I thought hugging

faces open deep research gives an

interesting example. So they use a code

agent. A code agent uses an LM to

generate executable code that contains

whatever tool calls that the agent wants

to run. And the code is just executed in

a sandbox

which will then run all the tools and

selective information can then be passed

back to the LLM

return values standard out variable

names and so forth that the Elm can

reason about. The key point they make is

that this sandbox can actually persist

state over multiple turns. So you don't

have to dump back all the context to the

LLM when you're doing multiple turns of

an agent. this environment can kind of

house a lot of tokenheavy information

like images or audio files so they never

expose to the LM's context window. And

this is another nice trick for isolating

tokenheavy objects from the LM context

window and only selectively passing back

things that you know that the LM will

need to make the next decision.

And I do want to call out that a runtime

state object is another kind of obvious

way to per perform context isolation.

And it's kind of common to create like

for example a data model like a pedantic

model for your state which has different

fields and those fields are just like

different buckets that you can dump

context into and you can selectively

decide what you want to fish out at what

point and then passed to the LM at a

certain stage in your agent. So it's

another very nice and intuitive way to

isolate context.

So we've covered these four categories

writing selecting compressing and

isolating context and talked about a

bunch of examples from popular agents.

Now I want to talk about how langraph

enables all of these.

So first as a preface before you

undertake on contact engineering it's

useful to have at least two things. One

is the ability to actually track tokens

that can be achieved through tracing and

observability. Langmith is a great way

to do that. Also ideally evaluation so

some way to measure the effect of

context engineering effort. Like here's

a good example. Let's say you're using

some kind of context compression. You

want to make sure you didn't actually

degrade the agents behavior. And so

simple evaluation

using lang is a great way to ensure of

that. So those are kind of table stakes

before actually undertake any context

engineering effort.

So let's talk about this idea of writing

context in line graph. So lang graph is

a low-level orchestration framework for

building agents. You can lay agents out

as a set of nodes and edges connecting

those nodes. Like for example, with the

typical tool calling agent, you'll have

one node that makes an LM call, one node

that just executes the tools, and you'll

just bounce between those two. Super

simple layout for a classic tool calling

agent. Now, this notion of a scratch bed

is actually very nicely supported in

Langraph because Lang graph is designed

around the idea of a state object. So,

what happens is in every node, this

state object is accessible and you can

fish anything you want from the state

object and write anything back to it.

The state object is typically defined up

front when you lay out your graph. It

can be for example a dictionary, a type

dict, a padantic model. It's just a data

model. You define it and it's accessible

to you. It's perfect for this notion of

scratchpad

in for example the LM node of your

agent. The agent can take notes. Those

notes can be written to state.

And now that state is checkpointed

across the lifetime of your agent. And

so you can access that state at any node

at any point within the session. And so

it's available to you for example at

future turns. And that's exactly the

intuition around why this idea of a

scratchman is so useful. Agents can

write down things fetch them later. Now

for memory lang is actually designed

with long-term memory as a first class

component. Long-term memory is

accessible in every node of your graph

and you can very easily write to it.

So the key point is within a session,

Langraph has persistence via

checkpointing where agent state is

accessible at every node. If you want to

save things across different

agent sessions, that's also achievable

in Langraph using Langraph's native

built-in long-term memory. Now, I talked

a bit about this previously. How about

selecting context? Well, within

Langraph, you can select, for example,

from state in any node, and that can

serve as a scratch pad. You can also

retrieve from long-term memory in any

node. And what's interesting is

long-term memory can store different

memory types. You can s you can store

simple files. You can also store

collections and use embedding based

similarity search as an example. So you

can see these two resources. Check out

this course from deep learning AI if you

want to learn a lot about the different

memory types. And check out this recent

course on ambient agents if you want a

simple crisp example of long-term memory

in a

longunning email assistant agent. And

the nice thing about that example is the

agent updates memory based on human

feedback. So it really shows that

feedback between human and loop memory

updating and then persisting long-term

memories over time to govern and improve

the behavior of agent in a kind of

virtuous cycle as you give it more

feedback.

Now this pre-built langraph big tool is

actually a really neat example of tool

selection in Langraph. So it uses

exactly the principle that was mentioned

previously

embedding based semantic similarity

search across tool descriptions and you

can see it all right here in this repo

but it's quite an effective way to

select across large collections of tools

and for rag lang graph is very low-level

framework you can implement many

different rag techniques using langraph

I linked to some tutorials that we have

we also have a lot of popular videos on

building rag workflows or agents with

langraph

now How about context compression?

Lapraph has a few useful utilities for

summarizing and trimming message history

when you're building agents which can be

used out of the box. But it's also of

course a low-level framework. So you

flexibility to define logic within each

node of your agent. And one thing I do

frequently is I'll actually have some

logic to post-process certain tokenheavy

tool calls just inside my tool node. So

you can kind of look at what tools

called and then kick off a little

post-processing step depending upon the

tool that was selected by the LLM.

So it's very easy to augment the logic

of your agent in lang graph to

incorporate things like post-processing

because it's just a low-level framework

and you control all the logic within

each node of your agent graph. I can I

show a little example here in our open

deep research.

Now context isolation we actually have a

lot of we've actually done a lot of work

on multi- aent. We have implementations

for both supervisor and swarm. They're

popular open source implementations.

their popular multi-agent

implementations and a bunch of different

videos here that show how to get

started. So multi-agent is something

that has been well supported in Langraph

for a while. Langraph also works nicely

with different environments and

sandboxes. This is a cool repo by Jacob.

You can check out it uses E2B within a

Langraph node to actually do code

execution. And this video here talks

about using a sandbox with langraph but

in that case with state persistence. So

the states persisted within that sandbox

across different turns of the agent

which is exactly what we saw in the

hugging face example and this could be a

very nice way to isolate context within

an environment and prevent it from

flooding back into the context window of

your agent. And finally that you have a

state object. So state objects are are

central in line graph as mentioned

they're available to you within each

node of your graph. You can read from it

you can write to it. You can design the

state object to have a schema. For

example, it can be a pyantic model and

it can have multiple fields in that

schema. So for example, you can have one

field like messages and that's always

exposed to the LM at each turn of your

agent, but you can have other field that

just saves it any other information you

want to keep around but only use at

specific points like maybe towards the

end of your agent trajectory. So it's

very easy to organize and isolate

context in your state object just by

defining a simple schema. So just to

summarize, there's at least four overall

categories for context engineering,

though we've seen across many popular

agents. Writing, selecting, compressing,

and isolating context. Writing context

typically means saving it outside the

context window to help an agent perform

a task. And usually this is so that the

agent can retrieve that context at a

later point in time. Could be a scratch

pad, could be just writing it to a state

object, could be writing to long-term

memories. selecting context. It could be

retrieving tools. Could be retrieving

information from a scratch pad that an

agent is using to accomplish a task

within a given session. Could be

retrieving long-term memories that

provide guidance for an agent based upon

past interactions. It could be just

retrieving relevant knowledge. And that

of course gets into the entire topic of

rag which is a very deep one. Then

there's compressing context summarizing

it, trimming it, basically trying to

retain only the most relevant tokens

required to perform a task and isolating

it. A great way and simple way to do

this is just partitioning context in a

state object. Sandboxing is an

interesting approach to isolate context

from the LM. And of course, there's

multi- aent which involves splitting

contexts up between different sub aents

that perform independent tasks, but they

collectively can increase the overall

number of tokens that the system

processes in order to accomplish a task.

So these are just a few of the

categories. This is very much a moving

and emerging field. This is not a

complete list, but at least it's my

attempt to organize the space, talk

about some interesting examples, and

hopefully give some references about how

to do each of these things in line

graph. Thanks a lot.

Loading...

Loading video analysis...