Context Engineering for Agents
By LangChain
Summary
## Key takeaways - **Context Engineering: The Art of LLM Memory Management**: Context engineering is the discipline of strategically filling an LLM's context window with the most relevant information for each step of an agent's task, akin to an operating system managing RAM. [00:15], [01:17] - **Four Pillars of Context Engineering for Agents**: Key strategies for context engineering include writing (saving data outside the context window), selecting (retrieving relevant data), compressing (retaining only essential tokens), and isolating (partitioning context). [03:35], [03:51] - **Scratchpads vs. Memory: Persistence in Agents**: Scratchpads are temporary storage for information within a single agent session, while memories are designed to persist across multiple sessions, allowing agents to retain knowledge over time. [04:20], [05:24] - **Tool Handling Limits and Retrieval Augmentation**: Agents struggle with large numbers of tools, with performance degrading beyond 30 tools. Retrieval-Augmented Generation (RAG) over tool descriptions, using semantic similarity, can improve tool selection. [08:13], [08:25] - **Context Compression: Summarization and Trimming**: Compressing context involves techniques like summarization, as seen when Cloud Code compacts sessions exceeding 200,000 tokens, and trimming, which selectively removes less relevant information. [10:03], [10:10] - **LangGraph's State Object for Context Isolation**: LangGraph utilizes a state object, accessible within each node, to manage and isolate context. This allows for partitioning information and selectively exposing only necessary data to the LLM. [13:36], [15:15]
Topics Covered
- Context Engineering: The Number One Job for AI Agents.
- Agents use scratchpads and memory like humans.
- Selective retrieval improves tool and knowledge agent performance.
- Compressing context manages token bloat in agent trajectories.
- Isolating context scales agents beyond single LLM limits.
Full Transcript
Hey, this is Lance from Langchain. You
might have heard this term context
engineering recently. It's a nice way to
capture many of the different things
that we do when we're building agents.
Of course, agents need context. They
need instructions. They need external
knowledge. They also will use feedback
from tool calls.
Context engineering is just the art and
science of filling the context window
with just the right information at each
step of the agent's trajectory.
Now I want to talk about a few different
strategies for context engineering which
we can group into writing context,
selecting context, compressing context,
and isolating context. I'll walk through
some interesting examples of each one
with popular agents that we use
frequently in our day-to-day work. And
I'll also talk about how Langraph is
designed to support all of these. But
first, what is context engineering?
Where did this come from? Well, Toby
from Shopify had an interesting post
here saying he likes the term contact
engineering. Karpathy followed up on
this and offered a good definition.
Contact engineering is the delicate art
and science of filling the context
window with just the right information
for the next step. And Carpathies
recently highlighted an interesting
analogy between LLMs and operating
systems like the LLM is a CPU context
windows like RAM or its working memory.
And importantly, it has limited capacity
to handle context.
And so just like an operating system
curates what fits in RAM, context
engineering you can think of as the
discipline or the art and science of
deciding what needs to fit in context.
Oh, for example, each step of an agent
trajectory. Now, what are the types of
context we're talking about? Well, we
can think about this as an umbrella over
a few different themes. So, one is
instructions. You've heard a lot about
prompt engineering, and that's just a
subset of this. There's of course things
like memories. There's few shot
examples. There's tool descriptions.
There's also knowledge that could be
facts. That could be memories. And
there's tools which could be feedback
from the environment for example using
APIs or calculator or other tools. And
so you have all these sources of context
that are flowing in to LLM when you're
building applications. Now why is this a
bit trickier for agents in particular?
Well, agents have at least two
properties. They often handle longer
running tasks or higher complexity tasks
and they also utilize tool calling. Now
both of these things result in larger
context utilization. For example, the
feedback from tool calls can accumulate
in the context window or just very long
running tasks can accumulate lots of
token usage over many turns. Here's kind
of an example showing turn one you call
a tool. Turn two you call another tool.
If you have a large number of turns,
that tool feedback just grows and grows
and grows. Now, what's the problem with
that? This blog post from Drew Brun
nicely outlines a number of specific
context failures, context poisoning,
distraction, curation, and clash. I
encourage you to read that post. It's
really interesting, but it's kind of
intuitive. As the context grows longer,
there's more information for an LM to
process and there's more opportunities
for an LM to get, for example, confused
due to conflicting information or
injection of a hallucination which
influences the response in an adverse
way. And so for these reasons,
contexting is particularly critical when
building agents because they typically
have to handle longer context for the
reasons mentioned above. Now, cognition
highlighted this pretty nicely in a
recent blog post saying context
engineering is effectively the number
one job of engineers building AI agents.
So, what can we do about this? Well,
I've had a look at many different
popular agents that many of us use
today. Thought about this a lot,
reflected on my own experience. You can
kind of distill down approaches into
four bins. writing context. Saving
outside the context window to help an
agent perform a task. Selecting context,
selectively pulling context into the
context window to help an agent perform
a task. Compressing, retaining only the
most relevant tokens. And isolating,
splitting context up again to help an
agent perform a task.
And now I'll talk about some examples of
each of these categories.
So writing context. Writing context
means saving outside the context window
help an agent perform a task. When
humans solve tasks, we take notes and we
remember things for future related
tasks. Well, agents can do those same
two things.
For note-taking, agents can use a
scratch pad. And for remembering things,
agents can use memory.
So, you can think about scratch pads as
kind of a term that captures the idea of
persisting information while an agent is
performing a task. And I'll give a good
example of this. Anthropic's recent
multi-agent researcher. The lead
researcher begins by thinking through
the approach and saving that plan to
memory to persist it. And this is a
great point. You want to keep the plan
around. The context window might exceed
the limit of 200,000 tokens, but the
plan can always be retrieved and
retained. Very intuitive
example of taking a note and saving it
in a scratchpad. Now, I do want to make
a subtle point. The implementation for
your scratch pad can differ. So, in
their case, they just save it to a file.
But you can also for example save it to
a runtime state object depending on what
agent library you're using. But the
intuition is really that you want to be
able to write information while the
agent is solving a task so the agent can
then recall that information later if it
needs. Now memories are a bit different.
Sometimes we want to save information
across many different sessions with an
agent. So typically scratch pads are
relevant only within a single agent
session. An agent's trying to solve a
problem. It'll use a scratch pad to
solve the problem and then scratch pad's
not relevant anymore. Memories are
things that you want the agent to retain
over time over many sessions. So there's
some fun examples from the literature.
Gener agents for example synthesize
memories from collections of past agent
feedback. And you've seen this chatbt
has a great memory feature. Cursor
windsurf also will autogenerate memories
based on user agent interactions. So
this pattern is certainly emerging with
popular AI products. And again the
intuition is pretty clear here. You have
some new context. You have some existing
memories. And you can update memories
with new information
dynamically as the agent is interacting
with the user. Now, we covered writing
context and talked about a few examples
of that. Now, let's talk about selecting
context. So, selection means pulling
context into the context window to help
an agent perform a task. Now, we kind of
talked about this previously with
scratch pads. Of course, an agent can
reference what it wrote previously. This
could be via a tool call. This could be
by reading directly from a state object.
Now, memories are a bit more interesting
and a bit more subtle.
There's different types of memories you
might want to pull into context
depending on the problem you're trying
to solve. Could be fot examples that
provide specific instructions for a
desired behavior. It could be a
selectively chosen prompt for a given
task. could be facts. And these are
different memory types. Semantic
memories in humans, for example, are
things like facts, things that I learned
in school. Episodic memories are more
analogous to fot examples, past
experiences. Procedural memories are
like instructions, instincts, motor
skills. These are all things that we may
want to selectively pull into context to
help an agent solve problems. Now, where
do these come up practically different
agents we work with today? Well, you
think about instructions or procedural
memories are typically captured in
things like rules files or like cloud MD
when you're working with the code
agents. This is typically a file that
has like style guidelines or general
instructions for tools to use with a
given project. And often times these are
all pulled into context. For example,
when you start cloud code, it'll just
pull in all the cla files that you have
in your project and in your
organization.
Now, facts are a bit more subtle.
Oftentimes we want to selectively pull
in facts from a large collection. And
this is where it's common to think about
things like embedding based similarity
search or graph databases to actually
house collections of memories
in order to better control their
retrieval and ensure that only relevant
memories from a large collection are
pulled in at the right time.
Now tools are another very interesting
thing we often want to pull into
context. Now, one of the problems is
that agents have difficulty with
handling large collections of tools.
This paper I link here has some
interesting results on that showing
degradation after around 30 tools and
complete failure at around 100 tools.
And so they propose using rag over tool
descriptions. So that's just basically
embedding tool descriptions and using
retrieval based on semantic similarity
to fetch out relevant tools for a given
task. And this can improve performance
significantly. So it's a nice trick for
pulling selectively only relevant tools
into context.
Now finally I want to talk about
knowledge. Rag is a huge topic and you
can kind of think about memories as a
subset of rag but rag is of course much
broader. We often want to augment the
knowledge base of an LLM with for
example private tokens. The code agents
are some of the largest scale rag apps
currently and I actually thought this
post from Verun the CEO of Windsurf was
quite interesting. He talks a bit about
the approaches that they use for
retrieval. And the real point is it's
quite non-trivial. So of course you're
using indexing
and embedding based similarity search as
a core rack technique.
But then you have to think about
chunking. So they for example do a
parsing to chunk along semantically
meaningful code boundaries, not just
random blocks of code. So that's kind of
point one. But he also mentions that
pure embedding based search can become
unreliable. So they use a combination of
techniques like GP file search knowledge
graphs
and they use an LLM based ranking on top
of all of these. So when you look at
knowledge selection in popular agents
for example like the code agents this is
highly non-trivial and the huge amount
of context engineering goes into
knowledge selection.
So we talked about writing context we
talked about selecting context. Now
let's talk about compressing context.
This really involves retaining only the
tokens required to perform a task. So a
common idea here is summarization. If
you used cloud code for example, you
might have noticed
that it'll call autoco compact once the
session reaches 95% of the context
window, 200,000 tokens for the cloud
series. And so that's an example of
applying summarization across a full
kind of agent user trajectory.
But what's kind of interesting is you
can also apply summarization a bit more
narrowly. Like anthropic's recent multi-
aent research paper talked about
applying summarization only to completed
work sections. Cognition's post makes a
similar point, but they apply
summarization in this example at the
interface between different agents and
sub aents. So they kind of use
summarization to compress context such
that in this case sub aent one has a
compression of the context that the
initial agent was using. So it's kind of
a means of information handoff in this
case between linear sub aents. But the
principle is the same in all cases.
Summarization is a very useful technique
for compressing the context in order to
manage overall token bloat when working
with agents. Now it's also worth calling
out that you can use trimming as well.
And you can think kind of think about
this as more selective removal of tokens
that you know are relevant. So you can
use for example heruristics to keep only
recent messages. That's kind of a simple
approach. Or you can use learned
approaches
and this is an LLM based approach for
trimming or context pruning. So we
covered writing, selecting compressing
context. Now let's talk about the final
category of isolating context.
So isolating context involves splitting
up to help an agent perform a task. Now
multi- aent is the most intuitive
example here. So the swarm library from
open AI was designed based upon
separation of concerns where a team of
agents can all have their own context
window and tools and instructions.
Anthropic made this a bit more explicit
in their recent multi-agent researcher
post. They mentioned that the sub agents
operate in parallel with their own
context windows exploring different
aspects of the question simultaneously.
And one of the key points from their
blog post is really that the ability to
use multi- aent expands the number of
tokens that the overall system can
process because each agent has its own
context window and can independently
research a subtopic
and that allows for richer generation of
reports because the system was able to
process more tokens across these various
sub aents. Now there's some other
techniques for context isolation that I
want to call out. I thought hugging
faces open deep research gives an
interesting example. So they use a code
agent. A code agent uses an LM to
generate executable code that contains
whatever tool calls that the agent wants
to run. And the code is just executed in
a sandbox
which will then run all the tools and
selective information can then be passed
back to the LLM
return values standard out variable
names and so forth that the Elm can
reason about. The key point they make is
that this sandbox can actually persist
state over multiple turns. So you don't
have to dump back all the context to the
LLM when you're doing multiple turns of
an agent. this environment can kind of
house a lot of tokenheavy information
like images or audio files so they never
expose to the LM's context window. And
this is another nice trick for isolating
tokenheavy objects from the LM context
window and only selectively passing back
things that you know that the LM will
need to make the next decision.
And I do want to call out that a runtime
state object is another kind of obvious
way to per perform context isolation.
And it's kind of common to create like
for example a data model like a pedantic
model for your state which has different
fields and those fields are just like
different buckets that you can dump
context into and you can selectively
decide what you want to fish out at what
point and then passed to the LM at a
certain stage in your agent. So it's
another very nice and intuitive way to
isolate context.
So we've covered these four categories
writing selecting compressing and
isolating context and talked about a
bunch of examples from popular agents.
Now I want to talk about how langraph
enables all of these.
So first as a preface before you
undertake on contact engineering it's
useful to have at least two things. One
is the ability to actually track tokens
that can be achieved through tracing and
observability. Langmith is a great way
to do that. Also ideally evaluation so
some way to measure the effect of
context engineering effort. Like here's
a good example. Let's say you're using
some kind of context compression. You
want to make sure you didn't actually
degrade the agents behavior. And so
simple evaluation
using lang is a great way to ensure of
that. So those are kind of table stakes
before actually undertake any context
engineering effort.
So let's talk about this idea of writing
context in line graph. So lang graph is
a low-level orchestration framework for
building agents. You can lay agents out
as a set of nodes and edges connecting
those nodes. Like for example, with the
typical tool calling agent, you'll have
one node that makes an LM call, one node
that just executes the tools, and you'll
just bounce between those two. Super
simple layout for a classic tool calling
agent. Now, this notion of a scratch bed
is actually very nicely supported in
Langraph because Lang graph is designed
around the idea of a state object. So,
what happens is in every node, this
state object is accessible and you can
fish anything you want from the state
object and write anything back to it.
The state object is typically defined up
front when you lay out your graph. It
can be for example a dictionary, a type
dict, a padantic model. It's just a data
model. You define it and it's accessible
to you. It's perfect for this notion of
scratchpad
in for example the LM node of your
agent. The agent can take notes. Those
notes can be written to state.
And now that state is checkpointed
across the lifetime of your agent. And
so you can access that state at any node
at any point within the session. And so
it's available to you for example at
future turns. And that's exactly the
intuition around why this idea of a
scratchman is so useful. Agents can
write down things fetch them later. Now
for memory lang is actually designed
with long-term memory as a first class
component. Long-term memory is
accessible in every node of your graph
and you can very easily write to it.
So the key point is within a session,
Langraph has persistence via
checkpointing where agent state is
accessible at every node. If you want to
save things across different
agent sessions, that's also achievable
in Langraph using Langraph's native
built-in long-term memory. Now, I talked
a bit about this previously. How about
selecting context? Well, within
Langraph, you can select, for example,
from state in any node, and that can
serve as a scratch pad. You can also
retrieve from long-term memory in any
node. And what's interesting is
long-term memory can store different
memory types. You can s you can store
simple files. You can also store
collections and use embedding based
similarity search as an example. So you
can see these two resources. Check out
this course from deep learning AI if you
want to learn a lot about the different
memory types. And check out this recent
course on ambient agents if you want a
simple crisp example of long-term memory
in a
longunning email assistant agent. And
the nice thing about that example is the
agent updates memory based on human
feedback. So it really shows that
feedback between human and loop memory
updating and then persisting long-term
memories over time to govern and improve
the behavior of agent in a kind of
virtuous cycle as you give it more
feedback.
Now this pre-built langraph big tool is
actually a really neat example of tool
selection in Langraph. So it uses
exactly the principle that was mentioned
previously
embedding based semantic similarity
search across tool descriptions and you
can see it all right here in this repo
but it's quite an effective way to
select across large collections of tools
and for rag lang graph is very low-level
framework you can implement many
different rag techniques using langraph
I linked to some tutorials that we have
we also have a lot of popular videos on
building rag workflows or agents with
langraph
now How about context compression?
Lapraph has a few useful utilities for
summarizing and trimming message history
when you're building agents which can be
used out of the box. But it's also of
course a low-level framework. So you
flexibility to define logic within each
node of your agent. And one thing I do
frequently is I'll actually have some
logic to post-process certain tokenheavy
tool calls just inside my tool node. So
you can kind of look at what tools
called and then kick off a little
post-processing step depending upon the
tool that was selected by the LLM.
So it's very easy to augment the logic
of your agent in lang graph to
incorporate things like post-processing
because it's just a low-level framework
and you control all the logic within
each node of your agent graph. I can I
show a little example here in our open
deep research.
Now context isolation we actually have a
lot of we've actually done a lot of work
on multi- aent. We have implementations
for both supervisor and swarm. They're
popular open source implementations.
their popular multi-agent
implementations and a bunch of different
videos here that show how to get
started. So multi-agent is something
that has been well supported in Langraph
for a while. Langraph also works nicely
with different environments and
sandboxes. This is a cool repo by Jacob.
You can check out it uses E2B within a
Langraph node to actually do code
execution. And this video here talks
about using a sandbox with langraph but
in that case with state persistence. So
the states persisted within that sandbox
across different turns of the agent
which is exactly what we saw in the
hugging face example and this could be a
very nice way to isolate context within
an environment and prevent it from
flooding back into the context window of
your agent. And finally that you have a
state object. So state objects are are
central in line graph as mentioned
they're available to you within each
node of your graph. You can read from it
you can write to it. You can design the
state object to have a schema. For
example, it can be a pyantic model and
it can have multiple fields in that
schema. So for example, you can have one
field like messages and that's always
exposed to the LM at each turn of your
agent, but you can have other field that
just saves it any other information you
want to keep around but only use at
specific points like maybe towards the
end of your agent trajectory. So it's
very easy to organize and isolate
context in your state object just by
defining a simple schema. So just to
summarize, there's at least four overall
categories for context engineering,
though we've seen across many popular
agents. Writing, selecting, compressing,
and isolating context. Writing context
typically means saving it outside the
context window to help an agent perform
a task. And usually this is so that the
agent can retrieve that context at a
later point in time. Could be a scratch
pad, could be just writing it to a state
object, could be writing to long-term
memories. selecting context. It could be
retrieving tools. Could be retrieving
information from a scratch pad that an
agent is using to accomplish a task
within a given session. Could be
retrieving long-term memories that
provide guidance for an agent based upon
past interactions. It could be just
retrieving relevant knowledge. And that
of course gets into the entire topic of
rag which is a very deep one. Then
there's compressing context summarizing
it, trimming it, basically trying to
retain only the most relevant tokens
required to perform a task and isolating
it. A great way and simple way to do
this is just partitioning context in a
state object. Sandboxing is an
interesting approach to isolate context
from the LM. And of course, there's
multi- aent which involves splitting
contexts up between different sub aents
that perform independent tasks, but they
collectively can increase the overall
number of tokens that the system
processes in order to accomplish a task.
So these are just a few of the
categories. This is very much a moving
and emerging field. This is not a
complete list, but at least it's my
attempt to organize the space, talk
about some interesting examples, and
hopefully give some references about how
to do each of these things in line
graph. Thanks a lot.
Loading video analysis...