Context Engineering & Coding Agents with Cursor
By OpenAI
Summary
## Key takeaways - **AI coding evolution: Speedrun vs. gradual shift**: The transition to AI-assisted coding is happening much faster than previous shifts, like the move from text-based terminals to graphical interfaces, compressing decades of progress into just a few years. [01:13] - **Tab's real-time learning from user feedback**: Cursor's Tab feature, handling over 400 million daily requests, uses user acceptance and rejection of suggestions to train its next-action prediction model in near real-time via online reinforcement learning. [02:24], [02:55] - **Context engineering: Intentional context over raw data**: Effective context engineering for AI models is less about prompt tricks and more about providing intentional context, focusing on minimal, high-quality tokens rather than overwhelming the model with excessive data. [05:21] - **Semantic search beats basic grep for code retrieval**: Indexing codebases to create embeddings for semantic search significantly improves agent accuracy compared to basic string matching tools like grep, enabling agents to find relevant code even with slightly different file names. [06:32] - **User-land innovation drives agent features**: Many breakthroughs in context engineering for coding agents originate from power users in 'user-land,' who develop effective workflows and patterns that are later integrated into the core product. [11:20] - **Future: AI automates toil, enhances creativity**: Cursor aims to automate the tedious aspects of software development, freeing engineers to focus on creative problem-solving, system design, and building impactful features, making coding feel more like play. [16:17], [18:05]
Topics Covered
- AI is speedrunning software development's evolution.
- Intentional context, not just prompts, drives AI coding.
- Semantic search is fundamental for AI code retrieval.
- User workflows drive AI product feature development.
- AI frees engineers for creativity, not toil.
Full Transcript
[Applause]
I'm Lee and I'm on the cursor team and
I'm going to talk about how building
software has evolved. So, thanks for
being here.
We started with punch cards and
terminals back in the 60s where
programming was this new superpower, but
it was inaccessible to most people. And
then in the 70s, programmers grew up
writing basic on their Apple 2s and
their Commodore 64s. Then in the 80s
gueies started to get mainstream, but
still most programming was done on
textbased terminals. It wasn't until the
'9s and the 2000s that we started to see
programming shift to graphical
interfaces. So Front Page and
Dreamweaver, which you might remember
allowed beginners to drag and drop and
build websites. And new editors and
idees like Visual Studio made it easier
for professionals to work in very large
code bases. And I of course had to add
my favorite text editor, Sublime Text
here. I'm sure some of you have used it
before. It's a good one. Now with AI
building software is becoming more
accessible and powerful than ever.
Unlike this slower shift from terminals
to the shift to write code with AI is
really being speedrun. The progress of
decades is happening in just a few
years. And with each iteration, the
interface and the UX is changing to
allow the models to achieve more
ambitious tasks. So I'd like to talk
about context engineering and how coding
agents have evolved over the past few
years from the perspective of cursor.
I'll show how we've went from
autocompleting your next action to fully
autonomous coding agents. And finally
we'll have Michael Curser CEO talk about
the future of where we believe software
engineering is headed.
So, let's start with TAB. One of the
products that inspired Cursor was GitHub
C-Pilot. It showed that with
improvements to the UX of autocomplete
and with better models, we can make
writing code much easier. We released
the first version of tab back in 2023
and the experience has evolved from
predicting the next word to the next
line and then ultimately to where your
cursor is going to go next. Tab now
handles over 400 million requests per
day. And this means we have a lot of
data about which suggestions users
accept and reject. This led to us moving
from an off-the-shelf model to training
a model specialized for next action
prediction. So to improve this model, we
use data to positively reinforce
behaviors that lead to accepted
suggestions and then negatively
reinforce rejected suggestions. And
we're able to do this in near real time.
So you can accept a suggestion and then
30 minutes later the tab model has been
updated using online RL based on your
feedback.
Getting this experience right has taken
a lot of iterations. There's a delicate
balance between the speed of the
suggestion, the quality of the
suggestion, and also just the general UX
for how it's displayed. If it's slower
than 200 milliseconds, it kind of takes
you out of your flow. But you also don't
want to see fast unhelpful suggestions.
So with our latest release now, we show
fewer suggestions, but we have higher
confidence that they're going to be
accepted.
We find tab really helpful for domains
where AI models just aren't as helpful
yet. And the bottleneck here really is
your own typing speed. Now, most people
type at about 40 words per minute, even
though I'm sure all of you type at 90
plus, right? We've got some amazing
typists in here.
So what would it look like if we allow
the AI models to write more code for us?
This is where coding agents come in and
this is that next evolution of coding
with AI. You can talk to models directly
in products like cursor or like we saw
in codeex and have them create or update
entire blocks of code.
Something we've tried really hard to
make a focus in cursor is giving you
control over the level of autonomy of
working, with, the, models., So,, one, of the
first features we added back in 2023 was
prompting models to add inline
suggestions. This would take your
current line as well as the broader file
context and then pass it to the model to
suggest a diff.
Shortly after, we released our first
steps towards a coding agent, which was
a feature called composer, which some of
the longtime cursor user fans may
remember. Uh, we even have a pixelated
Twitter demo that I've included here of
one of the first versions. This made it
much easier to do multifile edits with
more of a conversational UI.
And then in 2024, we added a fully
autonomous coding agent. This saw models
use more tokens as they were getting
better at tool calling and it allowed
cursor to self gather its own context.
So in the previous versions, you had to
provide all of that context up front
which was a bit more difficult. So let's
talk about some of the ways that we've
optimized the cursor agent harness.
There's been a lot of talk recently
about context engineering as an
evolution of prompt engineering, which I
personally find really helpful. As
models are getting better, getting high
quality output is less about specific
prompting tricks, although those can
still help, but it's more about giving
the models the right context. And not
just any context, but intentional
context. Models get worse at recalling
information as the size of the context
increases. And in reality, you don't
want to push the limits of the context
window. You want to use a minimal amount
of highquality tokens. And this is why
the retrieval of code is actually really
important and fundamental to context
engineering. So let's look at an example
of searching code in a larger codebase.
We found that when you give models very
powerful tools, it can significantly
improve the rate at which code is
accepted.
Many coding agents now use commands like
GP or RIP Grep to look for direct string
matches across files and directories.
And as new models are trained on tool
calling and agents get better at using
tools, the search quality does improve.
However, we found that you can make
searching even better by automatically
indexing your codebase and creating
embeddings. So this allows us to have
semantic search. So I can ask the agent
update the top navigation, but if the
file is actually called header.tsx, tsx
semantic search allows the agent to go
and quickly and accurately find the
correct code during the retrieval
process
for generating embeddings. We also moved
from an off-the-shelf embedding model to
training a custom model that helped us
produce more accurate results and we
constantly AB test the performance of
using semantic search. We found that in
comparison to using GP alone, users
would send more follow-up questions and
also spend more tokens. So semantic
search is really helpful. One of the
biggest wins though is it shifts where
the compute happens. You spend the
compute and the latency upfront during
the indexing rather than at inference
time when the agent is actually being
invoked. So in other words, you're doing
the heavy lifting offline, which means
you can get faster and cheaper responses
at runtime without sacrificing
performance and putting that on the
user. So the takeaway here is you likely
want both GP and semantic search for the
best results. And we'll have a full blog
post soon that talks about some of these
results. So giving the models better
tools helps improve their quality. But
what about the UX of actually using
these coding agents? There's been a lot
of exploration with coding CLI from
OpenAI's codeex to claude to Cursor's
own CLI. And the idea here is to find
the most minimal abstraction over the
model, kind of iterate on the harness
and then make the agent extensible. But
we don't believe CLIs are the final
state or the end goal of working with
coding agents. What I like about the
terminal is that it opens up a new
surface for coding agents to run. So
this can be in the CLI. It can also be
on the web or from your phone. It can be
from a bug report in Slack, which I use
all the time. It can be from a backlog
item in linear just automatically
triaged for you.
Because CLI based agents are scriptable
you can use them in any type of
environment which is really helpful. We
use this internally to automatically
write docs or update parts of our
codebase. And it can be as simple as
just doing cursor-p and then a prompt
and having text or even structured
formats like JSON come back.
We also believe that you'll need more
specialized agents, which makes sense
when you see the keynote today. Last
year, we started experimenting with
using AI models to read and review code
instead of just writing and editing
code. And we made an internal tool
called Bugbot. It tried to help you find
meaningful logic bugs in your code. And
after using it internally for about 6
months, we found that it actually caught
a lot of bugs that we missed on code
reviews. So we decided to make it public
and funnily enough it actually caught a
bug which took down bugbot itself which
of course we accidentally ignored. So we
learned to then really pay attention to
those bugbot comments.
Newer models are also getting very good
at longer horizon tasks. So one way
we've pushed agents to run longer inside
of cursor is having them plan and do
more research upfront. This not only
gives you a chance to verify the
requirements of what you're trying to
build and course correct along the way
but we've also seen it significantly
improves the quality of the code
generated, which makes sense, right?
You're giving the models much higher
quality input context. And to do this
well, it's more than a simple prompt
change like plan better, but you
actually need to have deeper product
integration in how you store the plans
how you edit the files, and also giving
the model new tools.
It also makes sense to allow the agent
to create and manage a to-do list. This
gives the model the critical context so
they don't forget the task it's working
on or waste tokens. And it's like they
can have notes that they can constantly
reference. One area we're still
exploring is taking your to-dos and
making them have the same source of
truth, which is your codebase, which I
know is something that I would
personally use for smaller projects
where maybe I don't need a fully
featured task management tool.
Another important part of agent
extensibility is allowing you to package
up your workflows and then share them
with your team. So custom commands are a
way to share prompts and then rules
allow you to include important context
in every single agent conversation. One
way our engineers have found this really
helpful internally is packaging up our
commit standards and guidelines, putting
them in slashcomit and then being able
to pass in tickets like you pass in the
linear ticket that you're working on.
Another thing that I've noticed is that
a lot of the context engineering
breakthroughs actually happen in user
space first. So all of you the power
users figure out the workflows and the
patterns that actually work really well
and then as they get adopted they make
their way back into the core product as
features. So we see this with plans
memories and rules are really all like
this.
Speaking of teams, you want to trust
these agents to write code for you. But
that requires keeping a human in the
loop. Which is why when the agent tries
to run shell commands, cursor will ask
you if you would like to run it just
once or if you're comfortable, you can
add it to the allow list to auto run in
the future. And all these settings can
be stored in code and explicitly shared
with your team, including blocking
certain shell commands or actions. Our
latest release also has custom hooks, so
you can tap into every part of the
agents run. Maybe you want to have a
shell script that runs when the agent
finishes,, for example.
So, we've covered a lot of ground here.
Coding agents have evolved quite a bit
in the past year, and they're getting
better and better when you give them
very powerful tools. And as the models
have got more capable, we've actually
been able to remove overly pre precise
instructions from our system prompts
that just weren't necessary anymore. So
what would it look like if we allowed
agents to run for significantly longer?
What is the right interface for managing
multiple coding agents?
If you're just getting started coding
with agents, I don't recommend
immediately trying to juggle multiple
agents. I mean, let's be honest, are we
really being productive running nine
CLIs in parallel?
Probably not.
Probably not yet, though. I mean, not
only do you need to set up your local
machine for running parallel agents, but
it's also kind of hard to review the
output of all of these agents. So, we
don't think that this form factor is the
end goal or the end state, but there is
promise here. One thing we've been dog
fooding over the past few months is a
new type of interface for managing
multiple coding agents. And we found
this really helpful internally when
maybe you have an agent in the
foreground, but you need to ask
questions about the codebase or maybe do
some research about tools you want to
integrate or small refactors. When you
have this fast coding model in the
foreground, you can really stay in the
flow and then you have your parallel
agents kind of run other tasks in the
background which could run for much
longer. Those could be in the foreground
on your machine. They can be in the
background on the cloud. Each one of
these decisions has unique constraints
that right now you have to think about
and spend a lot of time on. If you're in
the cloud, you get these sandbox virtual
machines, which are really nice for very
long horizon tasks, but the trade-off is
that it usually takes longer to boot up
and you have to set up some initial
configuration with the environment that
you're working in. But running agents
locally in parallel is kind of a
different type of isolation. If you have
multiple agents that are trying to
modify the same set of files on your
local machine, you need to have tools
like git work trees that allow you to
have different copies of your codebase
where you can run independently. And
then you also have to think about all
the other parts of local dev like
managing accessing your database and
viewing the work trees on different
ports today. And I I talked to some
developers early like a lot of this is
happening in userland and people are
writing scripts and hacks to make this
work really well. And what we're working
on and exploring is actually building
this natively into the cursor product.
Another idea that we've started to
explore for multiple agents is being
able to have the models compete against
each other. So what if you had GPT5 high
reasoning versus medium or low reasoning
and then you can pick the best result or
compare results across different model
providers with cursors agent. This will
soon be an option to go from one to n
for any given prompt and any models.
Part of context engineering for agents
is making it so they can check their own
work. So the agent needs to be able to
run the code, test it, and then verify
it's actually working correctly, which
is why we're exploring giving the agent
computer use. They can then control a
browser to view network requests or take
snapshots of the DOM and even give
feedback about the design of the page.
As you can tell, there's still a lot to
figure out on the right interface, the
right product experience for managing
multiple coding agents. Some of the
things I just showed are available in
cursor today in beta. So go try them out
if you're curious. And we'll have a
stable release later this month. But I
would love to hear your feedback on how
you want to work with coding agents in
the future. So come find me later and we
can talk about it. And speaking of the
future, I'd like to welcome Michael to
the stage to talk about where software
engineering is headed next.
Thanks, Lee. Our goal with cursor is to
automate coding. We think that half of
that is a model problem and an autonomy
problem. And we think that half of that
is a human computer interaction problem
of what the act of building software
looks like. We want engineers to be more
ambitious, more inventive, and more
fulfilled. And today I want to hint a
little bit at the picture of the future
that I think we can create together. one
where AI frees up more time to work on
the parts of building software that you
love.
Imagine waking up in the morning
opening cursor, and seeing that all of
your tedious work has already been
handled. On call issues were fixed and
triaged overnight. Boiler plate you
never wanted to write was generated
tested, and ready to merge. A world
where code review is actually fun, too.
Instead of being buried in your busy
work, your energy goes toward the things
that drew you to engineering in the
first place, solving hard problems
designing beautiful systems, and
building things that matter.
Imagine agents that deeply understand
your codebase, your team style, and your
product sense. Agents that come back to
you after working for long, long, long
periods of time and show their work in
higher level programming languages.
Agents that propose ideas, help you
explore new directions, break down
complex projects into pieces you can
accept, reject, or refine. Ones that
extend your ambition, but never take
away your thinking and judgment. When
you have a problem too complex for
agents, they show you what they tried.
Pulling in runtime logs or debugging
tools. You'll never start from scratch.
This is the future we're working
towards. a world where building software
feels less like toil and much more like
play and where creativity is the focus.
Uh, and I think it's possible sooner
than even some of the most ambitious
people in this room think.
Uh, if this vision excites you, we'd
love to chat. And if you haven't tried
cursor, we've been shipping lots of
improvements to our agent and to our
editor. We'd love to hear what you
think. Thank you.
[Applause]
Loading video analysis...