Context Engineering & Coding Agents with Cursor

By OpenAI

Summary

## Key takeaways - **AI coding evolution: Speedrun vs. gradual shift**: The transition to AI-assisted coding is happening much faster than previous shifts, like the move from text-based terminals to graphical interfaces, compressing decades of progress into just a few years. [01:13] - **Tab's real-time learning from user feedback**: Cursor's Tab feature, handling over 400 million daily requests, uses user acceptance and rejection of suggestions to train its next-action prediction model in near real-time via online reinforcement learning. [02:24], [02:55] - **Context engineering: Intentional context over raw data**: Effective context engineering for AI models is less about prompt tricks and more about providing intentional context, focusing on minimal, high-quality tokens rather than overwhelming the model with excessive data. [05:21] - **Semantic search beats basic grep for code retrieval**: Indexing codebases to create embeddings for semantic search significantly improves agent accuracy compared to basic string matching tools like grep, enabling agents to find relevant code even with slightly different file names. [06:32] - **User-land innovation drives agent features**: Many breakthroughs in context engineering for coding agents originate from power users in 'user-land,' who develop effective workflows and patterns that are later integrated into the core product. [11:20] - **Future: AI automates toil, enhances creativity**: Cursor aims to automate the tedious aspects of software development, freeing engineers to focus on creative problem-solving, system design, and building impactful features, making coding feel more like play. [16:17], [18:05]

Topics Covered

AI is speedrunning software development's evolution.
Intentional context, not just prompts, drives AI coding.
Semantic search is fundamental for AI code retrieval.
User workflows drive AI product feature development.
AI frees engineers for creativity, not toil.

Full Transcript

[Applause]

I'm Lee and I'm on the cursor team and

I'm going to talk about how building

software has evolved. So, thanks for

being here.

We started with punch cards and

terminals back in the 60s where

programming was this new superpower, but

it was inaccessible to most people. And

then in the 70s, programmers grew up

writing basic on their Apple 2s and

their Commodore 64s. Then in the 80s

gueies started to get mainstream, but

still most programming was done on

textbased terminals. It wasn't until the

'9s and the 2000s that we started to see

programming shift to graphical

interfaces. So Front Page and

Dreamweaver, which you might remember

allowed beginners to drag and drop and

build websites. And new editors and

idees like Visual Studio made it easier

for professionals to work in very large

code bases. And I of course had to add

my favorite text editor, Sublime Text

here. I'm sure some of you have used it

before. It's a good one. Now with AI

building software is becoming more

accessible and powerful than ever.

Unlike this slower shift from terminals

to the shift to write code with AI is

really being speedrun. The progress of

decades is happening in just a few

years. And with each iteration, the

interface and the UX is changing to

allow the models to achieve more

ambitious tasks. So I'd like to talk

about context engineering and how coding

agents have evolved over the past few

years from the perspective of cursor.

I'll show how we've went from

autocompleting your next action to fully

autonomous coding agents. And finally

we'll have Michael Curser CEO talk about

the future of where we believe software

engineering is headed.

So, let's start with TAB. One of the

products that inspired Cursor was GitHub

C-Pilot. It showed that with

improvements to the UX of autocomplete

and with better models, we can make

writing code much easier. We released

the first version of tab back in 2023

and the experience has evolved from

predicting the next word to the next

line and then ultimately to where your

cursor is going to go next. Tab now

handles over 400 million requests per

day. And this means we have a lot of

data about which suggestions users

accept and reject. This led to us moving

from an off-the-shelf model to training

a model specialized for next action

prediction. So to improve this model, we

use data to positively reinforce

behaviors that lead to accepted

suggestions and then negatively

reinforce rejected suggestions. And

we're able to do this in near real time.

So you can accept a suggestion and then

30 minutes later the tab model has been

updated using online RL based on your

feedback.

Getting this experience right has taken

a lot of iterations. There's a delicate

balance between the speed of the

suggestion, the quality of the

suggestion, and also just the general UX

for how it's displayed. If it's slower

than 200 milliseconds, it kind of takes

you out of your flow. But you also don't

want to see fast unhelpful suggestions.

So with our latest release now, we show

fewer suggestions, but we have higher

confidence that they're going to be

accepted.

We find tab really helpful for domains

where AI models just aren't as helpful

yet. And the bottleneck here really is

your own typing speed. Now, most people

type at about 40 words per minute, even

though I'm sure all of you type at 90

plus, right? We've got some amazing

typists in here.

So what would it look like if we allow

the AI models to write more code for us?

This is where coding agents come in and

this is that next evolution of coding

with AI. You can talk to models directly

in products like cursor or like we saw

in codeex and have them create or update

entire blocks of code.

Something we've tried really hard to

make a focus in cursor is giving you

control over the level of autonomy of

working, with, the, models., So,, one, of the

first features we added back in 2023 was

prompting models to add inline

suggestions. This would take your

current line as well as the broader file

context and then pass it to the model to

suggest a diff.

Shortly after, we released our first

steps towards a coding agent, which was

a feature called composer, which some of

the longtime cursor user fans may

remember. Uh, we even have a pixelated

Twitter demo that I've included here of

one of the first versions. This made it

much easier to do multifile edits with

more of a conversational UI.

And then in 2024, we added a fully

autonomous coding agent. This saw models

use more tokens as they were getting

better at tool calling and it allowed

cursor to self gather its own context.

So in the previous versions, you had to

provide all of that context up front

which was a bit more difficult. So let's

talk about some of the ways that we've

optimized the cursor agent harness.

There's been a lot of talk recently

about context engineering as an

evolution of prompt engineering, which I

personally find really helpful. As

models are getting better, getting high

quality output is less about specific

prompting tricks, although those can

still help, but it's more about giving

the models the right context. And not

just any context, but intentional

context. Models get worse at recalling

information as the size of the context

increases. And in reality, you don't

want to push the limits of the context

window. You want to use a minimal amount

of highquality tokens. And this is why

the retrieval of code is actually really

important and fundamental to context

engineering. So let's look at an example

of searching code in a larger codebase.

We found that when you give models very

powerful tools, it can significantly

improve the rate at which code is

accepted.

Many coding agents now use commands like

GP or RIP Grep to look for direct string

matches across files and directories.

And as new models are trained on tool

calling and agents get better at using

tools, the search quality does improve.

However, we found that you can make

searching even better by automatically

indexing your codebase and creating

embeddings. So this allows us to have

semantic search. So I can ask the agent

update the top navigation, but if the

file is actually called header.tsx, tsx

semantic search allows the agent to go

and quickly and accurately find the

correct code during the retrieval

process

for generating embeddings. We also moved

from an off-the-shelf embedding model to

training a custom model that helped us

produce more accurate results and we

constantly AB test the performance of

using semantic search. We found that in

comparison to using GP alone, users

would send more follow-up questions and

also spend more tokens. So semantic

search is really helpful. One of the

biggest wins though is it shifts where

the compute happens. You spend the

compute and the latency upfront during

the indexing rather than at inference

time when the agent is actually being

invoked. So in other words, you're doing

the heavy lifting offline, which means

you can get faster and cheaper responses

at runtime without sacrificing

performance and putting that on the

user. So the takeaway here is you likely

want both GP and semantic search for the

best results. And we'll have a full blog

post soon that talks about some of these

results. So giving the models better

tools helps improve their quality. But

what about the UX of actually using

these coding agents? There's been a lot

of exploration with coding CLI from

OpenAI's codeex to claude to Cursor's

own CLI. And the idea here is to find

the most minimal abstraction over the

model, kind of iterate on the harness

and then make the agent extensible. But

we don't believe CLIs are the final

state or the end goal of working with

coding agents. What I like about the

terminal is that it opens up a new

surface for coding agents to run. So

this can be in the CLI. It can also be

on the web or from your phone. It can be

from a bug report in Slack, which I use

all the time. It can be from a backlog

item in linear just automatically

triaged for you.

Because CLI based agents are scriptable

you can use them in any type of

environment which is really helpful. We

use this internally to automatically

write docs or update parts of our

codebase. And it can be as simple as

just doing cursor-p and then a prompt

and having text or even structured

formats like JSON come back.

We also believe that you'll need more

specialized agents, which makes sense

when you see the keynote today. Last

year, we started experimenting with

using AI models to read and review code

instead of just writing and editing

code. And we made an internal tool

called Bugbot. It tried to help you find

meaningful logic bugs in your code. And

after using it internally for about 6

months, we found that it actually caught

a lot of bugs that we missed on code

reviews. So we decided to make it public

and funnily enough it actually caught a

bug which took down bugbot itself which

of course we accidentally ignored. So we

learned to then really pay attention to

those bugbot comments.

Newer models are also getting very good

at longer horizon tasks. So one way

we've pushed agents to run longer inside

of cursor is having them plan and do

more research upfront. This not only

gives you a chance to verify the

requirements of what you're trying to

build and course correct along the way

but we've also seen it significantly

improves the quality of the code

generated, which makes sense, right?

You're giving the models much higher

quality input context. And to do this

well, it's more than a simple prompt

change like plan better, but you

actually need to have deeper product

integration in how you store the plans

how you edit the files, and also giving

the model new tools.

It also makes sense to allow the agent

to create and manage a to-do list. This

gives the model the critical context so

they don't forget the task it's working

on or waste tokens. And it's like they

can have notes that they can constantly

reference. One area we're still

exploring is taking your to-dos and

making them have the same source of

truth, which is your codebase, which I

know is something that I would

personally use for smaller projects

where maybe I don't need a fully

featured task management tool.

Another important part of agent

extensibility is allowing you to package

up your workflows and then share them

with your team. So custom commands are a

way to share prompts and then rules

allow you to include important context

in every single agent conversation. One

way our engineers have found this really

helpful internally is packaging up our

commit standards and guidelines, putting

them in slashcomit and then being able

to pass in tickets like you pass in the

linear ticket that you're working on.

Another thing that I've noticed is that

a lot of the context engineering

breakthroughs actually happen in user

space first. So all of you the power

users figure out the workflows and the

patterns that actually work really well

and then as they get adopted they make

their way back into the core product as

features. So we see this with plans

memories and rules are really all like

this.

Speaking of teams, you want to trust

these agents to write code for you. But

that requires keeping a human in the

loop. Which is why when the agent tries

to run shell commands, cursor will ask

you if you would like to run it just

once or if you're comfortable, you can

add it to the allow list to auto run in

the future. And all these settings can

be stored in code and explicitly shared

with your team, including blocking

certain shell commands or actions. Our

latest release also has custom hooks, so

you can tap into every part of the

agents run. Maybe you want to have a

shell script that runs when the agent

finishes,, for example.

So, we've covered a lot of ground here.

Coding agents have evolved quite a bit

in the past year, and they're getting

better and better when you give them

very powerful tools. And as the models

have got more capable, we've actually

been able to remove overly pre precise

instructions from our system prompts

that just weren't necessary anymore. So

what would it look like if we allowed

agents to run for significantly longer?

What is the right interface for managing

multiple coding agents?

If you're just getting started coding

with agents, I don't recommend

immediately trying to juggle multiple

agents. I mean, let's be honest, are we

really being productive running nine

CLIs in parallel?

Probably not.

Probably not yet, though. I mean, not

only do you need to set up your local

machine for running parallel agents, but

it's also kind of hard to review the

output of all of these agents. So, we

don't think that this form factor is the

end goal or the end state, but there is

promise here. One thing we've been dog

fooding over the past few months is a

new type of interface for managing

multiple coding agents. And we found

this really helpful internally when

maybe you have an agent in the

foreground, but you need to ask

questions about the codebase or maybe do

some research about tools you want to

integrate or small refactors. When you

have this fast coding model in the

foreground, you can really stay in the

flow and then you have your parallel

agents kind of run other tasks in the

background which could run for much

longer. Those could be in the foreground

on your machine. They can be in the

background on the cloud. Each one of

these decisions has unique constraints

that right now you have to think about

and spend a lot of time on. If you're in

the cloud, you get these sandbox virtual

machines, which are really nice for very

long horizon tasks, but the trade-off is

that it usually takes longer to boot up

and you have to set up some initial

configuration with the environment that

you're working in. But running agents

locally in parallel is kind of a

different type of isolation. If you have

multiple agents that are trying to

modify the same set of files on your

local machine, you need to have tools

like git work trees that allow you to

have different copies of your codebase

where you can run independently. And

then you also have to think about all

the other parts of local dev like

managing accessing your database and

viewing the work trees on different

ports today. And I I talked to some

developers early like a lot of this is

happening in userland and people are

writing scripts and hacks to make this

work really well. And what we're working

on and exploring is actually building

this natively into the cursor product.

Another idea that we've started to

explore for multiple agents is being

able to have the models compete against

each other. So what if you had GPT5 high

reasoning versus medium or low reasoning

and then you can pick the best result or

compare results across different model

providers with cursors agent. This will

soon be an option to go from one to n

for any given prompt and any models.

Part of context engineering for agents

is making it so they can check their own

work. So the agent needs to be able to

run the code, test it, and then verify

it's actually working correctly, which

is why we're exploring giving the agent

computer use. They can then control a

browser to view network requests or take

snapshots of the DOM and even give

feedback about the design of the page.

As you can tell, there's still a lot to

figure out on the right interface, the

right product experience for managing

multiple coding agents. Some of the

things I just showed are available in

cursor today in beta. So go try them out

if you're curious. And we'll have a

stable release later this month. But I

would love to hear your feedback on how

you want to work with coding agents in

the future. So come find me later and we

can talk about it. And speaking of the

future, I'd like to welcome Michael to

the stage to talk about where software

engineering is headed next.

Thanks, Lee. Our goal with cursor is to

automate coding. We think that half of

that is a model problem and an autonomy

problem. And we think that half of that

is a human computer interaction problem

of what the act of building software

looks like. We want engineers to be more

ambitious, more inventive, and more

fulfilled. And today I want to hint a

little bit at the picture of the future

that I think we can create together. one

where AI frees up more time to work on

the parts of building software that you

love.

Imagine waking up in the morning

opening cursor, and seeing that all of

your tedious work has already been

handled. On call issues were fixed and

triaged overnight. Boiler plate you

never wanted to write was generated

tested, and ready to merge. A world

where code review is actually fun, too.

Instead of being buried in your busy

work, your energy goes toward the things

that drew you to engineering in the

first place, solving hard problems

designing beautiful systems, and

building things that matter.

Imagine agents that deeply understand

your codebase, your team style, and your

product sense. Agents that come back to

you after working for long, long, long

periods of time and show their work in

higher level programming languages.

Agents that propose ideas, help you

explore new directions, break down

complex projects into pieces you can

accept, reject, or refine. Ones that

extend your ambition, but never take

away your thinking and judgment. When

you have a problem too complex for

agents, they show you what they tried.

Pulling in runtime logs or debugging

tools. You'll never start from scratch.

This is the future we're working

towards. a world where building software

feels less like toil and much more like

play and where creativity is the focus.

Uh, and I think it's possible sooner

than even some of the most ambitious

people in this room think.

Uh, if this vision excites you, we'd

love to chat. And if you haven't tried

cursor, we've been shipping lots of

improvements to our agent and to our

editor. We'd love to hear what you

think. Thank you.

[Applause]

Loading...

Loading video analysis...