Cursor Team: Future of Programming with AI | Lex Fridman Podcast #447

By Lex Fridman

Summary

## Key takeaways - **AI will change programming's core nature.**: The future of programming involves AI-assisted coding, shifting the focus from boilerplate to high-level design and nuanced decision-making, empowering programmers with greater agency and speed. (36:41, 01:12:23) - **Cursor's fork of VS Code enables rapid AI innovation.**: By forking VS Code, Cursor can deeply integrate AI capabilities, allowing for faster iteration and the development of novel features that extensions alone cannot achieve. (10:34:14, 11:34:17) - **AI's role: predicting and executing programmer intent.**: Cursor aims to predict a programmer's next action, from code completion to multi-line edits and even navigating between files, by making AI a seamless partner in the coding process. (16:11:15, 17:37:39) - **Optimizing AI for coding requires specialized models.**: Custom models, like those using Mixture of Experts (MoE) and speculative edits, are crucial for achieving the low latency and high accuracy needed for AI-powered coding assistance. (19:40:13, 34:31:35) - **AI struggles with bug detection, needs better training.**: While AI excels at code generation, it currently struggles with bug detection due to a lack of relevant training data, highlighting the need for specialized models and potentially synthetic data generation. (33:24:27, 41:11:16) - **Human-AI collaboration amplifies engineering capabilities.**: The future of programming lies in a hybrid engineer who leverages AI for speed and scale while retaining human judgment for complex design decisions and trade-offs, leading to more effective and creative problem-solving. (47:44:48, 02:17:37)

Topics Covered

The future of programming is zero entropy actions.
Why are AI models so bad at finding bugs?
Formal verification will replace unit tests.
AI should be a partner, not an order-taker.

Full Transcript

the following is a conversation with the

founding members of the cursor team

Michael truell swall oif Arvid lunark

and Aman Sanger cursor is a code editor

based on VSS code that adds a lot of

powerful features for AI assisted coding

it has captivated the attention and

excitement of the programming and AI

communities so I thought this is an

excellent opportunity to dive deep into

the role of AI in programming this is a

super technical conversation that is

bigger than just about one code editor

it's about the future of programming and

in general the future of human AI

collaboration in designing and

Engineering complicated and Powerful

systems this is Le Freedman podcast to

support it please check out our sponsors

in the description and now dear friends

here's Michael suale Arvid and

Aman all right this is awesome we have

Michael Aman suali Arvid here from the

cursor team first up big ridiculous

question what's the point of a code

editor so the the code editor is largely

the place where you build software and

today or for a long time that's meant

the place where you text edit uh a

formal programming language and for

people who aren't programmers the way to

think of a code editor is like a really

souped up word processor for programmers

where the reason it's it's souped up is

code has a lot of structure and so the

the quote unquote word processor the

code editor can actually do a lot for

you that word processors you know sort

of in the writing space haven't been

able to do for for people editing text

there and so you know that's everything

from giving you visual differentiation

of like the actual tokens in the code to

so you can like scan it quickly to

letting you navigate around the code

base sort of like you're navigating

around the internet with like hyperlinks

you're going to sort of definitions of

things you're using to error checking um

to you know to catch rudimentary B

um and so traditionally that's what a

code editor has meant and I think that

what a code editor is is going to change

a lot over the next 10 years um as what

it means to build software maybe starts

to look a bit different I I think also

code edor should just be fun yes that is

very important that is very important

and it's actually sort of an underated

aspect of how we decide what to build

like a lot of the things that we build

and then we we try them out we do an

experiment and then we actually throw

them out because they're not fun and and

so a big part of being fun is like being

fast a lot of the time fast is fun yeah

fast

is uh yeah that should be a

t-shirt like like

fundamentally I think one of the things

that draws a lot of people to to

building stuff on computers is this like

insane integration speed where you know

in other disciplines you might be sort

of gate capped by resources or or the

ability even the ability you know to get

a large group together and coding is

just like amazing thing where it's you

and the computer and uh that alone you

can you can build really cool stuff

really quickly so for people don't know

cursor is this super cool new editor

that's a fork of vs code it would be

interesting to get your kind of

explanation of your own journey of

editors how did you I think all of you

are were big fans of vs code with

co-pilot how did you arrive to VSS code

and how did that lead to your journey

with cursor yeah um

so I think a lot of us well all of us

originally Vim users pure pure VI pure

Vim yeah no neo just pure Vim in a

terminal and at Le at least for myself

it was around the time that C- pilot

came out so

2021 that I really wanted to try it so I

went into vs code the only platform the

only code editor in which it was

available

and even though I you know really

enjoyed using Vim just the experience of

co-pilot with with vs code was more than

good enough to convince me to switch and

so that kind of was the default until we

started working on cursor and uh maybe

we should explain what copala does it's

like a really nice

autocomplete it suggests as you start

writing a thing it suggests one or two

or three lines how to complete the thing

and there's a fun experience in that you

know like when you have a close

friendship and your friend completes

your

sentences like when it's done well

there's an intimate feeling uh there's

probably a better word than intimate but

there's a there's a cool feeling of like

holy it gets

me now and then there's an unpleasant

feeling when it doesn't get you uh and

so there's that that kind of friction

but I would say for a lot of people the

feeling that it gets me over powers that

it doesn't and I think actually one of

the underrated aspects of get up copet

is that even when it's wrong is it's

like a little bit annoying but it's not

that bad because you just type another

character and then maybe then it gets

you or you type another character and

then then it gets you so even when it's

wrong it's not that bad yeah you you can

sort of iterate iterate and fix it I

mean the other underrated part of uh

calot for me sort of was just the first

real real AI product it's like the first

language model consumer product so

copile was kind of like the first killer

app for LMS yeah and like the beta was

out in 2021 right okay mhm uh so what's

the the origin story of cursor so around

2020 the scaling loss papers came out

from from open Ai and that was a moment

where this looked like clear predictable

progress for the field where even if we

didn't have any more ideas looked like

you could make these models a lot better

if you had more computer and more data

uh by the way we'll probably talk uh for

three to four hours on on the topic of

scaling laws but just just to summarize

it's a paper and a set of papers and set

of ideas that say bigger might be better

for model size and data size in the in

the realm of machine learning it's

bigger and better but predictively

better okay this another topic of

conversation but anyway yeah so around

that time for some of us there were like

a lot of conceptual conversations about

what's this going to look like what's

the the story going to be for all these

different knowledge worker Fields about

how they're going to be um made better U

by this technology getting better and

then um I think there were a couple of

moments where like the theoretical gains

predicted in that paper uh started to

feel really concrete and it started to

feel like a moment where you could

actually go and not you know do a PhD if

you wanted to work on uh do useful work

in AI actually felt like now there was

this this whole set of systems one could

built that were really useful and I

think that the first moment we already

talked about a little bit which was

playing with the early bit of copell

like that was awesome and magical um I

think that the next big moment where

everything kind of clicked together was

actually getting early access to gbd4 so

sort of end of 2022 was when we were um

tinkering with that model and the Step

Up in capabilities felt enormous and

previous to that we had been working on

a couple of different projects we had

been um because of co-pilot because of

scaling laws because of our prior

interest in the technology we had been

uh tinkering around with tools for

programmers but things that are like

very specific so you know we were

building tools for uh Financial

professionals who have to work with in a

juper notebook or like you know playing

around with can you do static analysis

with these models and then the Step Up

in gbd4 felt like look that really made

concrete the theoretical gains that um

we had predicted before felt like you

could build a lot more just immediately

at that point in time and

also if we were being consistent it

really felt like um this wasn't just

going to be a point solution thing this

was going to be all of programming was

going to flow through these models it

felt like that demanded a different type

of programming environment to different

type of programming and so we set off to

build that that sort of larger Vision

around then there's one that I

distinctly remember so my roommate is an

IMO gold winner and uh there's a

competition in the US called of putam

which is sort of the IMO for college

people and it's it's this math

competition is he's exceptionally good

so Shang Tong and Aman I remember it

sort of June June of

2022 had this bet on whether the mo like

2024 June or July you were going to win

a gold medal in the Imo with the with

like models IMO is international math

Olympiad uh yeah IMO is international

math Olympiad and so Arvid and I are

both of you know also competed in it so

was sort of personal and uh and I I

remember thinking Matt is just this is

not going to happen this was like it un

like even though I I sort of believed in

progress I thought you know I'm a girl

just like Aman is just delusional that

was the that was the and and to be

honest I mean I I was to be clear it

very wrong but that was maybe the most

preent bet in the group so the the new

results from Deep Mind it turned out

that you were correct that's what well

it technically not technically incorrect

but one point awayan was very

enthusiastic about this stuff back then

and before Aman had this like scaling

loss t-shirt that he would walk around

with where it had like charts and like

the formulas on it oh so you like felt

the AI or you felt the scaling yeah I i

l remember there was this one

conversation uh I had with with Michael

where before I hadn't thought super

deeply and critically about scaling laws

and he kind of posed the question why

isn't scaling all you need or why isn't

scaling going to result in massive gains

in progress and I think I went through

like the like the stages of grief there

is anger denial and then finally at the

end just thinking about it uh acceptance

um and I think I've been quite hopeful

and uh optimistic about progress since I

think one thing I'll caveat is I think

it also depends on like which domains

you're going to see progress like math

is a great domain because especially

like formal theor improving because you

get this fantastic signal of actually

verifying if the thing was correct and

so this means something like RL can work

really really well and I think like you

could have systems that are perhaps very

superhuman in math and still not

technically have ai okay so can we take

it off all the way to cursor mhm and

what is cursor it's a fork of vs code

and VSS code is one of the most popular

editors for a long time like everybody

fell in love with it everybody left Vim

I left dmax for it

sorry

uh uh so it unified in some fun

fundamental way the uh the developer

community and then that you look at the

space of things you look at the scaling

laws AI is becoming amazing and you

decide decided okay it's not enough to

just write an extension Fe vs

code because there's a lot of

limitations to that we we need if AI is

going to keep getting better and better

and better we need to really like

rethink how the the AI is going to be

part of the editing process and so you

decided to Fork vs code and start to

build a lot of the amazing features

we'll be able to to to talk about but

what was that decision like because

there's a lot of extensions including

co-pilot of vs code that are doing so AI

type stuff what was the decision like to

just Fork vs code so the decision to do

an editor seemed kind of self-evident to

us for at least what we wanted to do and

Achieve because when we started working

on the editor the idea was these models

are going to get much better their

capabilities are going to improve and

it's going to entirely change how you

build software both in a you will have

big productivity gains but also radical

in how like the active building software

is going to change a lot and so you're

very limited in the control you have

over a code editor if you're a plugin to

an existing coding environment um and we

didn't want to get locked in by those

limitations we wanted to be able to um

just build the most useful stuff okay

well then the natural question

is you know VSS code is kind of with

copilot a competitor so how do you win

is is it basically just the speed and

the quality of the features yeah I mean

I think this is a space that is quite

interesting perhaps quite unique where

if you look at previous Tech waves

maybe there's kind of one major thing

that happened and unlocked a new wave of

companies but every single year every

single model capability uh or jump you

get model capabilities you now unlock

this new wave of features things that

are possible especially in programming

and so I think in AI programming being

even just a few months ahead let alone a

year ahead makes your product much much

much more useful I think the cursor a

year from now will need to make the

cursor of today look

Obsolete and I think you know Microsoft

has' done a number of like fantastic

things but I don't think they're in a

great place to really keep innovating

and pushing on this in the way that a

startup can just rapidly implementing

features and and push yeah like and and

kind of doing the research

experimentation

necessary um to really push the ceiling

I don't I don't know if I think of it in

terms of features as I think of it in

terms of like capabilities for for

programmers it's that like you know as

you know the new one model came out and

I'm sure there are going to be more more

models of different types like longer

context and maybe faster like there's

all these crazy ideas that you can try

and hopefully 10% of the crazy ideas

will make it into something kind of cool

and useful and uh we want people to have

that sooner to rephrase it's like an

underrated fact is we're making it for

oursel when we started cursor you really

felt this frustration that you know

models you could see models getting

better uh but the coall experience had

not changed it was like man these these

guys like the steing is getting higher

like why are they not making new things

like they should be making new things

they should be like you like like

where's where's where's all the alpha

features there there were no Alpha

features it was like uh I I'm sure it it

was selling well I'm sure it was a great

business but it didn't feel I I'm I'm

one of these people that really want to

try and use new things and was just

there's no new thing for like a very

long while yeah it's interesting uh I

don't know how you put that into words

but when you compare a cursor with

copilot copilot pretty quickly became

started to feel stale for some reason

yeah I think one thing that I think uh

helps us is that we're sort of doing it

all in one where we're developing the

the ux and the way you interact with the

model and at the same time as we're

developing like how we actually make the

model give better answers so like how

you build up the The Prompt or or like

how do you find the context and for a

cursor tab like how do you train the

model um so I think that helps us to

have all of it like sort of like the

same people working on the entire

experience on end yeah it's like the the

person making the UI and the person

training the model like sit to like 18

ft away so often the same person even

yeah often often even the same person so

you you can you create things that that

are sort of not possible if you're not

you're not talking you're not

experimenting and you're using like you

said cursor to write cursor of course oh

yeah yeah well let's talk about some of

these features let's talk about the all-

knowing the all powerful praise B to the

tab so the you know autocomplete on

steroids basically so what how does tab

work what is tab to highlight and

summarize it a high level I'd say that

there are two things that curser is

pretty good at right now there there are

other things that it does um but two

things it it helps programmers with one

is this idea of looking over your

shoulder and being like a really fast

colleague who can kind of jump ahead of

you and type and figure out what you're

what you're going to do next and that

was the original idea behind that was

kind kind of the kernel the idea behind

a good autocomplete was predicting what

you're going to do next you can make

that concept even more ambitious by not

just predicting the characters after

cursor but actually predicting the next

entire change you're going to make the

next diff the next place you're going to

jump to um and the second thing cursor

is is pretty good at right now too is

helping you sometimes jump ahead of the

AI and tell it what to do and go from

instructions to code and on both of

those we've done a lot of work on making

the editing experience for those things

ergonomic um and also making those

things smart and fast one of the things

we really wanted was we wanted the model

to be able to edit code for us uh that

was kind of a wish and we had multiple

attempts at it before before we had a

sort of a good model that could edit

code for

you U then after after we had a good

model I think there there have been a

lot of effort to you know make the

inference fast for you know uh having

having a good good

experience and uh we've been starting to

incorporate I mean Michael sort of

mentioned this like ability to jump to

different places and that jump to

different places I think came from a

feeling off you know once you once you

accept an edit um was like man it should

be just really obvious where to go next

it's like it's like I I made this change

the model should just know that like the

next place to go to is like 18 lines

down like uh if you're if you're a whim

user you could press 18 JJ or

whatever but like why why even why am I

doing this like the model the model

should just know it and then so so the

idea was you you just press tab it would

go 18 lines down and then make it would

show you show you the next edit and you

would press tab so it's just you as long

as you could keep pressing Tab and so

the internal competition was how many

tabs can we make them pressive once you

have like the idea uh more more uh sort

of abstractly the the thing to think

about is sort of like once how how how

are the edit sort of zero zero entropy

so once You' sort of expressed your

intent and the edit is there's no like

new bits of information to finish your

thought but you still have to type some

characters to like make the computer

understand what you're actually thinking

then maybe the model should just sort of

read your mind and and all the zero

entropy bits should just be like tabbed

away yeah that was that was sort of the

abstract there's this interesting thing

where if you look at language model loss

on on different domains um I believe the

bits per bite which is kind of character

normalized loss for code is lower than

language which means in general there

are a lot of tokens in code that are

super predictable lot of characters that

are super predictable um and this is I

think even magnified when you're not

just trying to autocomplete code but

predicting what the user is going to do

next in their editing of existing code

and so you know the gold cursor tab is

let's eliminate all the low entropy

actions you take inside of the editor

when the intent is effectively

determined let's just jump you forward

in time skip you forward well well

what's the intuition and what's the

technical details of how to do next

cursor prediction that jump that's not

that's not so intuitive I think to

people yeah I think I can speak to a few

of the details on how how to make these

things work they're incredibly low

latency so you need to train small

models on this on this task um in

particular they're incredibly pre-fill

token hungry what that means is they

have these really really long prompts

where they see a lot of your code and

they're not actually generating that

many tokens and so the perfect fit for

that is using a sparse model meaning Ane

model um so that was kind of one one

break one breakthrough we made that

substantially improved its performance

at longer context the other being um a

variant of speculative decoding that we

we kind of built out called speculative

edits um these are two I think important

pieces of what make it quite high

quality um and very fast okay soe

mixture of experts the input is huge the

output is small yeah okay so like what

what what else can you say about how to

make it like caching play a role in this

cashing plays a huge role M um because

you're dealing with this many input

tokens if every single keystroke that

you're typing in a given line you had to

rerun the model on all those tokens

passed in you're just going to one

significantly deg grade latency two

you're going to kill your gpus with load

so you need to you you need to design

the actual prompts use for the model

such that they're cach caching aware and

then yeah you need you need to re use

the KV cach across request just so that

you're spending less work less compute

uh again what are the things that tab is

supposed to be able to do kind of in the

near term just to like sort of Linger on

that generate code like fill empty

space Also edit code across multiple

lines yeah and then jump to different

locations inside the same file yeah and

then like hopefully jump to different

files also so if you make an edit in one

file and maybe maybe you have to go

maybe you have to go to another file to

finish your thought it should it should

go to the second file also yeah and then

the full generalization is like next

next action prediction like sometimes

you need to run a command in the

terminal and it should be able to

suggest the command based on the code

that you wrote too um or sometimes you

actually need to like it suggest

something but you you it's hard for you

to know if it's correct because you

actually need some more information to

learn like you need to know the type to

be able to verify that it's correct and

so maybe it should actually take you to

a place that's like the definition of

something and then take you back so that

you have all the requisite knowledge to

be able to accept the next completion Al

also providing the human the knowledge

yes right yeah can you integrate like I

just uh gotten to know a guy named Prime

Jen who I believe has an SS you can

order coffee via SSH

oh yeah oh we did that we did that uh so

can that also the model do that like

feed you and like yeah and provide you

with caffeine okay so that's the general

framework yeah and the the magic moment

would be

if it is programming is this weird

discipline where um sometimes the next

five minutes not always but sometimes

the next five minutes of what you're

going to do is actually predictable from

the stuff you've done recently and so

can you get to a world where that next 5

minutes either happens by you

disengaging and it taking you through or

maybe a little bit more of just you

seeing Next Step what it's going to do

and you're like okay that's good that's

good that's good that's good and you can

just sort of tap tap tap through these

big changes as we're talking about this

I should mention like one of the really

cool and noticeable things about cursor

is that there's this whole diff

interface situation going on so like the

model suggests with uh with the red and

the green of like here's how we're going

to modify the code and in the chat

window you can apply and it shows you

the diff and you can accept the diff so

maybe can you speak to whatever

direction of that we'll probably have

like four or five different kinds of

diffs uh so we we have optimized the

diff for for the autocomplete so that

has a different diff interface

than uh then when you're reviewing

larger blocks of code and then we're

trying to optimize uh another diff thing

for when you're doing multiple different

files uh and and sort of at a high level

the difference is for

when you're doing autocomplete it should

be really really fast to

read uh actually it should be really

fast to read in all situations but in

autocomplete it sort of you're you're

really like your eyes focused in one

area you you can't be in too many you

the humans can't look in too many

different places so you're talking about

on the interface side like on the

interface side so it currently has this

box on the side so we have the current

box and if it tries to delete code in

some place and tries to add other code

it tries to show you a box on the you

can maybe show it if we pull it up on

cursor. comom this is what we're talking

about so that it was like three or four

different attempts at trying to make

this this thing work where first the

attempt was like these blue crossed out

line so before it was a box on the side

it used to show you the code to delete

by showing you like uh like Google doc

style you would see like a line through

it then you would see the the new code

that was super distracting and then we

tried many different you know there was

there was sort of deletions there was

trying to Red highlight then the next

iteration of it which is sort of funny

Would you would hold the on Mac the

option button so it would it would sort

of highlight a region of code to show

you that there might be something coming

uh so maybe in this example like the

input and the value uh would get would

all get blue and the blue would to

highlight that the AI had a suggestion

for you uh so instead of directly

showing you the thing it would show you

that the AI it would just hint that the

AI had a suggestion and if you really

wanted to see it you would hold the

option button and then you would see the

new suggestion then if you release the

option button you would then see your

original code mhm so that's by the way

that's pretty nice but you have to know

to hold the option button yeah I by the

way I'm not a Mac User but I got it it

was it was it's a button I guess you

people

it's h you know it's again it's just

it's just nonintuitive I think that's

the that's the key thing and there's a

chance this this is also not the final

version of it I am personally very

excited for

um making a lot of improvements in this

area like uh we we often talk about it

as the verification problem where U

these diffs are great for small edits uh

for large edits or like when it's

multiple files or something it's um

actually

a little bit prohibitive to to review

these diffs and uh uh so there are like

a couple of different ideas here like

one idea that we have is okay you know

like parts of the diffs are important

they have a lot of information and then

parts of the diff um are just very low

entropy they're like exam like the same

thing over and over again and so maybe

you can highlight the important pieces

and then gray out the the not so

important pieces or maybe you can have a

model that uh looks at the the diff and

and sees oh there's a likely bug here I

will like Mark this with a little red

squiggly and say like you should

probably like review this part of the

diff um and ideas in in that vein I

think are exciting yeah that's a really

fascinating space of like ux design

engineering so you're basically trying

to guide the human programmer through

all the things they need to read and

nothing more yeah like optimally yeah

and you want an intelligent model to do

it like ly diffs Al diff algorithms are

they're like Al like they're just like

normal algorithms uh there's no

intelligence uh there's like

intelligence that went into designing

the algorithm but then there there's no

like you don't care if the if it's about

this thing or this thing uh and so you

want a model to to do this so I think

the the the general question is like M

these models are going to get much

smarter as the models get much smarter

uh the the changes they will be able to

propose are much bigger so as the

changes gets bigger and bigger and

bigger the humans have to do more and

more and more verification work it gets

more and more more hard like it's just

you need you need to help them out it

sort of I I don't want to spend all my

time reviewing

code uh can you say a little more across

multiple files div yeah I mean so GitHub

tries to solve this right with code

review when you're doing code review

you're reviewing multiple deaths cross

multiple files but like Arvid said

earlier I think you can do much better

than code review you know code review

kind of sucks like you spend a lot of

time trying to grock this code that's

often quite unfamiliar to you and it

often like doesn't even actually catch

that many bugs and I think you can

signific significantly improve that

review experience using language models

for example using the kinds of tricks

that AR had described of maybe uh

pointing you towards the regions that

matter

um I think also if the code is produced

by these language models uh and it's not

produced by someone else like the code

review experience is designed for both

the reviewer and the person that

produced the code in the case where the

person that produced the code is a

language model you don't have to care

that much about their experience and you

can design the entire thing around the

reviewer such that the reviewer's job is

as fun as easy as productive as possible

um and I think that that feels like the

issue with just kind of naively trying

to make these things look like code

review I think you can be a lot more

creative and and push the boundary and

what's possible just one one idea there

is I think ordering matters generally

when you review a PR you you have this

list of files and you're reviewing them

from top to bottom but actually like you

actually want to understand this part

first because that came like logically

first and then you want understand the

next part and um you don't want to have

to figure out that yourself you want a

model to guide you through the thing and

is the step of creation going to be more

and more natural language is the goal

versus with actual uh I think sometimes

I don't think it's going to be the case

that all of programming will be natural

language and the reason for that is you

know if I'm PR programming with swalla

and swall is at the computer and the

keyboard uh and sometimes if I'm like

driving I want to say to swallet hey

like implement this function and that

that works and then sometimes it's just

so annoying to explain to swalla what I

want him to do and so I actually take

over the keyboard and I show him I I

write like part of the example and then

it makes sense and that's the easiest

way to communicate and so I think that's

also the case for AI like sometimes the

easiest way to communicate with the AI

will be to show an example and then it

goes and does the thing everywhere else

or sometimes if you're making a website

for example the easiest way to show to

the a what you want is not to tell it

what to do but you know drag things

around or draw things um and yeah and

and like maybe eventually we will get to

like brain machine interfaces or

whatever and can of like understand what

you're thinking and so I think natural

language will have a place I think it

will not definitely not be the way most

people program most of the time I'm

really feeling the AGI with this editor

uh it feels like there's a lot of

machine learning going on underneath

tell tell me about some of the ml stuff

that makes it all work recursor really

works via this Ensemble of custom models

that that that we've trained alongside

you know the frontier models that are

fantastic at the reasoning intense

things and so cursor tab for example is

is a great example of where you can

specialize this model to be even better

than even Frontier models if you look at

evls on on the on the task we set it at

the other domain which it's kind of

surprising that it requires custom

models but but it's kind of necessary

and works quite well is in apply

um

so I think these models are like the

frontier models are quite good at

sketching out plans for code and

generating like rough sketches of like

the change but

actually creating diffs is quite hard um

for Frontier models for your training

models um like you try to do this with

Sonet with 01 any Frontier Model and it

it really messes up stupid things like

counting line numbers um especially in

super super large file

um and so what we've done to alleviate

this is we let the model kind of sketch

out this rough code block that indicates

what the change will be and we train a

model to then apply that change to the

file and we should say that apply is the

model looks at your code it gives you a

really damn good suggestion of what new

things to do and the seemingly for

humans trivial step of combining the two

you're saying is not so trivial contrary

to popular perception it is not a

deterministic algorithm yeah I I I think

like you see shallow copies of apply um

elsewhere and it just breaks like most

of the time because you think you can

kind of try to do some deterministic

matching and then it fails you know at

least 40% of the time and that just

results in a terrible product

experience um I think in general this

this regime of you are going to get

smarter models and like so one other

thing that apply lets you do is it lets

you use fewer tokens with the most

intelligent models uh this is both

expensive in terms of latency for

generating all these tokens um and cost

so you can give this very very rough

sketch and then have your smaller models

go and implement it because it's a much

easier task to implement this very very

sketched out code and I think that this

this regime will continue where you can

use smarter and SM models to do the

planning and then maybe the

implementation details uh can be handled

by the less intelligent ones perhaps

you'll have you know maybe 01 maybe

it'll be even more cap capable models

given an even higher level plan that is

kind of recursively uh applied by Sonet

and then the apply model maybe we should

we should talk about how to how to make

it fast yeah I feel like fast is always

an interesting detail fast good yeah how

do you make it fast yeah so one big

component of making it it fast is

speculative edits so speculative edits

are a variant of speculative decoding

and maybe be helpful to briefly describe

speculative decoding um with speculative

decoding what you do is you you can kind

of take advantage of the fact that you

know most of the time and I I'll add the

caveat that it would be when you're

memory Bound in in language model

Generation Um if you process multiple

tokens at once um it is faster than

generating one Tok at a time so this is

like the same reason why if you look at

tokens per second uh with prompt tokens

versus generated tokens it's much much

faster for prompt tokens um so what we

do is instead of using what specul

decoding normally does which is using a

really small model to predict these

draft tokens that your larger model

would then go in and and verify um with

code edits we have a very strong prior

of what the existing code will look like

and that prior is literally the same

exact code so you can do is you can just

feed chunks of the original code back

into the into the model um and then the

model will just pretty much agree most

of the time that okay I'm just going to

spit this code back out and so you can

process all of those lines in parallel

and you just do this with sufficiently

many chunks and then eventually you'll

reach a point of disagreement where the

model will now predict text that is

different from the ground truth original

code it'll generate those tokens and

then we kind of will decide after enough

tokens match

uh the original code to re start

speculating in chunks of code what this

actually ends up looking like is just a

much faster version of normal editing

code so it's just like it looks like a

much faster version of the model

rewriting all the code so just we we can

use the same exact interface that we use

for for diffs but it will just stream

down a lot faster and then and then the

advantage is that W wireless streaming

you can just also be reviewing start

reviewing the code exactly before before

it's done so there's no no big loading

screen uh so maybe that that is part of

the part of the advantage so the human

can start reading before the thing is

done I think the interesting riff here

is something like like speculation is a

fairly common idea nowadays it's like

not only in language models I mean

there's obviously speculation in CPUs

and there's there like speculation for

databases and like speculation all over

the place let me ask the sort of the

ridiculous question of uh which llm is

better at coding GPT Claude who wins in

the context of programming and I'm sure

the answer is much more Nuance because

it sounds like every single part of this

involves a different

model yeah I think they there's no model

that poo dominates uh others meaning it

is better in all categories that we

think matter the categories being

speed

um ability to edit code ability to

process lots of code long context you

know a couple of other things and kind

of coding

capabilities the one that I'd say right

now is just kind of net best is Sonet I

think this is a consensus opinion our

one's really interesting and it's really

good at reasoning so if you give it

really hard uh programming interview

style problems or lead code problems it

can do quite quite well on them um but

it doesn't feel like it kind of

understands your rough intent as well as

son it

does like if you look at a lot of the

other Frontier models um one qual I have

is it feels like they're not necessarily

over I'm not saying they they train in

benchmarks um but they perform really

well in benchmarks relative to kind of

everything that's kind of in the middle

so if you tried on all these benchmarks

and things that are in the distribution

of the benchmarks they're valuated on

you know they'll do really well but when

you push them a little bit outside of

that son's I think the one that that

kind of does best at at kind of

maintaining that same capability like

you kind of have the same capability in

The Benchmark as when you try to

instruct it to do anything with coding

what another ridiculous question is the

difference between the normal

programming experience versus what

benchmarks represent like where do

benchmarks fall short do you think when

we're evaluating these models by the way

that's like a really really hard it's

like like critically important detail

like how how different like benchmarks

are versus where is like real coding

where real

coding it's not interview style coding

it's you're you're doing these you know

humans are saying like half broken

English sometimes and sometimes you're

saying like oh do what I did

before sometimes you're saying uh you

know go add this thing and then do this

other thing for me and then make this UI

element and then you know it's it's just

like a lot of things are sort of context

dependent

you really want to like understand the

human and then do do what the human

wants as opposed to sort of this maybe

the the way to put it is sort of

abstractly is uh the interview problems

are

very wellp

specified they lean a lot on

specification while the human stuff is

less

specified yeah I think that this this SP

for question is both Complicated by what

um Sol just mentioned and then also to

what Aman was getting into is that even

if you like you know there's this

problem of like the skew between what

can you actually model in a benchmark

versus uh real programming and that can

be sometimes hard to encapsulate because

it's like real programming is like very

messy and sometimes things aren't super

well specified what's correct or what

isn't but then uh it's also doubly hard

because of this public Benchmark problem

and that's both because public

benchmarks are sometimes kind of Hill

climbed on then it's like really really

hard to also get the data from the

public benchmarks out of the models and

so for instance like one of the most

popular like agent benchmarks sweet

bench um is really really contaminated

in the training data of uh these

Foundation models and so if you ask

these Foundation models to do a sweet

bench problem you actually don't give

them the context of a codebase they can

like hallucinate the right file pass

they can hallucinate the right function

names um and so the the it's it's also

just the public aspect of these things

is tricky yeah like in that case it

could be trained on the literal issues

or pool request themselves and and maybe

the lives will start to do a better job

um or they've already done a good job at

decontaminating those things but they're

not going to emit the actual training

data of the repository itself like these

are all like some of the most popular

python repositories like simpai is one

example I don't think they're going to

handicap their models on Senpai and all

these popular P python repositories in

order to get uh true evaluation scores

in these benchmarks yeah I think that

given the dirs and benchmarks

um there have been like a few

interesting crutches that uh places that

build systems with these models or build

these models actually use to get a sense

of are they going in the right direction

or not and uh in a lot of places uh

people will actually just have humans

play with the things and give

qualitative feedback on these um like

one or two of the foundation model

companies they they have people who

that's that's a big part of their role

and you know internally we also uh you

know qualitatively assess these models

and actually lean on that a lot in

addition to like private evals that we

have it's like the live

the vibe yeah the vi the vibe Benchmark

human Benchmark the hum you pull in the

humans to do a Vibe check yeah okay I

mean that's that's kind of what I do

like just like reading online forums and

Reddit and X just like well I don't know

how

to properly load in people's opinions

because they'll say things like I feel

like Claude or gpt's gotten Dumber or

something they'll say I feel like

and then I sometimes feel like that too

but I wonder if it's the model's problem

or mine yeah with Claude there's an

interesting take I heard where I think

AWS has different chips um and I I

suspect they've slightly different

numerics than uh Nvidia gpus and someone

speculated that claud's deg degraded

performance had to do with maybe using

the quantise version that existed on AWS

Bedrock versus uh whatever was running

on on anthropics gpus I interview a

bunch of people that have conspiracy

theories so I'm glad spoke spoke to this

conspiracy well it's it's not not like

conspiracy theory as much as they're

just they're like they're you know

humans humans are humans and there's

there's these details and you know

you're

doing like these quzy amount of flops

and you know chips are messy and man you

can just have bugs like bugs are it's

it's hard to overstate how how hard bugs

are to avoid what's uh the role of a

good prompt in all this see you mention

that benchmarks have

really uh structured well formulated

prompts what what should a human be

doing to maximize success and what's the

importance of what the humans you wrote

a blog post on you called it prompt

design yeah uh I think it depends on

which model you're using and all of them

are likly different and they respond

differently to different prompts but um

I think the original gp4 uh and the

original sort of bre of models last last

year they were quite sensitive to the

prompts and they also had a very small

context window and so we have all of

these pieces of information around the

codebase that would maybe be relevant in

the prompt like you have the docs you

have the files that you add you have the

conversation history and then there's a

problem like how do you decide what you

actually put in the prompt and when you

have a a limited space and even for

today's models even when you have long

context filling out the entire context

window means that it's slower it means

that sometimes a model actually gets

confused and some models get more

confused than others and we have this

one system internally that we call preum

which helps us with that a little bit um

and I think it was built for the era

before where we had

8,000 uh token context Windows uh and

it's a little bit similar to when you're

making a website you you sort of you you

want it to work on mobile you want it to

work on a desktop screen and you have

this uh Dynamic information which you

don't have for example if you're making

like designing a print magazine you have

like you know exactly where you can put

stuff but when you have a website or

when you have a prompt you have these

inputs and then you need to format them

will always work even if the input is

really big then you might have to cut

something down uh and and and so the

idea was okay like let's take some

inspiration what's the best way to

design websites well um the thing that

we really like is is react and the

declarative approach where you um you

use jsx in in in JavaScript uh and then

you declare this is what I want and I

think this has higher priority or like

this has higher Z index than something

else um and

then you have this rendering engine in

web design it's it's like Chrome and uh

in our case it's a pre renderer uh which

then fits everything onto the page and

and so you declaratively decide what you

want and then it figures out what you

want um and and so we have found that to

be uh quite helpful and I think the role

of it has has sort of shifted over time

um where initially was to fit to these

small context Windows now it's really

useful because you know it helps us with

splitting up the data that goes into the

prompt and the actual rendering of it

and so um it's easier to debug because

you can change the rendering of the

prompt and then try it on Old prompts

because you have the raw data that went

into the prompt and then you can see did

my change actually improve it for for

like this entire evil set so do you

literally prompt with jsx yes yes so it

kind of looks like react there are

components like we have one component

that's a file component and it takes in

like the cursor

like usually there's like one line where

the cursor is in your file and that's

like probably the most important line

because that's the one you're looking at

and so then you can give priorities so

like that line has the highest priority

and then you subtract one for every line

that uh is farther away and then

eventually when it's render it to figure

out how many lines can I actually fit

and it centers around that thing that's

amazing yeah and you can do like other

fancy things where if you have lots of

code blocks from the entire code base

you could use uh retrieval um and things

like embedding and reranking scores to

add priorities for each of these

components so should humans when they

ask questions also use try to use

something like that like would it be

beneficial to write jsx in the in the

problem where the whole idea is should

be loose and messy I I think our goal is

kind of that you should just uh do

whatever is the most natural thing for

you and then we are job is to figure out

how do we actually like retrieve the

relative EV things so that your thing

actually makes sense well this is sort

of the discussion I had with uh Arvin of

perplexity is like his whole idea is

like you should let the person be as

lazy as he want but like yeah that's a

beautiful thing but I feel like you're

allowed to ask more of programmers right

so like if you say just do what you want

I mean humans are lazy there's a kind of

tension between just being lazy versus

like provide more is uh be prompted

almost like the system

pressuring you or inspiring you to be

articulate not in terms of the grammar

of the sentences but in terms of the

depth of thoughts that you convey inside

the uh the problems I think even as a

system gets closer to some level of

perfection often when you ask the model

for something you just are not not

enough intent is conveyed to know what

to do and there are like a few ways to

resolve that intent one is the simple

thing of having model just ask you I'm

not sure how to do these parts based in

your query could you clarify that um I

think the other could be

maybe if you there are five or six

possible Generations given the

uncertainty present in your query so far

why don't we just actually show you all

of those and let you pick

them how hard is it to for the model to

choose to speak talk back sort of versus

gener that's a that's hard sort of like

how to deal with the

uncertainty do I do I choose to ask for

more information to reduce the ambiguity

so I mean one of the things we we do is

um it's like a recent addition is try to

suggest files that you can add so and

while you're typing uh one can guess

what the uncertainty is and maybe

suggest that like you know maybe maybe

you're writing your API

and uh we can guess using the

commits uh that you've made previously

in the same file that the client and the

server is super useful and uh there's

like a hard technical problem of how do

you resolve it across all commits which

files are the most important given your

current prompt and we still sort of uh

initial version is ruled out and I'm

sure we can make it much more

accurate uh it's it's it's very

experimental but then the ideaas we show

you like do you just want to add this

file this file this file also to tell

you know the model to edit those files

for you uh because if if you're maybe

you're making the API like you should

also edit the client and the server that

is using the API and the other one

resolving the API and so that would be

kind of cool as both there's the phase

where you're writing the prompt and

there's before you even click enter

maybe we can help resolve some of the

uncertainty to what degree do you use uh

agentic approaches how useful are agents

we think agents are really really cool

like I I I think agents is like uh it's

like resembles sort of like a human it's

sort of like the like you can kind of

feel that it like you're getting closer

to AGI because you see a demo where um

it acts as as a human would and and it's

really really cool I think um agents are

not yet super useful for many things

they I think we're we're getting close

to where they will actually be useful

and so I think uh there are certain

types of tasks where having an agent

would be really nice like I would love

to have an agent for example if like we

have a bug where you sometimes can't

command C and command V uh inside our

chat input box and that's a task that's

super well specified I just want to say

like in two sentences this does not work

please fix it and then I would love to

have an agent that just goes off does it

and then uh a day later I I come back

and I review the the thing you mean it

goes finds the right file yeah it finds

the right files it like tries to

reproduce the bug it like fixes the bug

and then it verifies that it's correct

and this is could be a process that

takes a long time um and so I think I

would love to have that uh and then I

think a lot of programming like there is

often this belief that agents will take

over all of programming um I don't think

we think that that's the case because a

lot of programming a lot of the value is

in iterating or you don't actually want

to specify something upfront because you

don't really know what you want until

youve seen an initial version and then

you want to iterate on that and then you

provide more information and so for a

lot of programming I think you actually

want a system that's instant that gives

you an initial version instantly back

and then you can iterate super super

quickly uh what about something like

that recently came out rep agent that

does also like setting up the

development environment installing

software packages configuring everything

configuring the databases and actually

deploying the app yeah is that also in

the set of things you dream about I

think so I think that would be really

cool for for certain types of

programming uh it it would be really

cool is that within scope of cursor yeah

we're aren't actively working on it

right now um but it's definitely like we

want to make the programmer's life

easier and more fun and some things are

just really tedious and you need to go

through a bunch of steps and you want to

delegate that to an agent um and then

some things you can actually have an

agent in the background while you're

working like let's say you have a PR

that's both backend and front end and

you're working in the front end and then

you can have a background agent that

doesn't work and figure out kind of what

you're doing and then when you get to

the backend part of your PR then you

have some like initial piece of code

that you can iterate on um and and so

that that would also be really cool one

of the things we already talked about is

speed but I wonder if we can just uh

Linger on that some more in the the

various places that uh the technical

details involved in making this thing

really fast so every single aspect of

cursor most aspects of cursor feel

really fast like I mentioned the apply

is probably the slowest thing and for me

from sorry the

pain I know it's it's a pain it's a pain

that we're feeling and we're working on

fixing it uh

yeah I mean it says something that

something that feels I don't know what

it is like 1 second or two seconds that

feels slow that means that's actually

shows that everything else is just

really really fast um so is there some

technical details about how to make some

of these models so how to make the chat

fast how to make the diffs fast is there

something that just jumps to mind yeah I

mean so we can go over a lot of the

strategies that we use one interesting

thing is Cash Waring um and so what you

can is if as the user is typing you can

have yeah you're you're probably going

to use uh some piece of context and you

can know that before the user's done

typing so you know as we discussed

before reusing the KV cache results and

lower latency lower cost uh cross

requests so as a user starts type in you

can immediately warm the cache with like

let's say the current file contents and

then when theyve pressed enter uh

there's very few tokens it actually has

to to prefill and compute before

starting the generation this will

significantly lower ttf can you explain

how KV cach works yeah so the way

Transformers work um I like it I

mean like one one of the mechanisms that

allow Transformers to not just

independently like the mechanism that

allows Transformers to not just

independently look at each token but see

previous tokens are the keys and values

to tension and generally the way tension

works is you have at your current token

some query and then you've all the keys

and values of all your previous tokens

which are some kind of representation

that the model stores internally of all

the previous tokens in the prompt

and like by default when you're doing a

chat the model has to for every single

token do this forward pass through the

entire uh model that's a lot of Matrix

multiplies that happen and that is

really really slow instead if you have

already done that and you stored the

keys and values and you keep that in the

GPU then when I'm let's say I have

stored it for the last end tokens if I

now want to compute the the output token

for the N plus one token I don't need to

pass those first end tokens through the

entire model because I already have all

those keys and values and so you just

need to do the forward pass through that

last token and then when you're doing

attention uh you're reusing those keys

and values that have been computed which

is the only kind of sequential part um

or sequentially dependent part of the

Transformer is there like higher level

caching of like caching of the prompts

or that kind of stuff could help yeah

that that there's other types of caching

you can kind of do um one interesting

thing that you can do for cursor tab

is you can basically predict ahead as if

the user would have accepted the

suggestion and then trigger another uh

request

and so then you've cashed you've done

the speculative it's it's a mix of

speculation and caching right because

you're speculating what would happen if

they accepted it and then you have this

value that is cach this this uh

suggestion and then when they press tab

the next one would be waiting for them

immediately it's a it's a kind of clever

heuristic slash trick uh that uses a

higher level caching and and can give uh

the it feels fast despite there not

actually being any changes in the in the

model and if you can make the KV cach

smaller one of the advantages you get is

like maybe maybe you can speculate even

more maybe you can get seriously 10

things that you know could be useful I

like uh like predict the next 10 and and

then like it's possible the user hits

the the one of the 10 it's like much

higher chance than the user hits like

the exact one that you show them uh

maybe they typeing another character and

and he sort of hits hits something else

in the cache yeah so there's there's all

these tricks where um the the general

phenomena here is uh I think it's it's

also super useful for RL is you know may

maybe a single sample from the model

isn't very good but if you

predict like 10 different things uh

turns out that one of the 10 uh that's

right is the probability is much higher

there's these passid key curves and you

know part of RL like what what RL does

is you know you can you can exploit this

passid K phenomena to to make many

different predictions and and uh one one

way to think about this the model sort

of knows internally has like has some

uncertainty over like which of the key

things is correct or like which of the

key things does the human want when we

ARL our uh you know cursor Tab model one

of the things we're doing is we're

predicting which like which of the

hundred different suggestions the model

produces is more amendable for humans

like which of them do humans more like

than other things uh maybe maybe like

there's something with the model can

predict very far ahead versus like a

little bit and maybe somewhere in the

middle and and you just and then you can

give a reward to the things that humans

would like more and and sort of punish

the things that it would like and sort

of then train the model to Output the

suggestions that humans would like more

you you have these like RL Loops that

are very useful that exploit these

passive K curves um Oman maybe can can

go into even more detail yeah it's a

little it is a little different than

speed um but I mean like technically you

tie it back in because you can get away

with the smaller model if you are all

your smaller model and it gets the same

performance as the bigger one um that's

like and SW I was mentioning stuff about

KV about reducing the size of your KV

cach there there are other techniques

there as well that are really helpful

for Speed um so kind of back in the day

like all the way two years ago uh people

mainly use multi-ad attention um and I

think there's been a migration towards

more uh efficient attention schemes like

group query um or multiquery attention

and this is really helpful for then uh

with larger batch sizes being able to

generate the tokens much faster the

interesting thing here is um this now

has no effect on that uh time to First

token pre-fill speed uh the thing this

matters for is uh now generating tokens

and and why is that because when you're

generating tokens instead of uh being

bottlenecked by doing the super

realizable Matrix multiplies across all

your tokens you're bottleneck by how

quickly it's for long context um with

large batch sizes by how quickly you can

read those cache keys and values um and

so then how that that's memory bandwidth

and how can we make this faster we can

try to compress the size of these keys

and values so multiquery attention is

the most aggressive of these um where

normally with multi-head attention you

have some number of quote unquote

attention heads um and some number of

kind of query query heads U multiquery

just preserves the query heads gets rid

of all the key value heads um so there's

only one kind of key value head and

there's all the remaining uh query heads

with group query um you instead you know

preserve all the query heads and then

your keys and values are kind of in

there are fewer heads for the keys and

values but you're not reducing it to

just one um but anyways like the whole

point here is you're just reducing the

size of your KV cache and then there is

MLA yeah multi- latent um that's a

little more complicated and the way that

this works is it kind of turns the

entirety of your keys and values across

all your heads into this kind of one

latent Vector that is then kind of

expanded in frence time but MLA is from

this company uh called Deep seek um it's

it's quite an interesting algorithm uh

maybe the key idea is sort of uh in both

mqa uh and in other places what you're

doing is sort of reducing the uh num

like the number of KV heads the

advantage you get from that is is you

know there's less of them but uh maybe

the theory is that you actually want a

lot of different uh like you want each

of the the keys and values to actually

be different so one way to reduce the

size is you keep

uh one big shared Vector for all the

keys and values and then you have

smaller vectors for every single token

so that when you m you can you can store

the only the smaller thing as some sort

of like low rank reduction and the low

rank reduction with that and at the end

of the time when you when you eventually

want to compute the final thing uh

remember that like your memory bound

which means that like you still have

some some compute left that you can use

for these things and so if you can

expand the um the latent vector

back out and and somehow like this is

far more efficient because just like

you're reducing like for example maybe

like you're reducing like 32 or

something like the size of the vector

that you're keeping yeah there's perhaps

some richness in having a separate uh

set of keys and values and query that

kind of pawise match up versus

compressing that all into

one and that interaction at least okay

and all of that is dealing with um being

memory bound yeah

and what I mean ultimately how does that

map to the user experience trying to get

the yeah the the two things that it maps

to is you can now make your cash a lot

larger because you've less space

allocated for the KB cash you can maybe

cash a lot more aggressively and a lot

more things do you get more cash hits

which are helpful for reducing the time

to First token for the reasons that were

kind of described earlier and then the

second being when you start doing

inference with more and more requests

and larger and larger batch sizes you

don't see much of a Slowdown in as it's

generating the tokens the speed of that

what it also allows you to make your

prompt bigger for certain yeah yeah so

like the basic the size of your KV cache

is uh both the size of all your prompts

multiply by the number of prompts being

processed in parallel so you could

increase either those Dimensions right

the batch size or the size of your

prompts without degrading the latency of

generating tokens Arvid you wrote a blog

post Shadow workspace iterating on code

in the background yeah so what's going

on uh so to be clear we want there to be

a lot of stuff stuff happening in the

background and we're experimenting with

a lot of things uh right now uh we don't

have much of that happening other than

like the the cash warming or like you

know figuring out the right context to

that goes into your command PRS for

example uh but the idea is if you can

actually spend computation in the

background then you can help um help the

user maybe like at a slightly longer

time Horizon than just predicting the

next few lines that you're going to make

but actually like in the next 10 minutes

what are you're going to make and by

doing it in background you can spend

more comp computation doing that and so

the idea of the Shadow workspace that

that we implemented and we use it

internally for like experiments um is

that to actually get advantage of doing

stuff in the background you want some

kind of feedback signal to give give

back to the model because otherwise like

you can get higher performance by just

letting the model think for longer um

and and so like o1 is a good example of

that but another way you can improve

performance is by letting the model

iterate and get feedback and and so one

very important piece of feedback when

you're a programmer is um the language

server which is uh this thing it exists

uh for most different languages and

there's like a separate language Ser per

language and it can tell you you know

you're using the wrong type appear and

then gives you an error or it can allow

you to go to definition and sort of

understands the structure of your code

so language servers are extensions

developed by like there's a typescript

language Ser developed by the typescript

people a rust language Ser developed by

the rust people and then they all inter

interface over the language server

protocol to vs code so that vs code

doesn't need to have all of the

different languages built into vs code

but rather uh you can use the existing

compiler infrastructure for linting

purposes what it's for it's for linting

it's for going to definition uh and for

like seeing the the right types that

you're using uh so it's doing like type

checking also yes type checking and and

going to references um and that's like

when you're working in a big project you

you kind of need that if you if you

don't have that it's like really hard to

to code in a big project can you say

again how that's being used inside

cursor the the language server protocol

communication thing so it's being used

in cursor to show to the programmer just

like nvs could but then the idea is you

want to show that same information to

the models the I models um and you want

to do that in a way that doesn't affect

the user because you wanted to do it in

background and so the idea behind the

chadow workspace was okay like one way

we can do this is um we spawn a separate

window of cursor that's hidden and so

you can set this flag and electron is

hidden there is a window but you don't

actually see it and inside of this

window uh the AI agents can modify code

however they want um as long as they

don't save it because it's still the

same folder um and then can get feedback

from from the lters and go to definition

and and iterate on their code so like

literally run everything in the

background like as if right yeah maybe

even run the code so that's the eventual

version okay that's what you want and a

lot of the blog post is actually about

how do you make that happen because it's

a little bit tricky you want it to be on

the user's machine so that it exactly

mirrors the user's environment

and then on Linux you can do this cool

thing where you can actually mirror the

file system and have the AI make changes

to the files and and it thinks that it's

operating on the file level but actually

that's stored in in memory and you you

can uh create this kernel extension to

to make it work um whereas on Mac and

windows it's a little bit more difficult

uh and and uh but it's it's a fun

technical problems that's way one one

maybe hacky but interesting idea that I

like is holding a lock on saving and so

basically you can then have the language

model kind of hold the lock on on saving

to disk and then instead of you

operating in the ground truth version of

the files uh that are save to dis you

you actually are operating what was the

shadow workspace before and these

unsaved things that only exist in memory

that you still get Lind erors for and

you can code in and then when you try to

maybe run code it's just like there's a

small warning that there's a lock and

then you kind of will take back the lock

from the language server if you're

trying to do things concurrently or from

the the shadow workspace if you're

trying to do things concurrently that's

such an exciting feuture by the way it's

a bit of a tangent but like to allow a

model to change files it's scary for

people but like it's really cool to be

able to just like let the agent do a set

of tasks and you come back the next day

and kind of observe like it's a

colleague or something like that yeah

yeah and I think there may be different

versions of like runability

where for the simple things where you're

doing things in the span of a few

minutes on behalf of the user as they're

programming it makes sense to make

something work locally in their machine

I think for the more aggressive things

where you're making larger changes that

take longer periods of time you'll

probably want to do this in some sandbox

remote environment and that's another

incredibly tricky problem of how do you

exactly reproduce or mostly reproduce to

the point of it being effectively

equivalent for running code the user's

environment which is remote remote

sandbox I'm curious what kind of Agents

you want for for coding oh do you want

them to find bugs do you want them to

like Implement new features like what

agents do you want so by the way when I

think about agents I don't think just

about coding uh I think so for the

practic this particular podcast there's

video editing and a lot of if you look

in Adobe a lot there's code behind uh

it's very poorly documented code but you

can interact with premiere for example

using code and basically all the

uploading everything I do on YouTube

everything as you could probably imagine

I do all of that through code and so and

including translation and overdubbing

all this so I Envision all those kinds

of tasks so automating many of the tasks

that don't have to do directly with the

editing so that okay that's what I was

thinking about but in terms of coding I

would be fundamentally thinking about

bug

finding like many levels of kind of bug

finding and also bug finding like

logical bugs not logical like spiritual

bugs or

something one's like sort of big

directions of implementation that kind

of stuff that's Bine on Buck finding

yeah I mean it's really interesting that

these models are so bad at bug finding

uh when just naively prompted to find a

bug they're incredibly poorly calibrated

even the the smartest models exactly

even o even 01 how do you explain that

is there a good

intuition I think these models are a

really strong reflection of the

pre-training distribution and you know I

do think they they generalize as the

loss gets lower and lower but I don't

think the the loss and the scale is

quite or the loss is low enough such

that they're like really fully

generalizing in code like the things

that we use these things for uh the

frontier models that that they're quite

good at are really code generation and

question answering these things exist in

massive quantities and pre-training with

all of the code on GitHub on the scale

of many many trillions of tokens and

questions and answers on things like

stack Overflow and maybe GitHub issues

and so when you try to push some of

these things that really don't exist uh

very much online like for example the

cursor tap objective of predicting the

next edit given the edit's done so far

uh the brittleness kind of shows and

then bug detection is another great

example where there aren't really that

many examples of like actually detecting

real bugs and then proposing fixes um

and the models just kind of like really

struggle at it but I think it's a

question of transferring the model like

in the same way that you get this

fantastic transfer um from pre-trained

Models uh just on code in general to the

cursor tab objective uh you'll see a

very very similar thing with generalized

models that are really good to code to

bug detection it just takes like a

little bit of kind of nudging in that

direction like to be clear I think they

sort of understand code really well like

while they're being pre-trained like the

representation that's being built up

like almost certainly like you know

Somewhere In The Stream there's the

model knows that maybe there's there's

some SK something sketchy going on right

it sort of has some sketchiness but

actually eliciting this the sketchiness

to uh like actually like part part of it

is that humans are really calibrated on

which bugs are really important it's not

just actually it's not just actually

saying like there's something sketchy

it's like it's just sketchy trivial it's

the sketchy like you're going to take

the server down it's like like part of

it is maybe the cultural knowledge of uh

like why is a staff engineer a staff

engineer a staff engineer is is good

because they know that three years ago

like someone wrote a really you know

sketchy piece of code that took took the

server down and as opposed to like as

supposed to maybe it's like you know you

just this thing is like an experiment so

like a few bugs are fine like you're

just trying to experiment and get the

feel of the thing and so if the model

gets really annoying when you're writing

an experiment that's really bad but if

you're writing something for super

production you're like writing a

database right you're you're writing

code in post scripts or Linux or

whatever like your lineus tals you're

you're it's sort of unacceptable to have

even a edge case and just having the

calibration of

like how paranoid is the user like but

even then like if you're putting in a

maximum paranoia it still just like

doesn't quite get it yeah yeah yeah I

mean but this is hard for humans too to

understand what which line of code is

important which is not it's like you I

think one of your principles on a

website says if if if a code can do a

lot of

damage one should add a comment that say

this this this line of code is is

dangerous and all

caps 10 times no you say like for every

single line of code inside the function

you have to and that's quite profound

that says something about human beings

because the the engineers move on even

the same person might just forget how it

can sync the Titanic a single function

like you don't you might not in it that

quite clearly by looking at the single

piece of code yeah and I think that that

one is also uh partially also for

today's AI models where uh if you

actually write dangerous dangerous

dangerous in every single line like uh

the models will pay more attention to

that and will be more likely to find

bucks in that region that's actually

just straight up a really good practice

of a labeling code of how much damage

this can do yeah I mean it's

controversial like some people think

it's ugly uh swall well I actually think

it's it's like in fact I actually think

this one of the things I learned from AR

is you know like I sort of aesthetically

I don't like it but I think there's

certainly something where like it's it's

useful for the models and and humans

just forget a lot and it's really easy

to make a small mistake and cause

like bring down you know like just bring

down the server and like you like of

course we we like test a lot and

whatever but there there's always these

things that you have to be very careful

yeah like with just normal dock strings

I think people will often just skim it

when making a change and think oh this I

I know how to do this um and you kind of

really need to point it out to them so

that that doesn't slip through

yeah you have to be reminded that you

could do a lot of

damage that's like we don't really think

about that like yeah you think about

okay how do I figure out how this work

so I can improve it you don't think

about the other direction that could

until until we have formal verification

for everything then you can do whatever

you want and you you know for certain

that you have not introduced a bug if

the proof passes but concretely what do

you think that future would look like I

think um people will just write tests

anymore and um the model will suggest

like you write a function the model will

suggest a spec and you review the spec

and uh in the meantime a smart reasoning

model computes appr proof that the

implementation follows the spec um and I

think that happens for for most

functions don't you think this gets at a

little bit some of the stuff you were

talking about earlier with the

difficulty of specifying intent for what

you want with software um where

sometimes it might be because the intent

is really hard to specify it's also then

going to be really hard to prove that

it's actually matching whatever your

intent is like you think that spec is

hard to

generate yeah or just like for a given

spec maybe you can I think there is a

question of like can you actually do the

formal verification like that's like is

that possible I think that there's like

more to dig into there but then also

even if you have this spe if you have

this spe how do you you have the spec is

the spec written in natural

language the spec spec would be formal

but how easy would that be so then I

think that you care about things that

are not going to be easily well

specified in the spec language I see I

see would be um yeah maybe an argument

against formal verification is all you

need yeah the worry is there's this

massive document replacing replacing

something like unitest sure yeah yeah um

I think you can probably also evolve the

the spec languages to capture some of

the things that they don't really

capture right now um but yeah I don't

know I think it's very exciting and

you're speaking not just about like

single functions you're speaking about

entire code bases I think entire code

bases is harder but that that is what I

would love to have and I think it should

be possible and because you can even

there there's like a lot of work

recently where uh you can prove formally

verify down to the hardware so like

through the you formally verify the C

code and then you formally verify

through the GCC compiler and then

through the VAR log down to the hardware

um and that's like incredibly big system

but it actually works and I think big

code bases are are sort of similar in

that they're like multi-layered system

and um if you can decompose it and

formally verify each part then I think

it should be possible I think the

specification problem is a real problem

but how do you handle side effects or

how do you handle I guess external

dependencies like calling the stripe API

maybe stripe would write a spec for

their you can't do this for everything

like can you do this for everything you

use like how do you how do you do it for

if there's language mod like maybe maybe

like people use language models as

Primitives in the programs they write

and there's like a dependence on it and

like how how do you now include that I

think you might be able to prove prove

that still prove what about language

models I think it it feels possible that

you could actually prove that a language

model is aligned for example or like you

can prove that it actually gives the the

right answer um that's the dream yeah

that is I mean that's if it's possible

your I Have a Dream speech if it's

possible that that will certainly help

with you know uh making sure your code

doesn't have bugs and making sure AI

doesn't destroy all of human

civilization so the the full spectrum of

AI safety to just bug finding uh so you

said the models struggle with bug

finding what's the Hope You Know My Hope

initially is and and I can let Michael

Michael chime into to it but was like

this

um it should you know first help with

the stupid bugs like it should very

quickly catch the stupid bugs like off

by one erors like sometimes you write

something in a comment and do the other

way it's like very common like I do this

I write like less than in a comment and

like I maybe write it greater than or

something like that and the model is

like yeah it looks sketchy like you sure

you want to do that uh but eventually it

should be able to catch 100 bucks too

yeah and I think that it's also

important to note that this is having

good bug finding models feels necessary

to get to the highest reaches of having

AI do more and more programming for you

where you're going to you know if the AI

is building more and more of the system

for you you need to not just generate

but also verify and without that some of

the problems that we've talked about

before with programming with these

models um will just become untenable um

so it's not just for humans like you

write a bug I write a bug find the bug

for me but it's also being able to to

verify the AI code and check it um is

really important yeah and then how do

you actually do this like we have had a

lot of contentious dinner discussions of

how do you actually train a bug model

but one very popular idea is you know

it's kind of potentially easy to

introduce a bug than actually finding

the bug and so you can train a model to

introduce bugs in existing code um and

then you can train a reverse bug model

then that uh can find find bugs using

this synthetic data so that's like one

example um but yeah there are lots of

ideas for how to also um you can also do

a bunch of work not even at the model

level of taking the biggest models and

then maybe giving them access to a lot

of information that's not just the code

like it's kind of a hard problem to like

stare at a file and be like where's the

bug and you know that's that's hard for

humans often right and so often you have

to to run the code and being able to see

things like traces and step through a

debugger um there's another whole

another Direction where it like kind of

tends toward that and it could also be

that there are kind of two different

product form factors here it could be

that you have a really specialty model

that's quite fast that's kind of running

in the background and trying to spot

bugs and it might be that sometimes sort

of to arvid's earlier example about you

know some nefarious input box bug might

be that sometimes you want to like

there's you know there's a bug you're

not just like checking hypothesis free

you're like this is a problem I really

want to solve it and you zap that with

tons and tons and tons of compute and

you're willing to put in like $50 to

solve that bug or something even more

have you thought about integrating money

into this whole thing like I would pay

probably a large amount of money for if

you found a bug or even generated a code

that I really appreciated like I had a

moment a few days ago when I started

using C were

generated uh

perfect uh like perfect three functions

for interacting with the YouTube API to

update captions and uh for localization

like different in different languages

the API documentation is not very good

and the code across like if I I Googled

it for a while I couldn't find exactly

there's a lot of confusing information

and cursor generated perfectly and I was

like I just said back I read the code I

was like this is correct I tested it

it's correct I was like I want a tip on

a on a button that goes yeah here's $5

one that's really good just to support

the company and support what the the

interface is and the other is that

probably sends a strong signal like good

job right so there much stronger signal

than just accepting the code right you

just actually send like a strong good

job that and for bug finding obviously

like there's a lot of people

you know that would pay a huge amount of

money for a bug like a bug bug Bounty

thing right is that you guys think about

that yeah it's a controversial idea

inside the the company I think it sort

of depends on how much uh you believe in

humanity almost you know like uh I think

it would be really cool if like uh you

spend nothing to try to find a bug and

if it doesn't find a bug you you spend Z

and then if it does find a bug uh and

you click accept then it also shows like

in parenthesis like $1 and so you spend

$1 to accept a bug uh and then of course

there's worry like okay we spent a lot

of computation like maybe people will

just copy paste um I think that's a

worry um and then there is also the

worry that like introducing money into

the product makes it like kind of you

know like it doesn't feel as fun anymore

like you have to like think about money

and and you all you want to think about

is like the code and so maybe it

actually makes more sense to separate it

out and like you pay some fee like every

month and then you get all of these

things for free but there could be a

tipping component which is not like it

it it still has that like dollar symbol

I think it's fine but I I also see the

point where like maybe you don't want to

introduce it yeah I was going to say the

moment that feels like people do this is

when they share it when they have this

fantastic example they just kind of

share it with their friends there is

also a potential world where there's a

technical solution to this like honor

System problem too where if we can get

to a place where we understand the

output of the system more I mean to the

stuff we were talking about with like

you know error checking with the LSP and

then also running the code but if you

could get to a place where you could

actually somehow verify oh I have fixed

the bug maybe then the the bounty system

doesn't need to rely on the honor System

Too how much interaction is there

between the terminal and the code like

how much information is gained from if

you if you run the code in the terminal

like can you use can you do like a a

loop where it runs runs the code and

suggests how to change the code if if

the code and runtime gives an error is

right now there're separate worlds

completely like I know you can like do

control K inside the terminal to help

you write the code you you can use

terminal contacts as well uh inside of

Jack man kind of everything um we don't

have the looping part yet though we

suspect something like this could make a

lot of sense there's a question of

whether it happens in the foreground too

or if it happens in the background like

what we've been discussing sure the

background is pretty cool like we do

running the code in different ways plus

there's a database side to this which

how do you protect it from not modifying

the database but

okay I mean there's there's certainly

cool Solutions there uh there's this new

API that is being developed for it's

it's not in AWS uh but you know it's it

certainly it's I think it's in Planet

scale I don't know if Planet scale was

the first one you added it's the ability

sort of add branches to a database uh

which is uh like if you're working on a

feature and you want to test against the

prod database but you don't actually

want to test against the pr database you

could sort of add a branch to the

database in the way to do that is to add

a branch to the WR ahead log uh and

there's obviously a lot of technical

complexity in doing it correctly I I

guess database companies need need need

new things to do uh because they have

they have they have good databases now

uh and and I I think like you know turbo

buffer which is which is one of the

databases we use as is is going to add

hope maybe braning to the to the rad log

and and so so maybe maybe the the AI

agents will use we'll use branching

they'll like test against some branch

and it's sort of going to be a

requirement for the database to like

support branching or something it would

be really interesting if you could

Branch a file system right yeah I feel

like everything needs branching it's

like that yeah yeah like that's the

problem with the Multiverse

[Music]

right like if you branch on everything

that's like a lot I mean there's there's

obviously these like super clever

algorithms to make sure that you don't

actually sort of use a lot of space or

CPU or whatever okay this is a good

place to ask about infrastructure so you

guys mostly use AWS what what are some

interesting details what are some

interesting challenges why' you choose

AWS why is why is AWS still winning

hashtag AWS is just really really good

it's really good like um whenever you

use an AWS product you just know that

it's going to work like it might be

absolute hell to go through the steps to

set it up um why is the interface so

horrible because it's just so good it

doesn't need to the nature of

winning I think it's exactly it's just

nature they winning yeah yeah but AWS

you can always trust like it will always

work and if there is a problem it's

probably your

problem yeah okay is there some

interesting like challenges to you guys

have pretty new startup to get scaling

to like to so many people and yeah I

think that they're uh it has been an

interesting Journey adding you know each

extra zero to the request per second you

run into all of these with like you know

the general components you're using for

for caching and databases run into

issues as you make things bigger and

bigger and now we're at the scale where

we get like you know int overflows on

our tables and things like that um and

then also there have been some custom

systems that we've built like for

instance our Ral system for um Computing

a semantic index of your codebase and

answering questions about a codebase

that have continually I feel like been

one of the the trickier things to scale

I I have a few friends who are who are

super super senior engineers and one of

their sort of lines is like it's it's

very hard to predict where systems will

break when when you scale them you you

you can sort of try to predict in

advance but like there's there's always

something something weird that's going

to happen when when you add this extra Z

and you you thought you thought through

everything but you didn't actually think

through everything uh but I think for

that particular system

we've so what the the for concrete

details the thing we do is obviously we

upload um when like we chunk up all of

your code and then we send up sort of

the code for for embedding and we embed

the code and then we store the

embeddings uh in a in a database but we

don't actually store any of the code and

then there's reasons around making sure

that

we don't introduce client bugs because

we're very very paranoid about client

bugs we store uh uh much of the details

on the server uh like everything is sort

of

encrypted so one one of the technical

challenges is is always making sure that

the local index the local codebase state

is the same as the state that is on the

server and and the way sort of

technically we ended up doing that is so

for every single file you can you can

sort of keep this hash and then for

every folder you can sort of keep a hash

which is the hash of all of its children

and you can sort of recursively do that

until the top and why why do something

something complicated uh one thing you

could do is you could keep a hash for

every file then every minute you could

try to download the hashes that are on

the server figure out what are the files

that don't exist on the server maybe

just created a new file maybe you just

deleted a file maybe you checked out a

new branch and try to reconcile the

state between the client and the

server but that introduces like

absolutely ginormous Network overhead

both uh both on the client side I mean

nobody really wants us to hammer their

Wi-Fi all the time if you're using

cursor uh but also like I mean it would

introduce like ginormous overhead in the

database it would sort of be reading

this uh tens of terabyte database sort

of approaching like 20 terabyt or

something database like every second

that's just just kind of crazy you

definitely don't want to do that so what

you do you sort of you just try to

reconcile the single hash which is at

the root of the project and then if if

something mismatches then you go you

find where all the things disagree maybe

you look at the children and see if the

hashes match and if the hashes don't

match go look at their children and so

on but you only do that in the scenario

where things don't match and for most

people most of the time the hashes match

so it's a kind of like hierarchical

reconciliation yeah something like that

yeah it's called the Merkel tree yeah

Merkel yeah I mean so yeah it's cool to

see that you kind of have to think

through all these problems and I mean

the the point of like the reason it's

gotten hard is just because like the

number of people using it and you know

if some of your customers have really

really large code bases uh to the point

where we you know we we originally

reordered our code base which is which

is big but I mean just just not the size

of some company that's been there for 20

years and sort of has to train enormous

number of files and you sort of want to

scale that across programmers there's

there's all these details where like

building the simple thing is easy but

scaling it to a lot of people like a lot

of companies is is obviously a difficult

problem which is sort of you know

independent of actually so that's

there's part of this scaling our current

solution is also you know coming up with

new ideas that obviously we're working

on uh but then but then scaling all of

that in the last few weeks once yeah and

there are a lot of clever things like

additional things that that go into this

indexing system

um for example the bottleneck in terms

of costs is not storing things in the

vector database or the database it's

actually embedding the code and you

don't want to Reed the code base for

every single person in a company that is

using the same exact code except for

maybe they're in a different branch with

a few different files or they've made a

few local changes and so because again

embeddings are the bottleneck you can do

this one clever trick and not have to

worry about like the complexity of like

dealing with branches and and the other

databases where you just have some cash

on

the actual vectors uh computed from the

hash of a given chunk MH and so this

means that when the nth person at a

company goes into their code base it's

it's really really fast and you do all

this without actually storing any code

on our servers at all no code data

stored we just store the vectors in the

vector database and the vector cache

what's the biggest gains at this time

you get from indexing the code base like

just out of curiosity like what

what benefit users have it seems like

longer term there'll be more and more

benefit but in the short term just

asking questions of the code

base uh what what's the use what's the

usefulness of that I think the most

obvious one is um just you want to find

out where something is happening in your

large code base and you sort of have a

fuzzy memory of okay I want to find the

place where we do X um but you don't

exactly know what to search for in a

normal text search and to ask a chat uh

you hit command enter to ask with with

the codebase chat and then uh very often

it finds the the right place that you

were thinking of I think like you like

you mentioned in the future I think this

only going to get more and more powerful

where we're working a lot on improving

the quality of our retrieval um and I

think the cealing for that is really

really much higher than people give a

credit for one question that's good to

ask here have you considered and why

haven't you much done sort of local

stuff to where you can do the it seems

like everything we just discussed is

exceptionally difficult to do to go to

go to the cloud you have to think about

all these things with the caching and

the

uh you know large code Bas with a large

number of programmers are using the same

code base you have to figure out the

puzzle of that a lot of it you know

most software just does stuff this heavy

computational stuff locally so if you

consider doing sort of embeddings

locally yeah we thought about it and I

think it would be cool to do it locally

I think it's just really hard and and

one thing to keep in mind is that you

know uh some of our users use the latest

MacBook Pro uh and but most of our users

like more than 80% of our users are in

Windows machines which uh and and many

of them are are not very powerful and

and so local models really only works on

the on the latest computers and it's

also a big overhead to to to build that

in and so even if we would like to do

that um it's currently not something

that we are able to focus on and I think

there there are some uh people that that

that do that and I think that's great um

but especially as models get bigger and

bigger and you want to do fancier things

with like bigger models it becomes even

harder to do it locally yeah and it's

not a problem of like weaker computers

it's just that for example if you're

some big company you have big company

code base it's just really hard to

process big company code based even on

the beefiest MacBook Pros so even if

it's not even a matter matter of like if

you're if you're just like uh a student

or something I think if you're like the

best programmer at at a big company

you're still going to have a horrible

experience if you do everything locally

when you could you could do it and sort

of scrape by but like again it wouldn't

be fun anymore yeah like at approximate

nearest neighbors and this massive code

base is going to just eat up your memory

and your CPU and and and that's and

that's just that like let's talk about

like also the modeling side where said

there are these massive headwinds

against uh local models where one uh

things seem to move towards Moes which

like one benefit is maybe they're more

memory bandwidth bound which plays in

favor of local uh versus uh using gpus

um or using Nvidia gpus but the downside

is these models are just bigger in total

and you know they're going to need to

fit often not even on a single node but

multiple nodes um there's no way that's

going to fit inside of even really good

MacBooks um and I think especially for

coding it's not a question as much of

like does it clear some bar of like the

model's good enough to do these things

and then like we're satisfied which may

may be the case for other other problems

and maybe where local models shine but

people are always going to want the best

the most intelligent the most capable

things and that's going to be really

really hard to run for almost all people

locally don't you want the the most

capable model like you want you want

Sonet you and also with o I like how

you're pitching

me1 would you be satisfied with an

inferior model listen I yeah I'm yes I'm

one of those but there's some people

that like to do stuff locally especially

like yeah really there's a whole

obviously open source movement that kind

of resists and it's good that they exist

actually because you want to resist the

power centers that are growing are

there's actually an alternative to local

models uh that I particularly fond of uh

I think it's still very much in the

research stage but you could imagine um

to do homomorphic encryption for

language model inference so you encrypt

your input on your local machine then

you send that up and then um the server

uh can use lots of computation they can

run models that you cannot run locally

on this encrypted data um but they

cannot see what the data is and then

they send back the answer and you

decrypt the answer and only you can see

the answer uh so I think uh that's still

very much research and all of it is

about trying to make the overhead lower

because right now the overhead is really

big uh but if you can make that happen I

think that would be really really cool

and I think it would be really really

impactful um because I think one thing

that's actually kind of worrisome is

that as these models get better and

better uh they're going to become more

and more economically useful and so more

and more of the world's information and

data uh will th flow through you know

one or two centralized actors um and

then there are worries about you know

there can be traditional hacker attempts

but it also creates this kind of scary

part where if all of the world's

information is flowing through one node

in PL text um you can have surveillance

in very bad ways and sometimes that will

happen for you know in initially will be

like good reasons like people will want

to try to prot protect against like bad

Act using AI models in bad ways and then

you will add in some surveillance code

and then someone else will come in and

you know you're in a slippery slope and

then you start uh doing bad things with

a lot of the world's data and so I I'm

very hopeful that uh we can solve

homomorphic encryption for doing privacy

preserving machine learning but I would

say like that's the challenge we have

with all software these days it's

like there's so many features that can

be provided from the cloud and all of us

increasingly rely on it and make our

life awesome but there's downsides and

that's that's why you rely on really

good security to protect from basic

attacks but there's also only a small

set of companies that are controlling

that data you know and they they

obviously have leverage and they could

be infiltrated in all kinds of ways

that's the world we live in yeah I mean

the thing I'm just actually quite

worried about is sort of the world where

mean entropic has this responsible

scaling policy and so where we're on

like the low low asls which is the

entropic security level or whatever uh

of like of the models but as we get your

like cod and code ASL 3L 4 whatever

models uh which are sort of very

powerful

but for for mostly reasonable security

reasons you would want to monitor all

the prompts uh but I think I think

that's that's sort reasonable and

understandable where where everyone is

coming from but man it'd be really

horrible if if sort of like all the

world's information is sort of monitor

that heavily it's way too centralized

it's like it's like sort of this like

really fine line you're walking where on

the one side like you don't want the

models to go Rogue on the other side

like man humans like I I don't know if I

if I trust like all the world's

information to pass through like three

three model providers yeah why do you

think it's different than Cloud

providers because I

think the this is a lot of this data

would never have gone to the cloud

providers in the in the first place um

where this is often like you want to

give more data to the eio models you

want to give personal data that you

would never have put online in the first

place uh to these companies or or or to

these models um and it also centralizes

control uh where right now um for for

cloud you can often use your own

encryption keys and it like it can't

really do much um but here it's just

centralized actors that see the exact

plain text of

everything on the topic of context that

that's actually been a friction for me

when I'm writing code you know in Python

there's a bunch of stuff imported

there's a you could probably int it the

kind of stuff I would like to include in

the context is there like how how hard

is it to Auto figure out the

context It's Tricky um I think we can do

a lot better um at uh Computing the

context automatically in the future one

thing that's important to not is there

are trade-offs with including automatic

context so the more context you include

for these models um first of all the

slower they are and um the more

expensive those requests are which means

you can then do less model calls and do

less fancy stuff in the background also

for a lot of these models they get

confused if you have a lot of

information in the prompt so the bar for

um accuracy and for relevance of the

context you include should be quite High

um but this is already we do some

automatic context in some places within

the product it's definitely something we

want to get a lot better at and um I

think that there are a lot of cool ideas

to try there um both on the learning

better retrieval systems like better

edding models better rankers I think

that there are also cool academic ideas

you know stuff we've tried out

internally but also the field is

grappling with RIT large about can you

get language models to a place where you

can actually just have the model itself

like understand a new Corpus of

information and the most popular talked

about version of this is can you make

the context Windows infinite then if you

make the context Windows infinite can

make the model actually pay attention to

the infinite context and then after you

can make it pay attention to the

infinite context to make it somewhat

feasible to actually do it can you then

do caching for that infinite context you

don't have to recompute that all the

time but there are other cool ideas that

are being tried that are a little bit

more analogous to fine-tuning of

actually learning this information and

the weights of the model and it might be

that you actually get sort of a

qualitatively different type of

understanding if you do it more at the

weight level than if you do it at the

Inc context learning level I think the

journey the jury is still a little bit

out on how this is all going to work in

the end uh but in the interm US us as a

company we are really excited about

better retrieval systems and um picking

the parts of the code base that are most

relevant to what you're doing uh we

could do that a lot better like one

interesting proof of concept for the

learning this knowledge directly in the

weights is with vs code so we're in a vs

code fork and vs code the code is all

public so these models in pre-training

have seen all the code um they probably

also seen questions and answers about it

and then they've been fine tuned and RL

Chef to to be able to answer questions

about code in general so when you ask it

a question about vs code you know

sometimes it'll hallucinate but

sometimes it actually does a pretty good

job at answering the question and I

think like this is just by it happens to

be okay at it but what if you could

actually like specifically train or Post

train a model such that it really was

built to understand this code base um

it's an open research question one that

we're quite interested in and then

there's also uncertainty of like do you

want the model to be the thing that end

to end is doing everything I.E it's

doing the retrieval in its internals and

then kind of answering your question

creating the code or do you want to

separate the retrieval from the Frontier

Model where maybe you know you'll get

some really capable models that are much

better than like the best open source

ones in a handful of months um and then

you'll want to separately train a really

good open source model to be the

retriever to be the thing that feeds in

the context um to these larger models

can you speak a little more to the post

trining a model to understand the code

base like what do you what do you mean

by that with is this synthetic data

direction is this yeah I mean there are

many possible ways you could try doing

it there's certainly no shortage of

ideas um it's just a question of going

in and like trying all of them and being

empirical about which one works best um

you know one one very naive thing is to

try to replicate What's Done uh with

vscode uh and these Frontier models so

let's like continue pre-training some

kind of continued pre-training that

includes General code data but also

throws in a lot of the data of some

particular repository that you care

about and then in post trainining um

meaning in let's just start with

instruction fine tuning you have like a

normal instruction fine tuning data set

about code then you throw in a lot of

questions about code in that repository

um so you could either get ground truth

ones which might be difficult or you

could do what you kind of hinted at or

suggested using synthetic data um I.E

kind of having the model uh ask

questions about various re pieces of the

code um so you kind of take the pieces

of the code then prompt the model or

have a model propose a question for that

piece of code and then add those as

instruction find Uni data points and

then in theory this might unlock the

models ability to answer questions about

that code base let me ask you about open

ai1 what do you think is the role of

that kind of test time compute system in

programming I think test time compute is

really really interesting so there's

been the pre-training regime which will

kind of as you scale up the amount of

data and the size of your model get you

better and better performance both on

loss and then on Downstream benchmarks

um and just general performance when we

use it for coding or or other tasks um

we're starting to hit uh a bit of a data

wall meaning it's going to be hard to

continue scaling up this regime and so

scaling up 10 test time compute is an

interesting way of now you know

increasing the number of inference time

flops that we use but still getting like

uh like yeah as you increase the number

of flops use inference time getting

corresponding uh improvements in in the

performance of these models

traditionally we just had to literally

train a bigger model that always uses uh

that always used that many more flops

but now we could perhaps use the same

siiz model um and run it for longer to

be able to get uh an answer at the

quality of a much larger model and so

the really interesting thing I like

about this is there are some problems

that perhaps require

100 trillion parameter model

intelligence trained on 100 trillion

tokens um but that's like maybe 1% maybe

like 0.1% of all queries so are you

going to spend all of this effort all

this compute training a model uh that

cost that much and then run it so

infrequently it feels completely

wasteful when instead you get the model

that can that is that you train the

model that's capable of doing the 99.9%

of queries then you have a way of

inference time running it longer for

those few people that really really want

Max

intelligence how do you figure out which

problem requires what level of

intelligence is that possible to

dynamically figure out when to use GPT 4

when to use like when to use a small

model and when you need the the

01 I mean yeah that's that's an open

research problem certainly uh I don't

think anyone's actually cracked this

model routing problem quite well uh we'd

like to we we have like kind of initial

implementations of this for things for

something like cursor tab um but at the

level of like going between 40 Sonet

to1 uh it's a bit trickier perh like

there's also a question of like what

level of intelligence do you need to

determine if the thing is uh too hard

for for the the four level model maybe

you need the 01 level model um it's

really unclear but but you mentioned so

there's a there's there's a pre-training

process then there's Pro post training

and then there's like test time compute

that fair does sort of separate where's

the biggest gains um well it's weird

because like test time compute there's

like a whole training strategy needed to

get test time compute to work and the

Really the other really weird thing

about this is no one like outside of the

big labs and maybe even just open AI no

one really knows how it works like there

have been some really interesting papers

that uh show hints of what they might be

doing and so perhaps they're doing

something with research using process

reward models but yeah I just I think

the issue is we don't quite know exactly

what it looks like so it would be hard

to kind of comment on like where it fits

in I I would put it in post training but

maybe like the compute spent for this

kind of for getting test time compute to

work for a model is going to dwarf

pre-training

eventually so we don't even know if 0an

is using just like Chain of Thought RL

we don't know how they're using any of

these we don't know anything it's fun to

speculate like if you were to uh build a

competing model what would you do yeah

so one thing to do would be I I think

you probably need to train a process

reward model which is so maybe we can

get into reward models and outcome

reward models versus process reward

models outcome reward models are the

kind of traditional reward models that

people are trained for these for for

language models language modeling and

it's just looking at the final thing so

if you're doing some math problem let's

look at that final thing you've done

everything and let's assign a grade to

it How likely we think uh like what's

the reward for this this this outcome

process reward models Instead try to

grade The Chain of Thought and so open

AI had some preliminary paper on this I

think uh last summer where they use

human labelers to get this pretty large

several hundred thousand data set of

creating chains of thought um um

ultimately it feels like I haven't seen

anything interesting in the ways that

people use process reward models outside

of just using it as a means of uh

affecting how we choose between a bunch

of samples so like what people do uh in

all these papers is they sample a bunch

of outputs from the language model and

then use the process reward models to

grade uh all those Generations alongside

maybe some other heuristics and then use

that to choose the best answer the

really interesting thing that people

think might work and people want to work

is Tre search with these processor re

models because if you really can grade

every single step of the Chain of

Thought then you can kind of Branch out

and you know explore multiple Paths of

this Chain of Thought and then use these

process word models to evaluate how good

is this branch that you're

taking yeah when the when the quality of

the branch is somehow strongly

correlated with the quality of the

outcome at the very end so like you have

a good model of knowing which should

take so not just this in the short term

and like in the long term yeah and like

the interesting work that I think has

been done is figuring out how to

properly train the process or the

interesting work that has been open-

sourced and people I think uh talk about

is uh how to train the process reward

models um maybe in a more automated way

um I I could be wrong here could not be

mentioning some papers I haven't seen

anything super uh that seems to work

really well for using the process reward

models creatively to do tree search and

code um this is kind of an AI safety

maybe a bit of a philosophy question so

open AI says that they're hiding the

Chain of Thought from the user and

they've said that that was a difficult

decision to make they instead of showing

the Chain of Thought they're asking the

model to summarize the Chain of Thought

they're also in the background saying

they're going to monitor the Chain of

Thought to make sure the model is not

trying to manipulate the user which is a

fascinating possibility but anyway what

do you think about hiding the Chain of

Thought one consideration for open Ai

and this is completely speculative could

be that they want to make it hard for

people to distill these capabilities out

of their model it might actually be

easier if you had access to that hidden

Chain of Thought uh to replicate the

technology um because that's pretty

important data like seeing seeing the

steps that the model took to get to the

final result so you can probably train

on that also and there was sort of a

mirror situation with this with some of

the large language model providers and

also this is speculation but um some of

these apis um used to offer easy access

to log probabilities for the tokens that

they're generating um and also log

probabilities over the promp tokens and

then some of these apis took those away

uh and again complete speculation but um

one of the thoughts is that the the

reason those were taken away is if you

have access to log probabilities um

similar to this hidden train of thought

that can give you even more information

to to try and distill these capabilities

out of the apis out of these biggest

models into models you control as an

asteris on also the the previous

discussion about uh us integrating 01 I

think that we're still learning how to

use this model so we made o1 available

in cursor because like we were when we

got the model we were really interested

in trying it out I think a lot of

programmers are going to be interested

in trying it out but um uh 01 is not

part of the default cursor experience in

any way up um and we still haven't found

a way to yet integrate it into an editor

in uh into the editor in a way that we

we we reach for sort of you know every

hour maybe even every day and so I think

that the jury's still out on how to how

to use the model um and uh I we haven't

seen examples yet of of people releasing

things where it seems really clear like

oh that's that's like now the use case

um the obvious one to to turn to is

maybe this can make it easier for you to

have these background things running

right to have these models in Loops to

have these models be atic um but we're

still um still discovering to be clear

we have ideas we just need to we need to

try and get something incredibly useful

before we we put it out there but it has

these significant limitations like even

like barring capabilities uh it does not

stream and that means it's really really

painful to use for things where you want

to supervise the output um and instead

you're just waiting for the wall text to

show up um also it does feel like the

early Innings of test time Computing

search where it's just like a very very

much of V zero um and there's so many

things that like like don't feel quite

right and I suspect um in parallel to

people increasing uh the amount of

pre-training data and the size of the

models and pre-training and finding

tricks there you'll now have this other

thread of getting search to work better

and

better so let me ask you

about strawberry tomorrow

eyes so it looks like GitHub um co-pilot

might be integrating 01 in some kind of

way and I think some of the comments are

saying this this mean cursor is

done I think I saw one comment saying

that I saw time to shut down cursory

time to shut down

cursor so is it time to shut down cursor

I think this space is a little bit

different from past software spaces over

the the 2010s um where I think that the

ceiling here is really really really

incredibly high and so I think that the

best product in 3 to four years will

just be so much more useful than the

best product today and you can like Wax

potic about Moes this and brand that and

you know this is our uh Advantage but I

think in the end just if you don't have

like if you stop innovating on the

product you will you will lose and

that's also great for startups um that's

great for people trying to to enter this

Market um because it means you have an

opportunity um to win against people who

have you know lots of users already by

just building something better um and so

I think yeah over the next few years

it's just about building the best

product building the best system and

that both comes down to the modeling

engine side of things and it also comes

down to the to the editing experience

yeah I think most of the additional

value from cursor versus everything else

out there is not just integrating the

new model fast like o1 it comes from all

of the kind of depth that goes into

these custom models that you don't

realize are working for you in kind of

every facet of the product as well as

like the really uh thoughtful ux with

every single

feature all right uh from that profound

answer let's descend back down to the

technical you mentioned you have a

taxonomy of synthetic data oh yeah uh

can you please explain yeah I think uh

there are three main kinds of synthetic

data the first is so so what is

synthetic data first so there's normal

data like non- synthetic data which is

just data that's naturally created I.E

usually it'll be from humans having done

things so uh from some human process you

get this data synthetic data uh the

first one would be distillation so

having a language model kind of output

tokens or probability distributions over

tokens um and then you can train some

less capable model on this uh this

approach is not going to get you a net

like more capable model than the

original one that has produced The

Tokens um

but it's really useful for if there's

some capability you want to elicit from

some really expensive High latency model

you can then that distill that down into

some smaller task specific model um the

second kind is when like One Direction

of the problem is easier than the

reverse and so a great example of this

is bug detection like we mentioned

earlier where it's a lot easier to

introduce reasonable looking bugs

than it is to actually detect them and

this is this is probably the case for

humans too um and so what you can do is

you can get a model that's not training

that much data that's not that smart to

introduce a bunch of bugs and code and

then you can use that to then train use

a synthetic data to train a model that

can be really good at detecting bugs um

the last category I think is I guess the

main one that it feels like the big labs

are doing for synthetic data which is um

producing texts with language models

that can then be verified easily um so

like you know extreme example of this is

if you have a verification system that

can detect if language is Shakespeare

level and then you have a bunch of

monkeys typing and typewriters like you

can eventually get enough training data

to train a Shakespeare level language

model and I mean this is the case like

very much the case for math where

verification is is is actually really

really easy for formal um formal

language

and then what you can do is you can have

an OKAY model uh generate a ton of roll

outs and then choose the ones that you

know have actually proved the ground

truth theorems and train that further uh

there's similar things you can do for

code with leode like problems or uh

where if you have some set of tests that

you know correspond to if if something

passes these tests it has actually

solved a problem you could do the same

thing where we verify that it's passed

the test and then train the model the

outputs that have passed the tests um I

think I think it's going to be a little

tricky getting this to work in all

domains or just in general like having

the perfect verifier feels really really

hard to do with just like open-ended

miscellaneous tasks you give the model

or more like long Horizon tasks even in

coding that's cuz you're not as

optimistic as Arvid but yeah uh so yeah

so that that that third category

requires having a verifier yeah

verification is it feels like it's best

when you know for a fact that it's

correct and like then like it wouldn't

be like using a language model to verify

it would be using tests or uh formal

systems or running the thing too doing

like the human form of verification

where you just do manual quality control

yeah yeah but like the the language

model version of that where it's like

running the thing it's actually

understands yeah but yeah no that's sort

of somewhere between yeah yeah I think

that that's the category that is um most

likely to to result in like massive

gains what about RL with feedback side

rhf versus RL

if um what's the role of that in um

getting better performance on the

models yeah so

rhf is when the reward model you use uh

is trained from some labels you've

collected from humans giving

feedback um I think this works if you

have the ability to get a ton of human

feedback for this kind of task that you

care about r r aif is interesting uh

because you're kind of depending on like

this is actually kind of uh going to

it's depending on the constraint that

verification is actually a decent bit

easier than generation because it feels

like okay like what are you doing you're

using this language model to look at the

language model outputs and then improve

the language model but no it actually

may work if the language model uh has a

much easier time verifying some solution

uh than it does generating it then you

actually could perhaps get this kind of

recursively but I don't think it's going

to look exactly like that um the other

the other thing you could do

is that we kind of do is like a little

bit of a mix of rif and rhf where

usually the model is actually quite

correct and this is in the case of

cursor tab at at picking uh between like

two possible generations of what is what

is what is the better one and then it

just needs like a hand a little bit of

human nudging with only like on the on

the order of 50 100 uh examples um to

like kind of align that prior the model

has with exactly with what what you want

it looks different than I think normal

RF we usually usually training these

reward models in tons of

examples what what's your intuition when

you compare generation and verification

or generation and

ranking is is ranking way easier than

generation my intuition would just say

yeah it should be like this is kind

of going going back

to like if you if you believe P does not

equal NP then there's this massive class

of problems that are much much easier to

verify given a proof than actually

proving it I wonder if the same thing

will prove P not equal to NP or P equal

to NP that would be that would be really

cool that'd be a whatever Fields

metal by AI who gets the credit another

open philosophical

question I'm

I'm I'm actually surprisingly curious

what what what like a good betat for one

uh one a will get the fields medal will

be actually don't is this mon specialty

uh I I don't know what a Mon's bed here

is oh sorry Nobel Prize or Fields medal

first F Metal Fields metal level Feld

metal I think Fields metal comes first

well you would say that of course but

it's also this like isolated system you

can verify and no sure like I don't even

know if I you don't need to do have much

more I felt like the path to get to IMO

was a little bit more clear because it

already could get a few IMO problems and

there are a bunch of like there's a

bunch of lwh hang fruit given the

literature at the time of like what what

tactics people could take I think I'm

one much less first in the space of the

improving now and to yeah less intuition

about how close we are to solving these

really really hard open problems so you

think you'll be feels mod first it won't

be like in U physics or in oh 100% I

think I I think I think that's probably

more likely like it's probably much more

likely that it'll get in yeah yeah yeah

well I think it goes to like I don't

know like BSD which is a bird when turn

di conjecture like remon hypothesis or

any one of these like hard hard math

problems which just like actually really

hard it's sort of unclear what the path

to to get even a solution looks like

like we we don't even know what a path

looks like let alone um and you don't

buy the idea that this is like an

isolated system and you can actually you

have a good reward system and

uh it feels like it's easier to train

for that I think we might get Fields

metal before AGI I think I mean I'd be

very

happy be very happy but I don't know if

I I think 202h

2030 feels metal feels metal all right

it's uh it feels like forever from now

given how fast things have been going um

speaking of how fast things have been

going let's talk about scaling laws so

for people who don't know uh maybe it's

good to talk about this

whole uh idea of scaling laws what are

they where do things stand and where do

you think things are going I think it

was interesting the original scaling

laws paper by open AI was slightly wrong

because I think of some uh issues they

did with uh learning right schedules uh

and then chinchilla showed a more

correct version and then from then

people have again kind of deviated from

doing the computer optimal thing because

people people start now optimizing more

so for uh making the thing work really

well given a given an inference budget

and I think there are a lot more

Dimensions to these curves than what we

originally used of just compute number

of uh parameters and data like inference

compute is is the obvious one I think

context length is another obvious one so

if you care like let's say you care

about the two things of inference

compute and and then uh context window

maybe the thing you want to train is

some kind of SSM because they're much

much cheaper and faster at super super

long context and even if maybe it is 10x

wor scaling properties during training

meaning you have to spend 10x more

compute to train the thing to get the

same same level of capabilities um it's

worth it because you care most about

that inference budget for really long

context windows so it'll be interesting

to see how people kind of play with all

these Dimensions so yeah I mean you

speak to the multiple Dimensions

obviously the original conception was

just looking at the variables of the

size of the model as measured by

parameters and the size of the data as

measured by the number of tokens and

looking at the ratio of the two yeah and

it's it's kind of a compelling notion

that there is a number or at least a

minimum and it seems like one was

emerging um do you still believe that

there is a kind of bigger is

better I mean I think bigger is

certainly better for just raw

performance and raw intelligence and raw

intelligence I think the the path that

people might take is I'm particularly

bullish on distillation and like yeah

how many knobs can you turn to if we

spend like a ton ton of money on

training like get the most capable uh

cheap model right like really really

caring as much as you can because like

the the the naive version of caring as

much as you can about inference time

Compu is what people have already done

with like the Llama models are just

overtraining the out of 7B models

um on way way way more tokens than isal

optimal right but if you really care

about it maybe thing to do is what Gemma

did which is let's just not let's not

just train on tokens let's literally

train on

uh minim minimizing the K Divergence

with uh the distribution of Gemma 27b

right so knowledge distillation there um

and you're spending the compute of

literally training this 27 billion model

uh billion parameter model on all these

tokens just to get out this I don't know

smaller model and the distillation gives

just a faster model smaller means faster

yeah distillation in theory is um I

think getting out more signal from the

data that you're training on and it's

like another it's it's perhaps another

way of getting over not like completely

over but like partially helping with the

data wall where like you only have so

much data to train on let's like train

this really really big model on all

these tokens and we'll distill it into

this smaller one and maybe we can get

more signal uh per token uh for this for

this much smaller model than we would

have originally if we trained it so if I

gave you1 trillion how would you how

would you spend it I mean you can't buy

an island or whatever um how would you

allocate it in terms of improving the

the big model

versus maybe paying for HF in the rhf or

yeah I think there's a lot of these

secrets and details about training these

large models that I I I just don't know

and are only priv to the large labs and

the issue is I would waste a lot of that

money if I even attempted this because I

wouldn't know those things uh suspending

a lot of disbelief and assuming like you

had the

knowhow um and operate or or if you're

saying like you have to operate with

like the The Limited information you

have now no no no actually I would say

you swoop in and you get all the

information all the little

characteristics all the little

parameters all the all the parameters

that Define how the thing is trained mhm

if we look

and how to invest money for the next 5

years in terms of maximizing what you

called raw intelligence I mean isn't the

answer like really simple you just you

just try to get as much compute as

possible like like at the end of the day

all all you need to buy is the gpus and

then the researchers can find find all

the all like they they can sort of you

you can tune whether you want between a

big model or a small model like well

this gets into the question of like are

you really limited by compute and money

or are you limited by these other things

and I'm more PR to arvid's arvid's

belief that we're we're sort of Ideal

limited but there's always that like but

if you have a lot of computes you can

run a lot of experiments so you would

run a lot of experiments versus like use

that compute to train a gigantic model I

would but I I do believe that we are

limited in terms of ideas that we have I

think yeah because even with all this

compute and like you know all the data

you could collect in the world than you

really are ultimately limited by not

even ideas but just like really good

engineering like even with all the

capital in the world would you really be

able to assemble like there aren't that

many people in the world who really can

like make the difference here um and and

there's so much work that goes into

research that is just like pure really

really hard engineering work um as like

a very kind of handwavy example if you

look at the original Transformer paper

you know how much work was kind of

joining together a lot of these really

interesting Concepts embedded in the

literature versus then going in and

writing all the code like maybe the Cuda

kernels maybe whatever else I don't know

if it ran on gpus or tpus originally

such that it actually saturated the GP

GPU performance right getting Gomes here

to go in and do do all this code right

and Nome is like probably one of the

best engineers in the world or maybe

going a step further like the next

generation of models having these things

like getting model Paralis to work and

scaling it on like you know thousands of

or maybe tens of thousands of like v100s

which I think gbd3 may have been um

there's just so much engineering effort

that has to go into all of these things

to make it work um if you really brought

that cost down

to like you know maybe not zero but just

made it 10x easier made it super easy

for someone with really fantastic ideas

to immediately get to the version of

like the new architecture they dreamed

up that is like getting 50 40% uh

utilization on the gpus I think that

would just speed up research by a ton I

mean I think I think if if you see a

clear path to Improvement you you should

always sort of take the low hanging

fruit first right and I think probably

open eye and and all the other labs it

did the right thing to pick off the low

hanging fruit where the low hanging

fruit is like sort

of you you could scale up to a GP

24.25

scale um and and you just keep scaling

and and like things things keep getting

better and as long as like you there's

there's no point of experimenting with

new ideas when like everything

everything is working and you should

sort of bang on and try try to get as

much as much juice out as the possible

and then and then maybe maybe when you

really need new ideas for I think I

think if you're if you're spending $10

trillion you probably want to spend some

you know then actually like reevaluate

your ideas like probably your idea

Limited at that point I think all of us

believe new ideas are probably needed to

get you know all the way way there to

Ai

and all of us also probably believe

there exist ways of testing out those

ideas at smaller

scales um and being fairly confident

that they'll play out it's just quite

difficult for the labs in their current

position to dedicate their very limited

research and Engineering talent to

exploring all these other ideas when

there's like this core thing that will

probably improve performance um for some

like decent amount of

time yeah but also these big Labs like

winning so they're just going wild

okay so how uh big question looking out

into the future you're now at the the

center of the programming world how do

you think programming the nature

programming changes in the next few

months in the next year in the next two

years the next 5 years 10 years I think

we're really excited about a future

where the programmer is in the driver's

seat for a long time and you've heard us

talk about this a little bit but one

that

emphasizes speed and agency for the

programmer and control the ability to

modify anything you want to modify the

ability to iterate really fast on what

you're

building

and this is a little different I think

than where some people um are are

jumping to uh in the space where I think

one idea that's captivated people is can

you talk to your um computer can you

have it build software for you as if

you're talking to like an engineering

department or an engineer over slack and

can it just be this this sort of

isolated text box and um part of the

reason we're not excited about that is

you know some of the stuff we've talked

about with latency but then a big piece

a reason we're not excited about that is

because that comes with giving up a lot

of control it's much harder to be really

specific when you're talking in the text

box and um if you're necessarily just

going to communicate with a thing like

you would be communicating with an

engineering department you're actually

abdicating tons of tons of really

important decisions um to the spot um

and this kind of gets at fundamentally

what engineering is um I think that some

some people who are a little bit more

removed from engineering might think of

it as you know the spec is completely

written out and then the engineers just

come and they just Implement and it's

just about making the thing happen in

code and making the thing um exists um

but I think a lot of the the best

engineering the engineering we

enjoy um involves tons of tiny micro

decisions about what exactly you're

building and about really hard

trade-offs between you know speed and

cost and all the other uh things

involved in a system and uh we want as

long as humans are actually the ones

making you know designing the software

and the ones um specifying what they

want to be built and it's not just like

company run by all AIS we think you'll

really want the humor the human in a

driver seat um dictating these decisions

and so there's the jury still out on

kind of what that looks like I think

that you know one weird idea for what

that could look like is it could look

like you kind of you can control the

level of abstraction you view a codebase

at and you can point at specific parts

of a codebase that um like maybe you

digest a code Base by looking at it in

the form of pseudo code and um you can

actually edit that pseudo code too and

then have changes get made down at the

the sort of formal programming level and

you keep the like you know you can

gestat any piece of logic uh in your

software component of programming you

keep the inflow text editing component

of programming you keep the control of

you can even go down into the code you

can go at higher levels of abstraction

while also giving you these big

productivity gains it would be nice if

you can go up and down the the

abstraction stack yeah and there are a

lot of details to figure out there

that's sort of a fuzzy idea time will

tell if it actually works but these

these principles of of control and speed

in the human and the driver seat we

think are really important um we think

for some things like Arvid mentioned

before for some styles of programming

you can kind of hand it off chapot style

you know if you have a bug that's really

well specified but that's not most of

programming and that's also not most of

the programming we think a lot of people

value uh what about like the fundamental

skill of programming there's a lot of

people

like young people right now kind of

scared like thinking because they like

love programming but they're scared

about like will I be able to have a

future if I pursue this career path do

you think the very skill of programming

will change fundamentally I actually

think this is a really really exciting

time to be building software yeah like

we remember what programming was like in

you know 2013 2012 whatever it was um

and there was just so much more Cru and

boilerplate and and you know looking up

something really gnarly and you know

that stuff still exists it's definitely

not at zero but programming today is way

more fun than back then um it's like

we're really getting down to the the

Delight concentration and all all the

things that really draw people to

programming like for instance this

element of being able to build things

really fast and um speed and also

individual control like all those are

just being turned up a ton um and so I

think it's just going to be I think it's

going to be a really really fun time for

people who build software um I think

that the skills will probably change too

I I think that people's taste and

creative ideas will be magnified and it

will be less

about maybe less a little bit about

boilerplate text editing maybe even a

little bit less about carefulness which

I think is really important today if

you're a programmer I think it'll be a

lot more fun what do you guys think I

agree I'm I'm very excited to be able to

change like just what one thing that

that happened recently was like we

wanted to do a relatively big migration

to our codebase we were using async

local storage in in no. JS which is

known to be not very performant and we

wanted to migrate to our context object

and this is a big migration it affects

the entire code base and swall and I

spent I don't know five days uh working

through this even with today's AI tools

and I am really excited for a future

where I can just show a couple of

examples and then the AI applies that to

all of the locations and then it

highlights oh this is a new example like

what should I do and then I show exactly

what to do there and then that can be

done in like 10 minutes uh and then you

can iterate much much faster then you

can then you don't have to think as much

up front and stay stand at the

Blackboard and like think exactly like

how are we going to do this because the

cost is so high but you can just try

something first and you realize oh this

is not actually exactly what I want and

then you can change it instantly again

after and so yeah I think being a

programmer in the future is going to be

a lot of fun yeah I I really like that

point about it feels like a lot of the

time with programming there are two ways

you can go about it one is like you

think really hard carefully upfront

about the best possible way to do it and

then you spend your limited time of

engineering to actually implement it uh

but I much prefer just getting in the

code and like you know taking a crack at

seeing how it how how it kind of lays

out and then

iterating really quickly on that that

feels more fun um yeah like just

speaking to generating the boiler plate

is great so you just focus on the

difficult design nuanced difficult

design decisions migration I feel like

this is this is a cool one like it seems

like large language models able to

basically translate from one programm

language to another or like translate

like migrate in the general sense of

what migrate is um but that's in the

current moment so I mean the fear has to

do with like okay as these models get

better and better then you're doing less

and less creative decisions and is it

going to kind of move to a place where

it's uh you're operating in the design

space of natural language where natural

language is the main programming

language and I guess I could ask that by

way of advice like if somebody's

interested in programming now what do

you think they should

learn like to say you guys started in

some

Java and uh I forget the oh some PHP PHP

Objective C Objective C there you go um

I mean in the end we all know JavaScript

is going to

win uh and not typescript it's just it's

going to be like vanilla JavaScript it's

just going to eat the world and maybe a

little bit of PHP and I mean it also

brings up the question of like I think

Don can has a this idea that some per of

the population is Geeks and like there's

a particular kind of psychology in mind

required for programming and it feels

like more and more that's expands the

kind of person that should be able to

can do great programming might

expand I think different people do

programming for different reasons but I

think the true maybe like the best

programmers um are the ones that really

love just like absolutely love

programming for example there folks in

our team who

literally when they're they get back

from work they go and then they boot up

cursor and then they start coding on

their side projects for the entire night

and they stay till 3:00 a.m. doing that

um and when they're sad they they

said I just really need to

code and I I I think like you know

there's there's that level of programmer

where like this Obsession and love of

programming um I think makes really the

best programmers and I think the these

types of people

will really get into the details of how

things work I guess the question I'm

asking that exact program I think about

that

person when you're when the super tab

the super awesome praise be the tab is

succeeds you keep PR pressing tab that

person in the team loves to cursor tab

more than anybody else right yeah and

it's also not just like like pressing

tab is like the just pressing tab that's

like the easy way to say it in the The

Catch catchphrase you know uh but what

you're actually doing when you're

pressing tab is that you're you're

injecting intent uh all the time while

you're doing it you're you're uh

sometimes you're rejecting it sometimes

you're typing a few more characters um

and and that's the way that you're um

you're sort of shaping the things that's

being created and I I think programming

will change a lot to just what is it

that you want to make it's sort of

higher bandwidth the communication to

the computer just becomes higher and

higher bandwidth as opposed to like like

just typing is much lower bandwidth than

than communicating intent I mean this

goes to your uh

Manifesto titled engineering genius we

are an applied research lab building

extraordinary productive human AI

systems So speaking to this like hybrid

element mhm uh to start we're building

the engineer of the future a human AI

programmer that's an order of magnitude

more effective than any one engineer

this hybrid engineer will have

effortless control over their code base

and no low entropy keystrokes they will

iterate at the speed of their judgment

even in the most complex systems using a

combination of AI and human Ingenuity

they will outsmart and out engineer the

best pure AI systems we are a group of

researchers and Engineers we build

software and models to invent at the

edge of what's useful and what's

possible our work has already improved

the lives of hundreds of thousands of

program

and on the way to that will at least

make programming more fun so thank you

for talking today thank you thanks for

having us thank you thank you thanks for

listening to this conversation with

Michael swall Arvid and Aman to support

this podcast please check out our

sponsors in the description and now let

me leave you with a random funny and

perhaps profound programming code I saw

on

Reddit nothing is as permanent as a

temporary solution that

works thank you for listening and hope

to see you next time

Loading...

Loading video analysis...