Cursor Team: Future of Programming with AI | Lex Fridman Podcast #447
By Lex Fridman
Summary
## Key takeaways - **AI will change programming's core nature.**: The future of programming involves AI-assisted coding, shifting the focus from boilerplate to high-level design and nuanced decision-making, empowering programmers with greater agency and speed. (36:41, 01:12:23) - **Cursor's fork of VS Code enables rapid AI innovation.**: By forking VS Code, Cursor can deeply integrate AI capabilities, allowing for faster iteration and the development of novel features that extensions alone cannot achieve. (10:34:14, 11:34:17) - **AI's role: predicting and executing programmer intent.**: Cursor aims to predict a programmer's next action, from code completion to multi-line edits and even navigating between files, by making AI a seamless partner in the coding process. (16:11:15, 17:37:39) - **Optimizing AI for coding requires specialized models.**: Custom models, like those using Mixture of Experts (MoE) and speculative edits, are crucial for achieving the low latency and high accuracy needed for AI-powered coding assistance. (19:40:13, 34:31:35) - **AI struggles with bug detection, needs better training.**: While AI excels at code generation, it currently struggles with bug detection due to a lack of relevant training data, highlighting the need for specialized models and potentially synthetic data generation. (33:24:27, 41:11:16) - **Human-AI collaboration amplifies engineering capabilities.**: The future of programming lies in a hybrid engineer who leverages AI for speed and scale while retaining human judgment for complex design decisions and trade-offs, leading to more effective and creative problem-solving. (47:44:48, 02:17:37)
Topics Covered
- The future of programming is zero entropy actions.
- Why are AI models so bad at finding bugs?
- Formal verification will replace unit tests.
- AI should be a partner, not an order-taker.
Full Transcript
the following is a conversation with the
founding members of the cursor team
Michael truell swall oif Arvid lunark
and Aman Sanger cursor is a code editor
based on VSS code that adds a lot of
powerful features for AI assisted coding
it has captivated the attention and
excitement of the programming and AI
communities so I thought this is an
excellent opportunity to dive deep into
the role of AI in programming this is a
super technical conversation that is
bigger than just about one code editor
it's about the future of programming and
in general the future of human AI
collaboration in designing and
Engineering complicated and Powerful
systems this is Le Freedman podcast to
support it please check out our sponsors
in the description and now dear friends
here's Michael suale Arvid and
Aman all right this is awesome we have
Michael Aman suali Arvid here from the
cursor team first up big ridiculous
question what's the point of a code
editor so the the code editor is largely
the place where you build software and
today or for a long time that's meant
the place where you text edit uh a
formal programming language and for
people who aren't programmers the way to
think of a code editor is like a really
souped up word processor for programmers
where the reason it's it's souped up is
code has a lot of structure and so the
the quote unquote word processor the
code editor can actually do a lot for
you that word processors you know sort
of in the writing space haven't been
able to do for for people editing text
there and so you know that's everything
from giving you visual differentiation
of like the actual tokens in the code to
so you can like scan it quickly to
letting you navigate around the code
base sort of like you're navigating
around the internet with like hyperlinks
you're going to sort of definitions of
things you're using to error checking um
to you know to catch rudimentary B
um and so traditionally that's what a
code editor has meant and I think that
what a code editor is is going to change
a lot over the next 10 years um as what
it means to build software maybe starts
to look a bit different I I think also
code edor should just be fun yes that is
very important that is very important
and it's actually sort of an underated
aspect of how we decide what to build
like a lot of the things that we build
and then we we try them out we do an
experiment and then we actually throw
them out because they're not fun and and
so a big part of being fun is like being
fast a lot of the time fast is fun yeah
fast
is uh yeah that should be a
t-shirt like like
fundamentally I think one of the things
that draws a lot of people to to
building stuff on computers is this like
insane integration speed where you know
in other disciplines you might be sort
of gate capped by resources or or the
ability even the ability you know to get
a large group together and coding is
just like amazing thing where it's you
and the computer and uh that alone you
can you can build really cool stuff
really quickly so for people don't know
cursor is this super cool new editor
that's a fork of vs code it would be
interesting to get your kind of
explanation of your own journey of
editors how did you I think all of you
are were big fans of vs code with
co-pilot how did you arrive to VSS code
and how did that lead to your journey
with cursor yeah um
so I think a lot of us well all of us
originally Vim users pure pure VI pure
Vim yeah no neo just pure Vim in a
terminal and at Le at least for myself
it was around the time that C- pilot
came out so
2021 that I really wanted to try it so I
went into vs code the only platform the
only code editor in which it was
available
and even though I you know really
enjoyed using Vim just the experience of
co-pilot with with vs code was more than
good enough to convince me to switch and
so that kind of was the default until we
started working on cursor and uh maybe
we should explain what copala does it's
like a really nice
autocomplete it suggests as you start
writing a thing it suggests one or two
or three lines how to complete the thing
and there's a fun experience in that you
know like when you have a close
friendship and your friend completes
your
sentences like when it's done well
there's an intimate feeling uh there's
probably a better word than intimate but
there's a there's a cool feeling of like
holy it gets
me now and then there's an unpleasant
feeling when it doesn't get you uh and
so there's that that kind of friction
but I would say for a lot of people the
feeling that it gets me over powers that
it doesn't and I think actually one of
the underrated aspects of get up copet
is that even when it's wrong is it's
like a little bit annoying but it's not
that bad because you just type another
character and then maybe then it gets
you or you type another character and
then then it gets you so even when it's
wrong it's not that bad yeah you you can
sort of iterate iterate and fix it I
mean the other underrated part of uh
calot for me sort of was just the first
real real AI product it's like the first
language model consumer product so
copile was kind of like the first killer
app for LMS yeah and like the beta was
out in 2021 right okay mhm uh so what's
the the origin story of cursor so around
2020 the scaling loss papers came out
from from open Ai and that was a moment
where this looked like clear predictable
progress for the field where even if we
didn't have any more ideas looked like
you could make these models a lot better
if you had more computer and more data
uh by the way we'll probably talk uh for
three to four hours on on the topic of
scaling laws but just just to summarize
it's a paper and a set of papers and set
of ideas that say bigger might be better
for model size and data size in the in
the realm of machine learning it's
bigger and better but predictively
better okay this another topic of
conversation but anyway yeah so around
that time for some of us there were like
a lot of conceptual conversations about
what's this going to look like what's
the the story going to be for all these
different knowledge worker Fields about
how they're going to be um made better U
by this technology getting better and
then um I think there were a couple of
moments where like the theoretical gains
predicted in that paper uh started to
feel really concrete and it started to
feel like a moment where you could
actually go and not you know do a PhD if
you wanted to work on uh do useful work
in AI actually felt like now there was
this this whole set of systems one could
built that were really useful and I
think that the first moment we already
talked about a little bit which was
playing with the early bit of copell
like that was awesome and magical um I
think that the next big moment where
everything kind of clicked together was
actually getting early access to gbd4 so
sort of end of 2022 was when we were um
tinkering with that model and the Step
Up in capabilities felt enormous and
previous to that we had been working on
a couple of different projects we had
been um because of co-pilot because of
scaling laws because of our prior
interest in the technology we had been
uh tinkering around with tools for
programmers but things that are like
very specific so you know we were
building tools for uh Financial
professionals who have to work with in a
juper notebook or like you know playing
around with can you do static analysis
with these models and then the Step Up
in gbd4 felt like look that really made
concrete the theoretical gains that um
we had predicted before felt like you
could build a lot more just immediately
at that point in time and
also if we were being consistent it
really felt like um this wasn't just
going to be a point solution thing this
was going to be all of programming was
going to flow through these models it
felt like that demanded a different type
of programming environment to different
type of programming and so we set off to
build that that sort of larger Vision
around then there's one that I
distinctly remember so my roommate is an
IMO gold winner and uh there's a
competition in the US called of putam
which is sort of the IMO for college
people and it's it's this math
competition is he's exceptionally good
so Shang Tong and Aman I remember it
sort of June June of
2022 had this bet on whether the mo like
2024 June or July you were going to win
a gold medal in the Imo with the with
like models IMO is international math
Olympiad uh yeah IMO is international
math Olympiad and so Arvid and I are
both of you know also competed in it so
was sort of personal and uh and I I
remember thinking Matt is just this is
not going to happen this was like it un
like even though I I sort of believed in
progress I thought you know I'm a girl
just like Aman is just delusional that
was the that was the and and to be
honest I mean I I was to be clear it
very wrong but that was maybe the most
preent bet in the group so the the new
results from Deep Mind it turned out
that you were correct that's what well
it technically not technically incorrect
but one point awayan was very
enthusiastic about this stuff back then
and before Aman had this like scaling
loss t-shirt that he would walk around
with where it had like charts and like
the formulas on it oh so you like felt
the AI or you felt the scaling yeah I i
l remember there was this one
conversation uh I had with with Michael
where before I hadn't thought super
deeply and critically about scaling laws
and he kind of posed the question why
isn't scaling all you need or why isn't
scaling going to result in massive gains
in progress and I think I went through
like the like the stages of grief there
is anger denial and then finally at the
end just thinking about it uh acceptance
um and I think I've been quite hopeful
and uh optimistic about progress since I
think one thing I'll caveat is I think
it also depends on like which domains
you're going to see progress like math
is a great domain because especially
like formal theor improving because you
get this fantastic signal of actually
verifying if the thing was correct and
so this means something like RL can work
really really well and I think like you
could have systems that are perhaps very
superhuman in math and still not
technically have ai okay so can we take
it off all the way to cursor mhm and
what is cursor it's a fork of vs code
and VSS code is one of the most popular
editors for a long time like everybody
fell in love with it everybody left Vim
I left dmax for it
sorry
uh uh so it unified in some fun
fundamental way the uh the developer
community and then that you look at the
space of things you look at the scaling
laws AI is becoming amazing and you
decide decided okay it's not enough to
just write an extension Fe vs
code because there's a lot of
limitations to that we we need if AI is
going to keep getting better and better
and better we need to really like
rethink how the the AI is going to be
part of the editing process and so you
decided to Fork vs code and start to
build a lot of the amazing features
we'll be able to to to talk about but
what was that decision like because
there's a lot of extensions including
co-pilot of vs code that are doing so AI
type stuff what was the decision like to
just Fork vs code so the decision to do
an editor seemed kind of self-evident to
us for at least what we wanted to do and
Achieve because when we started working
on the editor the idea was these models
are going to get much better their
capabilities are going to improve and
it's going to entirely change how you
build software both in a you will have
big productivity gains but also radical
in how like the active building software
is going to change a lot and so you're
very limited in the control you have
over a code editor if you're a plugin to
an existing coding environment um and we
didn't want to get locked in by those
limitations we wanted to be able to um
just build the most useful stuff okay
well then the natural question
is you know VSS code is kind of with
copilot a competitor so how do you win
is is it basically just the speed and
the quality of the features yeah I mean
I think this is a space that is quite
interesting perhaps quite unique where
if you look at previous Tech waves
maybe there's kind of one major thing
that happened and unlocked a new wave of
companies but every single year every
single model capability uh or jump you
get model capabilities you now unlock
this new wave of features things that
are possible especially in programming
and so I think in AI programming being
even just a few months ahead let alone a
year ahead makes your product much much
much more useful I think the cursor a
year from now will need to make the
cursor of today look
Obsolete and I think you know Microsoft
has' done a number of like fantastic
things but I don't think they're in a
great place to really keep innovating
and pushing on this in the way that a
startup can just rapidly implementing
features and and push yeah like and and
kind of doing the research
experimentation
necessary um to really push the ceiling
I don't I don't know if I think of it in
terms of features as I think of it in
terms of like capabilities for for
programmers it's that like you know as
you know the new one model came out and
I'm sure there are going to be more more
models of different types like longer
context and maybe faster like there's
all these crazy ideas that you can try
and hopefully 10% of the crazy ideas
will make it into something kind of cool
and useful and uh we want people to have
that sooner to rephrase it's like an
underrated fact is we're making it for
oursel when we started cursor you really
felt this frustration that you know
models you could see models getting
better uh but the coall experience had
not changed it was like man these these
guys like the steing is getting higher
like why are they not making new things
like they should be making new things
they should be like you like like
where's where's where's all the alpha
features there there were no Alpha
features it was like uh I I'm sure it it
was selling well I'm sure it was a great
business but it didn't feel I I'm I'm
one of these people that really want to
try and use new things and was just
there's no new thing for like a very
long while yeah it's interesting uh I
don't know how you put that into words
but when you compare a cursor with
copilot copilot pretty quickly became
started to feel stale for some reason
yeah I think one thing that I think uh
helps us is that we're sort of doing it
all in one where we're developing the
the ux and the way you interact with the
model and at the same time as we're
developing like how we actually make the
model give better answers so like how
you build up the The Prompt or or like
how do you find the context and for a
cursor tab like how do you train the
model um so I think that helps us to
have all of it like sort of like the
same people working on the entire
experience on end yeah it's like the the
person making the UI and the person
training the model like sit to like 18
ft away so often the same person even
yeah often often even the same person so
you you can you create things that that
are sort of not possible if you're not
you're not talking you're not
experimenting and you're using like you
said cursor to write cursor of course oh
yeah yeah well let's talk about some of
these features let's talk about the all-
knowing the all powerful praise B to the
tab so the you know autocomplete on
steroids basically so what how does tab
work what is tab to highlight and
summarize it a high level I'd say that
there are two things that curser is
pretty good at right now there there are
other things that it does um but two
things it it helps programmers with one
is this idea of looking over your
shoulder and being like a really fast
colleague who can kind of jump ahead of
you and type and figure out what you're
what you're going to do next and that
was the original idea behind that was
kind kind of the kernel the idea behind
a good autocomplete was predicting what
you're going to do next you can make
that concept even more ambitious by not
just predicting the characters after
cursor but actually predicting the next
entire change you're going to make the
next diff the next place you're going to
jump to um and the second thing cursor
is is pretty good at right now too is
helping you sometimes jump ahead of the
AI and tell it what to do and go from
instructions to code and on both of
those we've done a lot of work on making
the editing experience for those things
ergonomic um and also making those
things smart and fast one of the things
we really wanted was we wanted the model
to be able to edit code for us uh that
was kind of a wish and we had multiple
attempts at it before before we had a
sort of a good model that could edit
code for
you U then after after we had a good
model I think there there have been a
lot of effort to you know make the
inference fast for you know uh having
having a good good
experience and uh we've been starting to
incorporate I mean Michael sort of
mentioned this like ability to jump to
different places and that jump to
different places I think came from a
feeling off you know once you once you
accept an edit um was like man it should
be just really obvious where to go next
it's like it's like I I made this change
the model should just know that like the
next place to go to is like 18 lines
down like uh if you're if you're a whim
user you could press 18 JJ or
whatever but like why why even why am I
doing this like the model the model
should just know it and then so so the
idea was you you just press tab it would
go 18 lines down and then make it would
show you show you the next edit and you
would press tab so it's just you as long
as you could keep pressing Tab and so
the internal competition was how many
tabs can we make them pressive once you
have like the idea uh more more uh sort
of abstractly the the thing to think
about is sort of like once how how how
are the edit sort of zero zero entropy
so once You' sort of expressed your
intent and the edit is there's no like
new bits of information to finish your
thought but you still have to type some
characters to like make the computer
understand what you're actually thinking
then maybe the model should just sort of
read your mind and and all the zero
entropy bits should just be like tabbed
away yeah that was that was sort of the
abstract there's this interesting thing
where if you look at language model loss
on on different domains um I believe the
bits per bite which is kind of character
normalized loss for code is lower than
language which means in general there
are a lot of tokens in code that are
super predictable lot of characters that
are super predictable um and this is I
think even magnified when you're not
just trying to autocomplete code but
predicting what the user is going to do
next in their editing of existing code
and so you know the gold cursor tab is
let's eliminate all the low entropy
actions you take inside of the editor
when the intent is effectively
determined let's just jump you forward
in time skip you forward well well
what's the intuition and what's the
technical details of how to do next
cursor prediction that jump that's not
that's not so intuitive I think to
people yeah I think I can speak to a few
of the details on how how to make these
things work they're incredibly low
latency so you need to train small
models on this on this task um in
particular they're incredibly pre-fill
token hungry what that means is they
have these really really long prompts
where they see a lot of your code and
they're not actually generating that
many tokens and so the perfect fit for
that is using a sparse model meaning Ane
model um so that was kind of one one
break one breakthrough we made that
substantially improved its performance
at longer context the other being um a
variant of speculative decoding that we
we kind of built out called speculative
edits um these are two I think important
pieces of what make it quite high
quality um and very fast okay soe
mixture of experts the input is huge the
output is small yeah okay so like what
what what else can you say about how to
make it like caching play a role in this
cashing plays a huge role M um because
you're dealing with this many input
tokens if every single keystroke that
you're typing in a given line you had to
rerun the model on all those tokens
passed in you're just going to one
significantly deg grade latency two
you're going to kill your gpus with load
so you need to you you need to design
the actual prompts use for the model
such that they're cach caching aware and
then yeah you need you need to re use
the KV cach across request just so that
you're spending less work less compute
uh again what are the things that tab is
supposed to be able to do kind of in the
near term just to like sort of Linger on
that generate code like fill empty
space Also edit code across multiple
lines yeah and then jump to different
locations inside the same file yeah and
then like hopefully jump to different
files also so if you make an edit in one
file and maybe maybe you have to go
maybe you have to go to another file to
finish your thought it should it should
go to the second file also yeah and then
the full generalization is like next
next action prediction like sometimes
you need to run a command in the
terminal and it should be able to
suggest the command based on the code
that you wrote too um or sometimes you
actually need to like it suggest
something but you you it's hard for you
to know if it's correct because you
actually need some more information to
learn like you need to know the type to
be able to verify that it's correct and
so maybe it should actually take you to
a place that's like the definition of
something and then take you back so that
you have all the requisite knowledge to
be able to accept the next completion Al
also providing the human the knowledge
yes right yeah can you integrate like I
just uh gotten to know a guy named Prime
Jen who I believe has an SS you can
order coffee via SSH
oh yeah oh we did that we did that uh so
can that also the model do that like
feed you and like yeah and provide you
with caffeine okay so that's the general
framework yeah and the the magic moment
would be
if it is programming is this weird
discipline where um sometimes the next
five minutes not always but sometimes
the next five minutes of what you're
going to do is actually predictable from
the stuff you've done recently and so
can you get to a world where that next 5
minutes either happens by you
disengaging and it taking you through or
maybe a little bit more of just you
seeing Next Step what it's going to do
and you're like okay that's good that's
good that's good that's good and you can
just sort of tap tap tap through these
big changes as we're talking about this
I should mention like one of the really
cool and noticeable things about cursor
is that there's this whole diff
interface situation going on so like the
model suggests with uh with the red and
the green of like here's how we're going
to modify the code and in the chat
window you can apply and it shows you
the diff and you can accept the diff so
maybe can you speak to whatever
direction of that we'll probably have
like four or five different kinds of
diffs uh so we we have optimized the
diff for for the autocomplete so that
has a different diff interface
than uh then when you're reviewing
larger blocks of code and then we're
trying to optimize uh another diff thing
for when you're doing multiple different
files uh and and sort of at a high level
the difference is for
when you're doing autocomplete it should
be really really fast to
read uh actually it should be really
fast to read in all situations but in
autocomplete it sort of you're you're
really like your eyes focused in one
area you you can't be in too many you
the humans can't look in too many
different places so you're talking about
on the interface side like on the
interface side so it currently has this
box on the side so we have the current
box and if it tries to delete code in
some place and tries to add other code
it tries to show you a box on the you
can maybe show it if we pull it up on
cursor. comom this is what we're talking
about so that it was like three or four
different attempts at trying to make
this this thing work where first the
attempt was like these blue crossed out
line so before it was a box on the side
it used to show you the code to delete
by showing you like uh like Google doc
style you would see like a line through
it then you would see the the new code
that was super distracting and then we
tried many different you know there was
there was sort of deletions there was
trying to Red highlight then the next
iteration of it which is sort of funny
Would you would hold the on Mac the
option button so it would it would sort
of highlight a region of code to show
you that there might be something coming
uh so maybe in this example like the
input and the value uh would get would
all get blue and the blue would to
highlight that the AI had a suggestion
for you uh so instead of directly
showing you the thing it would show you
that the AI it would just hint that the
AI had a suggestion and if you really
wanted to see it you would hold the
option button and then you would see the
new suggestion then if you release the
option button you would then see your
original code mhm so that's by the way
that's pretty nice but you have to know
to hold the option button yeah I by the
way I'm not a Mac User but I got it it
was it was it's a button I guess you
people
it's h you know it's again it's just
it's just nonintuitive I think that's
the that's the key thing and there's a
chance this this is also not the final
version of it I am personally very
excited for
um making a lot of improvements in this
area like uh we we often talk about it
as the verification problem where U
these diffs are great for small edits uh
for large edits or like when it's
multiple files or something it's um
actually
a little bit prohibitive to to review
these diffs and uh uh so there are like
a couple of different ideas here like
one idea that we have is okay you know
like parts of the diffs are important
they have a lot of information and then
parts of the diff um are just very low
entropy they're like exam like the same
thing over and over again and so maybe
you can highlight the important pieces
and then gray out the the not so
important pieces or maybe you can have a
model that uh looks at the the diff and
and sees oh there's a likely bug here I
will like Mark this with a little red
squiggly and say like you should
probably like review this part of the
diff um and ideas in in that vein I
think are exciting yeah that's a really
fascinating space of like ux design
engineering so you're basically trying
to guide the human programmer through
all the things they need to read and
nothing more yeah like optimally yeah
and you want an intelligent model to do
it like ly diffs Al diff algorithms are
they're like Al like they're just like
normal algorithms uh there's no
intelligence uh there's like
intelligence that went into designing
the algorithm but then there there's no
like you don't care if the if it's about
this thing or this thing uh and so you
want a model to to do this so I think
the the the general question is like M
these models are going to get much
smarter as the models get much smarter
uh the the changes they will be able to
propose are much bigger so as the
changes gets bigger and bigger and
bigger the humans have to do more and
more and more verification work it gets
more and more more hard like it's just
you need you need to help them out it
sort of I I don't want to spend all my
time reviewing
code uh can you say a little more across
multiple files div yeah I mean so GitHub
tries to solve this right with code
review when you're doing code review
you're reviewing multiple deaths cross
multiple files but like Arvid said
earlier I think you can do much better
than code review you know code review
kind of sucks like you spend a lot of
time trying to grock this code that's
often quite unfamiliar to you and it
often like doesn't even actually catch
that many bugs and I think you can
signific significantly improve that
review experience using language models
for example using the kinds of tricks
that AR had described of maybe uh
pointing you towards the regions that
matter
um I think also if the code is produced
by these language models uh and it's not
produced by someone else like the code
review experience is designed for both
the reviewer and the person that
produced the code in the case where the
person that produced the code is a
language model you don't have to care
that much about their experience and you
can design the entire thing around the
reviewer such that the reviewer's job is
as fun as easy as productive as possible
um and I think that that feels like the
issue with just kind of naively trying
to make these things look like code
review I think you can be a lot more
creative and and push the boundary and
what's possible just one one idea there
is I think ordering matters generally
when you review a PR you you have this
list of files and you're reviewing them
from top to bottom but actually like you
actually want to understand this part
first because that came like logically
first and then you want understand the
next part and um you don't want to have
to figure out that yourself you want a
model to guide you through the thing and
is the step of creation going to be more
and more natural language is the goal
versus with actual uh I think sometimes
I don't think it's going to be the case
that all of programming will be natural
language and the reason for that is you
know if I'm PR programming with swalla
and swall is at the computer and the
keyboard uh and sometimes if I'm like
driving I want to say to swallet hey
like implement this function and that
that works and then sometimes it's just
so annoying to explain to swalla what I
want him to do and so I actually take
over the keyboard and I show him I I
write like part of the example and then
it makes sense and that's the easiest
way to communicate and so I think that's
also the case for AI like sometimes the
easiest way to communicate with the AI
will be to show an example and then it
goes and does the thing everywhere else
or sometimes if you're making a website
for example the easiest way to show to
the a what you want is not to tell it
what to do but you know drag things
around or draw things um and yeah and
and like maybe eventually we will get to
like brain machine interfaces or
whatever and can of like understand what
you're thinking and so I think natural
language will have a place I think it
will not definitely not be the way most
people program most of the time I'm
really feeling the AGI with this editor
uh it feels like there's a lot of
machine learning going on underneath
tell tell me about some of the ml stuff
that makes it all work recursor really
works via this Ensemble of custom models
that that that we've trained alongside
you know the frontier models that are
fantastic at the reasoning intense
things and so cursor tab for example is
is a great example of where you can
specialize this model to be even better
than even Frontier models if you look at
evls on on the on the task we set it at
the other domain which it's kind of
surprising that it requires custom
models but but it's kind of necessary
and works quite well is in apply
um
so I think these models are like the
frontier models are quite good at
sketching out plans for code and
generating like rough sketches of like
the change but
actually creating diffs is quite hard um
for Frontier models for your training
models um like you try to do this with
Sonet with 01 any Frontier Model and it
it really messes up stupid things like
counting line numbers um especially in
super super large file
um and so what we've done to alleviate
this is we let the model kind of sketch
out this rough code block that indicates
what the change will be and we train a
model to then apply that change to the
file and we should say that apply is the
model looks at your code it gives you a
really damn good suggestion of what new
things to do and the seemingly for
humans trivial step of combining the two
you're saying is not so trivial contrary
to popular perception it is not a
deterministic algorithm yeah I I I think
like you see shallow copies of apply um
elsewhere and it just breaks like most
of the time because you think you can
kind of try to do some deterministic
matching and then it fails you know at
least 40% of the time and that just
results in a terrible product
experience um I think in general this
this regime of you are going to get
smarter models and like so one other
thing that apply lets you do is it lets
you use fewer tokens with the most
intelligent models uh this is both
expensive in terms of latency for
generating all these tokens um and cost
so you can give this very very rough
sketch and then have your smaller models
go and implement it because it's a much
easier task to implement this very very
sketched out code and I think that this
this regime will continue where you can
use smarter and SM models to do the
planning and then maybe the
implementation details uh can be handled
by the less intelligent ones perhaps
you'll have you know maybe 01 maybe
it'll be even more cap capable models
given an even higher level plan that is
kind of recursively uh applied by Sonet
and then the apply model maybe we should
we should talk about how to how to make
it fast yeah I feel like fast is always
an interesting detail fast good yeah how
do you make it fast yeah so one big
component of making it it fast is
speculative edits so speculative edits
are a variant of speculative decoding
and maybe be helpful to briefly describe
speculative decoding um with speculative
decoding what you do is you you can kind
of take advantage of the fact that you
know most of the time and I I'll add the
caveat that it would be when you're
memory Bound in in language model
Generation Um if you process multiple
tokens at once um it is faster than
generating one Tok at a time so this is
like the same reason why if you look at
tokens per second uh with prompt tokens
versus generated tokens it's much much
faster for prompt tokens um so what we
do is instead of using what specul
decoding normally does which is using a
really small model to predict these
draft tokens that your larger model
would then go in and and verify um with
code edits we have a very strong prior
of what the existing code will look like
and that prior is literally the same
exact code so you can do is you can just
feed chunks of the original code back
into the into the model um and then the
model will just pretty much agree most
of the time that okay I'm just going to
spit this code back out and so you can
process all of those lines in parallel
and you just do this with sufficiently
many chunks and then eventually you'll
reach a point of disagreement where the
model will now predict text that is
different from the ground truth original
code it'll generate those tokens and
then we kind of will decide after enough
tokens match
uh the original code to re start
speculating in chunks of code what this
actually ends up looking like is just a
much faster version of normal editing
code so it's just like it looks like a
much faster version of the model
rewriting all the code so just we we can
use the same exact interface that we use
for for diffs but it will just stream
down a lot faster and then and then the
advantage is that W wireless streaming
you can just also be reviewing start
reviewing the code exactly before before
it's done so there's no no big loading
screen uh so maybe that that is part of
the part of the advantage so the human
can start reading before the thing is
done I think the interesting riff here
is something like like speculation is a
fairly common idea nowadays it's like
not only in language models I mean
there's obviously speculation in CPUs
and there's there like speculation for
databases and like speculation all over
the place let me ask the sort of the
ridiculous question of uh which llm is
better at coding GPT Claude who wins in
the context of programming and I'm sure
the answer is much more Nuance because
it sounds like every single part of this
involves a different
model yeah I think they there's no model
that poo dominates uh others meaning it
is better in all categories that we
think matter the categories being
speed
um ability to edit code ability to
process lots of code long context you
know a couple of other things and kind
of coding
capabilities the one that I'd say right
now is just kind of net best is Sonet I
think this is a consensus opinion our
one's really interesting and it's really
good at reasoning so if you give it
really hard uh programming interview
style problems or lead code problems it
can do quite quite well on them um but
it doesn't feel like it kind of
understands your rough intent as well as
son it
does like if you look at a lot of the
other Frontier models um one qual I have
is it feels like they're not necessarily
over I'm not saying they they train in
benchmarks um but they perform really
well in benchmarks relative to kind of
everything that's kind of in the middle
so if you tried on all these benchmarks
and things that are in the distribution
of the benchmarks they're valuated on
you know they'll do really well but when
you push them a little bit outside of
that son's I think the one that that
kind of does best at at kind of
maintaining that same capability like
you kind of have the same capability in
The Benchmark as when you try to
instruct it to do anything with coding
what another ridiculous question is the
difference between the normal
programming experience versus what
benchmarks represent like where do
benchmarks fall short do you think when
we're evaluating these models by the way
that's like a really really hard it's
like like critically important detail
like how how different like benchmarks
are versus where is like real coding
where real
coding it's not interview style coding
it's you're you're doing these you know
humans are saying like half broken
English sometimes and sometimes you're
saying like oh do what I did
before sometimes you're saying uh you
know go add this thing and then do this
other thing for me and then make this UI
element and then you know it's it's just
like a lot of things are sort of context
dependent
you really want to like understand the
human and then do do what the human
wants as opposed to sort of this maybe
the the way to put it is sort of
abstractly is uh the interview problems
are
very wellp
specified they lean a lot on
specification while the human stuff is
less
specified yeah I think that this this SP
for question is both Complicated by what
um Sol just mentioned and then also to
what Aman was getting into is that even
if you like you know there's this
problem of like the skew between what
can you actually model in a benchmark
versus uh real programming and that can
be sometimes hard to encapsulate because
it's like real programming is like very
messy and sometimes things aren't super
well specified what's correct or what
isn't but then uh it's also doubly hard
because of this public Benchmark problem
and that's both because public
benchmarks are sometimes kind of Hill
climbed on then it's like really really
hard to also get the data from the
public benchmarks out of the models and
so for instance like one of the most
popular like agent benchmarks sweet
bench um is really really contaminated
in the training data of uh these
Foundation models and so if you ask
these Foundation models to do a sweet
bench problem you actually don't give
them the context of a codebase they can
like hallucinate the right file pass
they can hallucinate the right function
names um and so the the it's it's also
just the public aspect of these things
is tricky yeah like in that case it
could be trained on the literal issues
or pool request themselves and and maybe
the lives will start to do a better job
um or they've already done a good job at
decontaminating those things but they're
not going to emit the actual training
data of the repository itself like these
are all like some of the most popular
python repositories like simpai is one
example I don't think they're going to
handicap their models on Senpai and all
these popular P python repositories in
order to get uh true evaluation scores
in these benchmarks yeah I think that
given the dirs and benchmarks
um there have been like a few
interesting crutches that uh places that
build systems with these models or build
these models actually use to get a sense
of are they going in the right direction
or not and uh in a lot of places uh
people will actually just have humans
play with the things and give
qualitative feedback on these um like
one or two of the foundation model
companies they they have people who
that's that's a big part of their role
and you know internally we also uh you
know qualitatively assess these models
and actually lean on that a lot in
addition to like private evals that we
have it's like the live
the vibe yeah the vi the vibe Benchmark
human Benchmark the hum you pull in the
humans to do a Vibe check yeah okay I
mean that's that's kind of what I do
like just like reading online forums and
Reddit and X just like well I don't know
how
to properly load in people's opinions
because they'll say things like I feel
like Claude or gpt's gotten Dumber or
something they'll say I feel like
and then I sometimes feel like that too
but I wonder if it's the model's problem
or mine yeah with Claude there's an
interesting take I heard where I think
AWS has different chips um and I I
suspect they've slightly different
numerics than uh Nvidia gpus and someone
speculated that claud's deg degraded
performance had to do with maybe using
the quantise version that existed on AWS
Bedrock versus uh whatever was running
on on anthropics gpus I interview a
bunch of people that have conspiracy
theories so I'm glad spoke spoke to this
conspiracy well it's it's not not like
conspiracy theory as much as they're
just they're like they're you know
humans humans are humans and there's
there's these details and you know
you're
doing like these quzy amount of flops
and you know chips are messy and man you
can just have bugs like bugs are it's
it's hard to overstate how how hard bugs
are to avoid what's uh the role of a
good prompt in all this see you mention
that benchmarks have
really uh structured well formulated
prompts what what should a human be
doing to maximize success and what's the
importance of what the humans you wrote
a blog post on you called it prompt
design yeah uh I think it depends on
which model you're using and all of them
are likly different and they respond
differently to different prompts but um
I think the original gp4 uh and the
original sort of bre of models last last
year they were quite sensitive to the
prompts and they also had a very small
context window and so we have all of
these pieces of information around the
codebase that would maybe be relevant in
the prompt like you have the docs you
have the files that you add you have the
conversation history and then there's a
problem like how do you decide what you
actually put in the prompt and when you
have a a limited space and even for
today's models even when you have long
context filling out the entire context
window means that it's slower it means
that sometimes a model actually gets
confused and some models get more
confused than others and we have this
one system internally that we call preum
which helps us with that a little bit um
and I think it was built for the era
before where we had
8,000 uh token context Windows uh and
it's a little bit similar to when you're
making a website you you sort of you you
want it to work on mobile you want it to
work on a desktop screen and you have
this uh Dynamic information which you
don't have for example if you're making
like designing a print magazine you have
like you know exactly where you can put
stuff but when you have a website or
when you have a prompt you have these
inputs and then you need to format them
will always work even if the input is
really big then you might have to cut
something down uh and and and so the
idea was okay like let's take some
inspiration what's the best way to
design websites well um the thing that
we really like is is react and the
declarative approach where you um you
use jsx in in in JavaScript uh and then
you declare this is what I want and I
think this has higher priority or like
this has higher Z index than something
else um and
then you have this rendering engine in
web design it's it's like Chrome and uh
in our case it's a pre renderer uh which
then fits everything onto the page and
and so you declaratively decide what you
want and then it figures out what you
want um and and so we have found that to
be uh quite helpful and I think the role
of it has has sort of shifted over time
um where initially was to fit to these
small context Windows now it's really
useful because you know it helps us with
splitting up the data that goes into the
prompt and the actual rendering of it
and so um it's easier to debug because
you can change the rendering of the
prompt and then try it on Old prompts
because you have the raw data that went
into the prompt and then you can see did
my change actually improve it for for
like this entire evil set so do you
literally prompt with jsx yes yes so it
kind of looks like react there are
components like we have one component
that's a file component and it takes in
like the cursor
like usually there's like one line where
the cursor is in your file and that's
like probably the most important line
because that's the one you're looking at
and so then you can give priorities so
like that line has the highest priority
and then you subtract one for every line
that uh is farther away and then
eventually when it's render it to figure
out how many lines can I actually fit
and it centers around that thing that's
amazing yeah and you can do like other
fancy things where if you have lots of
code blocks from the entire code base
you could use uh retrieval um and things
like embedding and reranking scores to
add priorities for each of these
components so should humans when they
ask questions also use try to use
something like that like would it be
beneficial to write jsx in the in the
problem where the whole idea is should
be loose and messy I I think our goal is
kind of that you should just uh do
whatever is the most natural thing for
you and then we are job is to figure out
how do we actually like retrieve the
relative EV things so that your thing
actually makes sense well this is sort
of the discussion I had with uh Arvin of
perplexity is like his whole idea is
like you should let the person be as
lazy as he want but like yeah that's a
beautiful thing but I feel like you're
allowed to ask more of programmers right
so like if you say just do what you want
I mean humans are lazy there's a kind of
tension between just being lazy versus
like provide more is uh be prompted
almost like the system
pressuring you or inspiring you to be
articulate not in terms of the grammar
of the sentences but in terms of the
depth of thoughts that you convey inside
the uh the problems I think even as a
system gets closer to some level of
perfection often when you ask the model
for something you just are not not
enough intent is conveyed to know what
to do and there are like a few ways to
resolve that intent one is the simple
thing of having model just ask you I'm
not sure how to do these parts based in
your query could you clarify that um I
think the other could be
maybe if you there are five or six
possible Generations given the
uncertainty present in your query so far
why don't we just actually show you all
of those and let you pick
them how hard is it to for the model to
choose to speak talk back sort of versus
gener that's a that's hard sort of like
how to deal with the
uncertainty do I do I choose to ask for
more information to reduce the ambiguity
so I mean one of the things we we do is
um it's like a recent addition is try to
suggest files that you can add so and
while you're typing uh one can guess
what the uncertainty is and maybe
suggest that like you know maybe maybe
you're writing your API
and uh we can guess using the
commits uh that you've made previously
in the same file that the client and the
server is super useful and uh there's
like a hard technical problem of how do
you resolve it across all commits which
files are the most important given your
current prompt and we still sort of uh
initial version is ruled out and I'm
sure we can make it much more
accurate uh it's it's it's very
experimental but then the ideaas we show
you like do you just want to add this
file this file this file also to tell
you know the model to edit those files
for you uh because if if you're maybe
you're making the API like you should
also edit the client and the server that
is using the API and the other one
resolving the API and so that would be
kind of cool as both there's the phase
where you're writing the prompt and
there's before you even click enter
maybe we can help resolve some of the
uncertainty to what degree do you use uh
agentic approaches how useful are agents
we think agents are really really cool
like I I I think agents is like uh it's
like resembles sort of like a human it's
sort of like the like you can kind of
feel that it like you're getting closer
to AGI because you see a demo where um
it acts as as a human would and and it's
really really cool I think um agents are
not yet super useful for many things
they I think we're we're getting close
to where they will actually be useful
and so I think uh there are certain
types of tasks where having an agent
would be really nice like I would love
to have an agent for example if like we
have a bug where you sometimes can't
command C and command V uh inside our
chat input box and that's a task that's
super well specified I just want to say
like in two sentences this does not work
please fix it and then I would love to
have an agent that just goes off does it
and then uh a day later I I come back
and I review the the thing you mean it
goes finds the right file yeah it finds
the right files it like tries to
reproduce the bug it like fixes the bug
and then it verifies that it's correct
and this is could be a process that
takes a long time um and so I think I
would love to have that uh and then I
think a lot of programming like there is
often this belief that agents will take
over all of programming um I don't think
we think that that's the case because a
lot of programming a lot of the value is
in iterating or you don't actually want
to specify something upfront because you
don't really know what you want until
youve seen an initial version and then
you want to iterate on that and then you
provide more information and so for a
lot of programming I think you actually
want a system that's instant that gives
you an initial version instantly back
and then you can iterate super super
quickly uh what about something like
that recently came out rep agent that
does also like setting up the
development environment installing
software packages configuring everything
configuring the databases and actually
deploying the app yeah is that also in
the set of things you dream about I
think so I think that would be really
cool for for certain types of
programming uh it it would be really
cool is that within scope of cursor yeah
we're aren't actively working on it
right now um but it's definitely like we
want to make the programmer's life
easier and more fun and some things are
just really tedious and you need to go
through a bunch of steps and you want to
delegate that to an agent um and then
some things you can actually have an
agent in the background while you're
working like let's say you have a PR
that's both backend and front end and
you're working in the front end and then
you can have a background agent that
doesn't work and figure out kind of what
you're doing and then when you get to
the backend part of your PR then you
have some like initial piece of code
that you can iterate on um and and so
that that would also be really cool one
of the things we already talked about is
speed but I wonder if we can just uh
Linger on that some more in the the
various places that uh the technical
details involved in making this thing
really fast so every single aspect of
cursor most aspects of cursor feel
really fast like I mentioned the apply
is probably the slowest thing and for me
from sorry the
pain I know it's it's a pain it's a pain
that we're feeling and we're working on
fixing it uh
yeah I mean it says something that
something that feels I don't know what
it is like 1 second or two seconds that
feels slow that means that's actually
shows that everything else is just
really really fast um so is there some
technical details about how to make some
of these models so how to make the chat
fast how to make the diffs fast is there
something that just jumps to mind yeah I
mean so we can go over a lot of the
strategies that we use one interesting
thing is Cash Waring um and so what you
can is if as the user is typing you can
have yeah you're you're probably going
to use uh some piece of context and you
can know that before the user's done
typing so you know as we discussed
before reusing the KV cache results and
lower latency lower cost uh cross
requests so as a user starts type in you
can immediately warm the cache with like
let's say the current file contents and
then when theyve pressed enter uh
there's very few tokens it actually has
to to prefill and compute before
starting the generation this will
significantly lower ttf can you explain
how KV cach works yeah so the way
Transformers work um I like it I
mean like one one of the mechanisms that
allow Transformers to not just
independently like the mechanism that
allows Transformers to not just
independently look at each token but see
previous tokens are the keys and values
to tension and generally the way tension
works is you have at your current token
some query and then you've all the keys
and values of all your previous tokens
which are some kind of representation
that the model stores internally of all
the previous tokens in the prompt
and like by default when you're doing a
chat the model has to for every single
token do this forward pass through the
entire uh model that's a lot of Matrix
multiplies that happen and that is
really really slow instead if you have
already done that and you stored the
keys and values and you keep that in the
GPU then when I'm let's say I have
stored it for the last end tokens if I
now want to compute the the output token
for the N plus one token I don't need to
pass those first end tokens through the
entire model because I already have all
those keys and values and so you just
need to do the forward pass through that
last token and then when you're doing
attention uh you're reusing those keys
and values that have been computed which
is the only kind of sequential part um
or sequentially dependent part of the
Transformer is there like higher level
caching of like caching of the prompts
or that kind of stuff could help yeah
that that there's other types of caching
you can kind of do um one interesting
thing that you can do for cursor tab
is you can basically predict ahead as if
the user would have accepted the
suggestion and then trigger another uh
request
and so then you've cashed you've done
the speculative it's it's a mix of
speculation and caching right because
you're speculating what would happen if
they accepted it and then you have this
value that is cach this this uh
suggestion and then when they press tab
the next one would be waiting for them
immediately it's a it's a kind of clever
heuristic slash trick uh that uses a
higher level caching and and can give uh
the it feels fast despite there not
actually being any changes in the in the
model and if you can make the KV cach
smaller one of the advantages you get is
like maybe maybe you can speculate even
more maybe you can get seriously 10
things that you know could be useful I
like uh like predict the next 10 and and
then like it's possible the user hits
the the one of the 10 it's like much
higher chance than the user hits like
the exact one that you show them uh
maybe they typeing another character and
and he sort of hits hits something else
in the cache yeah so there's there's all
these tricks where um the the general
phenomena here is uh I think it's it's
also super useful for RL is you know may
maybe a single sample from the model
isn't very good but if you
predict like 10 different things uh
turns out that one of the 10 uh that's
right is the probability is much higher
there's these passid key curves and you
know part of RL like what what RL does
is you know you can you can exploit this
passid K phenomena to to make many
different predictions and and uh one one
way to think about this the model sort
of knows internally has like has some
uncertainty over like which of the key
things is correct or like which of the
key things does the human want when we
ARL our uh you know cursor Tab model one
of the things we're doing is we're
predicting which like which of the
hundred different suggestions the model
produces is more amendable for humans
like which of them do humans more like
than other things uh maybe maybe like
there's something with the model can
predict very far ahead versus like a
little bit and maybe somewhere in the
middle and and you just and then you can
give a reward to the things that humans
would like more and and sort of punish
the things that it would like and sort
of then train the model to Output the
suggestions that humans would like more
you you have these like RL Loops that
are very useful that exploit these
passive K curves um Oman maybe can can
go into even more detail yeah it's a
little it is a little different than
speed um but I mean like technically you
tie it back in because you can get away
with the smaller model if you are all
your smaller model and it gets the same
performance as the bigger one um that's
like and SW I was mentioning stuff about
KV about reducing the size of your KV
cach there there are other techniques
there as well that are really helpful
for Speed um so kind of back in the day
like all the way two years ago uh people
mainly use multi-ad attention um and I
think there's been a migration towards
more uh efficient attention schemes like
group query um or multiquery attention
and this is really helpful for then uh
with larger batch sizes being able to
generate the tokens much faster the
interesting thing here is um this now
has no effect on that uh time to First
token pre-fill speed uh the thing this
matters for is uh now generating tokens
and and why is that because when you're
generating tokens instead of uh being
bottlenecked by doing the super
realizable Matrix multiplies across all
your tokens you're bottleneck by how
quickly it's for long context um with
large batch sizes by how quickly you can
read those cache keys and values um and
so then how that that's memory bandwidth
and how can we make this faster we can
try to compress the size of these keys
and values so multiquery attention is
the most aggressive of these um where
normally with multi-head attention you
have some number of quote unquote
attention heads um and some number of
kind of query query heads U multiquery
just preserves the query heads gets rid
of all the key value heads um so there's
only one kind of key value head and
there's all the remaining uh query heads
with group query um you instead you know
preserve all the query heads and then
your keys and values are kind of in
there are fewer heads for the keys and
values but you're not reducing it to
just one um but anyways like the whole
point here is you're just reducing the
size of your KV cache and then there is
MLA yeah multi- latent um that's a
little more complicated and the way that
this works is it kind of turns the
entirety of your keys and values across
all your heads into this kind of one
latent Vector that is then kind of
expanded in frence time but MLA is from
this company uh called Deep seek um it's
it's quite an interesting algorithm uh
maybe the key idea is sort of uh in both
mqa uh and in other places what you're
doing is sort of reducing the uh num
like the number of KV heads the
advantage you get from that is is you
know there's less of them but uh maybe
the theory is that you actually want a
lot of different uh like you want each
of the the keys and values to actually
be different so one way to reduce the
size is you keep
uh one big shared Vector for all the
keys and values and then you have
smaller vectors for every single token
so that when you m you can you can store
the only the smaller thing as some sort
of like low rank reduction and the low
rank reduction with that and at the end
of the time when you when you eventually
want to compute the final thing uh
remember that like your memory bound
which means that like you still have
some some compute left that you can use
for these things and so if you can
expand the um the latent vector
back out and and somehow like this is
far more efficient because just like
you're reducing like for example maybe
like you're reducing like 32 or
something like the size of the vector
that you're keeping yeah there's perhaps
some richness in having a separate uh
set of keys and values and query that
kind of pawise match up versus
compressing that all into
one and that interaction at least okay
and all of that is dealing with um being
memory bound yeah
and what I mean ultimately how does that
map to the user experience trying to get
the yeah the the two things that it maps
to is you can now make your cash a lot
larger because you've less space
allocated for the KB cash you can maybe
cash a lot more aggressively and a lot
more things do you get more cash hits
which are helpful for reducing the time
to First token for the reasons that were
kind of described earlier and then the
second being when you start doing
inference with more and more requests
and larger and larger batch sizes you
don't see much of a Slowdown in as it's
generating the tokens the speed of that
what it also allows you to make your
prompt bigger for certain yeah yeah so
like the basic the size of your KV cache
is uh both the size of all your prompts
multiply by the number of prompts being
processed in parallel so you could
increase either those Dimensions right
the batch size or the size of your
prompts without degrading the latency of
generating tokens Arvid you wrote a blog
post Shadow workspace iterating on code
in the background yeah so what's going
on uh so to be clear we want there to be
a lot of stuff stuff happening in the
background and we're experimenting with
a lot of things uh right now uh we don't
have much of that happening other than
like the the cash warming or like you
know figuring out the right context to
that goes into your command PRS for
example uh but the idea is if you can
actually spend computation in the
background then you can help um help the
user maybe like at a slightly longer
time Horizon than just predicting the
next few lines that you're going to make
but actually like in the next 10 minutes
what are you're going to make and by
doing it in background you can spend
more comp computation doing that and so
the idea of the Shadow workspace that
that we implemented and we use it
internally for like experiments um is
that to actually get advantage of doing
stuff in the background you want some
kind of feedback signal to give give
back to the model because otherwise like
you can get higher performance by just
letting the model think for longer um
and and so like o1 is a good example of
that but another way you can improve
performance is by letting the model
iterate and get feedback and and so one
very important piece of feedback when
you're a programmer is um the language
server which is uh this thing it exists
uh for most different languages and
there's like a separate language Ser per
language and it can tell you you know
you're using the wrong type appear and
then gives you an error or it can allow
you to go to definition and sort of
understands the structure of your code
so language servers are extensions
developed by like there's a typescript
language Ser developed by the typescript
people a rust language Ser developed by
the rust people and then they all inter
interface over the language server
protocol to vs code so that vs code
doesn't need to have all of the
different languages built into vs code
but rather uh you can use the existing
compiler infrastructure for linting
purposes what it's for it's for linting
it's for going to definition uh and for
like seeing the the right types that
you're using uh so it's doing like type
checking also yes type checking and and
going to references um and that's like
when you're working in a big project you
you kind of need that if you if you
don't have that it's like really hard to
to code in a big project can you say
again how that's being used inside
cursor the the language server protocol
communication thing so it's being used
in cursor to show to the programmer just
like nvs could but then the idea is you
want to show that same information to
the models the I models um and you want
to do that in a way that doesn't affect
the user because you wanted to do it in
background and so the idea behind the
chadow workspace was okay like one way
we can do this is um we spawn a separate
window of cursor that's hidden and so
you can set this flag and electron is
hidden there is a window but you don't
actually see it and inside of this
window uh the AI agents can modify code
however they want um as long as they
don't save it because it's still the
same folder um and then can get feedback
from from the lters and go to definition
and and iterate on their code so like
literally run everything in the
background like as if right yeah maybe
even run the code so that's the eventual
version okay that's what you want and a
lot of the blog post is actually about
how do you make that happen because it's
a little bit tricky you want it to be on
the user's machine so that it exactly
mirrors the user's environment
and then on Linux you can do this cool
thing where you can actually mirror the
file system and have the AI make changes
to the files and and it thinks that it's
operating on the file level but actually
that's stored in in memory and you you
can uh create this kernel extension to
to make it work um whereas on Mac and
windows it's a little bit more difficult
uh and and uh but it's it's a fun
technical problems that's way one one
maybe hacky but interesting idea that I
like is holding a lock on saving and so
basically you can then have the language
model kind of hold the lock on on saving
to disk and then instead of you
operating in the ground truth version of
the files uh that are save to dis you
you actually are operating what was the
shadow workspace before and these
unsaved things that only exist in memory
that you still get Lind erors for and
you can code in and then when you try to
maybe run code it's just like there's a
small warning that there's a lock and
then you kind of will take back the lock
from the language server if you're
trying to do things concurrently or from
the the shadow workspace if you're
trying to do things concurrently that's
such an exciting feuture by the way it's
a bit of a tangent but like to allow a
model to change files it's scary for
people but like it's really cool to be
able to just like let the agent do a set
of tasks and you come back the next day
and kind of observe like it's a
colleague or something like that yeah
yeah and I think there may be different
versions of like runability
where for the simple things where you're
doing things in the span of a few
minutes on behalf of the user as they're
programming it makes sense to make
something work locally in their machine
I think for the more aggressive things
where you're making larger changes that
take longer periods of time you'll
probably want to do this in some sandbox
remote environment and that's another
incredibly tricky problem of how do you
exactly reproduce or mostly reproduce to
the point of it being effectively
equivalent for running code the user's
environment which is remote remote
sandbox I'm curious what kind of Agents
you want for for coding oh do you want
them to find bugs do you want them to
like Implement new features like what
agents do you want so by the way when I
think about agents I don't think just
about coding uh I think so for the
practic this particular podcast there's
video editing and a lot of if you look
in Adobe a lot there's code behind uh
it's very poorly documented code but you
can interact with premiere for example
using code and basically all the
uploading everything I do on YouTube
everything as you could probably imagine
I do all of that through code and so and
including translation and overdubbing
all this so I Envision all those kinds
of tasks so automating many of the tasks
that don't have to do directly with the
editing so that okay that's what I was
thinking about but in terms of coding I
would be fundamentally thinking about
bug
finding like many levels of kind of bug
finding and also bug finding like
logical bugs not logical like spiritual
bugs or
something one's like sort of big
directions of implementation that kind
of stuff that's Bine on Buck finding
yeah I mean it's really interesting that
these models are so bad at bug finding
uh when just naively prompted to find a
bug they're incredibly poorly calibrated
even the the smartest models exactly
even o even 01 how do you explain that
is there a good
intuition I think these models are a
really strong reflection of the
pre-training distribution and you know I
do think they they generalize as the
loss gets lower and lower but I don't
think the the loss and the scale is
quite or the loss is low enough such
that they're like really fully
generalizing in code like the things
that we use these things for uh the
frontier models that that they're quite
good at are really code generation and
question answering these things exist in
massive quantities and pre-training with
all of the code on GitHub on the scale
of many many trillions of tokens and
questions and answers on things like
stack Overflow and maybe GitHub issues
and so when you try to push some of
these things that really don't exist uh
very much online like for example the
cursor tap objective of predicting the
next edit given the edit's done so far
uh the brittleness kind of shows and
then bug detection is another great
example where there aren't really that
many examples of like actually detecting
real bugs and then proposing fixes um
and the models just kind of like really
struggle at it but I think it's a
question of transferring the model like
in the same way that you get this
fantastic transfer um from pre-trained
Models uh just on code in general to the
cursor tab objective uh you'll see a
very very similar thing with generalized
models that are really good to code to
bug detection it just takes like a
little bit of kind of nudging in that
direction like to be clear I think they
sort of understand code really well like
while they're being pre-trained like the
representation that's being built up
like almost certainly like you know
Somewhere In The Stream there's the
model knows that maybe there's there's
some SK something sketchy going on right
it sort of has some sketchiness but
actually eliciting this the sketchiness
to uh like actually like part part of it
is that humans are really calibrated on
which bugs are really important it's not
just actually it's not just actually
saying like there's something sketchy
it's like it's just sketchy trivial it's
the sketchy like you're going to take
the server down it's like like part of
it is maybe the cultural knowledge of uh
like why is a staff engineer a staff
engineer a staff engineer is is good
because they know that three years ago
like someone wrote a really you know
sketchy piece of code that took took the
server down and as opposed to like as
supposed to maybe it's like you know you
just this thing is like an experiment so
like a few bugs are fine like you're
just trying to experiment and get the
feel of the thing and so if the model
gets really annoying when you're writing
an experiment that's really bad but if
you're writing something for super
production you're like writing a
database right you're you're writing
code in post scripts or Linux or
whatever like your lineus tals you're
you're it's sort of unacceptable to have
even a edge case and just having the
calibration of
like how paranoid is the user like but
even then like if you're putting in a
maximum paranoia it still just like
doesn't quite get it yeah yeah yeah I
mean but this is hard for humans too to
understand what which line of code is
important which is not it's like you I
think one of your principles on a
website says if if if a code can do a
lot of
damage one should add a comment that say
this this this line of code is is
dangerous and all
caps 10 times no you say like for every
single line of code inside the function
you have to and that's quite profound
that says something about human beings
because the the engineers move on even
the same person might just forget how it
can sync the Titanic a single function
like you don't you might not in it that
quite clearly by looking at the single
piece of code yeah and I think that that
one is also uh partially also for
today's AI models where uh if you
actually write dangerous dangerous
dangerous in every single line like uh
the models will pay more attention to
that and will be more likely to find
bucks in that region that's actually
just straight up a really good practice
of a labeling code of how much damage
this can do yeah I mean it's
controversial like some people think
it's ugly uh swall well I actually think
it's it's like in fact I actually think
this one of the things I learned from AR
is you know like I sort of aesthetically
I don't like it but I think there's
certainly something where like it's it's
useful for the models and and humans
just forget a lot and it's really easy
to make a small mistake and cause
like bring down you know like just bring
down the server and like you like of
course we we like test a lot and
whatever but there there's always these
things that you have to be very careful
yeah like with just normal dock strings
I think people will often just skim it
when making a change and think oh this I
I know how to do this um and you kind of
really need to point it out to them so
that that doesn't slip through
yeah you have to be reminded that you
could do a lot of
damage that's like we don't really think
about that like yeah you think about
okay how do I figure out how this work
so I can improve it you don't think
about the other direction that could
until until we have formal verification
for everything then you can do whatever
you want and you you know for certain
that you have not introduced a bug if
the proof passes but concretely what do
you think that future would look like I
think um people will just write tests
anymore and um the model will suggest
like you write a function the model will
suggest a spec and you review the spec
and uh in the meantime a smart reasoning
model computes appr proof that the
implementation follows the spec um and I
think that happens for for most
functions don't you think this gets at a
little bit some of the stuff you were
talking about earlier with the
difficulty of specifying intent for what
you want with software um where
sometimes it might be because the intent
is really hard to specify it's also then
going to be really hard to prove that
it's actually matching whatever your
intent is like you think that spec is
hard to
generate yeah or just like for a given
spec maybe you can I think there is a
question of like can you actually do the
formal verification like that's like is
that possible I think that there's like
more to dig into there but then also
even if you have this spe if you have
this spe how do you you have the spec is
the spec written in natural
language the spec spec would be formal
but how easy would that be so then I
think that you care about things that
are not going to be easily well
specified in the spec language I see I
see would be um yeah maybe an argument
against formal verification is all you
need yeah the worry is there's this
massive document replacing replacing
something like unitest sure yeah yeah um
I think you can probably also evolve the
the spec languages to capture some of
the things that they don't really
capture right now um but yeah I don't
know I think it's very exciting and
you're speaking not just about like
single functions you're speaking about
entire code bases I think entire code
bases is harder but that that is what I
would love to have and I think it should
be possible and because you can even
there there's like a lot of work
recently where uh you can prove formally
verify down to the hardware so like
through the you formally verify the C
code and then you formally verify
through the GCC compiler and then
through the VAR log down to the hardware
um and that's like incredibly big system
but it actually works and I think big
code bases are are sort of similar in
that they're like multi-layered system
and um if you can decompose it and
formally verify each part then I think
it should be possible I think the
specification problem is a real problem
but how do you handle side effects or
how do you handle I guess external
dependencies like calling the stripe API
maybe stripe would write a spec for
their you can't do this for everything
like can you do this for everything you
use like how do you how do you do it for
if there's language mod like maybe maybe
like people use language models as
Primitives in the programs they write
and there's like a dependence on it and
like how how do you now include that I
think you might be able to prove prove
that still prove what about language
models I think it it feels possible that
you could actually prove that a language
model is aligned for example or like you
can prove that it actually gives the the
right answer um that's the dream yeah
that is I mean that's if it's possible
your I Have a Dream speech if it's
possible that that will certainly help
with you know uh making sure your code
doesn't have bugs and making sure AI
doesn't destroy all of human
civilization so the the full spectrum of
AI safety to just bug finding uh so you
said the models struggle with bug
finding what's the Hope You Know My Hope
initially is and and I can let Michael
Michael chime into to it but was like
this
um it should you know first help with
the stupid bugs like it should very
quickly catch the stupid bugs like off
by one erors like sometimes you write
something in a comment and do the other
way it's like very common like I do this
I write like less than in a comment and
like I maybe write it greater than or
something like that and the model is
like yeah it looks sketchy like you sure
you want to do that uh but eventually it
should be able to catch 100 bucks too
yeah and I think that it's also
important to note that this is having
good bug finding models feels necessary
to get to the highest reaches of having
AI do more and more programming for you
where you're going to you know if the AI
is building more and more of the system
for you you need to not just generate
but also verify and without that some of
the problems that we've talked about
before with programming with these
models um will just become untenable um
so it's not just for humans like you
write a bug I write a bug find the bug
for me but it's also being able to to
verify the AI code and check it um is
really important yeah and then how do
you actually do this like we have had a
lot of contentious dinner discussions of
how do you actually train a bug model
but one very popular idea is you know
it's kind of potentially easy to
introduce a bug than actually finding
the bug and so you can train a model to
introduce bugs in existing code um and
then you can train a reverse bug model
then that uh can find find bugs using
this synthetic data so that's like one
example um but yeah there are lots of
ideas for how to also um you can also do
a bunch of work not even at the model
level of taking the biggest models and
then maybe giving them access to a lot
of information that's not just the code
like it's kind of a hard problem to like
stare at a file and be like where's the
bug and you know that's that's hard for
humans often right and so often you have
to to run the code and being able to see
things like traces and step through a
debugger um there's another whole
another Direction where it like kind of
tends toward that and it could also be
that there are kind of two different
product form factors here it could be
that you have a really specialty model
that's quite fast that's kind of running
in the background and trying to spot
bugs and it might be that sometimes sort
of to arvid's earlier example about you
know some nefarious input box bug might
be that sometimes you want to like
there's you know there's a bug you're
not just like checking hypothesis free
you're like this is a problem I really
want to solve it and you zap that with
tons and tons and tons of compute and
you're willing to put in like $50 to
solve that bug or something even more
have you thought about integrating money
into this whole thing like I would pay
probably a large amount of money for if
you found a bug or even generated a code
that I really appreciated like I had a
moment a few days ago when I started
using C were
generated uh
perfect uh like perfect three functions
for interacting with the YouTube API to
update captions and uh for localization
like different in different languages
the API documentation is not very good
and the code across like if I I Googled
it for a while I couldn't find exactly
there's a lot of confusing information
and cursor generated perfectly and I was
like I just said back I read the code I
was like this is correct I tested it
it's correct I was like I want a tip on
a on a button that goes yeah here's $5
one that's really good just to support
the company and support what the the
interface is and the other is that
probably sends a strong signal like good
job right so there much stronger signal
than just accepting the code right you
just actually send like a strong good
job that and for bug finding obviously
like there's a lot of people
you know that would pay a huge amount of
money for a bug like a bug bug Bounty
thing right is that you guys think about
that yeah it's a controversial idea
inside the the company I think it sort
of depends on how much uh you believe in
humanity almost you know like uh I think
it would be really cool if like uh you
spend nothing to try to find a bug and
if it doesn't find a bug you you spend Z
and then if it does find a bug uh and
you click accept then it also shows like
in parenthesis like $1 and so you spend
$1 to accept a bug uh and then of course
there's worry like okay we spent a lot
of computation like maybe people will
just copy paste um I think that's a
worry um and then there is also the
worry that like introducing money into
the product makes it like kind of you
know like it doesn't feel as fun anymore
like you have to like think about money
and and you all you want to think about
is like the code and so maybe it
actually makes more sense to separate it
out and like you pay some fee like every
month and then you get all of these
things for free but there could be a
tipping component which is not like it
it it still has that like dollar symbol
I think it's fine but I I also see the
point where like maybe you don't want to
introduce it yeah I was going to say the
moment that feels like people do this is
when they share it when they have this
fantastic example they just kind of
share it with their friends there is
also a potential world where there's a
technical solution to this like honor
System problem too where if we can get
to a place where we understand the
output of the system more I mean to the
stuff we were talking about with like
you know error checking with the LSP and
then also running the code but if you
could get to a place where you could
actually somehow verify oh I have fixed
the bug maybe then the the bounty system
doesn't need to rely on the honor System
Too how much interaction is there
between the terminal and the code like
how much information is gained from if
you if you run the code in the terminal
like can you use can you do like a a
loop where it runs runs the code and
suggests how to change the code if if
the code and runtime gives an error is
right now there're separate worlds
completely like I know you can like do
control K inside the terminal to help
you write the code you you can use
terminal contacts as well uh inside of
Jack man kind of everything um we don't
have the looping part yet though we
suspect something like this could make a
lot of sense there's a question of
whether it happens in the foreground too
or if it happens in the background like
what we've been discussing sure the
background is pretty cool like we do
running the code in different ways plus
there's a database side to this which
how do you protect it from not modifying
the database but
okay I mean there's there's certainly
cool Solutions there uh there's this new
API that is being developed for it's
it's not in AWS uh but you know it's it
certainly it's I think it's in Planet
scale I don't know if Planet scale was
the first one you added it's the ability
sort of add branches to a database uh
which is uh like if you're working on a
feature and you want to test against the
prod database but you don't actually
want to test against the pr database you
could sort of add a branch to the
database in the way to do that is to add
a branch to the WR ahead log uh and
there's obviously a lot of technical
complexity in doing it correctly I I
guess database companies need need need
new things to do uh because they have
they have they have good databases now
uh and and I I think like you know turbo
buffer which is which is one of the
databases we use as is is going to add
hope maybe braning to the to the rad log
and and so so maybe maybe the the AI
agents will use we'll use branching
they'll like test against some branch
and it's sort of going to be a
requirement for the database to like
support branching or something it would
be really interesting if you could
Branch a file system right yeah I feel
like everything needs branching it's
like that yeah yeah like that's the
problem with the Multiverse
[Music]
right like if you branch on everything
that's like a lot I mean there's there's
obviously these like super clever
algorithms to make sure that you don't
actually sort of use a lot of space or
CPU or whatever okay this is a good
place to ask about infrastructure so you
guys mostly use AWS what what are some
interesting details what are some
interesting challenges why' you choose
AWS why is why is AWS still winning
hashtag AWS is just really really good
it's really good like um whenever you
use an AWS product you just know that
it's going to work like it might be
absolute hell to go through the steps to
set it up um why is the interface so
horrible because it's just so good it
doesn't need to the nature of
winning I think it's exactly it's just
nature they winning yeah yeah but AWS
you can always trust like it will always
work and if there is a problem it's
probably your
problem yeah okay is there some
interesting like challenges to you guys
have pretty new startup to get scaling
to like to so many people and yeah I
think that they're uh it has been an
interesting Journey adding you know each
extra zero to the request per second you
run into all of these with like you know
the general components you're using for
for caching and databases run into
issues as you make things bigger and
bigger and now we're at the scale where
we get like you know int overflows on
our tables and things like that um and
then also there have been some custom
systems that we've built like for
instance our Ral system for um Computing
a semantic index of your codebase and
answering questions about a codebase
that have continually I feel like been
one of the the trickier things to scale
I I have a few friends who are who are
super super senior engineers and one of
their sort of lines is like it's it's
very hard to predict where systems will
break when when you scale them you you
you can sort of try to predict in
advance but like there's there's always
something something weird that's going
to happen when when you add this extra Z
and you you thought you thought through
everything but you didn't actually think
through everything uh but I think for
that particular system
we've so what the the for concrete
details the thing we do is obviously we
upload um when like we chunk up all of
your code and then we send up sort of
the code for for embedding and we embed
the code and then we store the
embeddings uh in a in a database but we
don't actually store any of the code and
then there's reasons around making sure
that
we don't introduce client bugs because
we're very very paranoid about client
bugs we store uh uh much of the details
on the server uh like everything is sort
of
encrypted so one one of the technical
challenges is is always making sure that
the local index the local codebase state
is the same as the state that is on the
server and and the way sort of
technically we ended up doing that is so
for every single file you can you can
sort of keep this hash and then for
every folder you can sort of keep a hash
which is the hash of all of its children
and you can sort of recursively do that
until the top and why why do something
something complicated uh one thing you
could do is you could keep a hash for
every file then every minute you could
try to download the hashes that are on
the server figure out what are the files
that don't exist on the server maybe
just created a new file maybe you just
deleted a file maybe you checked out a
new branch and try to reconcile the
state between the client and the
server but that introduces like
absolutely ginormous Network overhead
both uh both on the client side I mean
nobody really wants us to hammer their
Wi-Fi all the time if you're using
cursor uh but also like I mean it would
introduce like ginormous overhead in the
database it would sort of be reading
this uh tens of terabyte database sort
of approaching like 20 terabyt or
something database like every second
that's just just kind of crazy you
definitely don't want to do that so what
you do you sort of you just try to
reconcile the single hash which is at
the root of the project and then if if
something mismatches then you go you
find where all the things disagree maybe
you look at the children and see if the
hashes match and if the hashes don't
match go look at their children and so
on but you only do that in the scenario
where things don't match and for most
people most of the time the hashes match
so it's a kind of like hierarchical
reconciliation yeah something like that
yeah it's called the Merkel tree yeah
Merkel yeah I mean so yeah it's cool to
see that you kind of have to think
through all these problems and I mean
the the point of like the reason it's
gotten hard is just because like the
number of people using it and you know
if some of your customers have really
really large code bases uh to the point
where we you know we we originally
reordered our code base which is which
is big but I mean just just not the size
of some company that's been there for 20
years and sort of has to train enormous
number of files and you sort of want to
scale that across programmers there's
there's all these details where like
building the simple thing is easy but
scaling it to a lot of people like a lot
of companies is is obviously a difficult
problem which is sort of you know
independent of actually so that's
there's part of this scaling our current
solution is also you know coming up with
new ideas that obviously we're working
on uh but then but then scaling all of
that in the last few weeks once yeah and
there are a lot of clever things like
additional things that that go into this
indexing system
um for example the bottleneck in terms
of costs is not storing things in the
vector database or the database it's
actually embedding the code and you
don't want to Reed the code base for
every single person in a company that is
using the same exact code except for
maybe they're in a different branch with
a few different files or they've made a
few local changes and so because again
embeddings are the bottleneck you can do
this one clever trick and not have to
worry about like the complexity of like
dealing with branches and and the other
databases where you just have some cash
on
the actual vectors uh computed from the
hash of a given chunk MH and so this
means that when the nth person at a
company goes into their code base it's
it's really really fast and you do all
this without actually storing any code
on our servers at all no code data
stored we just store the vectors in the
vector database and the vector cache
what's the biggest gains at this time
you get from indexing the code base like
just out of curiosity like what
what benefit users have it seems like
longer term there'll be more and more
benefit but in the short term just
asking questions of the code
base uh what what's the use what's the
usefulness of that I think the most
obvious one is um just you want to find
out where something is happening in your
large code base and you sort of have a
fuzzy memory of okay I want to find the
place where we do X um but you don't
exactly know what to search for in a
normal text search and to ask a chat uh
you hit command enter to ask with with
the codebase chat and then uh very often
it finds the the right place that you
were thinking of I think like you like
you mentioned in the future I think this
only going to get more and more powerful
where we're working a lot on improving
the quality of our retrieval um and I
think the cealing for that is really
really much higher than people give a
credit for one question that's good to
ask here have you considered and why
haven't you much done sort of local
stuff to where you can do the it seems
like everything we just discussed is
exceptionally difficult to do to go to
go to the cloud you have to think about
all these things with the caching and
the
uh you know large code Bas with a large
number of programmers are using the same
code base you have to figure out the
puzzle of that a lot of it you know
most software just does stuff this heavy
computational stuff locally so if you
consider doing sort of embeddings
locally yeah we thought about it and I
think it would be cool to do it locally
I think it's just really hard and and
one thing to keep in mind is that you
know uh some of our users use the latest
MacBook Pro uh and but most of our users
like more than 80% of our users are in
Windows machines which uh and and many
of them are are not very powerful and
and so local models really only works on
the on the latest computers and it's
also a big overhead to to to build that
in and so even if we would like to do
that um it's currently not something
that we are able to focus on and I think
there there are some uh people that that
that do that and I think that's great um
but especially as models get bigger and
bigger and you want to do fancier things
with like bigger models it becomes even
harder to do it locally yeah and it's
not a problem of like weaker computers
it's just that for example if you're
some big company you have big company
code base it's just really hard to
process big company code based even on
the beefiest MacBook Pros so even if
it's not even a matter matter of like if
you're if you're just like uh a student
or something I think if you're like the
best programmer at at a big company
you're still going to have a horrible
experience if you do everything locally
when you could you could do it and sort
of scrape by but like again it wouldn't
be fun anymore yeah like at approximate
nearest neighbors and this massive code
base is going to just eat up your memory
and your CPU and and and that's and
that's just that like let's talk about
like also the modeling side where said
there are these massive headwinds
against uh local models where one uh
things seem to move towards Moes which
like one benefit is maybe they're more
memory bandwidth bound which plays in
favor of local uh versus uh using gpus
um or using Nvidia gpus but the downside
is these models are just bigger in total
and you know they're going to need to
fit often not even on a single node but
multiple nodes um there's no way that's
going to fit inside of even really good
MacBooks um and I think especially for
coding it's not a question as much of
like does it clear some bar of like the
model's good enough to do these things
and then like we're satisfied which may
may be the case for other other problems
and maybe where local models shine but
people are always going to want the best
the most intelligent the most capable
things and that's going to be really
really hard to run for almost all people
locally don't you want the the most
capable model like you want you want
Sonet you and also with o I like how
you're pitching
me1 would you be satisfied with an
inferior model listen I yeah I'm yes I'm
one of those but there's some people
that like to do stuff locally especially
like yeah really there's a whole
obviously open source movement that kind
of resists and it's good that they exist
actually because you want to resist the
power centers that are growing are
there's actually an alternative to local
models uh that I particularly fond of uh
I think it's still very much in the
research stage but you could imagine um
to do homomorphic encryption for
language model inference so you encrypt
your input on your local machine then
you send that up and then um the server
uh can use lots of computation they can
run models that you cannot run locally
on this encrypted data um but they
cannot see what the data is and then
they send back the answer and you
decrypt the answer and only you can see
the answer uh so I think uh that's still
very much research and all of it is
about trying to make the overhead lower
because right now the overhead is really
big uh but if you can make that happen I
think that would be really really cool
and I think it would be really really
impactful um because I think one thing
that's actually kind of worrisome is
that as these models get better and
better uh they're going to become more
and more economically useful and so more
and more of the world's information and
data uh will th flow through you know
one or two centralized actors um and
then there are worries about you know
there can be traditional hacker attempts
but it also creates this kind of scary
part where if all of the world's
information is flowing through one node
in PL text um you can have surveillance
in very bad ways and sometimes that will
happen for you know in initially will be
like good reasons like people will want
to try to prot protect against like bad
Act using AI models in bad ways and then
you will add in some surveillance code
and then someone else will come in and
you know you're in a slippery slope and
then you start uh doing bad things with
a lot of the world's data and so I I'm
very hopeful that uh we can solve
homomorphic encryption for doing privacy
preserving machine learning but I would
say like that's the challenge we have
with all software these days it's
like there's so many features that can
be provided from the cloud and all of us
increasingly rely on it and make our
life awesome but there's downsides and
that's that's why you rely on really
good security to protect from basic
attacks but there's also only a small
set of companies that are controlling
that data you know and they they
obviously have leverage and they could
be infiltrated in all kinds of ways
that's the world we live in yeah I mean
the thing I'm just actually quite
worried about is sort of the world where
mean entropic has this responsible
scaling policy and so where we're on
like the low low asls which is the
entropic security level or whatever uh
of like of the models but as we get your
like cod and code ASL 3L 4 whatever
models uh which are sort of very
powerful
but for for mostly reasonable security
reasons you would want to monitor all
the prompts uh but I think I think
that's that's sort reasonable and
understandable where where everyone is
coming from but man it'd be really
horrible if if sort of like all the
world's information is sort of monitor
that heavily it's way too centralized
it's like it's like sort of this like
really fine line you're walking where on
the one side like you don't want the
models to go Rogue on the other side
like man humans like I I don't know if I
if I trust like all the world's
information to pass through like three
three model providers yeah why do you
think it's different than Cloud
providers because I
think the this is a lot of this data
would never have gone to the cloud
providers in the in the first place um
where this is often like you want to
give more data to the eio models you
want to give personal data that you
would never have put online in the first
place uh to these companies or or or to
these models um and it also centralizes
control uh where right now um for for
cloud you can often use your own
encryption keys and it like it can't
really do much um but here it's just
centralized actors that see the exact
plain text of
everything on the topic of context that
that's actually been a friction for me
when I'm writing code you know in Python
there's a bunch of stuff imported
there's a you could probably int it the
kind of stuff I would like to include in
the context is there like how how hard
is it to Auto figure out the
context It's Tricky um I think we can do
a lot better um at uh Computing the
context automatically in the future one
thing that's important to not is there
are trade-offs with including automatic
context so the more context you include
for these models um first of all the
slower they are and um the more
expensive those requests are which means
you can then do less model calls and do
less fancy stuff in the background also
for a lot of these models they get
confused if you have a lot of
information in the prompt so the bar for
um accuracy and for relevance of the
context you include should be quite High
um but this is already we do some
automatic context in some places within
the product it's definitely something we
want to get a lot better at and um I
think that there are a lot of cool ideas
to try there um both on the learning
better retrieval systems like better
edding models better rankers I think
that there are also cool academic ideas
you know stuff we've tried out
internally but also the field is
grappling with RIT large about can you
get language models to a place where you
can actually just have the model itself
like understand a new Corpus of
information and the most popular talked
about version of this is can you make
the context Windows infinite then if you
make the context Windows infinite can
make the model actually pay attention to
the infinite context and then after you
can make it pay attention to the
infinite context to make it somewhat
feasible to actually do it can you then
do caching for that infinite context you
don't have to recompute that all the
time but there are other cool ideas that
are being tried that are a little bit
more analogous to fine-tuning of
actually learning this information and
the weights of the model and it might be
that you actually get sort of a
qualitatively different type of
understanding if you do it more at the
weight level than if you do it at the
Inc context learning level I think the
journey the jury is still a little bit
out on how this is all going to work in
the end uh but in the interm US us as a
company we are really excited about
better retrieval systems and um picking
the parts of the code base that are most
relevant to what you're doing uh we
could do that a lot better like one
interesting proof of concept for the
learning this knowledge directly in the
weights is with vs code so we're in a vs
code fork and vs code the code is all
public so these models in pre-training
have seen all the code um they probably
also seen questions and answers about it
and then they've been fine tuned and RL
Chef to to be able to answer questions
about code in general so when you ask it
a question about vs code you know
sometimes it'll hallucinate but
sometimes it actually does a pretty good
job at answering the question and I
think like this is just by it happens to
be okay at it but what if you could
actually like specifically train or Post
train a model such that it really was
built to understand this code base um
it's an open research question one that
we're quite interested in and then
there's also uncertainty of like do you
want the model to be the thing that end
to end is doing everything I.E it's
doing the retrieval in its internals and
then kind of answering your question
creating the code or do you want to
separate the retrieval from the Frontier
Model where maybe you know you'll get
some really capable models that are much
better than like the best open source
ones in a handful of months um and then
you'll want to separately train a really
good open source model to be the
retriever to be the thing that feeds in
the context um to these larger models
can you speak a little more to the post
trining a model to understand the code
base like what do you what do you mean
by that with is this synthetic data
direction is this yeah I mean there are
many possible ways you could try doing
it there's certainly no shortage of
ideas um it's just a question of going
in and like trying all of them and being
empirical about which one works best um
you know one one very naive thing is to
try to replicate What's Done uh with
vscode uh and these Frontier models so
let's like continue pre-training some
kind of continued pre-training that
includes General code data but also
throws in a lot of the data of some
particular repository that you care
about and then in post trainining um
meaning in let's just start with
instruction fine tuning you have like a
normal instruction fine tuning data set
about code then you throw in a lot of
questions about code in that repository
um so you could either get ground truth
ones which might be difficult or you
could do what you kind of hinted at or
suggested using synthetic data um I.E
kind of having the model uh ask
questions about various re pieces of the
code um so you kind of take the pieces
of the code then prompt the model or
have a model propose a question for that
piece of code and then add those as
instruction find Uni data points and
then in theory this might unlock the
models ability to answer questions about
that code base let me ask you about open
ai1 what do you think is the role of
that kind of test time compute system in
programming I think test time compute is
really really interesting so there's
been the pre-training regime which will
kind of as you scale up the amount of
data and the size of your model get you
better and better performance both on
loss and then on Downstream benchmarks
um and just general performance when we
use it for coding or or other tasks um
we're starting to hit uh a bit of a data
wall meaning it's going to be hard to
continue scaling up this regime and so
scaling up 10 test time compute is an
interesting way of now you know
increasing the number of inference time
flops that we use but still getting like
uh like yeah as you increase the number
of flops use inference time getting
corresponding uh improvements in in the
performance of these models
traditionally we just had to literally
train a bigger model that always uses uh
that always used that many more flops
but now we could perhaps use the same
siiz model um and run it for longer to
be able to get uh an answer at the
quality of a much larger model and so
the really interesting thing I like
about this is there are some problems
that perhaps require
100 trillion parameter model
intelligence trained on 100 trillion
tokens um but that's like maybe 1% maybe
like 0.1% of all queries so are you
going to spend all of this effort all
this compute training a model uh that
cost that much and then run it so
infrequently it feels completely
wasteful when instead you get the model
that can that is that you train the
model that's capable of doing the 99.9%
of queries then you have a way of
inference time running it longer for
those few people that really really want
Max
intelligence how do you figure out which
problem requires what level of
intelligence is that possible to
dynamically figure out when to use GPT 4
when to use like when to use a small
model and when you need the the
01 I mean yeah that's that's an open
research problem certainly uh I don't
think anyone's actually cracked this
model routing problem quite well uh we'd
like to we we have like kind of initial
implementations of this for things for
something like cursor tab um but at the
level of like going between 40 Sonet
to1 uh it's a bit trickier perh like
there's also a question of like what
level of intelligence do you need to
determine if the thing is uh too hard
for for the the four level model maybe
you need the 01 level model um it's
really unclear but but you mentioned so
there's a there's there's a pre-training
process then there's Pro post training
and then there's like test time compute
that fair does sort of separate where's
the biggest gains um well it's weird
because like test time compute there's
like a whole training strategy needed to
get test time compute to work and the
Really the other really weird thing
about this is no one like outside of the
big labs and maybe even just open AI no
one really knows how it works like there
have been some really interesting papers
that uh show hints of what they might be
doing and so perhaps they're doing
something with research using process
reward models but yeah I just I think
the issue is we don't quite know exactly
what it looks like so it would be hard
to kind of comment on like where it fits
in I I would put it in post training but
maybe like the compute spent for this
kind of for getting test time compute to
work for a model is going to dwarf
pre-training
eventually so we don't even know if 0an
is using just like Chain of Thought RL
we don't know how they're using any of
these we don't know anything it's fun to
speculate like if you were to uh build a
competing model what would you do yeah
so one thing to do would be I I think
you probably need to train a process
reward model which is so maybe we can
get into reward models and outcome
reward models versus process reward
models outcome reward models are the
kind of traditional reward models that
people are trained for these for for
language models language modeling and
it's just looking at the final thing so
if you're doing some math problem let's
look at that final thing you've done
everything and let's assign a grade to
it How likely we think uh like what's
the reward for this this this outcome
process reward models Instead try to
grade The Chain of Thought and so open
AI had some preliminary paper on this I
think uh last summer where they use
human labelers to get this pretty large
several hundred thousand data set of
creating chains of thought um um
ultimately it feels like I haven't seen
anything interesting in the ways that
people use process reward models outside
of just using it as a means of uh
affecting how we choose between a bunch
of samples so like what people do uh in
all these papers is they sample a bunch
of outputs from the language model and
then use the process reward models to
grade uh all those Generations alongside
maybe some other heuristics and then use
that to choose the best answer the
really interesting thing that people
think might work and people want to work
is Tre search with these processor re
models because if you really can grade
every single step of the Chain of
Thought then you can kind of Branch out
and you know explore multiple Paths of
this Chain of Thought and then use these
process word models to evaluate how good
is this branch that you're
taking yeah when the when the quality of
the branch is somehow strongly
correlated with the quality of the
outcome at the very end so like you have
a good model of knowing which should
take so not just this in the short term
and like in the long term yeah and like
the interesting work that I think has
been done is figuring out how to
properly train the process or the
interesting work that has been open-
sourced and people I think uh talk about
is uh how to train the process reward
models um maybe in a more automated way
um I I could be wrong here could not be
mentioning some papers I haven't seen
anything super uh that seems to work
really well for using the process reward
models creatively to do tree search and
code um this is kind of an AI safety
maybe a bit of a philosophy question so
open AI says that they're hiding the
Chain of Thought from the user and
they've said that that was a difficult
decision to make they instead of showing
the Chain of Thought they're asking the
model to summarize the Chain of Thought
they're also in the background saying
they're going to monitor the Chain of
Thought to make sure the model is not
trying to manipulate the user which is a
fascinating possibility but anyway what
do you think about hiding the Chain of
Thought one consideration for open Ai
and this is completely speculative could
be that they want to make it hard for
people to distill these capabilities out
of their model it might actually be
easier if you had access to that hidden
Chain of Thought uh to replicate the
technology um because that's pretty
important data like seeing seeing the
steps that the model took to get to the
final result so you can probably train
on that also and there was sort of a
mirror situation with this with some of
the large language model providers and
also this is speculation but um some of
these apis um used to offer easy access
to log probabilities for the tokens that
they're generating um and also log
probabilities over the promp tokens and
then some of these apis took those away
uh and again complete speculation but um
one of the thoughts is that the the
reason those were taken away is if you
have access to log probabilities um
similar to this hidden train of thought
that can give you even more information
to to try and distill these capabilities
out of the apis out of these biggest
models into models you control as an
asteris on also the the previous
discussion about uh us integrating 01 I
think that we're still learning how to
use this model so we made o1 available
in cursor because like we were when we
got the model we were really interested
in trying it out I think a lot of
programmers are going to be interested
in trying it out but um uh 01 is not
part of the default cursor experience in
any way up um and we still haven't found
a way to yet integrate it into an editor
in uh into the editor in a way that we
we we reach for sort of you know every
hour maybe even every day and so I think
that the jury's still out on how to how
to use the model um and uh I we haven't
seen examples yet of of people releasing
things where it seems really clear like
oh that's that's like now the use case
um the obvious one to to turn to is
maybe this can make it easier for you to
have these background things running
right to have these models in Loops to
have these models be atic um but we're
still um still discovering to be clear
we have ideas we just need to we need to
try and get something incredibly useful
before we we put it out there but it has
these significant limitations like even
like barring capabilities uh it does not
stream and that means it's really really
painful to use for things where you want
to supervise the output um and instead
you're just waiting for the wall text to
show up um also it does feel like the
early Innings of test time Computing
search where it's just like a very very
much of V zero um and there's so many
things that like like don't feel quite
right and I suspect um in parallel to
people increasing uh the amount of
pre-training data and the size of the
models and pre-training and finding
tricks there you'll now have this other
thread of getting search to work better
and
better so let me ask you
about strawberry tomorrow
eyes so it looks like GitHub um co-pilot
might be integrating 01 in some kind of
way and I think some of the comments are
saying this this mean cursor is
done I think I saw one comment saying
that I saw time to shut down cursory
time to shut down
cursor so is it time to shut down cursor
I think this space is a little bit
different from past software spaces over
the the 2010s um where I think that the
ceiling here is really really really
incredibly high and so I think that the
best product in 3 to four years will
just be so much more useful than the
best product today and you can like Wax
potic about Moes this and brand that and
you know this is our uh Advantage but I
think in the end just if you don't have
like if you stop innovating on the
product you will you will lose and
that's also great for startups um that's
great for people trying to to enter this
Market um because it means you have an
opportunity um to win against people who
have you know lots of users already by
just building something better um and so
I think yeah over the next few years
it's just about building the best
product building the best system and
that both comes down to the modeling
engine side of things and it also comes
down to the to the editing experience
yeah I think most of the additional
value from cursor versus everything else
out there is not just integrating the
new model fast like o1 it comes from all
of the kind of depth that goes into
these custom models that you don't
realize are working for you in kind of
every facet of the product as well as
like the really uh thoughtful ux with
every single
feature all right uh from that profound
answer let's descend back down to the
technical you mentioned you have a
taxonomy of synthetic data oh yeah uh
can you please explain yeah I think uh
there are three main kinds of synthetic
data the first is so so what is
synthetic data first so there's normal
data like non- synthetic data which is
just data that's naturally created I.E
usually it'll be from humans having done
things so uh from some human process you
get this data synthetic data uh the
first one would be distillation so
having a language model kind of output
tokens or probability distributions over
tokens um and then you can train some
less capable model on this uh this
approach is not going to get you a net
like more capable model than the
original one that has produced The
Tokens um
but it's really useful for if there's
some capability you want to elicit from
some really expensive High latency model
you can then that distill that down into
some smaller task specific model um the
second kind is when like One Direction
of the problem is easier than the
reverse and so a great example of this
is bug detection like we mentioned
earlier where it's a lot easier to
introduce reasonable looking bugs
than it is to actually detect them and
this is this is probably the case for
humans too um and so what you can do is
you can get a model that's not training
that much data that's not that smart to
introduce a bunch of bugs and code and
then you can use that to then train use
a synthetic data to train a model that
can be really good at detecting bugs um
the last category I think is I guess the
main one that it feels like the big labs
are doing for synthetic data which is um
producing texts with language models
that can then be verified easily um so
like you know extreme example of this is
if you have a verification system that
can detect if language is Shakespeare
level and then you have a bunch of
monkeys typing and typewriters like you
can eventually get enough training data
to train a Shakespeare level language
model and I mean this is the case like
very much the case for math where
verification is is is actually really
really easy for formal um formal
language
and then what you can do is you can have
an OKAY model uh generate a ton of roll
outs and then choose the ones that you
know have actually proved the ground
truth theorems and train that further uh
there's similar things you can do for
code with leode like problems or uh
where if you have some set of tests that
you know correspond to if if something
passes these tests it has actually
solved a problem you could do the same
thing where we verify that it's passed
the test and then train the model the
outputs that have passed the tests um I
think I think it's going to be a little
tricky getting this to work in all
domains or just in general like having
the perfect verifier feels really really
hard to do with just like open-ended
miscellaneous tasks you give the model
or more like long Horizon tasks even in
coding that's cuz you're not as
optimistic as Arvid but yeah uh so yeah
so that that that third category
requires having a verifier yeah
verification is it feels like it's best
when you know for a fact that it's
correct and like then like it wouldn't
be like using a language model to verify
it would be using tests or uh formal
systems or running the thing too doing
like the human form of verification
where you just do manual quality control
yeah yeah but like the the language
model version of that where it's like
running the thing it's actually
understands yeah but yeah no that's sort
of somewhere between yeah yeah I think
that that's the category that is um most
likely to to result in like massive
gains what about RL with feedback side
rhf versus RL
if um what's the role of that in um
getting better performance on the
models yeah so
rhf is when the reward model you use uh
is trained from some labels you've
collected from humans giving
feedback um I think this works if you
have the ability to get a ton of human
feedback for this kind of task that you
care about r r aif is interesting uh
because you're kind of depending on like
this is actually kind of uh going to
it's depending on the constraint that
verification is actually a decent bit
easier than generation because it feels
like okay like what are you doing you're
using this language model to look at the
language model outputs and then improve
the language model but no it actually
may work if the language model uh has a
much easier time verifying some solution
uh than it does generating it then you
actually could perhaps get this kind of
recursively but I don't think it's going
to look exactly like that um the other
the other thing you could do
is that we kind of do is like a little
bit of a mix of rif and rhf where
usually the model is actually quite
correct and this is in the case of
cursor tab at at picking uh between like
two possible generations of what is what
is what is the better one and then it
just needs like a hand a little bit of
human nudging with only like on the on
the order of 50 100 uh examples um to
like kind of align that prior the model
has with exactly with what what you want
it looks different than I think normal
RF we usually usually training these
reward models in tons of
examples what what's your intuition when
you compare generation and verification
or generation and
ranking is is ranking way easier than
generation my intuition would just say
yeah it should be like this is kind
of going going back
to like if you if you believe P does not
equal NP then there's this massive class
of problems that are much much easier to
verify given a proof than actually
proving it I wonder if the same thing
will prove P not equal to NP or P equal
to NP that would be that would be really
cool that'd be a whatever Fields
metal by AI who gets the credit another
open philosophical
question I'm
I'm I'm actually surprisingly curious
what what what like a good betat for one
uh one a will get the fields medal will
be actually don't is this mon specialty
uh I I don't know what a Mon's bed here
is oh sorry Nobel Prize or Fields medal
first F Metal Fields metal level Feld
metal I think Fields metal comes first
well you would say that of course but
it's also this like isolated system you
can verify and no sure like I don't even
know if I you don't need to do have much
more I felt like the path to get to IMO
was a little bit more clear because it
already could get a few IMO problems and
there are a bunch of like there's a
bunch of lwh hang fruit given the
literature at the time of like what what
tactics people could take I think I'm
one much less first in the space of the
improving now and to yeah less intuition
about how close we are to solving these
really really hard open problems so you
think you'll be feels mod first it won't
be like in U physics or in oh 100% I
think I I think I think that's probably
more likely like it's probably much more
likely that it'll get in yeah yeah yeah
well I think it goes to like I don't
know like BSD which is a bird when turn
di conjecture like remon hypothesis or
any one of these like hard hard math
problems which just like actually really
hard it's sort of unclear what the path
to to get even a solution looks like
like we we don't even know what a path
looks like let alone um and you don't
buy the idea that this is like an
isolated system and you can actually you
have a good reward system and
uh it feels like it's easier to train
for that I think we might get Fields
metal before AGI I think I mean I'd be
very
happy be very happy but I don't know if
I I think 202h
2030 feels metal feels metal all right
it's uh it feels like forever from now
given how fast things have been going um
speaking of how fast things have been
going let's talk about scaling laws so
for people who don't know uh maybe it's
good to talk about this
whole uh idea of scaling laws what are
they where do things stand and where do
you think things are going I think it
was interesting the original scaling
laws paper by open AI was slightly wrong
because I think of some uh issues they
did with uh learning right schedules uh
and then chinchilla showed a more
correct version and then from then
people have again kind of deviated from
doing the computer optimal thing because
people people start now optimizing more
so for uh making the thing work really
well given a given an inference budget
and I think there are a lot more
Dimensions to these curves than what we
originally used of just compute number
of uh parameters and data like inference
compute is is the obvious one I think
context length is another obvious one so
if you care like let's say you care
about the two things of inference
compute and and then uh context window
maybe the thing you want to train is
some kind of SSM because they're much
much cheaper and faster at super super
long context and even if maybe it is 10x
wor scaling properties during training
meaning you have to spend 10x more
compute to train the thing to get the
same same level of capabilities um it's
worth it because you care most about
that inference budget for really long
context windows so it'll be interesting
to see how people kind of play with all
these Dimensions so yeah I mean you
speak to the multiple Dimensions
obviously the original conception was
just looking at the variables of the
size of the model as measured by
parameters and the size of the data as
measured by the number of tokens and
looking at the ratio of the two yeah and
it's it's kind of a compelling notion
that there is a number or at least a
minimum and it seems like one was
emerging um do you still believe that
there is a kind of bigger is
better I mean I think bigger is
certainly better for just raw
performance and raw intelligence and raw
intelligence I think the the path that
people might take is I'm particularly
bullish on distillation and like yeah
how many knobs can you turn to if we
spend like a ton ton of money on
training like get the most capable uh
cheap model right like really really
caring as much as you can because like
the the the naive version of caring as
much as you can about inference time
Compu is what people have already done
with like the Llama models are just
overtraining the out of 7B models
um on way way way more tokens than isal
optimal right but if you really care
about it maybe thing to do is what Gemma
did which is let's just not let's not
just train on tokens let's literally
train on
uh minim minimizing the K Divergence
with uh the distribution of Gemma 27b
right so knowledge distillation there um
and you're spending the compute of
literally training this 27 billion model
uh billion parameter model on all these
tokens just to get out this I don't know
smaller model and the distillation gives
just a faster model smaller means faster
yeah distillation in theory is um I
think getting out more signal from the
data that you're training on and it's
like another it's it's perhaps another
way of getting over not like completely
over but like partially helping with the
data wall where like you only have so
much data to train on let's like train
this really really big model on all
these tokens and we'll distill it into
this smaller one and maybe we can get
more signal uh per token uh for this for
this much smaller model than we would
have originally if we trained it so if I
gave you1 trillion how would you how
would you spend it I mean you can't buy
an island or whatever um how would you
allocate it in terms of improving the
the big model
versus maybe paying for HF in the rhf or
yeah I think there's a lot of these
secrets and details about training these
large models that I I I just don't know
and are only priv to the large labs and
the issue is I would waste a lot of that
money if I even attempted this because I
wouldn't know those things uh suspending
a lot of disbelief and assuming like you
had the
knowhow um and operate or or if you're
saying like you have to operate with
like the The Limited information you
have now no no no actually I would say
you swoop in and you get all the
information all the little
characteristics all the little
parameters all the all the parameters
that Define how the thing is trained mhm
if we look
and how to invest money for the next 5
years in terms of maximizing what you
called raw intelligence I mean isn't the
answer like really simple you just you
just try to get as much compute as
possible like like at the end of the day
all all you need to buy is the gpus and
then the researchers can find find all
the all like they they can sort of you
you can tune whether you want between a
big model or a small model like well
this gets into the question of like are
you really limited by compute and money
or are you limited by these other things
and I'm more PR to arvid's arvid's
belief that we're we're sort of Ideal
limited but there's always that like but
if you have a lot of computes you can
run a lot of experiments so you would
run a lot of experiments versus like use
that compute to train a gigantic model I
would but I I do believe that we are
limited in terms of ideas that we have I
think yeah because even with all this
compute and like you know all the data
you could collect in the world than you
really are ultimately limited by not
even ideas but just like really good
engineering like even with all the
capital in the world would you really be
able to assemble like there aren't that
many people in the world who really can
like make the difference here um and and
there's so much work that goes into
research that is just like pure really
really hard engineering work um as like
a very kind of handwavy example if you
look at the original Transformer paper
you know how much work was kind of
joining together a lot of these really
interesting Concepts embedded in the
literature versus then going in and
writing all the code like maybe the Cuda
kernels maybe whatever else I don't know
if it ran on gpus or tpus originally
such that it actually saturated the GP
GPU performance right getting Gomes here
to go in and do do all this code right
and Nome is like probably one of the
best engineers in the world or maybe
going a step further like the next
generation of models having these things
like getting model Paralis to work and
scaling it on like you know thousands of
or maybe tens of thousands of like v100s
which I think gbd3 may have been um
there's just so much engineering effort
that has to go into all of these things
to make it work um if you really brought
that cost down
to like you know maybe not zero but just
made it 10x easier made it super easy
for someone with really fantastic ideas
to immediately get to the version of
like the new architecture they dreamed
up that is like getting 50 40% uh
utilization on the gpus I think that
would just speed up research by a ton I
mean I think I think if if you see a
clear path to Improvement you you should
always sort of take the low hanging
fruit first right and I think probably
open eye and and all the other labs it
did the right thing to pick off the low
hanging fruit where the low hanging
fruit is like sort
of you you could scale up to a GP
24.25
scale um and and you just keep scaling
and and like things things keep getting
better and as long as like you there's
there's no point of experimenting with
new ideas when like everything
everything is working and you should
sort of bang on and try try to get as
much as much juice out as the possible
and then and then maybe maybe when you
really need new ideas for I think I
think if you're if you're spending $10
trillion you probably want to spend some
you know then actually like reevaluate
your ideas like probably your idea
Limited at that point I think all of us
believe new ideas are probably needed to
get you know all the way way there to
Ai
and all of us also probably believe
there exist ways of testing out those
ideas at smaller
scales um and being fairly confident
that they'll play out it's just quite
difficult for the labs in their current
position to dedicate their very limited
research and Engineering talent to
exploring all these other ideas when
there's like this core thing that will
probably improve performance um for some
like decent amount of
time yeah but also these big Labs like
winning so they're just going wild
okay so how uh big question looking out
into the future you're now at the the
center of the programming world how do
you think programming the nature
programming changes in the next few
months in the next year in the next two
years the next 5 years 10 years I think
we're really excited about a future
where the programmer is in the driver's
seat for a long time and you've heard us
talk about this a little bit but one
that
emphasizes speed and agency for the
programmer and control the ability to
modify anything you want to modify the
ability to iterate really fast on what
you're
building
and this is a little different I think
than where some people um are are
jumping to uh in the space where I think
one idea that's captivated people is can
you talk to your um computer can you
have it build software for you as if
you're talking to like an engineering
department or an engineer over slack and
can it just be this this sort of
isolated text box and um part of the
reason we're not excited about that is
you know some of the stuff we've talked
about with latency but then a big piece
a reason we're not excited about that is
because that comes with giving up a lot
of control it's much harder to be really
specific when you're talking in the text
box and um if you're necessarily just
going to communicate with a thing like
you would be communicating with an
engineering department you're actually
abdicating tons of tons of really
important decisions um to the spot um
and this kind of gets at fundamentally
what engineering is um I think that some
some people who are a little bit more
removed from engineering might think of
it as you know the spec is completely
written out and then the engineers just
come and they just Implement and it's
just about making the thing happen in
code and making the thing um exists um
but I think a lot of the the best
engineering the engineering we
enjoy um involves tons of tiny micro
decisions about what exactly you're
building and about really hard
trade-offs between you know speed and
cost and all the other uh things
involved in a system and uh we want as
long as humans are actually the ones
making you know designing the software
and the ones um specifying what they
want to be built and it's not just like
company run by all AIS we think you'll
really want the humor the human in a
driver seat um dictating these decisions
and so there's the jury still out on
kind of what that looks like I think
that you know one weird idea for what
that could look like is it could look
like you kind of you can control the
level of abstraction you view a codebase
at and you can point at specific parts
of a codebase that um like maybe you
digest a code Base by looking at it in
the form of pseudo code and um you can
actually edit that pseudo code too and
then have changes get made down at the
the sort of formal programming level and
you keep the like you know you can
gestat any piece of logic uh in your
software component of programming you
keep the inflow text editing component
of programming you keep the control of
you can even go down into the code you
can go at higher levels of abstraction
while also giving you these big
productivity gains it would be nice if
you can go up and down the the
abstraction stack yeah and there are a
lot of details to figure out there
that's sort of a fuzzy idea time will
tell if it actually works but these
these principles of of control and speed
in the human and the driver seat we
think are really important um we think
for some things like Arvid mentioned
before for some styles of programming
you can kind of hand it off chapot style
you know if you have a bug that's really
well specified but that's not most of
programming and that's also not most of
the programming we think a lot of people
value uh what about like the fundamental
skill of programming there's a lot of
people
like young people right now kind of
scared like thinking because they like
love programming but they're scared
about like will I be able to have a
future if I pursue this career path do
you think the very skill of programming
will change fundamentally I actually
think this is a really really exciting
time to be building software yeah like
we remember what programming was like in
you know 2013 2012 whatever it was um
and there was just so much more Cru and
boilerplate and and you know looking up
something really gnarly and you know
that stuff still exists it's definitely
not at zero but programming today is way
more fun than back then um it's like
we're really getting down to the the
Delight concentration and all all the
things that really draw people to
programming like for instance this
element of being able to build things
really fast and um speed and also
individual control like all those are
just being turned up a ton um and so I
think it's just going to be I think it's
going to be a really really fun time for
people who build software um I think
that the skills will probably change too
I I think that people's taste and
creative ideas will be magnified and it
will be less
about maybe less a little bit about
boilerplate text editing maybe even a
little bit less about carefulness which
I think is really important today if
you're a programmer I think it'll be a
lot more fun what do you guys think I
agree I'm I'm very excited to be able to
change like just what one thing that
that happened recently was like we
wanted to do a relatively big migration
to our codebase we were using async
local storage in in no. JS which is
known to be not very performant and we
wanted to migrate to our context object
and this is a big migration it affects
the entire code base and swall and I
spent I don't know five days uh working
through this even with today's AI tools
and I am really excited for a future
where I can just show a couple of
examples and then the AI applies that to
all of the locations and then it
highlights oh this is a new example like
what should I do and then I show exactly
what to do there and then that can be
done in like 10 minutes uh and then you
can iterate much much faster then you
can then you don't have to think as much
up front and stay stand at the
Blackboard and like think exactly like
how are we going to do this because the
cost is so high but you can just try
something first and you realize oh this
is not actually exactly what I want and
then you can change it instantly again
after and so yeah I think being a
programmer in the future is going to be
a lot of fun yeah I I really like that
point about it feels like a lot of the
time with programming there are two ways
you can go about it one is like you
think really hard carefully upfront
about the best possible way to do it and
then you spend your limited time of
engineering to actually implement it uh
but I much prefer just getting in the
code and like you know taking a crack at
seeing how it how how it kind of lays
out and then
iterating really quickly on that that
feels more fun um yeah like just
speaking to generating the boiler plate
is great so you just focus on the
difficult design nuanced difficult
design decisions migration I feel like
this is this is a cool one like it seems
like large language models able to
basically translate from one programm
language to another or like translate
like migrate in the general sense of
what migrate is um but that's in the
current moment so I mean the fear has to
do with like okay as these models get
better and better then you're doing less
and less creative decisions and is it
going to kind of move to a place where
it's uh you're operating in the design
space of natural language where natural
language is the main programming
language and I guess I could ask that by
way of advice like if somebody's
interested in programming now what do
you think they should
learn like to say you guys started in
some
Java and uh I forget the oh some PHP PHP
Objective C Objective C there you go um
I mean in the end we all know JavaScript
is going to
win uh and not typescript it's just it's
going to be like vanilla JavaScript it's
just going to eat the world and maybe a
little bit of PHP and I mean it also
brings up the question of like I think
Don can has a this idea that some per of
the population is Geeks and like there's
a particular kind of psychology in mind
required for programming and it feels
like more and more that's expands the
kind of person that should be able to
can do great programming might
expand I think different people do
programming for different reasons but I
think the true maybe like the best
programmers um are the ones that really
love just like absolutely love
programming for example there folks in
our team who
literally when they're they get back
from work they go and then they boot up
cursor and then they start coding on
their side projects for the entire night
and they stay till 3:00 a.m. doing that
um and when they're sad they they
said I just really need to
code and I I I think like you know
there's there's that level of programmer
where like this Obsession and love of
programming um I think makes really the
best programmers and I think the these
types of people
will really get into the details of how
things work I guess the question I'm
asking that exact program I think about
that
person when you're when the super tab
the super awesome praise be the tab is
succeeds you keep PR pressing tab that
person in the team loves to cursor tab
more than anybody else right yeah and
it's also not just like like pressing
tab is like the just pressing tab that's
like the easy way to say it in the The
Catch catchphrase you know uh but what
you're actually doing when you're
pressing tab is that you're you're
injecting intent uh all the time while
you're doing it you're you're uh
sometimes you're rejecting it sometimes
you're typing a few more characters um
and and that's the way that you're um
you're sort of shaping the things that's
being created and I I think programming
will change a lot to just what is it
that you want to make it's sort of
higher bandwidth the communication to
the computer just becomes higher and
higher bandwidth as opposed to like like
just typing is much lower bandwidth than
than communicating intent I mean this
goes to your uh
Manifesto titled engineering genius we
are an applied research lab building
extraordinary productive human AI
systems So speaking to this like hybrid
element mhm uh to start we're building
the engineer of the future a human AI
programmer that's an order of magnitude
more effective than any one engineer
this hybrid engineer will have
effortless control over their code base
and no low entropy keystrokes they will
iterate at the speed of their judgment
even in the most complex systems using a
combination of AI and human Ingenuity
they will outsmart and out engineer the
best pure AI systems we are a group of
researchers and Engineers we build
software and models to invent at the
edge of what's useful and what's
possible our work has already improved
the lives of hundreds of thousands of
program
and on the way to that will at least
make programming more fun so thank you
for talking today thank you thanks for
having us thank you thank you thanks for
listening to this conversation with
Michael swall Arvid and Aman to support
this podcast please check out our
sponsors in the description and now let
me leave you with a random funny and
perhaps profound programming code I saw
on
Reddit nothing is as permanent as a
temporary solution that
works thank you for listening and hope
to see you next time
Loading video analysis...