#14 - CS 139 - AI programming (Peter Norvig)
By Dan Russell
Summary
## Key takeaways - **AI's Superior Hurricane Prediction**: A Google AI model has demonstrated superior hurricane prediction capabilities compared to traditional forecasting systems, utilizing data more efficiently and offering a clearer path for future improvements. [04:38] - **The Return of Physical Buttons in EVs**: Scout's new Terra EV truck features physical buttons, a deliberate return to a more intuitive and less distracting interface compared to touchscreens, reminiscent of 1950s truck dashboards. [02:11] - **AI Code Generation: Democratizing Development**: AI tools can now generate functional code for websites and applications from simple descriptions, lowering the barrier to entry for coding and enabling individuals without traditional programming skills to create software. [07:34] - **AI in Coding: Not Yet Autonomous**: While AI can automate parts of code writing, it's not yet capable of full automation. Human oversight is crucial due to potential errors, and the AI project lifecycle, including trust and privacy, remains a concern. [08:43] - **The 'Vibe Coding' Phenomenon**: Andre Karpathy's 'vibe coding' approach, where an expert programmer relies heavily on AI to generate code without meticulously reviewing each line, highlights the shift in programming paradigms, though human expertise remains vital for effective prompting and validation. [23:34] - **AI's Struggle with Mathematical Reasoning**: Despite advancements, many LLMs in 2024 struggled with mathematical reasoning and logic puzzles, often conflating different knowledge states. However, by 2025, significant progress was observed, with half of the tested models correctly solving complex problems. [44:31]
Topics Covered
- Google AI predicts hurricanes better than traditional models.
- Why is code AI's best scratch paper for reasoning?
- Is AI-generated code violating intellectual property rights?
- Can AI now code better than expert human programmers?
- What programming languages will AI need in the future?
Full Transcript
Let's get started. Uh, today we're
talking about applying LLMs to writing
code and Dan and I had a little bit of a
synchronization issue this morning and
we ended up both adding in news of the
day. So, I'll let him do his and then
I'll do mine. I
>> I've got just a couple things real
quick. So, first off, I've mentioned
this before, but on Friday is the HCI
seminar and this one looks pretty
interesting. So, it kind of is an
extension of what Peter will be talking
about today. Uh, although for visual
effects. So, this guy is coming from
Adobe to talk about the new wave of
Adobe AI tools. So, check that out. Um,
news of the day, there's a new super
intelligence team and one of the things,
so this is AGI and a different cloak. U
they're doing this for specifically for
medical diagnosis. So, rather than doing
AGI, which is general, they're doing
super intelligence, which is not
general. So they're taking the
well-known
uh AI mechanism of focusing on a domain.
So just diagnosis and what I found so
interesting about this was that they
were doing this because this is this is
the classic AI trick. Don't try to do
everything just focus on one thing. So I
also like at the bottom they say
Microsoft plans to invest a quote lot of
money said Mustafa
and Mustafa
used to be at Deep Mind so he migrated
over a while ago. Second thing, um,
following up on the autonomous vehicles
presentation of the other day, I saw
this in the news today. Um, uh, Xpang,
they're they've now got this G9 which
they are planning on rolling out
literally in China. It's going to be the
first un unmodified mass-produced EV to
be a robo taxi. And I'm very interested
to see how this is going to work out.
I don't know. So if anybody sees an X uh
uh Xpang G9 news brief in the future,
I'm curious about how well it's working,
let me know. But it's interesting to see
that the rest of the world is also doing
all the AI work that we're doing as
well.
And just for con on the on the realm of
UI for vehicles, this was just launched,
I guess. Scout has announced their new
Terra EV truck with physical buttons.
Damn it.
Right. No more of this kind of
touchscreen kind of figure out. I don't
know about you, when I'm driving and
I've got a curly touchcreen device, I do
this trying to figure out where the
button part. So, they've gone retro.
Now, that looks like a dash out of a
1950 Ford truck,
but you know, everybody understands how
it works and it actually has a bunch of
nice affordances.
Um I wanted to point out one more thing
which is people are starting to send in
email asking about can we present on the
floor can we present on that that's fine
um two things to note about the final
presentation
you need to turn in two parts both your
final report whatever in whatever form
that is and the slides or your
presentation. Okay. Also, pick your date
as soon as possible because
this is
this is the uh spreadsheet. So, I'll
make that a little bit bigger so you can
see it. Um,
you can put in TBD for your title or
project description, but you need to do
that. You need to solidify that sometime
soon. And then you need to put in your
people
and choose your date, the 4th or the
11th. Okay, so these are the two
columns. This is going to be your your
final the thing you turn in and this is
going to be the presentation you give on
the presentation dates. Okay, one other
thing to note.
These are all the slots available for
the fourth. The pink ones are the slots
available on the 11th.
Notice that nobody signed up for them.
Notice also that the fourth slots are
running out fast.
So, if you have to be on the 4th, get
your presentation uh project description
in today as soon as possible. Otherwise,
we won't have a choice. All right,
I think that's it. Back to you.
>> All right. So, uh, my choice for news,
uh, there were some comparisons done and
it turns out this Google AI model did
better at predicting hurricanes than,
uh, other models, including sort of the
main forecasting system. And, uh, I
think that's interesting for a couple
reasons. So, one is it's more efficient
and it uses data in a better way, right?
So with the existing models, kind of
what they did is they said, "We've got a
couple differential equations for how we
think the atmosphere works." And the way
we're going to make it run better is by
making the grid size on our simulation
smaller and maybe gathering more
measurements and then applying a more
powerful supercomput to crunching these
numbers. And that's how the traditional
approach goes. and it feels like they're
asmmptoing out that they're not getting
that much better because using that
approach you can only go so far. Google
can take uh different types of of
information uh beyond just these at this
point the pre temperature and pressure
is such and such uh feed it into this AI
that combines it in a way that nobody
quits understand but it's different than
these very precise uh physical models
and it turns out this does better and it
also feels like it has a path to improve
more in the future. So I think that's
really interesting. Uh Jensen Wang says
China is going to win the the AI race.
Uh I think he might be right in that.
Certainly China's make big strides. Also
of course everybody's got their own
slant and partially what he's saying is
get off my back with this regulation. Uh
and here's why you should do that. Uh
and so some of that I think is is
accurate and honest and some of it is
self- serving.
uh
related to that maybe the EU is pausing
part of their landmark AI act right so
they had all these rules on what you
could do with AI sort of based on
consumer uh protection but some of the
people were pointing out well maybe it's
also based on protection of competition
from all these companies that are coming
from the US and China and are not coming
from the EU and part of that backing off
may be that we are now seeing companies
in uh in EU. So Mistl in in France uh uh
they're saying well maybe we don't need
all these regulations and Stellantis you
might not have heard of them but
basically they're a huge merger of a
bunch of uh automobile companies
including Fiat and Pujo who are European
companies and they're saying uh maybe we
want to back off on this legislation. So
those fights will continue. All right.
So, back to the topic.
Uh, and some of you may have played with
this and some of you may know. It's
pretty easy now to go in and just say,
give a description. I want you to write
some code and it does it. So, I said
make a website for Stanford student
concentrate in human- centered AI and is
looking for a job and it does the HTML
and the CSS and maybe some JavaScript
and generates something does a good job
and then you can iterate on that. So,
uh, this is pretty new. Just through the
past couple years, you can do this. I
think it's okay. Those of you in the
audience who are CS majors, there's
still going to be jobs for you. Don't
worry. Don't panic. Uh, but it does mean
that a lot of people who didn't have
access to doing these kinds of things
before can now do it. Okay. And I also
think this is interesting because this
issue of building an AI system that will
automate part of all the process of
writing code, it covers kind of all the
issues that we've been talking about in
this class about how to use AI for
anything, right? So we're in this state
where uh this can do this AI can do
amazing things, but they're not perfect.
So you can't just hand it over and say
AI write all my code. You have to worry
about how do we get this to work? And so
the whole AI project life cycle uh that
appears for any AI system definitely
appears here.
So here here's all the questions, right?
Do I have the right technology? I
actually know something about uh coding
that I don't know uh about in other
areas. Uh so is just using deep learning
the right thing? Right. So we made large
language models because linguists had
tried and failed for 50 years to write
down a grammar of English, right? We
didn't know what that grammar was and
deep learning could do it. With
programming languages, we know exactly
what the what the grammar is. And yet
these approaches tend not to use
everything that we know and instead use
the same approach that we use uh with
English. Is that the right thing? Uh the
technology is not mature. We can't do
full automation. Same as with
self-driving cars. How do you get there?
Uh what's the human role? How do you
build trust with the users? This issue
of uh vigilance fatigue we've seen
before. If it gives the right answer 10
times in a row, then maybe you stop
checking it and the next time it puts in
an error. uh and all these issues of uh
privacy and security and intellectual
property rights, they all show up here
just as they as they would in most of
these uh AI systems. Okay. So, what
level of automation are we at? Right? We
have the these five levels for
self-driving cars. I think the same kind
of idea applies here. So, level zero
would be no automation and and what
counts as no has has changed over the
years. So in 1957
the programming language forran replaced
assembly code and it was called an
automatic coding system right so uh uh
they would have said oh wow we're really
moving up that automation scale and
today we'd say uh forran is a bad and
old and inexpressive literature then
level one is doing a specific task maybe
autocomp completion of code and we're
definitely there level two is doing more
complex tasks
maybe saying uh well I'm still going to
write the main code uh but then I want
the AI system to write all the tests for
me. Level three where I think mostly is
where we're at now is the humans and the
AIs working together. The AI might make
a mistake but the human can help correct
it. Uh level four would be driverless
and specific situations. I think we're
not quite there yet. Although there's
some things like uh if you say well I
want to have something that optimizes
database queries or you know sort of one
specific subset uh you may be able to do
full driverless full automation there
and then level five we're definitely not
there yet. Okay. So I want to ask you
what do you think say uh you're a
product manager and uh your company
comes to you and say we want to release
some kind of AI writing code product but
given all these constraints we're not
sure what to do what should we focus on
what can we build that will definitely
help the users won't go beyond the
state-of-the-art uh what what do you
think we can do and what what should we
not attempt to do because that might be
dangerous so talk to your neighbor for a
few And think about that.
How do
and
imagine
Yeah.
Byebye.
[Music]
All right,
>> let's let's bring it back together.
>> What have you guys come up with? What
what useful product can we build with
this amazing yet imperfect technology?
[Music]
Yeah. So, so I think that's great. So
this idea that maybe it's a tool earlier
in the cycle to to help you go faster,
but you don't want to make mistake
pushing something out to users that's
wrong. And so you want more checks in
there. And you know, software companies
have been doing that for a long time,
right? So I'm not allowed to push
something to production before it's been
code reviewed and tests have been run
and all these other checks and balances.
And the same should be true for AI. I
shouldn't let it uh do that all by
itself. Anybody else?
>> Yeah.
>> Um I think a big one for like actually
um that we talked about is like having
more autonomy in terms of how how much
it writes at one time. So like now like
they kind of done better with like the
black box where they list out different
tasks um that it's like kind of
checklist complete and it does each one
but it sends them all together. So when
you stop it, it kind of forget straight
dot. If if there's an option where you
can tell it to just go one pass at a
time or um you tell it to do a whole lot
at once, that would be really really
helpful. Um have more control. And then
that also allows you to go back in and
fix problems as they come up rather than
like have curs huge database.
>> Yeah. Yeah. I think that that's
important. We're going to talk about
that a little bit more, but this idea of
there are different time scales of
interactions, right? And so we have, you
know, before AI, we had this uh
technology of autocomplete and you know,
you you hit tab and it says here's all
the methods for this variable. And
that's really got to come up in a 100
milliseconds. If it takes any longer
than that, it interrupts your flow and
it's simple. So we can do that. Uh but
if you have more AI in the loop and it's
taking a couple seconds rather than 100
milliseconds then that's a different
kind of interaction and and that
interaction still can be valuable like
when we do pairs programming you're
talking back and forth and it takes a
couple seconds for each interaction and
that's okay but you should be clear that
that's a different kind of interaction
and then there's this third kind that
you were talking about with cursor just
saying you know do all these things and
I'll come back in 10 minutes or half an
and you'll be done. So there different
time scales and different kinds of
products and interactions for each of
those. Now uh you guys over here were
talking about uh uh something for the
learner rather than just for producing
code. Can you tell me about that idea?
Um, yeah. So, I guess like
I was thinking like if there's like a
program
for like students then it should like
attempt to or like not attempt to give
full solutions or implementations to
students but not
or like give
that too.
>> Yeah. So there I think you know so the
product is not writing code or maybe
that's part of the product but but
another important part is making the
user better and teaching them something
and having them uh become a better
programmer rather than you know some
people are worried now that uh no one's
going to learn to be an expert
programmer because they're the machine's
going to do it all for them. And we'll
talk about some of that a little bit
later on too.
Okay. So, here's some of the tasks uh
that are possible. Maybe some of you
talked about all that. Uh but basically,
you can just go through all the parts of
what it takes to code and say, is this a
good target? Uh and I won't I won't read
them one by one, but if you know, if
you've done programming, you know that
there's all these possibilities here,
and we could focus on one or the other.
Here's what I was talking about. This
idea of code completion. Uh it's a 25
year old technology at least and it's
got to be really fast. Uh and it you can
train it on an existing codebase. It can
be personalized or localized but mostly
it's just doing lookup and then showing
you the possibilities.
Uh what could we do with the deep
learning model to do better? Well, one
we could do re-ranking, right? So, uh,
you know, here rather than just saying
here's all the possible methods, we
could say, yes, let's fetch all the
possible ones, but then let's put the
most likely ones first. Uh, we can check
for syntactic and semantic corre
correctness. Will this actually compile?
Uh, we can focus on making it faster and
making the uh UI be unobtrusive.
uh we can focus on continuity of saying
I don't want to interrupt the programmer
if the programmer is in the flow I want
to help them to continue if they're
stuck then I want to get them unstuck
so
I guess the deeper question is is it
even possible for a deep neural network
to write code and do a really good job
of that and I would like this quote from
from Edgar Dystra you know one of our
most acclaimed computer scientists he's
about this algorithm named after him,
Dyster's algorithm. And he said, "In the
discrete world of computing, there's no
meaningful metric in which small changes
and small effects go hand in hand, and
there never will be." Uh, and what I
meant by that is unlike most things in
the physical world, you know, you could
take this code that's megabytes of uh of
code and you change one bit and then the
whole thing completely changes. it does
something else or it crashes. And in the
in the real physical world, you make a
tiny change, you usually have a a tiny
uh outcome. And code is just different
from that. And that's worrisome. If
you're trying to make a a deep neural
network because, you know, we do it all
by gradient descent. And we assume if we
make a small change uh in our program,
we're going to have a small change in
its error. And then we minimize that
error. And if he's right, then gradient
descent isn't going to work and this
whole thing is gonna fail. Uh so that's
his distinguished opinion. Here's
another uh opinion from Arthur C. Clark,
science fiction writer, who says when a
distinguished elderly scientist states
that something is possible, he's almost
certainly right. When he says it's
impossible, he's very probably wrong.
And I think in this case, Dystra was
proved to be wrong. Fortunately for him,
he died before he had to take his words
back.
Uh, and uh, I'll go to this other
expert, Ken Thompson, one of the authors
of Unix, who said when in doubt, use
brute force. And our GPUs and TPUs says,
"Yeah, I got this." Right? So,
Dyster is right that there are programs
for which you change one bit and the
whole thing changes. But most of what we
write is not like that. So, a couple of
of things are like that, but just stay
away from those. Most of the things we
write is we look at the source code, you
make a small change, it results in a
small change in the output and we are
able to do gradient descent and and uh
improve based on that. So
way back in ancient history in 2023,
Andre Karpathy says the hottest new
programming language is English. He
expanded on that in uh February of this
year, invented this term vibe coding of
saying, "I'm just going to do this stuff
and I'm going to not even look at the
code and it's all going to work." Uh,
and I think that's great. I think he's
uh in some sense maybe kidding himself a
little bit, right? So part of the reason
it works so well for him is because he
is an expert programmer. And so when he
gives a prompt, he gives a better prompt
than somebody that doesn't know how to
program because he has in his mind where
the program's going and he can help lead
the system there. And he says he doesn't
look at the code and I don't think he's
lying, right? So he doesn't look at the
code line by line, but I think you know
the system can write a couple hundred
lines and he can just glance at it for a
second and say, "No, that doesn't look
right. Let me try again." And so having
that expertise of the human in the loom
I think really makes a difference. But
he is right that the system is doing
most of the coding. He's not doing most
of it. Okay. And now we're all going to
chance to do it. Right. So we're going
to do this live and uh some of you may
already have accounts. Some of you may
have already done this. And so use
whatever system you like. I just looked
at it and I thought the easiest one,
especially if you haven't signed up yet,
uh, seems to be, you know, the least
friction to signing up is claudet.ainew.
So, do this either by yourself or I
think it's probably better to do it uh
with a neighbor. So, two or three of you
together and certainly if you've only
got a phone, go get somebody with a
laptop
and go to quad.ai I/new or your favorite
place and we're all going to invent an
app. So I did it. So my prompt was
invent a casual word game, something
maybe like Wordle but different and
implement it and let me play. And that
was it. That was the whole prompt. They
came up with this thing. It's there's a
little app here that that you could run.
And the game was change from one word to
another by chain of words where you
change one letter at a time. Uh, I don't
know if that counts as as invention
because I think I've seen that before,
but putting it into an app that gets
counts as invention. So, I want to
change cohole perform by changing one
letter at a time and have multiple steps
and they build the app and type a word
and hit submit and it all runs to some
extent.
So,
come up with an idea. What's an app you
want to build? Uh, and start doing it.
See if it works.
Uh we'll see how it goes, but uh
>> yeah, it it could be like 10 minutes or
so, right? So, this is going to be
longer than the two-minute discussion.
>> Yeah.
>> Yeah.
not
It's cool.
I feel like
I think
That's a great idea once a
Wait, it was doing well.
That was awesome.
Yeah,
>> that was pretty quick.
How do I
feel something?
Great. Australia.
[Music]
That just
fine.
So that's
export.
It's
something
like
But you're calling
I don't know.
Yeah.
at the SL.
Oh no.
[Laughter]
I should
Okay, sounds like things have have
quieted down a little bit. It only took
eight minutes to and looks like most
people made an app. That's pretty cool.
I want to show you uh so you keep doing
if if you're not done and you're into
it, you can keep going. But I want to
show you how my app went. So I came up
with this thing. It has this interface.
Interface not beautiful, but it's okay.
But then I looked a little bit closer
and here's the code. It looks all right.
But there are a couple issues here. So
one is, you know, it says you're
supposed to change one word at a time.
And one of the examples is going from go
to W. What?
That's not right. That's a whole lot of
changes at once. and all the other ones,
all the other examples, every step was a
valid step, changing one letter, and
then all of a sudden it did that. So,
what's going on there? That was really
weird. And then the other thing, which
may be a kind of a minor thing, but it
has these lists of legal words that you
could use. And I think maybe what it did
is it said, "Let me only list the words
that work for the examples that I
chose." Uh, but I played the game and I
chose a very common word that was not in
their list, right? And they only had
like 300 words. I think they should have
had 3,000 words. Uh, and that would have
been better. So, you know, it's kind of
okay. And I could have done some
iterations and I could have fixed that
uh, and and could have gotten it better.
So, what' you guys come up with? Who
wants to share something that that they
did?
>> I know you. Yeah. Go ahead. Oh, we
reinvented the gold stick.
>> Oh, yeah. Yes.
>> Alexa did it as well.
>> And it and it worked.
>> So, I have an anecdote of a time I
played Snake uh in real life. Uh my my
daughter was very little and there were
a bunch of Girl Scouts sitting around
and playing Duck Duck Go.
uh where you know you tap one person on
>> search engine,
>> right? The actual physical game, the
idea is you say you go around each
person, you say goose goose goose goose
and then you say duck and then they're
supposed to chase you. They're supposed
to get up and chase you. And I said I'm
going to change the rules a little bit
and I'm going to pat everybody and say
you're all chasing me. And then I'm
running around and I realized I've got a
snake behind me. They're following each
other one by one. And I can't just go
right back to the start because the last
snake will then get me. So I have to go
in a ciruitous path and all the snakes
has followed the person in front of me
and allowed me to get back to the start.
So
>> exactly.
Okay. Uh who else had had a a fun game
that they played
>> or you come up with?
>> Yeah. Cool.
>> And I think there's another floppy bird
over here somewhere. Yeah. How did yours
work?
>> And and what'd you write it in?
>> Yeah. Which uh did you use Claude or
what did you use?
>> So So we are Claude.
>> So maybe there's something about the
exact words of the prompt or maybe it
was just random choices that it worked
for you and it didn't work for you. But
that that's a common lesson is that you
don't know when it's going to work and
when it's not going to work. Anybody
else have one they want to talk about?
>> Okay. Well, well, I hope you Yeah.
>> One thing that common thread at least on
that side of the room was uh what does
it take to export this?
>> Yeah.
>> Externally. Yeah.
>> And some people had found that there
were dependencies that were not obvious
and had issues with that.
>> Yeah. and some people were using other
ser services and so on and so I thought
claude was the best to just get going in
10 minutes but other versions are better
for other aspects okay so how does it
work how does it do this stuff that's
pretty cool uh so there's this interest
paper by Andre Karpathy in which he kind
of goes through and shows you know he
looks at the different neurons within
his net and shows what they trigger on
and he says most of them you can't
really tell, right? So here's the
different letters. Here's how much they
excite this particular cell. It looks
completely random. Uh some of them you
can figure out exactly what they're
doing, right? So there's one that turns
on within uh quotes and comment
characters, right? So it's figured out
the syntax of the language of how
comments work and implemented that in a
neuron within the net.
Uh, and here's one that basically counts
the indentation level, right? So, this
this is something that, you know, is
well known. You can't do this with a
with a a finite state grammar. Uh, and
technically you can't do it with a
neural net of a limited depth, right?
So, if I went to depth a thousand,
probably the neural net would would
fail. But most programs only go to depth
10 and 20 and it works fine.
Okay. Uh, so how could neural nets
understand programs? So this is back in
2021. I thought this was an interesting
paper and it says here's all the things
you could look at. You could train by
the source code. We could get the parser
to output a abstract syntax tree. We
could look at the assembly code that's
generated. We could trace it and look at
the execution flow. Uh we can look at
the design docs and all these other
things. So all these different
representations in 2021 people were uh
experimenting with what else do we want
to look at and then it turns out in 2025
right so here's the diagram from that
paper of all the things they looked at
in 2025 it says no we don't need any of
that so yeah we could look at the
compiled code but we don't need to if we
have enough of the source code that
always wins right and so all these
clever ideas of how to outsmart things
and how to bring in these additional
knowledge sources. They're all swamped
by just saying pour more code through it
and it will get better.
Okay. Uh so what is it good at? Uh so
most of you had this experience. It
could kind of build a game. Maybe it
worked well. Maybe it was perfect. Maybe
there were some flaws. I think it's
really good at well-known algorithms.
Right? So here's an example. write a
Python program to solve the set cover
problem, a standard computer science
algorithm problem, and it does a decent
job. Uh, but I also could have done a
search on GitHub or something and found
similar code. So why would I do this
rather than just search and find it on
GitHub? And I think the reason is
because maybe it's not the exact version
that I want, right? So I can say, well,
make the subsets have weights rather
than having every element be the same.
Maybe I want them to be different. And
it might have been harder to express
that if I'm just doing a search for an
algorithm like this. Or then I can say
make it more efficient. And that might
or might not work. This case it didn't
work. Right? So what's going on here is
they said okay I want to make it more
efficient. Uh it probably be good to uh
sort them so that I get the best one
first. But it does sort inside the loop.
Uh which just actually makes it slower
rather than more efficient. And what it
should have done the sort outside the
root the loop and maintain a priority
queue and that would have made it
faster. Uh then I could say well maybe I
want to have a name associated with each
subset rather than just a list of
elements and I can do that and it did
that right. So this ability it's easy to
retrieve an algorithm from all the code
that's out there. But if you want to
customize it to do exactly what you want
this seems like a better interface. So
what about larger apps? So I played
around a lot with with these smaller
types things uh with my colleague Peter
Dannenburg and I over the last couple
weeks we actually built a larger thing
and we wanted to build this interactive
learning system that's kind of like
notebook LM but is focused on making you
learn better. Uh and so we built it,
dumped some uh uh some pedagogical
uh book chapters or Python notebooks in
and told it to extract a knowledge graph
of all the concepts and their
prerequisites. It does a pretty good job
of that. We were surprised at how well
it did. Sometimes it says there's a
prerequisite link from A to B when
actually it was just I talked about A
before I talked about B and I could have
talked about it in the other order but
mostly it gets it right. Uh built builds
that kind of graph, builds a uh learning
objectives and key insights summary and
then allows you to have a a dialogue and
allows you to run code and and see if it
works and so on. and and basically we
just threw this together
and here's more of the interactions and
it mostly worked and and sometimes we
get things a little bit wrong and we had
to fix it but you know I don't know node
and npm but it was able to generate code
that worked and so this changed the
capabilities for what I could do and I
think a lot of people are seeing that.
Okay. Uh here's another experiment I
did. So uh over the last couple years
it's been a lot of talk of do LLMs have
a theory of mind. By that I mean do they
understand uh what I'm thinking and can
they use that in their thinking and and
vice versa. And I thought there's a lot
of sort of logic puzzles uh that work
like that. And so I uh told it write a
po Python program to solve the Cheryl's
birthday problem. Uh so I don't know if
you remember that but a couple years ago
there was this problem and uh Cheryl
tells one friend the month of her
birthday and the other friend uh the day
of her birthday and says it's one of
these 10 possibilities. And the first
friend says, "Well, then I don't know
what it is." And the second one says, "I
don't know either." And then the first
says well because of that now I know
right so it's a uh they have to model
each other's uh state of knowledge come
to the conclusion. So I tried nine LLMs
in in 2024 to see if they could do this
and they all were very confident about
writing code and they all got the wrong
answer because they all conflated uh
what I know and what somebody else
knows. Uh but then I ran it again in
2025 and now half of them get it right.
So it's progress. These things are
getting better at a very fast rate.
Here's another example. Similar kind of
thing. My friend Waywa
submitted this uh again in 2024. This
math question. List all the ways in
which three distinct positive integers
have a product of 108. And it turns out
these are the ones. and he asked a bunch
of LLMs and I extended the list of the
nine I had and only two of them could
get that right. And one of the mistakes
they made is they all said, "Well, it's
a good idea to figure out the prime
factorization of 108. 2 * 2 * 3 * 3 * 3.
And now what we got to do is we got to
take these numbers and put them into
three subsets."
uh but some of them kind of forgot that
you could have the empty subset or
equivalently that the number one uh is a
factor uh of 108 and so they got it
wrong for that reason or for other
reasons. So two out of nine succeeded
for this math question, but then I
turned it into a programming question to
say write a program to list all the ways
in which three distinct positive
integers have a product of 108. And now
seven of nine got them in 2024 and 809
in 2025.
And I think what's going on here is
it's uh you know so one is kind of
strange when it comes to prime numbers
and multiplication right we're not quite
sure is one a prime or not uh but in
programming you always say for i equals
1 to n you never say for i equals 2 to n
right so it was easy in the math
question to forget uh one in the
programming question it was harder to
forget on uh and I think in general
there are different representations
different ways of talking that will be
better for some problems than others.
The language of math is amazing and can
let you do a bunch of things and so is
the language of programming and let you
do overlapping but different types of
things. And the way LLMs work is they uh
they do thinking on their own but they
have to have some way to represent their
thinking in in some kind of format.
That's why they do better when you say
use this think aloud uh protocol show
your intermediate work and there are
some problems for which a programming
language is a really great intermediate
format.
Uh and here's an example. OpenAI codeex
did that. So again, they were solving
math problems, but they said we're going
to do an intermediate step where we
generate a program and uh solve the math
pro problem through that. And I think
this is important for two reasons. One,
it kind of focuses this reasoning. It's
important for it to have scratch paper
that it can write on. And often writing
that as a program is a better way to do
it than trying to write it as English
statements or as math statements. And
secondly, you can do voting with
programs, right? You can have it say,
I'm gonna generate 10 possible programs.
I'm going to run them all. Oh, look,
eight of them gave the same answer.
Maybe that's the right answer. Whereas,
if I said generate 10 paragraphs or
generate 10 uh math statements, I can't
execute them. So, I can't tell uh are
they agreeing or are they disagreeing.
And so that's the advantages of of using
programs as an intermediate
representation.
Now,
so I said the way you succeed is by
piling in more and more code and
training on it. Inevitably, we run into
these intellectual property issues. So,
uh, GitHub files lawsuit over co-pilot.
It's saying you weren't uh following all
the licensing agreements, right? So, we
have this code. some of them have a
common common or or some other kind of
licensing agreement associated with
them. Uh co-pilot copied all that code
and maybe violated some of those
licensing agreements. And so there's
issues over that. So I want you to take
a couple minutes talk to your uh
neighbor uh what do you think's okay in
terms of intellectual property with
code? We have all this code that is open
source that's posted up there. Is it
okay for an AI program to read this
public source code? Is it okay to learn
from that? Is it okay to store a copy?
Is it okay for it to generate code
that's similar? Right? So, a lot of you
people generated code to play Wordle or
floppy words that might have been
similar to other code. Is it okay to
generate an exact copy? Uh what do you
think is legal or illegal? And what do
you think should be allowed just by
norms rather than by law?
>> Go ahead and discuss.
I would ask
questions.
not be
a good
Yeah.
This
Right.
[Music]
Yeah.
Who stole the bell?
All right,
let's let's wrap that up. Uh
so so all these issues uh in some sense
are the same for code as they are for
for any other training data that's out
on the web. But another sense they're
different, right? Because my random blog
post or my restaurant review, maybe
that's technically my intellectual
property, but doesn't feel like that has
much value to it. whereas code has been
proven to have very strong and and
sometimes very large value. So in that
sense it feels different. So what did
you guys discuss and what conclusions
did you come to?
>> Yes. So I think that's right. Right. And
I think one of the issues that does make
it confusing and one reason why GitHub
was in the middle of this is because it
is a common repository in which there's
a lot of code and a lot of it is
visible. You can make you are allowed to
make private repositories in GitHub that
nobody can see but a lot of them are
public and yet they have this license
that restricts what you can do with it.
I think that's different than a lot of
other sources.
My group of guys here in the middle had
interesting thought on citations.
>> Yeah, basically we're saying that if you
think about it like what a human can do.
A human can do pretty much all of these
except for
those
>> if it's like copying exact code or
taking those ideas from someone else.
>> Yeah. So that AI should be able to do
whatever it needs to do. Um they should
be able to do all the same things except
for the site. Um but then the idea came
up that AI gets it ideas from a lot of
different places. So citation become
pretty confusing because every like
couple lines of code it would be a new
citation for a new place that
>> Yeah. Yeah. I think I think that's
right. Um
I think uh a couple issues with that,
right? So one is
where are they going to put the
citation, right? So if I publish a
paper, it's clear, you know, if I
borrowed something, I'm definitely going
to put a a citation there. But if AI
generates uh birds, doesn't feel like
there's a place to put that citation,
right? Maybe there could be footnotes
somewhere, but nobody's going to look at
them. uh so that seems seems less
likely. uh then there have been these
big issues even before AI of what are
you allowed to copy and not copy right
so when for example when uh Google
wanted to make the Android operating
system they wanted to do it in Java and
they went to the owners of Java and
tried to work out a licensing agreement
uh but they couldn't get a licensing
agreement that worked right they said
yeah this would we could pay you this
amount if we were charging uh money for
every copy of the operating system, but
we want to give it away for free, so we
can't pay you a lot for a copy. And so
Google decided, we're just going to
reimplement Java from scratch, and we're
going to put in these clean rooms where
it's very clear that nobody can look at
the source code from Java and they have
to rewrite it on their own. And they did
that and they got sued anyways. And one
of the pieces of evidence was saying
here's this sixline method that's the
same in the Java implementation and your
implementation. And basically, you know,
it says if x is less than zero, then do
this error message and if it's greater
than n, do this error message else do
the right thing. And it kind of felt
like, well, anybody could write that
code. And it's just you could write it
slightly different or you could write it
exactly the same, but it's still the
same idea. and any programmer would have
come up with something similar. But that
was sort of my opinion as a programmer.
But the legal opinions, they look at it
differently. And I think we we still
haven't resolved all those types of
issues.
Okay.
So, programming contest uh and there's
we'll get to some uh uh recent big wins,
but this is from uh Alpha Code, which I
guess is ancient history now, probably
2023.
And uh they entered these contests and
and did well. And this was their primary
example. They said, you know, here's the
50 examples that we worked on. You can
look at all of them and look at the code
for all of them, but this is the one
we're going to talk about the most. And
this is what the input to the program
looks like. So there's a English
language description and then there's
sort of a formal input and expected
output. And basically what they're
asking is they're saying we're going to
pass you two strings s and t and we're
going to ask you could you uh generate
the string t by typing the characters in
s but you have the option for any of the
characters to type a backspace instead
of typing the character. So I give you
aba ba and I want you to generate ba and
you could do that by substituting a
backspace for the first three characters
and then just typing ba and that would
give you the output. So then you should
say yes and there might have been other
ways in which you could have put the
back spaces in different places and get
the right answer.
So here's a program they came up with.
Uh so the system came up with this code
and then uh they put annotations on the
side describing what the program does
and yes it's right it it works 100% of
the time. Uh but you know before
anybody's going to check in code into my
repository they have to go through a
code review. So I did a code review and
basically almost every line I would want
them to change. And so there's a bunch
of stuff here, a lot of them is sort of
style type issues and that would be easy
to change. Some of them are a little bit
deeper than that. And again, the
program's 100% right. I just don't like
it. Uh, and here's one thing. So they
initialize they're dealing with these
stacks of characters A and B and they
initialize this stack C to be the empty
stack and then when they pop something
off of stack B they store it onto stack
C and I can see where they would get
that because that happens a lot of times
you're dealing with stacks you pop
something off they say I better save
this somewhere I might need it later but
they don't use C anywhere else in the
entire program right and I know
programmers that write like that and I
don't want to hire them.
Uh so uh and may maybe this is
inevitable, right? Because they train
this on a lot of code and and half the
code was written by people who are below
average. So
all right and then this is kind of
interesting. You you could go to that
website and you can play with it and it
shows you at each point uh what's it
what it's thinking of might come next.
So you say uh for blank in range t it's
saying uh well my first guess is i is
the most common uh iterator and then
underscore is sometimes used in python
when you don't care about the value of
the variable and then there's these
other possibilities and you go through
and sort of think figure out how it's
thinking.
Uh so I showed you the program I didn't
like. I'd rather have a program like
this. I think this is a lot simpler and
it should be able to get there. Uh I'd
also like to be able to say well
generate a bunch of test cases. The you
know the program gave me four test
cases. I don't think that's enough.
Generate a bunch of other ones and and
make them cover all the the
possibilities. Do that automatically for
me. Uh I think that would be good. Uh
and then I'd like to say well I had some
sort of optimizations in the in the
program but I'm not sure those
optimizations would always work. So have
a system that's slow but obviously
correct right. So I say generate all
possible outputs from the source all
possible ways of using a backspace or
not and then say is target one of those
possibilities right so this is an
exponentially slow algorithm but if this
gets the same answer as the other
algorithm that's more evidence that I
got it right and so I'd like to be able
to get my system to do these things for
me right it was too hard for me to write
this code but I could ask it you know
prove to me that it's correct by showing
doing that in this uh degenerate case
and then uh we'll skip through some of
this but these are sort of all the types
of questions that I want to see if I was
doing this I might ask easy questions of
myself if I was doing care programming I
might ask this of my partner I want the
system to be able to engage in
conversations and do all this I don't
just want to see the code that's correct
I want code that gives me more
confidence in it
So, how well does this work? Uh, in
2022, it looked like a 5 to 10%
improvement. In 2023, another study got
40%. In 2025, another study said 57%.
Looks like it's going up, but there
other ones. And these studies are all
over the map. And it really depends on
exactly what you're measuring and
exactly who's measuring it and what
their setup is and what they're trying
to do and so on. But does seem like
there is a lot of progress here. But
here's a study that says there was a 41%
increase in bugs, right? So this was
people going too fast uh and maybe
getting ahead of themsel and building up
that technical depth.
Here's another one that says large
language models can outperform human
programmers
uh at this international uh programming
competition. Uh so this is a lot more
serious than what Alpha code did in in
2023. Here in 2025
uh uh Gemini did solve this problem that
uh no human team could solve and open
AAI had a similar performance. Uh so
they both did well. Open AAI actually
did a little bit better, but Gemini
competed in the actual competition and
OpenAI sort of did it on the side and
said, "We think we scored a perfect
score." And so, uh, it's your choice
which one of those you want to believe
more.
>> Okay. And I think something I think is
really interesting is what programming
language should we be using? Right. So
in in the early days of programming, we
said we're going to program assembly
language because the you know
programmers are cheap but computers are
big and expensive and so we want to make
sure it's super efficient. And then we
said well now we're going to program
mostly in C which uh is a compromise
between speed of to the programmer and
speed for the computer. Now we both we
program in Python which is not very
efficient in terms of using the machine
resources but is better for the
programmer. In the future we want
something that's good for the hardware
for the human programmer and for the
LLM. And so what should that be like?
And one argument is well it should be
Python and JavaScript because those are
the languages that we have the most
training data on. So those are the
languages we'll do the best.
counterargument is well it's easy to
translate between languages so maybe we
could use something else maybe we should
use ones with the most explicit
information right so in Python if you
don't have type declarations you know
less about the program a language that's
more strongly type you know more about
it maybe that would be better or maybe
something completely new
uh and here's a map of things that
people are experimenting with
uh and mostly comes down to uh
probabilistic programming and
differential programming. And I think
this really speaks to what are the types
of things we want to solve. If we're
trying to solve simulations in the real
world of hurricanes or whatever, uh then
our traditional languages aren't that
good, right? So our traditional
languages are built on things like if
statements. If X is true, then do Y,
else do Z. In the real world, we rarely
know something 100% certain. And so we
can build in at the application level.
Well, I'm going to have probability
distributions and I'm going to have
uncertainty and so on, but that's on top
of the language. And this idea of
probabilistic programming is we should
build that into the language. So an if
statement rather than just saying true
or false says we're going to deal with a
whole probability distribution. And then
differentiable programming says I'm
going to write a program. It's going to
have a bunch of parameters in it and
then I can automatically say choose the
right uh parameters to make the system
run better.
Uh here's uh this could have gone in the
news the day, but I'm putting it here.
And this is a proposal for
software that's more modular and uh can
fit together better using these things
like LLMs, right? And their point here
is you have things that that seem kind
of simple, right? So I build this app
and there's a button. The button should
be one thing, but the way we write code
today, a button isn't one thing, right?
So there's a button and you press it and
then that calls something and that makes
a remote call to another computer and
then it responds and then you process
the response and so on. So a button
doesn't stand alone. It stands connected
to all these other things. And the
proposal is can we write software that
separates that out more to make each of
those components more independent.
Uh
uh this book by uh Hansen and Susman
software design for flexibility
uh says uh one should not have to modify
a working program. One should be able to
add to it. And and we don't do that,
right? We modify programs all the time
when we want to make it better. And
they're saying if we made it modular
enough, we wouldn't have to do that. We
just have to say yes, here's program you
had before. Yes. And it should also do
this thing without having to delete what
you did before. Um then they also say uh
compare our guilt software to biological
systems. Say biological systems use
contextual signals that are informative
rather than imperative. There's no
master commander saying what each part
must do. Right? So the cells of my body
communicate with each other in ways that
I don't understand and there is no one
central processor that says at this step
the cells got to do the next thing. They
all interact with each other and we are
building software that's more complex
and maybe it seems more like a
biological system. So maybe we need
programming languages that support that.
And uh Kevin Kelly has a more sort of
philosophical take on that saying all is
flux, nothing is finished. That means
processes are more important than
products. And so we should optimize how
to make that process easy so that we can
make the changes and have our programs
evolve. So let's stop there and fill out
the computer.
Yeah,
>> I don't know who
It's not a character.
So I'm really curious
if you were observing the progress of
evolution
animals.
>> That's right.
Loading video analysis...