#14 - CS 139 - AI programming (Peter Norvig)

By Dan Russell

Summary

## Key takeaways - **AI's Superior Hurricane Prediction**: A Google AI model has demonstrated superior hurricane prediction capabilities compared to traditional forecasting systems, utilizing data more efficiently and offering a clearer path for future improvements. [04:38] - **The Return of Physical Buttons in EVs**: Scout's new Terra EV truck features physical buttons, a deliberate return to a more intuitive and less distracting interface compared to touchscreens, reminiscent of 1950s truck dashboards. [02:11] - **AI Code Generation: Democratizing Development**: AI tools can now generate functional code for websites and applications from simple descriptions, lowering the barrier to entry for coding and enabling individuals without traditional programming skills to create software. [07:34] - **AI in Coding: Not Yet Autonomous**: While AI can automate parts of code writing, it's not yet capable of full automation. Human oversight is crucial due to potential errors, and the AI project lifecycle, including trust and privacy, remains a concern. [08:43] - **The 'Vibe Coding' Phenomenon**: Andre Karpathy's 'vibe coding' approach, where an expert programmer relies heavily on AI to generate code without meticulously reviewing each line, highlights the shift in programming paradigms, though human expertise remains vital for effective prompting and validation. [23:34] - **AI's Struggle with Mathematical Reasoning**: Despite advancements, many LLMs in 2024 struggled with mathematical reasoning and logic puzzles, often conflating different knowledge states. However, by 2025, significant progress was observed, with half of the tested models correctly solving complex problems. [44:31]

Topics Covered

Google AI predicts hurricanes better than traditional models.
Why is code AI's best scratch paper for reasoning?
Is AI-generated code violating intellectual property rights?
Can AI now code better than expert human programmers?
What programming languages will AI need in the future?

Full Transcript

Let's get started. Uh, today we're

talking about applying LLMs to writing

code and Dan and I had a little bit of a

synchronization issue this morning and

we ended up both adding in news of the

day. So, I'll let him do his and then

I'll do mine. I

>> I've got just a couple things real

quick. So, first off, I've mentioned

this before, but on Friday is the HCI

seminar and this one looks pretty

interesting. So, it kind of is an

extension of what Peter will be talking

about today. Uh, although for visual

effects. So, this guy is coming from

Adobe to talk about the new wave of

Adobe AI tools. So, check that out. Um,

news of the day, there's a new super

intelligence team and one of the things,

so this is AGI and a different cloak. U

they're doing this for specifically for

medical diagnosis. So, rather than doing

AGI, which is general, they're doing

super intelligence, which is not

general. So they're taking the

well-known

uh AI mechanism of focusing on a domain.

So just diagnosis and what I found so

interesting about this was that they

were doing this because this is this is

the classic AI trick. Don't try to do

everything just focus on one thing. So I

also like at the bottom they say

Microsoft plans to invest a quote lot of

money said Mustafa

and Mustafa

used to be at Deep Mind so he migrated

over a while ago. Second thing, um,

following up on the autonomous vehicles

presentation of the other day, I saw

this in the news today. Um, uh, Xpang,

they're they've now got this G9 which

they are planning on rolling out

literally in China. It's going to be the

first un unmodified mass-produced EV to

be a robo taxi. And I'm very interested

to see how this is going to work out.

I don't know. So if anybody sees an X uh

uh Xpang G9 news brief in the future,

I'm curious about how well it's working,

let me know. But it's interesting to see

that the rest of the world is also doing

all the AI work that we're doing as

well.

And just for con on the on the realm of

UI for vehicles, this was just launched,

I guess. Scout has announced their new

Terra EV truck with physical buttons.

Damn it.

Right. No more of this kind of

touchscreen kind of figure out. I don't

know about you, when I'm driving and

I've got a curly touchcreen device, I do

this trying to figure out where the

button part. So, they've gone retro.

Now, that looks like a dash out of a

1950 Ford truck,

but you know, everybody understands how

it works and it actually has a bunch of

nice affordances.

Um I wanted to point out one more thing

which is people are starting to send in

email asking about can we present on the

floor can we present on that that's fine

um two things to note about the final

presentation

you need to turn in two parts both your

final report whatever in whatever form

that is and the slides or your

presentation. Okay. Also, pick your date

as soon as possible because

this is

this is the uh spreadsheet. So, I'll

make that a little bit bigger so you can

see it. Um,

you can put in TBD for your title or

project description, but you need to do

that. You need to solidify that sometime

soon. And then you need to put in your

people

and choose your date, the 4th or the

11th. Okay, so these are the two

columns. This is going to be your your

final the thing you turn in and this is

going to be the presentation you give on

the presentation dates. Okay, one other

thing to note.

These are all the slots available for

the fourth. The pink ones are the slots

available on the 11th.

Notice that nobody signed up for them.

Notice also that the fourth slots are

running out fast.

So, if you have to be on the 4th, get

your presentation uh project description

in today as soon as possible. Otherwise,

we won't have a choice. All right,

I think that's it. Back to you.

>> All right. So, uh, my choice for news,

uh, there were some comparisons done and

it turns out this Google AI model did

better at predicting hurricanes than,

uh, other models, including sort of the

main forecasting system. And, uh, I

think that's interesting for a couple

reasons. So, one is it's more efficient

and it uses data in a better way, right?

So with the existing models, kind of

what they did is they said, "We've got a

couple differential equations for how we

think the atmosphere works." And the way

we're going to make it run better is by

making the grid size on our simulation

smaller and maybe gathering more

measurements and then applying a more

powerful supercomput to crunching these

numbers. And that's how the traditional

approach goes. and it feels like they're

asmmptoing out that they're not getting

that much better because using that

approach you can only go so far. Google

can take uh different types of of

information uh beyond just these at this

point the pre temperature and pressure

is such and such uh feed it into this AI

that combines it in a way that nobody

quits understand but it's different than

these very precise uh physical models

and it turns out this does better and it

also feels like it has a path to improve

more in the future. So I think that's

really interesting. Uh Jensen Wang says

China is going to win the the AI race.

Uh I think he might be right in that.

Certainly China's make big strides. Also

of course everybody's got their own

slant and partially what he's saying is

get off my back with this regulation. Uh

and here's why you should do that. Uh

and so some of that I think is is

accurate and honest and some of it is

self- serving.

uh

related to that maybe the EU is pausing

part of their landmark AI act right so

they had all these rules on what you

could do with AI sort of based on

consumer uh protection but some of the

people were pointing out well maybe it's

also based on protection of competition

from all these companies that are coming

from the US and China and are not coming

from the EU and part of that backing off

may be that we are now seeing companies

in uh in EU. So Mistl in in France uh uh

they're saying well maybe we don't need

all these regulations and Stellantis you

might not have heard of them but

basically they're a huge merger of a

bunch of uh automobile companies

including Fiat and Pujo who are European

companies and they're saying uh maybe we

want to back off on this legislation. So

those fights will continue. All right.

So, back to the topic.

Uh, and some of you may have played with

this and some of you may know. It's

pretty easy now to go in and just say,

give a description. I want you to write

some code and it does it. So, I said

make a website for Stanford student

concentrate in human- centered AI and is

looking for a job and it does the HTML

and the CSS and maybe some JavaScript

and generates something does a good job

and then you can iterate on that. So,

uh, this is pretty new. Just through the

past couple years, you can do this. I

think it's okay. Those of you in the

audience who are CS majors, there's

still going to be jobs for you. Don't

worry. Don't panic. Uh, but it does mean

that a lot of people who didn't have

access to doing these kinds of things

before can now do it. Okay. And I also

think this is interesting because this

issue of building an AI system that will

automate part of all the process of

writing code, it covers kind of all the

issues that we've been talking about in

this class about how to use AI for

anything, right? So we're in this state

where uh this can do this AI can do

amazing things, but they're not perfect.

So you can't just hand it over and say

AI write all my code. You have to worry

about how do we get this to work? And so

the whole AI project life cycle uh that

appears for any AI system definitely

appears here.

So here here's all the questions, right?

Do I have the right technology? I

actually know something about uh coding

that I don't know uh about in other

areas. Uh so is just using deep learning

the right thing? Right. So we made large

language models because linguists had

tried and failed for 50 years to write

down a grammar of English, right? We

didn't know what that grammar was and

deep learning could do it. With

programming languages, we know exactly

what the what the grammar is. And yet

these approaches tend not to use

everything that we know and instead use

the same approach that we use uh with

English. Is that the right thing? Uh the

technology is not mature. We can't do

full automation. Same as with

self-driving cars. How do you get there?

Uh what's the human role? How do you

build trust with the users? This issue

of uh vigilance fatigue we've seen

before. If it gives the right answer 10

times in a row, then maybe you stop

checking it and the next time it puts in

an error. uh and all these issues of uh

privacy and security and intellectual

property rights, they all show up here

just as they as they would in most of

these uh AI systems. Okay. So, what

level of automation are we at? Right? We

have the these five levels for

self-driving cars. I think the same kind

of idea applies here. So, level zero

would be no automation and and what

counts as no has has changed over the

years. So in 1957

the programming language forran replaced

assembly code and it was called an

automatic coding system right so uh uh

they would have said oh wow we're really

moving up that automation scale and

today we'd say uh forran is a bad and

old and inexpressive literature then

level one is doing a specific task maybe

autocomp completion of code and we're

definitely there level two is doing more

complex tasks

maybe saying uh well I'm still going to

write the main code uh but then I want

the AI system to write all the tests for

me. Level three where I think mostly is

where we're at now is the humans and the

AIs working together. The AI might make

a mistake but the human can help correct

it. Uh level four would be driverless

and specific situations. I think we're

not quite there yet. Although there's

some things like uh if you say well I

want to have something that optimizes

database queries or you know sort of one

specific subset uh you may be able to do

full driverless full automation there

and then level five we're definitely not

there yet. Okay. So I want to ask you

what do you think say uh you're a

product manager and uh your company

comes to you and say we want to release

some kind of AI writing code product but

given all these constraints we're not

sure what to do what should we focus on

what can we build that will definitely

help the users won't go beyond the

state-of-the-art uh what what do you

think we can do and what what should we

not attempt to do because that might be

dangerous so talk to your neighbor for a

few And think about that.

How do

and

imagine

Yeah.

Byebye.

[Music]

All right,

>> let's let's bring it back together.

>> What have you guys come up with? What

what useful product can we build with

this amazing yet imperfect technology?

[Music]

Yeah. So, so I think that's great. So

this idea that maybe it's a tool earlier

in the cycle to to help you go faster,

but you don't want to make mistake

pushing something out to users that's

wrong. And so you want more checks in

there. And you know, software companies

have been doing that for a long time,

right? So I'm not allowed to push

something to production before it's been

code reviewed and tests have been run

and all these other checks and balances.

And the same should be true for AI. I

shouldn't let it uh do that all by

itself. Anybody else?

>> Yeah.

>> Um I think a big one for like actually

um that we talked about is like having

more autonomy in terms of how how much

it writes at one time. So like now like

they kind of done better with like the

black box where they list out different

tasks um that it's like kind of

checklist complete and it does each one

but it sends them all together. So when

you stop it, it kind of forget straight

dot. If if there's an option where you

can tell it to just go one pass at a

time or um you tell it to do a whole lot

at once, that would be really really

helpful. Um have more control. And then

that also allows you to go back in and

fix problems as they come up rather than

like have curs huge database.

>> Yeah. Yeah. I think that that's

important. We're going to talk about

that a little bit more, but this idea of

there are different time scales of

interactions, right? And so we have, you

know, before AI, we had this uh

technology of autocomplete and you know,

you you hit tab and it says here's all

the methods for this variable. And

that's really got to come up in a 100

milliseconds. If it takes any longer

than that, it interrupts your flow and

it's simple. So we can do that. Uh but

if you have more AI in the loop and it's

taking a couple seconds rather than 100

milliseconds then that's a different

kind of interaction and and that

interaction still can be valuable like

when we do pairs programming you're

talking back and forth and it takes a

couple seconds for each interaction and

that's okay but you should be clear that

that's a different kind of interaction

and then there's this third kind that

you were talking about with cursor just

saying you know do all these things and

I'll come back in 10 minutes or half an

and you'll be done. So there different

time scales and different kinds of

products and interactions for each of

those. Now uh you guys over here were

talking about uh uh something for the

learner rather than just for producing

code. Can you tell me about that idea?

Um, yeah. So, I guess like

I was thinking like if there's like a

program

for like students then it should like

attempt to or like not attempt to give

full solutions or implementations to

students but not

or like give

that too.

>> Yeah. So there I think you know so the

product is not writing code or maybe

that's part of the product but but

another important part is making the

user better and teaching them something

and having them uh become a better

programmer rather than you know some

people are worried now that uh no one's

going to learn to be an expert

programmer because they're the machine's

going to do it all for them. And we'll

talk about some of that a little bit

later on too.

Okay. So, here's some of the tasks uh

that are possible. Maybe some of you

talked about all that. Uh but basically,

you can just go through all the parts of

what it takes to code and say, is this a

good target? Uh and I won't I won't read

them one by one, but if you know, if

you've done programming, you know that

there's all these possibilities here,

and we could focus on one or the other.

Here's what I was talking about. This

idea of code completion. Uh it's a 25

year old technology at least and it's

got to be really fast. Uh and it you can

train it on an existing codebase. It can

be personalized or localized but mostly

it's just doing lookup and then showing

you the possibilities.

Uh what could we do with the deep

learning model to do better? Well, one

we could do re-ranking, right? So, uh,

you know, here rather than just saying

here's all the possible methods, we

could say, yes, let's fetch all the

possible ones, but then let's put the

most likely ones first. Uh, we can check

for syntactic and semantic corre

correctness. Will this actually compile?

Uh, we can focus on making it faster and

making the uh UI be unobtrusive.

uh we can focus on continuity of saying

I don't want to interrupt the programmer

if the programmer is in the flow I want

to help them to continue if they're

stuck then I want to get them unstuck

so

I guess the deeper question is is it

even possible for a deep neural network

to write code and do a really good job

of that and I would like this quote from

from Edgar Dystra you know one of our

most acclaimed computer scientists he's

about this algorithm named after him,

Dyster's algorithm. And he said, "In the

discrete world of computing, there's no

meaningful metric in which small changes

and small effects go hand in hand, and

there never will be." Uh, and what I

meant by that is unlike most things in

the physical world, you know, you could

take this code that's megabytes of uh of

code and you change one bit and then the

whole thing completely changes. it does

something else or it crashes. And in the

in the real physical world, you make a

tiny change, you usually have a a tiny

uh outcome. And code is just different

from that. And that's worrisome. If

you're trying to make a a deep neural

network because, you know, we do it all

by gradient descent. And we assume if we

make a small change uh in our program,

we're going to have a small change in

its error. And then we minimize that

error. And if he's right, then gradient

descent isn't going to work and this

whole thing is gonna fail. Uh so that's

his distinguished opinion. Here's

another uh opinion from Arthur C. Clark,

science fiction writer, who says when a

distinguished elderly scientist states

that something is possible, he's almost

certainly right. When he says it's

impossible, he's very probably wrong.

And I think in this case, Dystra was

proved to be wrong. Fortunately for him,

he died before he had to take his words

back.

Uh, and uh, I'll go to this other

expert, Ken Thompson, one of the authors

of Unix, who said when in doubt, use

brute force. And our GPUs and TPUs says,

"Yeah, I got this." Right? So,

Dyster is right that there are programs

for which you change one bit and the

whole thing changes. But most of what we

write is not like that. So, a couple of

of things are like that, but just stay

away from those. Most of the things we

write is we look at the source code, you

make a small change, it results in a

small change in the output and we are

able to do gradient descent and and uh

improve based on that. So

way back in ancient history in 2023,

Andre Karpathy says the hottest new

programming language is English. He

expanded on that in uh February of this

year, invented this term vibe coding of

saying, "I'm just going to do this stuff

and I'm going to not even look at the

code and it's all going to work." Uh,

and I think that's great. I think he's

uh in some sense maybe kidding himself a

little bit, right? So part of the reason

it works so well for him is because he

is an expert programmer. And so when he

gives a prompt, he gives a better prompt

than somebody that doesn't know how to

program because he has in his mind where

the program's going and he can help lead

the system there. And he says he doesn't

look at the code and I don't think he's

lying, right? So he doesn't look at the

code line by line, but I think you know

the system can write a couple hundred

lines and he can just glance at it for a

second and say, "No, that doesn't look

right. Let me try again." And so having

that expertise of the human in the loom

I think really makes a difference. But

he is right that the system is doing

most of the coding. He's not doing most

of it. Okay. And now we're all going to

chance to do it. Right. So we're going

to do this live and uh some of you may

already have accounts. Some of you may

have already done this. And so use

whatever system you like. I just looked

at it and I thought the easiest one,

especially if you haven't signed up yet,

uh, seems to be, you know, the least

friction to signing up is claudet.ainew.

So, do this either by yourself or I

think it's probably better to do it uh

with a neighbor. So, two or three of you

together and certainly if you've only

got a phone, go get somebody with a

laptop

and go to quad.ai I/new or your favorite

place and we're all going to invent an

app. So I did it. So my prompt was

invent a casual word game, something

maybe like Wordle but different and

implement it and let me play. And that

was it. That was the whole prompt. They

came up with this thing. It's there's a

little app here that that you could run.

And the game was change from one word to

another by chain of words where you

change one letter at a time. Uh, I don't

know if that counts as as invention

because I think I've seen that before,

but putting it into an app that gets

counts as invention. So, I want to

change cohole perform by changing one

letter at a time and have multiple steps

and they build the app and type a word

and hit submit and it all runs to some

extent.

So,

come up with an idea. What's an app you

want to build? Uh, and start doing it.

See if it works.

Uh we'll see how it goes, but uh

>> yeah, it it could be like 10 minutes or

so, right? So, this is going to be

longer than the two-minute discussion.

>> Yeah.

>> Yeah.

not

It's cool.

I feel like

I think

That's a great idea once a

Wait, it was doing well.

That was awesome.

Yeah,

>> that was pretty quick.

How do I

feel something?

Great. Australia.

[Music]

That just

fine.

So that's

export.

It's

something

like

But you're calling

I don't know.

Yeah.

at the SL.

Oh no.

[Laughter]

I should

Okay, sounds like things have have

quieted down a little bit. It only took

eight minutes to and looks like most

people made an app. That's pretty cool.

I want to show you uh so you keep doing

if if you're not done and you're into

it, you can keep going. But I want to

show you how my app went. So I came up

with this thing. It has this interface.

Interface not beautiful, but it's okay.

But then I looked a little bit closer

and here's the code. It looks all right.

But there are a couple issues here. So

one is, you know, it says you're

supposed to change one word at a time.

And one of the examples is going from go

to W. What?

That's not right. That's a whole lot of

changes at once. and all the other ones,

all the other examples, every step was a

valid step, changing one letter, and

then all of a sudden it did that. So,

what's going on there? That was really

weird. And then the other thing, which

may be a kind of a minor thing, but it

has these lists of legal words that you

could use. And I think maybe what it did

is it said, "Let me only list the words

that work for the examples that I

chose." Uh, but I played the game and I

chose a very common word that was not in

their list, right? And they only had

like 300 words. I think they should have

had 3,000 words. Uh, and that would have

been better. So, you know, it's kind of

okay. And I could have done some

iterations and I could have fixed that

uh, and and could have gotten it better.

So, what' you guys come up with? Who

wants to share something that that they

did?

>> I know you. Yeah. Go ahead. Oh, we

reinvented the gold stick.

>> Oh, yeah. Yes.

>> Alexa did it as well.

>> And it and it worked.

>> So, I have an anecdote of a time I

played Snake uh in real life. Uh my my

daughter was very little and there were

a bunch of Girl Scouts sitting around

and playing Duck Duck Go.

uh where you know you tap one person on

>> search engine,

>> right? The actual physical game, the

idea is you say you go around each

person, you say goose goose goose goose

and then you say duck and then they're

supposed to chase you. They're supposed

to get up and chase you. And I said I'm

going to change the rules a little bit

and I'm going to pat everybody and say

you're all chasing me. And then I'm

running around and I realized I've got a

snake behind me. They're following each

other one by one. And I can't just go

right back to the start because the last

snake will then get me. So I have to go

in a ciruitous path and all the snakes

has followed the person in front of me

and allowed me to get back to the start.

So

>> exactly.

Okay. Uh who else had had a a fun game

that they played

>> or you come up with?

>> Yeah. Cool.

>> And I think there's another floppy bird

over here somewhere. Yeah. How did yours

work?

>> And and what'd you write it in?

>> Yeah. Which uh did you use Claude or

what did you use?

>> So So we are Claude.

>> So maybe there's something about the

exact words of the prompt or maybe it

was just random choices that it worked

for you and it didn't work for you. But

that that's a common lesson is that you

don't know when it's going to work and

when it's not going to work. Anybody

else have one they want to talk about?

>> Okay. Well, well, I hope you Yeah.

>> One thing that common thread at least on

that side of the room was uh what does

it take to export this?

>> Yeah.

>> Externally. Yeah.

>> And some people had found that there

were dependencies that were not obvious

and had issues with that.

>> Yeah. and some people were using other

ser services and so on and so I thought

claude was the best to just get going in

10 minutes but other versions are better

for other aspects okay so how does it

work how does it do this stuff that's

pretty cool uh so there's this interest

paper by Andre Karpathy in which he kind

of goes through and shows you know he

looks at the different neurons within

his net and shows what they trigger on

and he says most of them you can't

really tell, right? So here's the

different letters. Here's how much they

excite this particular cell. It looks

completely random. Uh some of them you

can figure out exactly what they're

doing, right? So there's one that turns

on within uh quotes and comment

characters, right? So it's figured out

the syntax of the language of how

comments work and implemented that in a

neuron within the net.

Uh, and here's one that basically counts

the indentation level, right? So, this

this is something that, you know, is

well known. You can't do this with a

with a a finite state grammar. Uh, and

technically you can't do it with a

neural net of a limited depth, right?

So, if I went to depth a thousand,

probably the neural net would would

fail. But most programs only go to depth

10 and 20 and it works fine.

Okay. Uh, so how could neural nets

understand programs? So this is back in

2021. I thought this was an interesting

paper and it says here's all the things

you could look at. You could train by

the source code. We could get the parser

to output a abstract syntax tree. We

could look at the assembly code that's

generated. We could trace it and look at

the execution flow. Uh we can look at

the design docs and all these other

things. So all these different

representations in 2021 people were uh

experimenting with what else do we want

to look at and then it turns out in 2025

right so here's the diagram from that

paper of all the things they looked at

in 2025 it says no we don't need any of

that so yeah we could look at the

compiled code but we don't need to if we

have enough of the source code that

always wins right and so all these

clever ideas of how to outsmart things

and how to bring in these additional

knowledge sources. They're all swamped

by just saying pour more code through it

and it will get better.

Okay. Uh so what is it good at? Uh so

most of you had this experience. It

could kind of build a game. Maybe it

worked well. Maybe it was perfect. Maybe

there were some flaws. I think it's

really good at well-known algorithms.

Right? So here's an example. write a

Python program to solve the set cover

problem, a standard computer science

algorithm problem, and it does a decent

job. Uh, but I also could have done a

search on GitHub or something and found

similar code. So why would I do this

rather than just search and find it on

GitHub? And I think the reason is

because maybe it's not the exact version

that I want, right? So I can say, well,

make the subsets have weights rather

than having every element be the same.

Maybe I want them to be different. And

it might have been harder to express

that if I'm just doing a search for an

algorithm like this. Or then I can say

make it more efficient. And that might

or might not work. This case it didn't

work. Right? So what's going on here is

they said okay I want to make it more

efficient. Uh it probably be good to uh

sort them so that I get the best one

first. But it does sort inside the loop.

Uh which just actually makes it slower

rather than more efficient. And what it

should have done the sort outside the

root the loop and maintain a priority

queue and that would have made it

faster. Uh then I could say well maybe I

want to have a name associated with each

subset rather than just a list of

elements and I can do that and it did

that right. So this ability it's easy to

retrieve an algorithm from all the code

that's out there. But if you want to

customize it to do exactly what you want

this seems like a better interface. So

what about larger apps? So I played

around a lot with with these smaller

types things uh with my colleague Peter

Dannenburg and I over the last couple

weeks we actually built a larger thing

and we wanted to build this interactive

learning system that's kind of like

notebook LM but is focused on making you

learn better. Uh and so we built it,

dumped some uh uh some pedagogical

uh book chapters or Python notebooks in

and told it to extract a knowledge graph

of all the concepts and their

prerequisites. It does a pretty good job

of that. We were surprised at how well

it did. Sometimes it says there's a

prerequisite link from A to B when

actually it was just I talked about A

before I talked about B and I could have

talked about it in the other order but

mostly it gets it right. Uh built builds

that kind of graph, builds a uh learning

objectives and key insights summary and

then allows you to have a a dialogue and

allows you to run code and and see if it

works and so on. and and basically we

just threw this together

and here's more of the interactions and

it mostly worked and and sometimes we

get things a little bit wrong and we had

to fix it but you know I don't know node

and npm but it was able to generate code

that worked and so this changed the

capabilities for what I could do and I

think a lot of people are seeing that.

Okay. Uh here's another experiment I

did. So uh over the last couple years

it's been a lot of talk of do LLMs have

a theory of mind. By that I mean do they

understand uh what I'm thinking and can

they use that in their thinking and and

vice versa. And I thought there's a lot

of sort of logic puzzles uh that work

like that. And so I uh told it write a

po Python program to solve the Cheryl's

birthday problem. Uh so I don't know if

you remember that but a couple years ago

there was this problem and uh Cheryl

tells one friend the month of her

birthday and the other friend uh the day

of her birthday and says it's one of

these 10 possibilities. And the first

friend says, "Well, then I don't know

what it is." And the second one says, "I

don't know either." And then the first

says well because of that now I know

right so it's a uh they have to model

each other's uh state of knowledge come

to the conclusion. So I tried nine LLMs

in in 2024 to see if they could do this

and they all were very confident about

writing code and they all got the wrong

answer because they all conflated uh

what I know and what somebody else

knows. Uh but then I ran it again in

2025 and now half of them get it right.

So it's progress. These things are

getting better at a very fast rate.

Here's another example. Similar kind of

thing. My friend Waywa

submitted this uh again in 2024. This

math question. List all the ways in

which three distinct positive integers

have a product of 108. And it turns out

these are the ones. and he asked a bunch

of LLMs and I extended the list of the

nine I had and only two of them could

get that right. And one of the mistakes

they made is they all said, "Well, it's

a good idea to figure out the prime

factorization of 108. 2 * 2 * 3 * 3 * 3.

And now what we got to do is we got to

take these numbers and put them into

three subsets."

uh but some of them kind of forgot that

you could have the empty subset or

equivalently that the number one uh is a

factor uh of 108 and so they got it

wrong for that reason or for other

reasons. So two out of nine succeeded

for this math question, but then I

turned it into a programming question to

say write a program to list all the ways

in which three distinct positive

integers have a product of 108. And now

seven of nine got them in 2024 and 809

in 2025.

And I think what's going on here is

it's uh you know so one is kind of

strange when it comes to prime numbers

and multiplication right we're not quite

sure is one a prime or not uh but in

programming you always say for i equals

1 to n you never say for i equals 2 to n

right so it was easy in the math

question to forget uh one in the

programming question it was harder to

forget on uh and I think in general

there are different representations

different ways of talking that will be

better for some problems than others.

The language of math is amazing and can

let you do a bunch of things and so is

the language of programming and let you

do overlapping but different types of

things. And the way LLMs work is they uh

they do thinking on their own but they

have to have some way to represent their

thinking in in some kind of format.

That's why they do better when you say

use this think aloud uh protocol show

your intermediate work and there are

some problems for which a programming

language is a really great intermediate

format.

Uh and here's an example. OpenAI codeex

did that. So again, they were solving

math problems, but they said we're going

to do an intermediate step where we

generate a program and uh solve the math

pro problem through that. And I think

this is important for two reasons. One,

it kind of focuses this reasoning. It's

important for it to have scratch paper

that it can write on. And often writing

that as a program is a better way to do

it than trying to write it as English

statements or as math statements. And

secondly, you can do voting with

programs, right? You can have it say,

I'm gonna generate 10 possible programs.

I'm going to run them all. Oh, look,

eight of them gave the same answer.

Maybe that's the right answer. Whereas,

if I said generate 10 paragraphs or

generate 10 uh math statements, I can't

execute them. So, I can't tell uh are

they agreeing or are they disagreeing.

And so that's the advantages of of using

programs as an intermediate

representation.

Now,

so I said the way you succeed is by

piling in more and more code and

training on it. Inevitably, we run into

these intellectual property issues. So,

uh, GitHub files lawsuit over co-pilot.

It's saying you weren't uh following all

the licensing agreements, right? So, we

have this code. some of them have a

common common or or some other kind of

licensing agreement associated with

them. Uh co-pilot copied all that code

and maybe violated some of those

licensing agreements. And so there's

issues over that. So I want you to take

a couple minutes talk to your uh

neighbor uh what do you think's okay in

terms of intellectual property with

code? We have all this code that is open

source that's posted up there. Is it

okay for an AI program to read this

public source code? Is it okay to learn

from that? Is it okay to store a copy?

Is it okay for it to generate code

that's similar? Right? So, a lot of you

people generated code to play Wordle or

floppy words that might have been

similar to other code. Is it okay to

generate an exact copy? Uh what do you

think is legal or illegal? And what do

you think should be allowed just by

norms rather than by law?

>> Go ahead and discuss.

I would ask

questions.

not be

a good

Yeah.

This

Right.

[Music]

Yeah.

Who stole the bell?

All right,

let's let's wrap that up. Uh

so so all these issues uh in some sense

are the same for code as they are for

for any other training data that's out

on the web. But another sense they're

different, right? Because my random blog

post or my restaurant review, maybe

that's technically my intellectual

property, but doesn't feel like that has

much value to it. whereas code has been

proven to have very strong and and

sometimes very large value. So in that

sense it feels different. So what did

you guys discuss and what conclusions

did you come to?

>> Yes. So I think that's right. Right. And

I think one of the issues that does make

it confusing and one reason why GitHub

was in the middle of this is because it

is a common repository in which there's

a lot of code and a lot of it is

visible. You can make you are allowed to

make private repositories in GitHub that

nobody can see but a lot of them are

public and yet they have this license

that restricts what you can do with it.

I think that's different than a lot of

other sources.

My group of guys here in the middle had

interesting thought on citations.

>> Yeah, basically we're saying that if you

think about it like what a human can do.

A human can do pretty much all of these

except for

those

>> if it's like copying exact code or

taking those ideas from someone else.

>> Yeah. So that AI should be able to do

whatever it needs to do. Um they should

be able to do all the same things except

for the site. Um but then the idea came

up that AI gets it ideas from a lot of

different places. So citation become

pretty confusing because every like

couple lines of code it would be a new

citation for a new place that

>> Yeah. Yeah. I think I think that's

right. Um

I think uh a couple issues with that,

right? So one is

where are they going to put the

citation, right? So if I publish a

paper, it's clear, you know, if I

borrowed something, I'm definitely going

to put a a citation there. But if AI

generates uh birds, doesn't feel like

there's a place to put that citation,

right? Maybe there could be footnotes

somewhere, but nobody's going to look at

them. uh so that seems seems less

likely. uh then there have been these

big issues even before AI of what are

you allowed to copy and not copy right

so when for example when uh Google

wanted to make the Android operating

system they wanted to do it in Java and

they went to the owners of Java and

tried to work out a licensing agreement

uh but they couldn't get a licensing

agreement that worked right they said

yeah this would we could pay you this

amount if we were charging uh money for

every copy of the operating system, but

we want to give it away for free, so we

can't pay you a lot for a copy. And so

Google decided, we're just going to

reimplement Java from scratch, and we're

going to put in these clean rooms where

it's very clear that nobody can look at

the source code from Java and they have

to rewrite it on their own. And they did

that and they got sued anyways. And one

of the pieces of evidence was saying

here's this sixline method that's the

same in the Java implementation and your

implementation. And basically, you know,

it says if x is less than zero, then do

this error message and if it's greater

than n, do this error message else do

the right thing. And it kind of felt

like, well, anybody could write that

code. And it's just you could write it

slightly different or you could write it

exactly the same, but it's still the

same idea. and any programmer would have

come up with something similar. But that

was sort of my opinion as a programmer.

But the legal opinions, they look at it

differently. And I think we we still

haven't resolved all those types of

issues.

Okay.

So, programming contest uh and there's

we'll get to some uh uh recent big wins,

but this is from uh Alpha Code, which I

guess is ancient history now, probably

2023.

And uh they entered these contests and

and did well. And this was their primary

example. They said, you know, here's the

50 examples that we worked on. You can

look at all of them and look at the code

for all of them, but this is the one

we're going to talk about the most. And

this is what the input to the program

looks like. So there's a English

language description and then there's

sort of a formal input and expected

output. And basically what they're

asking is they're saying we're going to

pass you two strings s and t and we're

going to ask you could you uh generate

the string t by typing the characters in

s but you have the option for any of the

characters to type a backspace instead

of typing the character. So I give you

aba ba and I want you to generate ba and

you could do that by substituting a

backspace for the first three characters

and then just typing ba and that would

give you the output. So then you should

say yes and there might have been other

ways in which you could have put the

back spaces in different places and get

the right answer.

So here's a program they came up with.

Uh so the system came up with this code

and then uh they put annotations on the

side describing what the program does

and yes it's right it it works 100% of

the time. Uh but you know before

anybody's going to check in code into my

repository they have to go through a

code review. So I did a code review and

basically almost every line I would want

them to change. And so there's a bunch

of stuff here, a lot of them is sort of

style type issues and that would be easy

to change. Some of them are a little bit

deeper than that. And again, the

program's 100% right. I just don't like

it. Uh, and here's one thing. So they

initialize they're dealing with these

stacks of characters A and B and they

initialize this stack C to be the empty

stack and then when they pop something

off of stack B they store it onto stack

C and I can see where they would get

that because that happens a lot of times

you're dealing with stacks you pop

something off they say I better save

this somewhere I might need it later but

they don't use C anywhere else in the

entire program right and I know

programmers that write like that and I

don't want to hire them.

Uh so uh and may maybe this is

inevitable, right? Because they train

this on a lot of code and and half the

code was written by people who are below

average. So

all right and then this is kind of

interesting. You you could go to that

website and you can play with it and it

shows you at each point uh what's it

what it's thinking of might come next.

So you say uh for blank in range t it's

saying uh well my first guess is i is

the most common uh iterator and then

underscore is sometimes used in python

when you don't care about the value of

the variable and then there's these

other possibilities and you go through

and sort of think figure out how it's

thinking.

Uh so I showed you the program I didn't

like. I'd rather have a program like

this. I think this is a lot simpler and

it should be able to get there. Uh I'd

also like to be able to say well

generate a bunch of test cases. The you

know the program gave me four test

cases. I don't think that's enough.

Generate a bunch of other ones and and

make them cover all the the

possibilities. Do that automatically for

me. Uh I think that would be good. Uh

and then I'd like to say well I had some

sort of optimizations in the in the

program but I'm not sure those

optimizations would always work. So have

a system that's slow but obviously

correct right. So I say generate all

possible outputs from the source all

possible ways of using a backspace or

not and then say is target one of those

possibilities right so this is an

exponentially slow algorithm but if this

gets the same answer as the other

algorithm that's more evidence that I

got it right and so I'd like to be able

to get my system to do these things for

me right it was too hard for me to write

this code but I could ask it you know

prove to me that it's correct by showing

doing that in this uh degenerate case

and then uh we'll skip through some of

this but these are sort of all the types

of questions that I want to see if I was

doing this I might ask easy questions of

myself if I was doing care programming I

might ask this of my partner I want the

system to be able to engage in

conversations and do all this I don't

just want to see the code that's correct

I want code that gives me more

confidence in it

So, how well does this work? Uh, in

2022, it looked like a 5 to 10%

improvement. In 2023, another study got

40%. In 2025, another study said 57%.

Looks like it's going up, but there

other ones. And these studies are all

over the map. And it really depends on

exactly what you're measuring and

exactly who's measuring it and what

their setup is and what they're trying

to do and so on. But does seem like

there is a lot of progress here. But

here's a study that says there was a 41%

increase in bugs, right? So this was

people going too fast uh and maybe

getting ahead of themsel and building up

that technical depth.

Here's another one that says large

language models can outperform human

programmers

uh at this international uh programming

competition. Uh so this is a lot more

serious than what Alpha code did in in

2023. Here in 2025

uh uh Gemini did solve this problem that

uh no human team could solve and open

AAI had a similar performance. Uh so

they both did well. Open AAI actually

did a little bit better, but Gemini

competed in the actual competition and

OpenAI sort of did it on the side and

said, "We think we scored a perfect

score." And so, uh, it's your choice

which one of those you want to believe

more.

>> Okay. And I think something I think is

really interesting is what programming

language should we be using? Right. So

in in the early days of programming, we

said we're going to program assembly

language because the you know

programmers are cheap but computers are

big and expensive and so we want to make

sure it's super efficient. And then we

said well now we're going to program

mostly in C which uh is a compromise

between speed of to the programmer and

speed for the computer. Now we both we

program in Python which is not very

efficient in terms of using the machine

resources but is better for the

programmer. In the future we want

something that's good for the hardware

for the human programmer and for the

LLM. And so what should that be like?

And one argument is well it should be

Python and JavaScript because those are

the languages that we have the most

training data on. So those are the

languages we'll do the best.

counterargument is well it's easy to

translate between languages so maybe we

could use something else maybe we should

use ones with the most explicit

information right so in Python if you

don't have type declarations you know

less about the program a language that's

more strongly type you know more about

it maybe that would be better or maybe

something completely new

uh and here's a map of things that

people are experimenting with

uh and mostly comes down to uh

probabilistic programming and

differential programming. And I think

this really speaks to what are the types

of things we want to solve. If we're

trying to solve simulations in the real

world of hurricanes or whatever, uh then

our traditional languages aren't that

good, right? So our traditional

languages are built on things like if

statements. If X is true, then do Y,

else do Z. In the real world, we rarely

know something 100% certain. And so we

can build in at the application level.

Well, I'm going to have probability

distributions and I'm going to have

uncertainty and so on, but that's on top

of the language. And this idea of

probabilistic programming is we should

build that into the language. So an if

statement rather than just saying true

or false says we're going to deal with a

whole probability distribution. And then

differentiable programming says I'm

going to write a program. It's going to

have a bunch of parameters in it and

then I can automatically say choose the

right uh parameters to make the system

run better.

Uh here's uh this could have gone in the

news the day, but I'm putting it here.

And this is a proposal for

software that's more modular and uh can

fit together better using these things

like LLMs, right? And their point here

is you have things that that seem kind

of simple, right? So I build this app

and there's a button. The button should

be one thing, but the way we write code

today, a button isn't one thing, right?

So there's a button and you press it and

then that calls something and that makes

a remote call to another computer and

then it responds and then you process

the response and so on. So a button

doesn't stand alone. It stands connected

to all these other things. And the

proposal is can we write software that

separates that out more to make each of

those components more independent.

Uh

uh this book by uh Hansen and Susman

software design for flexibility

uh says uh one should not have to modify

a working program. One should be able to

add to it. And and we don't do that,

right? We modify programs all the time

when we want to make it better. And

they're saying if we made it modular

enough, we wouldn't have to do that. We

just have to say yes, here's program you

had before. Yes. And it should also do

this thing without having to delete what

you did before. Um then they also say uh

compare our guilt software to biological

systems. Say biological systems use

contextual signals that are informative

rather than imperative. There's no

master commander saying what each part

must do. Right? So the cells of my body

communicate with each other in ways that

I don't understand and there is no one

central processor that says at this step

the cells got to do the next thing. They

all interact with each other and we are

building software that's more complex

and maybe it seems more like a

biological system. So maybe we need

programming languages that support that.

And uh Kevin Kelly has a more sort of

philosophical take on that saying all is

flux, nothing is finished. That means

processes are more important than

products. And so we should optimize how

to make that process easy so that we can

make the changes and have our programs

evolve. So let's stop there and fill out

the computer.

Yeah,

>> I don't know who

It's not a character.

So I'm really curious

if you were observing the progress of

evolution

animals.

>> That's right.

Loading...

Loading video analysis...