Ep. 1 - The History of Machine Learning with Tom Mitchell

By Stanford Digital Economy Lab

Summary

Topics Covered

Train, Don't Program
Induction Remains Unjustified
Learning Beats Computation
Paradigms Replace Each Other
Question Authority

Full Transcript

Welcome to machine learning. How did we get here? I'm Tom Mitchell, your podcast

get here? I'm Tom Mitchell, your podcast host. Now, many people ask, how did we

host. Now, many people ask, how did we get to this point where today we have these amazing AI systems? I have a one-s

sentence answer to that question.

We tried for 50 years to write by hand intelligent programs, but we discovered about a decade ago that it was actually

much easier and much more successful to use machine learning methods to instead train them to become intelligent.

So the real question is how did machine learning get here?

What were the successes along the way and the failures? Who were the people involved? What were they thinking? What

involved? What were they thinking? What

even made them want to get into this field in the first place?

This first episode will set the stage for the podcast. It is a recording of a lecture I gave this month in February 2026

at Carnegie Melon University and it attempts to cover in one hour a 75 year history of the field of machine learning.

Most of the rest of the episodes in the podcast involve interviews with various pioneers in the field who made very

significant contributions along the way.

Before we start, I want to thank Carnegie Melon University and also the Stanford University Digital Economy Lab

for supporting the podcast. And I want to thank Maddie Smith, our podcast producer. I hope you enjoy the podcast.

producer. I hope you enjoy the podcast.

If we're going to talk about machine learning, it's only fair to start with the first people who talked about how on earth is learning possible, which were

the philosophers. And so as early as

the philosophers. And so as early as Aristotle, he was talking about the question of how is it that people could

look at examples of things and learn their general essence. In his words, about a century later, there was a school of philosophers called the

Pyreists who really zeroed in on the problem of induction and how it can be justified. When we say induction, what

justified. When we say induction, what we really mean is um the process of coming up with a general rule from looking at specific examples. And so

they talked about questions like, well, if all of the swans we've seen so far in our life are white, should we conclude that all swans are white? What would be

the justification for that? Maybe

there's a black swan out there that we haven't seen.

And uh that debate went on for some time. Around 1300, William of Acham

time. Around 1300, William of Acham uh suggested something that we now call AAM's razor, the policy that we should

prefer the simplest hypothesis.

So indeed, if all the swans we've seen so far are white, then the simplest hypothesis is all swans are white. That

was his prescription.

Later on around 1600, Francis Bacon brought up the importance of data collection of actively experimenting to

collect data that could falsify hypotheses that weren't correct.

And then in the 1700s, the philosopher David Hume um really kind of nailed the problem of induction. He argued very

persuasively that it's really impossible to generalize from examples if you don't have some additional assumption that you're making. And he pointed out that

you're making. And he pointed out that even the assumption that the future will be like the past is itself not a provable assumption. It's just a guess

provable assumption. It's just a guess that we use. So his point was that people do induction but it's a habit.

It's not a justified, rational, provable, correct process.

So, they had plenty to say. Around the

1940s, when computers became available, Alen Turing, who's often called the father of computing, uh, suggested that maybe computers could

learn. He said instead of trying to

learn. He said instead of trying to produce a program to simulate the adult mind, why not rather try to produce one which

simulates a child's if this were then subjected to an appropriate course of education, one would obtain the adult brain. So he had the idea that maybe

brain. So he had the idea that maybe computers could learn, but he did not have an algorithm by which they would

learn. that waited until the 1950s when

learn. that waited until the 1950s when there were two important seinal events.

One was a computer program written by an IBM researcher named Art Samuel and his program learned to play checkers. I'll just read you a couple

checkers. I'll just read you a couple sentences from the abstract of his paper. He said, "Two machine learning

paper. He said, "Two machine learning procedures have been investigated in some detail using the game of checkers.

Enough work has been done to verify the fact that a computer can be programmed so that it will learn to play a better game of checkers than can be played by

the person who wrote the program."

And then he went on to point out the principles of machine learning verified by these experiments are of course

applicable to many other situations.

So he had really one of maybe the first demonstration of a program that learned to do something interesting and he understood that the techniques he was

using were very general. Now how did he get the computer to learn to play checkers? His program learned an

checkers? His program learned an evaluation function that would s assign a numerical score to any checker's position.

And that score would be higher the better the checker's position was from your point of view as you're playing the game. And then you would use that to

game. And then you would use that to control a search, a look ahead search for which move to proceed to take. That

evaluation function was a linear weighted combination of board features that he made up. Things like how many checkers are on the board that are mine?

How many are on the board that are yours and so forth.

So his program learned what it learned was that evaluation function. How did it learn it? By playing games against

learn it? By playing games against itself.

and he points out that in 8 to 10 hours it could learn well enough to beat him.

Those ideas persisted through the decades. They became reused over and

decades. They became reused over and over, including in the computer programs that finally beat the world chess champion and the world bad gaming

champion and the world go champion. So

those ideas were really seinal. A second

thing that happened in the 50s was the invention of the first early version of neural networks by Frank Rosenble

Ro I'm sorry Frank Rosenblad from Cornell and he was interested in neuroscience

how can the brain neurons in the brain um be used to learn and he ended up building a simple

U at least by today's standards simple uh neural network that consisted of uh

one layer of neurons where uh there' be a receptive field of input say an image and then the neurons would respond to

that and produce an output set of neuron firings.

What got learned in that case were the connection strengths between the input to the neuron and the probability that it would fire.

And the way he trained it was what we now call supervised learning. Uh you

show an input and and what the output should be and he had schemes for updating those weights to fit the data.

Now that the importance of this work is that it catalyzed a whole bunch of work in the 1960s for the next decade looking

at different algorithms for tuning the weights of perceptron style systems.

Um that work proceeded for a decade or so and at the end of the 1960s two MIT

scientists Marvin Minsky and Seymour Papert wrote a book called perceptrons.

But unfortunately that book proved that a single layer perceptron which is the only thing we knew how to train at that

point uh could never even represent any many many functions that we wanted to learn. It could only represent linear

to learn. It could only represent linear functions, not even uh exclusive or you know where the input could be the output

would be one if input one is a one and the other is a zero or if it's a zero and a one but the output would have to be zero if they were both one. You can't

even represent that simple function with a perceptron no matter how you train it.

So this really kind of put the kibash on work on perceptrons uh following the publication of this book.

Now if we're not going to be able or don't want to spend our time figuring out how to learn perceptrons, then what's next? Well, it turned out

one of Minsk's PhD students, Patrick Winston, the next year published his thesis. And Winston suggested that

thesis. And Winston suggested that instead of learning perceptron type uh representations of information, we should learn symbolic descriptions. And

so his program uh in his thesis, he showed how his program could learn descriptions of different physical structures like an

arch or a tower. And he would train the program by showing it line drawings of positive and negative examples of uh in

this example arches. And then the program would process those incrementally arriving examples to produce a symbolic description that

would describe the different parts and relations among them. For example, an arch could be two rectangles which don't touch each other but which jointly

support a roof of any shape.

So this was an important step because it shifted the focus onto learning a much richer kind of representation symbolic

descriptions and this became the new paradigm which dominated the 1970s.

So during the 70s there were a number of people working on learning symbolic descriptions. My favorite is the

descriptions. My favorite is the metadendral program uh developed by Bruce Buchanan at Stanford. This program

again was a symbolic learning program.

What it learned was rules that would predict how molecules would shatter inside a mass spectrometer and therefore predict what the mass

spectrum of a new molecule would be. And

those rules again described um symbolically described a subgraph of atoms within the

molecular graph. And the rules would say

molecular graph. And the rules would say if you find this subgraph then specific bonds in that subgraph are

likely to fragment when you put this in a mass spectrometer.

And this was an important step forward.

I asked Bruce Buchanan how well it worked. What was this program able to do

worked. What was this program able to do in terms of did it work? Well, for one

small class of steroid molecules, the keto and the strains if if you will, uh we had

fewer than a dozen spectra, and we were able to tease out the rules that determined uh how

a new keto and androstain would fragment and in a mass spectrometer. And we were able to publish that set of rules in a

referee chem chemical journal uh chemistry journal, sorry.

Uh and it was to our knowledge the the first time that the result of a machine learning program, symbolic learning had

been published uh in a referee journal.

So that was an important milestone for machine learning really the first time that a program discovered some knowledge that was useful enough to get published

in that domain. Uh now it turned out personal note I was a PhD student at Stanford at the time and Bruce became my

PhD advisor. So my PhD thesis was also

PhD advisor. So my PhD thesis was also built around um this same data set and

um for my thesis I developed a system called version spaces that was the first symbolic learning algorithm where you could prove that it would converge and

furthermore that the learner would know when it had converged. so it would know it was done. And it did that by maintaining not just one hypothesis that

it would modify but by keeping track of every hypothesis consistent with the data that it had seen. And this also opened up the

seen. And this also opened up the possibility of what we call today active learning. It made it easy for the system

learning. It made it easy for the system to play 20 questions with the teacher.

Uh it could ask the teacher please label this example. so that in a way uh it

this example. so that in a way uh it could reduce the set of hypothesis as quickly as possible. So by the end of

the 70s there seemed to be enough work going on in the field that it was time to hold a meeting and so we organized

the first workshop in machine learning was held here at CMU at Wayne Hall a couple buildings that direction and it

was organized by Haimey Carbonel who is an assistant professor here at the time Rashard Mcowski who is a more senior

professor Illinois and myself I was at the time an assistant professor at Rutgers University and so we held this meeting pulled

together some people one of the people who attended was a student of Rishard Mcowski named Tom Dietrich and Tom went

on to make many contributions in the field of machine learning and so I asked Tom what was the field like in 1980 say

it was really chaotic. Um you know uh I was uh attended that very first machine learning workshop that was organized. I

think you were one of the core organizers at CMU and there were probably 30 people in the room and uh and probably 30 completely different

talks. Um you know I remember uh uh you

talks. Um you know I remember uh uh you know I was talking about I had done uh uh a a sort of algorithm comparison

paper that I published at Ichkai 79 I think so just before that workshop um in which I I I was by hand executing these very simple algorithms for this kind of

subgraph learning problem uh and comparing how many subgraph isomeorphosis calculations they had to do but it was like the first attempt actually compare multiple machine learning algorithms that were more or

less trying to do the same thing. There

were a couple of them there. Um and uh but uh you know uh I think John Anderson was there talking about uh you know cognitive models. Um you were there

cognitive models. Um you were there talking about the beginnings of EBL and the LEX system for for for uh calculus

uh uh symbolic integration. um uh you know the I remember the most interesting talk I thought was Ross Quinnland's talk on on ID3

uh where he was trying to take these uh reverse enumerated chess end games and learn decision trees uh that would completely exactly losslessly

uh uh basically compress those uh giant tables into a small decision tree. A

really important thing people should understand in those days was we were we believed there was a right answer for our machine learning problems and um and

we would and it would often happen that I would run like the MAC's algorithms and it would not get the right answer.

It would not get the the the logical expression that we thought was the right answer. you would get something that was

answer. you would get something that was really actually equally accurate on the training data um and actually worked pretty well on although we didn't really have a se idea

of a separate test set in those days. I

mean it was not a field of statistics.

It was a the the idea was right we were coming out of the uh really the John McCarthy program of programs with common sense which didn't have a lot to do with common sense but it was about we're

going to represent everything in logic and we're going to use logical inference as the execution engine.

>> So there's Tom's take on what things were like. Um he mentioned that he

were like. Um he mentioned that he thought the most interesting talk was Ross Quinnland's talk. I agree. I

thought that was the most interesting uh talk. Ross's talk presented the idea

talk. Ross's talk presented the idea that we should learn decision trees. A

decision tree is something where you classify your example by putting it at the root of the tree and then you sort it down to a leaf in the tree based on

its features and the leaf tells you what the output classification label should be. That's what get learn what gets

be. That's what get learn what gets learned. Um so I asked Ross how he came

learned. Um so I asked Ross how he came up with this idea.

>> I had done a PhD under a psychologist Earl Hunt and part of his work involved decision trees which I learned about of

course as a student but then put in the back of my mind for 15 years or so. And

then I was at at Stanford on sbatical at the same time as Donald Mickey was teaching a course on learning and he had a challenge for

the class on which you know I sat in on the class and the challenge was to

um work out a way of predicting a win in a very simple chess endame. King rook

versus king knight. So I remembered Earl Hunt's work on decision trees and I thought well maybe that would be the way

to go. So I developed a thing called ID3

to go. So I developed a thing called ID3 which was just a simple decision tree program no pruning just straight decision tree and then uh that that

seemed to solve the problem pretty well up to about 95%.

And then I got that up to 100 the next year. And then remember the first real

year. And then remember the first real time I talked about this was at that conference you organized the workshop in

1980 at Pittsburgh at Carnegi Melon.

You, Rishard and Haimey all um all set up that workshop and then I gave a talk on decision tree learning.

>> So there's Ross's story. He he got the idea of decision trees from his thesis advisor many years earlier. But it turns out

Ross was the one who came up with the algorithm that actually successfully discovered useful decision trees. And

that whole idea of decision tree learning became very important in the field. uh by 2010 it was probably the

field. uh by 2010 it was probably the one of the most commercially used approaches in machine learning.

So in the early 80s there were various experiments like these trying to build machine learning systems but really no theory no theory that could tell us for

example how many examples would we have to present to a learner in order for it to

reliably learn. And that changed in 1984

reliably learn. And that changed in 1984 when less valiant published a paper on

what he calls probably approximately correct learning. Um and the idea is it

correct learning. Um and the idea is it really was the first practical theory to tell us how many examples you would

need. Um and it in particular

need. Um and it in particular the number of examples you need depends on three things. The complexity of your

hypothesis space. For example, if you're

hypothesis space. For example, if you're going to learn decision trees of depth two, that's a lot less complex than if you're learning decision trees of depth

12.

So the comp depends on how complex your hypotheses are depends on the error rate you're willing to tolerate in the final

hypothesis 1% error 5% error. It also

depends on the probability you're willing to put up with that if you do choose that many random randomly provided training examples

um the probability that you'll still fail. You can't guarantee that you won't

fail. You can't guarantee that you won't fail but you can reduce that probability.

So this was a breakthrough in the area of theoretical characterization of algorithms. So I asked I asked less um

what he thought was the key idea there.

>> It's a uh it's a kind of a model of computation but it it yeah it makes sense because it's got some application.

So s the particular result which u um persuaded people that there was something there is uh this result that uh if you take a a conjunctive normal

form formula which uh you know from MP completeness at the time you already knew there's some hardness in it because if someone gave you the formula was completely

difficult to find out whether it's null it's equivalent to formula which uh uh is which is always zero which is never satisfiable. On the other hand, uh this

satisfiable. On the other hand, uh this was kind of this uh contracting normal form formula with three um

uh variables in in each course. Uh so

this was pack learnable and so this was bit striking that something which is very hard is learnable. Um but then this this highlighted the difference between

uh computing and uh and learning because so with the learning model you you the idea was that there was a distribution of inputs and you learned from this distribution but you only have to be

good on this distribution when you had to predict. Um so if for example in this

to predict. Um so if for example in this formula there were some very rare ones which were very rare then the learner wouldn't have to know about that. So in

this sense this was easier than the NP completeness.

>> So I was actually quite surprised at that answer. What he's saying put

that answer. What he's saying put another way is that what was really interesting there is that for this one kind of hypothesis

conjunctive normal form which is a way of it's a kind of logical expression. If

your hypotheses are of that form, then it's easier to learn them than it is to compute them. When he says compute them,

compute them. When he says compute them, what he means is uh the cost of answering the question, can you find a positive example of this?

And it was known at the time that the computational cost of answering that question, is there a positive example of

this formula was exponential in the size of the formula? And then he discovered that learning a formula if

somebody gives you positive and negative examples only takes polomial less than exponential time. So I agree with him

exponential time. So I agree with him that that's a fascinating theoretical fact. But that would not be the answer I

fact. But that would not be the answer I would give about why this revolutionized the field of machine learning. It

revolutionized the field in my view because he was the first person really to be able to come up with a framing a new framing of the machine learning

problem that even allowed this kind of theoretical analysis. In particular, his

theoretical analysis. In particular, his framing included assumptions like the training data would come from some source that would give you

that would give you random examples according to some probability distribution and then later when you wanted to test your hypothesis on new

data, you would get more random examples from that same source. And so he reframed the problem in a way that made theory possible. The consequence of that

theory possible. The consequence of that was it catalyzed a huge amount of theoretical work in machine learning and

uh continues this day to um keeps branching further and further. There are

conferences specifically designed to cover theoretical computer science.

So the 80s was really a very generative decade. There were a lot of things going

decade. There were a lot of things going on. Another thing was going on was some

on. Another thing was going on was some people were looking at human learning and how that might inspire our models of

AI and machine learning. One such effort was here uh at CMU by Alan Newell and his two PhD students John Leard and Paul

Rosen Bloom. They took the approach of

Rosen Bloom. They took the approach of they built a system they called soar which was really one of the first AI

agents designed to capture the full breadth of what humans do play games solve problems

many different tasks. So they framed their machine learning problem as one of getting a general agent to learn and their architecture had very interesting

properties that I think are relevant today now that agents are again a topic of hot activity.

I won't go into the details but in the podcast there's an interview with John Leairard who goes into detail on this.

Another item that can't be overlooked in the 80s was really the rebirth of neural network. Remember in the end of 60s

network. Remember in the end of 60s Minsky and Paper published that book that killed off work on perceptrons.

Well, in the mid 80s uh finally people came up with an algorithm that could train not just one layer perceptrons but

multi-layer perceptrons. And that

multi-layer perceptrons. And that allowed learning functions that were highly nonlinear.

And Dave Romhart, Jay Mlen, and Jeff Hinton were three of the ring leaders of this effort. So I asked Jeff about that

this effort. So I asked Jeff about that period.

>> Now we're up to the mid80s when really neuronets are reborn. Is that the right word? I would

word? I would >> Yes. With back with back propagation. I

>> Yes. With back with back propagation. I

mean, we didn't invent it. It invented

by several different groups, but we showed that it really worked to learn representations. And as you know, sort

representations. And as you know, sort of one of the big problems in AI is how do you learn new representations?

How do you avoid having to put them all in by hand? Um, and my particular example, which was the family trees example, where you take all the information in some family trees, you

convert it into triples of symbols like John has father Mary. Um, and then you train a neural net to predict the last term in a triple given the first two terms. So it's just like the big

language models. You're predicting the

language models. You're predicting the next word given the context. It's just

much simpler. I had 102 total examples of which 104 were training examples and eight were test examples, which is a bit less than the

trillion examples they have nowadays.

Um, but it was the same idea. you

convert a symbol into a feature vector.

You then have the feature vectors of the context interact um via a hidden layer. They then predict the features of the next symbol and from those features you guess what the next

symbol should be and you try and maximize the probability of predicting the next symbol and you then back propagate through the feature interactions and through the process that converts the symbol into features

and that way you learn um feature vectors to represent the symbols and how these vectors should interact to predict the features of the next symbol and

that's what these big language models do. So there's Jeff on uh the mid1980s

do. So there's Jeff on uh the mid1980s work on bat propagation. Another

personal note, in 1986 while this was going on, I came to spend a year at CMU as a visiting professor and I got to

meet Alan Newell at the time and Alan said, "Hey, do you want to team teach a course? We'll teach a course on

course? We'll teach a course on architectures for intelligent agents."

And of course, I said, "Yes, the opportunity to teach with Allen." and he said, "By the way, there will be another uh assistant professor working with us.

The three of us will team teach it.

That's Jeff Hinton." So, Alen, Jeff and I team taught in spring of 1986.

Uh this course, it was one of the best experiences of my uh career up to that point. And so, it was a large part of

point. And so, it was a large part of the reason why I ended up staying at CMU.

But when I came um I was here for about a year and then Jeff moved on. He moved

up to the University of Toronto and um started building up a group there. One of the people who joined his

there. One of the people who joined his group was a person named Yan Lukun who went on to win the touring award jointly

with Jeff and Yosua Benjio for their work in neural networks. So I asked John about this period.

>> Um and then um mid 1987 I moved to Toronto to do a postto with Jeff and I completed the this uh the simulator.

Jeff thought I was not doing anything because I was just basically hacking all you know all the time. Um and um uh and this this system was kind of uh

interesting because we had to build a front- end language to interact with it.

And that language was a lisp interpreter that Leon and I wrote. And so we're using lisp even though as as a front end to kind of a neural net simulator. Um

and I uh you know implemented the weight sharing uh abilities and all that stuff and started experimenting with what became convolutional nets you know when

I was a postto in Toronto u early 1988 roughly and started to get like really good results um you know very simple shape recognition like you know handwritten characters that I had drawn

with my mouse or something like that right >> so as you just heard Jan was experimenting with can we apply neural networks to the problem of character

recognition, written characters. People

were experimenting with many different uses of neural nets at the time. My

favorite, the one I would vote application of the decade was done in the area surprisingly of self-driving

cars. There was a PhD student here at

cars. There was a PhD student here at CMU named Dean Pomemelo. He trained a neural network where the input was an image taken by a camera looking out the

front windshield of a vehicle and the output of the neural network was a steering command telling the car which direction to steer. So I asked Dean

about that work.

>> How much training data did you have?

>> So the interesting thing was to begin with it was all batch training. So I'

I'd drive I'd have a person drive the vehicle along Shenley Park on Flagstaff Hill path and then I would go off and

and crunch it overnight. But in the end what we were able to do is uh real time learning. So, one drive up the hill with

learning. So, one drive up the hill with a human behind the wheel steering and the the neural network learning to pair images with camera images with the

steering command that the human was giving was able to uh train it in about five minutes to uh take over and steer on its own from there on on that road

and on similar roads. So it was one of the first real time real world vision applications of uh of artificial neural networks

>> going beyond just Flagstaff Hill, you know, the little paths on there. And we

went out on on real roads first through the golf course, Chennley Golf Course on the uh on the road there. And then we we went on, you know, the high local

highways. In fact, the longest as part

highways. In fact, the longest as part of my PhD, the longest trip we did was I think about a 100 miles at the time from

basically up uh I79 from Pittsburgh all the way up to Eerie. Uh and it drove basically the the whole way. So it and

it was getting up to 55 miles per hour after we got a faster vehicle.

>> It turns out he didn't ask for permission.

So, so this was all happening in the 1980s. Really, it was a decade of

1980s. Really, it was a decade of amazing uh invention and innovation and exploration. Um, another important thing

exploration. Um, another important thing that happened in that decade was uh the development of of reinforcement

learning. The way to understand that is

learning. The way to understand that is to uh first realize that supervised learning was the kind of standard way of framing the machine learning question.

When Dean talked about training a system, he would input an image. He had

people drive the car. So he got a lot of training examples of the form. Here's

the image and here's the correct steering command. So he could tell the

steering command. So he could tell the neural network for this input, here's the correct output. That's called

supervised learning. But reinforcement

learning reframes the problem. It takes

into account that sometimes we don't know what the right output is. For

example, if you're learning to play chess, you might not have a person who tells you at every step, given this board position, here's the right move.

Instead, you might have to wait until the end of the game after you've made many moves to get the feedback signal that says you lost or you won. And then

you have to figure out what to do about that because you actually took many moves. So that's what reinforcement

moves. So that's what reinforcement learning is about. And Rich Sutton and Andy Barto were um instrumental in kind of framing that problem and and working

on it. They recently won the Turing

on it. They recently won the Turing Award for this work. So I asked Rich how reinforcement learning fit into the field.

>> The field of machine learning has always been been uh dominated by the more straightforward supervised approach. Um

and there was a there's as I mentioned at the very beginning the rewards and penalties were were very

much part of it but then the um focus uh as things became more clear and more better defined and it became a more clearer uh learning problem then became

pattern recognition and supervised learning and and uh this fellow this strange uh fellow Harry Klopp up uh you know

recognized this more than other people and and wrote some reports and ultimately a book uh uh saying that something had been lost and Andy Barton

and I um uh picked up on his work and and and eventually realized that he was right that something had been left out and in some sense it was obvious that something had been left out from the point of view of psychology where I'd

been studying how animals learn and animals learn really in both ways and both a supervised way in a reinforcement way. And so um so we picked up on that

way. And so um so we picked up on that and made that into a well- definfined area in the

when was that would have been in the 80s and then finally you wrote a book on it in 98. So then it became a a clear sub

in 98. So then it became a a clear sub field of machine learning.

Yeah. But the key thing is why is why pay why is the way I say it to myself is that why is reinforcement learning off

why is it powerful potentially powerful and it's powerful because it's it's learn it's really learning from experience learning from the normal data

that an animal or person would get and doesn't require a prepared special data like you of course do in supervised learning.

So during the 80s there were a lot of other really interesting things going on. Uh people experimenting with the

on. Uh people experimenting with the idea that maybe machines should learn by simulating evolution. There was an

simulating evolution. There was an entire set of conferences on something called genetic algorithms, genetic programming which had to do with that

sort of thing. uh a cluster of work on studying human learning and other areas but uh we don't have time for all of

those. Let's move on to the 1990s when

those. Let's move on to the 1990s when again there was a I would say a sea change in terms of the style of work

that went on. The theme of the 1990s was really the integration of statistical and probabilistic methods

into the field of machine learning. And

a lot of that took the grounded form of learning a new kind of object which people called either graphical models or basets.

But what got learned in that case was again a network where each node would represent a variable. For example, maybe you would be interested in predicting

whether somebody has lung cancer. You'd

make that a variable. Uh and maybe you'd have evidence like are they a smoker? Do

they have a normal or abnormal X-ray result? You'd make those variables. And

result? You'd make those variables. And

then the edges in the graph represent probabilistic dependencies among the variables in a way such that

in the end the whole graph represents the full joint probability distribution over the entire collection of variables.

So that's what got learned and how it got learned waited for some algorithms to be discovered. One of the

key people who was involved in inventing those algorithms, although Judea Pearl came up with the idea of how to represent these um Daphne Kohler, a

professor at Stanford was one of the most active researchers in terms of designing algorithms for learning these.

So I asked her, why do we need graphical models? Graphical models for me emerged

models? Graphical models for me emerged by realizing that the problems that we needed to solve to address most real

world applications went beyond you have a vector representation of an input and a single often times binary or at best continuous output. There was so much

continuous output. There was so much more opportunity to think about richly structured environments, richly structured problems. So even if you

think about problems like understanding what is in an image, that's not a single label problem of there is a dog because images are complex and there is interreationships

between the different objects. You

wanted to get beyond the yes, no, is there a dog in this image to something that is much more rich. There's a dog and a frisbee and a beach and three kids

building a sand castle. You have a rich input and a rich output. Thinking about

these richly structured domains gave rise to we have to think about multiple variables. We have to think about the

variables. We have to think about the interactions between those variables and leverage that structure both in our input and our output space in order to get to much better conclusions and deal

with problems that really matter.

So this work on training graphical models was really part of a bigger theme that decade which was just the integration of statistical

methods with uh what had been pretty much statistics free machine learning up to that point. Another person who was instrumental in that was a Berkeley

professor named Mike Jordan. I asked him about the relationship between statistics and machine. So anyway, by the time I moved to wanted to move to Berkeley, I was realizing that I was missing the whole statistics community

that uh it was just separate from machine learning. As maybe you kind of

machine learning. As maybe you kind of remember, there was occasionally a little leakage, but it was way too separate. And and nowadays, we're often

separate. And and nowadays, we're often seeing, you know, people run a machine learning method, but then it's not calibrated. It's not, you know, has bias

calibrated. It's not, you know, has bias and and all that. And that's the thing statisticians have talked about for a long long time. And so nowadays I think it's a given that yeah there there are

kind of two parts two sides of the same coin. Machine learning is maybe a little

coin. Machine learning is maybe a little more engineering in order. Let's build a system and make it do great things in the world. And this is a little bit more

the world. And this is a little bit more well let's be cautious. Let's we're

going to do like clinical trials. Let's

make sure that the the the answer is really trustable. But those are two

really trustable. But those are two sides of the same coin. And I think that's probably pretty much clear now.

But for a long time there was a resistance. Everyone said this is a

resistance. Everyone said this is a brand new field. This is different. and

and I kept uh again annoying colleagues by saying no I don't believe it is so anyway long story short >> it is it is remarkable that to me that

the field of machine learning went through most of the 1980s kind of without even noticing that statistics existed >> I mean people like Leo Briman were

around to help make the the passage so ensemble methods they were kind of invented by Leo in stat literature but they were independently invented in the in the machine learning literature And is that machine learning or statistics?

Well, clearly it's both and it needs first both perspectives. And yes, in the 1990s that the EM algorithm, you know, the the the graphical models, they were they had they had uh so yeah, in the

'90s it was a real flourishing of that.

>> So Mike mentioned that one of the themes was ensemble. Anyway, I think that's

was ensemble. Anyway, I think that's actually a very nice example of how machine learning theory and statistical

theory kind of intertwined. The idea of ensemble learning is instead of learning one hypothesis, let's learn multiple ones. For example, instead of learning a

ones. For example, instead of learning a decision tree, you might learn a whole forest of decision trees. And then when it comes to classifying a new example,

you give it to all of the classifiers and you let them vote and you take the vote of the classifiers. Well, that

turned out to be very successful and and commercially very important, but it also is a beautiful example where um there's

a pretty interesting theory around that.

And initially Yo Frind and Rob Shapi uh in the early 90s uh started working on uh theory and methods for doing this

kind of ensemble. Leo Brimman who was a statistician recognized that this echoed some of the themes of resampling and statistics and

those two things uh kind of came together in a very successful way.

So in the 90s and the first decade of the 2000s there were many other things going on. um the development of things

going on. um the development of things called support vector machines, kernel methods um which were u mathematical

techniques for learning um very nonlinear classifiers that were actually commercially important and open the door

in many cases to machine learning for non-numerical data data like images or text. Um there was work on manifold

text. Um there was work on manifold learning. Um there was also growing

learning. Um there was also growing commercialization during that decade. More and more companies were starting to use machine learning commercially.

But for me, the theme of that first decade of the 2000s was really a growing awareness by many people that you know

maybe we have good enough machine learning algorithms that the bottleneck to more accuracy is not the algorithm. Maybe we need more

data and more computation.

And this idea was crystallized in this beautiful paper written in 2009 by three uh authors at Google called the

unreasonable effectiveness of data which really highlighted uh cases where really if you want better results keep your

same algorithm get more data and that was kind of a a theme of what was going on at the time but things really broke

open in the year 2012.

In 2012, um the computer vision community had been using a data set created by FA Lee called ImageNet

to test out different vision algorithms, see who could do the best job of labeling which object was the primary object in an image. and the image that

data set was very large. In 2012, Jeff Hinton and some of his students entered the competition and they blew away the

competition. What's interesting is they

competition. What's interesting is they were the only neural network approach in the competition. By that time, by the

the competition. By that time, by the way, neural networks were very scarce in the field of machine learning. They had

been displaced really by more recent probabilistic methods and uh only a smalish number of

researchers were even still working on neural networks. But um nevertheless

neural networks. But um nevertheless this happened. So I asked Jeff about

this happened. So I asked Jeff about that >> and Yan realized when Feay came up with the imagenet data set, Yan realized they could win that competition and he tried

to get graduate students and posters in his lab to do it and they all declined.

Um and I sus realized that um backrop would just kill imageet um and he wanted Alex to work on it and

Alex didn't really want to work on it.

Um, Alex had already been working on small images and recognizing small images in C510. Um, and pre-processed everything for Alex to make it easy. And

I bought Alex two Nvidia GPUs to have in his bedroom at home. Um, and

Alex then get on with got on with it and he was an absolutely wizard programmer.

He wrote amazing code on multiple GPUs to do convolution really efficiently, much better code than anybody else had ever written. um I believe and so it's a

ever written. um I believe and so it's a combination of Ilia realizing we really had to do this and I was involved in the design of the

net and so on but Alex's programming skills and then I added a few ideas like use rectified linear units instead of sigmoid units and use little patches of

the images I mean big patches of the images so you can translate things round a bit to get some translation invariance as well as using convolution um and use dropout. So that was one of

the first applications of dropout and that helped by about 1%. It but it helped and then we beat the best vision systems. The best vision systems were

sort of plateauing at 25% errors. That's

errors for getting the right answer in the top in your top five bets. Um

and we got like 15%.

15 or 16 depending on how you count it.

So we got almost half the error rate.

And what happened then was what ought to happen in science but seldom does.

So our most vigorous opponents like Chitendra Mannic and Zisserman, Andrew Zissman um looked at these results and said okay you were right. That never

happens in science. And slightly

irritatingly Andrew Zissman then switched to doing this. He had some very good posttos or students working with him. Simon um and

him. Simon um and um after about a year they were making better networks than us. But that was really the

as far as the general public was concerned that was the start of this big swing towards deep learning in 2012.

So that event, that competition and the fact that the neural network approach um totally dominated all the other

approaches really was a wakeup call to both the computer vision community. uh

which within a couple years everybody was using neural networks, but it was al also a wake-up call to the machine learning community who had kind of

scoffed at neural networks for several decades um that neural networks were back. And so people started again now

back. And so people started again now experimenting with this new generation of deep neural networks. That just meant that instead of having two layers, they

could have many layers, dozens of layers. Um because training algorithms

layers. Um because training algorithms were available and so was computation.

Um people started experimenting with these and primarily on perceptual style problems. In fact, by 2016,

uh, neural nets had taken over not only computer vision, but in 2016, u some scientists from Microsoft showed

that they had been able to train a neural network to finally reach human level recognition

performance for uh, individual words in a widely used data set called the switchboard data. data set. So people

switchboard data. data set. So people

were experimenting with neural nets for visual data, speech data, radar, LAR, all kinds of sensory data. People

started also asking well can we apply these to text data and the answer was yes. Um and people started inventing

yes. Um and people started inventing various architectures, things with names like long short-term memory and others

um to analyze sequences of text and applying them to problems like uh machine translation, translating English

into French and so forth. Um and uh that kind of worked. And then in 2017

um a very important paper was published.

The name of the paper was attention is all you need. And what that was referring to was a a subcircuit in a

neural network called an attention mechanism that had recently been um invented and developed and was

trainable. Um but

trainable. Um but that attention mechanism was used in this paper and it advanced the state-of-the-art and machine

translation. But even more importantly

translation. But even more importantly for us today, it introduced the transformer architecture based on this attention mechanism.

And it's that transformer architecture that underlies GPT and pretty much all of the large language models that were released

around 2022.

So that was a major event. Now around

the same time, Yan Lun, remember the guy who was the postoc with Jeff in 1987, Jan had become the head of AI research

at Facebook and so he was in a very interesting position because he he was both an academic, he retained his NYU

professorship and at the same time he had a foot in the commercial world directing the AI strategy for Facebook. So I asked Yan

about this period and what it looked like to him from from being inside both worlds. His first part of his answer was

worlds. His first part of his answer was that he said for him a key development was realizing that you didn't have to

wait for people to label all your training data. That you could do

training data. That you could do something called self-supervised learning. for example, just take data

learning. for example, just take data like a string of words and remove a word and have the program force the program to predict what that removed word was.

So there's no human labeling you have to do for that. You can use the whole web and you get a lot of training examples.

So that self-supervised learning was a key um development.

But then here's this description of what next. So the idea that you know self-s

next. So the idea that you know self-s supervised learning could really kind of bring something to the table there I think was kind of a big um

sort of mind change of uh mindset. Uh and then and then there was transformers of course,

right? Um that that um

right? Um that that um so so before that there was uh some demonstration that uh you know you could basically match the performance of

classical systems for task like translation uh language translation using large neural nets like LSTM. So

this was the work by Edgar when he was at Google. We had this big, you know,

at Google. We had this big, you know, sequence tosequence model with LSTMs and some, you know, gigantic model where, you know, you can train it to do

uh translation and it kind of works at the same level if not better in some cases than the the then classif classical um translation methods. Uh and

then a few months later, Yeshua Venjo, Mitri Badau and and Kyong Cho who is now a colleague at NYU uh showed that you could change the architecture and use

this attention mechanism um that they that they proposed uh to basically get really good performance on translation with much smaller models than what Ilia had been proposing and the entire

industry jumped on this. uh Chris

Manning's group at Stanford kind of you know used that architecture and basically beat um you know won the WT

competition for a particular uh uh type of translation and the entire industry jumped on it. So within a few months after that like you know all the big

players uh in translation were using attention type architectures for translation and that's when um the transformer paper

came out attention is all you need. So

basically if you build a neural net just with those kind of attention circuit uh you don't need much else and it ends up working super well and that's what

started the you know the transformer revolution uh and then after that came BERT that also came out of Google which was this idea of using self-supervised learning right take a sequence of words

corrupt it remove some of the words and then train this big neural net to reconstruct the words that are missing predict the words that are missing Um and again people were amazed by like

how how good the representations learned by the system were for all kinds of NLP tasks and that really uh you know kind of captured the imagination of a lot of

people. Um and then after that the next

people. Um and then after that the next revolution was oh um actually the best thing to do is you remove the encoder you just use a decoder. Um and you just

train a system you feed it a sequence and you just train it to reproduce the input sequence on its output. And

because the architecture of the decoder is strictly causal um because a particular output is not connected to the corresponding input. It's only

connected to the ones to the left of it.

implicitly you're training the system to predict the next word that comes after a sequence of words. That's the GPT architecture that that was you know

promoted by uh OpenAI and uh that turned out to be more scalable than B and so in a sense that you can train gigantic networks on enormous amounts of data and

you get some sort of emergent uh uh property and that's what gave us LLMs. So that brings us up to today with

transformers and you can see this very strange evolution and wandering path of

um progress exploration over decades. So

before we leave um um I want to let's just take a look at that history and say what if this is a case study of how

scientific progress was made in this field what are the main themes we see well I think the first one is progress

happens in waves it's paradigm after paradigm right first uh there were perceptrons but that got thrown away and replaced by symbolic

representations being learned eventually to be replaced by neural nets which were replaced by probabilistic methods and so forth. So there's wave after wave of

forth. So there's wave after wave of paradigm.

Another theme is that a lot of these ideas really came from other fields.

Even the very notion of perceptrons came from somebody who was fundamentally a neuroscientist interested in how neurons

in the brain could even learn stuff.

Pack learning you heard less valiant talk. Um he's very much a computational

talk. Um he's very much a computational complexity researcher who found that this was an interesting theoretical

result. uh beige networks heavily

result. uh beige networks heavily influenced by statistics and so forth.

Many of these advances really were new framings of the problem.

So, uh Winston's work on symbolic learning was really a reframing of what the problem was. The work on reinforcement learning is really

changing the definition of what the training signal even is for these systems. So that's another theme that you see.

And finally, I think like a lot of scientific fields, machine learning is really a blend of technical forces and

social forces. Um certainly in the long

social forces. Um certainly in the long term, the cold hard facts of what works best come out and those methods win. But

in the shorter term, the question of who works on what kinds of problems is very much influenced by the personalities of

people, their ability to persuade other people to uh jump in and start working with them on their problems. So these are some of the themes you see and I

think if you look around at other fields sometimes you see similar themes.

Finally, what are the lessons from all this for researchers?

I think the first lesson really is question authority because really if you think about the major advances, many of those came from

just uh going against what was currently the conventional wisdom in the field.

uh inventing a new framing or taking a radically different approach.

Another lesson, don't drag your feet.

I've seen decade after decade new paradigms emerge in the field. And every

single time that happens, existing researchers take longer than they need to to recognize the benefits of the new

paradigm.

And the most guilty people are the senior researchers.

You can probably explain that by taking into account who has the most to lose if there's a new paradigm replacing the current approach.

Another lesson, learn to communicate and learn to follow through. You heard Jeff Hinton when he was talking about in the mid 80s the development of B

propagation. You heard him say, "We

propagation. You heard him say, "We didn't invent back propagation, but we showed that it was important."

And actually, to be fair, they thought they were inventing back propagation.

They they actually reinvented it, but they had no idea that somebody had invented it before because whoever did that didn't succeed in waking up the

research community to the fact that they had a really good idea. I don't know why. Maybe they didn't put in the effort

why. Maybe they didn't put in the effort or succeed in communicating. Maybe they

dropped it after they did it, went some other direction so that they didn't follow through to provide the evidence.

But that kind of thing happens frequently and successful researchers are good communicators and they follow through to to push the field to pay

attention.

The final lesson I think is the philosophers were actually right.

We really today despite these amazing capabilities of our learning systems, we don't have a proof or anything like a

rational justification of why you can generalize from examples to get these general rules that work well. Despite

the success that we have, we don't really understand at this very fundamental level why. And I think that if we did pay more attention to that

question, we might have a better chance to develop algorithms that outperform what we have today.

So I'll stop there. Thank you very much.

Loading...

Loading video analysis...