Ep. 1 - The History of Machine Learning with Tom Mitchell
By Stanford Digital Economy Lab
Summary
Topics Covered
- Train, Don't Program
- Induction Remains Unjustified
- Learning Beats Computation
- Paradigms Replace Each Other
- Question Authority
Full Transcript
Welcome to machine learning. How did we get here? I'm Tom Mitchell, your podcast
get here? I'm Tom Mitchell, your podcast host. Now, many people ask, how did we
host. Now, many people ask, how did we get to this point where today we have these amazing AI systems? I have a one-s
sentence answer to that question.
We tried for 50 years to write by hand intelligent programs, but we discovered about a decade ago that it was actually
much easier and much more successful to use machine learning methods to instead train them to become intelligent.
So the real question is how did machine learning get here?
What were the successes along the way and the failures? Who were the people involved? What were they thinking? What
involved? What were they thinking? What
even made them want to get into this field in the first place?
This first episode will set the stage for the podcast. It is a recording of a lecture I gave this month in February 2026
at Carnegie Melon University and it attempts to cover in one hour a 75 year history of the field of machine learning.
Most of the rest of the episodes in the podcast involve interviews with various pioneers in the field who made very
significant contributions along the way.
Before we start, I want to thank Carnegie Melon University and also the Stanford University Digital Economy Lab
for supporting the podcast. And I want to thank Maddie Smith, our podcast producer. I hope you enjoy the podcast.
producer. I hope you enjoy the podcast.
If we're going to talk about machine learning, it's only fair to start with the first people who talked about how on earth is learning possible, which were
the philosophers. And so as early as
the philosophers. And so as early as Aristotle, he was talking about the question of how is it that people could
look at examples of things and learn their general essence. In his words, about a century later, there was a school of philosophers called the
Pyreists who really zeroed in on the problem of induction and how it can be justified. When we say induction, what
justified. When we say induction, what we really mean is um the process of coming up with a general rule from looking at specific examples. And so
they talked about questions like, well, if all of the swans we've seen so far in our life are white, should we conclude that all swans are white? What would be
the justification for that? Maybe
there's a black swan out there that we haven't seen.
And uh that debate went on for some time. Around 1300, William of Acham
time. Around 1300, William of Acham uh suggested something that we now call AAM's razor, the policy that we should
prefer the simplest hypothesis.
So indeed, if all the swans we've seen so far are white, then the simplest hypothesis is all swans are white. That
was his prescription.
Later on around 1600, Francis Bacon brought up the importance of data collection of actively experimenting to
collect data that could falsify hypotheses that weren't correct.
And then in the 1700s, the philosopher David Hume um really kind of nailed the problem of induction. He argued very
persuasively that it's really impossible to generalize from examples if you don't have some additional assumption that you're making. And he pointed out that
you're making. And he pointed out that even the assumption that the future will be like the past is itself not a provable assumption. It's just a guess
provable assumption. It's just a guess that we use. So his point was that people do induction but it's a habit.
It's not a justified, rational, provable, correct process.
So, they had plenty to say. Around the
1940s, when computers became available, Alen Turing, who's often called the father of computing, uh, suggested that maybe computers could
learn. He said instead of trying to
learn. He said instead of trying to produce a program to simulate the adult mind, why not rather try to produce one which
simulates a child's if this were then subjected to an appropriate course of education, one would obtain the adult brain. So he had the idea that maybe
brain. So he had the idea that maybe computers could learn, but he did not have an algorithm by which they would
learn. that waited until the 1950s when
learn. that waited until the 1950s when there were two important seinal events.
One was a computer program written by an IBM researcher named Art Samuel and his program learned to play checkers. I'll just read you a couple
checkers. I'll just read you a couple sentences from the abstract of his paper. He said, "Two machine learning
paper. He said, "Two machine learning procedures have been investigated in some detail using the game of checkers.
Enough work has been done to verify the fact that a computer can be programmed so that it will learn to play a better game of checkers than can be played by
the person who wrote the program."
And then he went on to point out the principles of machine learning verified by these experiments are of course
applicable to many other situations.
So he had really one of maybe the first demonstration of a program that learned to do something interesting and he understood that the techniques he was
using were very general. Now how did he get the computer to learn to play checkers? His program learned an
checkers? His program learned an evaluation function that would s assign a numerical score to any checker's position.
And that score would be higher the better the checker's position was from your point of view as you're playing the game. And then you would use that to
game. And then you would use that to control a search, a look ahead search for which move to proceed to take. That
evaluation function was a linear weighted combination of board features that he made up. Things like how many checkers are on the board that are mine?
How many are on the board that are yours and so forth.
So his program learned what it learned was that evaluation function. How did it learn it? By playing games against
learn it? By playing games against itself.
and he points out that in 8 to 10 hours it could learn well enough to beat him.
Those ideas persisted through the decades. They became reused over and
decades. They became reused over and over, including in the computer programs that finally beat the world chess champion and the world bad gaming
champion and the world go champion. So
those ideas were really seinal. A second
thing that happened in the 50s was the invention of the first early version of neural networks by Frank Rosenble
Ro I'm sorry Frank Rosenblad from Cornell and he was interested in neuroscience
how can the brain neurons in the brain um be used to learn and he ended up building a simple
U at least by today's standards simple uh neural network that consisted of uh
one layer of neurons where uh there' be a receptive field of input say an image and then the neurons would respond to
that and produce an output set of neuron firings.
What got learned in that case were the connection strengths between the input to the neuron and the probability that it would fire.
And the way he trained it was what we now call supervised learning. Uh you
show an input and and what the output should be and he had schemes for updating those weights to fit the data.
Now that the importance of this work is that it catalyzed a whole bunch of work in the 1960s for the next decade looking
at different algorithms for tuning the weights of perceptron style systems.
Um that work proceeded for a decade or so and at the end of the 1960s two MIT
scientists Marvin Minsky and Seymour Papert wrote a book called perceptrons.
But unfortunately that book proved that a single layer perceptron which is the only thing we knew how to train at that
point uh could never even represent any many many functions that we wanted to learn. It could only represent linear
to learn. It could only represent linear functions, not even uh exclusive or you know where the input could be the output
would be one if input one is a one and the other is a zero or if it's a zero and a one but the output would have to be zero if they were both one. You can't
even represent that simple function with a perceptron no matter how you train it.
So this really kind of put the kibash on work on perceptrons uh following the publication of this book.
Now if we're not going to be able or don't want to spend our time figuring out how to learn perceptrons, then what's next? Well, it turned out
one of Minsk's PhD students, Patrick Winston, the next year published his thesis. And Winston suggested that
thesis. And Winston suggested that instead of learning perceptron type uh representations of information, we should learn symbolic descriptions. And
so his program uh in his thesis, he showed how his program could learn descriptions of different physical structures like an
arch or a tower. And he would train the program by showing it line drawings of positive and negative examples of uh in
this example arches. And then the program would process those incrementally arriving examples to produce a symbolic description that
would describe the different parts and relations among them. For example, an arch could be two rectangles which don't touch each other but which jointly
support a roof of any shape.
So this was an important step because it shifted the focus onto learning a much richer kind of representation symbolic
descriptions and this became the new paradigm which dominated the 1970s.
So during the 70s there were a number of people working on learning symbolic descriptions. My favorite is the
descriptions. My favorite is the metadendral program uh developed by Bruce Buchanan at Stanford. This program
again was a symbolic learning program.
What it learned was rules that would predict how molecules would shatter inside a mass spectrometer and therefore predict what the mass
spectrum of a new molecule would be. And
those rules again described um symbolically described a subgraph of atoms within the
molecular graph. And the rules would say
molecular graph. And the rules would say if you find this subgraph then specific bonds in that subgraph are
likely to fragment when you put this in a mass spectrometer.
And this was an important step forward.
I asked Bruce Buchanan how well it worked. What was this program able to do
worked. What was this program able to do in terms of did it work? Well, for one
small class of steroid molecules, the keto and the strains if if you will, uh we had
fewer than a dozen spectra, and we were able to tease out the rules that determined uh how
a new keto and androstain would fragment and in a mass spectrometer. And we were able to publish that set of rules in a
referee chem chemical journal uh chemistry journal, sorry.
Uh and it was to our knowledge the the first time that the result of a machine learning program, symbolic learning had
been published uh in a referee journal.
So that was an important milestone for machine learning really the first time that a program discovered some knowledge that was useful enough to get published
in that domain. Uh now it turned out personal note I was a PhD student at Stanford at the time and Bruce became my
PhD advisor. So my PhD thesis was also
PhD advisor. So my PhD thesis was also built around um this same data set and
um for my thesis I developed a system called version spaces that was the first symbolic learning algorithm where you could prove that it would converge and
furthermore that the learner would know when it had converged. so it would know it was done. And it did that by maintaining not just one hypothesis that
it would modify but by keeping track of every hypothesis consistent with the data that it had seen. And this also opened up the
seen. And this also opened up the possibility of what we call today active learning. It made it easy for the system
learning. It made it easy for the system to play 20 questions with the teacher.
Uh it could ask the teacher please label this example. so that in a way uh it
this example. so that in a way uh it could reduce the set of hypothesis as quickly as possible. So by the end of
the 70s there seemed to be enough work going on in the field that it was time to hold a meeting and so we organized
the first workshop in machine learning was held here at CMU at Wayne Hall a couple buildings that direction and it
was organized by Haimey Carbonel who is an assistant professor here at the time Rashard Mcowski who is a more senior
professor Illinois and myself I was at the time an assistant professor at Rutgers University and so we held this meeting pulled
together some people one of the people who attended was a student of Rishard Mcowski named Tom Dietrich and Tom went
on to make many contributions in the field of machine learning and so I asked Tom what was the field like in 1980 say
it was really chaotic. Um you know uh I was uh attended that very first machine learning workshop that was organized. I
think you were one of the core organizers at CMU and there were probably 30 people in the room and uh and probably 30 completely different
talks. Um you know I remember uh uh you
talks. Um you know I remember uh uh you know I was talking about I had done uh uh a a sort of algorithm comparison
paper that I published at Ichkai 79 I think so just before that workshop um in which I I I was by hand executing these very simple algorithms for this kind of
subgraph learning problem uh and comparing how many subgraph isomeorphosis calculations they had to do but it was like the first attempt actually compare multiple machine learning algorithms that were more or
less trying to do the same thing. There
were a couple of them there. Um and uh but uh you know uh I think John Anderson was there talking about uh you know cognitive models. Um you were there
cognitive models. Um you were there talking about the beginnings of EBL and the LEX system for for for uh calculus
uh uh symbolic integration. um uh you know the I remember the most interesting talk I thought was Ross Quinnland's talk on on ID3
uh where he was trying to take these uh reverse enumerated chess end games and learn decision trees uh that would completely exactly losslessly
uh uh basically compress those uh giant tables into a small decision tree. A
really important thing people should understand in those days was we were we believed there was a right answer for our machine learning problems and um and
we would and it would often happen that I would run like the MAC's algorithms and it would not get the right answer.
It would not get the the the logical expression that we thought was the right answer. you would get something that was
answer. you would get something that was really actually equally accurate on the training data um and actually worked pretty well on although we didn't really have a se idea
of a separate test set in those days. I
mean it was not a field of statistics.
It was a the the idea was right we were coming out of the uh really the John McCarthy program of programs with common sense which didn't have a lot to do with common sense but it was about we're
going to represent everything in logic and we're going to use logical inference as the execution engine.
>> So there's Tom's take on what things were like. Um he mentioned that he
were like. Um he mentioned that he thought the most interesting talk was Ross Quinnland's talk. I agree. I
thought that was the most interesting uh talk. Ross's talk presented the idea
talk. Ross's talk presented the idea that we should learn decision trees. A
decision tree is something where you classify your example by putting it at the root of the tree and then you sort it down to a leaf in the tree based on
its features and the leaf tells you what the output classification label should be. That's what get learn what gets
be. That's what get learn what gets learned. Um so I asked Ross how he came
learned. Um so I asked Ross how he came up with this idea.
>> I had done a PhD under a psychologist Earl Hunt and part of his work involved decision trees which I learned about of
course as a student but then put in the back of my mind for 15 years or so. And
then I was at at Stanford on sbatical at the same time as Donald Mickey was teaching a course on learning and he had a challenge for
the class on which you know I sat in on the class and the challenge was to
um work out a way of predicting a win in a very simple chess endame. King rook
versus king knight. So I remembered Earl Hunt's work on decision trees and I thought well maybe that would be the way
to go. So I developed a thing called ID3
to go. So I developed a thing called ID3 which was just a simple decision tree program no pruning just straight decision tree and then uh that that
seemed to solve the problem pretty well up to about 95%.
And then I got that up to 100 the next year. And then remember the first real
year. And then remember the first real time I talked about this was at that conference you organized the workshop in
1980 at Pittsburgh at Carnegi Melon.
You, Rishard and Haimey all um all set up that workshop and then I gave a talk on decision tree learning.
>> So there's Ross's story. He he got the idea of decision trees from his thesis advisor many years earlier. But it turns out
Ross was the one who came up with the algorithm that actually successfully discovered useful decision trees. And
that whole idea of decision tree learning became very important in the field. uh by 2010 it was probably the
field. uh by 2010 it was probably the one of the most commercially used approaches in machine learning.
So in the early 80s there were various experiments like these trying to build machine learning systems but really no theory no theory that could tell us for
example how many examples would we have to present to a learner in order for it to
reliably learn. And that changed in 1984
reliably learn. And that changed in 1984 when less valiant published a paper on
what he calls probably approximately correct learning. Um and the idea is it
correct learning. Um and the idea is it really was the first practical theory to tell us how many examples you would
need. Um and it in particular
need. Um and it in particular the number of examples you need depends on three things. The complexity of your
hypothesis space. For example, if you're
hypothesis space. For example, if you're going to learn decision trees of depth two, that's a lot less complex than if you're learning decision trees of depth
12.
So the comp depends on how complex your hypotheses are depends on the error rate you're willing to tolerate in the final
hypothesis 1% error 5% error. It also
depends on the probability you're willing to put up with that if you do choose that many random randomly provided training examples
um the probability that you'll still fail. You can't guarantee that you won't
fail. You can't guarantee that you won't fail but you can reduce that probability.
So this was a breakthrough in the area of theoretical characterization of algorithms. So I asked I asked less um
what he thought was the key idea there.
>> It's a uh it's a kind of a model of computation but it it yeah it makes sense because it's got some application.
So s the particular result which u um persuaded people that there was something there is uh this result that uh if you take a a conjunctive normal
form formula which uh you know from MP completeness at the time you already knew there's some hardness in it because if someone gave you the formula was completely
difficult to find out whether it's null it's equivalent to formula which uh uh is which is always zero which is never satisfiable. On the other hand, uh this
satisfiable. On the other hand, uh this was kind of this uh contracting normal form formula with three um
uh variables in in each course. Uh so
this was pack learnable and so this was bit striking that something which is very hard is learnable. Um but then this this highlighted the difference between
uh computing and uh and learning because so with the learning model you you the idea was that there was a distribution of inputs and you learned from this distribution but you only have to be
good on this distribution when you had to predict. Um so if for example in this
to predict. Um so if for example in this formula there were some very rare ones which were very rare then the learner wouldn't have to know about that. So in
this sense this was easier than the NP completeness.
>> So I was actually quite surprised at that answer. What he's saying put
that answer. What he's saying put another way is that what was really interesting there is that for this one kind of hypothesis
conjunctive normal form which is a way of it's a kind of logical expression. If
your hypotheses are of that form, then it's easier to learn them than it is to compute them. When he says compute them,
compute them. When he says compute them, what he means is uh the cost of answering the question, can you find a positive example of this?
And it was known at the time that the computational cost of answering that question, is there a positive example of
this formula was exponential in the size of the formula? And then he discovered that learning a formula if
somebody gives you positive and negative examples only takes polomial less than exponential time. So I agree with him
exponential time. So I agree with him that that's a fascinating theoretical fact. But that would not be the answer I
fact. But that would not be the answer I would give about why this revolutionized the field of machine learning. It
revolutionized the field in my view because he was the first person really to be able to come up with a framing a new framing of the machine learning
problem that even allowed this kind of theoretical analysis. In particular, his
theoretical analysis. In particular, his framing included assumptions like the training data would come from some source that would give you
that would give you random examples according to some probability distribution and then later when you wanted to test your hypothesis on new
data, you would get more random examples from that same source. And so he reframed the problem in a way that made theory possible. The consequence of that
theory possible. The consequence of that was it catalyzed a huge amount of theoretical work in machine learning and
uh continues this day to um keeps branching further and further. There are
conferences specifically designed to cover theoretical computer science.
So the 80s was really a very generative decade. There were a lot of things going
decade. There were a lot of things going on. Another thing was going on was some
on. Another thing was going on was some people were looking at human learning and how that might inspire our models of
AI and machine learning. One such effort was here uh at CMU by Alan Newell and his two PhD students John Leard and Paul
Rosen Bloom. They took the approach of
Rosen Bloom. They took the approach of they built a system they called soar which was really one of the first AI
agents designed to capture the full breadth of what humans do play games solve problems
many different tasks. So they framed their machine learning problem as one of getting a general agent to learn and their architecture had very interesting
properties that I think are relevant today now that agents are again a topic of hot activity.
I won't go into the details but in the podcast there's an interview with John Leairard who goes into detail on this.
Another item that can't be overlooked in the 80s was really the rebirth of neural network. Remember in the end of 60s
network. Remember in the end of 60s Minsky and Paper published that book that killed off work on perceptrons.
Well, in the mid 80s uh finally people came up with an algorithm that could train not just one layer perceptrons but
multi-layer perceptrons. And that
multi-layer perceptrons. And that allowed learning functions that were highly nonlinear.
And Dave Romhart, Jay Mlen, and Jeff Hinton were three of the ring leaders of this effort. So I asked Jeff about that
this effort. So I asked Jeff about that period.
>> Now we're up to the mid80s when really neuronets are reborn. Is that the right word? I would
word? I would >> Yes. With back with back propagation. I
>> Yes. With back with back propagation. I
mean, we didn't invent it. It invented
by several different groups, but we showed that it really worked to learn representations. And as you know, sort
representations. And as you know, sort of one of the big problems in AI is how do you learn new representations?
How do you avoid having to put them all in by hand? Um, and my particular example, which was the family trees example, where you take all the information in some family trees, you
convert it into triples of symbols like John has father Mary. Um, and then you train a neural net to predict the last term in a triple given the first two terms. So it's just like the big
language models. You're predicting the
language models. You're predicting the next word given the context. It's just
much simpler. I had 102 total examples of which 104 were training examples and eight were test examples, which is a bit less than the
trillion examples they have nowadays.
Um, but it was the same idea. you
convert a symbol into a feature vector.
You then have the feature vectors of the context interact um via a hidden layer. They then predict the features of the next symbol and from those features you guess what the next
symbol should be and you try and maximize the probability of predicting the next symbol and you then back propagate through the feature interactions and through the process that converts the symbol into features
and that way you learn um feature vectors to represent the symbols and how these vectors should interact to predict the features of the next symbol and
that's what these big language models do. So there's Jeff on uh the mid1980s
do. So there's Jeff on uh the mid1980s work on bat propagation. Another
personal note, in 1986 while this was going on, I came to spend a year at CMU as a visiting professor and I got to
meet Alan Newell at the time and Alan said, "Hey, do you want to team teach a course? We'll teach a course on
course? We'll teach a course on architectures for intelligent agents."
And of course, I said, "Yes, the opportunity to teach with Allen." and he said, "By the way, there will be another uh assistant professor working with us.
The three of us will team teach it.
That's Jeff Hinton." So, Alen, Jeff and I team taught in spring of 1986.
Uh this course, it was one of the best experiences of my uh career up to that point. And so, it was a large part of
point. And so, it was a large part of the reason why I ended up staying at CMU.
But when I came um I was here for about a year and then Jeff moved on. He moved
up to the University of Toronto and um started building up a group there. One of the people who joined his
there. One of the people who joined his group was a person named Yan Lukun who went on to win the touring award jointly
with Jeff and Yosua Benjio for their work in neural networks. So I asked John about this period.
>> Um and then um mid 1987 I moved to Toronto to do a postto with Jeff and I completed the this uh the simulator.
Jeff thought I was not doing anything because I was just basically hacking all you know all the time. Um and um uh and this this system was kind of uh
interesting because we had to build a front- end language to interact with it.
And that language was a lisp interpreter that Leon and I wrote. And so we're using lisp even though as as a front end to kind of a neural net simulator. Um
and I uh you know implemented the weight sharing uh abilities and all that stuff and started experimenting with what became convolutional nets you know when
I was a postto in Toronto u early 1988 roughly and started to get like really good results um you know very simple shape recognition like you know handwritten characters that I had drawn
with my mouse or something like that right >> so as you just heard Jan was experimenting with can we apply neural networks to the problem of character
recognition, written characters. People
were experimenting with many different uses of neural nets at the time. My
favorite, the one I would vote application of the decade was done in the area surprisingly of self-driving
cars. There was a PhD student here at
cars. There was a PhD student here at CMU named Dean Pomemelo. He trained a neural network where the input was an image taken by a camera looking out the
front windshield of a vehicle and the output of the neural network was a steering command telling the car which direction to steer. So I asked Dean
about that work.
>> How much training data did you have?
>> So the interesting thing was to begin with it was all batch training. So I'
I'd drive I'd have a person drive the vehicle along Shenley Park on Flagstaff Hill path and then I would go off and
and crunch it overnight. But in the end what we were able to do is uh real time learning. So, one drive up the hill with
learning. So, one drive up the hill with a human behind the wheel steering and the the neural network learning to pair images with camera images with the
steering command that the human was giving was able to uh train it in about five minutes to uh take over and steer on its own from there on on that road
and on similar roads. So it was one of the first real time real world vision applications of uh of artificial neural networks
>> going beyond just Flagstaff Hill, you know, the little paths on there. And we
went out on on real roads first through the golf course, Chennley Golf Course on the uh on the road there. And then we we went on, you know, the high local
highways. In fact, the longest as part
highways. In fact, the longest as part of my PhD, the longest trip we did was I think about a 100 miles at the time from
basically up uh I79 from Pittsburgh all the way up to Eerie. Uh and it drove basically the the whole way. So it and
it was getting up to 55 miles per hour after we got a faster vehicle.
>> It turns out he didn't ask for permission.
So, so this was all happening in the 1980s. Really, it was a decade of
1980s. Really, it was a decade of amazing uh invention and innovation and exploration. Um, another important thing
exploration. Um, another important thing that happened in that decade was uh the development of of reinforcement
learning. The way to understand that is
learning. The way to understand that is to uh first realize that supervised learning was the kind of standard way of framing the machine learning question.
When Dean talked about training a system, he would input an image. He had
people drive the car. So he got a lot of training examples of the form. Here's
the image and here's the correct steering command. So he could tell the
steering command. So he could tell the neural network for this input, here's the correct output. That's called
supervised learning. But reinforcement
learning reframes the problem. It takes
into account that sometimes we don't know what the right output is. For
example, if you're learning to play chess, you might not have a person who tells you at every step, given this board position, here's the right move.
Instead, you might have to wait until the end of the game after you've made many moves to get the feedback signal that says you lost or you won. And then
you have to figure out what to do about that because you actually took many moves. So that's what reinforcement
moves. So that's what reinforcement learning is about. And Rich Sutton and Andy Barto were um instrumental in kind of framing that problem and and working
on it. They recently won the Turing
on it. They recently won the Turing Award for this work. So I asked Rich how reinforcement learning fit into the field.
>> The field of machine learning has always been been uh dominated by the more straightforward supervised approach. Um
and there was a there's as I mentioned at the very beginning the rewards and penalties were were very
much part of it but then the um focus uh as things became more clear and more better defined and it became a more clearer uh learning problem then became
pattern recognition and supervised learning and and uh this fellow this strange uh fellow Harry Klopp up uh you know
recognized this more than other people and and wrote some reports and ultimately a book uh uh saying that something had been lost and Andy Barton
and I um uh picked up on his work and and and eventually realized that he was right that something had been left out and in some sense it was obvious that something had been left out from the point of view of psychology where I'd
been studying how animals learn and animals learn really in both ways and both a supervised way in a reinforcement way. And so um so we picked up on that
way. And so um so we picked up on that and made that into a well- definfined area in the
when was that would have been in the 80s and then finally you wrote a book on it in 98. So then it became a a clear sub
in 98. So then it became a a clear sub field of machine learning.
Yeah. But the key thing is why is why pay why is the way I say it to myself is that why is reinforcement learning off
why is it powerful potentially powerful and it's powerful because it's it's learn it's really learning from experience learning from the normal data
that an animal or person would get and doesn't require a prepared special data like you of course do in supervised learning.
So during the 80s there were a lot of other really interesting things going on. Uh people experimenting with the
on. Uh people experimenting with the idea that maybe machines should learn by simulating evolution. There was an
simulating evolution. There was an entire set of conferences on something called genetic algorithms, genetic programming which had to do with that
sort of thing. uh a cluster of work on studying human learning and other areas but uh we don't have time for all of
those. Let's move on to the 1990s when
those. Let's move on to the 1990s when again there was a I would say a sea change in terms of the style of work
that went on. The theme of the 1990s was really the integration of statistical and probabilistic methods
into the field of machine learning. And
a lot of that took the grounded form of learning a new kind of object which people called either graphical models or basets.
But what got learned in that case was again a network where each node would represent a variable. For example, maybe you would be interested in predicting
whether somebody has lung cancer. You'd
make that a variable. Uh and maybe you'd have evidence like are they a smoker? Do
they have a normal or abnormal X-ray result? You'd make those variables. And
result? You'd make those variables. And
then the edges in the graph represent probabilistic dependencies among the variables in a way such that
in the end the whole graph represents the full joint probability distribution over the entire collection of variables.
So that's what got learned and how it got learned waited for some algorithms to be discovered. One of the
key people who was involved in inventing those algorithms, although Judea Pearl came up with the idea of how to represent these um Daphne Kohler, a
professor at Stanford was one of the most active researchers in terms of designing algorithms for learning these.
So I asked her, why do we need graphical models? Graphical models for me emerged
models? Graphical models for me emerged by realizing that the problems that we needed to solve to address most real
world applications went beyond you have a vector representation of an input and a single often times binary or at best continuous output. There was so much
continuous output. There was so much more opportunity to think about richly structured environments, richly structured problems. So even if you
think about problems like understanding what is in an image, that's not a single label problem of there is a dog because images are complex and there is interreationships
between the different objects. You
wanted to get beyond the yes, no, is there a dog in this image to something that is much more rich. There's a dog and a frisbee and a beach and three kids
building a sand castle. You have a rich input and a rich output. Thinking about
these richly structured domains gave rise to we have to think about multiple variables. We have to think about the
variables. We have to think about the interactions between those variables and leverage that structure both in our input and our output space in order to get to much better conclusions and deal
with problems that really matter.
So this work on training graphical models was really part of a bigger theme that decade which was just the integration of statistical
methods with uh what had been pretty much statistics free machine learning up to that point. Another person who was instrumental in that was a Berkeley
professor named Mike Jordan. I asked him about the relationship between statistics and machine. So anyway, by the time I moved to wanted to move to Berkeley, I was realizing that I was missing the whole statistics community
that uh it was just separate from machine learning. As maybe you kind of
machine learning. As maybe you kind of remember, there was occasionally a little leakage, but it was way too separate. And and nowadays, we're often
separate. And and nowadays, we're often seeing, you know, people run a machine learning method, but then it's not calibrated. It's not, you know, has bias
calibrated. It's not, you know, has bias and and all that. And that's the thing statisticians have talked about for a long long time. And so nowadays I think it's a given that yeah there there are
kind of two parts two sides of the same coin. Machine learning is maybe a little
coin. Machine learning is maybe a little more engineering in order. Let's build a system and make it do great things in the world. And this is a little bit more
the world. And this is a little bit more well let's be cautious. Let's we're
going to do like clinical trials. Let's
make sure that the the the answer is really trustable. But those are two
really trustable. But those are two sides of the same coin. And I think that's probably pretty much clear now.
But for a long time there was a resistance. Everyone said this is a
resistance. Everyone said this is a brand new field. This is different. and
and I kept uh again annoying colleagues by saying no I don't believe it is so anyway long story short >> it is it is remarkable that to me that
the field of machine learning went through most of the 1980s kind of without even noticing that statistics existed >> I mean people like Leo Briman were
around to help make the the passage so ensemble methods they were kind of invented by Leo in stat literature but they were independently invented in the in the machine learning literature And is that machine learning or statistics?
Well, clearly it's both and it needs first both perspectives. And yes, in the 1990s that the EM algorithm, you know, the the the graphical models, they were they had they had uh so yeah, in the
'90s it was a real flourishing of that.
>> So Mike mentioned that one of the themes was ensemble. Anyway, I think that's
was ensemble. Anyway, I think that's actually a very nice example of how machine learning theory and statistical
theory kind of intertwined. The idea of ensemble learning is instead of learning one hypothesis, let's learn multiple ones. For example, instead of learning a
ones. For example, instead of learning a decision tree, you might learn a whole forest of decision trees. And then when it comes to classifying a new example,
you give it to all of the classifiers and you let them vote and you take the vote of the classifiers. Well, that
turned out to be very successful and and commercially very important, but it also is a beautiful example where um there's
a pretty interesting theory around that.
And initially Yo Frind and Rob Shapi uh in the early 90s uh started working on uh theory and methods for doing this
kind of ensemble. Leo Brimman who was a statistician recognized that this echoed some of the themes of resampling and statistics and
those two things uh kind of came together in a very successful way.
So in the 90s and the first decade of the 2000s there were many other things going on. um the development of things
going on. um the development of things called support vector machines, kernel methods um which were u mathematical
techniques for learning um very nonlinear classifiers that were actually commercially important and open the door
in many cases to machine learning for non-numerical data data like images or text. Um there was work on manifold
text. Um there was work on manifold learning. Um there was also growing
learning. Um there was also growing commercialization during that decade. More and more companies were starting to use machine learning commercially.
But for me, the theme of that first decade of the 2000s was really a growing awareness by many people that you know
maybe we have good enough machine learning algorithms that the bottleneck to more accuracy is not the algorithm. Maybe we need more
data and more computation.
And this idea was crystallized in this beautiful paper written in 2009 by three uh authors at Google called the
unreasonable effectiveness of data which really highlighted uh cases where really if you want better results keep your
same algorithm get more data and that was kind of a a theme of what was going on at the time but things really broke
open in the year 2012.
In 2012, um the computer vision community had been using a data set created by FA Lee called ImageNet
to test out different vision algorithms, see who could do the best job of labeling which object was the primary object in an image. and the image that
data set was very large. In 2012, Jeff Hinton and some of his students entered the competition and they blew away the
competition. What's interesting is they
competition. What's interesting is they were the only neural network approach in the competition. By that time, by the
the competition. By that time, by the way, neural networks were very scarce in the field of machine learning. They had
been displaced really by more recent probabilistic methods and uh only a smalish number of
researchers were even still working on neural networks. But um nevertheless
neural networks. But um nevertheless this happened. So I asked Jeff about
this happened. So I asked Jeff about that >> and Yan realized when Feay came up with the imagenet data set, Yan realized they could win that competition and he tried
to get graduate students and posters in his lab to do it and they all declined.
Um and I sus realized that um backrop would just kill imageet um and he wanted Alex to work on it and
Alex didn't really want to work on it.
Um, Alex had already been working on small images and recognizing small images in C510. Um, and pre-processed everything for Alex to make it easy. And
I bought Alex two Nvidia GPUs to have in his bedroom at home. Um, and
Alex then get on with got on with it and he was an absolutely wizard programmer.
He wrote amazing code on multiple GPUs to do convolution really efficiently, much better code than anybody else had ever written. um I believe and so it's a
ever written. um I believe and so it's a combination of Ilia realizing we really had to do this and I was involved in the design of the
net and so on but Alex's programming skills and then I added a few ideas like use rectified linear units instead of sigmoid units and use little patches of
the images I mean big patches of the images so you can translate things round a bit to get some translation invariance as well as using convolution um and use dropout. So that was one of
the first applications of dropout and that helped by about 1%. It but it helped and then we beat the best vision systems. The best vision systems were
sort of plateauing at 25% errors. That's
errors for getting the right answer in the top in your top five bets. Um
and we got like 15%.
15 or 16 depending on how you count it.
So we got almost half the error rate.
And what happened then was what ought to happen in science but seldom does.
So our most vigorous opponents like Chitendra Mannic and Zisserman, Andrew Zissman um looked at these results and said okay you were right. That never
happens in science. And slightly
irritatingly Andrew Zissman then switched to doing this. He had some very good posttos or students working with him. Simon um and
him. Simon um and um after about a year they were making better networks than us. But that was really the
as far as the general public was concerned that was the start of this big swing towards deep learning in 2012.
So that event, that competition and the fact that the neural network approach um totally dominated all the other
approaches really was a wakeup call to both the computer vision community. uh
which within a couple years everybody was using neural networks, but it was al also a wake-up call to the machine learning community who had kind of
scoffed at neural networks for several decades um that neural networks were back. And so people started again now
back. And so people started again now experimenting with this new generation of deep neural networks. That just meant that instead of having two layers, they
could have many layers, dozens of layers. Um because training algorithms
layers. Um because training algorithms were available and so was computation.
Um people started experimenting with these and primarily on perceptual style problems. In fact, by 2016,
uh, neural nets had taken over not only computer vision, but in 2016, u some scientists from Microsoft showed
that they had been able to train a neural network to finally reach human level recognition
performance for uh, individual words in a widely used data set called the switchboard data. data set. So people
switchboard data. data set. So people
were experimenting with neural nets for visual data, speech data, radar, LAR, all kinds of sensory data. People
started also asking well can we apply these to text data and the answer was yes. Um and people started inventing
yes. Um and people started inventing various architectures, things with names like long short-term memory and others
um to analyze sequences of text and applying them to problems like uh machine translation, translating English
into French and so forth. Um and uh that kind of worked. And then in 2017
um a very important paper was published.
The name of the paper was attention is all you need. And what that was referring to was a a subcircuit in a
neural network called an attention mechanism that had recently been um invented and developed and was
trainable. Um but
trainable. Um but that attention mechanism was used in this paper and it advanced the state-of-the-art and machine
translation. But even more importantly
translation. But even more importantly for us today, it introduced the transformer architecture based on this attention mechanism.
And it's that transformer architecture that underlies GPT and pretty much all of the large language models that were released
around 2022.
So that was a major event. Now around
the same time, Yan Lun, remember the guy who was the postoc with Jeff in 1987, Jan had become the head of AI research
at Facebook and so he was in a very interesting position because he he was both an academic, he retained his NYU
professorship and at the same time he had a foot in the commercial world directing the AI strategy for Facebook. So I asked Yan
about this period and what it looked like to him from from being inside both worlds. His first part of his answer was
worlds. His first part of his answer was that he said for him a key development was realizing that you didn't have to
wait for people to label all your training data. That you could do
training data. That you could do something called self-supervised learning. for example, just take data
learning. for example, just take data like a string of words and remove a word and have the program force the program to predict what that removed word was.
So there's no human labeling you have to do for that. You can use the whole web and you get a lot of training examples.
So that self-supervised learning was a key um development.
But then here's this description of what next. So the idea that you know self-s
next. So the idea that you know self-s supervised learning could really kind of bring something to the table there I think was kind of a big um
sort of mind change of uh mindset. Uh and then and then there was transformers of course,
right? Um that that um
right? Um that that um so so before that there was uh some demonstration that uh you know you could basically match the performance of
classical systems for task like translation uh language translation using large neural nets like LSTM. So
this was the work by Edgar when he was at Google. We had this big, you know,
at Google. We had this big, you know, sequence tosequence model with LSTMs and some, you know, gigantic model where, you know, you can train it to do
uh translation and it kind of works at the same level if not better in some cases than the the then classif classical um translation methods. Uh and
then a few months later, Yeshua Venjo, Mitri Badau and and Kyong Cho who is now a colleague at NYU uh showed that you could change the architecture and use
this attention mechanism um that they that they proposed uh to basically get really good performance on translation with much smaller models than what Ilia had been proposing and the entire
industry jumped on this. uh Chris
Manning's group at Stanford kind of you know used that architecture and basically beat um you know won the WT
competition for a particular uh uh type of translation and the entire industry jumped on it. So within a few months after that like you know all the big
players uh in translation were using attention type architectures for translation and that's when um the transformer paper
came out attention is all you need. So
basically if you build a neural net just with those kind of attention circuit uh you don't need much else and it ends up working super well and that's what
started the you know the transformer revolution uh and then after that came BERT that also came out of Google which was this idea of using self-supervised learning right take a sequence of words
corrupt it remove some of the words and then train this big neural net to reconstruct the words that are missing predict the words that are missing Um and again people were amazed by like
how how good the representations learned by the system were for all kinds of NLP tasks and that really uh you know kind of captured the imagination of a lot of
people. Um and then after that the next
people. Um and then after that the next revolution was oh um actually the best thing to do is you remove the encoder you just use a decoder. Um and you just
train a system you feed it a sequence and you just train it to reproduce the input sequence on its output. And
because the architecture of the decoder is strictly causal um because a particular output is not connected to the corresponding input. It's only
connected to the ones to the left of it.
implicitly you're training the system to predict the next word that comes after a sequence of words. That's the GPT architecture that that was you know
promoted by uh OpenAI and uh that turned out to be more scalable than B and so in a sense that you can train gigantic networks on enormous amounts of data and
you get some sort of emergent uh uh property and that's what gave us LLMs. So that brings us up to today with
transformers and you can see this very strange evolution and wandering path of
um progress exploration over decades. So
before we leave um um I want to let's just take a look at that history and say what if this is a case study of how
scientific progress was made in this field what are the main themes we see well I think the first one is progress
happens in waves it's paradigm after paradigm right first uh there were perceptrons but that got thrown away and replaced by symbolic
representations being learned eventually to be replaced by neural nets which were replaced by probabilistic methods and so forth. So there's wave after wave of
forth. So there's wave after wave of paradigm.
Another theme is that a lot of these ideas really came from other fields.
Even the very notion of perceptrons came from somebody who was fundamentally a neuroscientist interested in how neurons
in the brain could even learn stuff.
Pack learning you heard less valiant talk. Um he's very much a computational
talk. Um he's very much a computational complexity researcher who found that this was an interesting theoretical
result. uh beige networks heavily
result. uh beige networks heavily influenced by statistics and so forth.
Many of these advances really were new framings of the problem.
So, uh Winston's work on symbolic learning was really a reframing of what the problem was. The work on reinforcement learning is really
changing the definition of what the training signal even is for these systems. So that's another theme that you see.
And finally, I think like a lot of scientific fields, machine learning is really a blend of technical forces and
social forces. Um certainly in the long
social forces. Um certainly in the long term, the cold hard facts of what works best come out and those methods win. But
in the shorter term, the question of who works on what kinds of problems is very much influenced by the personalities of
people, their ability to persuade other people to uh jump in and start working with them on their problems. So these are some of the themes you see and I
think if you look around at other fields sometimes you see similar themes.
Finally, what are the lessons from all this for researchers?
I think the first lesson really is question authority because really if you think about the major advances, many of those came from
just uh going against what was currently the conventional wisdom in the field.
uh inventing a new framing or taking a radically different approach.
Another lesson, don't drag your feet.
I've seen decade after decade new paradigms emerge in the field. And every
single time that happens, existing researchers take longer than they need to to recognize the benefits of the new
paradigm.
And the most guilty people are the senior researchers.
You can probably explain that by taking into account who has the most to lose if there's a new paradigm replacing the current approach.
Another lesson, learn to communicate and learn to follow through. You heard Jeff Hinton when he was talking about in the mid 80s the development of B
propagation. You heard him say, "We
propagation. You heard him say, "We didn't invent back propagation, but we showed that it was important."
And actually, to be fair, they thought they were inventing back propagation.
They they actually reinvented it, but they had no idea that somebody had invented it before because whoever did that didn't succeed in waking up the
research community to the fact that they had a really good idea. I don't know why. Maybe they didn't put in the effort
why. Maybe they didn't put in the effort or succeed in communicating. Maybe they
dropped it after they did it, went some other direction so that they didn't follow through to provide the evidence.
But that kind of thing happens frequently and successful researchers are good communicators and they follow through to to push the field to pay
attention.
The final lesson I think is the philosophers were actually right.
We really today despite these amazing capabilities of our learning systems, we don't have a proof or anything like a
rational justification of why you can generalize from examples to get these general rules that work well. Despite
the success that we have, we don't really understand at this very fundamental level why. And I think that if we did pay more attention to that
question, we might have a better chance to develop algorithms that outperform what we have today.
So I'll stop there. Thank you very much.
Loading video analysis...