An Observation on Generalization
By Simons Institute for the Theory of Computing
Summary
## Key takeaways - **Supervised learning is mathematically sound.**: Supervised learning is well understood due to precise mathematical conditions that guarantee success. If training loss is low and model complexity is less than the training set size, test error is guaranteed to be low. This provides a solid foundation for collecting data and improving models with certainty. [02:56], [04:48] - **Unsupervised learning lacks clear mathematical guarantees.**: Unlike supervised learning, there's no clear mathematical framework for unsupervised learning. While we can optimize objectives like reconstruction error, it's unclear why this leads to good results for a different, desired objective, creating a high level of mystery. [07:50], [10:15] - **Compression offers a path to understanding unsupervised learning.**: The language of compression provides a more intuitive way to think about unsupervised learning than prediction. A good compressor, when applied to concatenated datasets, will naturally leverage shared structures between them, offering a clue to how patterns in unlabeled data can help with a supervised task. [15:15], [16:18] - **Kolmogorov complexity as the ultimate, uncomputable compressor.**: Kolmogorov complexity represents the shortest program to output data, acting as the ultimate compressor. While not computable, it serves as a theoretical benchmark for zero regret in unsupervised learning, implying that a perfect compressor would extract all possible predictive value from unlabeled data. [20:48], [23:33] - **Neural nets approximate Kolmogorov compressors.**: Large neural networks, trained with SGD, can be viewed as performing a search over programs. As networks grow, they approximate the ideal Kolmogorov compressor, suggesting that larger models are desirable because they get closer to this theoretical ideal of zero regret and maximum information extraction. [24:47], [30:32] - **Next pixel prediction yields strong image representations.**: Applying the next pixel prediction task, similar to next word prediction, to images has shown success. This autoregressive approach, as seen in iGPT, leads to good unsupervised learning results and better linear representations compared to models like BERT, suggesting a powerful unsupervised learning strategy for vision. [32:21], [33:41]
Topics Covered
- Why is unsupervised learning so mysteriously magical?
- A good compressor will find your data's shared structure.
- Big neural nets approximate an uncomputable ideal compressor.
- Next-pixel prediction proves compression works for vision.
- Why do autoregressive models learn better linear representations?
Full Transcript
YEJIN CHOI: It has a citation count that's six digits.
It's more than 139,000, and then many more.
So super excited to hear what Ilya has to say
about LLMs and the future.
Take it away.
ILYA SUTSKEVER: OK.
Hi.
[APPLAUSE]
Hi.
Hi everyone.
Thank you for the introduction.
When Umesh invited me to attend this event,
I got really excited.
I saw all the speaker list.
And I thought, great, I'll go and I'll talk about something.
And then I ran into a problem, which
is that a lot of the technical work
that we're doing at OpenAI, I actually can't talk about.
[LAUGHTER]
So I was really wracking my head what could I talk about.
Right now, so as of not too long ago,
I switched all my focus to working on AI alignment,
all my research focus.
And we'll have very cool results to show there, but not just
yet.
So I'm like, so this would have to be for the next talk.
But what would be good for this talk?
And I came up with something.
I'll tell you about some very old results
that we had at OpenAI many years ago, back in 2016 even,
which really affected the way I think
about unsupervised learning.
And I thought I'd share them with you.
It is possible that at this point,
you will find them obvious.
But maybe not all of them.
So there is at least a small chance
that you'll find it interesting.
So I want to set the expectations modestly, and then
hopefully exceed them.
So a theory of unsupervised learning.
How can such a thing exist?
Before we talk about unsupervised learning,
we want to talk about learning in general.
What is learning?
And why does learning work at all?
Why should learning work at all?
And why should computers be able to learn?
And now we're just used to the fact.
We take it for granted that neural networks learn.
But why do they?
Mathematically, why should they?
Why would data have regularity that our machine learning
models can capture?
So that's not an obvious question.
And one important conceptual advance
that has taken place in machine learning many years ago
by multiple people was the discovery and the formalization
of supervised learning.
So it goes under the name of PAC learning,
or statistical learning theory.
And the nice thing about supervised learning
is that it gives you a precise mathematical condition
under which learning must succeed.
You're told that if you have some data, from some data
distribution, that if you manage to achieve a low training
loss and the number of your degrees of freedom
is smaller than your training set, then
you will achieve low test error.
And you are guaranteed to do so.
So you have a mathematical condition
where you can say, well, if I did
find a function out of my function class,
which achieves low training error,
then learning will succeed.
And you can say, yeah, this is a very sensible
mathematical thing that we can reason about.
And that's why supervised learning is easy.
And then you had all these theorems,
which I thought were simple.
I found them elegant.
You've seen maybe theorems like this.
If you have your--
basically, it's like this is the sort of thing
where if you know it, you're going
to say, oh, yeah, that thing.
And if you don't know it, it will not
be possible to explain it in 30 seconds.
Though, it is possible to explain it in 5 minutes, just
not 30 seconds.
None of this stuff is complicated.
And you've got your little proof which says, well,
if you have some number of functions in your function
class, the probability that your train error will be far
from your test error for at least one function,
there is some math.
And the math is simple.
This is all the math-- three lines of math.
So three lines of math can prove all of supervised learning.
Well, that's nice.
That's very nice.
So supervised learning is something
that is well understood--
comparatively speaking, well understood.
We know why it must succeed.
So we can go forth and collect large supervised learning data
sets and be completely certain that models
will keep getting better.
So that's the story there.
And yeah, I forgot to mention a very important piece
of these results.
The test distribution and the training distribution
need to be the same.
If they are the same, then your theory of supervised learning
kicks in and works and will be successful.
So conceptually, it is trivial.
We have an answer for why supervised
learning works, why speech recognition should work, why
image categorization should work,
because they all reduce to supervised learning, which
works, which has this mathematical guarantee.
So this is a very nice.
Here I want to make a small side comment on VC dimension
to those of you who care about such things.
There may be a small subset.
So if you want to zone out for the next 30 seconds,
feel free to do so.
So a lot of writings about statistical learning theory
emphasized the VC dimension as a key component.
But the main reason the VC dimension--
in fact, the only reason the VC dimension was invented
was to allow us to handle parameters
which have infinite precision.
The VC dimension was invented to handle precision parameters
with infinite precision.
So if you have a linear classifier,
every parameter is infinite precision.
But then, of course, in reality, all our floats
are finite precision, and their precision is shrinking.
So you can-- so you have your--
so the number of functions that are implemented by a computer
is actually small.
And you can reduce it back to this little formula, which
gives you pretty much all the best bounds
that you can get from supervised learning.
So I find this to be cool.
I find this to be appealing because you
have fewer lines of proof.
Now, let's talk about unsupervised learning.
So I claim-- first of all, what is unsupervised learning?
What is it?
Supervised learning is like, yeah, you've
got your data that says, here, do this, here, do that,
and you have all this data.
You do well in your training error.
Your training error is low.
You have more training data than degrees
of freedom in your function class, parameters.
There may be other things, too-- degrees of freedom, like bits.
And you say, your supervised learning will succeed.
But what is unsupervised learning?
What can you say at all about unsupervised learning?
And I'll say that at least I have not
seen like an exposition of unsupervised learning, which
I found satisfying.
How to reason about it mathematically?
We can reason about it intuitively,
but can we reason about it mathematically?
And for some context, what is the old dream
of unsupervised learning, which, by the way,
this dream has been fulfilled, but it's fulfilled empirically.
Can we go just a tiny bit beyond the empirical results,
like the idea that you just look at the images,
or you just look at the text without being told what
to do with them, and somehow you still discover
the true hidden structure that exists in the data,
and somehow it helps you?
But why should it happen?
Should it happen?
Should we expect it to happen?
You cannot-- you don't have anything remotely similar
to the supervised learning guarantee.
The supervised learning guarantee
says, yeah, get your low training error,
and you're going to get your learning.
It's going to be a great success.
So unsupervised learning, it appears--
it appears that it's not this way.
People were talking about it for a long time, in the '80s.
The Boltzmann machine was already
talking about unsupervised learning.
And unsupervised learning also did not work at small scale,
but all the ideas were there, like denoising autoencoder,
for those of you who remember it.
It's like BERT or the diffusion model.
Within it, it's a tiny twist, the tiniest of twists.
The language models of old times,
they also generated cool samples for their time,
but they're unsupervised learning performance was not as
impressive as those of today.
But I want to make the case that it is confusing,
because it's like--
why is it confusing?
You optimize.
How does unsupervised learning work?
You say, let's optimize some kind of reconstruction error,
or let's optimize some kind of denoising error,
or some kind of self-supervised learning error.
You optimize one objective.
Oh yes.
I just said that.
But you care about a different objective.
So then doesn't it mean that you have no reason
to expect that you will get any kind
of good unsupervised learning results,
or rather you do get them empirically,
but like, is it going to be--
the level of mystery is quite high, I claim.
It seems like a totally inaccessible phenomenon.
You optimize one objective, but you
care about another objective.
And yet it helps.
How can that be?
Magic.
The other thing, by the way, is that unsuper--
well, I guess I'm going to say something which is only 90%
true.
Unsupervised learning doesn't-- you could say, OK,
so you just learn the structure and the input distribution,
and it helped you.
But then, what if you're training
from the uniform distribution?
Then all your unsupervised learning algorithms will fail.
How should you to think about that?
So what can we say?
Do we need to make assumptions?
What kind of assumptions?
So I'd like to present potentially
a way of thinking about unsupervised learning, which,
I think, I personally find it interesting.
Perhaps you'll find it interesting, too.
Let's find out.
So I want to show you one way of doing unsupervised learning,
which is not necessarily widely known, because it never
became the dominant way of doing unsupervised learning.
But it has the cool feature that similarly
to supervised learning, it has to work.
So what kind of mysterious unsupervised
learning procedure where you're not
giving any labels to any of your inputs
is still guaranteed to work?
Distribution matching, distribution matching.
So what is distribution matching?
Say I've got my data.
I've got X, and I've got Y, data sources.
There is no correspondence between them.
I've just got two data sources, data source X and data source
Y, language one, language two, text, speech--
no correspondence between them.
Let us look at this criterion.
Find the function f such that the distribution of F of X
is similar to the distribution of Y. It is a constraint on F.
And in the case of machine translation and speech
recognition, for example, this constraint might be meaningful.
You could say, yeah, if I have long sentences--
if you tell me that you have a function
such that the distribution-- you take your distribution
of English sentences.
You apply the function F to them,
and you get something which is very
similar to the distribution of French sentences, you can say,
OK, I found the true constraint about F.
If the dimensionality of X and the dimensionality of Y
is high enough, that's going to give you a lot of constraints.
In fact, you might be able to almost fully recover F
from that information.
This is an example of supervised learning--
of unsupervised learning, where it is still guaranteed
to work in the same sense that supervised learning is
guaranteed to work.
Also, substitution ciphers, like little simple encryptions,
will also fit into this framework.
So that's the thing.
And so I ran into this.
I independently discovered this in 2015.
And I got really fascinated by it,
because I thought, wow, maybe there
is something meaningful, mathematically meaningful
that we can say about unsupervised learning.
But let's see.
The thing about this setup is that still it's
a little bit artificial.
It's still-- real machine learning
setups aren't like this.
And the way we like to think about unsupervised learning
isn't that also.
Now I'll present to you the meat, the meat
of what I wanted to say--
a proposed way to think about unsupervised learning
that lets you--
that puts it on par with supervised learning.
So OK, what is it doing mathematically?
How can you be sure that your unsupervised learning is good?
Compression to the rescue.
Obviously, it is well-known--
I shouldn't say obviously.
It is not obvious, but it is well-known
that compression is prediction.
Every compressor can be a predictor, and vice versa.
There is a one-to-one correspondence
between all compressors and all predictors.
However, I would argue that for the purpose of thinking
about unsupervised learning, the language of compression
offers some advantages, at least, for me, it did.
Perhaps it will for you, too.
So consider the following thought experiment.
This thought experiment is the most important slide.
Say you have two data sets, X and Y. You have two data
sets, two files on your big giant hard disk.
And say you have a really great compression algorithm
C, which takes data in, and outputs compressed objects out.
Say you compress X and Y jointly.
You concatenate them.
You take the two data sets, and you concatenate them.
And you feed them to your compressor.
What will happen?
Well, let's see.
What, and in particular, an important question
is, what will a sufficiently good compressor do?
My answer is, very intuitively, it
will use the patterns that exist inside X to help
it compress Y, and vice versa.
You could make the same claim about prediction,
but somehow it's more intuitive when
you say it about compression.
I don't know why that is, but I find it to be the case.
So that's our clue.
And you can make an equation like this, where you say, hey,
if your compression is good enough, if it's
like a real great compressor, it should
say that the compression of your concatenation
of your giant files should be no worse
than the separate compression of your two files.
So any additional compression that
was gained by concatenation is some kind of shared structure
that your compressor noticed.
And the better your compressor is, the more shared structure
it will extract.
The gap is the shared structure, or the algorithmic mutual
information.
So that's interesting, right?
You can see what I'm alluding to.
Y is the data of your supervised task.
X is your unsupervised task.
But suddenly, you have some kind of mathematical reason
for the information for the patterns in X to try to help Y.
Notice also how it generalizes distribution matching.
If we are in the distribution matching case, where you've got
your X is language one and Y is language two,
and you are saying--
and you know that there exists some simple function
F that transforms one distribution into the other,
surely, your compressor, if it's good,
you'll notice that, and make use of that,
and maybe even internally try to recover this function.
I think that's pretty cool.
We've closed the circle.
So how do we formalize it then?
What will be the formalization of unsupervised learning?
Let's see.
Let's see if I can do it.
Consider an ML algorithm.
So here, by the way, in what follows, I'll be a bit sloppy,
and I will use compression and prediction interchangeably.
Say you have a machine learning algorithm A. It's an algorithm
A, and it tries to compress Y. And say it has access to X. X
is file number one, and Y is file number two.
You want your machine learning algorithm, your compressor
to compress Y, and it can probe X as it sees fit.
The goal is to compress Y as well as possible.
We can ask ourselves, what is our regret
of using this particular algorithm?
You'll see what I'm getting at.
Regret relative to what?
If I do a good enough job, if I have--
being low regret means that I've gotten all the help
that I can possibly get from this unlabeled data.
This unlabeled data has helped me as much as possible.
And I don't feel bad about it.
I don't feel bad.
I don't feel like I've left some prediction value on the table
that someone else with a better compression algorithm
could have used.
That's what it means.
And in particular, it's like, yeah, you can go,
and you can sleep happily at night, knowing that if there
is some information in the unlabeled data that could
help my task, no one else--
no one could have done as good of a job as me.
I've done the best job at benefiting
from my unlabeled data.
So I think that is a step towards thinking
about unsupervised learning.
You don't know if your unsupervised data
set is actually useful.
It may be super useful.
It may have the answer.
It may be totally useless.
It may be the uniform distribution.
But if you have a low regret unsupervised
learning algorithm, you can say, whether it's
the first case or the second case, I know I've done my best.
I know I've done my best at benefiting
from my unlabeled data.
And no one could have done better than me.
Now, I want to take a detour to theory land, which
is a little obscure.
I think it is interesting.
Kolmogorov complexity as the ultimate compressor
gives us the ultimate low regret algorithm,
which is actually not an algorithm,
because it's not computable.
But you will see what I mean really quickly.
So Kolmogorov-- first of all, for some context,
who here is familiar with Kolmogorov complexity?
OK, about 50%.
Kolmogorov complexity is the kind of thing
which is easy to explain in 1 minute, so I'll just do it.
It's like imagine if I give you some data,
or you give me some data.
And I'm going to compress it by giving you
the shortest program that can possibly exist,
the shortest program that exists, which, if you run it,
outputs your data.
AUDIENCE: There's a typo.
[INAUDIBLE] Instead of Y, it should be X.
ILYA SUTSKEVER: Yes, that is correct.
You got me.
[LAUGHTER]
It is the length of the shortest program which outputs X. Yes.
Intuitively, you can see that this compressor is quite good,
because you can prove this theorem, which
is also really easy to prove, or rather, it's easy to feel it.
And once you feel it, you could kind of believe me
that it's easy to prove.
And you can basically say that your Kolmogorov compressor,
if you use that to compress your strings,
you'll have very low regret about your compression quality.
You can prove this result. It says
that if you've got your string X,
your data set database, whatever, the shortest
program which outputs X is shorter than whatever
your compressor needed output, and however well
your compressor compressed your data, plus a little term, which
is however many characters of code you need
to implement your compressor.
Intuitively, you can see how it makes sense, the simulation
argument.
The simulation argument, if you tell me, hey, I've
got this really great compressor C, I'm going to say,
cool, does it come with a computer program?
Can you give this computer program to k and k
is going to run your compressor?
Because it runs computer programs,
you just need to pay for the program length.
So without giving you the details,
I think I gave you the feel of it.
Kolmogorov complexity, the Kolmogorov compressor
can simulate how the computer program simulates
other compressors.
This is also why it's not computable.
It's not computable because it simulates--
it feels very much at liberty to simulate all computer programs.
But it is the best compressor that exists.
And we were talking about good compression
for unsupervised learning.
Now, let us generalize a Kolmogorov complexity,
a Kolmogorov compressor to be allowed
to use side information.
I'll talk more in detail.
So I'll make this detail.
I'll reiterate this point several times,
because this point is important.
Obviously, the Kolmogorov compressor is not computable.
It's undecidable.
But like, hey, because it like searches over all programs,
did you know that if you run SGD over the parameters
of some neural net with 100 layers,
it's automatically like doing program search over a computer,
which has a certain amount of memory,
a certain number of steps.
It's kind of like micro micro micro homography,
like fitting a neural net.
You kind of see the feel, the similarity.
It's kind of magical, right?
Neural networks can simulate real programs.
They are little computers.
They're circuits.
Circuits are computers, computing machines.
And SGD searches over the program.
And all of deep learning hinges on top of the SGD miracle
that we can actually train these computers with SGD.
That works.
We actually find those circuits from data.
Therefore, we can compute our miniature Kolmogorov
compressor.
An assimilation argument applies here as well, by the way,
I just want to mention this one fact.
I don't know if you've-- if you've ever tried to design
a better neural network architecture,
what you'd find is that it's kind of hard to find a better
neural network architecture.
You say, well, let's add this connection.
Let's add that connection.
And let's modify this and that.
Why is it hard?
The simulation argument, because your new architecture
can be pretty straightforwardly simulated
by your old architecture.
Except when it can't.
Those are rare cases.
And in those rare cases, you have a big improvement,
such as when you switch from the little RNN to the transformer.
The RNN has a bottleneck, the hidden state.
So it has a hard time implementing the transformer.
However, had we found a way to engineer [INAUDIBLE]
a very, very large hidden state, perhaps it
would become as good as a transformer again.
So this is the link.
You start to see how we switch from the formal land
to neural network land.
But you see the similarity.
So conditional Kolmogorov complexity as the solution
to unsupervised learning--
you can basically have a similar theorem,
where I'm not going to define what-- well, I'm
going to define what K of Y given X is.
It's like the shortest program which outputs Y,
if it's allowed to probe X. That's
the length of that shortest program which outputs Y
if it allows to probe X. And you can prove the same result.
And you can see immediately that, yeah, this
is definitionally the solution to unsupervised learning.
If you use that, you can sleep very, very safely at night--
soundly at night, knowing that no one does
unsupervised learning better than you.
It's literally that.
So this is the ultimate low regret solution
to unsupervised learning, except that it's not computable.
But I do think it's a useful framework.
And here we condition on a data set, not an example.
This thing will extract all the value out
of x for predicting y, the data set, the data set,
not the example.
So this is the solution to unsupervised learning--
done success.
And there is one little technicality
which I need to spend a little bit of time talking about,
which is we were talking about this conditional Kolmogorov
complexity, where you are talking about compressors that
get to C, try to compress one thing, while having access
to another thing.
And this is a bit unnatural in a machine learning context
when you care about feeding big data sets, at least
as of today.
Although, it's changing fairly rapidly.
But I think it's still fair to say
that there is no truly good way of conditioning on a big data
set.
You can fit a big data set, but you can't condition on it,
not yet, not truly.
So this result says that, hey, if you
care about making predictions about your supervised task Y,
using the good old-fashioned Kolmogorov compressor, which
just compresses the concatenation of X and Y
is going to be just as good as using
your conditional compressor.
There are more details and a few subtleties to what I just said.
And I'm happy to talk about them offline in the event someone
is interested in them, but that basically
says that if before you were saying, hey--
the previous slide said that you can
use this conditional Kolmogorov compressor to solve
unsupervised learning.
This says that you can also use your regular Kolmogorov
compressor.
Just throw all your data.
Take all your files.
Concatenate them.
Compress them.
And that's going to make some great predictions
on your supervised task, the task that you care about.
There are some intuitions about why this is true.
This result is actually like--
proving this is slightly more tricky, so I won't do it.
And so the solution to unsupervised learning,
just give it all to your Kolmogorov complexity--
to your Kolmogorov compressor.
The final thing is I'll mention that this kind
of joint compression is maximum likelihood if we don't overfit.
If you have a data set, then the sum of the likelihood,
given your parameters, is the cost
of compressing the data set.
You also need to pay the cost of compressing the parameters,
but you can see if you now want to compress two data
sets, no problem.
Just add more points to your training set, to your data set,
and add the terms to the sum.
So this concatenation, this joint compression
is very natural in a machine learning context, which
is why it was worth the hassle of saying, yeah,
we have our conditional Kolmogorov complexity.
And then I made some arguments.
I made some claim without fully defending it,
but it is possible to defend it, but using regular--
just compress everything.
Kolmogorov complexity works also.
And so I think this is elegant.
I like it, because then it says, well,
if you squint hard enough, you can
say that this explains what our neural network is doing.
You can say, hey, SGD over big neural networks
is our big program search.
Bigger neural networks approximate the Kolmogorov
compressor more and more and better and better.
And so maybe this is also why we like big neural nets,
because we approach the unapproachable idea
of the Kolmogorov compressor, which truly has no regret,
and we just want to have less and less
regret as we train larger and larger neural nets
as far as extracting predictive value is concerned.
Now, the way this applies to the GPT models--
you know, I claim that it applies to the GPT models also,
this theory.
But the thing that's a little bit tricky with the GPT models
is that the theory of their behavior
can also be explained without making
any references to compression or unsupervised learning.
You just say, no, it's just the conditional distribution
of text on the internet.
Few-shot learning, just imagine a document
with some repeated patterns.
The pattern is probably going to continue.
So the GPT models can intuitively be explained,
at least their few-shot behavior can definitely
be explained without alluding to this theory.
And so I thought it would be nice-- can we
find some other direct validation of this theory?
Now, can we find a different domain, like vision?
Because vision, you have pixels.
Can you show that doing this on pixels
will lead to good unsupervised learning?
The answer is, yes, you can.
This is work we've done in 2020.
We just called it iGPT.
And it's an expensive proof of concept.
It's not meant to be--
it was not meant to be a practical procedure.
It meant to be a paper that showed
that if you have a really good next step predictor,
you're going to do great unsupervised learning.
And it was proven in the image domain.
And there, I'll just spell it out.
You have your image.
You turn it into a sequence of pixels.
Each pixel should be given some discrete value of intensity.
And then just do next pixel prediction.
Be the same transformer.
That's it-- different from BERT.
Just next token prediction, because this
maximizes the likelihood, therefore compresses.
And like we see, one immediate result we see--
so these are results on CIFAR-10.
You have models of different sizes.
This is their next step prediction accuracy
on their pixel prediction task on their unsupervised learning
task.
And this is linear probe accuracy, linear probe,
when you pick some layer inside your neural net, the best
layer, fit a linear classifier, and see how well it does.
You've got these nice looking curves,
and they start to get more and more similar.
It's kind of what you want.
It's like it's working.
Pixel prediction of the same variety
as next word prediction, next--
not just pixel-- not mere pixel prediction,
but next pixel prediction leads to better and better
unsupervised learning.
I thought that was pretty cool.
And we tried it-- we scaled it up to various degrees,
and we showed that, indeed, it learns reasonably well
where-- we came close, but we didn't fully
match the gap between our unsupervised learning
and the best unsupervised learning
of the day on ImageNet.
But it was clearly a mat-- it did feel like it scaled,
and it was a matter of compute, because these things use
large, high-resolution images, and we use like tiny 64
by 64 images with giant transformers for the day--
tiny by today's standards, but giant for the day, 6
billion parameters.
So this is like, yeah, you do your unsupervised next pixel
prediction on a big image data set,
and then you fit your linear probe on ImageNet.
And you get strong results.
Like, OK, that is cool.
CIFAR, yeah-- like we got-- one of the cool things
is that if you got 99% on CIFAR-10 with this,
that was cool.
2020 was a different time, but it was cool.
I'll mention-- I'll close with a couple of comments
on linear representations.
The thing-- I like the compression theory,
because I feel like it just bothered me for a long time
that you couldn't think rigorously
about unsupervised learning.
I claim now you can, at least to some partial degree.
Still lots of waving of hands is needed,
but maybe a bit less than before.
But it doesn't say why your representations
should be linearly separable.
It doesn't say a linear [INAUDIBLE] should happen.
But linear representations happen all the time.
And the reason for the information
must be deep and profound.
And I think we might be able to crisply articulate it
at some point.
One thing which I thought was interesting
is that these next pixel prediction
models, autoregressive models seem
to have better linear representations than BERT.
And the blue accuracy is BERT versus autoregressive.
I'm not sure why that is, or rather I can speculate.
I have some speculations.
But it would be nice to--
I think it would be nice to gain more understanding for why,
like really why those linear representations are formed.
And yeah, this is the end.
Thank you for your attention.
[APPLAUSE]
AUDIENCE: Could you provide the speculation?
ILYA SUTSKEVER: Yeah.
Oh yeah yeah.
I think the speculation is basically
that if you're doing next pixel prediction,
you're predicting the next pixel from all the previous pixels.
So you need to look at the long range structure.
But in BERT, you have your vector.
And let's say you drop 25% of the tokens, of the pixels,
in this case, then any prediction
that you have can actually be done quite well by looking
a little bit into the past and a little bit into the future.
So you remove all the hard prediction tasks,
or you make them much easier.
The hardest prediction task in the next pixel prediction
is a lot harder than the hardest prediction task in the BERT
prediction case.
So this is an argument.
We may be even be able to test it if we think--
if we try to design an experiment to test it.
But yeah, that's the speculation.
AUDIENCE: Is there a more robust 2D version
of the next pixel prediction?
ILYA SUTSKEVER: I would say more robust--
I would say any formally-- anything
you do at all that turns a neural network
into a probabilistic model that assigns probabilities
to different inputs is that.
So the other big way of doing next token prediction is
diffusion models, diffusion models also like--
not the ones that are so--
the diffusion models that people use in like high quality image
generators don't really maximize the likelihood of their inputs.
They have a different objective.
But the most original formulation
is maximizing likelihood.
And by the way, the diffusion model is a counterargument to--
well, rather, the diffusion model also, I would claim,
should have worse representations than a next
token prediction model for the same reason that BERT doesn't.
So I feel this further, in my mind,
increases the mystery of what caused leads
to linear representations to form.
AUDIENCE: Yeah.
Thanks for the talk.
I liked the analogy between Kolmogorov complexity
and neural networks.
I guess one kind of place where it
seems like there's a disanalogy is that neural networks have
training dynamics.
And if you do something like just take all the computer
programs, it's more memory.
So if you're compressing with Kolmogorov complexity,
the order of the data doesn't matter.
It clearly does for neural nets.
You have simple heuristics and features
that are learned early in training that probably
stick around late in training.
And so, I guess, do you have thoughts
on how to think about that in the context of this picture?
ILYA SUTSKEVER: Yeah.
I mean, the analogy isn't perfect.
That is true.
And the way in which you describe its--
the place where that analogy breaks down the most
is in the search procedure.
This is really what you're alluding to, like, hey,
order of data matters.
Why?
Because we are using a weak search procedure.
A Kolmogorov compressor uses the infinitely expensive search
procedure of just enumerate everything from scratch
each time.
So yeah, I mean it is--
I mean-- I would say that caution should be used
with this particular analogy.
It's very clearly not universally applicable.
But I do think that in the case--
I do think that there is some value here in explaining
where unsupervised learning comes from.
AUDIENCE: So you are-- now that your formal--
going back and forth between compression and the next bit
prediction, you are connecting that
to-- relating unsupervised learning
to supervised learning.
So if you backtrack from cryptography,
there's sort of a theory.
It goes back to the '80s, where they talk about compression
being equivalent to next bit prediction,
being equivalent to being able to distinguish
two distributions.
So if you have an [INAUDIBLE],, then you have an algorithm that
can predict--
then you have an algorithm that can compress.
I mean, cryptography is the other way.
You're saying there is no way to compress [INAUDIBLE]..
And therefore, there is no way [INAUDIBLE]..
So I wonder if this idea of being able
to distinguish, how could that translate
to anything naturally?
ILYA SUTSKEVER: So I think I understand the question.
Let me paraphrase.
So I'm not super well-versed in the area that you mentioned.
But you're saying, if you can distinguish
through distribution, you can use that distinguishability
to make some--
to predict.
And I guess the question would be, what's the--
I have some--
AUDIENCE: You have samples from one.
You have samples from the other.
And you can tell them apart [INAUDIBLE]..
ILYA SUTSKEVER: And so I guess my question
is, how well can you predict--
or you're talking about you can predict a little bit?
Is that what you're referring to.
AUDIENCE: Yeah, [INAUDIBLE].
You can predict what the next bit would be,
let's say, in a sample.
ILYA SUTSKEVER: So--
AUDIENCE: Actually, [INAUDIBLE],, [INAUDIBLE] you can predict
those.
ILYA SUTSKEVER: I mean, I can mention
one thing that is related, which is energy-based models.
So energy-based models offer yet another way
of turning neural networks into probability distributions,
where an energy-based model will say, just give me
your configuration of vectors.
And I'm just going to tell you how it feels.
And then you normalize over all of them.
And I feel like when it comes to energy-based models
in particular, ratios of distributions
correspond to differences of energy.
Maybe there is some relation to what you're saying.
I think I'm probably not like precisely commenting
on the thing that you said, but I
don't think I have anything more to add, unfortunately.
AUDIENCE: Ilya, maybe I'll just put in a tiny word
to defend the honor of VC dimension.
ILYA SUTSKEVER: Go ahead.
AUDIENCE: Well, so in 2007, I had
this theorem about PAC learning of quantum states.
And this was precisely an example, where
even if you don't care about infinite precision,
even if you discretize the state space,
like you look at the VC dimension,
or the fat shattering dimension, or whatever,
and it is exponentially smaller than the log of the size
of the hypothesis class.
So I do think there are cases where you still
want VC dimension.
ILYA SUTSKEVER: That is a cool example.
[LAUGHTER]
AUDIENCE: So I didn't completely follow your notation.
Capital X was a sample from a distribution.
ILYA SUTSKEVER: Yeah.
AUDIENCE: The distributions--
ILYA SUTSKEVER: The data set.
AUDIENCE: So then the transformer, SGD--
I'm not sure if you can think of it as the best program
for compressing [INAUDIBLE].
Maybe because it's only given one sample at a time.
ILYA SUTSKEVER: That's true.
There is one other assumption that we are making.
If you look at the total training loss--
say you have some neural network.
It doesn't need to be a transformer--
some neural network which assigns
log probabilities to your data.
And you have some large number of training cases.
You can run the neural net.
You need some neural network.
You can compute its log probability on each case.
And you can take their sum.
And that's going to be the log probability
that the neural network assigns to the entire data set.
Now, the neural network is unable--
this particular formulation makes
the neural network unable to explicitly notice
maybe temporal, or any kind of structure
in the order of the data.
But I still claim it's meaningful to say
that, yeah, you can compute the entire log
probability of the data set.
And that gives you your negative log probability, literally
the number of bits you need to compress
this data set using this neural network as a compressor.
AUDIENCE: If I can paraphrase, I think
you're arguing for compression as a framework
for understanding or motivating unsupervised learning.
And a point you made at the end was
that if you apply that framework to language models,
to next word prediction, it can feel a little maybe
superficial, because next word prediction, any text task
can be converted to next word prediction.
And so unsupervised learning is kind of superficially
the same as supervised learning for text.
So then we turn to image GPT, where
you can't formulate any given task as next pixel prediction.
And maybe you could, but let's just say you can't.
But then you can use the linear representation
as a way of showing that compression is a good way
to formulate unsupervised learning.
But then there are highly effective compressors
that wouldn't give you a useful linear representation.
So I'm wondering, are there any cases where
unsupervised learning and supervised learning
are not superficially the same, [INAUDIBLE],,
but that you don't require your compressor
to give you an effective linear representation
to show that compression is a good unsupervised objective?
ILYA SUTSKEVER: Yes.
So that's a very--
I have thoughts on this question.
So first of all, I will say that the linear representation,
a good linear representation is a bonus.
At no point does this argument say,
and therefore linear representation should emerge.
However, I would argue that the theory does
say that good fine tuning should emerge,
because joint compression is kind
of like hacky approximate fine tuning using the bad search
algorithm, which is SGD.
Now, we know that the evidence from these old experiments
suggests that BERT learns worse linear representations
than next pixel prediction when run on images.
And perhaps the same is true for diffusion models.
That seems highly likely.
I think it would be very interesting to see
how the fine tuned diffusion models compare.
Maybe someone knows already here.
AUDIENCE: So I wonder if you can take
your ideas for unsupervised learning and bring them back.
It seems to me that there could be
some insights you could take back
to the supervised learning [INAUDIBLE]..
ILYA SUTSKEVER: Well, I think here
this gives you more insight into the-- so the particular insight
that you might get here is insight into the function
class, the desired function class.
You want your neural net to have many layers.
For example, if you have many layers,
it can do more thinking.
You want those layers to be wider, bigger neural nets.
I mean, essentially, this is like what the field has already
discovered.
AUDIENCE: But you might not need more examples than the number
of [INAUDIBLE].
In supervised learning, this might
be an explanation for why you don't need
necessarily more parameters.
ILYA SUTSKEVER: That's right.
So this might, for example, be--
so there is a--
so the weakness of this theory-- by the way, this theory
has a huge practical weakness, which
is it ignores compute cost.
It only focuses on information.
So you're all familiar with the universal transformer, maybe--
basically, it's like it's a transformer.
We use the same weights at each layer.
It's a great idea, except that if you
want to have a lot of parameters,
you need to pay for it a lot with compute.
Nobody wants to do that.
You don't ignore compute costs.
This ignores compute costs.
If you ignore compute costs, it gives you
a recipe of how to proceed.
AUDIENCE: What do you think the importance is
of the autoregressive modeling in particular
of the probability distribution?
You can view BERT as sort of doing maximum likelihood
training.
I think you can find ways to sample out of BERT,
for example.
ILYA SUTSKEVER: So certain diffusion models
can be set up to be maximum likelihood models.
And so all this theory predicts, the diffusion models
should also be--
it should be possible to make diffusion models
to do equally great things, perhaps
with some constant factors of--
because this-- as the earlier answer,
this is not a compute sensitive theory.
So it's going to say, OK, maybe you
need like a factor of 10 or 15 compute,
and then things will be the same between the autoregressive
and diffusion model.
The autoregressive model, it's just simple.
It's convenient.
Maybe the energy-based model will do even greater things.
But from that perspective, they're all the same.
AUDIENCE: It seems like GPT-4 may
be the best compressor at the moment, which presumably
is the largest model out there as well.
So on one hand, it's able to compress better.
On the other hand, the size of the compressor
itself is increasing.
Or is it that it's actually not necessarily the case
from the perspective of the theory [INAUDIBLE] complexity?
ILYA SUTSKEVER: Well, I mean what matters
is that the cost of the--
what you really want is for the--
there is another way in which this theory doesn't quite
correspond to reality, because this theory says,
you have a fixed data set you want to compress.
But in the way we train the GPT models,
you say, you have a big training set,
and then you have a test set, which
is assumed to be infinite.
If the test set is infinite and you
care about compressing the test set,
then you don't care about the size of your compressor,
so long as this test set size is much larger.
So I would say that this is also, like, I would say,
a disanalogy, where some more careful thinking
would be useful to get clarity.
AUDIENCE: Is it enough to use a validation
set that's independent as long as it's separate?
That kind of closes the gap?
ILYA SUTSKEVER: Is it enough?
It's a good thing.
It's a good thing to do.
Is it truly-- the thing is I think it's also--
actually, like in the single epoch case,
if you are in a single epoch case,
then you can just compute your log probs as you train,
and that will be something like the model's
compression of the data set.
And that gives you also a good estimate of the validation set.
So in the single epoch case, I think things are quite similar.
AUDIENCE: I just wanted to make you aware
that there was a paper last month in [INAUDIBLE]..
It has used gzip, followed by the k-nearest neighbor
classifier for text classification.
So more training, more parameters.
Is it compressed strings, just like you showed, concatenate
two strings together [INAUDIBLE] individually,
then the compute distance.
There is a way using the output of gzip
as the [INAUDIBLE] [? code ?] [INAUDIBLE]..
ILYA SUTSKEVER: Yeah.
My only comment on that is that gzip is not
a very strong compressor of text.
So I think it does show that things
are possible to some degree, but all the really meaty stuff
happens when you squeeze the last bits,
if you see what I mean.
AUDIENCE: It seems like it works like a toy proof of concept,
but it does work.
AUDIENCE: I think there was some bug there before that--
ILYA SUTSKEVER: It doesn't work.
AUDIENCE: [INAUDIBLE] results for now.
AUDIENCE: Since Jacob brought up curriculum effects,
and how they appear for neural nets, but not for-- you
don't see it, but Kolmogorov complexity.
We were just talking at lunch about what
is the current empirical situation, about how much do
curriculum effects actually matter.
And none of us knew the answer.
ILYA SUTSKEVER: Well, it's like--
so it's a bit--
I have a take on this, which is the easier--
so we've done a lot of work as a field to make the transformer,
like the architecture that we are using, to be relatively
easy to optimize.
We found good initializations.
We found good hyperparameters.
We made changes to the architecture
so that training is as easy as possible.
The easier is-- the more easy the training optimization
problem is, the less susceptible you are to curriculum effects.
And it's known.
For example, people who were training
in all kinds of exotic architectures,
like neural Turing machines, for example,
which are these really complicated things.
And it's super, super huge numbers
of heterogeneous layers, which are different.
And that thing, you need to be very, very careful
with your curriculum, because if you give it the full data set
right away, it'll just fail.
YEJIN CHOI: All right, let's thank Ilya again.
[APPLAUSE]
[SIDE CONVERSATION]
Loading video analysis...