An Observation on Generalization

By Simons Institute for the Theory of Computing

Summary

## Key takeaways - **Supervised learning is mathematically sound.**: Supervised learning is well understood due to precise mathematical conditions that guarantee success. If training loss is low and model complexity is less than the training set size, test error is guaranteed to be low. This provides a solid foundation for collecting data and improving models with certainty. [02:56], [04:48] - **Unsupervised learning lacks clear mathematical guarantees.**: Unlike supervised learning, there's no clear mathematical framework for unsupervised learning. While we can optimize objectives like reconstruction error, it's unclear why this leads to good results for a different, desired objective, creating a high level of mystery. [07:50], [10:15] - **Compression offers a path to understanding unsupervised learning.**: The language of compression provides a more intuitive way to think about unsupervised learning than prediction. A good compressor, when applied to concatenated datasets, will naturally leverage shared structures between them, offering a clue to how patterns in unlabeled data can help with a supervised task. [15:15], [16:18] - **Kolmogorov complexity as the ultimate, uncomputable compressor.**: Kolmogorov complexity represents the shortest program to output data, acting as the ultimate compressor. While not computable, it serves as a theoretical benchmark for zero regret in unsupervised learning, implying that a perfect compressor would extract all possible predictive value from unlabeled data. [20:48], [23:33] - **Neural nets approximate Kolmogorov compressors.**: Large neural networks, trained with SGD, can be viewed as performing a search over programs. As networks grow, they approximate the ideal Kolmogorov compressor, suggesting that larger models are desirable because they get closer to this theoretical ideal of zero regret and maximum information extraction. [24:47], [30:32] - **Next pixel prediction yields strong image representations.**: Applying the next pixel prediction task, similar to next word prediction, to images has shown success. This autoregressive approach, as seen in iGPT, leads to good unsupervised learning results and better linear representations compared to models like BERT, suggesting a powerful unsupervised learning strategy for vision. [32:21], [33:41]

Topics Covered

Why is unsupervised learning so mysteriously magical?
A good compressor will find your data's shared structure.
Big neural nets approximate an uncomputable ideal compressor.
Next-pixel prediction proves compression works for vision.
Why do autoregressive models learn better linear representations?

Full Transcript

YEJIN CHOI: It has a citation count that's six digits.

It's more than 139,000, and then many more.

So super excited to hear what Ilya has to say

about LLMs and the future.

Take it away.

ILYA SUTSKEVER: OK.

Hi.

[APPLAUSE]

Hi.

Hi everyone.

Thank you for the introduction.

When Umesh invited me to attend this event,

I got really excited.

I saw all the speaker list.

And I thought, great, I'll go and I'll talk about something.

And then I ran into a problem, which

is that a lot of the technical work

that we're doing at OpenAI, I actually can't talk about.

[LAUGHTER]

So I was really wracking my head what could I talk about.

Right now, so as of not too long ago,

I switched all my focus to working on AI alignment,

all my research focus.

And we'll have very cool results to show there, but not just

yet.

So I'm like, so this would have to be for the next talk.

But what would be good for this talk?

And I came up with something.

I'll tell you about some very old results

that we had at OpenAI many years ago, back in 2016 even,

which really affected the way I think

about unsupervised learning.

And I thought I'd share them with you.

It is possible that at this point,

you will find them obvious.

But maybe not all of them.

So there is at least a small chance

that you'll find it interesting.

So I want to set the expectations modestly, and then

hopefully exceed them.

So a theory of unsupervised learning.

How can such a thing exist?

Before we talk about unsupervised learning,

we want to talk about learning in general.

What is learning?

And why does learning work at all?

Why should learning work at all?

And why should computers be able to learn?

And now we're just used to the fact.

We take it for granted that neural networks learn.

But why do they?

Mathematically, why should they?

Why would data have regularity that our machine learning

models can capture?

So that's not an obvious question.

And one important conceptual advance

that has taken place in machine learning many years ago

by multiple people was the discovery and the formalization

of supervised learning.

So it goes under the name of PAC learning,

or statistical learning theory.

And the nice thing about supervised learning

is that it gives you a precise mathematical condition

under which learning must succeed.

You're told that if you have some data, from some data

distribution, that if you manage to achieve a low training

loss and the number of your degrees of freedom

is smaller than your training set, then

you will achieve low test error.

And you are guaranteed to do so.

So you have a mathematical condition

where you can say, well, if I did

find a function out of my function class,

which achieves low training error,

then learning will succeed.

And you can say, yeah, this is a very sensible

mathematical thing that we can reason about.

And that's why supervised learning is easy.

And then you had all these theorems,

which I thought were simple.

I found them elegant.

You've seen maybe theorems like this.

If you have your--

basically, it's like this is the sort of thing

where if you know it, you're going

to say, oh, yeah, that thing.

And if you don't know it, it will not

be possible to explain it in 30 seconds.

Though, it is possible to explain it in 5 minutes, just

not 30 seconds.

None of this stuff is complicated.

And you've got your little proof which says, well,

if you have some number of functions in your function

class, the probability that your train error will be far

from your test error for at least one function,

there is some math.

And the math is simple.

This is all the math-- three lines of math.

So three lines of math can prove all of supervised learning.

Well, that's nice.

That's very nice.

So supervised learning is something

that is well understood--

comparatively speaking, well understood.

We know why it must succeed.

So we can go forth and collect large supervised learning data

sets and be completely certain that models

will keep getting better.

So that's the story there.

And yeah, I forgot to mention a very important piece

of these results.

The test distribution and the training distribution

need to be the same.

If they are the same, then your theory of supervised learning

kicks in and works and will be successful.

So conceptually, it is trivial.

We have an answer for why supervised

learning works, why speech recognition should work, why

image categorization should work,

because they all reduce to supervised learning, which

works, which has this mathematical guarantee.

So this is a very nice.

Here I want to make a small side comment on VC dimension

to those of you who care about such things.

There may be a small subset.

So if you want to zone out for the next 30 seconds,

feel free to do so.

So a lot of writings about statistical learning theory

emphasized the VC dimension as a key component.

But the main reason the VC dimension--

in fact, the only reason the VC dimension was invented

was to allow us to handle parameters

which have infinite precision.

The VC dimension was invented to handle precision parameters

with infinite precision.

So if you have a linear classifier,

every parameter is infinite precision.

But then, of course, in reality, all our floats

are finite precision, and their precision is shrinking.

So you can-- so you have your--

so the number of functions that are implemented by a computer

is actually small.

And you can reduce it back to this little formula, which

gives you pretty much all the best bounds

that you can get from supervised learning.

So I find this to be cool.

I find this to be appealing because you

have fewer lines of proof.

Now, let's talk about unsupervised learning.

So I claim-- first of all, what is unsupervised learning?

What is it?

Supervised learning is like, yeah, you've

got your data that says, here, do this, here, do that,

and you have all this data.

You do well in your training error.

Your training error is low.

You have more training data than degrees

of freedom in your function class, parameters.

There may be other things, too-- degrees of freedom, like bits.

And you say, your supervised learning will succeed.

But what is unsupervised learning?

What can you say at all about unsupervised learning?

And I'll say that at least I have not

seen like an exposition of unsupervised learning, which

I found satisfying.

How to reason about it mathematically?

We can reason about it intuitively,

but can we reason about it mathematically?

And for some context, what is the old dream

of unsupervised learning, which, by the way,

this dream has been fulfilled, but it's fulfilled empirically.

Can we go just a tiny bit beyond the empirical results,

like the idea that you just look at the images,

or you just look at the text without being told what

to do with them, and somehow you still discover

the true hidden structure that exists in the data,

and somehow it helps you?

But why should it happen?

Should it happen?

Should we expect it to happen?

You cannot-- you don't have anything remotely similar

to the supervised learning guarantee.

The supervised learning guarantee

says, yeah, get your low training error,

and you're going to get your learning.

It's going to be a great success.

So unsupervised learning, it appears--

it appears that it's not this way.

People were talking about it for a long time, in the '80s.

The Boltzmann machine was already

talking about unsupervised learning.

And unsupervised learning also did not work at small scale,

but all the ideas were there, like denoising autoencoder,

for those of you who remember it.

It's like BERT or the diffusion model.

Within it, it's a tiny twist, the tiniest of twists.

The language models of old times,

they also generated cool samples for their time,

but they're unsupervised learning performance was not as

impressive as those of today.

But I want to make the case that it is confusing,

because it's like--

why is it confusing?

You optimize.

How does unsupervised learning work?

You say, let's optimize some kind of reconstruction error,

or let's optimize some kind of denoising error,

or some kind of self-supervised learning error.

You optimize one objective.

Oh yes.

I just said that.

But you care about a different objective.

So then doesn't it mean that you have no reason

to expect that you will get any kind

of good unsupervised learning results,

or rather you do get them empirically,

but like, is it going to be--

the level of mystery is quite high, I claim.

It seems like a totally inaccessible phenomenon.

You optimize one objective, but you

care about another objective.

And yet it helps.

How can that be?

Magic.

The other thing, by the way, is that unsuper--

well, I guess I'm going to say something which is only 90%

true.

Unsupervised learning doesn't-- you could say, OK,

so you just learn the structure and the input distribution,

and it helped you.

But then, what if you're training

from the uniform distribution?

Then all your unsupervised learning algorithms will fail.

How should you to think about that?

So what can we say?

Do we need to make assumptions?

What kind of assumptions?

So I'd like to present potentially

a way of thinking about unsupervised learning, which,

I think, I personally find it interesting.

Perhaps you'll find it interesting, too.

Let's find out.

So I want to show you one way of doing unsupervised learning,

which is not necessarily widely known, because it never

became the dominant way of doing unsupervised learning.

But it has the cool feature that similarly

to supervised learning, it has to work.

So what kind of mysterious unsupervised

learning procedure where you're not

giving any labels to any of your inputs

is still guaranteed to work?

Distribution matching, distribution matching.

So what is distribution matching?

Say I've got my data.

I've got X, and I've got Y, data sources.

There is no correspondence between them.

I've just got two data sources, data source X and data source

Y, language one, language two, text, speech--

no correspondence between them.

Let us look at this criterion.

Find the function f such that the distribution of F of X

is similar to the distribution of Y. It is a constraint on F.

And in the case of machine translation and speech

recognition, for example, this constraint might be meaningful.

You could say, yeah, if I have long sentences--

if you tell me that you have a function

such that the distribution-- you take your distribution

of English sentences.

You apply the function F to them,

and you get something which is very

similar to the distribution of French sentences, you can say,

OK, I found the true constraint about F.

If the dimensionality of X and the dimensionality of Y

is high enough, that's going to give you a lot of constraints.

In fact, you might be able to almost fully recover F

from that information.

This is an example of supervised learning--

of unsupervised learning, where it is still guaranteed

to work in the same sense that supervised learning is

guaranteed to work.

Also, substitution ciphers, like little simple encryptions,

will also fit into this framework.

So that's the thing.

And so I ran into this.

I independently discovered this in 2015.

And I got really fascinated by it,

because I thought, wow, maybe there

is something meaningful, mathematically meaningful

that we can say about unsupervised learning.

But let's see.

The thing about this setup is that still it's

a little bit artificial.

It's still-- real machine learning

setups aren't like this.

And the way we like to think about unsupervised learning

isn't that also.

Now I'll present to you the meat, the meat

of what I wanted to say--

a proposed way to think about unsupervised learning

that lets you--

that puts it on par with supervised learning.

So OK, what is it doing mathematically?

How can you be sure that your unsupervised learning is good?

Compression to the rescue.

Obviously, it is well-known--

I shouldn't say obviously.

It is not obvious, but it is well-known

that compression is prediction.

Every compressor can be a predictor, and vice versa.

There is a one-to-one correspondence

between all compressors and all predictors.

However, I would argue that for the purpose of thinking

about unsupervised learning, the language of compression

offers some advantages, at least, for me, it did.

Perhaps it will for you, too.

So consider the following thought experiment.

This thought experiment is the most important slide.

Say you have two data sets, X and Y. You have two data

sets, two files on your big giant hard disk.

And say you have a really great compression algorithm

C, which takes data in, and outputs compressed objects out.

Say you compress X and Y jointly.

You concatenate them.

You take the two data sets, and you concatenate them.

And you feed them to your compressor.

What will happen?

Well, let's see.

What, and in particular, an important question

is, what will a sufficiently good compressor do?

My answer is, very intuitively, it

will use the patterns that exist inside X to help

it compress Y, and vice versa.

You could make the same claim about prediction,

but somehow it's more intuitive when

you say it about compression.

I don't know why that is, but I find it to be the case.

So that's our clue.

And you can make an equation like this, where you say, hey,

if your compression is good enough, if it's

like a real great compressor, it should

say that the compression of your concatenation

of your giant files should be no worse

than the separate compression of your two files.

So any additional compression that

was gained by concatenation is some kind of shared structure

that your compressor noticed.

And the better your compressor is, the more shared structure

it will extract.

The gap is the shared structure, or the algorithmic mutual

information.

So that's interesting, right?

You can see what I'm alluding to.

Y is the data of your supervised task.

X is your unsupervised task.

But suddenly, you have some kind of mathematical reason

for the information for the patterns in X to try to help Y.

Notice also how it generalizes distribution matching.

If we are in the distribution matching case, where you've got

your X is language one and Y is language two,

and you are saying--

and you know that there exists some simple function

F that transforms one distribution into the other,

surely, your compressor, if it's good,

you'll notice that, and make use of that,

and maybe even internally try to recover this function.

I think that's pretty cool.

We've closed the circle.

So how do we formalize it then?

What will be the formalization of unsupervised learning?

Let's see.

Let's see if I can do it.

Consider an ML algorithm.

So here, by the way, in what follows, I'll be a bit sloppy,

and I will use compression and prediction interchangeably.

Say you have a machine learning algorithm A. It's an algorithm

A, and it tries to compress Y. And say it has access to X. X

is file number one, and Y is file number two.

You want your machine learning algorithm, your compressor

to compress Y, and it can probe X as it sees fit.

The goal is to compress Y as well as possible.

We can ask ourselves, what is our regret

of using this particular algorithm?

You'll see what I'm getting at.

Regret relative to what?

If I do a good enough job, if I have--

being low regret means that I've gotten all the help

that I can possibly get from this unlabeled data.

This unlabeled data has helped me as much as possible.

And I don't feel bad about it.

I don't feel bad.

I don't feel like I've left some prediction value on the table

that someone else with a better compression algorithm

could have used.

That's what it means.

And in particular, it's like, yeah, you can go,

and you can sleep happily at night, knowing that if there

is some information in the unlabeled data that could

help my task, no one else--

no one could have done as good of a job as me.

I've done the best job at benefiting

from my unlabeled data.

So I think that is a step towards thinking

about unsupervised learning.

You don't know if your unsupervised data

set is actually useful.

It may be super useful.

It may have the answer.

It may be totally useless.

It may be the uniform distribution.

But if you have a low regret unsupervised

learning algorithm, you can say, whether it's

the first case or the second case, I know I've done my best.

I know I've done my best at benefiting

from my unlabeled data.

And no one could have done better than me.

Now, I want to take a detour to theory land, which

is a little obscure.

I think it is interesting.

Kolmogorov complexity as the ultimate compressor

gives us the ultimate low regret algorithm,

which is actually not an algorithm,

because it's not computable.

But you will see what I mean really quickly.

So Kolmogorov-- first of all, for some context,

who here is familiar with Kolmogorov complexity?

OK, about 50%.

Kolmogorov complexity is the kind of thing

which is easy to explain in 1 minute, so I'll just do it.

It's like imagine if I give you some data,

or you give me some data.

And I'm going to compress it by giving you

the shortest program that can possibly exist,

the shortest program that exists, which, if you run it,

outputs your data.

AUDIENCE: There's a typo.

[INAUDIBLE] Instead of Y, it should be X.

ILYA SUTSKEVER: Yes, that is correct.

You got me.

[LAUGHTER]

It is the length of the shortest program which outputs X. Yes.

Intuitively, you can see that this compressor is quite good,

because you can prove this theorem, which

is also really easy to prove, or rather, it's easy to feel it.

And once you feel it, you could kind of believe me

that it's easy to prove.

And you can basically say that your Kolmogorov compressor,

if you use that to compress your strings,

you'll have very low regret about your compression quality.

You can prove this result. It says

that if you've got your string X,

your data set database, whatever, the shortest

program which outputs X is shorter than whatever

your compressor needed output, and however well

your compressor compressed your data, plus a little term, which

is however many characters of code you need

to implement your compressor.

Intuitively, you can see how it makes sense, the simulation

argument.

The simulation argument, if you tell me, hey, I've

got this really great compressor C, I'm going to say,

cool, does it come with a computer program?

Can you give this computer program to k and k

is going to run your compressor?

Because it runs computer programs,

you just need to pay for the program length.

So without giving you the details,

I think I gave you the feel of it.

Kolmogorov complexity, the Kolmogorov compressor

can simulate how the computer program simulates

other compressors.

This is also why it's not computable.

It's not computable because it simulates--

it feels very much at liberty to simulate all computer programs.

But it is the best compressor that exists.

And we were talking about good compression

for unsupervised learning.

Now, let us generalize a Kolmogorov complexity,

a Kolmogorov compressor to be allowed

to use side information.

I'll talk more in detail.

So I'll make this detail.

I'll reiterate this point several times,

because this point is important.

Obviously, the Kolmogorov compressor is not computable.

It's undecidable.

But like, hey, because it like searches over all programs,

did you know that if you run SGD over the parameters

of some neural net with 100 layers,

it's automatically like doing program search over a computer,

which has a certain amount of memory,

a certain number of steps.

It's kind of like micro micro micro homography,

like fitting a neural net.

You kind of see the feel, the similarity.

It's kind of magical, right?

Neural networks can simulate real programs.

They are little computers.

They're circuits.

Circuits are computers, computing machines.

And SGD searches over the program.

And all of deep learning hinges on top of the SGD miracle

that we can actually train these computers with SGD.

That works.

We actually find those circuits from data.

Therefore, we can compute our miniature Kolmogorov

compressor.

An assimilation argument applies here as well, by the way,

I just want to mention this one fact.

I don't know if you've-- if you've ever tried to design

a better neural network architecture,

what you'd find is that it's kind of hard to find a better

neural network architecture.

You say, well, let's add this connection.

Let's add that connection.

And let's modify this and that.

Why is it hard?

The simulation argument, because your new architecture

can be pretty straightforwardly simulated

by your old architecture.

Except when it can't.

Those are rare cases.

And in those rare cases, you have a big improvement,

such as when you switch from the little RNN to the transformer.

The RNN has a bottleneck, the hidden state.

So it has a hard time implementing the transformer.

However, had we found a way to engineer [INAUDIBLE]

a very, very large hidden state, perhaps it

would become as good as a transformer again.

So this is the link.

You start to see how we switch from the formal land

to neural network land.

But you see the similarity.

So conditional Kolmogorov complexity as the solution

to unsupervised learning--

you can basically have a similar theorem,

where I'm not going to define what-- well, I'm

going to define what K of Y given X is.

It's like the shortest program which outputs Y,

if it's allowed to probe X. That's

the length of that shortest program which outputs Y

if it allows to probe X. And you can prove the same result.

And you can see immediately that, yeah, this

is definitionally the solution to unsupervised learning.

If you use that, you can sleep very, very safely at night--

soundly at night, knowing that no one does

unsupervised learning better than you.

It's literally that.

So this is the ultimate low regret solution

to unsupervised learning, except that it's not computable.

But I do think it's a useful framework.

And here we condition on a data set, not an example.

This thing will extract all the value out

of x for predicting y, the data set, the data set,

not the example.

So this is the solution to unsupervised learning--

done success.

And there is one little technicality

which I need to spend a little bit of time talking about,

which is we were talking about this conditional Kolmogorov

complexity, where you are talking about compressors that

get to C, try to compress one thing, while having access

to another thing.

And this is a bit unnatural in a machine learning context

when you care about feeding big data sets, at least

as of today.

Although, it's changing fairly rapidly.

But I think it's still fair to say

that there is no truly good way of conditioning on a big data

set.

You can fit a big data set, but you can't condition on it,

not yet, not truly.

So this result says that, hey, if you

care about making predictions about your supervised task Y,

using the good old-fashioned Kolmogorov compressor, which

just compresses the concatenation of X and Y

is going to be just as good as using

your conditional compressor.

There are more details and a few subtleties to what I just said.

And I'm happy to talk about them offline in the event someone

is interested in them, but that basically

says that if before you were saying, hey--

the previous slide said that you can

use this conditional Kolmogorov compressor to solve

unsupervised learning.

This says that you can also use your regular Kolmogorov

compressor.

Just throw all your data.

Take all your files.

Concatenate them.

Compress them.

And that's going to make some great predictions

on your supervised task, the task that you care about.

There are some intuitions about why this is true.

This result is actually like--

proving this is slightly more tricky, so I won't do it.

And so the solution to unsupervised learning,

just give it all to your Kolmogorov complexity--

to your Kolmogorov compressor.

The final thing is I'll mention that this kind

of joint compression is maximum likelihood if we don't overfit.

If you have a data set, then the sum of the likelihood,

given your parameters, is the cost

of compressing the data set.

You also need to pay the cost of compressing the parameters,

but you can see if you now want to compress two data

sets, no problem.

Just add more points to your training set, to your data set,

and add the terms to the sum.

So this concatenation, this joint compression

is very natural in a machine learning context, which

is why it was worth the hassle of saying, yeah,

we have our conditional Kolmogorov complexity.

And then I made some arguments.

I made some claim without fully defending it,

but it is possible to defend it, but using regular--

just compress everything.

Kolmogorov complexity works also.

And so I think this is elegant.

I like it, because then it says, well,

if you squint hard enough, you can

say that this explains what our neural network is doing.

You can say, hey, SGD over big neural networks

is our big program search.

Bigger neural networks approximate the Kolmogorov

compressor more and more and better and better.

And so maybe this is also why we like big neural nets,

because we approach the unapproachable idea

of the Kolmogorov compressor, which truly has no regret,

and we just want to have less and less

regret as we train larger and larger neural nets

as far as extracting predictive value is concerned.

Now, the way this applies to the GPT models--

you know, I claim that it applies to the GPT models also,

this theory.

But the thing that's a little bit tricky with the GPT models

is that the theory of their behavior

can also be explained without making

any references to compression or unsupervised learning.

You just say, no, it's just the conditional distribution

of text on the internet.

Few-shot learning, just imagine a document

with some repeated patterns.

The pattern is probably going to continue.

So the GPT models can intuitively be explained,

at least their few-shot behavior can definitely

be explained without alluding to this theory.

And so I thought it would be nice-- can we

find some other direct validation of this theory?

Now, can we find a different domain, like vision?

Because vision, you have pixels.

Can you show that doing this on pixels

will lead to good unsupervised learning?

The answer is, yes, you can.

This is work we've done in 2020.

We just called it iGPT.

And it's an expensive proof of concept.

It's not meant to be--

it was not meant to be a practical procedure.

It meant to be a paper that showed

that if you have a really good next step predictor,

you're going to do great unsupervised learning.

And it was proven in the image domain.

And there, I'll just spell it out.

You have your image.

You turn it into a sequence of pixels.

Each pixel should be given some discrete value of intensity.

And then just do next pixel prediction.

Be the same transformer.

That's it-- different from BERT.

Just next token prediction, because this

maximizes the likelihood, therefore compresses.

And like we see, one immediate result we see--

so these are results on CIFAR-10.

You have models of different sizes.

This is their next step prediction accuracy

on their pixel prediction task on their unsupervised learning

task.

And this is linear probe accuracy, linear probe,

when you pick some layer inside your neural net, the best

layer, fit a linear classifier, and see how well it does.

You've got these nice looking curves,

and they start to get more and more similar.

It's kind of what you want.

It's like it's working.

Pixel prediction of the same variety

as next word prediction, next--

not just pixel-- not mere pixel prediction,

but next pixel prediction leads to better and better

unsupervised learning.

I thought that was pretty cool.

And we tried it-- we scaled it up to various degrees,

and we showed that, indeed, it learns reasonably well

where-- we came close, but we didn't fully

match the gap between our unsupervised learning

and the best unsupervised learning

of the day on ImageNet.

But it was clearly a mat-- it did feel like it scaled,

and it was a matter of compute, because these things use

large, high-resolution images, and we use like tiny 64

by 64 images with giant transformers for the day--

tiny by today's standards, but giant for the day, 6

billion parameters.

So this is like, yeah, you do your unsupervised next pixel

prediction on a big image data set,

and then you fit your linear probe on ImageNet.

And you get strong results.

Like, OK, that is cool.

CIFAR, yeah-- like we got-- one of the cool things

is that if you got 99% on CIFAR-10 with this,

that was cool.

2020 was a different time, but it was cool.

I'll mention-- I'll close with a couple of comments

on linear representations.

The thing-- I like the compression theory,

because I feel like it just bothered me for a long time

that you couldn't think rigorously

about unsupervised learning.

I claim now you can, at least to some partial degree.

Still lots of waving of hands is needed,

but maybe a bit less than before.

But it doesn't say why your representations

should be linearly separable.

It doesn't say a linear [INAUDIBLE] should happen.

But linear representations happen all the time.

And the reason for the information

must be deep and profound.

And I think we might be able to crisply articulate it

at some point.

One thing which I thought was interesting

is that these next pixel prediction

models, autoregressive models seem

to have better linear representations than BERT.

And the blue accuracy is BERT versus autoregressive.

I'm not sure why that is, or rather I can speculate.

I have some speculations.

But it would be nice to--

I think it would be nice to gain more understanding for why,

like really why those linear representations are formed.

And yeah, this is the end.

Thank you for your attention.

[APPLAUSE]

AUDIENCE: Could you provide the speculation?

ILYA SUTSKEVER: Yeah.

Oh yeah yeah.

I think the speculation is basically

that if you're doing next pixel prediction,

you're predicting the next pixel from all the previous pixels.

So you need to look at the long range structure.

But in BERT, you have your vector.

And let's say you drop 25% of the tokens, of the pixels,

in this case, then any prediction

that you have can actually be done quite well by looking

a little bit into the past and a little bit into the future.

So you remove all the hard prediction tasks,

or you make them much easier.

The hardest prediction task in the next pixel prediction

is a lot harder than the hardest prediction task in the BERT

prediction case.

So this is an argument.

We may be even be able to test it if we think--

if we try to design an experiment to test it.

But yeah, that's the speculation.

AUDIENCE: Is there a more robust 2D version

of the next pixel prediction?

ILYA SUTSKEVER: I would say more robust--

I would say any formally-- anything

you do at all that turns a neural network

into a probabilistic model that assigns probabilities

to different inputs is that.

So the other big way of doing next token prediction is

diffusion models, diffusion models also like--

not the ones that are so--

the diffusion models that people use in like high quality image

generators don't really maximize the likelihood of their inputs.

They have a different objective.

But the most original formulation

is maximizing likelihood.

And by the way, the diffusion model is a counterargument to--

well, rather, the diffusion model also, I would claim,

should have worse representations than a next

token prediction model for the same reason that BERT doesn't.

So I feel this further, in my mind,

increases the mystery of what caused leads

to linear representations to form.

AUDIENCE: Yeah.

Thanks for the talk.

I liked the analogy between Kolmogorov complexity

and neural networks.

I guess one kind of place where it

seems like there's a disanalogy is that neural networks have

training dynamics.

And if you do something like just take all the computer

programs, it's more memory.

So if you're compressing with Kolmogorov complexity,

the order of the data doesn't matter.

It clearly does for neural nets.

You have simple heuristics and features

that are learned early in training that probably

stick around late in training.

And so, I guess, do you have thoughts

on how to think about that in the context of this picture?

ILYA SUTSKEVER: Yeah.

I mean, the analogy isn't perfect.

That is true.

And the way in which you describe its--

the place where that analogy breaks down the most

is in the search procedure.

This is really what you're alluding to, like, hey,

order of data matters.

Why?

Because we are using a weak search procedure.

A Kolmogorov compressor uses the infinitely expensive search

procedure of just enumerate everything from scratch

each time.

So yeah, I mean it is--

I mean-- I would say that caution should be used

with this particular analogy.

It's very clearly not universally applicable.

But I do think that in the case--

I do think that there is some value here in explaining

where unsupervised learning comes from.

AUDIENCE: So you are-- now that your formal--

going back and forth between compression and the next bit

prediction, you are connecting that

to-- relating unsupervised learning

to supervised learning.

So if you backtrack from cryptography,

there's sort of a theory.

It goes back to the '80s, where they talk about compression

being equivalent to next bit prediction,

being equivalent to being able to distinguish

two distributions.

So if you have an [INAUDIBLE],, then you have an algorithm that

can predict--

then you have an algorithm that can compress.

I mean, cryptography is the other way.

You're saying there is no way to compress [INAUDIBLE]..

And therefore, there is no way [INAUDIBLE]..

So I wonder if this idea of being able

to distinguish, how could that translate

to anything naturally?

ILYA SUTSKEVER: So I think I understand the question.

Let me paraphrase.

So I'm not super well-versed in the area that you mentioned.

But you're saying, if you can distinguish

through distribution, you can use that distinguishability

to make some--

to predict.

And I guess the question would be, what's the--

I have some--

AUDIENCE: You have samples from one.

You have samples from the other.

And you can tell them apart [INAUDIBLE]..

ILYA SUTSKEVER: And so I guess my question

is, how well can you predict--

or you're talking about you can predict a little bit?

Is that what you're referring to.

AUDIENCE: Yeah, [INAUDIBLE].

You can predict what the next bit would be,

let's say, in a sample.

ILYA SUTSKEVER: So--

AUDIENCE: Actually, [INAUDIBLE],, [INAUDIBLE] you can predict

those.

ILYA SUTSKEVER: I mean, I can mention

one thing that is related, which is energy-based models.

So energy-based models offer yet another way

of turning neural networks into probability distributions,

where an energy-based model will say, just give me

your configuration of vectors.

And I'm just going to tell you how it feels.

And then you normalize over all of them.

And I feel like when it comes to energy-based models

in particular, ratios of distributions

correspond to differences of energy.

Maybe there is some relation to what you're saying.

I think I'm probably not like precisely commenting

on the thing that you said, but I

don't think I have anything more to add, unfortunately.

AUDIENCE: Ilya, maybe I'll just put in a tiny word

to defend the honor of VC dimension.

ILYA SUTSKEVER: Go ahead.

AUDIENCE: Well, so in 2007, I had

this theorem about PAC learning of quantum states.

And this was precisely an example, where

even if you don't care about infinite precision,

even if you discretize the state space,

like you look at the VC dimension,

or the fat shattering dimension, or whatever,

and it is exponentially smaller than the log of the size

of the hypothesis class.

So I do think there are cases where you still

want VC dimension.

ILYA SUTSKEVER: That is a cool example.

[LAUGHTER]

AUDIENCE: So I didn't completely follow your notation.

Capital X was a sample from a distribution.

ILYA SUTSKEVER: Yeah.

AUDIENCE: The distributions--

ILYA SUTSKEVER: The data set.

AUDIENCE: So then the transformer, SGD--

I'm not sure if you can think of it as the best program

for compressing [INAUDIBLE].

Maybe because it's only given one sample at a time.

ILYA SUTSKEVER: That's true.

There is one other assumption that we are making.

If you look at the total training loss--

say you have some neural network.

It doesn't need to be a transformer--

some neural network which assigns

log probabilities to your data.

And you have some large number of training cases.

You can run the neural net.

You need some neural network.

You can compute its log probability on each case.

And you can take their sum.

And that's going to be the log probability

that the neural network assigns to the entire data set.

Now, the neural network is unable--

this particular formulation makes

the neural network unable to explicitly notice

maybe temporal, or any kind of structure

in the order of the data.

But I still claim it's meaningful to say

that, yeah, you can compute the entire log

probability of the data set.

And that gives you your negative log probability, literally

the number of bits you need to compress

this data set using this neural network as a compressor.

AUDIENCE: If I can paraphrase, I think

you're arguing for compression as a framework

for understanding or motivating unsupervised learning.

And a point you made at the end was

that if you apply that framework to language models,

to next word prediction, it can feel a little maybe

superficial, because next word prediction, any text task

can be converted to next word prediction.

And so unsupervised learning is kind of superficially

the same as supervised learning for text.

So then we turn to image GPT, where

you can't formulate any given task as next pixel prediction.

And maybe you could, but let's just say you can't.

But then you can use the linear representation

as a way of showing that compression is a good way

to formulate unsupervised learning.

But then there are highly effective compressors

that wouldn't give you a useful linear representation.

So I'm wondering, are there any cases where

unsupervised learning and supervised learning

are not superficially the same, [INAUDIBLE],,

but that you don't require your compressor

to give you an effective linear representation

to show that compression is a good unsupervised objective?

ILYA SUTSKEVER: Yes.

So that's a very--

I have thoughts on this question.

So first of all, I will say that the linear representation,

a good linear representation is a bonus.

At no point does this argument say,

and therefore linear representation should emerge.

However, I would argue that the theory does

say that good fine tuning should emerge,

because joint compression is kind

of like hacky approximate fine tuning using the bad search

algorithm, which is SGD.

Now, we know that the evidence from these old experiments

suggests that BERT learns worse linear representations

than next pixel prediction when run on images.

And perhaps the same is true for diffusion models.

That seems highly likely.

I think it would be very interesting to see

how the fine tuned diffusion models compare.

Maybe someone knows already here.

AUDIENCE: So I wonder if you can take

your ideas for unsupervised learning and bring them back.

It seems to me that there could be

some insights you could take back

to the supervised learning [INAUDIBLE]..

ILYA SUTSKEVER: Well, I think here

this gives you more insight into the-- so the particular insight

that you might get here is insight into the function

class, the desired function class.

You want your neural net to have many layers.

For example, if you have many layers,

it can do more thinking.

You want those layers to be wider, bigger neural nets.

I mean, essentially, this is like what the field has already

discovered.

AUDIENCE: But you might not need more examples than the number

of [INAUDIBLE].

In supervised learning, this might

be an explanation for why you don't need

necessarily more parameters.

ILYA SUTSKEVER: That's right.

So this might, for example, be--

so there is a--

so the weakness of this theory-- by the way, this theory

has a huge practical weakness, which

is it ignores compute cost.

It only focuses on information.

So you're all familiar with the universal transformer, maybe--

basically, it's like it's a transformer.

We use the same weights at each layer.

It's a great idea, except that if you

want to have a lot of parameters,

you need to pay for it a lot with compute.

Nobody wants to do that.

You don't ignore compute costs.

This ignores compute costs.

If you ignore compute costs, it gives you

a recipe of how to proceed.

AUDIENCE: What do you think the importance is

of the autoregressive modeling in particular

of the probability distribution?

You can view BERT as sort of doing maximum likelihood

training.

I think you can find ways to sample out of BERT,

for example.

ILYA SUTSKEVER: So certain diffusion models

can be set up to be maximum likelihood models.

And so all this theory predicts, the diffusion models

should also be--

it should be possible to make diffusion models

to do equally great things, perhaps

with some constant factors of--

because this-- as the earlier answer,

this is not a compute sensitive theory.

So it's going to say, OK, maybe you

need like a factor of 10 or 15 compute,

and then things will be the same between the autoregressive

and diffusion model.

The autoregressive model, it's just simple.

It's convenient.

Maybe the energy-based model will do even greater things.

But from that perspective, they're all the same.

AUDIENCE: It seems like GPT-4 may

be the best compressor at the moment, which presumably

is the largest model out there as well.

So on one hand, it's able to compress better.

On the other hand, the size of the compressor

itself is increasing.

Or is it that it's actually not necessarily the case

from the perspective of the theory [INAUDIBLE] complexity?

ILYA SUTSKEVER: Well, I mean what matters

is that the cost of the--

what you really want is for the--

there is another way in which this theory doesn't quite

correspond to reality, because this theory says,

you have a fixed data set you want to compress.

But in the way we train the GPT models,

you say, you have a big training set,

and then you have a test set, which

is assumed to be infinite.

If the test set is infinite and you

care about compressing the test set,

then you don't care about the size of your compressor,

so long as this test set size is much larger.

So I would say that this is also, like, I would say,

a disanalogy, where some more careful thinking

would be useful to get clarity.

AUDIENCE: Is it enough to use a validation

set that's independent as long as it's separate?

That kind of closes the gap?

ILYA SUTSKEVER: Is it enough?

It's a good thing.

It's a good thing to do.

Is it truly-- the thing is I think it's also--

actually, like in the single epoch case,

if you are in a single epoch case,

then you can just compute your log probs as you train,

and that will be something like the model's

compression of the data set.

And that gives you also a good estimate of the validation set.

So in the single epoch case, I think things are quite similar.

AUDIENCE: I just wanted to make you aware

that there was a paper last month in [INAUDIBLE]..

It has used gzip, followed by the k-nearest neighbor

classifier for text classification.

So more training, more parameters.

Is it compressed strings, just like you showed, concatenate

two strings together [INAUDIBLE] individually,

then the compute distance.

There is a way using the output of gzip

as the [INAUDIBLE] [? code ?] [INAUDIBLE]..

ILYA SUTSKEVER: Yeah.

My only comment on that is that gzip is not

a very strong compressor of text.

So I think it does show that things

are possible to some degree, but all the really meaty stuff

happens when you squeeze the last bits,

if you see what I mean.

AUDIENCE: It seems like it works like a toy proof of concept,

but it does work.

AUDIENCE: I think there was some bug there before that--

ILYA SUTSKEVER: It doesn't work.

AUDIENCE: [INAUDIBLE] results for now.

AUDIENCE: Since Jacob brought up curriculum effects,

and how they appear for neural nets, but not for-- you

don't see it, but Kolmogorov complexity.

We were just talking at lunch about what

is the current empirical situation, about how much do

curriculum effects actually matter.

And none of us knew the answer.

ILYA SUTSKEVER: Well, it's like--

so it's a bit--

I have a take on this, which is the easier--

so we've done a lot of work as a field to make the transformer,

like the architecture that we are using, to be relatively

easy to optimize.

We found good initializations.

We found good hyperparameters.

We made changes to the architecture

so that training is as easy as possible.

The easier is-- the more easy the training optimization

problem is, the less susceptible you are to curriculum effects.

And it's known.

For example, people who were training

in all kinds of exotic architectures,

like neural Turing machines, for example,

which are these really complicated things.

And it's super, super huge numbers

of heterogeneous layers, which are different.

And that thing, you need to be very, very careful

with your curriculum, because if you give it the full data set

right away, it'll just fail.

YEJIN CHOI: All right, let's thank Ilya again.

[APPLAUSE]

[SIDE CONVERSATION]

Loading...

Loading video analysis...