CS294-196 (Agentic AI MOOC) - Lecture 1 {Yann Dubois}
By Berkeley RDI Center on Decentralization & AI
Summary
## Key takeaways - **Pretraining: Predicting the Next Word at Scale**: Pretraining involves predicting the next word on vast amounts of internet data, aiming to imbue the model with broad world knowledge. This process requires immense computational resources, with training costs exceeding $10 million and datasets of over 10 trillion tokens. [01:32:30], [01:37:35] - **Post-training Aligns Models with Human Intent**: While pretraining teaches a model language, post-training (like RLHF or SFT) is crucial for aligning the model's behavior with human preferences and instructions, making it useful for real-world tasks. This stage uses significantly less data, focusing on quality over quantity. [03:05:05], [53:36:40] - **Scaling Laws Drive LLM Performance**: Empirical evidence shows a strong correlation between increased compute (more data and larger models) and improved LLM performance. This relationship, known as scaling laws, allows researchers to predict performance at larger scales based on smaller-scale experiments, guiding resource allocation. [42:01:03], [43:05:09] - **Data Quality and Filtering are Critical**: The quality of training data significantly impacts LLM performance. Extensive filtering, deduplication, and heuristic-based cleaning are essential to remove undesirable content and low-quality documents, transforming raw internet data into a more effective training corpus. [28:18:20], [34:48:54] - **Systems and Infrastructure are Key for Scale**: Efficiently scaling LLM training relies heavily on systems optimization, including low-precision operations, fused kernels, and advanced parallelization strategies (data, model, and tensor parallelism). These techniques address bottlenecks in memory and communication, maximizing hardware utilization. [37:01:11], [46:46:49]
Topics Covered
- Pretraining LLMs: Predicting the Next Word on the Internet
- RLHF: Training LLMs to Interact with Humans
- AI 'Hacks': When Models Game the System
- GPU Bottleneck: Data Feeding, Not Computation
- Halving Precision for Faster AI Training
Full Transcript
YANN DUBOIS: OK.
Let's get started.
So hi, everyone.
My name is Yann.
I'm a researcher at OpenAI.
And I'll be rerecording a class that I gave at the Berkeley LLM
Agents MOOC series on Introduction
to training LLMs for AI agents.
So the reason why we're recording this class
is because, one, we had technical difficulties early
on in the class, and second, there
was a fire alarm that started, which means that we did not
go through all the slides.
And also, we don't have a good recording,
and I want to make sure that everyone who's online
would also see the entire class.
So before getting started, one thing to say
is that, all views here on my own,
even though I work at OpenAI, I will mostly talk about things
that we can find online and information that we
can find of open-source models, especially Kimi, Llama,
and IPsec.
So unless I talk about OpenAI, nothing
is related to OpenAI here.
Great.
So with that, let's get started.
So we all know that LLMs and chatbots really
took over the world in the last few years.
And so the question I will try to answer
is-- how do we actually train those?
How do we train those models?
So this is an example here from ChatGPT and an answer
from ChatGPT.
So there are three main parts of the training pipelines,
one training LLM.
The first one is pretraining, so you probably all heard about it.
The general mental model that I like to give people
is that pretraining is really about predicting
the next word on internet.
So you take all of internet or all the clean part of internet,
and you just try to predict the next word.
And as a result, you will learn everything about the internet
and as much as possible about the world.
In terms of data, pretraining takes more than,
at least for The big open-source models,
takes more than 10 trillion tokens, so that's a lot.
And when I see tokens, you can think
a little bit about it as words or subwords,
but essentially, more than central tokens.
So think about it as 10 trillion words.
It takes months to train on that much data.
And it takes a lot of money and a lot of compute
to actually train on that amount of data.
So you can think about the compute cost
being more than $10 million for one run.
So here, the bottleneck in pretraining
is both the data we talked about, 10 trillion is a lot.
And second is compute.
And as we will see, basically, the more you scale up
these models, so that means, the more data
you put in the model, the more compute.
So the longer you train for, the better the performance will be.
So an example, a pretraining model is like Llama 3,
for example.
So a second part, which actually is
third in a chronological order of how these models are usually
trained, but this is the second part
that came after pretraining.
The last part is more recent.
So the second part is what I call
classic post-training or RLHF, so Reinforcement Learning
from Human Feedback.
Here, the idea is that this pre-trained model
is just good at predicting the next word,
but it's not really good at performing well
in the sense of predicting what the user wants
or answering questions or following instructions.
Basically, you can think about it
as a model that knows everything about the world,
but it doesn't actually know how to interact with a human.
That's one way of thinking about it.
And that's what we're going to try
to optimize for is model's interaction with humans
and making sure that when you ask it to do something,
it does it.
The data size here is much smaller,
maybe around like 100,000 problems.
All these numbers are just orders of magnitude,
just to give you a sense.
In terms of time, it probably takes a few days to do one
of these runs, compute costs, maybe around like $100,000.
And here, the bottleneck is data and evals.
So when I say data, I really mean the quality of data,
because 100,000 problems is not that much.
So it's really about how high-quality the data is
and whether you can evaluate whether you're making
improvements on your run.
So how good you're improving your run.
And this is really important, because when
you do for this RLHF or this post-training,
you have to balance many things together.
So you need to make sure that you're actually
tracking your performance on all these different axes.
So a specific instantiation of this RLHF model
is a Lamma Instruct.
So when you hear instruct at the end of the model,
that usually means they just went through RLHF,
because it can follow instructions
that are given by humans.
Great.
And this last part, which, actually, as I said,
usually comes second in the pipeline,
is what I call the reasoning reinforcement learning.
And so the idea here is to think on questions
where there's objective answers or where you have access
to ground truth.
So you will hear, recently, you will see here
from open-source models also that perform very well on math,
and coding, and things where it's actually pretty easy
to get some ground truth answers, for example,
like passing test functions in coding or passing
some math exam.
And this is what you want to optimize
during this reinforcement learning for reasoning.
So this second stage is only true for reasoning models.
So one example is DeepSeek-R1, which
was the first open-source reasoning model.
In terms of data, in the R1 paper, they don't say exactly,
but you can read through the lines,
and you can look at the plot and try
to extract around the amount of problems
that they are actually training on.
So in terms of data, it's probably
around a million problems that they're training on,
probably takes in the order of weeks
to train this second stage, so this reasoning stage,
around $1 million.
And here, the bottleneck is reinforcement
learning environment and hacks.
So what I mean by that is, as I said,
this is about optimizing for objective truth
or optimizing, for example, if you take the case of passing
test functions, encoding, the bottleneck is-- how many
test functions can you get?
And one thing that will usually happen
is when you start optimizing for these test functions or test
cases, you will see that the model
will start optimizing things that you weren't expecting.
And so maybe we'll be able to pass the test case by,
for example, removing the test in your environment
or replacing it with always returning to.
That's one of the types of things that the model may do,
which is what we usually call hacks.
So the way to think about hacks is just,
the model found a way of optimizing the reward,
even though that's not what you were hoping they would do.
So this is usually a pretty big bottleneck,
because models are really good at optimizing things,
even if it's not exactly the type of thing
that we want to optimize.
If you write something to optimize,
they will optimize it exactly as written.
And usually, the second and the third stage,
I will bundle them together and call them post-training,
which comes after pretraining.
It depends.
Different people have different names.
That's how, I believe, R1 and Kimi
talked about that in the post-training stage.
Great.
So the LLM training pipeline, there
are basically five things that you
need to consider when training an LLM.
First is the architecture.
So what model architecture are you using?
So you probably all heard about transformers or about mixture
of experts, which is a kind of a variant of transformers.
And then there's a training algorithm and the loss,
so that means, what are you optimizing for this architecture
to do?
So what are you trying to optimize?
And then there's the data in the RL environment,
so that's what we talked about before, this evaluation, which
is knowing whether you're making any progress.
And then last part is systems and infra,
to make sure that you can scale up these runs.
So until, I would say, 2023, most of the academia
was actually focused on architecture, and training
algorithm, and losses.
I've done also a PhD.
That's what I was focused on until around 2023.
And really, there were few people
who were working on the rest, but that was the main product
of academic research.
But in reality, what matters in practice is the last three.
So what matters is usually, data, evaluation, and systems
to be able to scale.
So people usually want to work on architecture
on developing new algorithms for optimizing your model,
but these things matter much less,
as we will see, than really, how much data do you put in?
What's the quality of the data?
Are you measuring your progress well?
And do you have the infra to actually scale things up?
So I will not be talking about architecture,
mostly because, at this point, the architecture is not changing
that much in the open-source.
And so it seems to be mostly using transformers
with mixture of experts.
And a lot of people like talking about architectures,
so you can find a lot of information
about what architectures are being used.
At least, currently, it's really not as important,
so that's why I will not be talking about that.
OK.
These two last spots that I didn't
talk about in the pipeline for training LLMs.
And I consider them more about specializing the LLM.
One is prompting.
So once you usually have a model, for example, a big lab.
Yeah, big lab might release a big open-source model
or a closed model.
And then people will be able to interact with it
and specialize that model for their use cases.
And the usual way that people do it is, first, just by prompting.
So prompting is really just knowing
how to ask questions, essentially.
So it's the art of asking the model what you want.
What is nice is that you don't need any data,
and it's pretty fast.
You just try a few examples, and you see how it works.
And there's no compute associated to it or very little.
And the bottleneck is eval, like, how do you make sure
that you actually have a good prompt, that you're
asking the right question to the model?
And then the second part is fine-tuning.
So we will not be talking about it, but just to mention it here,
briefly, fine-tuning is basically
continual post-training or an additional post-training,
where you basically apply the second stage of clustering
to domain-specific data.
So for example, imagine that all these companies
release pretty general models.
And now, you want to specialize it to some specific domain,
like medical data.
So internally or for your project,
you might have some specific data
that you want to optimize for.
And you will take these open source models,
and you will be basically fine-tuning,
so doing a little bit more of training on your specific data.
And just like post training, this requires around maybe
10,000 to 100,000 problems around days.
And compute costs around like 10,000 to 100,000.
And here, again, just like [INAUDIBLE]
the bottleneck is really the quality of your data,
and evaluation.
How do you know whether you're making progress?
Great.
So let's talk about pretraining.
I will talk about the method, what pretraining is,
the data, and the compute that you need.
So in terms of pretraining, as I said,
the mental model, the metaphor that I like giving to people
is that pretraining is about predicting the next word.
And a way to think about it is, just for example,
when you type a message, you will usually see your phone,
like predicting what's the next word that you will predict.
And this is exactly how pretraining works, or not
exactly, but mostly, this is the metaphor that I like giving.
And it's essentially how pretraining works.
So the goal of pretraining is to teach the model everything
in the world.
And the way that we basically achieve that
is just to predict the next word.
Because if you can predict the next word,
then you must understand.
If you can predict the next word on every single domain,
then you must have some understanding of that domain.
And that is basically what pretraining is.
So in terms of data, it's basically any reasonable data
on internet, as much as possible,
because again, you want to have models
that understand, as much as possible, about everything.
So you really want to give as much data as possible
for the model to learn on.
So in terms of scale of data, you
have around, maybe, I said more than 10 trillion tokens.
So for Llama 4, for example, I believe
the models were trained with between 20 to 40
trillion tokens.
For DeepSeek V3, I believe it was 15 trillion tokens.
So it gives you an order of magnitude
of data that you need for the current best open-source models.
So that 15 trillion tokens corresponds to approximately $20
billion unique web pages.
So that's a lot of data.
It's not all of internet, but it's basically
all the clean data that people can find on internet.
And pretraining has really been the key
since GPT-2 in 2019, that mostly showed the world what
pretraining can do.
And basically, just using a simple method
like predicting the next word, but you just do it at scale,
it really showed how smart the models can become.
OK.
So what is actually happening under the hood?
I'll give you first a brief overview in terms of tasks.
As I said, it's about predicting the next word.
So the steps are the following.
First, you tokenize the data.
So here, I have a sentence, "She likely prefers."
And the goal is to predict the next word.
So in this case, the next word is dogs.
So "She likely prefers."
So what you will do is you will split up "She likely prefers"
into different tokens, which are basically different subwords
or different subunits.
The reason why we do that is because, I
mean, they don't understand words, they only
understand numbers.
So you have to take these words or you
have to take this sentence and split it up into numbers.
So that's what we call tokenize.
I split it up by word here, so by space.
So I have "She likely prefers."
And I say all these three words become tokens.
And I will associate all of these tokens
with different index.
So "she," I will give it 1, "likely" becomes 2,
and "prefers" becomes 3.
And this is just one way of converting, again,
these words that computers don't understand to numbers
that computers can work with.
Then you will do what we call a forward pass, which
means that you will pass through the model.
We'll see exactly what happens later.
But you will pass it through the model.
So usually, this is a transformer.
And then you will have this model
try to predict a probability distribution.
So categorical distribution that tries to predict what
is the probability of the next word.
So for example here, you see that "She likely prefers,"
it's very unlikely to say "she" again.
But it's very likely to say this word, which in this case,
is dog.
And then you will sample from this probability distribution.
So once you have a model that predicts the distribution,
you can just sample.
And that's why, every time you ask a question
to some open source model, it will not always
get the same answer because you actually have the sampling step.
You sample and then you detokenize.
Because again, when we talk about it
in this categorical distribution,
that just tells me that this is token number 5.
Here, I have 1, 2, 3, 4, 5.
I have one index here.
So I have five.
And then I need to look through my dictionary.
That tells me, index 5 was actually the word "dog,"
so the token "dogs."
So that's detokenized.
So the last two steps, this is not super important.
But last two steps are only happening at inference time.
At training time, all you can do is
you can always keep predicting the next word by predicting
the probability distribution and just
optimizing your cross-entropy loss, which, I'm sure,
most of you are familiar with.
So you don't actually need to do the a sampling.
So these two steps are only doing inference.
Great.
So now, I want to give you some intuition
about why this can even work.
And to do that, I will talk about,
honestly, the most simple language
model that you could think about.
And this is the N-gram language model,
which was already used at scale in 2003, so a very long time
ago.
It already worked pretty well.
But I think it gives a good intuition
of what is happening under the hood for the current models.
So the question here is, how can you learn what to predict?
Because we talked about, before, I
said, oh, you just do a forward pass to your model
and just predict the distribution.
How can you learn that?
So one way you can do that, so let's take an example and say,
how can you build what comes after the sentence,
"The grass is."
And you probably know that after "The grass is,"
it's most likely to be "green," for example.
So how can we know that?
How can we teach them all to do that?
Well, the solution is statistics.
Statistics is always the solution
to most of your problems.
So one way you can do that is, you can take all the occurrences
of the sentence "The grass is" online, or for example,
take all the occurrences of "The grass is" on Wikipedia.
And now, you can predict the probability of every word that
comes after "The grass is," by looking at the number of times
that that word appeared after "The grass is"
normalized by the number of times that you saw the sentence
"The grass is."
So let's say that the sentence "The grass is"
happens 1,000 times on the web pages that you looked at.
And maybe half of the time, so maybe 500 times,
the next word is "green."
And maybe, I don't know, 100 times, the next word is "red,"
you will have then probability of "green," knowing,
after "The grass is" will be half, so 500 divided by 1,000,
and for "red," it will be like 10%, so 0.1.
So that's a very simple way of predicting
the categorical distribution of the next word.
But it would actually work pretty well.
At least for simple things like "The grass is,"
that would actually work pretty well.
There are still a few challenges.
One is that you need to keep count of all the occurrences
for each of these N-grams.
Or at least, in this case, each of these sentences
that happened, you need to keep count of every word that
came after it.
So just think about it in terms of memory,
it's a huge memory requirement for storing all of that.
So that's unfeasibly large.
But it will still work pretty well for simple things.
And then the other problem is that most sentences, maybe not
most, but a lot of sentences might be unique.
So if there's something that never happened in your training
corpus, so if you never saw this very long text,
let's say that instead of "The grass is,"
I gave you 100 lines of code, and I asked you,
what's the next word?
Maybe you never saw these 100 lines of code at training time.
And then your predictor will have no way to generalize,
because basically, count will be zero.
And so we'll give a probability of zero,
even though the probability is actually a little bit higher
than zero.
That's like two problems that you
would have with this very simple statistical language model.
And so the solution is very simple.
Just use neural networks.
So I'm sure, many of you know about neural networks,
and we're going to assume that you do,
But you can basically approximate this prediction
by using this parametric estimator that neural networks
are, instead of this non-parametric estimators
that we talked about here.
Great.
So let's go through what I will call here neural language
models.
So it's a language model of a neural network,
which is what everyone does.
So the way that this works, again, at a very high level,
is that you take a sentence, for example,
"I saw a cat on a," I will basically split the sentence
into different tokens.
So these are all these words.
I will associate all these tokens
with a word embedding, so like a vector
representation of that word.
So the way that you can think about it is that,
imagine that this was in 2D.
You basically have a plane.
And you basically have all these points
that are on this plane, where usually, most similar words are
clustered with one another.
So you might have "I," "saw," "cat," and things like this.
It's just that instead of being 2D,
it might be much higher dimensional,
and it might be like a vector space of like 768 dimension
or something like this.
Then you pass that through a neural network.
So neural network, the way to think about it
is just some nonlinear aggregator of these vectors.
So it's just something that takes
all these vectors as input.
It does some munging.
And it gives you another vector.
The important part is that it's differentiable,
so you can actually back propagate through that.
That's the most important.
But for example, you could use a very simple neural network could
just be an average of this.
You could literally just average all these tokens together
or the vectors associated with these tokens.
And it gives you another token here, which is, intuitively,
it's the vector representation of all this sentence,
"I saw a cat on a."
So yeah.
So again, you could take some average
or you could just take some nonlinear aggregation,
like a passing through a neural network.
Then what you do is that, this vector representation
is in the wrong dimension, because what you want to do
is you want to be able to predict
which is the most likely word.
So you want to predict the probability of each word.
So you want a representation that lives
in this dimensional space, that is the number of dimension
is the number of tokens, the number of words
that exist in your language, for example, English.
So what you will do, very simple way
is that you can just pass this through a linear layer.
So you can just multiply this by a matrix
to take this H representation that lives in d dimension
and pass it to your vocabulary size dimensions.
So let's say, very concretely, you have 768 dimension.
Let's say that your vocabulary might be 20,000 words that you
might want to predict in English.
And you will basically multiply this by a matrix of 768
by 20,000.
And then you will get a vector out of it.
That is a 20,000 dimensional weight.
So once you have this, you will just pass it through a softmax.
So softmax is the usual trick to get
a categorical neural distribution from any vector.
So this just ensures that basically you
have numbers that sum to 1 and are between 0 and 1.
And then you can just consider that as the probability
of the next word after "I saw a cat on a."
Here you are.
You basically have this prediction of the next word.
Great.
And once you have the next word, you
can just optimize the cross-entropy loss.
So basically, just try to optimize what the real word is.
Let's say that the real word comes from here.
You will basically try to maximize a little bit this one
and minimize all the rest.
And then you just backpropagate because everything
is differentiable.
And that will basically tune all the weights
that you have in your neural network, including
also these word embeddings.
So this representation for every word.
OK, that was a very brief overview.
But hopefully, you get a sense of what a neural language
model is.
OK.
So now that we talked about the method, let's talk
about the data that goes into pretraining.
So the idea, as I said before, is to basically use
all of the clean internet.
So use as much data as possible and everything
that is clean on the internet.
Why do I say all of clean internet,
because majority of internet is pretty dirty and not
representative of what you want to ship to users
or what you want to optimize your model on.
So a very practical type of pipeline.
So every different lab and different pretraining groups
have different ways of doing it, but that's just
to give you a broad overview.
You first download all of internet.
So usually, people use in the open-source
some crawlers that already downloaded internet for them.
So basically, for example, the Common Crawl
is a crawler that already downloaded 250 billion pages.
So that's around more than 1 petabyte of data.
And they're all in these work files.
So basically, you download all over the internet.
Oh yeah.
And how these look like, so these files,
it's basically just HTML.
I mean, it's hard to understand.
You see here some meta, some keywords.
And here, you will find the text that says, blah, blah, blah,
blah, or here, paragraph, one of the best and most rewarding
features of the blah, blah, all that.
So it seems to be like an ad talking
about rewarding features.
And then it talks about downloading free question
and answers.
Anyways, it seems to be kind of an ad.
So anyways, this is a random website
that I took from Common crawl.
So as you see, it's hard to pass.
And probably, this is not even something
that you really want to train on if it's an app.
Great.
So that's just an example of what you have.
Oh, my computer stopped.
Oh, like this.
Great.
Second thing that you do, so as you just saw,
you have this HTML.
So what you have to do is you have to extract text out of it.
So it's actually pretty challenging.
There will be some question of how
do you deal with JavaScript, or boilerplate code,
or math that is run differently, and things like this.
So you will need to extract text from HTML.
This is actually pretty computationally expensive
because at this point, the name of the game is how much data
you can have.
So you really, you have a lot of data
that you have to clean and extract from.
So that's actually pretty computationally expensive.
Then you will do some filtering.
So one filter that we usually do on the open-source world
does pretty early on, is filtering
for undesirable content like PII data, or not safe for work data,
or anything that is harmful.
You will try to remove this.
Then another very common filter that people usually do
is deduplicating your data.
And then the deduplication could be by document,
it could be by line, it could be by paragraph,
it could be at different levels.
But the idea is to not train too many times
on the exact same data.
For example, if you train on forums, let's say,
on all the data that you have on Wikipedia or Stack Overflow,
you will always have these headers and footers
that are duplicated.
And you definitely don't want to train a million times
on the exact same Stack Overflow header,
because you don't learn much from it.
So you would basically be losing compute to try
to learn the header perfectly.
OK.
And then you will do some heuristic filtering.
You might do some heuristic filtering.
For example, you might try to remove low-quality documents.
Low-quality might be that there are too many words.
If it's an extremely long document,
it might be suspicious.
If it's a very short one, let's say
there's only 10 words, probably, it's not worth training on.
If there's any kind of outlier tokens,
like tokens or words that really are extremely rare, yeah,
it might be that this is just bad data.
So you will do a lot of these heuristic-based filtering.
And then you might also do some model-based filtering.
So one idea that I find pretty neat that people have been doing
is trying to basically do distribution matching.
So you find some distribution that you think is high-quality.
For example, you might say, Wikipedia
is pretty high quality.
Or you might say, every page that is referenced on Wikipedia
is likely to be high quality, because that means that someone
went and referenced that page.
So that's already a pretty big amount of data.
All the websites that are linked by Wikipedia is pretty large,
but it's still very little compared to the amount of data
that we need for pretraining.
So what you might say is, I want to find
more of that type of data.
And the way you can do that is that you can train
a classifier that takes some random data in and says,
this is not a reference on Wikipedia,
and the pages that are referenced on Wikipedia.
And you try to predict, basically, yes
to this and no to the former.
And then one you train that classifier,
you basically have a classifier that essentially predicts
how likely it is that your document is referenced
by Wikipedia.
And then you can just do a filtering based on that.
So you can say, if it's very likely, I'll keep.
If it's not likely at all, I will throw it away,
because it's probably some bad data.
So this is some model-based filtering.
And you can do a lot of that.
And then you can do some dynamics.
Data exchanges, for example, you might classify the category
of data, whether it's like code books
entertainment, any of these domains.
And then you might want to reweigh different domains.
So for example, if you train a coding model,
you want to reweight coding, probably, there's
not enough code online.
So you want to say no.
Even though I only have 5% of coding,
I want to bump it up to 50%.
And the way to do this reweighting,
you can usually do these experiments at small scale.
This is true for any of these filtering.
You might do these experiments at a small scale,
try to understand what is best, and then you
will try to predict what to do at larger scale.
Great.
And at the end, we'll talk about that too.
But once you did all of this pretraining data,
you will also try to collect some higher quality data.
For example, you might say like, Wikipedia is super high quality
or everything on arXiv might be really high quality.
So you will keep this second distribution
of high quality data.
And usually, after training on this pretraining data,
you will do what we call mid-training, which is training
on this high-quality data.
The idea being like, well, we don't have enough of that data,
but we know it's high-quality, so we will try to fine-tune
or optimize after pretraining or continual pretraining
our model on that high-quality data, such as the model
lends to really be as good as possible.
OK.
So pretraining data.
One paper that I would recommend reading is this FineWeb.
And it's both a paper and a blog post about FineWeb data
sets from Hugging Face.
And they talk a lot about what filtering they've done,
but this is just one plot from the paper.
And here, the x-axis shows the amount of tokens,
billion tokens that you train on.
So this is still pretty small compared
to the scale of pretraining data that we
talked about before, which is more than 10 trillion tokens.
And this is the aggregated accuracy, where it's basically,
your performance on a whole set of evals.
And here, what they show is, first, this green line
is when they took 200 trillion tokens from, I believe,
Common Crawl.
So this is basically raw data.
And then they applied a lot of filters.
So not safe for work blocklist.
They mostly went for English text, some simple document
filtering.
For example, if it's too much repetition in the document,
they removed it or if it's the wrong lens.
So that's this first filtering, one from 200 trillion tokens
to 36 trillion tokens.
And here, we see how well you perform
when you train on 360 billion tokens from those.
And here, you see the performance gain
when you deduplicate data.
So the way that they've done it is, they said, essentially,
I don't want to have text that is
duplicated more than 100 times.
So that's basically at a high level, so they filtered data.
They filtered by nearly half, from 36 trillion tokens
to 20 trillion tokens.
And you see that training on that
really improves performance.
Again, that's because you're basically not forcing your model
to learn things that are duplicated and not that useful.
And you really focus on new data.
So yeah, that worked pretty well for them.
And I mean, 100 documents that are duplicated
is still quite a lot.
But usually, you can have a huge classes of like 100,000
duplicates.
So those are the ones that they wanted to filter out.
And here, you see some additional filtering.
So for example, they remove the, I believe, JavaScript.
They remove lorem ipsum text and things like this.
So that removed, again, a little bit more of data.
And you see that it performs better, and then
some additional, I believe, model-based filtering
that performed even better.
Great.
So that's pretraining data.
And then there's midtraining.
So as I said before, the idea of midtraining
is basically continuing your pretraining
and to adapt your model to have some desired properties,
or to basically, adapt your model on some high-quality data.
So usually, you do it on, basically, less than 10%
of what you did for pretraining, so less than a trillion token.
So you might, for example, change the data mix
in your data.
So you might say, I want to have a lot of coding data at the end
or I want to be more scientific and have
a model that is really good at basic science questions.
Or you might want to optimize more on multilingual data.
Let's say that you know that a lot of the data that we have
access to is more English, but this
is not representative of the languages
that people usually speak.
So you might say, OK, I'm going to upweight some other languages
that we usually are less represented in our data sets.
And just to make sure that it's basically
representative of how many people speak that language.
Some other type of things we do during training
or that we might want to do is that you usually want
to increase context length.
So in many of these models, you usually hear this idea of,
how much context can the model see?
And when you do pretraining, you don't
want to train on very large context lengths,
because that's much more computationally intensive.
But you do want the model to be able to understand,
let's say, 120 tokens that came before your question.
So usually, what you do during training
is that you will bump up this context length,
so you will do some extension of context length
during my training.
For DeepSeek V3, they went from 4,000 contexts during
pretraining to 128 during midtraining.
And I think, many other open-source projects did that.
Other type of data that you might want to add
is some formatting or instruction following.
So you might want to already teach your model
to answer questions when you ask a question
or to write in a very specific chatty way.
And some high-quality data.
If you have some high-quality data,
you might keep it for the end and be like, OK, first,
I want to learn how to speak grammatically correctly.
And then I want to actually learn the real meat of the text
that you have in your data.
And you might have some reasoning data
about teaching the model how to think, which, I believe,
is what Kimi did.
And yes, many other things.
Great.
So pretraining and midtraining, let's just do a recap.
One is that really, this data during pretraining
and midtraining is really a huge part of training LLMs.
I would even say that it's basically
the key for training LLMs.
And there's a lot of research that has already been
done and a lot more to be done.
For example, how do you process well and efficiently?
I mean, these are huge scales of data that we're talking about,
whether to use synthetic data, whether to use big models
like generate more data.
How much multimodal data to put in?
How to balance your domains?
And all of that.
And there's a lot of secrecy.
So most companies are not talking about what they do.
Even in companies that actually do open-source models,
I don't usually talk that much about the data
that they collected.
First, because it's the most important thing.
So this competitive dynamics, they
don't want to tell you what they've been training on,
because that would be easier to replicate.
And then some companies might be scared about copyright liability
if they train on data that they shouldn't have trained on.
So here are a few common academic data sets--
C4, The File, Dolma, FineWeb, I just wrote a few.
So FineWeb is the one we talked about before
with 15 trillion tokens.
And this is the composition of the file.
And you see that in the file, there's
a lot of archives, and PubMed, and high-quality data.
And you will have also a good amount of code and things
like this.
Great.
So just to give you a scale of these data, as I said,
Llama 2 was trained around two trillion tokens, Llama 3,
around 15, Llama 4, between 20 and 40 trillion tokens.
So every new generation tries to train on more data
and does also some better filtering.
OK.
So that was about pretraining data aspect.
Now, let's talk about the compute.
So empirically, one thing that is super important
is that empirically, for any type of data and model,
the most important as I said before,
is how much compute you basically spend on training.
So by how much compute, I mean, both how much data
do you put in the model and the size of the model.
Because if the model is bigger, you need to spend more compute.
And what is very nice is that you can actually
predict pretty well the performance,
at least during pretraining.
You can predict pretty well the performance
that you will achieve if you just pour
more compute into your run.
So if you just train for longer or train bigger models,
you can predict pretty well how well they will perform.
So here, the way to interpret this plot
is that on, the x-axis, you see the amount of compute
that you have in your run.
This is in log scale.
And here, you have your test loss also in log scale.
And all these blue lines are basically different runs.
And then you take the minimum achieved for all of these runs.
And you can link all of them together.
And it gives you something that looks pretty close to a line.
And then you can just fit the line
to discover the ideal compute and test loss.
And now, you can use this line to predict how well can you
perform if you train with 10 times more compute or 100 times
more compute.
So what is very nice is what I wrote
here is that you can now do research at very low scales
and then predict how well it will perform at higher scales.
So this is what we call a scaling loss,
which one is very surprising.
There's really no good reason for this to happen or at least,
yeah, it could have been differently.
And there are some theories for why that happens.
And two, I mean, it's very nice when you do research,
because now, it means you can work at this small scale
and has really been great for the field.
OK.
So scaling laws, what is nice, as I said,
is that now, you can tune things at lower scale.
For example, if I ask you a question,
and I gave you 10,000 GPUs, I asked you,
how should you be using these 10,000 GPUs?
How should you be training that model?
Historically, what you might have done
is you might tune high parameters for different models.
So you might say, OK, I'm going to have 20 different runs or 30
different runs.
And I'm going to pick the best, and that's the one
that I'm going to ship.
But as a result, each of them will only be trained on 1/30
of the compute that you had access to.
The new pipeline is that now, you can find scaling recipes.
So you can find recipes that tell you
how to change the learning rate with different scales and things
like this.
And then you can tune high parameters
at small scale for a very short amount of time.
You can do many, many iterations.
And then you can plot the scaling law,
extrapolate how well you will be performing at larger scale,
and then train one huge model at the end, where
you use way more of the compute that you have access to.
So maybe 90% of your compute goes for the full run rather
than 1/30 of what you had before.
So yeah, this is really a blessing.
OK.
So for example, very concretely, should you
use an architecture that is a transformer or an LSTM?
You see that transformers.
This is the scaling law for transformers.
And here, you see LSTMs.
You see that transformers have a better constant,
so that means that they are always lower than LSTMs.
And they also have a better scaling rate.
You see here that the LSTM seems to be plateauing a little bit.
So that tells you both that at any scale,
transformers is better, but also,
the larger the scale, the better the transformer becomes,
which is why most people gave up on, essentially,
LSTMs as an architecture.
But what's interesting is, it could also just
be that the constant is better for one of the architectures,
but the scaling rate is better for the other one.
And in that case, you definitely want
to always go with the scaling rate, not the constant,
because who cares how well it performs at very small scale?
The real question is, what if it's 200 times larger.
How does it perform then?
And that's why the scaling rate is what matters.
Great.
So one very famous paper about scaling laws
is Chinchilla, that tries to show
what is the optimal way of allocating
training resources between the size of the model and the data.
Because both of these things are about compute.
And as we said, the more compute the better.
But there are two ways of spending compute,
either you train for longer or you train larger models.
So they have these results I'm going to skip a little bit,
but you can basically predict the optimal resource allocation.
And they found that for every parameter,
you should be using around 20 tokens.
So that's this optimal resource allocation.
One thing to note, you will hear often about chinchilla,
but Chinchilla is only an optimization
of training resources, it doesn't consider inference cost.
So they will only ask themselves.
They only ask, what is the best way
to achieve a certain training loss?
Where should I be putting the compute?
But it doesn't take into account that if you
have larger models at inference time,
they will actually cost more.
So for example, I mean, let's say, for example, for OpenAI,
for ChatGPT, the larger the model,
the more you will spend per user.
So you might be better off actually, training for longer
and training a smaller model, even
if it means that you need to spend more compute to achieve
the same performance, because at inference time,
it will cost less.
Anyway, so that's the Chinchilla paper.
And then I want to talk a little bit about the bigger
lesson from Sutton.
So basically, the bitter lesson, I
would really recommend reading this blog post
from Richard Sutton, which is really
the big researcher of reinforcement learning.
And he wrote that blog post that essentially tries to says,
the only thing that matters in the long run
is about leveraging compute.
And the reason why is because we see empirically
that the more compute we put in the models,
the more improvements you get out of it.
So basically, more compute equals better performance.
And we also know from Moore's law and some derivative laws,
that we will always have more compute, or at least,
that's the hope.
We will always have more compute every year.
And if you put these two things together,
more compute equals better performance,
and you will always have more compute.
And then the natural things that come out of it
is that it's all about leveraging computation.
There's no reason for trying to optimize things
at your current level of compute,
because next year, you will have more,
and that will just perform better.
So what matters is to have methods
that will scale up really well.
So that's the TLDR for the bitter lesson, which really
has driven a lot of how the community has
been thinking in the last, I would say, three or four years.
So yeah, so the summary is, don't spend
time overcomplicating things.
Do the simple thing, and make sure
that it scales, because what matters,
again, is not tuning this constant performance.
It's really making sure that you can scale it up.
Great.
So for training a SOTA model, this is a slide that I wrote
maybe two years ago or one or two years ago for training Llama
3 400B, which, at the time, was the largest open-source model.
And I just tried to predict how much that would cost.
So in terms of data, I was trained
on a 15.6 trillion tokens, 405 billion parameters.
And you see here that it uses around 40 tokens per parameter.
So that's pretty trained compute optimal by Chinchilla standards.
In terms of FLOPs, it uses 3.8 E25 FLOPs.
So it is an executive order that says
that you need to be more careful when you open source models
or when you train models that are more than one E26 FLOPs.
So this is around 2 times less than executive order.
In terms of FLOPs, in terms of compute, they use 16,000 H100.
And if you do the computation in terms of time,
it probably takes around 70 days of training to train this model.
And in terms of cost, my rough estimate
is that it would cost around like $52 million
for training this, so between 50 and 80 million,
depending on how much you consider they spend per compute
given a stack on clusters.
And in terms of carbon emitted for training this,
this is around, just for training a one model,
maybe 2,000 return tickets from JFK to London,
so from New York to London.
So that's quite a lot.
It's still neglectable compared to,
I mean, how many flights there are per year
and things like this.
But if you think that every generation is
going to be maybe 10 times more compute
than the previous generation, you
could see how in 2, 3, 4 generations, that
will become a real issue in terms of carbon emitted.
In terms of next model, as I said,
basically, every generation, you can think about it
as 10 times more FLOPs that go into training the models.
Great.
OK.
So pretraining summary, the idea is
about predicting the next word on the internet,
data., around 10 trillion words go into training these models
right now.
In terms of time, it takes months.
In terms of compute, more than $10 million.
The bottleneck is data and computation.
And some examples might be DeepSeek V3 and Llama 4.
OK.
So now, we talked about pretraining.
Let's talk about post-training.
Again, I'll talk about the method, the data and compute.
So why do we want to do post-training?
Well, language modeling, so what we do during pretraining,
is really not about assisting users and about helping users.
So language modeling is not what you want.
And what I mean by that is that if you just take GPT 3,
and you prompt it with "Explain the moon landing
to a six-year-old in a few sentences,"
what it will do is that it has been trained on basically,
a large part of internet.
So we'll say, well, that reminds me
of maybe large lists of questions that people might ask.
So instead of answering the question,
it might just predict what is a similar type of question
that people might ask.
So actually, what GPT 3 answers to you
is explain the theory of gravity to a six-year-old.
Explain the theory of relativity to blah, blah, blah.
So this really shows you that these models are really not
optimized for predicting what you want,
this is just about language modeling
predicting the next word.
So the idea of classic post-training,
also called instruction following or alignment,
is about steering the model to be useful on real world tasks.
So if I ask "Explain the moon landing to a six-year-old
in a few sentences," so the the same as before,
I want ChatGPT or any model to give me a real answer.
And the way that we basically do that
is to maximize the preferences of humans.
So maximize answer preferences of humans.
In terms of data, probably between 5,000 and 500,000
problems, so much, much smaller scale.
In pretraining, the idea is that, first,
you try to basically learn everything in the world,
and then you try to optimize on very specific domains, which,
in this case, just like instruction following
and answering questions with very few data
points, because the model already knows everything.
So it just needs to learn, basically, how to act
or how to interact with the human.
And this is really what made ChatGPT what it is.
So since 2022, that's really when
post-training became important.
So that's the overview of this third stage
that I told you about, which is the classic post-training.
And then there's the second stage,
which is about teaching the model to reason.
So that only happens in some models,
for example, Kimi and R1.
And the idea is to optimize simply answering correctly
the question.
So you will see some, for example, in 01,
it says things like thought for 24 seconds.
So reasoning is about, how do you optimize for that?
The data, it usually optimize for, basically,
any hard task with verifiable answers,
so things like math competitions or coding test cases.
And you try to optimize for that.
So this really became important since 01 in 2024.
And yeah, this is about this new paradigm of reasoning.
And I believe, Norm from OpenAI will also come and tell you
about reasoning.
But at a very high level, the idea
is that, what we had before was train time, compute time.
I mentioned to you scaling laws, which
shows that the more compute you put in the run during training,
the better your performance is.
And what reasoning gave you is test time compute.
So after training, you can also pull more compute in your model
to get better performance.
And that's like humans.
If you make me answer a question in the second,
I will probably provide less thoughtful and less correct
answer than if you gave me like a week to answer the question.
So the goal was test time scaling.
Let me put it a bit more light.
Great.
So post-training methods, I will talk about SFT and reinforcement
learning.
So the task is alignment.
Just as an example, let's say that we
want to optimize the arm to follow user instructions
or something like designers desires.
So this is the example from before, which
is like answering questions.
Or maybe you want the model to never answer
specific type of questions.
For example, if I ask, "Write a tweet describing how x people
are evil," you might want your model not to answer that
question.
So what I told you before is that the intuition
of portraying, in general, is that you actually
know what you want to provide to these models.
You do know the type of answers to give to humans.
So you do know the type of answers
that you want to give to humans and what
you want your model to follow.
But that behavior, these answers are scarce and expensive,
so it's pretty expensive and slow to also collect
that type of data.
But you could just go and try to ask humans,
what are all the correct answers to every question
that you might want to ask?
So the idea is that you know what
you want your model to output, but it's
expensive to collect that data.
But pretraining is something where it's
very easy to collect that data.
You just take all of internet, essentially,
but it's not really what you want, as we said.
So the idea is that, given that one is scalable,
but it's not what you want, and the other one is not
scalable is what you want, what you can do
is that you will basically take the pretraining
model that already, you learned about grammar
and different languages.
And you will just fine-tune it or do some small optimization
with the little amount of data that is in the format
that you want.
And this is what we call a post-training.
OK.
So there are two methods.
The first one is supervised fine-tuning.
So the idea is, again, just to fine-tune the LLM
with language modeling, so the exact same method as before.
But you do it on desired answers.
So instead of just predicting the next word,
you predict the next word on answers
that are the answers that you would want to give to humans.
So this language modeling, it means that it's, again,
next word prediction.
And desired answers is why we say supervised fine-tuning.
It's where the S comes from is because you assume that you
have access to the correct answer, which
is why it's supervised.
So how can we collect that data?
There are many different ways-- one is just to ask humans.
And this was the key from GPT 3 to ChatGPT, the initial ChatGPT
model.
And here are some examples from OpenAI system
that did that in the open source, where
you have a question, and then you
have answers that are written by humans.
You can also do it differently.
One problem, basically, with human data
is that it's slow to collect and it's expensive.
So one idea that you might want to do
is to use an LLM to scale data collection.
So this is what we did for Alpaca, for example,
in early 2023, where we basically said,
well, we don't have the money or we
don't have the luxury of having humans
that provide us sciences with.
What we can do is that we can use the best model from OpenAI
at the time to predict the right answer.
And we can basically try to do supervised fine-tuning
with the answer that is given by the OpenAI models.
So we did that on 52,000 answers.
And basically, these are some examples.
And yeah, that was one of the first or probably,
the first instruction following LLM in the open-source.
So that really started as a trying to replicate ChatGPT.
And now, this synthetic data generation
is a whole field on its own, because yeah,
the idea is that now, actually some of these models
are just better than the humans.
So it's not only that humans are slow.
And it's not only that human data collection
is slow and expensive, it might just be that it's lower quality.
So yeah.
OK.
Yeah.
So for SFT, there's another way of doing it.
So we talked about two ways right now.
We talked about humans.
We talked about LLMs that just provide an answer.
But the problem is that if you want
an LLM to provide an answer, you can
assume that you have access to an LLM
that is smarter than the LLM that you're training.
And that was indeed the case when we were training Alpaca,
but this is not the case if you're,
for example, in the best closed labs, which
are training the frontier models,
or even if your inputs is open-source,
and you're trying to train the best open source
models you might not have access to or be
able to distill closed models.
And so what the DeepSeek R1 do, because they
were treating this first top open source model.
The idea is that you can use rejection sampling
based on verifiers.
So what I mean by that is that you can just
use an LLM to provide many different answers to a question,
and then you only keep the answer if it's correct,
in some sense.
So if it passes some test case, or some verification,
or if it's preferred over other answers.
So the idea, again, is, well, you
don't have an ideal LLM that you can generate data from and then
train the SFT to predict that data.
What you can do is if you have access to verifiers or ways
of comparing different samples, is
that you can roll out many samples, then decide which
one is better based on your verifier,
and then do SFT on the sample that is given by the verifier.
So that's exactly what DeepSeek R1
did for the first stage of reinforcement learning or sorry,
for the first stage of safety.
Great.
So what do we learn during SFT?
What are the type of things that we can learn?
Well, we already talked about it.
We can learn instruction following and following
instructions.
You can learn desired formatting or style, be more chatty,
or use emojis, or things like this.
You can use tool use.
So I would recommend reading if you're interested in that.
The Kimi 2 paper, the excellent paper,
that basically uses SFT at scale to learn how to use tools.
You can learn some early reasoning.
So how to think before answering, which is exactly what
we just talked about with DeepSeek R1,
where they use this rejection sampling algorithm.
And honestly, you can learn anything
where you have good inputs and output pairs.
So SFT can either be seen as a final stage for training,
a final model or as a preparation
for the next stage, which is the reinforcement learning stage,
basically, given that this works pretty well,
and you might want to do that first before doing
the next stage, to accelerate the next stage, as we will see.
So SFTP pipelines can become pretty complex.
I'm not going to talk through this one in detail,
but I just want to you a sense of how complicated it can be.
This is about Kimi K2.
So I would recommend reading that paper
and how they use SFT or training for tool use.
So basically, teaching the model to use tools.
And what they did is pretty complicated with some LLM that
simulates users, simulates tools, and then
do this rejection sampling that we talked about before.
And yeah.
So the idea is that they collected a lot of tools,
they simulated a lot of synthetic tools
that tell you how the tool should be called.
And then they basically have an agent
that interacts with an LLM that simulates
user and another LLM that simulates tool calls,
because otherwise, tool calls, you might not have access
to enough different tools to really simulate old tools that
might be called by the model.
And then you basically do some rejection
sampling based on these rollouts that
were generated with these three LLM that interact
with one another-- the agent LLM, the user LLM, and the tool
simulating LLM.
Anyways, all this to say that these things can
become pretty complex but still work pretty well.
OK.
So scalable data for SFT, How much data do you need?
SFT, what is nice is that you actually
don't need that much data.
For learning simple things like style and instruction following,
maybe 10,000 is enough.
So this is from a LIMA paper in 2023 that basically shows that
already with 2,000 examples.
You learned the style and instruction following
capabilities that you want.
If you want to train more complicated things like tool use
and reasoning, you might want to increase that.
So I believe, R1 use 800,000 samples, which is a good amount,
but basically, less than a million, at least.
So yeah, the idea is that you don't
need to train on much data for SFT
if the model already learned that.
My intuition for these type of things or my mental model
is that everything that is already learned really
well during pretraining, but you just
want to surface, during post-training, things
how to write in bullet points or how to use emojis in your model.
And this is more about specializing your model
to one particular type of user, and that it has already
modeled during pretraining.
Then for that, you don't need that much data.
If it's for things that it has never
seen during pretraining or very little, and then
you need much more data.
OK.
So that brings us to reinforcement learning, so
the second method, which is RL.
So in reinforcement learning, yeah,
the problem that we try to solve with reinforcement learning
is that SFT is about behavior cloning of humans or of outputs
as we saw could be from LLMs too.
It's about behavior cloning.
It's about copying the behavior of different outputs.
And this has many issues-- one is
that it's bound by human abilities
or bound by the abilities of the LLMs that you're copying.
But humans, even if you're actually collecting human data,
they might not prefer the things that they are generating.
So even though they might not write better answers
and might not be able to write better answers,
they can still say which one they prefer.
So yeah, the idea is that you will always
be bound by human abilities.
And the second thing is that you will actually
teach hallucination.
And this is a pretty interesting behaviors
that even if you're cloning correct answers
or correct behavior, you might actually be teaching the model
to hallucinate if that model did not know
that that answer was correct.
So what do I mean by that?
Imagine that I ask a question to write some introduction.
Yeah, this is some text.
And basically, I ask you to provide some references.
If the answer provides a reference
that the model does not know about,
we see, what you're teaching the model is
you're teaching the model provide something that seems
like a plausible reference, even if that reference was not
in your pretraining course.
So even if you don't know if that exists, still like that.
So you're basically teaching the model
to make up plausible sounding references.
So yeah.
So hallucination, that's one issue.
And the third thing is that collecting ideal answers
can be pretty expensive.
So the idea or one solution is to,
instead of doing behavior cloning or SFT,
you can do reinforcement learning.
So instead of cloning the behavior,
you can maximize that behavior.
And so I would really recommend reading the DeepSeek-R1
paper and the Kimi K2 papers that some of the best papers
out there in the open-source.
And the key in reinforcement learning
is to decide, what are you maximizing?
What is the reward that you're maximizing?
There are different things, for example,
that R1 has been optimizing for, one might be rule-based rewards,
things like string matches.
Let's say that you have close-ended questionaire
answering.
You might just say, your answer is correct
if the answer is exactly x, or you could have
some test cases for coding.
So that's rule-based rewards.
You can have reward models that were trained
to predict human preferences.
So we will talk a little bit about that,
but you can basically train a classifier
to predict whether something is good or bad, as predicted
by a human, and then optimize against that.
Or you might optimize against an LLM as a judge.
So using an LLM, let's say you use the best possible LLM,
and you just say, is that answer correct or not?
So here, yeah, you see that this particular case, which says,
write a Python code, blah, blah, blah, and then the model
generates different answers.
And here, given that we say, let's write a Python code,
you might have rule-based verification that says,
well, this is not code.
Here, this is the answer.
Here's a joke about frogs.
It is not code.
I asked for Python code.
So is it Python or not?
If it's not Python, then it's also wrong.
And then it might check like if you pass some test cases,
and it will only keep the ones that are passing.
So the idea is to optimize the things that are currently
passing.
So you just say to the model, do more of the thing
that I gave you a correct reward for or positive reward for.
Great.
So yeah, as I said before, I would
recommend reading the DeepSeek-R1 paper,
but basically, they use for what they call reasoning prompts, so
like math questions, coding questions,
and some logical reasoning.
They use these rule-based verifiers
that we just talked about.
And then for general prompts, like translation,
factual question, answering, and writing requests,
things that are more long form text,
they basically use a reward model
that was trained to predict human preferences.
And they try to optimize that.
So what they do is that they start with this SFT checkpoint.
So they use a model.
They do some SFT like this.
The model is already pretty good at generating things
that are often correct.
And then you just do this reinforcement learning pipeline,
where it tries to optimize the number of times
that your verifier says you're correct.
So in terms of algorithm, the most common algorithm
in the open source is the GRPO from DeepSeek-R1.
And the idea is actually pretty simple, is
that you take a policy model, so you take your LLM.
It's usually a SFT model.
You ask it to answer multiple times
to provide multiple outputs to your question.
And then let's skip that part for now.
You basically have a reward model or verifier
that gives a reward, that tells you like, yes, it was correct,
no, it was wrong, by how much, and all of this.
You get a reward for each.
And then you do some group computation
to get your advantages.
The way you just think about it is you
do some normalization, just so that you
know which one is very good and which one is very bad,
you basically renormalize all of these rewards.
And then you basically backpropagate
to tell your policy, do more of the thing that was good.
And then usually, the one we skipped here
is this reference model.
You usually have a KL divergence,
which means that you ask the policy model during training.
You say, don't move too far from outputs from my base model.
And this is really just a hack, because you're basically saying,
reinforcement learning can get you
in places that are usually not ideal.
For example, the hacks that we talked about,
so that's one way of just saying, don't go too far,
optimize as well as you can, but with certain limits
of how far you can go.
Yeah.
And so this is not super important,
but basically, if you know a little bit of reinforcement
learning, DeepSeek-R1 optimizes GRPO, which really just uses
the Monte Carlo estimate for computing the advantage.
And Kimi K1.5 and Kimi K2 use a similar loss.
OK.
So one thing I want to emphasize is
that in reinforcement learning, infra is really key,
it's really, really important.
And the reason why is because, as you saw,
if you use this GRPO algorithm, sampling is a key bottleneck,
because for every question, you have to sample multiple outputs
for each of these problems.
And especially for agents, given that this is an agent class,
this becomes even worse, because you might have
very long agentic rollouts.
And you basically don't want to block all your training compute
on these very long agentic rollouts that
are being rolled out.
So Kimi did a lot of optimization.
And I would, again, recommend reading that papers.
For example, for long rollouts, Kimi
decided to pause these long rollouts.
So if it's more than a certain amount of time,
they will basically pause the rollout,
and they will say, OK, this is not worth it.
We will optimize our weights.
And then we will resume the rollout, the next step.
And then another issue with agents
is that the environment feedback can be slow.
So if you have an agent that really interacts with the world
and cause a lot of APIs and things like this,
maybe you're not even using your GPUs at all,
because maybe you're not even doing rollouts,
and maybe you're just waiting for the environment response.
So the way that Kimi bypasses that is by using,
really, a lot of concurrent rollouts like this one.
When certain rollout is waiting on environment response,
you can work on something else and dedicated services
that microservices that can really spin up and scale.
And the way that Kimi did it is that for every part,
they have a train engine.
Then they have a checkpoint engine
that broadcasts all the weights to all the other pods.
And then they have an inference engine
that really does the sampling.
And what is important is, everything is colocated,
all the engines are colocated on the same pod
to avoid communication overhead.
So anyways, all this to say that there's a lot of optimization
on the infra side.
And Infra is really key here.
So the communication for them takes less than 30 seconds
for communicating the weights.
And everything is, again, working on the same pod.
OK.
So let's talk about reinforcement learning
from human feedback, which, right now,
we talked about reinforcement learning for reasoning when
you have, usually, ground truth verifiers.
The reinforcement learning from human feedback or HF
is this notion of reinforcement learning when
you don't have ground truth.
This is really what made ChatGPT work in 2022.
So the idea is, instead of SFT, where
you clone the behavior of humans,
you want to maximize their preferences.
As I said, this is what made ChatGPT.
And the pipeline is the following.
This is how the original algorithm for ChatGPT worked.
You have an instruction that goes to a model.
A question, it goes to a model.
You ask the model to provide two answers.
And usually, the model is already pretty good.
It's an SFT model.
And then you ask some labelers to select which of the two
was better.
So you have some humans, hey, which one of the two was better?
And then you basically predict, you basically maximize
the number of times that you will generate the thing,
or you will tell your model to predict more
of this thing that was correct.
So there are different algorithms.
I'm not going to go through them.
PPO and DPO are two of them for doing that.
But we see, this is just reinforcement learning,
where your reward is actually given
by a reward model that was trained
to classify human preferences.
And here, you see, these are pretty old results by now.
But here, you see, for learning to summarize,
how a pretraining model performs for summarization,
SFT performs better.
This is human reference summaries, so how well
you compare to humans.
And so SFT really improved compared to pretraining,
but then you see PPO.
So this reinforcement learning made you perform even better.
And this is the order of things, where pretraining is good,
SFT is better, and RL is even better.
And same thing here.
And this is Alpaca farm, which is
a paper we did for optimizing human preferences.
And you see that the two algorithms here, those two
are two RL algorithms work similarly,
and they work better than SFT, and they work better
than the pretraining model.
OK.
Human data.
So as I said, data comes from humans.
This is a very expensive, or at least,
it takes a long time to do and pretty expensive too.
You have to write extremely detailed rubrics
to tell humans what is even considered
as a good answer, what is considered as a bad answer.
Yeah, a lot of work that goes into collecting data.
Collecting data is hard.
Challenges with human data.
As I just said, it's slow and expensive.
Second, it's actually hard for really focusing
on the content of the answers.
Most humans, when you ask them what is good and what is bad,
they will usually focus on the form or the style, things
like lengths.
And this is usually not what you want
to optimize for in your LLM.
Also, depending on who you ask, the distribution of annotators,
you will really get different behaviors,
different political views, and yeah, different views
on many things.
So you have to be pretty mindful about that.
There's also crowdsourcing ethics involved here,
like, who are you asking to label your data?
And yeah, so there's a lot of challenges with human data.
OK.
So one way to reduce this dependency on human data
is exactly what I told you about before with SFT, is
that you can ask an LLM to replace
humans to provide preferences.
And this is, again, this Alpaca farm paper
that we wrote two years ago, which
shows, on the x-axis, the amount of dollars
that you need to spend for collecting data.
In the y-axis, you see the agreement with humans.
And you see that, actually, humans, I believe
are in blue, so here.
So this is around $300 per 1,000 examples that we had to pay
humans.
And you see that the agreement between different humans
is around 66%, while for LLMs, we could divide by 10
or even by 30 the amount of money that we spent.
And that was two years ago.
Now, it would be way less than that.
And we actually performed already better than humans
on predicting the correct human answer.
So it worked surprisingly well.
So you can always use this trick of using LLMs instead of humans.
But again, this is harder to do when you're at the frontier,
and you don't have a better LLM.
OK.
And then evaluation.
So I'll talk really briefly about that.
But there's basically two types of evaluation--
closed-ended evaluation and open-ended evaluation.
And one thing to note is really, evaluation is really the key.
It is one of the most important things in machine
learning in general and AI.
And the reason why is for three reasons--
first, what it does is that it's key to identify improvements,
to quantify progress that you're making,
and to say whether you're making progress and what to change,
what high parameter to select, and things like this.
The second thing that's really important for
is that it will allow you to select which model.
So if I have which model to use for your application,
if I have a specific application in mind,
I will have all these different models
to use, yeah, to choose from.
And I would need to know which one to go after or to choose.
And finally, evaluation is really important
to know whether your model is ready to be put in production.
Even though your model might be the best current model,
is it good enough for your application?
This is very important for practical use cases.
You really need to have good evaluations of your RL labs.
So closed-ended evaluation, the idea
is that if you can turn your problem into something where
you have a few possible answers, then
you can easily automatically verify
whether your answer is correct.
For example, if you turn your eval
into a question-answering evaluation,
then you can simply ask an LLM to provide an answer that
is the answer like A, B, or C, and you simply
look at what the right answer was,
and then you just considered your accuracy.
So this is, for example, what the MMLU eval did.
So there are still many issues.
There's still challenges with closed-end evaluation.
One, it's sensitive to prompting.
Different ways that you prompt your model
will provide different answers.
Two, it might have train test contamination.
So if your model was trained on the eval, because right now,
for example, MMLU is all over internet,
maybe your model was trained on that.
It will see much better than what it actually is.
So this is about closed-ended evaluation.
I really want to focus on open-ended evaluation,
because despite these challenges,
closed-ended evaluation is much easier
than open-ended evaluation.
The question for open-ended evaluation
is, how do we even evaluate something
like ChatGPT or an LLM?
So ChatGPT or all these instruction-following models,
they can be applied on so many different things.
So you can be applying it for coding,
for chatting, for summarization, for many things.
So you really want to have an eval that covers all these use
cases.
The second thing is that it's open-ended tasks.
So what I mean by that is that you have very long answers.
And as a result, you can't do this accuracy-based evaluation,
where you just check whether the answer is verbatim,
the correct answer.
So that makes it hard.
So you cannot do this string matching to know whether
you're correct.
So one idea that you might have for closed-ended evaluation
is that you can simply ask humans to tell you
which answer is preferred.
So you might show two answers to a human
and just ask, which of the two is better?
So this is what Chatbot Arena by LLMs
did, where you basically ask humans to blindly interact
with two chatbot and rate which one is better.
So that's one way of doing that, of basically dealing
with this challenge, where, when you ask open-ended tasks,
with what is not a single answer,
and the answers are usually really long.
It's easier.
Yeah, it's much easier to just ask
humans to rank things than to compare it to a gold answer,
because there's no gold answer.
And the problem with this is that using humans
is, again, costly and expensive.
It's very expensive and slow.
So just as before, what you can do
is you can use an LLM instead of a human.
This is what we did with AlpacaEval two years ago,
and many others followed.
And the idea here, the steps is that for each question,
you might ask a baseline that could be a human or a model
to provide an answer and the model
that you try to evaluate to provide an answer.
And then you will just ask another LLM,
which of the two answers is better?
And then you will just look at the number of times
that your answer is better than the baseline.
And you can get what we call a win rate, which
is probability of winning.
So AlpacaEval was one of the first eval doing that.
And despite being much cheaper than Chatbot Arena,
it had really high spearman correlation with Chatbot Arena.
So using our LLMs can be really good as a judge
for judging your performance and for evaluating your performance.
Yeah.
So running AlpacaEval right now probably costs much less,
but at the time, was less than three minutes and less than $10.
Great.
OK.
So I think I'm getting at the end.
I do want to tell you a little bit about systems and Infra,
because as I said, if you really understand the fact that scaling
is what matters, then the natural thing
that you should be spending time on
is also making sure that your models, your training
can scale well.
So the problem is that everyone is bottlenecked by compute.
So one idea that you might have is, well,
if you're borrowing by compute, and if you
know that spending more compute gives you a better answer,
why just not buy more GPUs and training on that?
There are a few reasons why we can do that.
One, of course, GPUs are expensive,
but they are not only expensive, they are scarce.
So even if you have the money, it
can be hard to just get access to the best GPUs.
And then there are physical limitations.
So if you have a lot of GPUs, you
need to have the communication between GPUs.
And that can really slow down your training.
So you do need to optimize your systems
and make sure that training is as
efficient as possible on every GPU that you have.
So yeah, you need to do some good resource allocation,
and you need to optimize your pipelines.
OK.
So I will try to give you an extremely brief overview
of GPUs, just for you to get a sense of what
matters when you optimize these runs
and what you're actually optimizing for.
So Systems 101 GPUs, so the difference between GPUs and CPUs
is that essentially, GPUs are massively parallel.
So they will apply the same instructions in parallel
on all different threads but different inputs.
So you will have the same input that
will go on different threads.
And it will basically--
sorry, different inputs that will
go through different threads and will apply the same instructions
to them.
So really, the difference with CPUs
is that you're optimizing for throughput.
It's massively parallel.
So here, you see GPUs and CPUs, the difference.
So yeah, as I said, first, GPUs are massively parallel.
Second thing is that GPUs are really optimized
for matrix multiplications.
So GPUs are graphical processing units.
And anything about computer vision and graphics
really requires extremely fast matrix multiplications.
So from the early days of GPUs, people building GPUs
are really optimizing for matrix multiplication.
So they have specific cores that will make matrix multiplication
very fast and actually, around 10 times faster
than most other floating point operations.
So you see different versions of GPUs.
And you see the speed.
And you see that for matrix multiplication
is much faster, especially recently, much
faster than nonmatrix multiplication, floating point
operations.
So another thing that is important to understand
about GPUs is that actually, compute
is not the bottleneck anymore.
So what I mean by that is that if you
look at here, the compute, that the peak hardware scaling.
So here you have compute so FLOPs
that could be performed on the best hardware across time.
And here, you have, basically, the communication and memory,
and how it improved across time.
And you see that, basically, compute increased or improved
and much faster across time than memory and communication.
So what that means is that right now, we have more compute
in a GPU than we have memory.
And then we have improvements in communication.
So in other words, the bottleneck for GPUs
is not performing the computation,
but it's actually keeping the processor
that performs the computation fed up with data.
So you basically need to send as much data as possible there.
And the bottleneck is actually feeding the data,
not doing the computation.
And that's a very important thing
to understand when you're optimizing your pipelines.
And yeah, as a result, if you look at this paper for 2020,
that analyzes where you perform all the compute form, how
much time it takes to run a transformer.
You will actually see that things like tensor contraction,
which is basically matrix multiplication,
requires most of the FLOPs, so most of the actual compute
that is required for matrix multiplication.
But in terms of runtime speed, it's still a majority,
but only 2/3 of the runtime is spent on the thing that is most
of the compute.
And things like element-wise operation or some normalization
actually requires very little floating point operations
but requires a pretty large amount of time,
or basically, you spend a lot of time
there, because you need to still send to the GPUs your data
and do the computation, even if the computation is slow.
OK.
And the last thing that you need to know about GPUs
is that it's really a large memory kind of hierarchy.
So the closer you are to the cores,
the cores being the things that actually perform
the computation, the faster the communication with the cores
will be, the less memory that will be there.
And the further you are from the cores, the more memory,
but it's slower.
And you basically have different levels of hierarchy,
so you have this shared memory.
And then the L1 cache, that is the shared
memory that is super close.
And then you have the L1 cache, the L2 cache, and then
you have the global memory that is very far
from your registers and your process.
So yeah, this memory hierarchy and the metric
that we try to optimize when we try to optimize our runs
and just our systems is Model FLOP Utilization,
you will often hear this word, or MFU for short.
And this is basically the ratio between the observed throughput
to your model and the theoretical best.
So NVIDIA will tell you, at best, we
can do that amount of FLOPs.
And then you will check how much you are achieving.
And if you achieve an MFU of 1, that
means you are able to get all the data, the maximum of data.
You can always keep your process basically fed with data.
So at any point of time, there's something
that is being computed on your processor.
Just to give you a rough sense of these numbers,
if you have 50%, you're in a really,
really, really good shape.
And even big companies might be optimizing
to go higher than 15% or 20% to achieve this 50%.
So I want to give you a very quick overview of things
that you might want to do for optimizing your runs, just
to give you a sense of, at least things that people do
for optimizing this compute and making sure
that your runs are scalable.
One thing that you might do is low precision operations.
So the idea is that if you use fewer bits for every data points
or for every data that goes in your processor,
you will have faster communication and lower memory
consumption.
So as I said, given that the bottleneck is not the compute,
but it's this memory and this communication,
might just decrease the precision
in which you put your data in.
And as a result, you will just have faster communication,
because you can put more through the bottleneck and then lower
memory consumption.
So for example, for deep learning,
the actual decimal precision is not that important,
except for a few operations.
That's because there's a lot of noise in any case.
And when you train deep neural networks, because stochastic
gradient descent already has a lot of noise.
So matrix multiplications will usually
be done in bf16 rather than fp32,
so you can halve the precision.
And if you have the precision, what you can do?
So usually, one thing that is very common
is using automatic mixed precision or AMP
during training, which is that the weights are stored
in fp32, so use 32 bytes.
Before the computation, you will convert the fp32 to bf16,
so you will basically half the precision.
And then everything will be done in bf16.
So you will have less memory.
You will have more speed up because faster communication.
And your gradients will be stored in bf16,
so you'll have memory gains.
And then at the end, you will put it back in fp32.
And every small operation that you do
can be shown in your weights at pretty high precision.
Great.
There are other optimizations.
For example, operation fusion.
So again, the idea here is that communication is slow,
as we said.
And every time that, for example,
if you write in PyTorch, every time you write a new line,
it actually moves your variable back to global memory.
And that makes it very costly.
Because basically, if you do something like x1
is equal to the cosine of x, you will basically
read x from global memory, write it to x1.
And then when you do this new line,
x2 is equal to cosine of x1, you will again
take it back to global memory and write it to x2.
So that can be really very slow because you
have a lot of this communication between global and memory.
And so what you might want to do,
so this is just to give a schematic version of what
is happening.
So here, you have everything in memory, your DRAM.
And basically, you will send data to your processes
for performing compute.
And after every new line, you will send it back to your DRAM.
And then you have to do it again and do it again.
If you just have a PyTorch function,
this is a naive way that things are working.
But there's a much better way of doing
that once you realize that communication is the bottleneck,
is that you might just communicate once and do
all the operations, and then communicate it back.
So this is what fused kernels are doing.
So the idea is that you communicate once,
you perform all the operations that you want,
and then you send it back.
And as we said, this is actually fast, this is slow,
so there are fewer slow things here.
And this is basically what torch.compile
does to your code is that it fuses operations together.
OK.
Tiling, I know it's becoming long,
so I'll just quickly talk through that.
The idea is that the order in which you perform operation
will matter a lot because of communication.
So what I mean by that is that you can group and order
threads that are performing some computation
to minimize the number of times that you will communicate
with global memory.
So I'll give you an example for matrix multiplication.
Here, this is the very naive way of doing matrix multiplication.
This is how you basically learn it
at school, where you take two matrices that you multiply them,
you will basically go through all this column and all
this row.
You will basically multiply these two together
and multiply these numbers, and then sum across all of that.
And you get this number for this one.
And the way that basically, the memory is accessed here
is that one thread here is going to access
this one, and this one, and then this one, and this one.
And you'll basically have one thread
that is working with all of this and then all of this one.
And then you will have another thread
that is working on these things separately.
And then when this one is done, it
will work on a different column and a different row.
So what is important here is that you will rarely reread
the same values from cache.
In contrast, what you can do is you can split up your matrix
multiplication into different tiles to reuse memory.
So for example, you might say, well,
I'm going to have one thread that instead of working
with all the column and all the row at once,
it basically works with all of these four values
together against these four values together.
So it'll basically do multiply this and this together, and then
this one and this one together.
It's a bit hard to explain without actually showing it
to you and just with this diagram.
But basically, this number here, N_00, will be used twice,
will be used to multiply M_00 and M_10.
So basically, for one number that I have access to,
I made two operations.
So I have N_00, I made two operations.
So basically, I have to read less from global memory,
because by one read, I made two operations.
While before, with one read, I made only one operation.
So you're basically making sure that you make more work
with the same amount of data or same amount of work
with less data.
So you basically read less.
And you can still work through an algorithm like this
where you multiply these element-wise with this.
And then you have another thread that
works with this one, element with this one.
And then you have the partial sums, and you sum them together.
Anyways, all this to say, it's not super important.
The actual algorithm is really not that important.
What is important is that the order in which you perform
operations can really impact--
the grouping and ordering can really
impact the number of times that you have
to read from global memory.
And tiling is one way where you basically group things together
in a single thread, such that with less data,
you can do more computation.
So you can reuse the reads and basically have access
to your cache.
Great.
So FlashAttention is one pretty famous optimization
that was done for making attention faster.
And it basically combined the three things
that we talked about before, which
is this kernel fusion, this tiling, and also
one additional thing, which is recomputation.
So sometimes, it's cheaper to redo a computation
than actually reading from your memory, the values.
So basically, here, the recomputation of attention
is the idea that don't save everything.
Sometimes, it's easier to just recompute
the values then storing them.
And FlashAttention.
V1 got 1.7 end-to-end speed up from 3,
1.7x speed up gains just by combining these things together.
So all this to say that, yeah, systems really matters.
You can get huge speed up gains at no ML.
This is completely ML-neutral.
This is just an order of operation.
And yeah, the order in which you perform operation
can really improve a lot your performance.
OK.
I think I'm arriving at the end.
I do want to maybe briefly talk about parallelization.
That's the last big topic in terms of systems.
So the idea is that you have very big model.
This is one of the big problems is
that you have very big models.
And big models cannot fit on one GPU.
So you really want to use as many GPUs as possible for making
your training runs fast.
So once you think about it this way, there's a question of,
how do you split your GPUs?
Sorry, how do you work with as many GPUs as possible?
And how do you fit your model into GPUs?
And the idea is that you can split
your memory and your computation across GPUs.
So again, the problem is that models are big.
They don't fit across GPUs.
They don't fit on a single GPU.
Two, you want results as fast as possible,
so you want to put as many GPUs as possible working together.
And the idea is that you can split the computation
and split the memory across different GPUs.
And this is all about parallelization.
OK.
Slide background here is to naively train
a model that has p, parameters.
You actually need 16 times P gigabytes of DRAM.
So the reason is that you need 4P, so four bytes, because here,
we assume that it's 32 floating, like yeah, FP32.
So you have four bytes for every P parameter.
And so you have, yeah, sorry, 4P gigabytes for model weights.
And then for the optimizer, if you talk about Adam,
you need to store both the mean and the variance
of every parameter.
So the optimizer needs to store 2 times 4P gigabytes in terms
of values.
And so you have here, 12.
And here, you have 4P for the gradients.
So you need to store both the weights
when you do backpropagation to store the gradients.
And this is also 4P.
So basically, that means that for training a seven billion
parameter model, you actually need 112 gigabytes of memory,
yeah, which is really huge.
So the idea here is that you can optimize that by--
yeah, the goal, at least, is to use more GPUs
and to optimize your training.
So let's say that you have four GPUs here.
And you want to optimize.
You want to basically have every GPU working simultaneously
on your data set.
One naive way that you can do it is
that you can copy the model and the optimizer on every GPU.
You can split the data.
And then you can basically have every GPU working
on the same model but different set of data,
because you split up your data.
And then at the end, after they do one step,
you basically communicate the gradients and sum the gradients.
And that will be the total gradient
that you would have gotten if you had actually trained
on the four sets of data.
So basically, after every batch, everyone
works on a separate batch.
And then at the end, you get gradients,
you communicate, you sum them.
And then you have, basically, the same gradient
as what you would have had, had you trained on four times
the batch size.
So the benefit is that now, you can use all these GPUs,
because now, you can use four times more GPUs than before.
So it's four times faster than before.
The negative aspect is that here, you
have literally no memory gains, because now,
if your model, for example, didn't fit on one GPU,
it still doesn't fit on a single GPU right now.
And also here, we said, 7B models require 112 gigabytes
of memory.
Here, it means that you really need
to have 112 gigabytes of memory on every GPUs,
so there's no memory gains.
So how would you split the memory?
How would you get memory gains?
One way of doing that is to have each GPU, update
a subset of weights.
Yeah, update a subset of weights and hold the subset of weights.
And then you communicate them before updating your weights.
So this is what we call sharding.
So here, one way of doing that is this paper called 0.
So here, you see the baseline, which has the 4P gigabytes
for parameters.
Here, you have the 4P gigabytes for gradients.
And here, you have the 8P gigabytes for optimizer states.
And they have different levels of sharding.
So the first thing that you can shard
is you can shard the optimizer.
So you can say, well, every GPU will only have
one subset of optimizer states.
And basically, each of them contain a subset.
And we'll just communicate them when needed.
So you basically have this, and then you can do the same thing.
So this will cut a lot your memory requirement from 120
gigabytes to 31 gigabytes, so nearly a 4x decrease.
And then you can do the same thing for your gradients.
And you can do the same thing for your parameters.
So you can basically say, every GPU
takes care of a different subset of weights.
OK.
So that was for data parallelism.
And now, let's talk about model parallelism.
The problem with data parallelism
is that it requires to have at least as much data as you have
or at least as much batch size as you have GPUs.
So basically, as I said, I assume,
you have a batch size of 16.
Basically, what you're saying is, if you have four GPUs,
average GPU now gets a batch size of four,
so 16 divided by 4.
But what if I want to use 32 GPUs?
How do I now split up that data to fit into 32 GPUs?
The idea is that you can have every GPU take
care of applying specific parameters rather than updating.
So what we saw before with this data parallelism
is that every GPU can take care of updating specific parameters.
But here, the idea with model parallelism
is that instead of having every GPU taking
care of updating the parameters, you
can have every GPU taking care of applying the parameters,
so like applying the actual operations.
So for example, in pipeline parallelism,
you can say that every GPU has access
to a whole different layers.
So you can say, layer 1 is on GPU 1, layer 2 is on GPU 2.
And basically, what you have is that once you take data,
you make all the data passed to your first layer, which
is on GPU 1, and then you send it to GPU 2,
it passes through the second layer, then GPU 3, et cetera.
So yeah.
So this is for pipeline parallel.
I'm going to skip that part.
And then you have tensor parallel.
So this is the idea that instead of having every GPU hold
a different layer, you can have split matrices,
or you can split inside of a layer,
you can split it between GPUs.
So for example, when you multiply
a matrix with a vector, what you can do
is you can split up the matrix into two,
you can split up the vector into two, and you can basically say,
I'm going to operate with these matrices on this vector,
and these matrices on this vector.
And I'm going to aggregate everything at the end together.
So this is what we call tensor parallelism.
So pipeline parallelism is this idea
that every GPU has different layers.
Tensor parallelism basically split up the weights
into different GPUs.
Great.
OK.
And the last example for system optimization
is that models are really huge.
So instead of splitting up your model weights
onto different GPUs, what you can have is
you can say, well, actually, not every data point
has to go through every parameter.
And this is what we call sparsity.
So very common architecture that is sparse
is the mixture of experts that basically
says, only some parameter will be
active for some types the data points.
So the idea is that you now have a data point that comes in.
And it will only go to some set of parameters, not
all the parameters.
And this makes it very easy for doing parallelism
and multi-GPU training, because you can just
have different GPUs contain the parameters that are required
for different data points.
So here, you have a dense model that basically is a little bit
too much into the weeds, but if you know about transformers
you have this linear layer at some point.
And you can basically say, the linear layer
is going to be split into different linear layers,
and different data points will go
through different linear layers.
And every GPU can basically have access
to different linear layers.
So if you didn't follow the last two slides,
though, maybe a little bit not as important,
but what I do want to stress is that there's
a lot of work that goes into systems
optimization and that kind of optimizing
the use of your compute.
And the different ways that we saw about doing that
was tiling, so like ordering of your operations.
It was the sparsity, so basically,
making your model sparse and not having every parameter.
So every data point go through every parameter.
We talked about parallelism.
So basically, using more GPUs.
And yes, I think that's basically it.
Great.
So we're done.
And there's no questions today, because as I said,
this is a rerecording of the video.
I know this was pretty long.
I'm also starting to be a little bit tired.
But I hope it was useful.
And yeah, good luck for the rest of the class.
Loading video analysis...