CS294-196 (Agentic AI MOOC) - Lecture 1 {Yann Dubois}

By Berkeley RDI Center on Decentralization & AI

Summary

## Key takeaways - **Pretraining: Predicting the Next Word at Scale**: Pretraining involves predicting the next word on vast amounts of internet data, aiming to imbue the model with broad world knowledge. This process requires immense computational resources, with training costs exceeding $10 million and datasets of over 10 trillion tokens. [01:32:30], [01:37:35] - **Post-training Aligns Models with Human Intent**: While pretraining teaches a model language, post-training (like RLHF or SFT) is crucial for aligning the model's behavior with human preferences and instructions, making it useful for real-world tasks. This stage uses significantly less data, focusing on quality over quantity. [03:05:05], [53:36:40] - **Scaling Laws Drive LLM Performance**: Empirical evidence shows a strong correlation between increased compute (more data and larger models) and improved LLM performance. This relationship, known as scaling laws, allows researchers to predict performance at larger scales based on smaller-scale experiments, guiding resource allocation. [42:01:03], [43:05:09] - **Data Quality and Filtering are Critical**: The quality of training data significantly impacts LLM performance. Extensive filtering, deduplication, and heuristic-based cleaning are essential to remove undesirable content and low-quality documents, transforming raw internet data into a more effective training corpus. [28:18:20], [34:48:54] - **Systems and Infrastructure are Key for Scale**: Efficiently scaling LLM training relies heavily on systems optimization, including low-precision operations, fused kernels, and advanced parallelization strategies (data, model, and tensor parallelism). These techniques address bottlenecks in memory and communication, maximizing hardware utilization. [37:01:11], [46:46:49]

Topics Covered

Pretraining LLMs: Predicting the Next Word on the Internet
RLHF: Training LLMs to Interact with Humans
AI 'Hacks': When Models Game the System
GPU Bottleneck: Data Feeding, Not Computation
Halving Precision for Faster AI Training

Full Transcript

YANN DUBOIS: OK.

Let's get started.

So hi, everyone.

My name is Yann.

I'm a researcher at OpenAI.

And I'll be rerecording a class that I gave at the Berkeley LLM

Agents MOOC series on Introduction

to training LLMs for AI agents.

So the reason why we're recording this class

is because, one, we had technical difficulties early

on in the class, and second, there

was a fire alarm that started, which means that we did not

go through all the slides.

And also, we don't have a good recording,

and I want to make sure that everyone who's online

would also see the entire class.

So before getting started, one thing to say

is that, all views here on my own,

even though I work at OpenAI, I will mostly talk about things

that we can find online and information that we

can find of open-source models, especially Kimi, Llama,

and IPsec.

So unless I talk about OpenAI, nothing

is related to OpenAI here.

Great.

So with that, let's get started.

So we all know that LLMs and chatbots really

took over the world in the last few years.

And so the question I will try to answer

is-- how do we actually train those?

How do we train those models?

So this is an example here from ChatGPT and an answer

from ChatGPT.

So there are three main parts of the training pipelines,

one training LLM.

The first one is pretraining, so you probably all heard about it.

The general mental model that I like to give people

is that pretraining is really about predicting

the next word on internet.

So you take all of internet or all the clean part of internet,

and you just try to predict the next word.

And as a result, you will learn everything about the internet

and as much as possible about the world.

In terms of data, pretraining takes more than,

at least for The big open-source models,

takes more than 10 trillion tokens, so that's a lot.

And when I see tokens, you can think

a little bit about it as words or subwords,

but essentially, more than central tokens.

So think about it as 10 trillion words.

It takes months to train on that much data.

And it takes a lot of money and a lot of compute

to actually train on that amount of data.

So you can think about the compute cost

being more than $10 million for one run.

So here, the bottleneck in pretraining

is both the data we talked about, 10 trillion is a lot.

And second is compute.

And as we will see, basically, the more you scale up

these models, so that means, the more data

you put in the model, the more compute.

So the longer you train for, the better the performance will be.

So an example, a pretraining model is like Llama 3,

for example.

So a second part, which actually is

third in a chronological order of how these models are usually

trained, but this is the second part

that came after pretraining.

The last part is more recent.

So the second part is what I call

classic post-training or RLHF, so Reinforcement Learning

from Human Feedback.

Here, the idea is that this pre-trained model

is just good at predicting the next word,

but it's not really good at performing well

in the sense of predicting what the user wants

or answering questions or following instructions.

Basically, you can think about it

as a model that knows everything about the world,

but it doesn't actually know how to interact with a human.

That's one way of thinking about it.

And that's what we're going to try

to optimize for is model's interaction with humans

and making sure that when you ask it to do something,

it does it.

The data size here is much smaller,

maybe around like 100,000 problems.

All these numbers are just orders of magnitude,

just to give you a sense.

In terms of time, it probably takes a few days to do one

of these runs, compute costs, maybe around like $100,000.

And here, the bottleneck is data and evals.

So when I say data, I really mean the quality of data,

because 100,000 problems is not that much.

So it's really about how high-quality the data is

and whether you can evaluate whether you're making

improvements on your run.

So how good you're improving your run.

And this is really important, because when

you do for this RLHF or this post-training,

you have to balance many things together.

So you need to make sure that you're actually

tracking your performance on all these different axes.

So a specific instantiation of this RLHF model

is a Lamma Instruct.

So when you hear instruct at the end of the model,

that usually means they just went through RLHF,

because it can follow instructions

that are given by humans.

Great.

And this last part, which, actually, as I said,

usually comes second in the pipeline,

is what I call the reasoning reinforcement learning.

And so the idea here is to think on questions

where there's objective answers or where you have access

to ground truth.

So you will hear, recently, you will see here

from open-source models also that perform very well on math,

and coding, and things where it's actually pretty easy

to get some ground truth answers, for example,

like passing test functions in coding or passing

some math exam.

And this is what you want to optimize

during this reinforcement learning for reasoning.

So this second stage is only true for reasoning models.

So one example is DeepSeek-R1, which

was the first open-source reasoning model.

In terms of data, in the R1 paper, they don't say exactly,

but you can read through the lines,

and you can look at the plot and try

to extract around the amount of problems

that they are actually training on.

So in terms of data, it's probably

around a million problems that they're training on,

probably takes in the order of weeks

to train this second stage, so this reasoning stage,

around $1 million.

And here, the bottleneck is reinforcement

learning environment and hacks.

So what I mean by that is, as I said,

this is about optimizing for objective truth

or optimizing, for example, if you take the case of passing

test functions, encoding, the bottleneck is-- how many

test functions can you get?

And one thing that will usually happen

is when you start optimizing for these test functions or test

cases, you will see that the model

will start optimizing things that you weren't expecting.

And so maybe we'll be able to pass the test case by,

for example, removing the test in your environment

or replacing it with always returning to.

That's one of the types of things that the model may do,

which is what we usually call hacks.

So the way to think about hacks is just,

the model found a way of optimizing the reward,

even though that's not what you were hoping they would do.

So this is usually a pretty big bottleneck,

because models are really good at optimizing things,

even if it's not exactly the type of thing

that we want to optimize.

If you write something to optimize,

they will optimize it exactly as written.

And usually, the second and the third stage,

I will bundle them together and call them post-training,

which comes after pretraining.

It depends.

Different people have different names.

That's how, I believe, R1 and Kimi

talked about that in the post-training stage.

Great.

So the LLM training pipeline, there

are basically five things that you

need to consider when training an LLM.

First is the architecture.

So what model architecture are you using?

So you probably all heard about transformers or about mixture

of experts, which is a kind of a variant of transformers.

And then there's a training algorithm and the loss,

so that means, what are you optimizing for this architecture

to do?

So what are you trying to optimize?

And then there's the data in the RL environment,

so that's what we talked about before, this evaluation, which

is knowing whether you're making any progress.

And then last part is systems and infra,

to make sure that you can scale up these runs.

So until, I would say, 2023, most of the academia

was actually focused on architecture, and training

algorithm, and losses.

I've done also a PhD.

That's what I was focused on until around 2023.

And really, there were few people

who were working on the rest, but that was the main product

of academic research.

But in reality, what matters in practice is the last three.

So what matters is usually, data, evaluation, and systems

to be able to scale.

So people usually want to work on architecture

on developing new algorithms for optimizing your model,

but these things matter much less,

as we will see, than really, how much data do you put in?

What's the quality of the data?

Are you measuring your progress well?

And do you have the infra to actually scale things up?

So I will not be talking about architecture,

mostly because, at this point, the architecture is not changing

that much in the open-source.

And so it seems to be mostly using transformers

with mixture of experts.

And a lot of people like talking about architectures,

so you can find a lot of information

about what architectures are being used.

At least, currently, it's really not as important,

so that's why I will not be talking about that.

OK.

These two last spots that I didn't

talk about in the pipeline for training LLMs.

And I consider them more about specializing the LLM.

One is prompting.

So once you usually have a model, for example, a big lab.

Yeah, big lab might release a big open-source model

or a closed model.

And then people will be able to interact with it

and specialize that model for their use cases.

And the usual way that people do it is, first, just by prompting.

So prompting is really just knowing

how to ask questions, essentially.

So it's the art of asking the model what you want.

What is nice is that you don't need any data,

and it's pretty fast.

You just try a few examples, and you see how it works.

And there's no compute associated to it or very little.

And the bottleneck is eval, like, how do you make sure

that you actually have a good prompt, that you're

asking the right question to the model?

And then the second part is fine-tuning.

So we will not be talking about it, but just to mention it here,

briefly, fine-tuning is basically

continual post-training or an additional post-training,

where you basically apply the second stage of clustering

to domain-specific data.

So for example, imagine that all these companies

release pretty general models.

And now, you want to specialize it to some specific domain,

like medical data.

So internally or for your project,

you might have some specific data

that you want to optimize for.

And you will take these open source models,

and you will be basically fine-tuning,

so doing a little bit more of training on your specific data.

And just like post training, this requires around maybe

10,000 to 100,000 problems around days.

And compute costs around like 10,000 to 100,000.

And here, again, just like [INAUDIBLE]

the bottleneck is really the quality of your data,

and evaluation.

How do you know whether you're making progress?

Great.

So let's talk about pretraining.

I will talk about the method, what pretraining is,

the data, and the compute that you need.

So in terms of pretraining, as I said,

the mental model, the metaphor that I like giving to people

is that pretraining is about predicting the next word.

And a way to think about it is, just for example,

when you type a message, you will usually see your phone,

like predicting what's the next word that you will predict.

And this is exactly how pretraining works, or not

exactly, but mostly, this is the metaphor that I like giving.

And it's essentially how pretraining works.

So the goal of pretraining is to teach the model everything

in the world.

And the way that we basically achieve that

is just to predict the next word.

Because if you can predict the next word,

then you must understand.

If you can predict the next word on every single domain,

then you must have some understanding of that domain.

And that is basically what pretraining is.

So in terms of data, it's basically any reasonable data

on internet, as much as possible,

because again, you want to have models

that understand, as much as possible, about everything.

So you really want to give as much data as possible

for the model to learn on.

So in terms of scale of data, you

have around, maybe, I said more than 10 trillion tokens.

So for Llama 4, for example, I believe

the models were trained with between 20 to 40

trillion tokens.

For DeepSeek V3, I believe it was 15 trillion tokens.

So it gives you an order of magnitude

of data that you need for the current best open-source models.

So that 15 trillion tokens corresponds to approximately $20

billion unique web pages.

So that's a lot of data.

It's not all of internet, but it's basically

all the clean data that people can find on internet.

And pretraining has really been the key

since GPT-2 in 2019, that mostly showed the world what

pretraining can do.

And basically, just using a simple method

like predicting the next word, but you just do it at scale,

it really showed how smart the models can become.

OK.

So what is actually happening under the hood?

I'll give you first a brief overview in terms of tasks.

As I said, it's about predicting the next word.

So the steps are the following.

First, you tokenize the data.

So here, I have a sentence, "She likely prefers."

And the goal is to predict the next word.

So in this case, the next word is dogs.

So "She likely prefers."

So what you will do is you will split up "She likely prefers"

into different tokens, which are basically different subwords

or different subunits.

The reason why we do that is because, I

mean, they don't understand words, they only

understand numbers.

So you have to take these words or you

have to take this sentence and split it up into numbers.

So that's what we call tokenize.

I split it up by word here, so by space.

So I have "She likely prefers."

And I say all these three words become tokens.

And I will associate all of these tokens

with different index.

So "she," I will give it 1, "likely" becomes 2,

and "prefers" becomes 3.

And this is just one way of converting, again,

these words that computers don't understand to numbers

that computers can work with.

Then you will do what we call a forward pass, which

means that you will pass through the model.

We'll see exactly what happens later.

But you will pass it through the model.

So usually, this is a transformer.

And then you will have this model

try to predict a probability distribution.

So categorical distribution that tries to predict what

is the probability of the next word.

So for example here, you see that "She likely prefers,"

it's very unlikely to say "she" again.

But it's very likely to say this word, which in this case,

is dog.

And then you will sample from this probability distribution.

So once you have a model that predicts the distribution,

you can just sample.

And that's why, every time you ask a question

to some open source model, it will not always

get the same answer because you actually have the sampling step.

You sample and then you detokenize.

Because again, when we talk about it

in this categorical distribution,

that just tells me that this is token number 5.

Here, I have 1, 2, 3, 4, 5.

I have one index here.

So I have five.

And then I need to look through my dictionary.

That tells me, index 5 was actually the word "dog,"

so the token "dogs."

So that's detokenized.

So the last two steps, this is not super important.

But last two steps are only happening at inference time.

At training time, all you can do is

you can always keep predicting the next word by predicting

the probability distribution and just

optimizing your cross-entropy loss, which, I'm sure,

most of you are familiar with.

So you don't actually need to do the a sampling.

So these two steps are only doing inference.

Great.

So now, I want to give you some intuition

about why this can even work.

And to do that, I will talk about,

honestly, the most simple language

model that you could think about.

And this is the N-gram language model,

which was already used at scale in 2003, so a very long time

ago.

It already worked pretty well.

But I think it gives a good intuition

of what is happening under the hood for the current models.

So the question here is, how can you learn what to predict?

Because we talked about, before, I

said, oh, you just do a forward pass to your model

and just predict the distribution.

How can you learn that?

So one way you can do that, so let's take an example and say,

how can you build what comes after the sentence,

"The grass is."

And you probably know that after "The grass is,"

it's most likely to be "green," for example.

So how can we know that?

How can we teach them all to do that?

Well, the solution is statistics.

Statistics is always the solution

to most of your problems.

So one way you can do that is, you can take all the occurrences

of the sentence "The grass is" online, or for example,

take all the occurrences of "The grass is" on Wikipedia.

And now, you can predict the probability of every word that

comes after "The grass is," by looking at the number of times

that that word appeared after "The grass is"

normalized by the number of times that you saw the sentence

"The grass is."

So let's say that the sentence "The grass is"

happens 1,000 times on the web pages that you looked at.

And maybe half of the time, so maybe 500 times,

the next word is "green."

And maybe, I don't know, 100 times, the next word is "red,"

you will have then probability of "green," knowing,

after "The grass is" will be half, so 500 divided by 1,000,

and for "red," it will be like 10%, so 0.1.

So that's a very simple way of predicting

the categorical distribution of the next word.

But it would actually work pretty well.

At least for simple things like "The grass is,"

that would actually work pretty well.

There are still a few challenges.

One is that you need to keep count of all the occurrences

for each of these N-grams.

Or at least, in this case, each of these sentences

that happened, you need to keep count of every word that

came after it.

So just think about it in terms of memory,

it's a huge memory requirement for storing all of that.

So that's unfeasibly large.

But it will still work pretty well for simple things.

And then the other problem is that most sentences, maybe not

most, but a lot of sentences might be unique.

So if there's something that never happened in your training

corpus, so if you never saw this very long text,

let's say that instead of "The grass is,"

I gave you 100 lines of code, and I asked you,

what's the next word?

Maybe you never saw these 100 lines of code at training time.

And then your predictor will have no way to generalize,

because basically, count will be zero.

And so we'll give a probability of zero,

even though the probability is actually a little bit higher

than zero.

That's like two problems that you

would have with this very simple statistical language model.

And so the solution is very simple.

Just use neural networks.

So I'm sure, many of you know about neural networks,

and we're going to assume that you do,

But you can basically approximate this prediction

by using this parametric estimator that neural networks

are, instead of this non-parametric estimators

that we talked about here.

Great.

So let's go through what I will call here neural language

models.

So it's a language model of a neural network,

which is what everyone does.

So the way that this works, again, at a very high level,

is that you take a sentence, for example,

"I saw a cat on a," I will basically split the sentence

into different tokens.

So these are all these words.

I will associate all these tokens

with a word embedding, so like a vector

representation of that word.

So the way that you can think about it is that,

imagine that this was in 2D.

You basically have a plane.

And you basically have all these points

that are on this plane, where usually, most similar words are

clustered with one another.

So you might have "I," "saw," "cat," and things like this.

It's just that instead of being 2D,

it might be much higher dimensional,

and it might be like a vector space of like 768 dimension

or something like this.

Then you pass that through a neural network.

So neural network, the way to think about it

is just some nonlinear aggregator of these vectors.

So it's just something that takes

all these vectors as input.

It does some munging.

And it gives you another vector.

The important part is that it's differentiable,

so you can actually back propagate through that.

That's the most important.

But for example, you could use a very simple neural network could

just be an average of this.

You could literally just average all these tokens together

or the vectors associated with these tokens.

And it gives you another token here, which is, intuitively,

it's the vector representation of all this sentence,

"I saw a cat on a."

So yeah.

So again, you could take some average

or you could just take some nonlinear aggregation,

like a passing through a neural network.

Then what you do is that, this vector representation

is in the wrong dimension, because what you want to do

is you want to be able to predict

which is the most likely word.

So you want to predict the probability of each word.

So you want a representation that lives

in this dimensional space, that is the number of dimension

is the number of tokens, the number of words

that exist in your language, for example, English.

So what you will do, very simple way

is that you can just pass this through a linear layer.

So you can just multiply this by a matrix

to take this H representation that lives in d dimension

and pass it to your vocabulary size dimensions.

So let's say, very concretely, you have 768 dimension.

Let's say that your vocabulary might be 20,000 words that you

might want to predict in English.

And you will basically multiply this by a matrix of 768

by 20,000.

And then you will get a vector out of it.

That is a 20,000 dimensional weight.

So once you have this, you will just pass it through a softmax.

So softmax is the usual trick to get

a categorical neural distribution from any vector.

So this just ensures that basically you

have numbers that sum to 1 and are between 0 and 1.

And then you can just consider that as the probability

of the next word after "I saw a cat on a."

Here you are.

You basically have this prediction of the next word.

Great.

And once you have the next word, you

can just optimize the cross-entropy loss.

So basically, just try to optimize what the real word is.

Let's say that the real word comes from here.

You will basically try to maximize a little bit this one

and minimize all the rest.

And then you just backpropagate because everything

is differentiable.

And that will basically tune all the weights

that you have in your neural network, including

also these word embeddings.

So this representation for every word.

OK, that was a very brief overview.

But hopefully, you get a sense of what a neural language

model is.

OK.

So now that we talked about the method, let's talk

about the data that goes into pretraining.

So the idea, as I said before, is to basically use

all of the clean internet.

So use as much data as possible and everything

that is clean on the internet.

Why do I say all of clean internet,

because majority of internet is pretty dirty and not

representative of what you want to ship to users

or what you want to optimize your model on.

So a very practical type of pipeline.

So every different lab and different pretraining groups

have different ways of doing it, but that's just

to give you a broad overview.

You first download all of internet.

So usually, people use in the open-source

some crawlers that already downloaded internet for them.

So basically, for example, the Common Crawl

is a crawler that already downloaded 250 billion pages.

So that's around more than 1 petabyte of data.

And they're all in these work files.

So basically, you download all over the internet.

Oh yeah.

And how these look like, so these files,

it's basically just HTML.

I mean, it's hard to understand.

You see here some meta, some keywords.

And here, you will find the text that says, blah, blah, blah,

blah, or here, paragraph, one of the best and most rewarding

features of the blah, blah, all that.

So it seems to be like an ad talking

about rewarding features.

And then it talks about downloading free question

and answers.

Anyways, it seems to be kind of an ad.

So anyways, this is a random website

that I took from Common crawl.

So as you see, it's hard to pass.

And probably, this is not even something

that you really want to train on if it's an app.

Great.

So that's just an example of what you have.

Oh, my computer stopped.

Oh, like this.

Great.

Second thing that you do, so as you just saw,

you have this HTML.

So what you have to do is you have to extract text out of it.

So it's actually pretty challenging.

There will be some question of how

do you deal with JavaScript, or boilerplate code,

or math that is run differently, and things like this.

So you will need to extract text from HTML.

This is actually pretty computationally expensive

because at this point, the name of the game is how much data

you can have.

So you really, you have a lot of data

that you have to clean and extract from.

So that's actually pretty computationally expensive.

Then you will do some filtering.

So one filter that we usually do on the open-source world

does pretty early on, is filtering

for undesirable content like PII data, or not safe for work data,

or anything that is harmful.

You will try to remove this.

Then another very common filter that people usually do

is deduplicating your data.

And then the deduplication could be by document,

it could be by line, it could be by paragraph,

it could be at different levels.

But the idea is to not train too many times

on the exact same data.

For example, if you train on forums, let's say,

on all the data that you have on Wikipedia or Stack Overflow,

you will always have these headers and footers

that are duplicated.

And you definitely don't want to train a million times

on the exact same Stack Overflow header,

because you don't learn much from it.

So you would basically be losing compute to try

to learn the header perfectly.

OK.

And then you will do some heuristic filtering.

You might do some heuristic filtering.

For example, you might try to remove low-quality documents.

Low-quality might be that there are too many words.

If it's an extremely long document,

it might be suspicious.

If it's a very short one, let's say

there's only 10 words, probably, it's not worth training on.

If there's any kind of outlier tokens,

like tokens or words that really are extremely rare, yeah,

it might be that this is just bad data.

So you will do a lot of these heuristic-based filtering.

And then you might also do some model-based filtering.

So one idea that I find pretty neat that people have been doing

is trying to basically do distribution matching.

So you find some distribution that you think is high-quality.

For example, you might say, Wikipedia

is pretty high quality.

Or you might say, every page that is referenced on Wikipedia

is likely to be high quality, because that means that someone

went and referenced that page.

So that's already a pretty big amount of data.

All the websites that are linked by Wikipedia is pretty large,

but it's still very little compared to the amount of data

that we need for pretraining.

So what you might say is, I want to find

more of that type of data.

And the way you can do that is that you can train

a classifier that takes some random data in and says,

this is not a reference on Wikipedia,

and the pages that are referenced on Wikipedia.

And you try to predict, basically, yes

to this and no to the former.

And then one you train that classifier,

you basically have a classifier that essentially predicts

how likely it is that your document is referenced

by Wikipedia.

And then you can just do a filtering based on that.

So you can say, if it's very likely, I'll keep.

If it's not likely at all, I will throw it away,

because it's probably some bad data.

So this is some model-based filtering.

And you can do a lot of that.

And then you can do some dynamics.

Data exchanges, for example, you might classify the category

of data, whether it's like code books

entertainment, any of these domains.

And then you might want to reweigh different domains.

So for example, if you train a coding model,

you want to reweight coding, probably, there's

not enough code online.

So you want to say no.

Even though I only have 5% of coding,

I want to bump it up to 50%.

And the way to do this reweighting,

you can usually do these experiments at small scale.

This is true for any of these filtering.

You might do these experiments at a small scale,

try to understand what is best, and then you

will try to predict what to do at larger scale.

Great.

And at the end, we'll talk about that too.

But once you did all of this pretraining data,

you will also try to collect some higher quality data.

For example, you might say like, Wikipedia is super high quality

or everything on arXiv might be really high quality.

So you will keep this second distribution

of high quality data.

And usually, after training on this pretraining data,

you will do what we call mid-training, which is training

on this high-quality data.

The idea being like, well, we don't have enough of that data,

but we know it's high-quality, so we will try to fine-tune

or optimize after pretraining or continual pretraining

our model on that high-quality data, such as the model

lends to really be as good as possible.

OK.

So pretraining data.

One paper that I would recommend reading is this FineWeb.

And it's both a paper and a blog post about FineWeb data

sets from Hugging Face.

And they talk a lot about what filtering they've done,

but this is just one plot from the paper.

And here, the x-axis shows the amount of tokens,

billion tokens that you train on.

So this is still pretty small compared

to the scale of pretraining data that we

talked about before, which is more than 10 trillion tokens.

And this is the aggregated accuracy, where it's basically,

your performance on a whole set of evals.

And here, what they show is, first, this green line

is when they took 200 trillion tokens from, I believe,

Common Crawl.

So this is basically raw data.

And then they applied a lot of filters.

So not safe for work blocklist.

They mostly went for English text, some simple document

filtering.

For example, if it's too much repetition in the document,

they removed it or if it's the wrong lens.

So that's this first filtering, one from 200 trillion tokens

to 36 trillion tokens.

And here, we see how well you perform

when you train on 360 billion tokens from those.

And here, you see the performance gain

when you deduplicate data.

So the way that they've done it is, they said, essentially,

I don't want to have text that is

duplicated more than 100 times.

So that's basically at a high level, so they filtered data.

They filtered by nearly half, from 36 trillion tokens

to 20 trillion tokens.

And you see that training on that

really improves performance.

Again, that's because you're basically not forcing your model

to learn things that are duplicated and not that useful.

And you really focus on new data.

So yeah, that worked pretty well for them.

And I mean, 100 documents that are duplicated

is still quite a lot.

But usually, you can have a huge classes of like 100,000

duplicates.

So those are the ones that they wanted to filter out.

And here, you see some additional filtering.

So for example, they remove the, I believe, JavaScript.

They remove lorem ipsum text and things like this.

So that removed, again, a little bit more of data.

And you see that it performs better, and then

some additional, I believe, model-based filtering

that performed even better.

Great.

So that's pretraining data.

And then there's midtraining.

So as I said before, the idea of midtraining

is basically continuing your pretraining

and to adapt your model to have some desired properties,

or to basically, adapt your model on some high-quality data.

So usually, you do it on, basically, less than 10%

of what you did for pretraining, so less than a trillion token.

So you might, for example, change the data mix

in your data.

So you might say, I want to have a lot of coding data at the end

or I want to be more scientific and have

a model that is really good at basic science questions.

Or you might want to optimize more on multilingual data.

Let's say that you know that a lot of the data that we have

access to is more English, but this

is not representative of the languages

that people usually speak.

So you might say, OK, I'm going to upweight some other languages

that we usually are less represented in our data sets.

And just to make sure that it's basically

representative of how many people speak that language.

Some other type of things we do during training

or that we might want to do is that you usually want

to increase context length.

So in many of these models, you usually hear this idea of,

how much context can the model see?

And when you do pretraining, you don't

want to train on very large context lengths,

because that's much more computationally intensive.

But you do want the model to be able to understand,

let's say, 120 tokens that came before your question.

So usually, what you do during training

is that you will bump up this context length,

so you will do some extension of context length

during my training.

For DeepSeek V3, they went from 4,000 contexts during

pretraining to 128 during midtraining.

And I think, many other open-source projects did that.

Other type of data that you might want to add

is some formatting or instruction following.

So you might want to already teach your model

to answer questions when you ask a question

or to write in a very specific chatty way.

And some high-quality data.

If you have some high-quality data,

you might keep it for the end and be like, OK, first,

I want to learn how to speak grammatically correctly.

And then I want to actually learn the real meat of the text

that you have in your data.

And you might have some reasoning data

about teaching the model how to think, which, I believe,

is what Kimi did.

And yes, many other things.

Great.

So pretraining and midtraining, let's just do a recap.

One is that really, this data during pretraining

and midtraining is really a huge part of training LLMs.

I would even say that it's basically

the key for training LLMs.

And there's a lot of research that has already been

done and a lot more to be done.

For example, how do you process well and efficiently?

I mean, these are huge scales of data that we're talking about,

whether to use synthetic data, whether to use big models

like generate more data.

How much multimodal data to put in?

How to balance your domains?

And all of that.

And there's a lot of secrecy.

So most companies are not talking about what they do.

Even in companies that actually do open-source models,

I don't usually talk that much about the data

that they collected.

First, because it's the most important thing.

So this competitive dynamics, they

don't want to tell you what they've been training on,

because that would be easier to replicate.

And then some companies might be scared about copyright liability

if they train on data that they shouldn't have trained on.

So here are a few common academic data sets--

C4, The File, Dolma, FineWeb, I just wrote a few.

So FineWeb is the one we talked about before

with 15 trillion tokens.

And this is the composition of the file.

And you see that in the file, there's

a lot of archives, and PubMed, and high-quality data.

And you will have also a good amount of code and things

like this.

Great.

So just to give you a scale of these data, as I said,

Llama 2 was trained around two trillion tokens, Llama 3,

around 15, Llama 4, between 20 and 40 trillion tokens.

So every new generation tries to train on more data

and does also some better filtering.

OK.

So that was about pretraining data aspect.

Now, let's talk about the compute.

So empirically, one thing that is super important

is that empirically, for any type of data and model,

the most important as I said before,

is how much compute you basically spend on training.

So by how much compute, I mean, both how much data

do you put in the model and the size of the model.

Because if the model is bigger, you need to spend more compute.

And what is very nice is that you can actually

predict pretty well the performance,

at least during pretraining.

You can predict pretty well the performance

that you will achieve if you just pour

more compute into your run.

So if you just train for longer or train bigger models,

you can predict pretty well how well they will perform.

So here, the way to interpret this plot

is that on, the x-axis, you see the amount of compute

that you have in your run.

This is in log scale.

And here, you have your test loss also in log scale.

And all these blue lines are basically different runs.

And then you take the minimum achieved for all of these runs.

And you can link all of them together.

And it gives you something that looks pretty close to a line.

And then you can just fit the line

to discover the ideal compute and test loss.

And now, you can use this line to predict how well can you

perform if you train with 10 times more compute or 100 times

more compute.

So what is very nice is what I wrote

here is that you can now do research at very low scales

and then predict how well it will perform at higher scales.

So this is what we call a scaling loss,

which one is very surprising.

There's really no good reason for this to happen or at least,

yeah, it could have been differently.

And there are some theories for why that happens.

And two, I mean, it's very nice when you do research,

because now, it means you can work at this small scale

and has really been great for the field.

OK.

So scaling laws, what is nice, as I said,

is that now, you can tune things at lower scale.

For example, if I ask you a question,

and I gave you 10,000 GPUs, I asked you,

how should you be using these 10,000 GPUs?

How should you be training that model?

Historically, what you might have done

is you might tune high parameters for different models.

So you might say, OK, I'm going to have 20 different runs or 30

different runs.

And I'm going to pick the best, and that's the one

that I'm going to ship.

But as a result, each of them will only be trained on 1/30

of the compute that you had access to.

The new pipeline is that now, you can find scaling recipes.

So you can find recipes that tell you

how to change the learning rate with different scales and things

like this.

And then you can tune high parameters

at small scale for a very short amount of time.

You can do many, many iterations.

And then you can plot the scaling law,

extrapolate how well you will be performing at larger scale,

and then train one huge model at the end, where

you use way more of the compute that you have access to.

So maybe 90% of your compute goes for the full run rather

than 1/30 of what you had before.

So yeah, this is really a blessing.

OK.

So for example, very concretely, should you

use an architecture that is a transformer or an LSTM?

You see that transformers.

This is the scaling law for transformers.

And here, you see LSTMs.

You see that transformers have a better constant,

so that means that they are always lower than LSTMs.

And they also have a better scaling rate.

You see here that the LSTM seems to be plateauing a little bit.

So that tells you both that at any scale,

transformers is better, but also,

the larger the scale, the better the transformer becomes,

which is why most people gave up on, essentially,

LSTMs as an architecture.

But what's interesting is, it could also just

be that the constant is better for one of the architectures,

but the scaling rate is better for the other one.

And in that case, you definitely want

to always go with the scaling rate, not the constant,

because who cares how well it performs at very small scale?

The real question is, what if it's 200 times larger.

How does it perform then?

And that's why the scaling rate is what matters.

Great.

So one very famous paper about scaling laws

is Chinchilla, that tries to show

what is the optimal way of allocating

training resources between the size of the model and the data.

Because both of these things are about compute.

And as we said, the more compute the better.

But there are two ways of spending compute,

either you train for longer or you train larger models.

So they have these results I'm going to skip a little bit,

but you can basically predict the optimal resource allocation.

And they found that for every parameter,

you should be using around 20 tokens.

So that's this optimal resource allocation.

One thing to note, you will hear often about chinchilla,

but Chinchilla is only an optimization

of training resources, it doesn't consider inference cost.

So they will only ask themselves.

They only ask, what is the best way

to achieve a certain training loss?

Where should I be putting the compute?

But it doesn't take into account that if you

have larger models at inference time,

they will actually cost more.

So for example, I mean, let's say, for example, for OpenAI,

for ChatGPT, the larger the model,

the more you will spend per user.

So you might be better off actually, training for longer

and training a smaller model, even

if it means that you need to spend more compute to achieve

the same performance, because at inference time,

it will cost less.

Anyway, so that's the Chinchilla paper.

And then I want to talk a little bit about the bigger

lesson from Sutton.

So basically, the bitter lesson, I

would really recommend reading this blog post

from Richard Sutton, which is really

the big researcher of reinforcement learning.

And he wrote that blog post that essentially tries to says,

the only thing that matters in the long run

is about leveraging compute.

And the reason why is because we see empirically

that the more compute we put in the models,

the more improvements you get out of it.

So basically, more compute equals better performance.

And we also know from Moore's law and some derivative laws,

that we will always have more compute, or at least,

that's the hope.

We will always have more compute every year.

And if you put these two things together,

more compute equals better performance,

and you will always have more compute.

And then the natural things that come out of it

is that it's all about leveraging computation.

There's no reason for trying to optimize things

at your current level of compute,

because next year, you will have more,

and that will just perform better.

So what matters is to have methods

that will scale up really well.

So that's the TLDR for the bitter lesson, which really

has driven a lot of how the community has

been thinking in the last, I would say, three or four years.

So yeah, so the summary is, don't spend

time overcomplicating things.

Do the simple thing, and make sure

that it scales, because what matters,

again, is not tuning this constant performance.

It's really making sure that you can scale it up.

Great.

So for training a SOTA model, this is a slide that I wrote

maybe two years ago or one or two years ago for training Llama

3 400B, which, at the time, was the largest open-source model.

And I just tried to predict how much that would cost.

So in terms of data, I was trained

on a 15.6 trillion tokens, 405 billion parameters.

And you see here that it uses around 40 tokens per parameter.

So that's pretty trained compute optimal by Chinchilla standards.

In terms of FLOPs, it uses 3.8 E25 FLOPs.

So it is an executive order that says

that you need to be more careful when you open source models

or when you train models that are more than one E26 FLOPs.

So this is around 2 times less than executive order.

In terms of FLOPs, in terms of compute, they use 16,000 H100.

And if you do the computation in terms of time,

it probably takes around 70 days of training to train this model.

And in terms of cost, my rough estimate

is that it would cost around like $52 million

for training this, so between 50 and 80 million,

depending on how much you consider they spend per compute

given a stack on clusters.

And in terms of carbon emitted for training this,

this is around, just for training a one model,

maybe 2,000 return tickets from JFK to London,

so from New York to London.

So that's quite a lot.

It's still neglectable compared to,

I mean, how many flights there are per year

and things like this.

But if you think that every generation is

going to be maybe 10 times more compute

than the previous generation, you

could see how in 2, 3, 4 generations, that

will become a real issue in terms of carbon emitted.

In terms of next model, as I said,

basically, every generation, you can think about it

as 10 times more FLOPs that go into training the models.

Great.

OK.

So pretraining summary, the idea is

about predicting the next word on the internet,

data., around 10 trillion words go into training these models

right now.

In terms of time, it takes months.

In terms of compute, more than $10 million.

The bottleneck is data and computation.

And some examples might be DeepSeek V3 and Llama 4.

OK.

So now, we talked about pretraining.

Let's talk about post-training.

Again, I'll talk about the method, the data and compute.

So why do we want to do post-training?

Well, language modeling, so what we do during pretraining,

is really not about assisting users and about helping users.

So language modeling is not what you want.

And what I mean by that is that if you just take GPT 3,

and you prompt it with "Explain the moon landing

to a six-year-old in a few sentences,"

what it will do is that it has been trained on basically,

a large part of internet.

So we'll say, well, that reminds me

of maybe large lists of questions that people might ask.

So instead of answering the question,

it might just predict what is a similar type of question

that people might ask.

So actually, what GPT 3 answers to you

is explain the theory of gravity to a six-year-old.

Explain the theory of relativity to blah, blah, blah.

So this really shows you that these models are really not

optimized for predicting what you want,

this is just about language modeling

predicting the next word.

So the idea of classic post-training,

also called instruction following or alignment,

is about steering the model to be useful on real world tasks.

So if I ask "Explain the moon landing to a six-year-old

in a few sentences," so the the same as before,

I want ChatGPT or any model to give me a real answer.

And the way that we basically do that

is to maximize the preferences of humans.

So maximize answer preferences of humans.

In terms of data, probably between 5,000 and 500,000

problems, so much, much smaller scale.

In pretraining, the idea is that, first,

you try to basically learn everything in the world,

and then you try to optimize on very specific domains, which,

in this case, just like instruction following

and answering questions with very few data

points, because the model already knows everything.

So it just needs to learn, basically, how to act

or how to interact with the human.

And this is really what made ChatGPT what it is.

So since 2022, that's really when

post-training became important.

So that's the overview of this third stage

that I told you about, which is the classic post-training.

And then there's the second stage,

which is about teaching the model to reason.

So that only happens in some models,

for example, Kimi and R1.

And the idea is to optimize simply answering correctly

the question.

So you will see some, for example, in 01,

it says things like thought for 24 seconds.

So reasoning is about, how do you optimize for that?

The data, it usually optimize for, basically,

any hard task with verifiable answers,

so things like math competitions or coding test cases.

And you try to optimize for that.

So this really became important since 01 in 2024.

And yeah, this is about this new paradigm of reasoning.

And I believe, Norm from OpenAI will also come and tell you

about reasoning.

But at a very high level, the idea

is that, what we had before was train time, compute time.

I mentioned to you scaling laws, which

shows that the more compute you put in the run during training,

the better your performance is.

And what reasoning gave you is test time compute.

So after training, you can also pull more compute in your model

to get better performance.

And that's like humans.

If you make me answer a question in the second,

I will probably provide less thoughtful and less correct

answer than if you gave me like a week to answer the question.

So the goal was test time scaling.

Let me put it a bit more light.

Great.

So post-training methods, I will talk about SFT and reinforcement

learning.

So the task is alignment.

Just as an example, let's say that we

want to optimize the arm to follow user instructions

or something like designers desires.

So this is the example from before, which

is like answering questions.

Or maybe you want the model to never answer

specific type of questions.

For example, if I ask, "Write a tweet describing how x people

are evil," you might want your model not to answer that

question.

So what I told you before is that the intuition

of portraying, in general, is that you actually

know what you want to provide to these models.

You do know the type of answers to give to humans.

So you do know the type of answers

that you want to give to humans and what

you want your model to follow.

But that behavior, these answers are scarce and expensive,

so it's pretty expensive and slow to also collect

that type of data.

But you could just go and try to ask humans,

what are all the correct answers to every question

that you might want to ask?

So the idea is that you know what

you want your model to output, but it's

expensive to collect that data.

But pretraining is something where it's

very easy to collect that data.

You just take all of internet, essentially,

but it's not really what you want, as we said.

So the idea is that, given that one is scalable,

but it's not what you want, and the other one is not

scalable is what you want, what you can do

is that you will basically take the pretraining

model that already, you learned about grammar

and different languages.

And you will just fine-tune it or do some small optimization

with the little amount of data that is in the format

that you want.

And this is what we call a post-training.

OK.

So there are two methods.

The first one is supervised fine-tuning.

So the idea is, again, just to fine-tune the LLM

with language modeling, so the exact same method as before.

But you do it on desired answers.

So instead of just predicting the next word,

you predict the next word on answers

that are the answers that you would want to give to humans.

So this language modeling, it means that it's, again,

next word prediction.

And desired answers is why we say supervised fine-tuning.

It's where the S comes from is because you assume that you

have access to the correct answer, which

is why it's supervised.

So how can we collect that data?

There are many different ways-- one is just to ask humans.

And this was the key from GPT 3 to ChatGPT, the initial ChatGPT

model.

And here are some examples from OpenAI system

that did that in the open source, where

you have a question, and then you

have answers that are written by humans.

You can also do it differently.

One problem, basically, with human data

is that it's slow to collect and it's expensive.

So one idea that you might want to do

is to use an LLM to scale data collection.

So this is what we did for Alpaca, for example,

in early 2023, where we basically said,

well, we don't have the money or we

don't have the luxury of having humans

that provide us sciences with.

What we can do is that we can use the best model from OpenAI

at the time to predict the right answer.

And we can basically try to do supervised fine-tuning

with the answer that is given by the OpenAI models.

So we did that on 52,000 answers.

And basically, these are some examples.

And yeah, that was one of the first or probably,

the first instruction following LLM in the open-source.

So that really started as a trying to replicate ChatGPT.

And now, this synthetic data generation

is a whole field on its own, because yeah,

the idea is that now, actually some of these models

are just better than the humans.

So it's not only that humans are slow.

And it's not only that human data collection

is slow and expensive, it might just be that it's lower quality.

So yeah.

OK.

Yeah.

So for SFT, there's another way of doing it.

So we talked about two ways right now.

We talked about humans.

We talked about LLMs that just provide an answer.

But the problem is that if you want

an LLM to provide an answer, you can

assume that you have access to an LLM

that is smarter than the LLM that you're training.

And that was indeed the case when we were training Alpaca,

but this is not the case if you're,

for example, in the best closed labs, which

are training the frontier models,

or even if your inputs is open-source,

and you're trying to train the best open source

models you might not have access to or be

able to distill closed models.

And so what the DeepSeek R1 do, because they

were treating this first top open source model.

The idea is that you can use rejection sampling

based on verifiers.

So what I mean by that is that you can just

use an LLM to provide many different answers to a question,

and then you only keep the answer if it's correct,

in some sense.

So if it passes some test case, or some verification,

or if it's preferred over other answers.

So the idea, again, is, well, you

don't have an ideal LLM that you can generate data from and then

train the SFT to predict that data.

What you can do is if you have access to verifiers or ways

of comparing different samples, is

that you can roll out many samples, then decide which

one is better based on your verifier,

and then do SFT on the sample that is given by the verifier.

So that's exactly what DeepSeek R1

did for the first stage of reinforcement learning or sorry,

for the first stage of safety.

Great.

So what do we learn during SFT?

What are the type of things that we can learn?

Well, we already talked about it.

We can learn instruction following and following

instructions.

You can learn desired formatting or style, be more chatty,

or use emojis, or things like this.

You can use tool use.

So I would recommend reading if you're interested in that.

The Kimi 2 paper, the excellent paper,

that basically uses SFT at scale to learn how to use tools.

You can learn some early reasoning.

So how to think before answering, which is exactly what

we just talked about with DeepSeek R1,

where they use this rejection sampling algorithm.

And honestly, you can learn anything

where you have good inputs and output pairs.

So SFT can either be seen as a final stage for training,

a final model or as a preparation

for the next stage, which is the reinforcement learning stage,

basically, given that this works pretty well,

and you might want to do that first before doing

the next stage, to accelerate the next stage, as we will see.

So SFTP pipelines can become pretty complex.

I'm not going to talk through this one in detail,

but I just want to you a sense of how complicated it can be.

This is about Kimi K2.

So I would recommend reading that paper

and how they use SFT or training for tool use.

So basically, teaching the model to use tools.

And what they did is pretty complicated with some LLM that

simulates users, simulates tools, and then

do this rejection sampling that we talked about before.

And yeah.

So the idea is that they collected a lot of tools,

they simulated a lot of synthetic tools

that tell you how the tool should be called.

And then they basically have an agent

that interacts with an LLM that simulates

user and another LLM that simulates tool calls,

because otherwise, tool calls, you might not have access

to enough different tools to really simulate old tools that

might be called by the model.

And then you basically do some rejection

sampling based on these rollouts that

were generated with these three LLM that interact

with one another-- the agent LLM, the user LLM, and the tool

simulating LLM.

Anyways, all this to say that these things can

become pretty complex but still work pretty well.

OK.

So scalable data for SFT, How much data do you need?

SFT, what is nice is that you actually

don't need that much data.

For learning simple things like style and instruction following,

maybe 10,000 is enough.

So this is from a LIMA paper in 2023 that basically shows that

already with 2,000 examples.

You learned the style and instruction following

capabilities that you want.

If you want to train more complicated things like tool use

and reasoning, you might want to increase that.

So I believe, R1 use 800,000 samples, which is a good amount,

but basically, less than a million, at least.

So yeah, the idea is that you don't

need to train on much data for SFT

if the model already learned that.

My intuition for these type of things or my mental model

is that everything that is already learned really

well during pretraining, but you just

want to surface, during post-training, things

how to write in bullet points or how to use emojis in your model.

And this is more about specializing your model

to one particular type of user, and that it has already

modeled during pretraining.

Then for that, you don't need that much data.

If it's for things that it has never

seen during pretraining or very little, and then

you need much more data.

OK.

So that brings us to reinforcement learning, so

the second method, which is RL.

So in reinforcement learning, yeah,

the problem that we try to solve with reinforcement learning

is that SFT is about behavior cloning of humans or of outputs

as we saw could be from LLMs too.

It's about behavior cloning.

It's about copying the behavior of different outputs.

And this has many issues-- one is

that it's bound by human abilities

or bound by the abilities of the LLMs that you're copying.

But humans, even if you're actually collecting human data,

they might not prefer the things that they are generating.

So even though they might not write better answers

and might not be able to write better answers,

they can still say which one they prefer.

So yeah, the idea is that you will always

be bound by human abilities.

And the second thing is that you will actually

teach hallucination.

And this is a pretty interesting behaviors

that even if you're cloning correct answers

or correct behavior, you might actually be teaching the model

to hallucinate if that model did not know

that that answer was correct.

So what do I mean by that?

Imagine that I ask a question to write some introduction.

Yeah, this is some text.

And basically, I ask you to provide some references.

If the answer provides a reference

that the model does not know about,

we see, what you're teaching the model is

you're teaching the model provide something that seems

like a plausible reference, even if that reference was not

in your pretraining course.

So even if you don't know if that exists, still like that.

So you're basically teaching the model

to make up plausible sounding references.

So yeah.

So hallucination, that's one issue.

And the third thing is that collecting ideal answers

can be pretty expensive.

So the idea or one solution is to,

instead of doing behavior cloning or SFT,

you can do reinforcement learning.

So instead of cloning the behavior,

you can maximize that behavior.

And so I would really recommend reading the DeepSeek-R1

paper and the Kimi K2 papers that some of the best papers

out there in the open-source.

And the key in reinforcement learning

is to decide, what are you maximizing?

What is the reward that you're maximizing?

There are different things, for example,

that R1 has been optimizing for, one might be rule-based rewards,

things like string matches.

Let's say that you have close-ended questionaire

answering.

You might just say, your answer is correct

if the answer is exactly x, or you could have

some test cases for coding.

So that's rule-based rewards.

You can have reward models that were trained

to predict human preferences.

So we will talk a little bit about that,

but you can basically train a classifier

to predict whether something is good or bad, as predicted

by a human, and then optimize against that.

Or you might optimize against an LLM as a judge.

So using an LLM, let's say you use the best possible LLM,

and you just say, is that answer correct or not?

So here, yeah, you see that this particular case, which says,

write a Python code, blah, blah, blah, and then the model

generates different answers.

And here, given that we say, let's write a Python code,

you might have rule-based verification that says,

well, this is not code.

Here, this is the answer.

Here's a joke about frogs.

It is not code.

I asked for Python code.

So is it Python or not?

If it's not Python, then it's also wrong.

And then it might check like if you pass some test cases,

and it will only keep the ones that are passing.

So the idea is to optimize the things that are currently

passing.

So you just say to the model, do more of the thing

that I gave you a correct reward for or positive reward for.

Great.

So yeah, as I said before, I would

recommend reading the DeepSeek-R1 paper,

but basically, they use for what they call reasoning prompts, so

like math questions, coding questions,

and some logical reasoning.

They use these rule-based verifiers

that we just talked about.

And then for general prompts, like translation,

factual question, answering, and writing requests,

things that are more long form text,

they basically use a reward model

that was trained to predict human preferences.

And they try to optimize that.

So what they do is that they start with this SFT checkpoint.

So they use a model.

They do some SFT like this.

The model is already pretty good at generating things

that are often correct.

And then you just do this reinforcement learning pipeline,

where it tries to optimize the number of times

that your verifier says you're correct.

So in terms of algorithm, the most common algorithm

in the open source is the GRPO from DeepSeek-R1.

And the idea is actually pretty simple, is

that you take a policy model, so you take your LLM.

It's usually a SFT model.

You ask it to answer multiple times

to provide multiple outputs to your question.

And then let's skip that part for now.

You basically have a reward model or verifier

that gives a reward, that tells you like, yes, it was correct,

no, it was wrong, by how much, and all of this.

You get a reward for each.

And then you do some group computation

to get your advantages.

The way you just think about it is you

do some normalization, just so that you

know which one is very good and which one is very bad,

you basically renormalize all of these rewards.

And then you basically backpropagate

to tell your policy, do more of the thing that was good.

And then usually, the one we skipped here

is this reference model.

You usually have a KL divergence,

which means that you ask the policy model during training.

You say, don't move too far from outputs from my base model.

And this is really just a hack, because you're basically saying,

reinforcement learning can get you

in places that are usually not ideal.

For example, the hacks that we talked about,

so that's one way of just saying, don't go too far,

optimize as well as you can, but with certain limits

of how far you can go.

Yeah.

And so this is not super important,

but basically, if you know a little bit of reinforcement

learning, DeepSeek-R1 optimizes GRPO, which really just uses

the Monte Carlo estimate for computing the advantage.

And Kimi K1.5 and Kimi K2 use a similar loss.

OK.

So one thing I want to emphasize is

that in reinforcement learning, infra is really key,

it's really, really important.

And the reason why is because, as you saw,

if you use this GRPO algorithm, sampling is a key bottleneck,

because for every question, you have to sample multiple outputs

for each of these problems.

And especially for agents, given that this is an agent class,

this becomes even worse, because you might have

very long agentic rollouts.

And you basically don't want to block all your training compute

on these very long agentic rollouts that

are being rolled out.

So Kimi did a lot of optimization.

And I would, again, recommend reading that papers.

For example, for long rollouts, Kimi

decided to pause these long rollouts.

So if it's more than a certain amount of time,

they will basically pause the rollout,

and they will say, OK, this is not worth it.

We will optimize our weights.

And then we will resume the rollout, the next step.

And then another issue with agents

is that the environment feedback can be slow.

So if you have an agent that really interacts with the world

and cause a lot of APIs and things like this,

maybe you're not even using your GPUs at all,

because maybe you're not even doing rollouts,

and maybe you're just waiting for the environment response.

So the way that Kimi bypasses that is by using,

really, a lot of concurrent rollouts like this one.

When certain rollout is waiting on environment response,

you can work on something else and dedicated services

that microservices that can really spin up and scale.

And the way that Kimi did it is that for every part,

they have a train engine.

Then they have a checkpoint engine

that broadcasts all the weights to all the other pods.

And then they have an inference engine

that really does the sampling.

And what is important is, everything is colocated,

all the engines are colocated on the same pod

to avoid communication overhead.

So anyways, all this to say that there's a lot of optimization

on the infra side.

And Infra is really key here.

So the communication for them takes less than 30 seconds

for communicating the weights.

And everything is, again, working on the same pod.

OK.

So let's talk about reinforcement learning

from human feedback, which, right now,

we talked about reinforcement learning for reasoning when

you have, usually, ground truth verifiers.

The reinforcement learning from human feedback or HF

is this notion of reinforcement learning when

you don't have ground truth.

This is really what made ChatGPT work in 2022.

So the idea is, instead of SFT, where

you clone the behavior of humans,

you want to maximize their preferences.

As I said, this is what made ChatGPT.

And the pipeline is the following.

This is how the original algorithm for ChatGPT worked.

You have an instruction that goes to a model.

A question, it goes to a model.

You ask the model to provide two answers.

And usually, the model is already pretty good.

It's an SFT model.

And then you ask some labelers to select which of the two

was better.

So you have some humans, hey, which one of the two was better?

And then you basically predict, you basically maximize

the number of times that you will generate the thing,

or you will tell your model to predict more

of this thing that was correct.

So there are different algorithms.

I'm not going to go through them.

PPO and DPO are two of them for doing that.

But we see, this is just reinforcement learning,

where your reward is actually given

by a reward model that was trained

to classify human preferences.

And here, you see, these are pretty old results by now.

But here, you see, for learning to summarize,

how a pretraining model performs for summarization,

SFT performs better.

This is human reference summaries, so how well

you compare to humans.

And so SFT really improved compared to pretraining,

but then you see PPO.

So this reinforcement learning made you perform even better.

And this is the order of things, where pretraining is good,

SFT is better, and RL is even better.

And same thing here.

And this is Alpaca farm, which is

a paper we did for optimizing human preferences.

And you see that the two algorithms here, those two

are two RL algorithms work similarly,

and they work better than SFT, and they work better

than the pretraining model.

OK.

Human data.

So as I said, data comes from humans.

This is a very expensive, or at least,

it takes a long time to do and pretty expensive too.

You have to write extremely detailed rubrics

to tell humans what is even considered

as a good answer, what is considered as a bad answer.

Yeah, a lot of work that goes into collecting data.

Collecting data is hard.

Challenges with human data.

As I just said, it's slow and expensive.

Second, it's actually hard for really focusing

on the content of the answers.

Most humans, when you ask them what is good and what is bad,

they will usually focus on the form or the style, things

like lengths.

And this is usually not what you want

to optimize for in your LLM.

Also, depending on who you ask, the distribution of annotators,

you will really get different behaviors,

different political views, and yeah, different views

on many things.

So you have to be pretty mindful about that.

There's also crowdsourcing ethics involved here,

like, who are you asking to label your data?

And yeah, so there's a lot of challenges with human data.

OK.

So one way to reduce this dependency on human data

is exactly what I told you about before with SFT, is

that you can ask an LLM to replace

humans to provide preferences.

And this is, again, this Alpaca farm paper

that we wrote two years ago, which

shows, on the x-axis, the amount of dollars

that you need to spend for collecting data.

In the y-axis, you see the agreement with humans.

And you see that, actually, humans, I believe

are in blue, so here.

So this is around $300 per 1,000 examples that we had to pay

humans.

And you see that the agreement between different humans

is around 66%, while for LLMs, we could divide by 10

or even by 30 the amount of money that we spent.

And that was two years ago.

Now, it would be way less than that.

And we actually performed already better than humans

on predicting the correct human answer.

So it worked surprisingly well.

So you can always use this trick of using LLMs instead of humans.

But again, this is harder to do when you're at the frontier,

and you don't have a better LLM.

OK.

And then evaluation.

So I'll talk really briefly about that.

But there's basically two types of evaluation--

closed-ended evaluation and open-ended evaluation.

And one thing to note is really, evaluation is really the key.

It is one of the most important things in machine

learning in general and AI.

And the reason why is for three reasons--

first, what it does is that it's key to identify improvements,

to quantify progress that you're making,

and to say whether you're making progress and what to change,

what high parameter to select, and things like this.

The second thing that's really important for

is that it will allow you to select which model.

So if I have which model to use for your application,

if I have a specific application in mind,

I will have all these different models

to use, yeah, to choose from.

And I would need to know which one to go after or to choose.

And finally, evaluation is really important

to know whether your model is ready to be put in production.

Even though your model might be the best current model,

is it good enough for your application?

This is very important for practical use cases.

You really need to have good evaluations of your RL labs.

So closed-ended evaluation, the idea

is that if you can turn your problem into something where

you have a few possible answers, then

you can easily automatically verify

whether your answer is correct.

For example, if you turn your eval

into a question-answering evaluation,

then you can simply ask an LLM to provide an answer that

is the answer like A, B, or C, and you simply

look at what the right answer was,

and then you just considered your accuracy.

So this is, for example, what the MMLU eval did.

So there are still many issues.

There's still challenges with closed-end evaluation.

One, it's sensitive to prompting.

Different ways that you prompt your model

will provide different answers.

Two, it might have train test contamination.

So if your model was trained on the eval, because right now,

for example, MMLU is all over internet,

maybe your model was trained on that.

It will see much better than what it actually is.

So this is about closed-ended evaluation.

I really want to focus on open-ended evaluation,

because despite these challenges,

closed-ended evaluation is much easier

than open-ended evaluation.

The question for open-ended evaluation

is, how do we even evaluate something

like ChatGPT or an LLM?

So ChatGPT or all these instruction-following models,

they can be applied on so many different things.

So you can be applying it for coding,

for chatting, for summarization, for many things.

So you really want to have an eval that covers all these use

cases.

The second thing is that it's open-ended tasks.

So what I mean by that is that you have very long answers.

And as a result, you can't do this accuracy-based evaluation,

where you just check whether the answer is verbatim,

the correct answer.

So that makes it hard.

So you cannot do this string matching to know whether

you're correct.

So one idea that you might have for closed-ended evaluation

is that you can simply ask humans to tell you

which answer is preferred.

So you might show two answers to a human

and just ask, which of the two is better?

So this is what Chatbot Arena by LLMs

did, where you basically ask humans to blindly interact

with two chatbot and rate which one is better.

So that's one way of doing that, of basically dealing

with this challenge, where, when you ask open-ended tasks,

with what is not a single answer,

and the answers are usually really long.

It's easier.

Yeah, it's much easier to just ask

humans to rank things than to compare it to a gold answer,

because there's no gold answer.

And the problem with this is that using humans

is, again, costly and expensive.

It's very expensive and slow.

So just as before, what you can do

is you can use an LLM instead of a human.

This is what we did with AlpacaEval two years ago,

and many others followed.

And the idea here, the steps is that for each question,

you might ask a baseline that could be a human or a model

to provide an answer and the model

that you try to evaluate to provide an answer.

And then you will just ask another LLM,

which of the two answers is better?

And then you will just look at the number of times

that your answer is better than the baseline.

And you can get what we call a win rate, which

is probability of winning.

So AlpacaEval was one of the first eval doing that.

And despite being much cheaper than Chatbot Arena,

it had really high spearman correlation with Chatbot Arena.

So using our LLMs can be really good as a judge

for judging your performance and for evaluating your performance.

Yeah.

So running AlpacaEval right now probably costs much less,

but at the time, was less than three minutes and less than $10.

Great.

OK.

So I think I'm getting at the end.

I do want to tell you a little bit about systems and Infra,

because as I said, if you really understand the fact that scaling

is what matters, then the natural thing

that you should be spending time on

is also making sure that your models, your training

can scale well.

So the problem is that everyone is bottlenecked by compute.

So one idea that you might have is, well,

if you're borrowing by compute, and if you

know that spending more compute gives you a better answer,

why just not buy more GPUs and training on that?

There are a few reasons why we can do that.

One, of course, GPUs are expensive,

but they are not only expensive, they are scarce.

So even if you have the money, it

can be hard to just get access to the best GPUs.

And then there are physical limitations.

So if you have a lot of GPUs, you

need to have the communication between GPUs.

And that can really slow down your training.

So you do need to optimize your systems

and make sure that training is as

efficient as possible on every GPU that you have.

So yeah, you need to do some good resource allocation,

and you need to optimize your pipelines.

OK.

So I will try to give you an extremely brief overview

of GPUs, just for you to get a sense of what

matters when you optimize these runs

and what you're actually optimizing for.

So Systems 101 GPUs, so the difference between GPUs and CPUs

is that essentially, GPUs are massively parallel.

So they will apply the same instructions in parallel

on all different threads but different inputs.

So you will have the same input that

will go on different threads.

And it will basically--

sorry, different inputs that will

go through different threads and will apply the same instructions

to them.

So really, the difference with CPUs

is that you're optimizing for throughput.

It's massively parallel.

So here, you see GPUs and CPUs, the difference.

So yeah, as I said, first, GPUs are massively parallel.

Second thing is that GPUs are really optimized

for matrix multiplications.

So GPUs are graphical processing units.

And anything about computer vision and graphics

really requires extremely fast matrix multiplications.

So from the early days of GPUs, people building GPUs

are really optimizing for matrix multiplication.

So they have specific cores that will make matrix multiplication

very fast and actually, around 10 times faster

than most other floating point operations.

So you see different versions of GPUs.

And you see the speed.

And you see that for matrix multiplication

is much faster, especially recently, much

faster than nonmatrix multiplication, floating point

operations.

So another thing that is important to understand

about GPUs is that actually, compute

is not the bottleneck anymore.

So what I mean by that is that if you

look at here, the compute, that the peak hardware scaling.

So here you have compute so FLOPs

that could be performed on the best hardware across time.

And here, you have, basically, the communication and memory,

and how it improved across time.

And you see that, basically, compute increased or improved

and much faster across time than memory and communication.

So what that means is that right now, we have more compute

in a GPU than we have memory.

And then we have improvements in communication.

So in other words, the bottleneck for GPUs

is not performing the computation,

but it's actually keeping the processor

that performs the computation fed up with data.

So you basically need to send as much data as possible there.

And the bottleneck is actually feeding the data,

not doing the computation.

And that's a very important thing

to understand when you're optimizing your pipelines.

And yeah, as a result, if you look at this paper for 2020,

that analyzes where you perform all the compute form, how

much time it takes to run a transformer.

You will actually see that things like tensor contraction,

which is basically matrix multiplication,

requires most of the FLOPs, so most of the actual compute

that is required for matrix multiplication.

But in terms of runtime speed, it's still a majority,

but only 2/3 of the runtime is spent on the thing that is most

of the compute.

And things like element-wise operation or some normalization

actually requires very little floating point operations

but requires a pretty large amount of time,

or basically, you spend a lot of time

there, because you need to still send to the GPUs your data

and do the computation, even if the computation is slow.

OK.

And the last thing that you need to know about GPUs

is that it's really a large memory kind of hierarchy.

So the closer you are to the cores,

the cores being the things that actually perform

the computation, the faster the communication with the cores

will be, the less memory that will be there.

And the further you are from the cores, the more memory,

but it's slower.

And you basically have different levels of hierarchy,

so you have this shared memory.

And then the L1 cache, that is the shared

memory that is super close.

And then you have the L1 cache, the L2 cache, and then

you have the global memory that is very far

from your registers and your process.

So yeah, this memory hierarchy and the metric

that we try to optimize when we try to optimize our runs

and just our systems is Model FLOP Utilization,

you will often hear this word, or MFU for short.

And this is basically the ratio between the observed throughput

to your model and the theoretical best.

So NVIDIA will tell you, at best, we

can do that amount of FLOPs.

And then you will check how much you are achieving.

And if you achieve an MFU of 1, that

means you are able to get all the data, the maximum of data.

You can always keep your process basically fed with data.

So at any point of time, there's something

that is being computed on your processor.

Just to give you a rough sense of these numbers,

if you have 50%, you're in a really,

really, really good shape.

And even big companies might be optimizing

to go higher than 15% or 20% to achieve this 50%.

So I want to give you a very quick overview of things

that you might want to do for optimizing your runs, just

to give you a sense of, at least things that people do

for optimizing this compute and making sure

that your runs are scalable.

One thing that you might do is low precision operations.

So the idea is that if you use fewer bits for every data points

or for every data that goes in your processor,

you will have faster communication and lower memory

consumption.

So as I said, given that the bottleneck is not the compute,

but it's this memory and this communication,

might just decrease the precision

in which you put your data in.

And as a result, you will just have faster communication,

because you can put more through the bottleneck and then lower

memory consumption.

So for example, for deep learning,

the actual decimal precision is not that important,

except for a few operations.

That's because there's a lot of noise in any case.

And when you train deep neural networks, because stochastic

gradient descent already has a lot of noise.

So matrix multiplications will usually

be done in bf16 rather than fp32,

so you can halve the precision.

And if you have the precision, what you can do?

So usually, one thing that is very common

is using automatic mixed precision or AMP

during training, which is that the weights are stored

in fp32, so use 32 bytes.

Before the computation, you will convert the fp32 to bf16,

so you will basically half the precision.

And then everything will be done in bf16.

So you will have less memory.

You will have more speed up because faster communication.

And your gradients will be stored in bf16,

so you'll have memory gains.

And then at the end, you will put it back in fp32.

And every small operation that you do

can be shown in your weights at pretty high precision.

Great.

There are other optimizations.

For example, operation fusion.

So again, the idea here is that communication is slow,

as we said.

And every time that, for example,

if you write in PyTorch, every time you write a new line,

it actually moves your variable back to global memory.

And that makes it very costly.

Because basically, if you do something like x1

is equal to the cosine of x, you will basically

read x from global memory, write it to x1.

And then when you do this new line,

x2 is equal to cosine of x1, you will again

take it back to global memory and write it to x2.

So that can be really very slow because you

have a lot of this communication between global and memory.

And so what you might want to do,

so this is just to give a schematic version of what

is happening.

So here, you have everything in memory, your DRAM.

And basically, you will send data to your processes

for performing compute.

And after every new line, you will send it back to your DRAM.

And then you have to do it again and do it again.

If you just have a PyTorch function,

this is a naive way that things are working.

But there's a much better way of doing

that once you realize that communication is the bottleneck,

is that you might just communicate once and do

all the operations, and then communicate it back.

So this is what fused kernels are doing.

So the idea is that you communicate once,

you perform all the operations that you want,

and then you send it back.

And as we said, this is actually fast, this is slow,

so there are fewer slow things here.

And this is basically what torch.compile

does to your code is that it fuses operations together.

OK.

Tiling, I know it's becoming long,

so I'll just quickly talk through that.

The idea is that the order in which you perform operation

will matter a lot because of communication.

So what I mean by that is that you can group and order

threads that are performing some computation

to minimize the number of times that you will communicate

with global memory.

So I'll give you an example for matrix multiplication.

Here, this is the very naive way of doing matrix multiplication.

This is how you basically learn it

at school, where you take two matrices that you multiply them,

you will basically go through all this column and all

this row.

You will basically multiply these two together

and multiply these numbers, and then sum across all of that.

And you get this number for this one.

And the way that basically, the memory is accessed here

is that one thread here is going to access

this one, and this one, and then this one, and this one.

And you'll basically have one thread

that is working with all of this and then all of this one.

And then you will have another thread

that is working on these things separately.

And then when this one is done, it

will work on a different column and a different row.

So what is important here is that you will rarely reread

the same values from cache.

In contrast, what you can do is you can split up your matrix

multiplication into different tiles to reuse memory.

So for example, you might say, well,

I'm going to have one thread that instead of working

with all the column and all the row at once,

it basically works with all of these four values

together against these four values together.

So it'll basically do multiply this and this together, and then

this one and this one together.

It's a bit hard to explain without actually showing it

to you and just with this diagram.

But basically, this number here, N_00, will be used twice,

will be used to multiply M_00 and M_10.

So basically, for one number that I have access to,

I made two operations.

So I have N_00, I made two operations.

So basically, I have to read less from global memory,

because by one read, I made two operations.

While before, with one read, I made only one operation.

So you're basically making sure that you make more work

with the same amount of data or same amount of work

with less data.

So you basically read less.

And you can still work through an algorithm like this

where you multiply these element-wise with this.

And then you have another thread that

works with this one, element with this one.

And then you have the partial sums, and you sum them together.

Anyways, all this to say, it's not super important.

The actual algorithm is really not that important.

What is important is that the order in which you perform

operations can really impact--

the grouping and ordering can really

impact the number of times that you have

to read from global memory.

And tiling is one way where you basically group things together

in a single thread, such that with less data,

you can do more computation.

So you can reuse the reads and basically have access

to your cache.

Great.

So FlashAttention is one pretty famous optimization

that was done for making attention faster.

And it basically combined the three things

that we talked about before, which

is this kernel fusion, this tiling, and also

one additional thing, which is recomputation.

So sometimes, it's cheaper to redo a computation

than actually reading from your memory, the values.

So basically, here, the recomputation of attention

is the idea that don't save everything.

Sometimes, it's easier to just recompute

the values then storing them.

And FlashAttention.

V1 got 1.7 end-to-end speed up from 3,

1.7x speed up gains just by combining these things together.

So all this to say that, yeah, systems really matters.

You can get huge speed up gains at no ML.

This is completely ML-neutral.

This is just an order of operation.

And yeah, the order in which you perform operation

can really improve a lot your performance.

OK.

I think I'm arriving at the end.

I do want to maybe briefly talk about parallelization.

That's the last big topic in terms of systems.

So the idea is that you have very big model.

This is one of the big problems is

that you have very big models.

And big models cannot fit on one GPU.

So you really want to use as many GPUs as possible for making

your training runs fast.

So once you think about it this way, there's a question of,

how do you split your GPUs?

Sorry, how do you work with as many GPUs as possible?

And how do you fit your model into GPUs?

And the idea is that you can split

your memory and your computation across GPUs.

So again, the problem is that models are big.

They don't fit across GPUs.

They don't fit on a single GPU.

Two, you want results as fast as possible,

so you want to put as many GPUs as possible working together.

And the idea is that you can split the computation

and split the memory across different GPUs.

And this is all about parallelization.

OK.

Slide background here is to naively train

a model that has p, parameters.

You actually need 16 times P gigabytes of DRAM.

So the reason is that you need 4P, so four bytes, because here,

we assume that it's 32 floating, like yeah, FP32.

So you have four bytes for every P parameter.

And so you have, yeah, sorry, 4P gigabytes for model weights.

And then for the optimizer, if you talk about Adam,

you need to store both the mean and the variance

of every parameter.

So the optimizer needs to store 2 times 4P gigabytes in terms

of values.

And so you have here, 12.

And here, you have 4P for the gradients.

So you need to store both the weights

when you do backpropagation to store the gradients.

And this is also 4P.

So basically, that means that for training a seven billion

parameter model, you actually need 112 gigabytes of memory,

yeah, which is really huge.

So the idea here is that you can optimize that by--

yeah, the goal, at least, is to use more GPUs

and to optimize your training.

So let's say that you have four GPUs here.

And you want to optimize.

You want to basically have every GPU working simultaneously

on your data set.

One naive way that you can do it is

that you can copy the model and the optimizer on every GPU.

You can split the data.

And then you can basically have every GPU working

on the same model but different set of data,

because you split up your data.

And then at the end, after they do one step,

you basically communicate the gradients and sum the gradients.

And that will be the total gradient

that you would have gotten if you had actually trained

on the four sets of data.

So basically, after every batch, everyone

works on a separate batch.

And then at the end, you get gradients,

you communicate, you sum them.

And then you have, basically, the same gradient

as what you would have had, had you trained on four times

the batch size.

So the benefit is that now, you can use all these GPUs,

because now, you can use four times more GPUs than before.

So it's four times faster than before.

The negative aspect is that here, you

have literally no memory gains, because now,

if your model, for example, didn't fit on one GPU,

it still doesn't fit on a single GPU right now.

And also here, we said, 7B models require 112 gigabytes

of memory.

Here, it means that you really need

to have 112 gigabytes of memory on every GPUs,

so there's no memory gains.

So how would you split the memory?

How would you get memory gains?

One way of doing that is to have each GPU, update

a subset of weights.

Yeah, update a subset of weights and hold the subset of weights.

And then you communicate them before updating your weights.

So this is what we call sharding.

So here, one way of doing that is this paper called 0.

So here, you see the baseline, which has the 4P gigabytes

for parameters.

Here, you have the 4P gigabytes for gradients.

And here, you have the 8P gigabytes for optimizer states.

And they have different levels of sharding.

So the first thing that you can shard

is you can shard the optimizer.

So you can say, well, every GPU will only have

one subset of optimizer states.

And basically, each of them contain a subset.

And we'll just communicate them when needed.

So you basically have this, and then you can do the same thing.

So this will cut a lot your memory requirement from 120

gigabytes to 31 gigabytes, so nearly a 4x decrease.

And then you can do the same thing for your gradients.

And you can do the same thing for your parameters.

So you can basically say, every GPU

takes care of a different subset of weights.

OK.

So that was for data parallelism.

And now, let's talk about model parallelism.

The problem with data parallelism

is that it requires to have at least as much data as you have

or at least as much batch size as you have GPUs.

So basically, as I said, I assume,

you have a batch size of 16.

Basically, what you're saying is, if you have four GPUs,

average GPU now gets a batch size of four,

so 16 divided by 4.

But what if I want to use 32 GPUs?

How do I now split up that data to fit into 32 GPUs?

The idea is that you can have every GPU take

care of applying specific parameters rather than updating.

So what we saw before with this data parallelism

is that every GPU can take care of updating specific parameters.

But here, the idea with model parallelism

is that instead of having every GPU taking

care of updating the parameters, you

can have every GPU taking care of applying the parameters,

so like applying the actual operations.

So for example, in pipeline parallelism,

you can say that every GPU has access

to a whole different layers.

So you can say, layer 1 is on GPU 1, layer 2 is on GPU 2.

And basically, what you have is that once you take data,

you make all the data passed to your first layer, which

is on GPU 1, and then you send it to GPU 2,

it passes through the second layer, then GPU 3, et cetera.

So yeah.

So this is for pipeline parallel.

I'm going to skip that part.

And then you have tensor parallel.

So this is the idea that instead of having every GPU hold

a different layer, you can have split matrices,

or you can split inside of a layer,

you can split it between GPUs.

So for example, when you multiply

a matrix with a vector, what you can do

is you can split up the matrix into two,

you can split up the vector into two, and you can basically say,

I'm going to operate with these matrices on this vector,

and these matrices on this vector.

And I'm going to aggregate everything at the end together.

So this is what we call tensor parallelism.

So pipeline parallelism is this idea

that every GPU has different layers.

Tensor parallelism basically split up the weights

into different GPUs.

Great.

OK.

And the last example for system optimization

is that models are really huge.

So instead of splitting up your model weights

onto different GPUs, what you can have is

you can say, well, actually, not every data point

has to go through every parameter.

And this is what we call sparsity.

So very common architecture that is sparse

is the mixture of experts that basically

says, only some parameter will be

active for some types the data points.

So the idea is that you now have a data point that comes in.

And it will only go to some set of parameters, not

all the parameters.

And this makes it very easy for doing parallelism

and multi-GPU training, because you can just

have different GPUs contain the parameters that are required

for different data points.

So here, you have a dense model that basically is a little bit

too much into the weeds, but if you know about transformers

you have this linear layer at some point.

And you can basically say, the linear layer

is going to be split into different linear layers,

and different data points will go

through different linear layers.

And every GPU can basically have access

to different linear layers.

So if you didn't follow the last two slides,

though, maybe a little bit not as important,

but what I do want to stress is that there's

a lot of work that goes into systems

optimization and that kind of optimizing

the use of your compute.

And the different ways that we saw about doing that

was tiling, so like ordering of your operations.

It was the sparsity, so basically,

making your model sparse and not having every parameter.

So every data point go through every parameter.

We talked about parallelism.

So basically, using more GPUs.

And yes, I think that's basically it.

Great.

So we're done.

And there's no questions today, because as I said,

this is a rerecording of the video.

I know this was pretty long.

I'm also starting to be a little bit tired.

But I hope it was useful.

And yeah, good luck for the rest of the class.

Loading...

Loading video analysis...