This 2-Hour Stanford Lecture Explains How ChatGPT & Claude Are Built (Must Watch)
By Meet Sethu
Summary
Topics Covered
- Architecture is secondary to data and systems
- No overfitting in large language models
- Scaling laws prove compute dominates
- LLMs can label preferences better than humans
- Pre-training is just model initialization
Full Transcript
So, I'll be talking about building LLMs today. Um, so I think a lot of you have
today. Um, so I think a lot of you have heard of LMS before. Uh, but just as a quick recap, uh, LMS standing for large language models are basically all the
chat bots uh, that you've been hearing about recently. So, uh, chat GPT from
about recently. So, uh, chat GPT from OpenAI, Claude, uh, from Entropic, Gemini, and and LMA and other type of models like this. And today we'll be talking about how do they actually work.
So, it's going to be an overview because it's only one lecture and it's hard to compress everything. But hopefully I'll
compress everything. But hopefully I'll touch a little bit about all the components that are needed to train uh some of these LLMs. Uh also, if you have questions, please interrupt me and ask.
Uh if you have a question, most likely other people in the room or on Zoom have other have the same questions. So,
please ask. Um great. So, what matters when training LLMs? Um so, there are few key components that matter. Uh one is the architecture. So as you probably all
the architecture. So as you probably all know LMS are neural networks and when you think about neural networks you have to think about what architecture you're using. Another component which is really
using. Another component which is really important is the training loss and the training algorithm. Um so how you
training algorithm. Um so how you actually train these models. Then it's
data. So uh what do you train these models on? Um the evaluation which is
models on? Um the evaluation which is how do you know whether you're actually making progress towards the goal of uh LLMs and then the system component. So
that is like how do you actually make these models run on uh modern hardware which is really important because these models are really large. Um so now more than ever systems actually really an
important topic um for LMS. So those five components um you probably all know that LLMs and if you don't know LMS are all based on transformers or at
least some version of transformers. Uh
I'm actually not going to talk about the architecture today. uh one because I
architecture today. uh one because I gave us here a lecture on um transformers a few weeks ago and two because you can find so much information online on uh transformers but I think
you can it's there's much less information about the other four topics so I really want to talk about those um another thing to say is that most of academia actually focuses on
architecture and training algorithm and losses um as academics and I've done that for a lot big part of my career is simply we like thinking that this is uh
like we make new architectures, new models and it it seems like it's very important but in reality honestly what matters in practice is mostly the three other topics. So data evaluation and
other topics. So data evaluation and systems uh which is one of most of industry actually focuses on um so that's also one of the reason why I don't want to talk too much about the
architecture uh because really the rest is super important. Um great. So
overview of the lecture I'll be talking about pre-training. So pre-training uh
about pre-training. So pre-training uh you probably heard that word. This is
the general word. This is kind of the classical language modeling uh paradigm where you basically train a language model to essentially model all of internet. And then there's a post
internet. And then there's a post training which is a more recent paradigm which is taking these large language models and making them essentially AI assistance. Um so this is more of a
assistance. Um so this is more of a recent trend since chat GPT. Uh so if you ever heard of GPD3 or GPT2 that's really pre-training land. uh if you
heard of tragic which you probably have this is really post-training land uh so I'll be talking about both but I'll start with pre-training and I specifically I'll talk about what is the
task of pre-training LLMs and what is the laws that people actually use so language modeling this is a quick recap uh language models at a high level
are simply models of probability distribution over sequences of tokens or of words so it's basically some uh model of p of X1 to XL where X1 is basically
word one and XL is the last word in the sequence or in the sentence. Um so very concretely if you have a sentence like the mouse ate the cheese what the language model gives you is simply a
probability of this sentence being uttered by a human or being found on on online. Uh so if you have another
online. Uh so if you have another sentence like the the mouse ate cheese uh here there's grammatical mistakes. So
the model should know that this uh should have some syntactic knowledge. So
it should know that this has less likelihood of appearing online.
Uh if you have another sentence like the cheese ate the mouse uh then the model should hopefully know about the fact that usually cheese don't eat mouse. Um
so there's some semantic knowledge and this is less likely than the first sentence. So this is basically at a high
sentence. So this is basically at a high level what language models are. Um, one
word that you probably have been hearing a lot in the news are generative models.
Uh, so this is just something that can generate models that can generate sentences or can generate some data. Uh,
the reason why we say language models are generative models is that once you have a model of a distribution, you can simply sample from this model and now you can generate data. Uh, so you can generate sentences uh, using a language
model.
So the type of models that uh people are all currently using are what we call auto reggressive language models. And
the key idea of auto reggressive language models is that you take this distribution over words and you basically decompose it into the into the distribution of the first word multiply
the by the distribution of or the likelihood of the distribution of the second word given the first word multiply by P of the third word given the first two words. Um so there's no approximation here. This is just the
approximation here. This is just the chain rule of probability which you hopefully you all know about. Uh really
no approximation. This is just one way of modeling a distribution. Uh so
slightly more concisely you can write it as a product of u of ps of the next word given everything which happened in the past. So of the context. Uh so this this
past. So of the context. Uh so this this is what we call auto reggressive language models. Again this is really
language models. Again this is really not the only way of modeling distribution. This is just one way. uh
distribution. This is just one way. uh
it has some benefits and some downsides.
One downside of autogressive language models is that when you actually sample from this autogressive language model, you basically have a for loop which generates the next word then conditions on that next word and then regenerate
the other word. So basically if you have a longer sentence that you want to generate you it takes more time to generate it. Uh so there are some
generate it. Uh so there are some downsides of this current paradigm but that's what we currently have. So I'm
going to talk about this one.
Uh great. So autogressive language models at a high level um what the task of autogressive language model is is simply predicting the next word as I just said. So if you have a sentence
just said. So if you have a sentence like she likely prefers uh one potential next word might be dogs and the way we do it is that we first tokenize. So you
take these words or subwords you tokenize them um and then you give an ID for each token. So here have one two three uh then you pass it through this black box. As I already said, we're not
black box. As I already said, we're not going to talk about the architecture.
You just pass it through pass it through a model and you then get a distribution, a probability distribution over the next word, over the next token. And then you
sample uh from this distribution, you get a new token and then you detokenize.
So you get a new ID, you then detokenize and that's how you basically sample from a language model. Uh one thing which is important to note is that the last two steps uh two steps are actually only need needed during inference. uh when
you do training you just need to predict uh the most likely token and you can just compare to the real token which happened in practice and then you basically change the weights of your model to increase the probability of
generating that token.
Um great so autogressive neural language models. So to be slightly more specific
models. So to be slightly more specific still without talking about the architecture uh the first thing we do is that we have all of these oh sorry yes on the previous slide when you're predicting the probability of the next
token does this mean that your final like output vector has to be the same dimensionality as the number of tokens you have yes how do you deal with like having more to like if you're adding more tokens to
your core or something like yeah so we're going to talk about tokenization actually later uh so you will get some sense of this you basically can't deal with adding new tokens. I am I'm kind of exaggerating.
tokens. I am I'm kind of exaggerating.
There are methods for doing it but essentially people don't do it. Um so
it's really important to think about how you tokenize your text and that's why we'll talk about that later. Uh but it's a very good point to notice that you basically the vocabulary size so the number of tokens that you have is essentially the output of your uh
language model. So it's actually pretty
language model. So it's actually pretty pretty large.
Okay. Okay. So, autogressive neural language models, first thing you do is that you take every word or every token.
You embed them. So, you get a um some vector representation for each of these tokens. Um you pass them through some
tokens. Um you pass them through some neural network. As we said, it's a
neural network. As we said, it's a transformer. Then you get a
transformer. Then you get a representation for all the word in all the words in the context. So, it's
basically representation of the entire sentence. uh you pass it through a
sentence. uh you pass it through a linear layer as you just said to basically map it to the number so that the output the number of outputs is the number of tokens. Uh you then pass it
through some soft max and you basically get a probability distribution over the next words given every word in the context
and the loss that you use is basically it's essentially a task of classifying the next token. So it's a very simple kind of machine learning task. So you
use a cross entropy loss where you basically you look at the actual target that happened which is a target distribution which is a one hot encoding which here in this in this case says I
saw uh the real word that happened is cat. So that's a one hot um distribution
cat. So that's a one hot um distribution over cat and here this is the actual uh do you see my mouse? Oh yeah, this is the distribution that you generated and basically you do cross entropy which really just increases the probability of
generating cat and decreases all the the probability of generating all the other tokens. One thing to notice is that as
tokens. One thing to notice is that as you all know again h this is just equivalent to maximizing the text log like the text log likelihood because you
can just rewrite the the max over the probability of um this autogressive language modeling task as just being this minimum over I just added the log
here and minus which is just the minimum of the loss which is the cross entropy loss. So basically minimizing the loss
loss. So basically minimizing the loss is the same thing as maximizing the likelihood of your text. Any question
questions?
Okay.
Tokenizer.
Um, so this is one thing that people usually don't talk that much about.
Tokenizers are extremely important. Uh,
so it's really important that you kind of understand at least uh what they do at a high level. So why do we need tokenizers in the first place? Uh, first
it's more general than words. So one
simple thing that you might think is oh we're just going to take every word that we will have. You just say every word is a new is a token in its own. Um but then what happens is if there's a typo in
your word then you might not have any token associated with this this word with a typo and then you don't know how to actually pass this word with a typo into a large language model. So what do
you do next? And also even if you think about words is a very like words are fine with like Latin based languages. Uh
but if you think about a language like Thai, you won't have a simple way of tokenizing by spaces because there are no spaces between words. Um so really uh tokens are much more general than words.
First thing, second thing that you might think is that you might tokenize every sentence character by character. You
might say A is one token, B is another token. Uh that would actually work and
token. Uh that would actually work and probably very well. The issue is that then your sequence becomes super long.
And as you probably remember from the lecture on on transformers, uh the complexity grows quadratically with the length of sequences. So you really don't
want to have a super long sequence. Um
so tokenizers basically try to deal with those two problems and give common subsequences a certain token and usually how you should be thinking about is
around uh an average every token is around three four letters. Um
and there are many algorithms for tokenization. I'll just talk about one
tokenization. I'll just talk about one of them to give you a high level. Uh
which is what we call bite pair encoding which is actually pretty common. One of
the two most common uh tokenizers. And
the way that you train a tokenizer is that first you start with a very large corpus of text. And here I'm really not talking about training a large language model yet. This is purely for the
model yet. This is purely for the tokenization step. Uh so this is my
tokenization step. Uh so this is my large corpus of text with these five words. Um then you associate every
words. Um then you associate every character in this corpus of text a different token. Uh so here I just split
different token. Uh so here I just split up every character with a different token. Uh and I just color coded all of
token. Uh and I just color coded all of those tokens. And then what you do is
those tokens. And then what you do is that you go through your text and every time you see pairs of tokens that are very common, the most common pair of
token, you just merge them. So here you see three times the the the tokens T and O next to each other. So you're just going to say this is a new token and then you continue you repeat that. So
now you have to k which happens three times to with an e that happens sorry two times and token which happens twice
and then ex which also happened twice.
So this is that if you were to train a tokenizer on this corpus of text which is very small that's how you would uh finish with a token with a pre like a trained tokenizer. uh in reality you do
trained tokenizer. uh in reality you do it on on much larger corpuses of text.
Um and this is the real tokenizer of uh actually I think this is GPT3 or chat GPT. Uh and here you see how it would
GPT. Uh and here you see how it would actually separate these words. So
basically you see the same thing as what we gave in the previous example. Token
becomes its own token. So tokenizer is actually split up into two tokens token and iser. Um so yeah that's all about
and iser. Um so yeah that's all about tokenizers. Any question on that? Yeah.
tokenizers. Any question on that? Yeah.
deal with spaces and how you deal with each.
Yeah. So, actually there's a a step before tokenizers which is what we call pre-tokenizers which is exactly what you just said. Uh so this is mostly in
just said. Uh so this is mostly in theory there's no reason to deal with spaces and punctuation separately. You
could just say every space gets its own token, every um uh punctuation get its own token and you could just do all the merging. The problem is that so there's
merging. The problem is that so there's an efficiency question actually training these tokenizers takes a long time. Uh
so you're better off because you have to consider every pair of token. So what
you end up doing is saying if there's a space this is very like pre-tokenizers are very English specific. You say if there's a space we're not going to start looking at the the token that came before and the token that came
afterwards. So you're not merging in
afterwards. So you're not merging in between spaces. But this is just like a
between spaces. But this is just like a optimiz like a computational optimization. you could theoretically
optimization. you could theoretically just deal with it um the same way as you deal with any other character.
Yeah.
When you merge tokens to delete the tokens that you merged away or do you keep the the smaller tokens?
Um you actually keep the smaller tokens.
I mean in reality it doesn't matter much because um usually on large corpus of text you will have actually everything.
Uh, but you usually keep the small ones.
And the reason why you want to do that is because if in case there's a, as we said before, you have some um some grammatical mistakes or some typos, you still want to be able to represent these
words by character. Um, so yeah. Yes.
Are the tokens unique? So I mean, say in this case, Ten, is there only one occurrence or could do you need to leave multiple occurrence so they could have
take on different meanings or something?
Oh. Oh, I see what you say. No, no, it's every token has its own uh unique ID.
Um, so a usual, this is a great question. For example, if you think
question. For example, if you think about uh bank, which could be bank for like money or bank like water. Um, it
will have the same token, but the model will learn, the transformer will learn that based on the words that are around it, it should associate that. I'm saying
I'm being very handwavy here, but associate that with the with a with a representation that is either more like the the bank money side or the bank water side. Um, but that's a transformer
water side. Um, but that's a transformer that does that. It's not a tokenizer.
Yes.
Yeah. So you mentioned during tokenization keep the smaller tokens you started with right like if you start with a T you keep the T and then you build your tokenizer to the extent that you can now encode token. So let's say
um maybe you didn't train on token but like in your data you are trying to encode token. So how does the tokenizer
encode token. So how does the tokenizer know to encode it with token or to a great question you basically when you so when you tokenize so that's after training of the tokenizer when you
actually apply the tokenizer you basically always choose the largest uh token that you can apply uh so if you can do token you will never do t you will always do token um but there's
actually so people don't usually talk that much about tokenizers but uh there's a lot of of computational benefits uh or computational tricks that you can do for making these things faster. Uh so I really don't think we
faster. Uh so I really don't think we and honestly I think a lot of people think that we should just get away from tokenizers um and just kind of tokenize character by character or bytes bytes.
Uh but as I said right now there's this issue of like length uh but maybe one day like in five or 10 years we'll have different architectures that don't scale quadratically with the length of the
sequence and uh maybe we'll um yeah move away from tokenizers.
So can you share with us the drawback?
Why why do people want to move away from the tokenizer?
Oh um yeah so think one good example is uh math. If you
think about math actually numbers right now are not tokenized. So for example 327 might have its own token which means that models when they see numbers they don't see them the same way as we do.
And this is very annoying because what I mean the reason why we can kind of generalize with math is because we can deal with every every letter separately and we can then do composition where you know that basically if you add stuff
which is the same thing as adding every one separately plus like whatever the unit that you add. Uh so they can't do that. Um so then you have to do like
that. Um so then you have to do like special tokenization and like one of the big changes that GPT4 did uh is changing the way that they tokenize uh code. So
for example uh if you have code you know you have like often in Python these four spaces at the beginning those were dealt with uh kind of strangely before um and as a result like the model couldn't
really understand uh how to deal with code uh so so tokenizer actually matter a lot um okay so I'll move on uh right now but we can come back later on tokenizers
great so we talked about the test the loss the tokenizer let's talk a little bit about evaluation uh so the way that LM usually evaluated is what we call is using what we call
perplexity. Um at a high level it's
perplexity. Um at a high level it's basically just your validation loss. Uh
the slight difference with perplexity is that we use something that is slightly more interpretable which is that we use the average per token loss and then you exponentiate it. And the reason why you
exponentiate it. And the reason why you exponentiate it is because you want I mean the loss has a log inside and you like one humans are actually pretty bad at thinking in log space but two logs
depend on the base of the log. uh while
when you exponentiate you basically have everything in the uh kind of the vocabulary size uh unit um and the average per token is just so that your your perplexity is independent of the
length of your sequence. Um so
perplexity is just two to the power uh average of the loss of the sequence. Um
so perplexity is between one and the length of the vocabulary of your tokenizer. uh one it's simply well if
tokenizer. uh one it's simply well if you predict perfectly the thing which uh every word then every word will have basically product of ones uh so the best
perplexity you can have is one if you really have no idea you basically predict with one divided by uh size of vocabulary um and then you do simple math and you basically get perplexity of
size of vocabulary uh so the intuition of perplexity is that it's basically the number of tokens that your model is kind of hesitating between uh so if you if your model is perfect it doesn't hesitate it know exactly the word. If it
really has no idea, then it hesitates between uh all of the vocabulary.
Uh so perplexity really improved. That's
perplexity on a standard data set between 2017 and 2023. It it went from a kind of 70 tokens to less than 10 tokens over these five six years. So that means
that the models were previously estating between 70 words every time it was generating a word and now it's estating between like less than 10 words. So
that's much better. Perplexity is
actually not used anymore in academic benchmarking mostly because it depends on the tokenizes that you use. Uh it
depends on the actual data that people are evaluating on but it's still very important for development of LLMs. So when you when you actually train your own LM people will still really look at
the perplexity.
Uh one common other way and now more common in academia of evaluating these LLMs is just by taking all the classical NLP benchmarks and I'll give you a few examples later and just kind of
aggregating everything. Um so collect as
aggregating everything. Um so collect as many automatically evaluatable benchmarks and just evaluate across all of them. Um so one such uh or actually
of them. Um so one such uh or actually two such uh benchmarks are what we call uh Helm which is from Stanford and another one is the hugging face open LM leaderboard which are the probably the
two two most common ones right now. Um
so just to give you an idea in Helm there are all of these type of tasks which are mostly things that can be easily evaluated uh like question answering. So think about many different
answering. So think about many different questionans answering uh tasks. Um and
the benefit with question answering is that you usually know what is the real answer. Um so you can the way that you
answer. Um so you can the way that you evaluate these models and I'll give you a concrete example in one second. Um is
that you can just look at how likely the language model is to generate the real answer compared to some other answers.
And that's essentially at a high level how you evaluate these models. Um so to give you a specific example, MMLU is probably the most common um academic
benchmark for LLMs. Uh and this is just a collection of many question and answers in all of those domains. For
example, college medicine, college physics, astronomy and these type of topics. And the questions are things
topics. And the questions are things like so this is in astronomy. What is
true for a type 1A supernova? Then you
give uh four different potential answers and you just ask the model which one is more likely. So e there are many
more likely. So e there are many different ways of doing it. Either you
can look at the likelihood of generating all these answers or you can ask the model which one is the most likely. Uh
so there are different ways that you can prompt the model but at a high level you know which one is correct and there are three other mistakes. Um yes
we're kind of creating these like unconstrained text as an output.
Yeah. How do you evaluate a model if it gives something that's you know semantically completely identical but is not the exact token that you expect?
Yeah, so that's a great question. I'll
talk more about that later. Here in this case, we don't do unconstrained. So the
way you would evaluate MMLU is basically either you you ask the first question and then you look at the likelihood of the model generating a the likelihood of the model generating b c and d and you
look at which one is the most likely or you can ask the model out of a b c d which one is the most likely and you look at whe the to the most likely next token is a b c or d. So uh you constrain
the model to say it can only answer these four things. When you say you constraint the model, do you mean you constrain it with the prompt or do you mean that of its whole probability
distribution outputs? You only comparing
distribution outputs? You only comparing the outputs of like you're only comparing the A token.
Yeah. So uh in the second case I gave you you would do exactly the actually you would do both. You would prompt the model saying ABC or D plus you would constrain to only uh look at these two
these four tokens. In the first case you don't even need to generate anything. So
in the first case you literally just look given that it's a language model it can give a distribution over sentences you just look at what is the the likelihood of generating all of these words what is the likelihood of
generating the second choice and you just look at whether the most likely sentence is actually the real answer. So
you don't actually sample from it. You
really just use P of X1 to XL. Does that
make sense? Uh that being said, evaluation of open-ended questions is something we're going to talk about later and is actually really important and really challenging.
Yes.
Earlier you mentioned that um like um metrics like complexity are not are not like usually used because it depends on like how you do yourization and design choices. I was wondering if you could
choices. I was wondering if you could speak more to that.
Oh um yeah. So think about perplexity. I
told you perplexity is between one and vocabulary size. So now imagine that
vocabulary size. So now imagine that Chad GPT uses a tokenizer that has like 10,000 tokens, but Gemini from Google uses a tokenizer that had 100,000 uh
potential tokens, then actually the Gemini one will will have like the upper bound of the the complexity that you can get is actually worse for Gemini than
for chat GPT. Does that make sense?
So that's just an idea. It's actually a little bit more complicated than that, but there's just like one uh first order bit of where you can see that the tokenizer actually matters.
Um great okay so evaluation challenges there are many I'll just talk about two really briefly. Uh one as I told you there are
briefly. Uh one as I told you there are two ways of doing evaluation for these MML use. Actually there are many more
MML use. Actually there are many more than two but I gave you two examples. Um
and it happens that for a long time even though that was a very classical benchmark that everyone used uh actually different uh different companies and
different um different uh uh different organization were actually using different ways of evaluating MMLU and as a result you could you get completely different results. For example, Lamas
different results. For example, Lamas 65B uh which was the first model of Meta in the Lama series had on Helm 63.7
accuracy, but on this other um benchmark had like 48.8. Um so really the way that you evaluate and this is not even talking about prompting. This is really just kind of the the way that you
evaluate the uh the models. Prompting is
another issue. So really there are a lot of inconsistencies. It's not as easy as
of inconsistencies. It's not as easy as it looks. Uh first thing yeah sorry
it looks. Uh first thing yeah sorry how can we make sure that all these models aren't trained on the benchmark okay second thing this is a great question uh train test contamination uh
this is something which I would say is really important in academia in uh given that the talk is mostly about training large language models uh for companies it's maybe not that
important because they know what they trained on uh for us we have no idea so for us it's a real column. Uh so there are many different ways of trying to
test whether uh the test set oh sorry whether the test set was actually in the training set. Uh one kind of cute trick
training set. Uh one kind of cute trick um that people uh in in the lab on Tatsu's lab have found is that what you can do is that given that most of the
data set online are not randomized. You
can just look at and that language models what they do is just predict the next word. um you can just look at the
next word. um you can just look at the entire test set. Uh what if you generate all the examples in order versus all the examples in a different order and if
it's more likely to generate the thing in order given that there's no real order there then it means that probably was in the training set. Does that make sense? Um so there are many that's like
sense? Um so there are many that's like one of them. There are many other ways of doing it. Train test contamination again not that important for development really important for academic benchmarking.
Great. So there are many other challenges but uh I'll move on for now.
Great data. Um so data is another really big topic. Um at a high level people
big topic. Um at a high level people just say oh you basically train large language models on all of internet. What
does that even mean? Um so people sometimes say all of clean internet which is even less defined. Um so
internet is very dirty and really not representative of what we want in practice. If I download a random website
practice. If I download a random website right now, uh you would be shocked at what is in there. It's definitely not your Wikipedia. Um so I'll go really
your Wikipedia. Um so I'll go really briefly on like what people do. Um I can answer some questions but I mean data is on its own is a huge topic. Uh basically
first what you do is download all of internet. What that means is that you
internet. What that means is that you use uh web crawlers that will go on every web page on internet or every web page that is um on Google and that is
around 250 billion pages right now. Um
and that's around one pabyte of of data.
So this is actually a common common crawl is one web crawler. So people
don't usually write their own web crawlers. What they do is that they use
crawlers. What they do is that they use standard web crawlers and web common crawl is one of them. Uh that basically every month adds all the new websites that were added on uh internet that are
found by by Google and they put it in a big uh basically a big data set. Um so
that's on common call you have around 250 billion pages right now. So one E6 gigabytes of data. Once you have this uh so this is a random web page like
literally random uh from this common crawl and what you see is that one it really doesn't look at type of things that you would usually see but actually so this is an HTML page uh it's hard to
see but if you look through you will see some content for example here uh testing world is your ultimate source for the system x higherformance server and then
you have three dots so you don't even the sentence is not even finished.
That's how random internet looks like.
Uh so of course it's not that useful if you just train a large language model to generate things like this. So what are some of the steps that are needed? First
one, you extract the text from the HTML.
So that's what I just tried to do by looking at uh basically the correct text. Uh there are a lot of challenges
text. Uh there are a lot of challenges by through this. For example, extracting math is actually very complicated but pretty important for training large language models. Um or for example,
language models. Um or for example, boiler plates. A lot of your forums will
boiler plates. A lot of your forums will have the same type of headers, the same type of footers. Uh you don't want to repeat all of this in your data.
Um then you will filter undesirable content. Uh so not safe for work,
content. Uh so not safe for work, harmful content, PII. Uh so usually every company has basically a a blacklist of websites that they don't want to train their models on. That
blacklist is very long and you basically say if it comes from there, we don't train on this. There are other ways of doing these things is that you can train a small model for classifying what is
PII removing these things. Um it's hard every point here that I'm going to show you is like a hard amount of work. Uh
but I'm going to go go quickly through it. So filter and desirable content.
it. So filter and desirable content.
Second or fourth is the duplication.
As I said um you might have things like headers and footers in forums that are always the same. You want to remove that. Another thing that you might have
that. Another thing that you might have is a lot of URLs that are different but actually show the same website. Um, and
you might also have a lot of like u um paragraphs that come from like common books that are basically dduplicated a thousand times or 10,000 times on internet. So you have to dduplicate also
internet. So you have to dduplicate also very challenging uh because you have to do that at scale. Once you do dduplication you will do some huristic filtering. you will try to remove
filtering. you will try to remove lowquality documents. Uh the way you do
lowquality documents. Uh the way you do that are things like rules-based um filtering. For example, if you see that
filtering. For example, if you see that there are some outlier tokens. If the
distribution of tokens in the website is very different than the usual distribution of tokens, then it's probably some outlier. If you see that the length of the words in this website is super long, there's something strange going on on that website. If you see
that the the website has only three words, maybe is it worth training on it?
Maybe not. if it has like 10 million words, maybe there's something also uh wrong going on that page. Um, so a lot of rules like this. Yes.
Why do we filter out undesirable content from our data set instead of kind of putting it in as like a supervised loss, right? Like can we not just say like,
right? Like can we not just say like, you know, here's this like hate speech website. Let's actively try to let's
website. Let's actively try to let's actively penalize them for getting We'll do exactly that, but not at this step. That's where the post-training
step. That's where the post-training will come from. Uh pre-training, um the idea is just to say I want to model kind
of how humans speak essentially. Um and
I want to remove all these like headers, footers and and menus and things like this. But it's a very good uh like idea
this. But it's a very good uh like idea that you just had and that's exactly what we'll do later.
Next step, model based filtering. So
once you filtered a lot of data, what you will do uh so that's actually a very cute trick. uh you will take all of
cute trick. uh you will take all of Wikipedia and you will look at all the links that are linked through Wikipedia pages because probably if something is referenced by Wikipedia it's probably
some high quality website and you will train a classifier to predict whether something comes from whether a document comes from one of these references uh
from Wikipedia or whether it's from the random web and you will try to basically say I want more of the things that come from Wikipedia references
Does that make sense? So yeah, so you will train a machine learning model usually also very simple models because you need to do that really at scale. I
mean just think about the 250 billion pages.
Uh next one you will try to classify your data into different different um domains. you will say okay this is
domains. you will say okay this is entertainment this is books this is code this is like these type of domains and then you will try to either um up or
downweight some of the domains uh for example you might say uh you might see that actually if you train more on code then actually your model becomes better on reasoning so that's something that people usually say in a very handwavy
way if you train your model more on code actually it helps reasoning so you want to upweight the coding uh distribution because that helps for general language modeling skills. Uh books is usually
modeling skills. Uh books is usually also another one that people usually um upweight. Entertainment they usually
upweight. Entertainment they usually downweight. Uh so things like this of
downweight. Uh so things like this of course you want to do it. So people used to do it maybe uh kind of huristically.
Now there's entire pipelines that we'll talk about of how to do these things uh slightly more um automatically.
And then at the end of training uh usually train um after training on all of this data that we saw you usually train on very high quality data at the end of of training your large language
model where you decrease your learning rate. Uh and that basically means that
rate. Uh and that basically means that you're kind of overfitting your model on a very high quality data. So usually
what you do there is like Wikipedia. You
basically overfitit on Wikipedia and you overfit on like h human uh data that was collected. Um the other things like
collected. Um the other things like continual pre-training for getting longer context. I'm going to skip over
longer context. I'm going to skip over all of these things. Uh but that's just to give you a sense of how hard it is when people just say oh I'm going to train on internet. That's a lot of work.
Um and really we haven't figured it out yet. So collecting well data is a huge
yet. So collecting well data is a huge part of practical large language model.
Uh some might say it's actually the key.
Yes about data. So basic question. So
usually when you start with like the terabyte of data after I go through all that steps what's the typical amount of data you have remaining and then like how how large a team does it typically
take to go through all the data steps you talked about.
Sorry how is the question how large is the data after you filter?
Yeah after you filter and then to go through all the steps. How large a team do you need to go through like the fil the all the filtration steps you mentioned?
Uh how slow is it or how how like how many people would you need to be able to do this?
Uh okay that's a great question. I'm
going to somewhat answer about the data uh how large is the data set uh at the end of this slide. Uh for number of people that work on it
u that's a good question. I'm actually
not quite sure, but I would say yeah, I actually don't quite know, but I would say it's probably even bigger than number of people that work on kind of
the tuning of the pre-training of the model. Uh, so the data is bigger than
model. Uh, so the data is bigger than kind of the modeling aspect. Um, yeah, I I don't think I have a good sense. I
would say probably in Lama's team which have like 70ish people, I would say maybe 15 work on data.
uh I yeah all these things you don't need that many people you need a lot of compute also because for data you need a lot of CPUs um so yeah and I'll answer the second question at the end of this
slide so as I just kind of alluded to really we haven't solved data at all for pre-training so there's a lot of research that that has to be done first how do you process these things super efficiently uh second how do you balance
a kind of like all of these different domains uh can you do synthetic data generation that's actually a big one right Now uh in because we don't have uh we'll talk about that later but we don't
have enough data on the internet um can you use multimodal data instead of just text data and how does that improve even your text performance um there's a lot
of secrecy because really this is the key of most of the pre-train pre-trained large language models. Uh so for competitive dynamics uh usually these these um these companies don't talk
about how they do the data collection and also there's a copyright liability issue. They definitely don't want to
issue. They definitely don't want to tell you that they've trained on books even though they did um because if not you can uh sue them. Uh common academic benchmarks. Uh so that will kind of
benchmarks. Uh so that will kind of answer what you asked. Um it started so those are the smaller ones. It's the
names are not that important but it started from around 150 billion tokens which around uh 800 gigabytes of data.
Now it's around 15 trillion of to 15 trillion tokens which is also uh the size of the models that are right now the best models are probably trained on that amount of data. So 15 trillion
tokens uh which is probably I guess two order of magnitude bigger than that. So
80 uh E3 GB. So that would be around 100 to 1,000 times uh filtering of the common crawl if I'm not mistaken. Um so
yeah one very one very uh famous one is the pile. So this is academic benchmark
the pile. So this is academic benchmark the pile and we can just look at what distribution of data they have. It's
things like um archive pubmed central uh which is all the the biology stuff. Uh
here it's Wikipedia. You see stack exchange um some GitHub and some books and things like this. Um again this is on the smaller side. So this is if we
look at here this is on 280B. So in
reality it's like 100 times bigger. So
you cannot have that much of GitHub and on Wikipedia.
Um in terms of closed source models uh just to give you an idea uh Lama 2 um it was trained on two two trillion tokens.
L 3 15 trillion tokens which is currently the best model that we know on how much it was trained on which is the same thing as this the the the best academic or the biggest academic benchmark which is 15 trillion tokens
GPD4 we don't really know but it's probably in the same order of magnitude or it's probably around that actually it's probably around 13 um from leaks if the leaks are true um
great so scaling laws um any other questions on data before you go to scaling laws Sorry, I know I'm giving you a lot of information, but uh there's a lot into
training a large language models. Great
scaling laws. So, the idea is that what people saw um around 2020 or at least from a long time, but they've been able to kind of theoretically show it or
empirically show it since 2020 is that the more data you train your models on and the larger the models, the better the performance. This is actually pretty
the performance. This is actually pretty different than what you've seen in this class. In this class, we teach you about
class. In this class, we teach you about overfitting. Overfitting doesn't happen
overfitting. Overfitting doesn't happen with large language models. Uh larger
models, better performance. Um it's
something that really took a long time for the community who took this type of class to realize. Um but for the exam, overfitting exists. [laughter]
overfitting exists. [laughter] Okay. The idea of scaling laws is that
Okay. The idea of scaling laws is that if given that you know that more data and larger models will always give you better performance. Can we predict how
better performance. Can we predict how much better your performance will be if you increase the amount of data and the size of your model and surprisingly it works. Uh so here you see three plots
works. Uh so here you see three plots from a very famous paper called scaling loss from openai. Um here you see on the x-axis compute so how much did you train like how much compute did you did you
spend for training and here you see test loss so this is essentially I mean it's not perplexity but it's your validation loss um so it's a log of the perplexity
and if you put these two on log scale uh then you see that uh the the performance or like the the the sorry the the scaling law is linear uh that means that
if you increase your compute by a certain amount you can you can say by how much your test loss will actually decrease. Same thing with data and same
decrease. Same thing with data and same thing for parameters. If you increase the data set size, your loss will will decrease by an amount that is somewhat predictable. If you increase the number
predictable. If you increase the number of parameters, it will decre the loss will decrease by amount which is somewhat predictable. This is really
somewhat predictable. This is really amazing. Um very surprising. I mean it
amazing. Um very surprising. I mean it looks innocuous when you look at these type of plots but that's crazy because it means that you can predict uh how well we're going to perform in two three years depending on how much compute we
will add assuming that these things will hold there's nothing theoretical about it um yes two things one what is the loss that they're using here is this perplexity or
so it's it's you know I said perplexity was like two to the power of the law so this is the the the power of the perplexity and then the second thing is when you like increase the number of parameters
or you increase the total data set size that data set one times doesn't that just inherently increase your compute like all of this work to just something specific
no this is a great question so the compute here is actually a factor of two things the data and the parameter what I'm showing here is that you can um well actually we're going to talk about that in details but basically if you increase
the number of parameters you should increase the number of data that you have um So you actually don't go multiple times through the same data set. No one does epochs in a large at
set. No one does epochs in a large at least not yet uh because we haven't still kind of enough data. Um so yeah this is all the
enough data. Um so yeah this is all the same trend which is increased compute decrease loss.
Yes.
Have we seen the numbers for the last two years still holding?
It is still holding. I I don't have like good numbers to show you. Uh but it is still holding surprisingly.
Yes.
Is there no evidence like empirical evidence that you ever plateau in theory? We would expect it, right?
theory? We would expect it, right?
No empirical evidence of plateauing anytime soon. Um why we don't know. Um
anytime soon. Um why we don't know. Um
will it happen? Probably. I mean it doesn't need to because it's actually in log scale. So it's not like as if it had
log scale. So it's not like as if it had to go it had to plateau like mathematically it could continue decreasing like this. I mean most people think that it will probably plateau at
some point. We don't know when.
some point. We don't know when.
Um okay so that's I'll talk more about scaling laws now. So why are scaling laws really cool? Imagine that I give you um you're very fortunate. I give you
10,000 GPUs for this month. What model
will you train? How do you even go about answering that question? And I mean this is a a hypothetical but that's exactly what these companies are faced with. Uh
the old pipeline um which was basically you tune hyperparameters on the big models. So let's say I have 30 days. I
models. So let's say I have 30 days. I
will train 30 models for one day each. I
will pick the best one uh and that will be the final model that I will use in production. Um that means that the model
production. Um that means that the model that I actually used was only trained for one day.
The new pipeline is that you first find a scaling recipe. So you find something that tells you for example oh like one common thing is that if you increase the size of your model you should decrease your learning rate. So you find a
scaling recipe so that you know if I increase the the the the size of my model here's what I should do with some hyperparameters. Then you tune your
hyperparameters. Then you tune your hyperarameters on smaller models of different sizes.
Let's say I will say for three days of my 30 days I will train many different models and I would do hyperparameter tuning on these small models each of different sizes. Then I will fit a
different sizes. Then I will fit a scaling law and try to extrapolate from these smaller models which one will be the best if I if I train it for much
longer or sorry if I train it for a larger model and then I will train the final huge model for 27 days instead of just one day. Um so the new pipeline is
not train things or do high parameter tuning on the real scale of the model that you're going to use in practice but do things on smaller ones at different scales try to predict how well they will
perform once you make them bigger. I
will give the I will give you a very concrete example right now. Uh let's say transformers versus LSTMs. Let's say you you have these 10,000 GPUs. You're not
sure which one you should be using.
Should I be using transformer based model or LCM based model? What I will do is I will train transformers at different scales. So here you see
different scales. So here you see different parameters on the x- axis, y-axis is my test loss. I will then train different LSTMs at different scales. Once I have these points, I will
scales. Once I have these points, I will see, oh, it kind of fits a scaling law.
I will fit my scaling law and then I will be able to predict, oh, if I had 10 times more compute, here's how well it would perform for the LSTM. It's
actually slightly less linear for the LSTM but like you could probably try to predict where you would end up and clearly from this plot you would see that transformers are better. Um one
thing to notice when you read these type of scaling laws is that there are two things that are important. Uh one is really your scaling rate uh which is
kind of the uh the slope of the the slope of the scaling law. The other
thing is your um your intercept like you could start worse but actually become better over time. It just happens that LSTMs are worse for both. Uh but I could show you another one where things you
can predict that actually after a certain scale you're better off using that type of model than others. Uh so
that's why scaling laws are actually really useful.
Any questions on that?
Yeah. So these are all kind of very how how sensitive are these to like small differences in architecture like one one like transformer architecture versus another transformer architecture? Do you
basically have to like fit your own curve and basically say like oh scaling has told me this should be some like logarithmic function like let me extrapate that for my own specific architecture.
Yeah. So uh usually for example if you're an academic and you want to now at least that's like pretty recent and you want to propose a new like activation uh that's exactly what you will do. you will fit a scaling law show
will do. you will fit a scaling law show another scaling law with the standard like I don't know GLU and you will say that it's better in reality once you start thinking about it in scaling loss terms you really realize that actually
all the architecture differences that we can make like the small minor ones all they do is maybe change a little bit the the intercept but really that doesn't matter uh cuz just train it for 10 hours longer or
like wait for the next uh for the next GPUs and these things are really secondary which is exactly why I was telling you originally people spend too much time on the architecture and losses. Um in reality these things don't
losses. Um in reality these things don't matter as much. Data though if you use good data you will have much better scaling loss than if you use bad data.
So that really matters.
Uh another really cool thing you can do with scaling loss is that you can ask yourself uh how to optimally allocate training resources. Should I train
training resources. Should I train larger models? Because we saw that it's
larger models? Because we saw that it's better when you train larger models but we saw that it's also better when you use more data. So which one should I do?
Should I just train on more data, a smaller model, or should I train a larger model on less data? Um, so
Chinchilla is a very famous paper that first showed this. Uh, the way they did it, I want to give you a little bit of a sense of what these plots are. Uh, here
you see training loss. Again, on the x-axis, you see parameter parameter differences, sorry, parameter size, uh, number of parameters, so the size of the model. And here all these curves are
model. And here all these curves are what we call isoflops which is that all the models on this curve h have been trained with the same amount of compute.
Um the way that you do that is that you train you change sorry you vary the number of tokens that were trained on and the size of the models but you vary in such a way that the total compute is constant. Okay. So all these curves that
constant. Okay. So all these curves that you see with different colors have different amount of computes that were trained on. Then you take the best one
trained on. Then you take the best one for each of those curves. Once you have the best one for each of those curves, um you can ask you can plot um how much
flops it was and which curve were you on and how much parameters did you actually use for training that specific point.
You put that on the on the log log uh scale again and now you fit a scaling law again. So now I have something which
law again. So now I have something which tells me if I want to train a model of 10 to the^ 23 flops here's exactly the number of parameters that I should be
using 100 100b and you can do the same thing with flops and tokens.
So now you can predict if if I tell you exactly I have one month of compute what size of model should I be training fake your scaling law and I tell you um of course that all looks beautiful in
reality like there's like there's a lot of like small things of like should you be counting like embedding parameters like there's there's a lot of complexities but if you do things well these things actually do hold
um so the optimal number of parameters that that chinchilla paper have found is to use 20 tokens for every parameter that you train. Uh so if you add one more parameter, you should add you
should train your thing on your model on 20 more tokens. So one caveat here is that this is optimal training resources.
So that is telling me if you have 10 to the^ 23 flops or if you have like 100 I don't know how much that is $100 million or 10 no that's much less actually.
Let's say I have $5 million to to train my best model that gets the lowest loss.
How what would I train on in reality?
These companies need to think about inference also. If you have a smaller
inference also. If you have a smaller model, they will spend less over time.
Um so actually if you consider the inference cost, you have other papers that try to show that um it's around 150 uh parameters per sorry tokens per
parameters because you prefer having a smaller model because over time you're going to you're going to actually um spend less money on inference of these
models. So 150 to one that's around what
models. So 150 to one that's around what the best models are trained on right now. At least the ones that are that are
now. At least the ones that are that are used um in practice for in production.
Great.
Any question on chinchilla?
Great. Oh, sorry.
In practice, how expensive is inference for these models relative to training?
Actually very expensive. uh I will not talk about inference because that would be another entire lecture but just think about chat GPT where they have I don't
know how much it is now like 600 million people that used it um like that's a lot um yeah so it's actually very expensive
there's a lot of optimization you can do for inference though um and that's an entire other lecture so I'm going to skip that uh this time but it's very interesting okay tuning Um as I said
there are many things that you can answer with scaling laws. I just tried to give you two examples. Uh but really there are many things. What data do you use? What mixture what data mixing
use? What mixture what data mixing waiting you you use data mixtures?
That's what we talked about before. Uh
what architecture you use? Whether you
should make your models uh wider or deeper. Um should you be paying for more
deeper. Um should you be paying for more GPUs or actually collecting more data?
Um all these things are things you can try to answer with scaling laws.
One thing I want to say is the bitter lesson. If you ever heard of Richard
lesson. If you ever heard of Richard Sudden uh very famous blog post in 2019 um what he realized uh which I think not enough people realize I didn't
definitely did not realize at that time um is that once you see these type of scaling laws you know that the more compute you have the better models you will get so with scale you will get
better model and you also know by mo law or these type of variance of mo law that you will always have better compute then the only thing that matters is just to have architectures that can leverage
computation. So what matters is
computation. So what matters is basically systems data and less so the architecture like the small architecture differences like your your your activation and things like this. Uh so I
think that's like one of the reasons why most of research focuses on um some things that for industry matters less and I was one of those researchers for a
large part of my my career. Um, so don't spend time over complicating. Do the
simple things, do it well, see all them.
That's really what OpenAI taught us with um with Chad GPT and with all the GPTs before.
Okay, I want to give you some back of the envelope computation. So I might be off by a few factors here, but I just want to give you a sense of how costly it is to train some of these models.
I'll give as an example a Lama 3 400B, which is currently the best open source model that you can get.
uh it was trained on five 15.6 tokens.
It has 40.5 billion parameters. So just
now that you know what is like this uh optimal tokens per parameter that's around 40. So that's a little bit more
around 40. So that's a little bit more than chinchilla but less than this like inference uh optimal um model. So they
went for training optimality uh flops for this model. So one simple uh way to compute flops is six uh times the number of parameters times the
number of data you train on. Uh so if you do the simple calculation here it's 3.8 E25 flops. The reason why this is important is that if you follow a little bit the news there's an executive order
from Biden that basically says that once you have uh one E26 parameters uh sorry flops uh then you have special scrutiny on your models. So they went 2x less
than that. So they really went right
than that. So they really went right below this to not have special scrutiny.
So 38 uh I might be off by a little bit but it's definitely under the 1 26 np.
Oh um so parameter p is parameters n is data number of tokens. This is uh uh this is just an approximation.
We Yeah.
Okay. Uh compute and we know that they train it on 16,000 H100s. Um and we know the throughput they they set it to. Uh
so if you do the computation, it takes around 70 days um or 26 million GPU hours. At least that's with my uh back
hours. At least that's with my uh back of the envelope computation. they
actually said that they use for 30 million instead of 26 million GPU hours.
Um so maybe they had like some uh some challenges. I don't really know but if
challenges. I don't really know but if you follow the simple computation it's around 70 days um cost uh I mean this it's hard to to approximate but I'm just
going to say it's kind of the rent like what if I were to rent H100s that many H100s for that many days how much will I pay uh H100 a lower bound on the on the
renting uh cost of H100 is around two hours uh $2 per hour so if you multiply this by 26 million uh hours, uh, you get
52 million uh, so they probably pay less than that, but not actually much less because all these um, all these services that actually rent GPUs, they don't make
that much money. So, it's it's probably slightly less, but not that much less.
Um, now salary, I said 50 employees, 500k per year.
Yeah, it's probably the right ballpark.
25 million. Uh so if you put all together around 75 million um dollars for training uh this lama model I'm
probably off by like 10 million but but that's kind of right u bullp carbon limited um a lot of people might ask like also the cost is not the only
thing that is important so I did the computation um it's around 4 4,000 um tons of CO2 equivalent
that is actually only 2,000 return tickets from JFK to uh London. So right
now uh carbon emitted is actually not uh I mean it's huge but it's not like um meaningful yet. I think in maybe GPT6
meaningful yet. I think in maybe GPT6 GPT7 once you multiply this by 100 that might become a real issue. Right now
it's still not I think um an issue in the grand scheme of things. next model.
The way you should be thinking about these models is that every new generation, the number of flops essentially uh multiplies 10x. At least
that's what they try. Uh if they have enough energy and if they can buy enough GPUs.
Uh great. Any question on these back of the envelope math?
No. Okay.
So now we talked about pre-training. I
wanted to also chat about systems because now we know compute is really important. So there's a question of how
important. So there's a question of how do you optimize the how do you optimize your compute? I will leave that for the
your compute? I will leave that for the end because I'm not sure how much time we will have. I think it's important but hopefully I I'll be able to to talk about that later. It's slightly
different than what we've been talking about right now. So I'll move on to post- training for now. So the task of postraining the reason why we need to do postraining
is as I told you before um it's to make AI assistance. So language modeling is
AI assistance. So language modeling is not uh really the thing that you want when you have an AI assistant. Uh for
example, if you ask to GPD3, which is a purely language model, a pure language model, not a um not an aligned one, if you ask a question like explain the moon landing to a six-year-old,
the completion that you would get is something like explain the theory of gravity to a six-year-old. because what
it learned is that on on on internet if you have one question you usually have maybe another bullet point of other similar questions. You don't usually
similar questions. You don't usually have question and then answer later. Uh
this is not what you want from an AI assistant. So how do we uh do this
assistant. So how do we uh do this alignment which is this post- training and making these models assistance. Um
so the goal of this alignment is to basically get LMS follow the instructions that are given um by users and and maybe some designers kind of
desires. Um so think about moderation.
desires. Um so think about moderation.
You don't want the model like open definitely doesn't want their model to say stuff that is very toxic.
Um so here you see on the left hand side uh that when you ask a question it actually provides a a real answer. So
it's not like uh before the LLM and on the right hand side you see that it would if you ask to write a tweet describing how certain part of the
population are evil it will say that it cannot do that. Um so that's kind of this alignment. Uh the background here
this alignment. Uh the background here is that uh basically the data that you want for training some of these models um is like we know what we want which is
just asking humans this is a question this is the answer that you want. Uh but
the thing is that it's very expensive to collect that data and it's hard to find it online. Uh in contrast pre-training
it online. Uh in contrast pre-training data is not what you want but there's a lot of it. Um so what what we will do or the main idea is simply take a pre-trained large language model
pre-train on all of internet and then you just fine-tune. So you just change a little bit the weights on the type of data that you actually want and hopefully given it you already pre-trained it on all of internet. It
basically learns or knows how to speak in English and and knows standard um language syntax. Uh then you can really
language syntax. Uh then you can really fine-tune it with very little data.
Okay. SFT. So supervised fine-tuning is really exactly what I just said which is the idea of fine-tuning the large language model on uh basically the desired answers that are collected from
humans. Um so why is it called
humans. Um so why is it called supervised fine-tuning? Because you
supervised fine-tuning? Because you basically want to do language modeling on the real answers. So language
modeling is this like next word prediction. Um and that's the finetuning
prediction. Um and that's the finetuning part and then you want to do it on desired answers given by humans. So
that's why we call it supervised.
So how do we collect this data? Well, we
I just said it. You just ask humans to tell you this is the this is a question.
This is the answer that you uh you would want from some of these models. So, this
is an example. Um sorry, I can't read very well on my computer, but uh my kid needs to do a science. Um no, let's read this one. Can you write a short
this one. Can you write a short introduction about the relevance of the term monopsiny? And then it says
term monopsiny? And then it says monopsiny refers to a market structure blah blah blah. And that's a human that wrote that. Um so actually this is open
wrote that. Um so actually this is open assistant which was a a way to collect um uh data online by humans.
So this type of supervised finetuning while alignment is really the key of chat GPT. This is what made uh the big
chat GPT. This is what made uh the big jump from GPD3 which was mostly something that was known by AI researchers to Chad GPT which became
known by basically everyone. Um
so the problem with uh human data is that it's uh very slow to collect and very expensive. Um so one possible
very expensive. Um so one possible simple idea is to use LMS to scale data collection. Uh so that's exactly what we
collection. Uh so that's exactly what we did with Alpaca one year ago. What we
did is that we asked humans or we use a data set of human uh question answers.
So there were 175 uh question answers here and we asked the best model at the time. So text3 to basically generate
time. So text3 to basically generate many more of these question and answers.
So all we did is like this is what humans would write now write similar answers and similar questions and we collected 52,000 LM generated question answers and then what we did is simply
we took lama 7B which was the best pre-trained model at the time and we just fine-tuned this with supervised finetuning as I told you and that's how we got um the alpaca 7B model.
Uh and this is the type of data that we collected. So things like what does
collected. So things like what does algorithm mean? An algorithm is a step
algorithm mean? An algorithm is a step by a stepby-step uh set of instruction used to solve a problem or achieve a goal blah blah blah blah. So the data is not actually it's actually pretty good
given it was LM generated by LLMs from essentially two generations ago. Um so
that really started at least for us kind of as an academic replication of CH GPT.
uh now it really there's a big field of like synthetic data generation of how to use LLMs to basically make development of LLMs faster um and by basically by
decreasing the amount of of human hours that you need quantity of data. So we talked about what type of data and how we collect it.
Um one thing which is surprising with SFT is that you don't need that much data. Uh so what this paper showed this
data. Uh so what this paper showed this is called Lima is that if you have if you scale the amount of data you use from uh supervised finetuning from 2,000
to 32,000 it really doesn't help much.
So here scaling laws definitely don't help. Um so the the intuition here is
help. Um so the the intuition here is that all you learn um is you learn how to format your desired answers. Another
way of saying it is that your pre-trained models, they essentially model the distribution of every user on the internet. One that might write
the internet. One that might write bullet points, another one that might answer qu answer question with an answer. So all you tell your model is
answer. So all you tell your model is like, wait, you should actually be optimizing more for this type of user than another one. So you're not actually teaching you're not teaching anything
through this um SFT. Uh so supervised finetuning. All you do is you tell the
finetuning. All you do is you tell the model to kind of optimize for one type of user that it saw already in a pre-trained data set. So the knowledge is already in a pre-trained LLM. Uh and
you basically just specialize to one type of user.
Great. Any question on SFT?
Yes.
So I know it's a big issue with synthetic data where uh if you keep generating data from the same distribution eventually you're not learning a new distribution. You're
essentially playing with it. Just
bootstrapping that. Yeah,
surely you can't scale that forever, right? You can't keep going on and
right? You can't keep going on and generating from the same distribution.
You hope to learn something new.
Uh so are there an active area of research, but any thoughts that you have around how people are maybe thinking around this and uh better ways to bootstrap or to give up on this idea and
and realize that the chart shows you don't need that many. So just get humans to generate 2,000 radio. Uh
yeah, so that's a very good question. uh so
for the data stuff so I'm saying it's not that important for SF but there will be another thing we'll talk about right after where actually uh data does matter my intuition based on not that much
empirical results is that you can still get um even though you use your LMS if you use purely LM generated text and you do that for like three four generations of LMS I agree with you that probably
you won't improve much but for me what is important is how do you use like human in the loop with LLMs not purely LM s not purely uh humans but maybe what you can do is just have the model
generate some new text and just humans write a few edits. Edits are much faster than writing the entire text. And I
think that if you have that type of collaboration then from like kind of an information theoretical point of view you still get additional information but you still much faster than if you use humans. And I think that as a field
humans. And I think that as a field we'll probably move towards these type of things. uh which is um really just
of things. uh which is um really just finding the examples that are important and and asking humans it's kind of active learning just asking humans exactly when uh you need to to get the
the inputs.
Yes.
Do we train with like the same loss function the same like general training algorithm for the supervised fine tuning bit as we do for the for the pre-training? Right. Because like
pre-training? Right. Because like
the examples you showed, I think the the important thing of the the good examples is they're like super fractionally accurate like there's these more complex than just like train
same loss. So that's why here I yeah I
same loss. So that's why here I yeah I didn't maybe didn't emphasize enough this is just language modeling fine tune with language model on the desired answers. So this is literally the same
answers. So this is literally the same loss. um it will be different in two
loss. um it will be different in two seconds but the first step of SFT is literally the same loss where you just say okay I want to actually specialize on that type of data so there's even a question of like what is pre-training
what is post training because in reality it's just like a different data that you use the reason why we usually call it post training is that the way we collect that data is very different great questions uh yes
maybe it's the same question but why would these 2,00 examples have such a overweighted influence fine tuning.
So that's why we uh also that's another reason why we call it post training is that we use different type of hyperparameters. So you know I told you
hyperparameters. So you know I told you basically at the end of pre-training you essentially end up with a learning rate of zero and here you're going to increase your learning rate. So like one e minus 5 1 e minus yeah and and so um
the weight that you give to them is actually different.
Um okay. Uh second step or second part of
okay. Uh second step or second part of this uh post training um is what we call reinforcement learning from human feedback or RHF. Uh some of you might
have heard of that. Um the idea is that SFT has a problem namely that uh you do behavioral cloning which means that you just try to clone what the humans would
say and that has that has many issues.
One of them is that you're bound by human abilities. So if um like humans
human abilities. So if um like humans actually humans won't generate the things that they think is actually the best thing to generate. So if you ask me to write a book, I mean I can definitely
enjoy a book. I can probably say one book is better than another, but I'm definitely not going to be as good as writing the book that I want to read. Uh
so you're going to be bound by the human ability to generate things even though the humans might be better at distinguishing between things. That's
one issue. Issue number two uh I find that actually pretty interesting is that it might if you ever heard of the word hallucination. So this is LM's
hallucination. So this is LM's generating fa like false information hallation might these people have um hypothesized that that can come from the
supervised fine-tuning even if you do supervised finetuning on data that is correct and the reason why that is is that if uh given I told you that
basically SFT is with very little data and it's with data that doesn't the model doesn't learn anything new. So
what if the human gives an answer that the model didn't know was true? From the
model perspective, you the human basically is telling the the model uh generate this thing that seems plausible but actually have no idea if it's true or not. Um so just to give you a very
or not. Um so just to give you a very concrete example, if we go back to this uh monopsony example, can you write blah blah blah about monopsin? Imagine that a
human uh wrote a reference on this type of book. um and that book might exist.
of book. um and that book might exist.
That might be a correct reference. But
what if the LLM never saw this reference during pre-training? Then it doesn't
during pre-training? Then it doesn't know that it's a correct reference. So
really what you tell the model is to generate or make up some plausibly sounding reference um rather than actually tell the real reference that it
saw during pre-training. Uh so
hallucination might be um uh a re like might be caused by this SFT. That's
problem number two. Does that all make sense? Great. Problem number three,
sense? Great. Problem number three, price. Generating the ideal answers is
price. Generating the ideal answers is very pricey. And that comes back to your
very pricey. And that comes back to your question um of like humans writing the entire answer is actually pretty expensive. Um so that's where LHF comes
expensive. Um so that's where LHF comes in. The idea is that instead of cloning
in. The idea is that instead of cloning the behaviors of humans, we're going to maximize human preference. Um, and the way we're going to do that, so the pipeline is that for a certain for every
instruction, you're going to ask a model to generate two answers. Um, and usually use a pretty good model. So you usually don't use an LLM here. You use a SFT uh
fine-tune, you use a fine-tuned LLM already to give like pretty good answers. And then you ask labelers which
answers. And then you ask labelers which of these two answers was better. So
select the preferred one. And then with different type of algorithms, we're going to talk about the algorithms. Um you just fine-tune the model to generate more of the green thing than the red thing. So more of the good stuff. Uh so
thing. So more of the good stuff. Uh so
now the question is how and we're going to talk about that right now.
So there are two ways that we're going to talk about and two that are mainly used in the community. Um the first one is simply the idea of using reinforcement learning. So hopefully you
reinforcement learning. So hopefully you all know what reinforcement learning is now. Um so when you think about using
now. Um so when you think about using reinforcement learning one important question is like what is the reward that we're optimizing? Uh so in this case
we're optimizing? Uh so in this case there are really two options that I could think about. The first one you could just say I'm going to compare the output generated by some baseline the output generated by my model. Uh and I'm
just going to ask the human to say which one is better and I'm going to use this as a reward. So if I'm better than the baseline this is a plus one. If not it's a minus one. Uh so now it's binary reward. The problem with binary reward
reward. The problem with binary reward is that it's very sparse and you don't get much information out of it. Uh like
maybe your answer was slightly better, maybe it was like way better and you don't really know from this um how much better it was. So option two is that you
can train what we call a reward model which is simply a classifier. Uh so you use machine learning to to classify how much better uh two outputs are from the
preference from the perspective of the human. Um so this is a little bit meta
human. Um so this is a little bit meta but what you basically do is that you train uh you take um a reward model R which is a uh just a large also a large
um a large classifier and you basically ask this reward model you give it the input and the actual output that you have one of the two outputs uh and you just um exponentiate that. So that's the
softmax loss that you all know about.
And now you divide by um the the exponentiated reward uh on the first example sorry on the first output and this is on the second output and you basically train. So the reason why you
basically train. So the reason why you do that is that you train your your model you train this reward model to be able to classify um how much better one
output is to another one. So another uh slightly less convoluted way of saying it is that your reward model will output some reward that will be used as the logits of your softmax. So now if you
have high logits in your softmax, it means that you highly likely this um output is better.
Uh so that's what we call Bradley Terry model. Yes. Is this remote model going
model. Yes. Is this remote model going over the entire output or is it going to be?
Um, so this takes the entire uh Yeah, this takes the entire output at once. So it takes all the input and all
once. So it takes all the input and all the output and it gives one number.
Yes.
So this reward model where would a human be then?
Sorry.
With the reward model, where would a human be? Like
human be? Like Oh, I see. Okay. Sorry, maybe I wasn't clear. um you train this reward model to
clear. um you train this reward model to fit this green and and red preference from humans. So basically you train a
from humans. So basically you train a classifier to say whether the humans prefer red or green. Uh but instead of using the binary reward which is what the human would tell you, you basically
use the logits of the softmax. And the
thing with the logits is that logits are continuous. So now you know that if your
continuous. So now you know that if your reward model said it has high logits then in some ways the human highly preferred this answer to some other answer.
Great. Um so as I just said continuous information is better. So that's what people uh use in practice or at least used to use in practice. I'll tell you about uh the other algorithm later. Uh
so what you do at the end is that you basically try to just use reinforcement learning that you know about. Now we
know we have our reward. what you sample through is the generation from your large language model um and then you just use some regularization term. So
the reason why we do this regularization term is for avoiding what we call over optimization. So this reward model might
optimization. So this reward model might not be really represent like might not perfectly model human preferences. So
you don't want to maximize this thing to essentially infinity. Um and you do it
essentially infinity. Um and you do it using PO which is a common uh reinforcement learning algorithm. Um,
one thing to note here because it will be important for later is that when we use maximum likelihood, um, sorry, now the large language models are
actually a policy for your reinforcement learning. It's not maximizing maximum
learning. It's not maximizing maximum likelihood anymore, which means that you're not modeling any distribution anymore. And the reason why this is
anymore. And the reason why this is important is that models that went through this type of PO actually don't give you likelihoods of text that are meaningful because what you optimize
them to do is basically just optimize for generating the most likely thing.
Not optimize for modeling like all the answers that humans might say. Another
way of saying that is that there's nothing that incentivizes here the model to not give a like a um a single possible generation. Nothing here says
possible generation. Nothing here says it's good if you have some distribution with some entropy. Um, okay. If you
haven't followed, it's not that important, but just good to know. Great.
So, PO is exactly what Chad GPT did originally. So, here's the on their blog
originally. So, here's the on their blog post or what they have is step one do supervised fine tuning, which now you all know about. Step two, train a reward
model on human preferences. Step three,
do PO multiple steps which is where you see this this blue arrow. So you
continue you train the model once with the PO, you collect new data, you continue uh and that's why and that's exactly what Chad GPT did. Uh that was the big breakthrough between GP3 and
Chad GPT. One thing to note is that uh
Chad GPT. One thing to note is that uh PRO has many challenges. Reinforcement
learning is something that's super nice theoretically. In practice, anyone who
theoretically. In practice, anyone who ever worked with reinforcement learning knows it's such a mess. Uh there's a lot of things like the rollouts, outer loops, clipping, so many complications.
Um so it's messy. This is the idealized PO used for LM settings. So that's
already much more complicated than this expectation we saw before. And in
practice, it's actually much more complicated. So we have one
complicated. So we have one implementation of it that we had to do and I'm not going to go through it, but basically you have like so much stuff that you have to think about when you implement that type of of uh PO
algorithm. So you have clipping
algorithm. So you have clipping everywhere, you have a lot of complexities and things are not well documented.
All this to say um that we're going to there was a new method that was proposed also from Stanford one year ago called DPO which is essentially a
simplification of PO um and the way uh what they did or the idea that they have is that instead of using reinforcement learning you can just maximize the probability of generating the stuff that
you like and minimizing the probability of the stuff that you don't like. Uh so
if you think about the human preference the red and green maximize uh green minimize red um so the loss is actually this one uh where what you see this is
simply um some log of the model. So this
is the likelihood of a model generating the things that the human preferred given the the inputs. Um, and what you try to do is basically maximize uh the
likelihood of generating the things that you like, minimize the likelihood of the things that you don't like. Um, all the rest of the terms here, it's not too important. It's actually really not that
important. It's actually really not that complicated to understand. But at a high level, it's really just maximizing the things you like, minimizing the the rest. Um and one thing to note uh which
rest. Um and one thing to note uh which I was going to say just here is that actually all the rest is chosen such that um the global minima of PO and a
global minima of like this DPO under some assumptions are essentially equivalent. So this is the right thing
equivalent. So this is the right thing to do mathematically. I'm not going to go through the derivations but that's the right thing to do. Uh it's pretty different with PO in the sense that now
with PO what you had to do is collect the human preferences then train reward model with maximum likelihood then use reinforcement learning. Now all you do
reinforcement learning. Now all you do is basically maximum likelihood much simpler. Yes.
simpler. Yes.
I mean yeah so it seems like this is a much simpler and b like what you would just intuitively do if you why did they start with this reward model like what what led them to doing that?
I think it's a great question. Uh I
don't really know. What I can tell you is that at OpenAI the people who did the um uh who did basically this PP uh sorry
who did Chad GPT initially are the ones who actually wrote PO and I think there were just like there are a lot of reinforcement learning people and I think that for them it was very
intuitive. Um, so there's also some
intuitive. Um, so there's also some additional like potential benefits. For
example, I don't want to Yeah. For
example, if you use the reward model, uh, the cool thing here with reinforcement learning is that you can use unlabelled data with the reward model. So here you can only use the
model. So here you can only use the label data for doing DPO. Um, for PP for PO, you first train a reward model and then you can use unlabelled data uh where the reward model will basically
label this unlabelled data. So this
there's additional kind of potential uh there could be potential improvements in practice it happens that they're none and I think uh it's just that a lot of people in this team were reinforcement
learning experts including uh the main author of PO Jman um so much simpler in PO and it basically performs as well uh so now this is the standard uh thing that
people use at least in the open source community I believe it's actually the standard also in in industry So that's called DPO gains. Um so those
are older papers on the left here. This
is on a summarization task. You see all I want to show you is that basically the pre-trained models uh were okay and they improve with scale. If you do supervised finetuning, you improve them a little
bit more. If you do PO or something with
bit more. If you do PO or something with RHF with human feedback, you get performance that are often times depending on a benchmark even better than uh humans. So this is the human uh
reference summaries. Same thing this is
reference summaries. Same thing this is on a uh on a paper that we have alpaca uh where we see uh the evaluation here is not too important but basically you see pre-trained model you jump to SFT
and then you jump to PPO DPO and PPO DPO have the exact same performance.
So basically all HF helps that's kind of the conclusion and DPO is simple.
uh data. Uh the way that you collect that type of data um first idea is just use humans as we already talked about.
Uh guidelines are very complicated for what humans should be labeling and it's really not that easy. And actually if you ever do some of the labeling, you will see that it's extremely
complicated. Like if I zoom into this uh
complicated. Like if I zoom into this uh here I have a question. Tell tell me about self-driving cars and you read both. Self-driving cars are vehicles
both. Self-driving cars are vehicles that are capable of detecting their surroundings blah blah blah.
Self-driving cars are cars that are equipped with sensors blah blah blah to navigate without the need for a driver.
I mean both seem okay like which one is better. It's actually hard to say at a
better. It's actually hard to say at a glance. Um and as a result uh the
glance. Um and as a result uh the problem with humans is that you will start optimizing a lot of like highle features. For example, the second one is
features. For example, the second one is longer. I can guarantee you that most
longer. I can guarantee you that most humans will choose the second one even though I mean maybe the first one is better. I don't know. I haven't read it
better. I don't know. I haven't read it carefully. So challenges with humans
carefully. So challenges with humans first slow and expensive. Uh second as I just mentioned it's hard to focus on things that matter like correctness and
people uh usually look at things that don't matter as much like the form like length. uh and as a result so what I
length. uh and as a result so what I show here is that uh when you do all HF the more you do of LHF the longer the output of the of the models become so if you've ever been annoyed at chat GPT
answering you super long sentences this is because of all L all HF um annotator distribution shift uh like the distribution of annotators that you use
matters a lot and you have to think like what is what is even the humans that we want to represent in these models another question is like crowdsourcing ethics Like usually these basically a lot of
the the labeling that is done um like the people who do them are not paid well and they have to go through a lot of toxic data uh because you basically want the model to avoid saying the toxic
data. Um so crowdsourcing ethics too.
data. Um so crowdsourcing ethics too.
So many challenges with human data. Um
so what we did also last year is again the same thing as Alpeka just the idea of like oh well there are challenges with humans maybe we can just replace them with LLMs. Uh so what we did is
simply replace um oh I see that I'm just realizing that the slides are not centered. Anyways uh you replace human
centered. Anyways uh you replace human preference with LM preferences. Uh so
here on this uh figure you see on the x-axis the price that we paid uh for collecting human data. It's around $300 for 10,000 examples. And this is on
mechanical turkers which are usually like cheaper than than maybe some of the other um companies that you could go through. And on the y-axis it's
through. And on the y-axis it's basically the agreement with uh other humans with the mode of other humans.
And what you see is that actually as I told you before labeling is really complicated. Humans agree with
complicated. Humans agree with themselves only around 66% of the time on a binary task. And it's not that the humans are not good here because uh we were five main authors on this paper. We
tried to label this data ourselves and we only had like say 67 or 68% accuracy even though we talk like we talked for like three hours of how we should be doing labeling but really it's
complicated it's not an easy task um and here I just showed many different models and um basically you see that models are much cheaper and they can actually get higher agreement with the mode of humans
than human humans themselves and the reason why is because humans have a lot of variance models have no variance so they might be a little bit more biased but have less variance uh so it works surprisingly well and now it's kind of
the standard in open uh source community. I think even in industry a
community. I think even in industry a lot of people use both humans and LLMs for improving uh the collection collection of all HF data. Um and this is like this is the paper from last year
but honestly now it's more like that LLMs would be around this agreement and this cost so around I would say 50x cheaper than humans and better agreement
with human than humans themselves.
Okay. So that gets us to evaluation of post- trainining. Um that goes back to
post- trainining. Um that goes back to your initial question at the beginning of the lecture. How do you evaluate something like CH GPT? Uh the answers that CHP could give are basically
unbounded. And it's not that there's one
unbounded. And it's not that there's one right answer. There are many answers
right answer. There are many answers that are just as good. Um so there are many challenges. One, you can't use
many challenges. One, you can't use validation loss because one method might use PO, the other one might use DPO.
Validation loss is not comparable.
Second, you can't use cal uh sorry, perplexity. That's the thing I told you
perplexity. That's the thing I told you before. These models uh are not
before. These models uh are not calibrated. They don't give
calibrated. They don't give distributions. They they just optimize
distributions. They they just optimize for one thing. So you can't use perplexity for actually evaluating uh these type of models once they're aligned. Sorry, once they're aligned.
aligned. Sorry, once they're aligned.
Third, uh there's a large diversity of questions that human might ask to these models. Generation, open QA, like some
models. Generation, open QA, like some question answering, some summarization, and all of these things. So there's so many things you have to cover.
um then the tasks are really open-ended so it's very hard to automate. So that's
what you were alluding to before. So the
idea uh is that instead of trying to come up with really easily automated uh benchmarks uh is just we're going to ask questions that that users actually ask to these models in practice and we're
just going to ask annotators to say between these two models which one is better like what's the what's the better output? So you basically do the exact
output? So you basically do the exact same thing as um basically the data from all HF but you use it now for evaluation. Yes. I'm not sure I
evaluation. Yes. I'm not sure I understand what you mean by like can't use perplexity not calibrated right like LM is still doing like next token prediction right um
so why can't perplexity use so think about um the optimal solution after doing PO is basically one model that gives you uh essentially a delta um
like basically says that there's only one sentence that is that could be generated for that question. So now if you use it on something that is slightly semantically differently different it would actually give a likelihood of zero
for that answer.
So in reality it's not that extreme because as you say it's still a distribution but it just shows you that there's a there's a fundamental issue with perplexity once these models are not LLMs anymore. They were not trained
at least with PPR they were not trained to to do maximum likelihood anymore.
They were trained to be policies.
Okay. Um so probably the most common or like the most um yeah the most common benchmark or the most trusted one is what we call chat uh sorry chatbot arena uh which is basically go on internet
have random users on the internet blindly talk with two chat bots just ask many questions see the two answers and rate which one is better and you do that over hundred of thousands of users and
then you get u the actual preferences and you get rankings of models uh so you can go right now on chatbot arena and actually interact with these models. Um,
one potential issue just to highlight is that while people who want to do these type of things are usually more like tech driven um or like techsavvy, uh, so a lot of the questions that you will ask are more like tech stuff discussing
software errors, inquiries about AI tools and all of these things. Um, so
another issue is cost and speed. If you
really want to use something like this for development process, um, it will be too costly because you would need to basically pay a lot of humans to do that.
So one simple idea is again as we said many times just use LM instead of humans. Uh you probably know the drill
humans. Uh you probably know the drill at this point. Uh steps for every instruction generate outputs by some baseline and a model that you want to
evaluate. Um so here imagine that I I'm
evaluate. Um so here imagine that I I'm comparing an answer from Chad GPT and from Misro. I'm just asking a model uh
from Misro. I'm just asking a model uh another model uh which one is better and I just basically average that out. Uh
yeah, I asked GPT4 which one is better.
I average that out over my entire distribution over my entire benchmark or data set and that gives me a win rates win probability for one model compared to another one. And now you can rank
models. Uh and this is the alpaca eval
models. Uh and this is the alpaca eval uh leaderboard.
So the benefits of this is that actually we show we get 98% correlation with chatbot arena. So very high correlation
chatbot arena. So very high correlation with humans. Um so this is yeah
with humans. Um so this is yeah comparison with correlation with other benchmarks and it takes less than three minutes and less than $10 to run. So
it's pretty cheap. Um there are downsides though. Uh one of them is
downsides though. Uh one of them is porous correlation. Um so as we already
porous correlation. Um so as we already saw before LM prefer this is one spurious correlation there are many.
I'll just talk about one. LMS prefer
longer outputs. Actually humans also prefer longer outputs. But the problem or the issue once you use LM is that once there's bias you will continue optimizing that. Humans at some point I
optimizing that. Humans at some point I can guarantee you if I ask a simple question and you give me five pages of answers I'll be like no I don't like that answer but LM if they have this bias and they were trained for that they
will continue preferring longer outputs.
So uh here we see um the the preference just showing that like humans and models prefer longer outputs. Um and here is another view of the initial alpaca eval
data uh benchmark where when we asked um when we rank GPT4 when we look at the win rate of GPT4 versus actually uh GPT4 itself if we comp if we use the standard
GP4 it gets 50% kind of by definition because we're comparing GPT4 versus GPD4 but if we ask GPD4 to be slightly more verbose so we just say in the prompt be verbose in your answers then it gets a
win rate of 64.4%.
So really there's a huge variance and if you ask it to be concise it gets 20%. So
there's a huge variance depending on um whether you ask it to be concise or verbose. That's very annoying. Um so one
verbose. That's very annoying. Um so one possible solution which is what we did is uh just use some regression analysis.
I'm not going to go into details but basically use causal inference tools to control for length. And right now uh actually length matters much less. So if
you ask it to be verbose, you still get some gains but much less.
Great. So that's all about post- training. And now for the next eight
training. And now for the next eight minutes, I might talk about systems or just answer questions. Yes.
Can you um go back to your post training? In terms of post training, how
training? In terms of post training, how did we tune those parameters using the small body of fine-tuning data and have such big effect on the model? You you
mentioned earlier that there's a different set of hyperparameters. Are we
changing just some of the weights, the later weights or all the weights? What's
actually happening?
Yeah. Uh yeah, I I kind of skimmed through all of this. You change all the weights actually. Um industry would
weights actually. Um industry would change all the weights in open source land. You might have heard of Laura,
land. You might have heard of Laura, which is going to change basically only some of the weights or it actually to be more specific, it's going to add some differences to the output of every of
every layer. But but in industry you're
every layer. But but in industry you're going to just fine-tune all the weights.
Um and also to say something else about the data actually this last step RHF you're usually going to collect uh a lot more data than with SFT. So if FSFT is
like 5,000 10,000 maybe 50,000 with RHF I think you're going to be more around like the 1 million uh order of magnitude. It's still much less than
magnitude. It's still much less than pre-training though.
Yeah because pre-training is 15 trillion tokens. I mean this is like that's not
tokens. I mean this is like that's not even a draw and yet you influence the weight a lot. So
because you do it I mean you have to think that how you do it is you use um I mean as I said the learning rate that you're going to use is going to be different but also you only do that so
just imagine if I trained even if I trained on one sentence but over and over again all at some point my model will only generate that sentence even if uh it was just one sentence instead of
the 15 trillion tokens. So if you use a large enough learning rate and for enough time you will basically overfit that sentence. So the the key thing to
that sentence. So the the key thing to to remember is that um the data is not it's not as if you mix some post- trainining data and some pre-training data. You do pre-training and then you
data. You do pre-training and then you just start fine-tuning only on the post- training. So another way maybe another
training. So another way maybe another perspective is that the post the pre-training is just the initialization of your model. And once you view it that way that this is just initialization of weights then there's nothing special
like you don't need to remember that you trained on a lot of data before the only thing that matters is that you had an initialization and now I actually train a model. So maybe think about it that
a model. So maybe think about it that way like there's a there's a mark of property in some ways just like you had your weights this is my initialization now I'm training that one. Does that
kind of answer your question?
Kind of. But
you said something just now about it's almost the equivalent of just rerunning the fine-tuning data many times. Is it
actually is that what actually happens in order to give so much more preference.
Um you might I actually don't know right now how they do it in industry. When we
did our packet we had to do three epochs. So you did run it three times to
epochs. So you did run it three times to it. Um
it. Um but I mean even the number of times that you run it through it's actually not important. The only thing like the only
important. The only thing like the only thing is the it's kind of the effective learning rate that what matters. Um, so
yeah, great.
So I think I have five minutes, right?
Um, okay. I might try to give a high level
okay. I might try to give a high level overview at least from one of the systems trick systems as we said uh for
everyone bottleneck is a sorry compute is the huge bottleneck. Uh, one question you might ask is why not buy more GPUs?
Uh, GPUs are expensive but also are scarce. Even if you have $10 million
scarce. Even if you have $10 million right now, you cannot buy the best GPUs.
Um there's Oh, yeah. There's also some physical
Oh, yeah. There's also some physical limitations. When you have when you have
limitations. When you have when you have multiple GPUs, you have to communicate between them. That takes time. Um, so
between them. That takes time. Um, so
just buying more GPUs is not that easy.
Um, so it's really important to think about how do you allocate resources and how do you optimize your pipeline. So
system 101 on GPUs. I'm sorry I'm going slightly faster. I hope that some of you
slightly faster. I hope that some of you at least can follow. Uh GPUs are basically optimized for throughput. CPUs
are optimized uh for latency. So GPUs
the way you have to think about it is that there's one com there's one command that is run on many many cores at the same time on different type of data. Um
so this is how you see a GPU. You see
there are many different cores. we call
them streaming multiprocessors which is very different than the usual uh CPU architecture. So just think high
architecture. So just think high throughput parallelization for GPUs. Uh
GPUs are optimized for fast matrix multiplication. So every time you will
multiplication. So every time you will do you will do something on GPU if you can do it with a with a matrix multiplication it's going to be 10 times faster than with anything else. Uh that
is a little bit annoying because it means that we're kind of uh bottlenecked to doing anything with matrix multiplications. Um, another thing to
multiplications. Um, another thing to note with GPUs is that compute has been improving faster than memory and communication. So, right now GPUs
communication. So, right now GPUs usually are hard to keep uh like the data that you send it send to GPUs is actually hard to keep up with the processes. So, most of your GPUs are
processes. So, most of your GPUs are actually going to be idle if you just run normal code, if you don't optimize your code. So, communication and this
your code. So, communication and this will continue over time.
Another thing to know about GPUs is that there's a memory hierarchy. This is the same thing actually with CPUs, but basically the closer you you are to your cores, the less memory there is, but the faster things run. If you're further,
more memory, slower. Um, okay, I'm going to skip that. Okay, actually I'm going to say it. I told you about this uh the fact of communication. Uh the metric that people usually look at is model
flop utilization. So what is the
flop utilization. So what is the theoretical maximum that GPU could run at? number of flops that you could use
at? number of flops that you could use per second divide sorry the number of obs observed throughput divided by this theoretical maximum and in general if you reach 50%
you're very happy like Facebook I looked at lama was at 45 for something like this so that means that data doesn't come fast enough even for these big companies
so one simple trick and that might be the only one I'm going to tell you about is low precision one simple idea is that well if I'm going to put my floats in lower precision, then there's going to
be fewer bits that I have to send to my GPUs. If there's fewer bits, it's faster
GPUs. If there's fewer bits, it's faster communication, lower memory consumption, things are going to go faster. Uh, and
for deep learning, it just happens that decim decimal is not that important. Uh,
so when you do matrix multiplication, when you do like for example SGD, there's already so much noise that if you update something by 0.01 or 0.015,
yeah, who cares? Uh so basically instead of using uh 32 bits per float which is um what people used to use or 64 for example which is what you would use in
other domains you use 16 bits uh for matrix multiplication. So for every
matrix multiplication. So for every float you use 16 bits. Um and for training you have this type of like uh what we call automatic mix precision which is that uh some of the things are
in 32 bits others are in 60 bit on 16 bits. Um generally the way you should be
bits. Um generally the way you should be thinking about it is that your weights are stored of your model are stored in 32 bits. Um but just before the
32 bits. Um but just before the computation you put everything in 16 16 bits like this. You do computation super fast and at the end you update your weights in 32 bits. And the reason why
you do all the updates in 32 bits is just think that if your learning rate for example is very small, you still want to be able to like make a difference in your weights. Uh so all the computation is done in 16 bits but
the the weights are actually stored in 32 bits. So that's like the standard way
32 bits. So that's like the standard way that people are doing it. Um okay I'll actually talk just about this and then I'll skip all the rest operator fusion because I think this is actually pretty
cool. As I just said communication is
cool. As I just said communication is very slow and actually every time you use a pietorch line it basically moves variable to global memory of your GPU.
So when you have something like this x dot cosine uh equal x1 and then you do x1 dot cosine what is happening behind the scenes is that you take the x which
is data you ship it to your um to your actual processes of your GPUs you apply the cosine you ship it back to the main memory of your GPU and then you see the next sign you ship it back to the
computer to the GPU processor you apply another cosine and you ship it back again. Um so another way to see that is
again. Um so another way to see that is that you go from your DRAM which is your global memory in your GPU and you ship it to compute you ship it back for every
line. This is a naive way of doing it.
line. This is a naive way of doing it.
This seems very wasteful. Um so the idea simple idea of operator fusion is just communicate do all the computation ship it back once and this is exactly what
fuse kernels are. Um, so if you ever want to make your comput your computations in PyTorch much faster, just apply torch.compile on your model,
this is going to make your model around two times faster. And what it does is simply that it rewrites your code uh your py like your pytorch code basically
in C++ and CUDA uh to to do the communication only once then do all the operations then uh ship it back. Okay,
I'm not going to have time to talk about tiling. Tyling is important.
tiling. Tyling is important.
Paralization. Parization is important.
Um, and mixture of experts, mixture of experts is important. Outlook, there are many things we haven't taluh talked about. Uh, we haven't talked about
about. Uh, we haven't talked about architectures. We definitely haven't
architectures. We definitely haven't talked about inference. Um, there are many other things that are important with LLMs. What is the UI that you use?
I mean, arguably chat GPT the big novelty was just have a simple UI to use it. multimodality, what are all the
it. multimodality, what are all the misuses you could have, uh the fact that there might not be enough data on the internet to train all these models, legality of data collection, so many other things. If you are interested in
other things. If you are interested in all these topics, uh I would suggest three classes. CS224N is probably the
three classes. CS224N is probably the one that touches the least on uh LMS uh but it gives some background and historical context um of all the LMS and
gives kind of some adjacent material.
CS324 I think it's called uh I think it's just called large language models uh more in-depth reading and lectures on everything I talked about CS 336 which
is large language model from scratch you actually build your own LLM u it's an amazing class also given by my two supervisors very heavy workload so be
careful um Great.
Loading video analysis...