【中文翻译收藏版】AI 大神安德烈.卡帕西AndrejKarpathy深入探索ChatGPT等大语言模型!
By 财富起点 | Wealth Starting Point
Summary
Topics Covered
- Internet Compresses to 44TB High-Quality Text
- Tokens Trade Sequence Length for Vocabulary Size
- Pretraining Predicts Next Token Autocomplete
- Base Models Simulate Internet Documents
- SFT Imitates Human Labeler Simulations
Full Transcript
Hi everyone. So I've wanted to make this video for a while. It is a comprehensive but general audience introduction to large language models like ChachiBT. And
what I'm hoping to achieve in this video is to give you kind of mental models for thinking through what it is that this tool is. It is obviously magical and
tool is. It is obviously magical and amazing in some respects. It's uh really good at some things, not very good at other things. And there's also a lot of
other things. And there's also a lot of sharp edges to be aware of. So what is behind this text box? You can put anything in there and press enter, but uh what should we be putting there? And
what are these words generated back? How
does this work? And what what are you talking to exactly? So, I'm hoping to get at all those topics in this video.
We're going to go through the entire pipeline of how this stuff is built, but I'm going to keep everything uh sort of accessible to a general audience. So,
let's take a look at first how you build something like ChachiPT. And along the way I'm going to talk about um you know some of the sort of cognitive psychological implications of these
tools. Okay. So let's build chachi bt.
tools. Okay. So let's build chachi bt.
So there's going to be multiple stages arranged sequentially. The first stage
arranged sequentially. The first stage is called the pre-training stage and the first step of the pre-training stage is to download and process the internet.
Now to get a sense of what this roughly looks like I recommend looking at this URL here. So um this company called
URL here. So um this company called hugging face uh collected and created and curated this data set called fine web and they go into a lot of detail in
this blog post on how they constructed the fine web data set and all of the major LLM providers like OpenAI anthropic and Google and so on will have some equivalent internally of something
like the fine web data set. So roughly
what are we trying to achieve here?
We're trying to get ton of text from the internet from publicly available sources. So we're trying to have a huge
sources. So we're trying to have a huge quantity of very high quality documents and we also want very large diversity of documents because we want to have a lot of knowledge inside these models. So we
want large diversity of highquality documents and we want many many of them and achieving this is uh quite complicated and as you can see here takes multiple stages to do well. So
let's take a look at what some of these stages look like in a bit. For now, I'd like just like to note that for example the fine web data set which is fairly representative what you would see in a production grade application actually
ends up being only about 44 terabytes of disk space. Um you can get a USB stick
disk space. Um you can get a USB stick for like a terabyte very easily or I think this could fit on a single hard drive almost today. So this is not a huge amount of data at the end of the
day. Even though the internet is very
day. Even though the internet is very very large, we're working with text and we're also filtering it aggressively. So
we end up with about 44 terabytes in this example. So let's take a look at uh
this example. So let's take a look at uh kind of what this data looks like and what some of these stages uh also are.
So the starting point for a lot of these efforts and something that contributes most of the data by the end of it is data from Common Crawl. So Common Crawl is an organization that has been
basically scouring the internet since 2007. So as of 2024 for example,
2007. So as of 2024 for example, ComicCroll has indexed 2.7 billion web pages uh and uh they have all these crawlers going around the internet and what you end up doing basically is you start with
a few seed web pages and then you follow all the links and you just keep following links and you keep indexing all the information and you end up with a ton of data of the internet over time.
So this is usually the starting point for a lot of the uh for a lot of these efforts. Now this common crawl data is
efforts. Now this common crawl data is quite raw and is filtered in many many different ways.
So here they pro they document this is the same diagram they document a little bit the kind of processing that happens in these stages. So the first thing here is something called URL filtering.
So what that is referring to is that there's these block lists of uh basically URLs that are or domains that uh you don't want to be getting
data from. So usually this includes
data from. So usually this includes things like malware websites, spam websites, marketing websites, uh racist websites, adult sites, and things like that. So there's a ton of different
that. So there's a ton of different types of websites that are just eliminated at this stage because we don't want them in our data set. Um the
second part is text extraction. You have
to remember that all these web pages, this is the raw HTML of these web pages that are being saved by these crawlers.
So when I go to inspect here, this is what the raw HTML actually looks like. You'll notice that it's got all
like. You'll notice that it's got all this markup uh like lists and stuff like that and there's CSS and all this kind of stuff. So this is um computer code
of stuff. So this is um computer code almost for these web pages. But what we really want is we just want this text, right? We just want the text of this web
right? We just want the text of this web page and we don't want the navigation and things like that. So there's a lot of filtering and processing uh and heristics that go into uh adequately
filtering for just the uh good content of these web pages. The next stage here is language filtering. So for example, fine web filters uh using a language
classifier. They try to guess what
classifier. They try to guess what language every single web page is in and then they only keep web pages that have more than 65% of English as an example.
And so you can get a sense that this is like a design decision that different companies can uh can uh take for themselves. What fraction of all
themselves. What fraction of all different types of languages are we going to include in our data set?
Because for example, if we filter out all of the Spanish as an example, then you might imagine that our model later will not be very good at Spanish because it's just never seen that much data of that language. And so different
that language. And so different companies can focus on multilingual performance to uh to a different degree as an example. So fine web is quite focused on English and so their language model if they end up training one later
will be very good at English but not maybe very good at other languages.
After language filtering, there's a few other filtering steps and dduplication and things like that. Um, finishing
with, for example, the PII removal. This
is personally identifiable information.
So, as an example, addresses, social security numbers, and things like that.
You would try to detect them and you would try to filter out those kinds of web pages from the data set as well. So
there's a lot of stages here and I won't go into full detail but it is a fairly extensive part of the pre-processing and you end up with for example the fine web data set. So when you click in on it uh
data set. So when you click in on it uh you can see some examples here of what this actually ends up looking like and anyone can download this on the hugging phase web page. And so here are some
examples of the final text that ends up in the training set. So this is some article about tornadoes in 2012.
Um, so there's some tornadoes in 2020 in 2012 and what happened.
Uh, this next one is something about did you know you have two little yellow 9V battery sized adrenal glands in your body? Okay, so this is some kind of a
body? Okay, so this is some kind of a odd medical article.
So just think of these as basically uh web pages on the internet filter just for the text in various ways. And now we have a ton of text, 40 terabytes of it.
And that now is the starting point for the next step of this stage. Now I
wanted to give you an intuitive sense of where we are right now. So I took the first 200 web pages here and remember we have tons of them. And I just take all that text and I just put it all together
concatenated. And so this is what we end
concatenated. And so this is what we end up with. We just get this just just raw
up with. We just get this just just raw text, raw internet text. And there's a ton of it even in these 200 web pages.
So I can continue zooming out here. And
we just have this like massive tapestry of text data. And this text data has all these patterns. And what we want to do
these patterns. And what we want to do now is we want to start training neural networks on this data. So the neural networks can internalize and model how
this text flows. Right? So we just have this giant texture of text. And now we want to get neural nets that mimic it.
Okay. Now before we plug text into neural networks, we have to decide how we're going to represent this text uh and how we're going to feed it in. Now
the way our technology works for these neural nets is that they expect a one-dimensional sequence of symbols and they want a finite uh set of symbols that are possible. And so we have to
decide what are the symbols and then we have to represent our data as a one-dimensional sequence of those symbols.
So right now what we have is a onedimensional sequence of text. It
starts here and it goes here and then it comes here etc. So this is a one-dimensional sequence even though on my monitor of course it's laid out in a two-dimensional way but it goes from left to right and top to bottom right.
So it's a one dimensional sequence of text. Now this being computers of course
text. Now this being computers of course there's an underlying representation here. So if I do what's called UTF8 uh
here. So if I do what's called UTF8 uh encode this text then I can get the raw bits that correspond to this text in the computer and that's what uh that looks
like this. So it turns out that for
like this. So it turns out that for example this very first bar here is the first uh eight bits as an example.
So what is this thing right? This is um a representation that we are looking for. Uh in in a certain sense we have
for. Uh in in a certain sense we have exactly two possible symbols zero and one and we have a very long sequence of
it. Right now as it turns out um this
it. Right now as it turns out um this sequence length is actually going to be very finite and precious resource uh in our neural network and we actually don't
want extremely long sequences of just two symbols. Instead, what we want is we
two symbols. Instead, what we want is we want to trade off uh this um symbol size uh of this vocabulary as we call it and the resulting sequence length. So we
don't want just two symbols and extremely long sequences. We're going to want more symbols and shorter sequences.
Okay. Okay, so one naive way of compressing or decreasing the length of our sequence here is to basically um consider some group of consecutive bits
for example eight bits and group them into a single what's called bite. So
because uh these bits are either on or off if we take a group of eight of them there turns out to be only 256 possible combinations of how these bits could be on or off and so therefore we can
re-represent this sequence into a sequence of bytes instead. So this
sequence of bytes will be eight times shorter. But now we have 256 possible
shorter. But now we have 256 possible symbols. So every number here goes from
symbols. So every number here goes from 0 to 255. Now I really encourage you to think of these not as numbers but as unique IDs or like unique symbols. So
maybe it's a bit more maybe it's better to actually think of these to replace every one of these with a unique emoji.
Uh you'd get something like this. So um
we basically have a sequence of emojis and there's 256 possible emojis. You can
think of it that way. Now, it turns out that in production for state-of-the-art language models, uh you actually want to go even beyond this. You want to continue to shrink the length of the
sequence. Uh because again, it is a
sequence. Uh because again, it is a precious resource in return for more symbols in your vocabulary. And the way this is done is done by running what's called the bite pair encoding algorithm.
And the way this works is we're basically looking for consecutive bytes or symbols that are very common. So for
example, turns out that the sequence 116 followed by 32 is quite common and occurs very frequently. So what we're going to do is we're going to group uh
this um pair into a new symbol. So we're
going to mint a symbol with an ID 256 and we're going to rewrite every single uh pair 11632 with this new symbol. And
then we can we can iterate this algorithm as many times as we wish. And
each time when we mint a new symbol, we're decreasing the length and we're increasing the symbol size. And in
practice, it turns out that a pretty good setting of um the basically the vocabulary size turns out to be about 100,000 possible symbols. So in
particular, GPT4 uses 100,277 symbols.
Um, and this process of converting from raw text into these symbols or as we call them tokens is the process called tokenization.
So let's now take a look at how GPT4 performs tokenization, converting from text to tokens and from tokens back to text and what this actually looks like.
So one website I like to use to explore these token representations is called tickto tokenizer. And so come here to
tickto tokenizer. And so come here to the drop down and select CL100k base which is the GPT4 base model tokenizer.
And here on the left you can put in text and it shows you the tokenization of that text. So for example hello
that text. So for example hello space world.
So hello world turns out to be exactly two tokens. the token hello which is the
two tokens. the token hello which is the token with ID 15,339 and the token spaceworld
that is the token 1 917.
So um hello spacew world. Now if I was to join these two for example I'm going to get again two tokens but it's the token h followed by the token hello
world without the h.
Um if I put in two spa two spaces here between hello and world it's again a different uh tokenization. There's a new token 220 here.
Okay. So you can play with this and see what happens here. Also keep in mind this is not uh this is case sensitive.
So if this is a capital H, it is something else. Or if it's uh hello
something else. Or if it's uh hello world, then actually this ends up being three tokens instead of just two tokens.
Yeah. So you can play with this and get an sort of like an intuitive sense of uh what these tokens work like. We're
actually going to loop around to tokenization a bit later in the video.
For now, I just wanted to show you the website. And I wanted to uh show you
website. And I wanted to uh show you that this text basically at the end of the day. So for example, if I take one
the day. So for example, if I take one line here, this is what GPT4 will see it as. So this text will be a sequence of
as. So this text will be a sequence of length 62.
This is the sequence here. And this is how the chunks of text correspond to these symbols. And again, there's
these symbols. And again, there's 100,277 possible symbols. And we now have
possible symbols. And we now have one-dimensional sequences of those symbols.
So um yeah, we're going to come back to tokenization, but that's uh for now where we are. Okay. So what I've done now is I've taken this uh sequence of text that we have here in the data set
and I have re-represented it using our tokenizer into a sequence of tokens. And
this is what that looks like now.
So for example, when we go back to the fine web data set, they mentioned that not only is this 44 terabytes of disk space, but this is about a 15 trillion
token sequence of um in this data set.
And so here these are just some of the first uh one or two or three or a few thousand here I think uh tokens of this data set, but there's 15 trillion here uh to keep in mind. And again, keep in
mind one more time that all of these represent little text chunks. They're
all this like atoms of these sequences and the numbers here don't make any sense. They're just uh they're just
sense. They're just uh they're just unique IDs. Okay, so now we get to the
unique IDs. Okay, so now we get to the fun part which is the uh neural network training and this is where a lot of the heavy lifting happens computationally when you're training these neural
networks. So what we do here in this in
networks. So what we do here in this in this step is we want to model the statistical relationships of how these tokens follow each other in the sequence. So what we do is we come into
sequence. So what we do is we come into the data and we take windows of tokens.
So we take a window of tokens uh from this data fairly randomly and um the windows length can range anybody
anywhere between uh zero tokens actually all the way up to some maximum size that we decide on. Uh so for example in practice you could see a token windows
of say 8,000 tokens. Now in principle we can use arbitrary window lengths of tokens uh but uh processing very long uh
basically u window sequences would just be very computationally expensive. So we
just kind of decide that say 8,000 is a good number or 4,000 or 16,000 and we crop it there. Now in this example I'm going to be uh taking the first four
tokens just so everything fits nicely.
So these tokens we're going to take a window of four tokens this bar view in and space single
which are these token ids. And now what we're trying to do here is we're trying to basically predict the token that comes next in a sequence. So 3 962 comes next. Right? So what we do now here is
next. Right? So what we do now here is that we call this the context. These
four tokens are context and they feed into a neural network.
And this is the input to the neural network.
Now I'm going to go into the detail of what's inside this neural network in a little bit. For now, what's important to
little bit. For now, what's important to understand is the input and the output of the neural net. So the input are sequences of tokens of variable length anywhere between zero and some maximum
size like 8,000. The output now is a prediction for what comes next. So
because our vocabulary has 100,277 possible tokens, the neural network is going to output exactly that many numbers and all of those numbers correspond to the probability of that
token as coming next in the sequence. So
it's making guesses about what comes next. Um in the beginning, this neural
next. Um in the beginning, this neural network is randomly initialized. So um
and we're going to see in a little bit what that means, but it's a it's a it's a random transformation. So these
probabilities in the very beginning of the training are also going to be kind of random. Uh so here I have three
of random. Uh so here I have three examples but keep in mind that there's 100,000 numbers here. Um so the probability of this token space direction and neural network is saying
that this is 4% likely right now. 11799
is 2%. And then here the probability of 3962 which is post is 3%. Now of course we've sampled this window from our data set. So we know what comes next. we know
set. So we know what comes next. we know
and that's the label. We know that the correct answer is that 3962 actually comes next in the sequence. So now what we have is this mathematical process for
doing an update to the neural network.
We have a way of tuning it and uh we're going to go into a little bit of of detail in a bit but basically we know that this probability here of 3% we want
this probability to be higher and we want the probabilities of all the other tokens to be lower.
And so we have a way of mathematically calculating how to adjust and update the neural network so that the correct answer has a slightly higher probability. So if I do an update to the
probability. So if I do an update to the neural network now the next time I feed this particular sequence of four tokens into the neural network the neural network will be slightly adjusted now
and it will say okay post is maybe 4%.
And case now maybe is 1%.
And uh direction could become 2% or something like that. And so we have a way of nudging of slightly updating the neural net to um basically give a higher probability to the correct token that
comes next in the sequence. And now you just have to remember that this process happens not just for uh this um token here where these four fed in and
predicted this one. This process happens at the same time for all of these tokens in the entire data set. And so in practice, we sample little windows, little batches of windows. And then at
every single one of these tokens, we want to adjust our neural network so that the probability of that token becomes slightly higher. And this all happens in parallel in large batches of these tokens. And this is the process of
these tokens. And this is the process of training the neural network. It's a
sequence of updating it so that its predictions match up the statistics of what actually happens in your training set. and its probabilities become
set. and its probabilities become consistent with the uh statistical patterns of how these tokens follow each other in the data. So let's now briefly get into the internals of these neural
networks just to give you a sense of what's inside. So neural network
what's inside. So neural network internals. So as I mentioned we have
internals. So as I mentioned we have these inputs uh that are sequences of tokens. In this case this is four input
tokens. In this case this is four input tokens but this can be anywhere between zero up to let's say a thousand tokens.
In principle this can be an infinite number of tokens. we just uh it would just be too computationally expensive to process an infinite number of tokens. So
we just crop it at a certain length and that becomes the maximum context length of that uh model. Now these inputs x are mixed up in a giant mathematical
expression together with the parameters or the weights of these neural networks.
So here I'm showing six example parameters and their setting. But in
practice these uh um modern neural networks will have billions of these uh parameters and in the beginning these parameters are completely randomly set.
Now with a random setting of parameters you might expect that this uh this neural network would make random predictions and it does in the beginning it's totally random predictions but it's
through this process of iteratively updating the network uh as and we call that process training a neural network.
So uh that the setting of these parameters gets adjusted such that the outputs of our neural network becomes consistent with the patterns seen in our training set.
So think of these parameters as kind of like knobs on a DJ set. And as you're twiddling these knobs, you're getting different uh predictions for every possible uh token sequence input. And
training a neural network just means discovering a setting of parameters that seems to be consistent with the statistics of the training set.
Now let me just give you an example of what this giant mathematical expression looks like just to give you a sense and modern networks are massive expressions with trillions of terms probably but let me just show you a simple example here
it would look something like this I mean these are the kinds of expressions just to show you that it's not very scary we have inputs x uh like x1 x2 in this case
two example inputs and they get mixed up with the weights of the network w1 23 etc and this mixing is simple things like multiplication,
addition exponentiation division etc. And it is the subject of neural network architecture research to design effective mathematical expressions uh that have a lot of uh kind of convenient
characteristics. They are expressive,
characteristics. They are expressive, they're optimizable, they're paralyzable, etc. And so, but uh at the end of the day, these are not complex expressions and basically they mix up
the inputs with the parameters to make predictions. and we're optimizing uh the
predictions. and we're optimizing uh the parameters of this neural network so that the predictions come out consistent with the training set. Now I would like to show you an actual production grade
example of what these neural networks look like. So for that I encourage you
look like. So for that I encourage you to go to this website uh that has a very nice visualization of one of these networks.
So this is what you will find on this website. And this neural network here
website. And this neural network here that is used in production settings has this special kind of structure. This
network is called the transformer and this particular one as an example has 85,000 roughly parameters.
Now here on the top we take the inputs which are the token sequences and then information flows through the neural network until the output which
here are the loget softmax but these are the predictions for what comes next what token comes next and then here there's a sequence of transformations and all these
intermediate values that get produced inside this mathematical expression as it is sort of predicting what comes next. So as an example, these tokens are
next. So as an example, these tokens are embedded into kind of like this distributed representation as it's called. So every possible token has kind
called. So every possible token has kind of like a vector that represents it inside the neural network. So first we embed the tokens and then those values
uh kind of like flow through this diagram and these are all very simple mathematical expressions individually.
So we have layer norms and matrix multiplications and uh soft maxes and so on. So here's kind of like the attention
on. So here's kind of like the attention block of this transformer and then information kind of flows through into the multi-layer perceptron block and so on. And all these numbers here, these
on. And all these numbers here, these are the intermediate values of their expression. And uh you can almost think
expression. And uh you can almost think of these as kind of like the firing rates of these synthetic neurons. But I
would caution you to uh not um kind of think of it too much like neurons because these are extremely simple neurons compared to the neurons you would find in your brain. Your
biological neurons are very complex dynamical processes that have memory and so on. There's no memory in this
so on. There's no memory in this expression. It's a fixed mathematical
expression. It's a fixed mathematical expression from input to output with no memory. It's just a stateless. So these
memory. It's just a stateless. So these
are very simple neurons in comparison to biological neurons. But you can still
biological neurons. But you can still kind of loosely think of this as like a synthetic piece of uh brain tissue if you if you like uh to think about it that way. So information flows through
that way. So information flows through all these neurons fire until we get to the predictions. Now I'm not actually
the predictions. Now I'm not actually going to dwell too much on the precise uh kind of like mathematical details of all these transformations. Honestly, I
don't think it's that important to get into. What's really important to
into. What's really important to understand is that this is a mathematical function. It is uh
mathematical function. It is uh parameterized by some fixed set of parameters like say 85,000 of them. And
it is a way of transforming inputs into outputs. And as we twiddle the
outputs. And as we twiddle the parameters, we are getting uh different kinds of predictions. And then we need to find a good setting of these parameters so that the predictions uh
sort of match up with the patterns seen in training set. So that's the transformer. Okay. So I've shown you the
transformer. Okay. So I've shown you the internals of the neural network and we talked a bit about the process of training it. I want to cover one more
training it. I want to cover one more major stage of working with these networks and that is the stage called inference. So in inference what we're
inference. So in inference what we're doing is we're generating new data from the model and so uh we want to basically see what kind of patterns it has internalized in the parameters of its
network. So to generate from the model
network. So to generate from the model is relatively straightforward. We start
with some tokens that are basically your prefix like what you want to start with.
So say we want to start with the token 91. Well, we feed it into the network
91. Well, we feed it into the network and remember that the network gives us probabilities, right? it gives us this
probabilities, right? it gives us this probability vector here. So what we can do now is we can basically flip a biased coin. So um we can sample uh basically a
coin. So um we can sample uh basically a token based on this probability distribution. So the tokens that are
distribution. So the tokens that are given high probability by the model are more likely to be sampled when you flip this biased coin. You can think of it that way. So we sample from the
that way. So we sample from the distribution to get a single unique token. So for example, token 860 comes
token. So for example, token 860 comes next.
Uh so 860 in this case when we're generating from model could come next.
Now 860 is a relatively likely token. It
might not be the only possible token in this case. There could be many other
this case. There could be many other tokens that could have been sampled. But
we could see that 86 is a relatively likely token as an example. And indeed
in our training example here 860 does follow 91. So let's now say that we um
follow 91. So let's now say that we um continue the process. So after 91 there's a 60. We append it and we again ask what is the third token. Let's
sample and let's just say that it's 287 exactly as here. Let's do that again. We
come back in. Now we have a sequence of three and we ask what is the likely fourth token and we sample from that and get this one. And now let's say we do it
one more time. We take those four, we sample and we get this one. And this
13659 uh this is not actually uh 3962 as we had before. So this token is the token
had before. So this token is the token article uh instead. So viewing a single article. And so in this case we didn't
article. And so in this case we didn't exactly reproduce the sequence that we saw here in the training data. So keep
in mind that these systems are stoastic.
they have um we're sampling and we're flipping coins and sometimes we lock out and we reproduce some like small chunk of the text in the training set but
sometimes we're uh we're getting a token that was not verbatim part of any of the documents in the training data. So we're
going to get sort of like remixes of the data that we saw in the training because at every step of the way we can flip and get a slightly different token and then once that token makes it in if you
sample the next one and so on you very quickly uh start to generate token streams that are very different from the token streams that occur in the training documents. So statistically they will
documents. So statistically they will have similar properties but um they are not identical to training data. They're
kind of like inspired by the training data. And so in this case we got a
data. And so in this case we got a slightly different sequence. And why
would we get article? You might imagine that article is a relatively likely token in the context of bar viewing single etc. And you can imagine that the word article followed this context
window somewhere in the training documents uh to some extent and we just happen to sample it here at that stage.
So basically inference is just uh predicting from these distributions one at a time. we continue feeding back tokens and getting the next one. And we
uh we're always flipping these coins and depending on how lucky or unlucky we get um we might get very different kinds of patterns depending on how we sample from
these probability distributions. So
that's inference. So in most common scenarios uh basically downloading the internet and tokenizing it is a pre-processing step. You do that a
pre-processing step. You do that a single time and then uh once you have your token sequence we can start training networks. And in practical
training networks. And in practical cases you would try to train many different networks of different kinds of uh settings and different kinds of arrangements and different kinds of sizes. And so you'll be doing a lot of
sizes. And so you'll be doing a lot of neural network training and um then once you have a neural network and you train it and you have some specific set of parameters that you're happy with um
then you can take the model and you can do inference and you can actually uh generate data from the model and when you're on chach and you're talking with a model uh that model is trained and has
been trained by openai many months ago probably and they have a specific set of weights that work well and when you're talking to the model all of that is just inference. there's no more training.
inference. there's no more training.
Those parameters are held fixed and you're just talking to the model sort of uh you're giving it some of the tokens and it's kind of completing token sequences and that's what you're seeing
uh generated when you actually use the model on chatbt. So that model then just does inference alone. So let's now look at an example of training an inference that is kind of concrete and gives you a
sense of what this actually looks like uh when these models are trained. Now
the example that I would like to work with and that I'm particularly fond of is that of OpenAI's GPT2. So GPT uh stands for generatively pre-trained transformer and this is the second
iteration of the GPT series by OpenAI.
When you are talking to CH GPT today, the model that is underlying all of the magic of that interaction is GPT4. So
the fourth iteration of that series. Now
GPT2 was published in 2019 by OpenAI in this paper that I have right here. And
the reason I like GPT2 is that it is the first time that a recognizably modern stack came together. So um all of the pieces of GP2 are recognizable today by
modern standards. It's just everything
modern standards. It's just everything has gotten bigger. Now I'm not going to be able to go into the full details of this paper of course because it is a technical publication but some of the details that I would like to highlight
are as follows. GPT2 was a transformer neural network just like you were just like the neural networks you would work with today.
It was it had 1.6 billion parameters, right? So these are the parameters that
right? So these are the parameters that we looked at here. It would have 1.6 billion of them. Today, modern
transformers would have a lot closer to a trillion or several hundred billion probably.
The maximum context length here was 1,024 tokens. So it is when we are
1,024 tokens. So it is when we are sampling chunks of windows of tokens from the data set, we're never taking more than 10,024 tokens. And so when you are trying to predict the next token in
a sequence, you will never have more than 1,024 tokens uh kind of in your context in order to make that prediction. Now this is also tiny by
prediction. Now this is also tiny by modern standards. Today the token uh the
modern standards. Today the token uh the context length would be a lot closer to um couple hundred thousand or maybe even a million. And so you'd have a lot more
a million. And so you'd have a lot more context, a lot more tokens in history and you can make a lot better prediction about the next token in a sequence in that way. And finally, GP2 was trained
that way. And finally, GP2 was trained on approximately 100 billion tokens. And
this is also fairly small by modern standards. As I mentioned, the fine web
standards. As I mentioned, the fine web data set that we looked at here, the fine web data set has 15 trillion tokens. Uh so 100 billion is is quite
tokens. Uh so 100 billion is is quite small. Now,
small. Now, uh I actually tried to reproduce uh GPT2 for fun as part of this project called LM.C. So you can see my write up of
LM.C. So you can see my write up of doing that in this post on GitHub under the LM.C repository. So, in particular,
the LM.C repository. So, in particular, the cost of training GPD2 in 2019 would was estimated to be approximately $40,000, but today you can do
significantly better than that. And in
particular here, I it took about one day and about $600.
Uh, but this wasn't even trying too hard. I think you could really bring
hard. I think you could really bring this down to about $100 today. Now, why
is it that the costs have come down so much? Well, number one, these data sets
much? Well, number one, these data sets have gotten a lot better and the way we filter them, extract them, and prepare them has gotten a lot more refined and so the data set is of just a lot higher
quality. So, that's one thing, but
quality. So, that's one thing, but really the biggest difference is that our computers have gotten much faster in terms of the hardware and we're going to look at that in a second. And also the
software for uh running these models and really squeezing out all all the speed from the hardware as it is possible. uh
that software has also gotten much better as everyone has focused on these models and tried to run them very very quickly.
Now I'm not going to be able to go into the full detail of this GPT2 reproduction and this is a long technical post but I would like to still give you an intuitive sense for what it looks like to actually train one of
these models as a researcher. Like what
are you looking at and what does it look like? What does it feel like? So let me
like? What does it feel like? So let me give you a sense of that a little bit.
Okay, so this is what it looks like. Let
me slide this over.
So what I'm doing here is I'm training a GPT2 model right now and um what's happening here is that every single line
here like this one is one update to the model. So remember how here we are um
model. So remember how here we are um basically making the prediction better for every one of these tokens and we are updating these weights or parameters of the neural net. So here every single
line is one update to the neural network where we change its parameters by a little bit so that it is better at predicting the next token in sequence.
In particular, every single line here is improving the prediction on 1 million tokens in the training set. So we've
basically taken 1 million tokens out of this data set and we've tried to improve the prediction of that token as coming next in a sequence on all 1 million of
them. simultaneously
them. simultaneously and at every single one of these steps we are making an update to the network for that. Now the number to watch
for that. Now the number to watch closely is this number called loss and the loss is a single number that tell is telling you how well your neural network
is performing right now and it is created so that low loss is good. So
you'll see that the loss is decreasing as we make more updates to the neural net which corresponds to making better predictions on the next token in a sequence.
And so the loss is the number that you are watching as a neural network researcher and you are kind of waiting you're twiddling your thumbs uh you're drinking coffee and you're making sure that this looks good so that with every
update your loss is improving and the network is getting better at prediction.
Now here you see that we are processing 1 million tokens per update. Each update
takes about 7 seconds roughly. And here
we are going to process a total of 32,000 steps of optimization.
So 32,000 steps with 1 million tokens each is about 33 billion tokens that we are going to process. And we're
currently only about 420 step 420 out of 32,000. So we are still only a bit more
32,000. So we are still only a bit more than 1% done because I've only been running this for 10 or 15 minutes or something like that.
Now every 20 steps I have configured this optimization to do inference. So
what you're seeing here is the model is predicting the next token in a sequence.
And so you sort of start it randomly and then you continue plugging in the tokens. So we're running this inference
tokens. So we're running this inference step. And this is the model sort of
step. And this is the model sort of predicting the next token in a sequence.
And every time you see something appear that's a new token.
Um, so let's just look at this and you can see that this is not yet very coherent and keep in mind that this is only 1% of the way through training and so the model is not yet very good at predicting the next token in the
sequence. So what comes out is actually
sequence. So what comes out is actually kind of a little bit of gibberish, right? But it still has a little bit of
right? But it still has a little bit of like local coherence. So since she is mine, it's a part of the information should discuss my father great companions Gordon showed me sitting over
it and etc. So, I know it doesn't look very good, but let's actually scroll up and see what it looked like when I started the optimization.
So, all the way here at step one.
So, after 20 steps of optimization, you see that what we're getting here is looks completely random. And of course, that's because the model has only had 20 updates to its parameters. And so, it's giving you random text because it's a
random network. And so you can see that
random network. And so you can see that at least in comparison to this, the model is starting to do much better. And
indeed, if we waited the entire 32,000 steps, the model will have improved the point that it's actually uh generating fairly coherent English. Uh and the
tokens stream correctly. Um and uh they they kind of make up English a lot better.
Um so this has to run for about a day or two more now. And so uh at this stage we just make sure that the loss is decreasing. Everything is looking good.
decreasing. Everything is looking good.
Um and we just have to wait. And now um let me turn now to the um story of the computation that's required because of course I'm not running this optimization
on my laptop. That would be way too expensive. Uh because we have to run
expensive. Uh because we have to run this neural network and we have to improve it and we have we need all this data and so on. So you can't run this too well on your computer uh because the
network is just too large. Uh so all of this is running on a computer that is out there in the cloud and I want to basically address the compute side of the story of training these models and what that looks like. So let's take a
look. Okay, so the computer that I am
look. Okay, so the computer that I am running this optimization on is this 8x H100 node. So there are eight H100s in a
H100 node. So there are eight H100s in a single node or a single computer. Now I
am renting this computer and it is somewhere in the cloud. I'm not sure where it is physically. Actually the
place I like to rent from is called Lambda but there are many other companies who provide this service. So
when you scroll down you can see that uh they have some on demand pricing for um sort of computers that have these uh H100s which are GPUs and I'm going to
show you what they look like in a second but on demand 8 times Nvidia H100 uh GPU.
This machine comes for $3 per GPU per hour, for example. So, you can rent these and then you get a machine in a cloud and you can uh go in and you can train these models.
And these uh GPUs, they look like this.
So, this is one H100 GPU. Uh this is kind of what it looks like. And you slot this into your computer. And GPUs are this uh perfect fit for training neural networks because they are very
computationally expensive. but they
computationally expensive. but they display a lot of parallelism in the computation. So you can have many
computation. So you can have many independent workers kind of um working all at the same time in solving uh the matrix multiplication that's under the
hood of training these neural networks.
So this is just one of these H100s but actually you would put them you would put multiple of them together. So you
could stack eight of them into a single node and then you can stack multiple nodes into an entire data center or an entire system. So
entire system. So when we look at a data center can't spell when we look at a data center we start to see things that look like this right so we have one GPU goes
to eight GPUs goes to a single system goes to many systems and so these are the bigger data centers and there of course would be much much more expensive um and what's happening is that all the
big tech companies really desire these GPUs so they can train all these language models because they are so powerful And that has is fundamentally what has driven the stock price of Nvidia to be
$3.4 trillion today as an example and why Nvidia has kind of exploded. So this
is the gold rush. The gold rush is getting the GPUs getting enough of them so they can all collaborate to perform this optimization and they're what are they all doing? They're all
collaborating to predict the next token on a data set like the fine web data set.
This is the computational workflow that that basically is extremely expensive.
The more GPUs you have, the more tokens you can try to predict and improve on and you're going to process this data set faster and you can iterate faster and get a bigger network and train a bigger network and so on. So this is
what all those machines are look like are uh are doing and this is why all of this is such a big deal and for example this is a article from like about a month ago or
so. This is why it's a big deal that for
so. This is why it's a big deal that for example Elon Musk is getting 100,000 GPUs uh in a single data center and all of these GPUs are extremely expensive
are going to take a ton of power and all of them are just trying to predict the next token in a sequence and improve the network uh by doing so and uh get probably a lot more coherent text than
what we're seeing here a lot faster.
Okay, so unfortunately I do not have a couple ten or hundred million of dollars to spend on training a really big model like this. But luckily we can turn to
like this. But luckily we can turn to some big tech companies who train these models routinely and release some of them once they are done training. So
they've spent a huge amount of compute to train this network and they release the network at the end of the optimization. So it's very useful
optimization. So it's very useful because they've done a lot of compute for that. So there are many companies
for that. So there are many companies who train these models routinely. But
actually not many of them release these what's called base models. So the model that comes out at the end here is what's called a base model. What is a base model? It's a token simulator, right?
model? It's a token simulator, right?
It's an internet text token simulator.
And so that is not by itself useful yet because what we want is what's called an assistant. We want to ask questions and
assistant. We want to ask questions and have it respond to answers. These models
won't do that. they just uh create sort of remixes of the internet. They dream
internet pages. So the base models are not very often released because they're kind of just only a step one of a few other steps that we still need to take to get an assistant. However, a few
releases have been made. So as an example, the GPT2 model released the 1.6 billion sorry 1.5 billion model back in
2019. And this GPT2 model is a base
2019. And this GPT2 model is a base model. Now what is a model release? What
model. Now what is a model release? What
does it look like to release these models? So this is the GPT2 repository
models? So this is the GPT2 repository on GitHub. Well, you need two things
on GitHub. Well, you need two things basically to release model. Number one,
we need the um Python code usually that describes the sequence of operations in detail that they make in their model. So
um if you remember back this transformer the sequence of steps that are taken here in this neural network is what is being described by this code. So this
code is sort of implementing the what's called forward pass of this neural network. So we need the specific details
network. So we need the specific details of exactly how they wired up that neural network. So this is just computer code
network. So this is just computer code and it's usually just a couple hundred lines of code. It's not it's not that crazy and uh this is all fairly understandable and usually fairly standard. What's not standard are the
standard. What's not standard are the parameters. That's where the actual
parameters. That's where the actual value is. Where are the parameters of
value is. Where are the parameters of this neural network? Because there's one 1.6 billion of them and we need the correct setting or a really good setting. And so that's why in addition
setting. And so that's why in addition to this source code, they release the parameters which in this case is roughly 1.5 billion parameters. And these are
just numbers. So it's one single list of
just numbers. So it's one single list of 1.5 billion numbers. the precise and good setting of all the knobs such that the tokens come out well.
So uh you need those two things to get a base model release. Now
GPT2 was released but that's actually a fairly old model as I mentioned. So
actually the model we're going to turn to is called Llama 3 and that's the one that I would like to show you next. So
Llama 3. So GPT2 again was 1.6 billion parameters trained on 100 billion tokens. Lama 3 is a much bigger model
tokens. Lama 3 is a much bigger model and much more modern model. It is
released and trained by Meta and it is a 405 billion parameter model trained on 15 trillion tokens in very much the same
way just much much bigger. Um and Meta has also made a release of Llama 3 and that was part of this paper.
So with this paper that goes into a lot of detail, the biggest base model that they released is the Alama 3.1 4.5 405 billion parameter model. So this is the
base model. And then in addition to the
base model. And then in addition to the base model, you see here foreshadowing for later sections of the video. They
also released the instruct model. And
the instruct means that this is an assistant. You can ask it questions and
assistant. You can ask it questions and it will give you answers. We still have yet to cover that part later. For now,
let's just look at this base model, this token simulator, and let's play with it and try to think about, you know, what is this thing and how does it work and um what do we get at the end of this
optimization if you let this run until the end uh for a very big neural network on a lot of data. So, my favorite place to interact with the base models is this um company called Hyperbolic, which is
basically serving the base model of the 405B Lama 3.1. So when you go into the
Lama 3.1. So when you go into the website and I think you may have to register and so on. Make sure that in the models make sure that you are using llama 3.145 billion base. It must be the
base model. And then here let's say the
base model. And then here let's say the max tokens is how many tokens we're going to be generating. So let's just decrease this to be a bit less just so we don't waste compute. We just want the next 128 tokens and leave the other
stuff alone. I'm not going to go into
stuff alone. I'm not going to go into the full detail here. Um now
fundamentally what's going to happen here is identical to what happens here during inference for us. So this is just going to continue the token sequence of whatever prefix you're going to give it.
So I want to first show you that this model here is not yet an assistant. So
you can't for example ask it what is 2 plus two. It's not going to tell you oh
plus two. It's not going to tell you oh it's four. Uh what else can I help you
it's four. Uh what else can I help you with? It's not going to do that because
with? It's not going to do that because what is 2 plus two is going to be tokenized and then those tokens just act as a prefix and then what the model is going to do now is just going to get the
probability for the next token and it's just a glorified autocomplete. It's a
very very expensive autocomplete of what comes next um depending on the statistics of what it saw in its training documents which are basically web pages.
So let's just uh hit enter to see what tokens it comes up with as a continuation.
Okay, so here it kind of actually answered the question and started to go off into some philosophical territory.
Uh let's try it again. So let me copy and paste and let's try again from scratch. What is 2 plus two?
scratch. What is 2 plus two?
So okay, so it just goes off again. So
notice one more thing that I want to stress is that the system uh I think every time you put it in it just kind of starts from scratch.
So it doesn't uh the system here is stochastic. So for the same prefix of
stochastic. So for the same prefix of tokens we're always getting a different answer and the reason for that is that we get this probability distribution and we sample from it and we always get different samples and we sort of always
go into a different territory uh afterwards. So here in this case um I
afterwards. So here in this case um I don't know what this is. Let's try one more time.
So it just continues on. So it's just doing the stuff that it saw on the internet, right? Um and it's just kind
internet, right? Um and it's just kind of like regurgitating those uh statistical patterns.
So first things, it's not an assistant yet. It's a token autocomplete. And
yet. It's a token autocomplete. And
second, it is a stoastic system. Now the
crucial thing is that even though this model is not yet by itself very useful for a lot of applications just yet um it is still very useful because in the task
of predicting the next token in the sequence the model has learned a lot about the world and it has stored all that knowledge in the parameters of the network. So remember that our text
network. So remember that our text looked like this right internet web pages and now all of this is sort of compressed in the weights of the
network. So you can think of um these
network. So you can think of um these 405 billion parameters as a kind of compression of the internet. You can
think of the 45 billion parameters as kind of like a zip file. Uh but it's not a lossless compression. It's a loss C compression. We're kind of like left
compression. We're kind of like left with kind of a gestalt to the internet and we can generate from it. Right now
we can elicit some of this knowledge by prompting the base model accordingly.
So, for example, here's a prompt that might work to elicit some of that knowledge that's hiding in the parameters. Here's my top 10 list of the
parameters. Here's my top 10 list of the top landmarks to see in Paris.
Um, and I'm doing it this way because I'm trying to prime the model to now continue this list. So, let's see if that works when I press enter.
Okay. So, you see that it started a list and it's now kind of giving me some of those landmarks.
And now notice that it's trying to give a lot of information here. Now you might not be able to actually fully trust some of the information here. Remember that
this is all just a recollection of some of the internet documents. And so the things that occur very frequently in the internet data are probably more likely to be remembered correctly compared to
things that happen very infrequently. So
you can't fully trust some of the things that and some of the information that is here because it's all just a vague recollection of internet documents because the information is not stored explicitly in any of the parameters.
It's all just the recollection. That
said, we did get something that is probably approximately correct and I don't actually have the expertise to verify that this is roughly correct. But
you see that we've elicited a lot of the knowledge of the model and this knowledge is not precise and exact. This
knowledge is vague and probabilistic and statistical and the kinds of things that occur often are the kinds of things that are more likely to be remembered um in the model. Now I want to show you a few
the model. Now I want to show you a few more examples of this model's behavior.
The first thing I want to show you is this example. I went to the Wikipedia
this example. I went to the Wikipedia page for Zebra and let me just copy paste the first uh even one sentence here and let me put it here. Now when I click
enter, what kind of uh completion are we going to get? So let me just hit enter.
There are three living species etc etc. What the model is producing here is an exact regurgitation of this Wikipedia entry. It is reciting this
Wikipedia entry. It is reciting this Wikipedia entry purely from memory and this memory is stored in its parameters.
And so it is possible that at some point in these 512 tokens the model will uh stray away from the Wikipedia entry. But
you can see that it has huge chunks of it memorized here. Uh let me see for example if this sentence occurs by now.
Okay. So this so we're still on track.
Let me check here.
Okay. We're still on track. It will
eventually uh stray away.
Okay. So this thing is just recited to a very large extent. It will eventually deviate uh because it won't be able to remember exactly. Now the reason that
remember exactly. Now the reason that this happens is because these models can be extremely good at memorization and usually this is not what you want in the final model and this is something called regurgitation and it's usually
undesirable to site uh things uh directly uh that you have trained on.
Now the reason that this happens actually is because for a lot of documents like for example Wikipedia when these documents are deemed to be of very high quality as a source like for
example Wikipedia it is very often uh the case that when you train the model you will preferentially sample from those sources. So basically the model
those sources. So basically the model has probably done a few epochs on this data meaning that it has seen this web page like maybe probably 10 times or so and it's a bit like you like when you
read some kind of a text many many times say you read something a 100 times then you'll be able to recite it and it's very similar for this model if it sees something way too often it's going to be able to recite it later from memory
except these models can be a lot more efficient um like per presentation than a human. So probably it's only seen this
a human. So probably it's only seen this Wikipedia entry 10 times. But basically
it has remembered this article exactly in its parameters. Okay. The next thing I want to show you is something that the model has definitely not seen during its training. So for example, if we go to
training. So for example, if we go to the paper uh and then we navigate to the pre-training data, we'll see here that
uh the data set has a knowledge cutoff until the end of 2023. So it will not have seen documents after this point and certainly it has not seen anything about
the 2024 election and how it turned out.
Now if we prime the model with the tokens from the future, it will continue the token sequence and it will just take its best guess according to the knowledge that it has in its own
parameters. So let's take a look at what
parameters. So let's take a look at what that could look like. So the Republican party kit Trump okay president of the United States from 2017 and let's see what it says after
this point. So for example the model
this point. So for example the model will have to guess at the running mate and who it's against etc. So let's hit enter.
So here it thinks that Mike Pence was the running mate instead of JD Vance and the ticket was against Hillary Clinton and Tim Kaine. So this is kind of a
interesting parallel universe potentially of what could have happened according to the alarm. Let's get a different sample. So the identical
different sample. So the identical prompt and let's resample.
So here the running mate was Ron the Santis and they ran against Joe Biden and Kamala Harris. So this is again a different parallel universe. So the
model will take educated guesses and it will continue the token sequence based on this knowledge. Um and it will just kind of like all of what we're seeing here is what's called hallucination. the
model is just taking its best guess uh in a probabistic manner. The next thing I would like to show you is that even though this is a base model and not yet an assistant model, it can still be utilized in practical applications if
you are clever with your prompt design.
So here's something that we would call a few shot prompt. So what it is here is that I have 10 words or 10 pairs and
each pair is a word of English colon and then the translation in Korean and we have 10 of them. And what the model does here is at the end we have teacher colon
and then here's where we're going to do a completion of say just five tokens.
And these models have what we call in context learning abilities. And what
that's referring to is that as it is reading this context, it is learning sort of in place that there's some kind of a algorithmic pattern going on in my data and it knows
to continue that pattern and this is called kind of like in context learning.
So it takes on the role of translator and when we hit uh completion we see that the teacher translation is senim which is correct. Um, and so this is how
you can build apps by being clever with your prompting even though we still just have a base model for now. And it relies on what we call this um, uh, in context learning ability. And it is done by
learning ability. And it is done by constructing what's called a few shot prompt. Okay. And finally, I want to
prompt. Okay. And finally, I want to show you that there is a clever way to actually instantiate a whole language model assistant just by prompting. And
the trick to it is that we're going to structure a prompt to look like a web page that is a conversation between a helpful AI assistant and a human. And
then the model will continue that conversation. So actually to write the
conversation. So actually to write the prompt I turned to chatbt itself which is kind of meta but I told it I want to create an LM assistant but all I have is
the base model. So can you please write my um uh prompt and this is what it came up with which is actually quite good. So
here's a conversation between an AI assistant and a human. The AI assistant is knowledgeable, helpful, capable of answering a wide variety of questions, etc. And then here, it's not enough to
just give it a sort of description. It
works much better if you create this few shot prompt. So here's a few terms of
shot prompt. So here's a few terms of human assistant, human assistant, and we have, you know, a few turns of conversation. And then here at the end
conversation. And then here at the end is we're going to be putting the actual query that we like. So let me copy paste this into the base model prompt. And now
let me do human column. And this is where we put our actual prompt. Why is
the sky blue?
And uh let's uh run assistant. The sky appears blue due to
assistant. The sky appears blue due to the phenomenon called ray light scattering etc etc. So you see that the base model is just continuing the sequence but because the sequence looks like this conversation it takes on that
role but it is a little subtle because here it just uh you know it ends the assistant and then just you know hallucinates the next question by the human etc. So it'll just continue going on and on. Uh but you can see that we
have sort of accomplished the task. And
if you just took this why is the sky blue and if we just refresh this and put it here then of course we don't expect this to work with the base model right we're just going to who knows what we're going to get okay we're just going to
get more questions.
Okay so this is one way to create an assistant even though you may only have a base model. Okay so this is the kind of brief summary of the things we talked
about over the last few minutes.
Now let me zoom out here.
And this is kind of like what we've talked about so far. We wish to train LLM assistants like CHACHBT. We've
discussed the first stage of that which is the pre-training stage. And we saw that really what it comes down to is we take internet documents. We break them up into these tokens, these atoms of little text chunks. And then we predict
token sequences using neural networks.
The output of this entire stage is this base model. It is the setting of the
base model. It is the setting of the parameters of this network. And this
base model is basically an internet document simulator on the token level.
So it can just uh it can generate token sequences that have the same kind of like statistics as internet documents.
And we saw that we can use it in some applications but we actually need to do better. We want an assistant. We want to
better. We want an assistant. We want to be able to ask questions and we want the model to give us answers. And so we need to now go into the second stage which is called the post- training stage. So we
take our base model, our internet document simulator and hand it off to post-raining. So we're now going to
post-raining. So we're now going to discuss a few ways to do what's called post-raining of these models. These
stages in post-raining are going to be computationally much less expensive.
Most of the computational work. All of
the massive data centers um and all of the sort of heavy compute and millions of dollars are the pre-training stage.
But now we go into this slightly cheaper but still extremely important stage called post training where we turn this LLM model into an assistant. So let's
take a look at how we can get our model to not sample internet documents but to give answers to questions. So in other words what we want to do is we want to start thinking about conversations and
these are conversations that can be multi-turn. So so uh that can be
multi-turn. So so uh that can be multiple terms and they are in the simplest case a conversation between a human and an assistant. And so for example, we can imagine the conversation could look something like this. When a
human says, "What is 2 plus 2?" The
assistant should respond with something like 2 plus 2 is four. When a human follows up and says, "What if it was star instead of a plus?" Assistant could respond with something like this. Um,
and similar here, this is another example showing that the assistant could also have some kind of a personality here that it's kind of like nice. And
then here in the third example, I'm showing that when a human is asking for something that we uh don't wish to help with, we can produce what's called refusal. we can say that we cannot help
refusal. we can say that we cannot help with that. So in other words, what we
with that. So in other words, what we want to do now is we want to think through how an assistant should interact with the human and we want to program the assistant and its behavior in these
conversations. Now because this is
conversations. Now because this is neural networks, we're not going to be programming these explicitly in code.
We're not going to be able to program the assistant in that way because this is neural networks. Everything is done through neural network training on data sets. And so because of that we are
sets. And so because of that we are going to be implicitly programming the assistant by creating data sets of conversations. So these are three
conversations. So these are three independent examples of conversations in a data set. An actual data set and I'm going to show you examples will be much larger. It could have hundreds of
larger. It could have hundreds of thousands of conversations that are multi-turn very long etc and would cover a diverse breath of topics. But here I'm only showing three examples. But the way
this works basically is uh assistant is being programmed by example and where is this data coming from like 2 * 2= 4 same as 2 plus 2 etc. Where does that come
from? This comes from human labelers. So
from? This comes from human labelers. So
we will basically give human labelers some conversational context and we will ask them to um basically give the ideal assistant response in this situation and a human will write out the ideal
response for an assistant in any situation. And then we're going to get
situation. And then we're going to get the model to basically train on this and to imitate those kinds of responses.
So the way this works then is we are going to take our base model which we produced in the pre-training stage and this base model was trained on internet documents. We're now going to take that
documents. We're now going to take that data set of internet documents and we're going to throw it out and we're going to substitute a new data set and that's going to be a data set of conversations and we're going to continue training the
model on these conversations on this new data set of conversations. And what
happens is that the model will very rapidly adjust and will sort of like learn the statistics of how this assistant responds to human queries. And
then later during inference we'll be able to basically um prime the assistant and get the response and it will be imitating what the humans would human labelers would do in that situation if
that makes sense. So we're going to see examples of that and this is going to become bit more concrete. I also wanted to mention that this post-training stage we're going to basically just continue training the model but um the
pre-training stage can in practice take roughly three months of training on many thousands of computers. The
post-training stage will typically be much shorter like 3 hours for example.
Um and that's because the data set of conversations that we're going to create here manually is much more smaller than the data set of text on the internet.
And so this training will be very short, but fundamentally we're just going to take our base model. We're going to continue training using the exact same algorithm, the exact same everything, except we're swapping out the data set
for conversations. So the questions now
for conversations. So the questions now are where are these conversations? How
do we represent them? How do we get the model to see conversations instead of just raw text? And then what are the outcomes of um this kind of training?
And what do you get in a certain like psychological sense? uh when we talk
psychological sense? uh when we talk about the model. So let's turn to those questions now. So let's start by talking
questions now. So let's start by talking about the tokenization of conversations.
Everything in these models has to be turned into tokens because everything is just about token sequences. So how do we turn conversations into token sequences is the question. And so for that we need
to design some kind of an encoding. And
uh this is kind of similar to maybe if you're familiar you don't have to be with for example a TCP IP packet in um on the internet. There are precise rules and protocols for how you represent information and how everything is
structured together so that you have all this kind of data laid out in a way that is written out on a paper and that everyone can agree on. And so it's the same thing now happening in LLMs. We need some kind of data structures and we
need to have some rules around how these data structures like conversations get encoded and decoded to and from tokens.
And so I want to show you now how I would recreate uh this conversation in the token space.
So if you go to tick tokenizer I can take that conversation and this is how it is represented in uh for the language model. So here we have we are
language model. So here we have we are iterating a user and an assistant in this two-turn conversation and what you're seeing here is it looks ugly but it's actually relatively
simple. The way it gets turned into a
simple. The way it gets turned into a token sequence here at the end is a little bit complicated but at the end this conversation between a user and assistant ends up being 49 tokens. It
has a onedimensional sequence of 49 tokens and these are the tokens. Okay.
And all the different LLMs will have a slightly different format or protocols and it's a little bit of a wild west right now. But for example, GPT4 does it
right now. But for example, GPT4 does it in the following way. You have this special token called im start and this is short for immer imaginary monologue
uh the start. Then you have to specify um I don't actually know why it's called that to be honest. Then you have to specify whose turn it is. So for
example, user which is a token 1428.
Then you have internal monologue separator and then you it's the exact question. So
the tokens of the question and then you have to close it. So I am end the end of the imaginary monologue. So basically
the question from a user of what is 2 plus2 ends up being the token sequence of these tokens. And now the important thing to mention here is that IM start
this is not text right imart is a special token that gets added. It's a
new token and um this token has never been trained on so far. It is a new token that we create in a post-training stage and we introduce and so these
special tokens like IM set, IM start etc are introduced and interspersed with text so that they sort of um get the model to learn that hey this is a the start of a turn for who is it the start
of the turn for the start of the turn is for the user and then this is what the user says and then the user ends and then it's a new start of a turn and it is by the assistant and then what does
the assistant Okay, well these are the tokens of what the assistant says, etc. And so this conversation is not turned into this sequence of tokens. The
specific details here are not actually that important. All I'm trying to show
that important. All I'm trying to show you in concrete terms is that our conversations, which we think of as kind of like a structured object, end up being turned via some encoding into
one-dimensional sequences of tokens. And
so because this is a onedimensional sequence of tokens, we can apply all the stuff that we applied before. Now it's
just a sequence of tokens and now we can train a language model on it. And so
we're just predicting the next token in a sequence uh just like before and um we can represent and train on conversations and then what does it look like at test time during inference. So say we've
trained a model and we've trained a model on these kinds of data sets of conversations and now we want to inference.
So during inference what does this look like when you're on chach? Well, you
come to chasht and you have say like a dialogue with it. And the way this works is basically um say that this was already filled in.
So like what is 2 plus2 2 plus2 is four and now you issue what if it was times I im and what basically ends up happening um on the servers of openi or something
like that is they put in im start assistant im and this is where they end it right here. So they construct this context and now they start sampling from
the model. So it's at this stage that
the model. So it's at this stage that they will go to the model and say okay what is a good first sequence? What is a good first token? What is a good second token? What is a good third token? And
token? What is a good third token? And
this is where the LM takes over and creates a response like for example response that looks something like this.
But it doesn't have to be identical to this but it will have the flavor of this if this kind of a conversation was in the data set. So um that's roughly how the protocol works. Although the details
of this protocol are not important. So
again my goal is that just to show you that everything ends up being just a onedimensional token sequence. So we can apply everything we've already seen but we're now training on conversations and
we're now uh basically generating conversations as well. Okay. So now I would like to turn to what these data sets look like in practice. The first
paper that I would like to show you and the first effort in this direction is this paper from OpenAI in 2022. And this
paper was called instruct GPT or the technique that they developed. And this
was the first time that open has kind of talked about how you can take language models and fine-tune them on conversations. And so this paper has a
conversations. And so this paper has a number of details that I would like to take you through. So the first stop I would like to make is in section 3.4 before where they talk about the human
contractors that they hired uh in this case from Upwork or through scale AAI to uh construct these conversations. And so
there are human labelers involved whose job it is professionally to create these conversations and these labelers are asked to come up with prompts and then they are asked to also complete the
ideal assistant responses and so these are the kinds of prompts that people came up with. So these are human labelers. So list five ideas for how to
labelers. So list five ideas for how to regain enthusiasm for my career. What
are the top 10 science fiction books I should read next? And there's many different types of uh kind of prompts here. So translate the sentence from uh
here. So translate the sentence from uh to Spanish etc. And so there's many things here that people came up with.
They first come up with the prompt and then they also uh answer that prompt and they give the ideal assistant response.
Now how do they know what is the ideal assistant response that they should write for these prompts? So when we scroll down a little bit further, we see that here we have this excerpt of labeling instructions uh that are given
to the human labelers. So the company that is developing the language model like for example OpenAI writes up labeling instructions for how the humans should create ideal responses. And so
here for example is an excerpt uh of these kinds of labeling instructions. On
a high level you're asking people to be helpful, truthful, and harmless. And you
can pause the video if you'd like to see more here. But on a high level,
more here. But on a high level, basically just just answer try to be helpful, try to be truthful, and don't answer questions that we don't want um kind of the system to handle uh later in
Chachi PT. And so roughly speaking, the
Chachi PT. And so roughly speaking, the company comes up with the labeling instructions. Usually they are not this
instructions. Usually they are not this short. Usually they are hundreds of
short. Usually they are hundreds of pages and people have to study them professionally and then they write out the ideal assistant responses uh following those labeling instructions.
So this is a very humanheavy process as it was described in this paper. Now the
data set for instruct GPT was never actually released by OpenAI but we do have some open-source um reproductions that were trying to follow this kind of a setup and collect their own data. So
one that I'm familiar with for example is the effort of open assistant from a while back and this is just one of I think many examples but I just want to show you an example. So here's uh so
these were people on the internet that were asked to basically create these conversations similar to what um OpenI did with human labelers. And so here's an entry of a person who came up with
this prompt. Can you write a short
this prompt. Can you write a short introduction to the relevance of the term monopsy uh in economics please use examples etc. And then the same person or potentially a different person will write up the
response. So here's the assistant
response. So here's the assistant response to this. And so then the same person or different person will actually write out this ideal response
and then this is an example of maybe how the conversation could continue. Now
explain it to a dog and then you can try to come up with a slightly a simpler explanation or something like that. Now
this then becomes the label and we end up training on this. So what happens during training is that um of course we're not going to have a full coverage
of all the possible questions that um the model will encounter at test time during inference. We can't possibly
during inference. We can't possibly cover all the possible prompts that people are going to be asking in the future. But if we have a like a data set
future. But if we have a like a data set of a few of these examples, then the model during training will start to take on this persona of this helpful,
truthful, harmless assistant. And it's
all programmed by example. And so these are all examples of behavior. And if you have conversations of these example behaviors and you have enough of them, like 100 thousand, and you train on it, the model sort of starts to understand
the statistical pattern and it kind of takes on this personality of this assistant.
Now, it's possible that when you get the exact same question like this at test time, it's possible that the answer will be recited as exactly what was in the
training set. But more likely than that
training set. But more likely than that is that the model will kind of like do something of a similar vibe. Um and it will understand that this is the kind of answer that you want. Um
so that's what we're doing. We're
programming the system um by example and the system adopts statistically this persona of this helpful, truthful, harmless assistant which is kind of like reflected in the labeling instructions
that the company creates. Now I want to show you that the state-of-the-art has kind of advanced in the last two or three years uh since the instruct GPT paper. So in particular it's not very
paper. So in particular it's not very common for humans to be doing all the heavy lifting just by themselves anymore. And that's because we now have
anymore. And that's because we now have language models and these language models are helping us create these data sets and conversations. So it is very rare that the people will like literally just write out the response from scratch. It is a lot more likely that
scratch. It is a lot more likely that they will use an existing LLM to basically like uh come up with an answer and then they will edit it or things like that. So there's many different
like that. So there's many different ways in which now LLMs have started to kind of permeate this post-training set uh stack and LLMs are basically used
pervasively to help create these massive data sets of conversations. So I don't want to show like ultra chat is one um such example of like a more modern data set of conversations. It is to a very
large extent synthetic but uh I believe there's some human involvement. I could
be wrong with that. Usually there will be a little bit of human but there will be a huge amount of synthetic help. Um
and this is all kind of like uh constructed in different ways and ultrash is just one example of many SFT data sets that currently exist. And the
only thing I want to show you is that uh these data sets have now millions of conversations. Uh these conversations
conversations. Uh these conversations are mostly synthetic but they're probably edited to some extent by humans and they span a huge diversity of sort of um
uh areas and so on. So these are fairly extensive artifacts by now and there are all these like SFT mixtures as they're called. So you have a mixture of like
called. So you have a mixture of like lots of different types and sources and it's partially synthetic, partially human and it's kind of like um gone in that direction since uh but roughly
speaking we still have SFT data sets.
They're made up of conversations. We're
training on them um just like we did before. And uh I guess like the last
before. And uh I guess like the last thing to note is that I want to dispel a little bit of the magic of talking to an AI. Like when you go to chatt and you
AI. Like when you go to chatt and you give it a question and then you hit enter, uh what is coming back is kind of like statistically aligned with what's
happening in the training set. And these
training sets, I mean, they really just have a seed in humans following labeling instructions. So what are you actually
instructions. So what are you actually talking to in ChachiPT or how should you think about it? Well, it's not coming from some magical AI like roughly speaking it's coming from something that
is statistically imitating human labelers which comes from labeling instructions written by these companies.
And so you're kind of imitating this uh you're kind of getting um it's almost as if you're asking a human labeler. And
imagine that the answer that is given to you uh from Chashibbt is some kind of a simulation of a human labeler. Uh and
it's kind of like asking what would a human labeler say in this kind of a conversation.
And uh it's not just like this human labeler is not just like a random person from the internet because these companies actually hire experts. So for
example when you are asking questions about code and so on the human labelers that would be in um involved in creation of these conversation data sets they will usually be usually be educated
expert people and you're kind of like asking a question of like a simulation of those people if that makes sense. So
you're not talking to a magical AI, you're talking to an average labeler.
This average labeler is probably fairly highly skilled, but you're talking to kind of like an instantaneous simulation of that kind of a person that would be hired uh in the construction of these
data sets. So let me give you one more
data sets. So let me give you one more specific example before we move on. For
example, when I go to ChachiP and I say recommend the top five landmarks to see in Paris and then I hit enter.
Uh, okay. Here we go. Okay. When I hit
okay. Here we go. Okay. When I hit enter, what's coming out here?
How do I think about it? Well, it's not some kind of a magical AI that has gone out and researched all the landmarks and then ranked them using its infinite intelligence, etc. What I'm getting is a
statistical simulation of a labeler that was hired by OpenAI. You can think about it roughly in that way. And so if this specific um question is in the post-
training data set somewhere at OpenAI, then I'm very likely to see an answer that is probably very very similar to what that human labeler would have put down for those five landmarks. How does
the human labeler come up with this?
Well, they go off and they go on the internet and they kind of do their own little research for 20 minutes and they just come up with a list right now. So
if they come up with this list and this is in the data set, I'm probably very likely to see what they submitted as the correct answer from the assistant. Now
if this specific query is not part of the post training data set then what I'm getting here is a little bit more emergent uh because uh the model kind of understands that statistically
um the kinds of landmarks that are in the training set are usually the prominent landmarks the landmarks that people usually want to see the the kinds of landmarks that are usually uh very often talked about on the internet and
remember that the model already has a ton of knowledge from its pre-training on the internet so it's probably seen a ton of conversations about Paris, about landmarks, about the kinds of things that people like to see. And so it's the pre-training knowledge that is then
combined with the post-training data set that results in this kind of an imitation.
Um, so that's uh that's roughly how uh you can kind of think about what's happening behind the scenes here in uh in the statistical sense. Okay. Now I
want to turn to the topic of LLM psychology as I like to call it which is what are sort of the emergent cognitive effects of the training pipeline that we have for these models. So in particular
the first one I want to talk to is of course hallucinations.
So you might be familiar with model hallucinations. It's when LLMs make
hallucinations. It's when LLMs make stuff up. They just totally fabricate
stuff up. They just totally fabricate information etc. And it's a big problem with LLM assistants. It is a problem that existed to a large extent with early models uh for many years ago and I think the problem has gotten a bit
better uh because there are some medications that I'm going to go into in a second. For now, let's just try to
a second. For now, let's just try to understand where these hallucinations come from. So here's a specific example
come from. So here's a specific example of a few uh of three conversations that you might think you have in your training set. And um these are pretty
training set. And um these are pretty reasonable conversations that you could imagine being in the training set. So
like for example, who is Tom Cruz? Well,
Tom Cruz isn't a famous actor, American actor and producer, etc. Who is John Baraso? Uh, this turns out to be a UN US
Baraso? Uh, this turns out to be a UN US senator, for example. Who is
Genghaskhan? Well, Genghaskhan was blah blah blah. And so, this is what your
blah blah. And so, this is what your conversations could look like at training time. Now the problem with this
training time. Now the problem with this is that when the human is writing the correct answer for the assistant in each one of these cases uh the human either like knows who this person is or they
research them on the internet and they come in and they write this response that kind of has this like confident tone of an answer and what happens basically is that at test time when you ask for someone who is this is a totally
random name that I totally came up with and I don't think this person exists um as far as I know I just tried to generate it randomly. The problem is when we ask who is Orson Kovatz the
problem is that the assistant will not just tell you oh I don't know even if the assistant and the language model itself might know inside its features inside its activations inside of its
brain sort of it might know that this person is like not someone that um that is that it's familiar with even if some part of the network kind of knows that
in some sense the saying that oh I don't know who this is is not going to happen because the model statistically imitates it training set. In the training set,
the questions of the form who is blah are confidently answered with the correct answer. And so it's going to
correct answer. And so it's going to take on the style of the answer and it's going to do its best. It's going to give you statistically the most likely guess and it's just going to basically make stuff up because these models again we
just talked about it is they don't have access to the internet. They're not
doing research. These are statistical token tumblers as I call them. uh is
just trying to sample the next token in the sequence and it's gonna basically mix stuff up. So let's take a look at what this looks like.
I have here what's called the inference playground from hugging face and I am on purpose picking on a model called Falcon 7B which is an old model. This is a few
years ago now. So it's an older model so it suffers from hallucinations and as I mentioned this has improved over time recently. But let's say who is Orson
recently. But let's say who is Orson Kovat? Let's ask Falcon 7B instruct run.
Kovat? Let's ask Falcon 7B instruct run.
Oh yeah, Orson Kovat is an American author and science fiction writer. Okay,
this is totally false. It's a
hallucination. Let's try again. These
are statistical systems, right? So we
can resample. This time Orson Kovat is a fictional character from this 1950s TV show. It's total BS, right? Let's try
show. It's total BS, right? Let's try
again. He's a former minor league baseball player. Okay, so it basically
baseball player. Okay, so it basically the model doesn't know and it's given us lots of different answers because it doesn't know. It's just kind of like
doesn't know. It's just kind of like sampling from these probabilities. The
model starts with the tokens who is Orson Kovat's assistant and then it comes in here and it's get it's getting these probabilities and it's just sampling from the probabilities and it
just like comes up with stuff and the stuff is actually statistically consistent with the style of the answer in its training set and it's just doing that but you and I experienced it as a
madeup factual knowledge but keep in mind that uh the model basically doesn't know and it's just imitating the format of the answer and it's not going to go off and look it up uh because it's just
imitating again the answer. So how can we uh mitigate this? Because for example when we go to chatpt and I say who is Orson Kovatz and I'm now asking the state of the state-of-the-art model from
open AAI this model will tell you oh so this model is actually is even smarter because you saw very briefly it said searching the web uh we're going to
cover this later um it's actually trying to do tuluse and uh kind of just like came up with some kind
of a story but I want to just kovage not use any tools. I don't want it to do web search.
There's a well-known historical or public figure named or or some cobots.
So, this model is not going to make up stuff. This model knows that it doesn't
stuff. This model knows that it doesn't know and it tells you that it doesn't appear to be a person that this model knows. So, somehow we sort of improved
knows. So, somehow we sort of improved hallucinations even though they clearly are an issue in older models. And it
makes totally uh sense why you would be getting these kinds of answers if this is what your training set looks like. So
how do we fix this? Okay. Well, clearly
we need some examples in our data set that where the correct answer for the assistant is that the model doesn't know about some particular fact. But we only need to have those answers be produced
in the cases where the model actually doesn't know. And so the question is how
doesn't know. And so the question is how do we know what the model knows or doesn't know? Well, we can empirically
doesn't know? Well, we can empirically probe the model to figure that out. So
let's take a look at for example how meta uh dealt with hallucinations for the llama 3 series of models as an example. So in this paper that they
example. So in this paper that they published from meta we can go into hallucinations which they call here factuality and they describe the procedure by which they basically interrogate the model to
figure out what it knows and doesn't know to figure out sort of like the boundary of its knowledge and then they add examples to the training set where
for the things where the model doesn't know them the correct answer is that the model doesn't know them which sounds like very easy thing to do in principle,
but this roughly fixes the issue. And
the reason it fixes the issue is because remember like the model might actually have a pretty good model of its self-nowledge inside the network. So
remember we looked at the network and all these neurons inside the network.
You might imagine that there's a neuron somewhere in the network that sort of like lights up for when the model is uncertain. But the problem is that the
uncertain. But the problem is that the activation of that neuron is not currently wired up to the model actually saying in words that it doesn't know. So
even though the internals of the neural network know because there's some neurons that represent that the model uh will not surface that it will instead take its best guess so that it sounds
confident um just like it sees in the training set. So we need to basically
training set. So we need to basically interrogate the model and allow it to say I don't know in the cases that it doesn't know. So let me take you through
doesn't know. So let me take you through what metar roughly does. So basically
what they do is here I have an example.
Uh Dominic Kashek is uh the featured article today. So I just went there
article today. So I just went there randomly. And what they do is basically
randomly. And what they do is basically they take a random document in a training set and they take a paragraph and then they use an LLM to construct
questions about that paragraph. So for
example, I did that with chat GPT here.
So I said here's a paragraph from this document. generate three specific
document. generate three specific factual questions based on this paragraph and give me the questions and the answers. And so the LLMs are already
the answers. And so the LLMs are already good enough to create and reframe this information. So if the information is in
information. So if the information is in the context window um of this LLM, this actually works pretty well. It doesn't
have to rely on its memory. It's right
there in the context window. And so it can basically reframe that information with fairly high accuracy. So for
example can generate questions for us like for which team did he play here's the answer how many cups did he win etc and now what we have to do is we have some question and answers and now we
want to interrogate the model so roughly speaking what we'll do is we'll take our questions and we'll go to our model which would be uh say llama uh in meta but let's just interrogate mistrol 7b
here as an example that's another model so does this model know about this answer let's take a Uh, so he played for Buffalo Sabres,
right? So the model knows and the the
right? So the model knows and the the way that you can programmatically decide is basically we're going to take this answer from the model and we're going to compare it to the correct answer. And
again, the models are good enough to do this automatically. So there's no humans
this automatically. So there's no humans involved here. We can take uh basically
involved here. We can take uh basically the answer from the model and we can use another LLM judge to check if that is correct according to this answer. And if
it is correct, that means that the model probably knows. So, what we're going to
probably knows. So, what we're going to do is we're going to do this maybe a few times. So, okay, it knows it's Buffalo
times. So, okay, it knows it's Buffalo Sabres. Let's try again.
Sabres. Let's try again.
Um, Buffalo Sabers. Let's try one more time.
Buffalo Sabers. Let's try one more time.
Buffalo Sabres. So, we asked three times about this factual question and the model seems to know. So, everything is great. Now, let's try the second
great. Now, let's try the second question. How many Stanley Cups did he
question. How many Stanley Cups did he win?
And again, let's interrogate the model about that. And the correct answer is
about that. And the correct answer is two.
So, um, here the model claims that he won um four times, which is not correct, right? It doesn't match two. So, the
right? It doesn't match two. So, the
model doesn't know. It's making stuff up. Let's try again.
up. Let's try again.
Um, so here the model again is kind of like making stuff up, right? Let's try again.
Here it says it didn't he did not even did not win during his career. So obviously
the model doesn't know. And the way we can programmatically tell again is we interrogate the model three times and we compare its answers maybe three times, five times, whatever it is to the correct answer. And if the model doesn't
correct answer. And if the model doesn't know, then we know that the model doesn't know this question. And then
what we do is we take this question, we create a new conversation in the training set. So, we're going to add a
training set. So, we're going to add a new conversation in the training set.
And when the question is, how many Stanley Cups did he win? The answer is, I'm sorry, I don't know, or I don't remember. And that's the correct answer
remember. And that's the correct answer for this question because we interrogated the model and we saw that that's the case. If you do this for many different types of uh questions, for
many different types of documents, you are giving the model an opportunity to in its training set refuse to say based on its knowledge. And if you just have a few examples of that in your training
set, the model will know um and and has the opportunity to learn the association of this knowledgebased refusal to this internal neuron somewhere in its network
that we presume exists and empirically this turns out to be probably the case and it can learn that association that hey when this neuron of uncertainty is high then I actually don't know and I'm
allowed to say that I'm sorry but I don't think I remember this etc. And if you have these uh examples in your training set, then this is a large mitigation for hallucination and that's
roughly speaking why chachi PT is able to do stuff like this as well. So these
are the kinds of uh mitigations that people have implemented and that have improved the factuality issue over time.
Okay, so I've described mitigation number one for basically mitigating the hallucinations issue. Now we can
hallucinations issue. Now we can actually do much better than that. uh
it's instead of just saying that we don't know uh we can introduce an additional mitigation number two to give the LLM an opportunity to be factual and actually answer the question. Now what
do you and I do if I was to ask you a factual question and you don't know uh what would you do um in order to answer the question? Well, you could uh go off
the question? Well, you could uh go off and do some search and uh use the internet and you could figure out the answer and then tell me what that answer is. And we can do the exact same thing
is. And we can do the exact same thing with these models. So think of the knowledge inside the neural network inside its billions of parameters. Think
of that as kind of a vague recollection of the things that the model has seen during its training during the pre-training stage a long time ago. So
think of that knowledge in the parameters as something you read a month ago. And if you keep reading something
ago. And if you keep reading something then you will remember it and the model remembers that. But if it's something
remembers that. But if it's something rare then you probably don't have a really good recollection of that information. But what you and I do is we
information. But what you and I do is we just go and look it up. Now when you go and look it up, what you're doing basically is like you're refreshing your working memory with information and then you're able to sort of like retrieve it,
talk about it or etc. So we need some equivalent of allowing the model to refresh its memory or it's for collection and we can do that by introducing tools uh for the models.
So the way we are going to approach this is that instead of just saying hey I'm sorry I don't know we can attempt to use tools. So we can create uh a mechanism
tools. So we can create uh a mechanism by which the language model can emit special tokens and these are tokens that we're going to introduce new tokens. So
for example here I've introduced two tokens and I've introduced a format or a protocol for how the model is allowed to use these tokens. So for example instead of answering the question when the model
does not instead of just saying I don't know sorry the model has the option now to emitting the special token search start and this is the query that will go to like bing.com in the case of openai
or say Google search or something like that. So it will emit the query and then
that. So it will emit the query and then it will emit search end. And then here what will happen is that the program that is sampling from the model that is
running the inference when it sees the special token search end instead of sampling the next token uh in the sequence it will actually pause generating from the model it will go off
it will open a session with bing.com and it will paste the search query into Bing and it will then um get all the text that is retrieved and it will basically
take that text it will maybe represent again with some other special tokens or something like that and it will take that text and it will copy paste it here into what I try to like show with the
brackets. So all that text kind of comes
brackets. So all that text kind of comes here and when the text comes here it enters the context window. So the model so that text from the web search is now
inside the context window that will feed into the neural network. And you should think of the context window as kind of like the working memory of the model.
that data that is in the context window is directly accessible by the model. It
directly feeds into the neural network.
So it's not anymore a vague recollection. It's data that it it has
recollection. It's data that it it has in the context window and is directly available to that model. So now when it's sampling the new uh tokens here afterwards, it can reference very easily
the data that has been copy pasted in there. So that's roughly how these um
there. So that's roughly how these um how these tools use uh tools uh function. And so web search is just one
function. And so web search is just one of the tools. We're going to look at some of the other tools in a bit. Uh but
basically you introduce new tokens. You
introduce some schema by which the model can utilize these tokens and can call these special functions like web search functions. And how do you teach the
functions. And how do you teach the model how to correctly use these tools like say web search start search end etc. Well again you do that through training sets. So we need now to have a
training sets. So we need now to have a bunch of data and a bunch of conversations that show the model by example how to use web search. So what
are the what are the settings where you are using the search um and what does that look like and here's by example how you start a search end the search etc. And uh if you have a few thousand maybe
examples of that in your training set the model will actually do a pretty good job of understanding uh how this tool works and it will know how to sort of structure its queries. And of course, because of the pre-training data set and
its understanding of the world, it actually kind of understands what a web search is. And so it actually kind of
search is. And so it actually kind of has a pretty good native understanding um of what kind of stuff is a good search query. Um and so it all kind of
search query. Um and so it all kind of just like works. You just need a little bit of a few examples to show it how to use this new tool and then it can lean on it to retrieve information and uh put
it in the context window. And that's
equivalent to you and I looking something up because once it's in the context, it's in the working memory and it's very easy to manipulate and access.
So that's what we saw a few minutes ago when I was searching on ChachiPT for who is Orson Kovat. The ChachiPT language model decided that this is some kind of
a rare um individual or something like that. And instead of giving me an answer
that. And instead of giving me an answer from its memory, it decided that it will sample a special token that is going to do a web search. And we saw briefly something flash was like using the web tool or something like that. So it
briefly said that and then we waited for like two seconds and then it generated this. And you see how it's creating
this. And you see how it's creating references here. And so it's citing
references here. And so it's citing sources. So what happened here is it
sources. So what happened here is it went off, it did a web web search, it found these sources and these URLs. And
the text of these web pages was all stuffed in between here. And it's not shown here, but it's it's basically stuffed as text in between here. And now
it sees that text. And now it kind of references it and says that okay it could be these people citation it could be those people citation etc. So that's what happened here and that's what and
that's why when I said who is Orson Kovat I could also say don't use any tools and then that's enough to um basically convince chach to not use tools and just use its memory and its
recollection. I also went off and I um
recollection. I also went off and I um tried to ask this question of Chachi PT.
So how many standing cups did uh Dominic Hashek win? And Chesp actually decided
Hashek win? And Chesp actually decided that it knows the answer and it has the confidence to say that uh he won twice.
And so it kind of just relied on its memory because presumably it has um it has enough of a kind of confidence in its weights and
its parameters and activations that this is uh retrievable just from memory. Um,
but you can also conversely use web search to make sure and then for the same query it actually goes off and it searches and then it finds a bunch of sources. It finds all this. All of this
sources. It finds all this. All of this stuff gets copy pasted in there and then it tells us uh two again and sites and it actually says the Wikipedia article
which is the source of this information for us as well. So that's tools web search. the model determines when to
search. the model determines when to search and then uh that's kind of like how these tools uh work and this is an additional kind of mitigation for uh hallucinations and factuality. So I want
to stress one more time this very important sort of psychology point.
Knowledge in the parameters of the neural network is a vague recollection.
The knowledge in the tokens that make up the context window is the working memory. And it roughly speaking works
memory. And it roughly speaking works kind of like um it works for us in our brain. The stuff we remember is our
brain. The stuff we remember is our parameters uh and the stuff that we just experienced like uh a few seconds or minutes ago and so on. You can imagine that being in our context window and this context window is being built up as
you have a conscious experience around you. So this has a bunch of um
you. So this has a bunch of um implications also for your use of LLMs in practice. So for example, I can go to
in practice. So for example, I can go to chat and I can do something like this. I
can say can you summarize chapter one of Jane Austin's Pride and Prejudice right and this is a perfectly fine prompt and Chaship actually does something relatively reasonable here and but the reason it does that is because Chaship
has a pretty good recollection of a famous work like Pride and Prejudice it's probably seen a ton of stuff about it there's probably forums about this book it's probably read versions of this
book um and it kind of like remembers because even if you've read this or articles about it you'd kind of have a recollection enough to actually say all this. But usually when I actually
this. But usually when I actually interact with LMS and I want them to recall specific things, it always works better if you just give it to them. So I
think a much better prompt would be something like this. Can you summarize for me chapter one of Jane Austin's Pride and Prejudice? And then I am attaching it below for your reference.
And then I do something like the limiter here and I paste it in. And I I found it just copy pasting it from some website that I found here. Um so copy pasting
the chapter one here and I do that because when it's in the context window the model has direct access to it and can exactly it doesn't have to recall it it just has direct access to it and so
this summary is can be expected to be a significantly high quality or higher quality than this summary uh just because it's directly available to the model and I think you and I would work in the same way if you want to it would
be you would produce a much better summary if you had reread this chapter before you had to summarize it and that's uh basically what's happening here or the equivalent of it. The next
sort of psychological quirk I'd like to talk about briefly is that of the knowledge of self. So what I see very often on the internet is that people do something like this. They ask LLMs something like what model are you and
who built you? And um basically this question is a little bit nonsensical.
And the reason I say that is that as I tried to kind of explain with some of the under the hood fundamentals, this thing is not a person, right? It doesn't
have a persistent existence in any way.
It sort of boots up, processes tokens and shuts off and it does that for every single person. It just kind of builds up
single person. It just kind of builds up a context window of conversation and then everything gets deleted. And so
this this entity is kind of like restarted from scratch every single conversation if that makes sense. It has
no persistent self. It has no sense of self. It's a token tumbler and uh it
self. It's a token tumbler and uh it follows the stat statistical regularities of its training set. So it
doesn't really make sense to ask it who are you, what built you, etc. And by default, if you do what I described and just by default and from nowhere, you're going to get some pretty random answers.
So for example, let's uh pick on Falcon, which is a fairly old model. And let's
see what it tells us.
Uh so it's evading the question, uh talented engineers and developers. Here
it says, I was built by OpenAI based on the GP3 model. It's totally making stuff up. Now, the fact that it's built by
up. Now, the fact that it's built by OpenAI here, I think a lot of people would take this as evidence that this model was somehow trained on OpenAI data or something like that. I don't actually think that that's necessarily true. The
reason for that is that if you don't explicitly program the model to answer these kinds of questions, then what you're going to get
is its statistical best guess at the answer. And this model had a um SFT data
answer. And this model had a um SFT data mixture of conversations. And during the fine-tuning, um, the model sort of understands as
it's training on this data that it's taking on this personality of this like helpful assistant and it doesn't know how to it doesn't actually it wasn't told exactly what label to apply to
self. It just kind of has taken on this
self. It just kind of has taken on this uh this uh persona of a helpful assistant. And remember that the
assistant. And remember that the pre-training stage took the documents from the entire internet and chachi and open AAAI are very prominent in these documents. And so I think what's
documents. And so I think what's actually likely to be happening here is that this is just its hallucinated label for what it is. This is itself identity is that it's chat by OpenAI. And it's
only saying that because there's a ton of data on the internet of um answers like this that are actually coming from OpenAI from Chacht. And so that's its
label for what it is. Now you can override this as a developer. If you
have a LLM model, you can actually override it. And there are a few ways to
override it. And there are a few ways to do that. So for example, let me show
do that. So for example, let me show you. There's this Almo model from Alen
you. There's this Almo model from Alen AI. And um this is one LLM. It's not a
AI. And um this is one LLM. It's not a top tier LM or anything like that, but I like it because it is fully open source.
So the paper for Almo and everything else is completely fully open source, which is nice. Um so here we are looking at its SFT mixture. So this is the data mixture of um the fine-tuning. So this
is the conversations data set, right?
And so the way that they are solving it for the mo model is we see that there's a bunch of stuff in the mixture and there's a total of 1 million conversations here. But here we have
conversations here. But here we have Almo 2 hardcoded. If we go there, we see that this is 240 conversations.
And look at these 240 conversations.
They're hardcoded. Tell me about yourself says user. And then the assistant says, "I'mma an open language model developed by AI2 Allen Institute of Artificial Intelligence, etc. I'm
here to help blah blah blah. What is
your name? Uh, the project." So these are all kinds of like cooked up hard-coded questions about MMO 2 and the correct answers to give in these cases.
If you take 240 questions like this or conversations, put them into your training set and fine-tune with it, then the model will actually be expected to parrot this stuff later.
If you don't give it this then it's probably by open AI and um there's one more way to sometimes do this is that
basically um in these conversations and you have terms between human and assistant sometimes there's a special message called system message at the very beginning of the conversation. So
it's not just between human and assistant. There's a system and in the
assistant. There's a system and in the system message you can actually hardcode and remind the model that hey you are a model developed by open AAI and uh your
name is chaship 40 and you were trained on this date and your knowledge cutoff is this and basically it kind of like documents the model a little bit and then this is inserted into your conversations. So when you go on chasht
conversations. So when you go on chasht you see a blank page but actually the system message is kind of like hidden in there and those tokens are in the context window and so those are the two
ways to kind of um program the models to talk about themselves. Either it's done through uh data like this or it's done through system message and things like that basically invisible tokens that are
in the context window and remind the model of its identity. But it's all just kind of like cooked up and bolted on in some in some way. It's not actually like really deeply there in any real sense as
it would be for a haven. I want to now continue to the next section which deals with the computational capabilities or like I should say the native computational capabilities of these models in problem solving scenarios. And
so in particular, we have to be very careful with these models when we construct our examples of conversations.
And there's a lot of sharp edges here that are kind of like elucidative. Is
that a word? Uh they're kind of like interesting to look at when we consider how these models think. So um consider the following prompt from a human and suppose that basically that we are
building out a conversation to enter into our training set of conversations.
So we're going to train the model on this. We're teaching it how to basically
this. We're teaching it how to basically solve simple math problems. So the prompt is Emily buys three apples and two oranges. Each orange cost $2. The
two oranges. Each orange cost $2. The
total cost is 13. What is the cost of apples? Very simple math question. Now
apples? Very simple math question. Now
there are two answers here on the left and on the right. They are both correct answers. They both say that the answer
answers. They both say that the answer is three, which is correct. But one of these two is a significantly better answer for the assistant than the other.
Like if I was data labeler and I was creating one of these, one of these would be uh a really terrible answer for the assistant and the other would be okay. And so I'd like you to potentially
okay. And so I'd like you to potentially pause the video even and think through why one of these two is significantly better answer uh than the other. And um
if you use the wrong one, your model will actually be uh really bad at math potentially and it would have uh bad outcomes. And this is something that you
outcomes. And this is something that you would be careful with in your labeling documentations when you are training people uh to create the ideal responses for the assistant. Okay. So the key to this question is to realize and remember
that when the models are training and also inferencing they are working in one-dimensional sequence of tokens from left to right. And this is the picture that I often have in my mind. I imagine
basically the token sequence evolving from left to right and to always produce the next token in a sequence. We are
feeding all these tokens into the neural network and this neural network then gives us the probabilities for the next token in sequence. Right? So this
picture here is the exact same picture we saw uh before up here and this comes from the web demo that I showed you before. Right? So this is the
before. Right? So this is the calculation that basically takes the input tokens here on the top and uh performs these operations of all these neurons and uh gives you the answer for
the probabilities of what comes next.
Now the important thing to realize is that roughly speaking uh there's basically a finite number of layers of computation that happen here.
So for example this model here has only one two three layers of what's called attention and uh MLP here. Um maybe
typical modern state-of-the-art network would have more like say a 100 layers or something like that, but there's only 100 layers of computation or something like that to go from the previous token sequence to the probabilities for the
next token. And so there's a finite
next token. And so there's a finite amount of computation that happens here for every single token. And you should think of this as a very small amount of computation. And this amount of
computation. And this amount of computation is almost roughly fixed uh for every single token in this sequence.
um that's not actually fully true because the more tokens you feed in uh the the more expensive uh this forward pass will be of this neural network but not by much. So you should think of this
uh and I think as a good model to have in mind this is a fixed amount of compute that's going to happen in this box for every single one of these tokens and this amount of compute cannot possibly be too big because there's not that many layers that are sort of going
from the top to bottom here. there's not
that that much computationally that will happen here. And so you can't imagine
happen here. And so you can't imagine the model to to basically do arbitrary computation in a single forward pass to get a single token. And so what that means is that we actually have to distribute our reasoning and our
computation across many tokens because every single token is only spending a finite amount of computation on it. And
so we kind of want to distribute the computation across many tokens and we can't have too much computation or expect too much computation out of the model in any single individual token
because there's only so much computation that happens per token. Okay, roughly
fixed amount of computation here. So
that's why this answer here is significantly worse. And the reason for
significantly worse. And the reason for that is imagine going from left to right here um and I copy pasted it right here.
the answer is three etc. Imagine a model having to go from left to right emitting these tokens one at a time. It has to say or we're expecting to say the answer
is space dollar sign and then right here we're expecting it to basically cram all of the computation of this problem into this single token. It has to emit the correct answer three. And then once
we've emitted the answer three, we're expecting it to say all these tokens.
But at this point we've already produced the answer and it's already in the context window for all these tokens that follow. So anything here is just um kind
follow. So anything here is just um kind of post hawk justification of why this is the answer. Um because the answer is already created. It's already in the
already created. It's already in the token window. So it's it's not actually
token window. So it's it's not actually being calculated here. Um and so if you are answering the question directly and immediately you are training the model
to to try to basically guess the answer in a single token. And that is just not going to work because of the finite amount of computation that happens per token. That's why this answer on the
token. That's why this answer on the right is significantly better because we are distributing this computation across the answer. We're actually getting the
the answer. We're actually getting the model to sort of slowly come to the answer from the left to right. We're
getting intermediate results. We're
saying, okay, the total cost of oranges is four. So 13 - 4 is 9. And so we're
is four. So 13 - 4 is 9. And so we're creating intermediate calculations. And
each one of these calculations is by itself not that expensive. And so we're actually basically kind of guessing a little bit the difficulty that the model is capable of in any single one of these
individual tokens. And there can never
individual tokens. And there can never be too much work in any one of these tokens computationally because then the model won't be able to do that later at test time. And so we're teaching the
test time. And so we're teaching the model here to spread out its reasoning and to spread out its computation over the tokens. And in this way it only has
the tokens. And in this way it only has very simple problems in each token and they can add up and then by the time it's near the end it has all the previous results in its working memory
and it's much easier for it to determine that the answer is and here it is three.
So this is a significantly better label for our computation. This would be really bad and it is teaching the model to try to do all the computation in a single token is really bad.
So, uh, that's kind of like an interesting thing to keep in mind is in your prompts, uh, usually you don't have to think about it explicitly because, uh, the
people at OpenAI have labelers and so on that actually worry about this and they make sure that the answers are spread out. And so, actually, OpenAI will kind
out. And so, actually, OpenAI will kind of like do the right thing. So, when I ask this question for Chacht, it's actually going to go very slowly. It's
going to be like, okay, let's define our variables, set up the equation, and it's kind of creating all these intermediate results. These are not for you. These
results. These are not for you. These
are for the model. If the model is not creating these intermediate results for itself, it's not going to be able to reach three. I also wanted to show you
reach three. I also wanted to show you that it's possible to be a bit mean to the model. Uh we can just ask for
the model. Uh we can just ask for things. So, as an example, I said I gave
things. So, as an example, I said I gave it the exact same uh prompt and I said answer the question in a single token.
Just immediately give me the answer, nothing else. And it turns out that for
nothing else. And it turns out that for this simple um prompt here, it actually was able to do it in a single go. So it
just created a single I think this is two tokens right uh because the dollar sign is its own token. So basically this model didn't give me a single token it gave me two tokens but it still produced
the correct answer and it did that in a single forward pass of the network. Now
that's because the numbers here I think are very simple and so I made it a bit more difficult to be a bit mean to the model. So I said Emily buys 23 apples
model. So I said Emily buys 23 apples and 177 oranges and then I just made the numbers a bit bigger and I'm just making it harder for the model. I'm asking it to do more computation in a single
token. And so I said the same thing. And
token. And so I said the same thing. And
here it gave me five. And five is actually not correct. So the model failed to do all this calculation in a single forward pass of the network. It
failed to go from the input tokens and then in a single forward pass of the network, single go through the network, it couldn't produce the result. And then
I said, okay, now don't worry about the the token limit and just solve the problem as usual. And then it goes all the intermediate results. it simplifies
and every one of these intermediate results here and intermediate calculations is much easier for the model and um it sort of it's not too much work per token all of the tokens
here are correct and it arises solution which is seven and it just couldn't squeeze all of this work it couldn't squeeze that into a single for passive network so I think that's kind of just a cute example and something to kind of
like think about and I think is kind of again just elucidative in terms of how these uh models work the last thing that I would say on this topic is that if I was in practice trying to actually solve this in my day-to-day life, I might
actually not uh trust that the model did all the intermediate calculations correctly here. So actually probably
correctly here. So actually probably what I do is something like this. I
would come here and I would say use code and uh that's because code is one of the possible tools that Chachi PD can use.
And instead of it having to do mental arithmetic like this mental arithmetic here, I don't fully trust it. And
especially if the numbers get really big, there's no guarantee that the model will do this correctly. Any one of these intermediate steps might in principle fail. We're using neural networks to do
fail. We're using neural networks to do mental arithmetic. Uh kind of like you
mental arithmetic. Uh kind of like you doing mental arithmetic in your brain.
It might just like uh screw up some of the intermediate results. It's actually
kind of amazing that it can even do this kind of mental arithmetic. I don't think I could do this in my head. But
basically, the model is kind of like doing it in its head and I don't trust that. So I wanted to use tools. So you
that. So I wanted to use tools. So you
can say stuff like use code and uh I'm not sure what happened there.
Use code and so um like I mentioned there's a special tool and the uh the model can write code and I can inspect that this
code is correct and then uh it's not relying on its mental arithmetic. It is
using the Python interpreter which is a very simple programming language to basically write out the code that calculates the result. And I would personally trust this a lot more because this came out of a Python program which
I think has a lot more correctness guarantees than the mental arithmetic of a language model. Uh so just um another kind of potential hint that if you have these kinds of problems uh you may want
to basically just uh ask the model to use the code interpreter. And just like we saw with the web search, the model has special uh kind of tokens for calling uh like it will not actually
generate these tokens from the language model. it will write the program and
model. it will write the program and then it actually sends that program to a different sort of part of the computer that actually just runs that program and brings back the result and then the model gets access to that result and can
tell you that okay the cost of each apple is sudden um so that's another kind of tool and I would use this in practice for yourself and it's um yeah
it's just uh less errorprone I would say so that's why I called this section models need tokens to think distribute your computation across any tokens, ask models to create intermediate results,
or whenever you can, lean on tools and tool use instead of allowing the models to do all of this stuff in their memory.
So, if they try to do it all in their memory, I don't fully trust it and prefer to use tools whenever possible. I
want to show you one more example of where this actually comes up, and that's in counting. So, models actually are not
in counting. So, models actually are not very good at counting for the exact same reason. You're asking for way too much
reason. You're asking for way too much in a single individual token. So, let me show you a simple example of that. um
how many dots are below and then I just put in a bunch of dots and chach says there are and then it just tries to solve the problem in a single token. So
in a single token it has to count the number of dots in its context window. Um
and it has to do that in a single forward pass of a network. In a single forward pass of a network as we talked about there's not that much computation that can happen there. Just think of that as being like very little
computation that happens there. So if I just look at what the model sees, let's go to the LM go to tokenizer.
It sees uh this how many dots are below. And then it turns out that these dots here, this group of I think 20 dots is a single token. And then this group of whatever
token. And then this group of whatever it is is another token. And then for some reason they break up as this. So I
don't actually this has to do with the details of the tokenizer but it turns out that these um the model basically sees the token ID this this this and so
on and then from these token ids it's expected to count the number and spoiler alert it's not 161 it's actually I believe 177 so here's what we can do
instead uh we can say use code and you might expect that like why should this work and it's actually kind of subtle and kind of interesting so when I say use code. I actually expect this to
use code. I actually expect this to work. Let's see. Okay, 177 is correct.
work. Let's see. Okay, 177 is correct.
So, what happens here is I've actually, it doesn't look like it, but I've broken down the problem into problems that are easier for the model.
I know that the model can't count. It
can't do mental counting, but I know that the model is actually pretty good at doing copy pasting. So, what I'm doing here is when I say use code, it creates a string in Python for this. And
the task of basically copy pasting my input here to here is very simple because for the model um it sees this string of uh it sees it as just these
four tokens or whatever it is. So it's
very simple for the model to copy paste those token ids and um kind of unpack them into dots here. And so it creates this string and then it calls python
routine.count and then it comes up with
routine.count and then it comes up with the correct answer. So the Python interpreter is doing the counting. It's
not the model's mental arithmetic doing the counting. So it's again a simple
the counting. So it's again a simple example of um models need tokens to think. Don't rely on their mental
think. Don't rely on their mental arithmetic. And um that's why also the
arithmetic. And um that's why also the models are not very good at counting. If
you need them to do counting tasks, always ask them to lean on the tool. Now
the models also have many other little cognitive deficits here and there. And
these are kind of like sharp edges of the technology to be kind of aware of over time. So as an example, the models
over time. So as an example, the models are not very good with all kinds of spelling related tasks. They're not very good at it. And I told you that we would loop back around to tokenization. And
the reason to do for this is that the models, they don't see the characters.
They see tokens. And they their entire world is about tokens, which are these little text chunks. And so they don't see characters like our eyes do. And so
very simple character level tasks often fail. So for example,
fail. So for example, uh I'm giving it a string ubiquitous.
And I'm asking it to print only every third character starting with the first one. So we start with U and then we
one. So we start with U and then we should go every third. So every so 1 2 3 Q should be next and then etc. So this I see is not correct. And again my
hypothesis is that this is again the mental arithmetic here is failing number one a little bit. But number two, I think the the more important issue here is that if you go to tick tokenizer and
you look at ubiquitous, we see that it is three tokens, right? So you and I see ubiquitous and we can easily access the individual letters because we kind of see them and when we have it in the
working memory of our visual sort of field, we can really easily index into every third letter and I can do that task. But the models don't have access
task. But the models don't have access to the individual letters. They see this as these three tokens. And uh remember these models are trained from scratch on the internet and all these token uh
basically the model has to discover how many of all these different letters are packed into all these different tokens and the reason we even use tokens is mostly for efficiency. Uh but I think a lot of people are interested to delete
tokens entirely like we should really have character level or bite level models. It's just that that would create
models. It's just that that would create very long sequences and people don't know how to deal with that right now. So
while we have the token world any kind of spelling tasks are not actually expected to work super well. So because
I know that spelling is not a strong suit because of tokenization I can again ask it to lean on tools. So I can just say use code and I would again expect this to work because the task of copy
pasting ubiquitous into the Python interpreter is much easier and then we're leaning on Python interpreter to manipulate the characters of this
string. So when I say use code
string. So when I say use code ubiquitous, yes, it indexes into every third character and the actual truth is UQ2S, UQTS,
uh, which looks correct to me. So um,
again an example of spelling related tasks not working very well. A very
famous example of that recently is how many R are there in strawberry? And this
went viral many times and basically the models now get it correct. They say
there are three Rs in Strawberry. But
for a very long time, all the state-of-the-art models would insist that there are only two Rs in Strawberry. And this caused a lot of,
Strawberry. And this caused a lot of, you know, ruckus because, is that a word? I think so. Because um it just
word? I think so. Because um it just kind of like why are the models so brilliant and they can solve math olympiat questions, but they can't like count ours in strawberry. And the answer for that again is I've kind of built up
to it kind of slowly, but number one, the models don't see characters, they see tokens. And number two, they are not
see tokens. And number two, they are not very good at counting. And so here we are combining the difficulty of seeing the characters with the difficulty of counting. And that's why the models
counting. And that's why the models struggled with this even though I think by now honestly I think openi may have hardcoded the answer here or I'm not sure what they did but um uh but this
specific query now works. So
models are not very good at spelling and there there's a bunch of other little sharp edges and I don't want to go into all of them. I just want to show you a few examples of things to be aware of and uh when you're using these models in practice. I don't actually want to have
practice. I don't actually want to have a comprehensive analysis here of all the ways that the models are kind of like falling short. I just want to make the
falling short. I just want to make the point that there are some jagged edges here and there and we've discussed a few of them and a few of them make sense, but some of them also will just not make as much sense and they're kind of like
you're left scratching your head even if you understand in depth how these models work. And and a good example of that
work. And and a good example of that recently is the following. uh the models are not very good at very simple questions like this and uh this is shocking to a lot of people because these math uh these problems can solve
complex math problems they can answer PhD grade physics chemistry biology questions much better than I can but sometimes they fall short in like super simple problems like this so here we go
9.11 is bigger than 9.9 and it justifies in some way but obviously and then at the end okay it actually it flips its decision later So, um I don't believe
that this is very reproducible.
Sometimes it flips around its answer.
Sometimes it gets it right, sometimes gets it wrong. Uh let's try again.
Okay, even though it might look larger.
Okay, so here it doesn't even correct itself in the end. If you ask many times, sometimes it gets it right too.
But how is it that the model can do so great at Olympiad grade problems but then fail on very simple problems like this? And uh I think this one is as I
this? And uh I think this one is as I mentioned a little bit of a head scratcher. It turns out that a bunch of
scratcher. It turns out that a bunch of people studied this in depth and I haven't actually read the paper uh but what I was told by this team was that when you scrutinize the activations
inside the neural network when you look at some of the features and what what features turn on or off and what neurons turn on or off uh a bunch of neurons inside the neural network light up that
are usually associated with Bible verses. Uh, and so I think the model is
verses. Uh, and so I think the model is kind of like reminded that these almost look like Bible verse markers and in a Bible verse setting 9.11 would come
after 9.9. And so basically the model
after 9.9. And so basically the model somehow finds it like cognitively very distracting that in Bible verses 9.11 would be greater. Um, even though here it's actually trying to justify it and
come up to the answer with a math, it still ends up with the wrong answer here. So it basically just doesn't fully
here. So it basically just doesn't fully make sense and it's not fully understood and um there's a few jagged issues like that. So that's why treat this as a as
that. So that's why treat this as a as what it is which is a stochastic system that is really magical but that you can't also fully trust and you want to use it as a tool not as something that
you kind of like letter rip on a problem and copy paste the results. Okay. So we
have now covered two major stages of training of large language models. We
saw that in the first stage, this is called the pre-training stage. We are
basically training on internet documents. And when you train a language
documents. And when you train a language model on internet documents, you get what's called a base model. And it's
basically an internet document simulator. Right now, we saw that this
simulator. Right now, we saw that this is an interesting artifact and uh this takes many months to train on thousands of computers and it's kind of a lossy compression of the internet and it's
extremely interesting, but it's not directly useful because we don't want to sample internet documents. We want to ask questions of an AI and have it respond to our questions. So for that we
need an assistant. And we saw that we can actually construct an assistant in the process of post training and specifically in the process of
supervised fine-tuning as we call it.
So in this stage we saw that it's algorithmically identical to pre-training. Nothing is going to
pre-training. Nothing is going to change. The only thing that changes is
change. The only thing that changes is the data set. So instead of internet documents, we now want to create and curate a very nice data set of conversations. So we want million
conversations. So we want million conversations on all kinds of diverse topics between a human and an assistant.
And fundamentally these conversations are created by humans. So humans write the prompts and humans write the ideal responses and they do that based on
labeling documentations.
Now in the modern stack it's not actually done fully and manually by humans right they actually now have a lot of help from these tools so we can use language models um to help us create these data sets and we that's done
extensively but fundamentally it's all still coming from human curation at the end. So we create these conversations
end. So we create these conversations that now becomes our data set. We
fine-tune on it or continue training on it and we get an assistant. And then we kind of shifted gears and started talking about some of the kind of cognitive implications of what this assistant is like. And we saw that for
example the assistant will hallucinate if you don't take some uh sort of mitigations towards it. So we saw that hallucinations would be common. And then
we looked at some of the mitigations of those hallucinations. And then we saw
those hallucinations. And then we saw that the models are quite impressive and can do a lot of stuff in their head. But
we saw that they can also lean on tools to become better. So for example, we can lean uh lean on the web search in order to hallucinate less and to maybe bring up some more um recent information or
something like that. Or we can lean on tools like code interpreter so the code can so the LLM can write some code and actually run it and see the results.
So these are some of the topics we looked at so far. Um now what I'd like to do is I'd like to cover the last and major stage of this pipeline and that is
reinforcement learning. So reinforcement
reinforcement learning. So reinforcement learning is still kind of thought to be under the umbrella of post-raining uh but it is a the last third major stage and it's a different way of training
language models and usually follows as this third step. So inside companies like OpenAI, you will start here and these are all separate teams. So there's a team doing data for pre-training and a
team doing training for pre-training and then there's a team doing all the conversation generation in uh in a different team that is kind of doing the supervised fine tuning and then there will be a team for the reinforcement
learning as well. So it's kind of like a handoff of these models. You get your base model, then you fine-tune it to be an assistant and then you go into reinforcement learning which we'll talk about uh now.
So that's kind of like the major flow.
And so let's now focus on reinforcement learning, the last major stage of training. And let me first actually
training. And let me first actually motivate it and why we would want to do reinforcement learning and what it looks like on a high level. So I would now like to try to motivate the reinforcement learning stage and what it corresponds to with something that
you're probably familiar with and that is basically going to school. So just
like you went to school to become um really good at something, we want to take large language models through school. And really what we're doing is
school. And really what we're doing is um we're um we have a few paradigms of ways of uh giving them knowledge or transferring skills. So in particular
transferring skills. So in particular when we're working with textbooks in school, you'll see that there are three major kind of uh pieces of information in these textbooks, three classes of
information. The first thing you'll see
information. The first thing you'll see is you'll see a lot of exposition. Um
and by the way, this is a totally random book I pulled from the internet. I think
it's some kind of an organic chemistry or something. I'm not sure. Uh but the
or something. I'm not sure. Uh but the important thing is that you'll see that most of the text most of it is kind of just like the meat of it is exposition.
It's kind of like background knowledge etc. As you are reading through the words of this exposition you can think of that roughly as training on that
data. So um and that's why when you're
data. So um and that's why when you're reading through this stuff this background knowledge on this all this context information it's kind of equivalent to pre-training.
So it's it's where we build sort of like a knowledge base of this data and get a sense of the topic. The next major kind of information that you will see is
these uh problems and with their worked solutions. So basically a human expert
solutions. So basically a human expert in this case uh the author of this book has given us not just a problem but has also worked through the solution and the solution is basically like equivalent to
having like this ideal response for an assistant. So it's basically the expert
assistant. So it's basically the expert is showing us how to solve the problem in its uh kind of like um in its full form. So as we are reading the solution
form. So as we are reading the solution we are basically training on the expert data and then later we can try to imitate the expert um and basically um
that's that roughly corresponds to having the SFT model. That's what it would be doing. So basically we've already done pre-training and we've already covered this um imitation of
experts and how they solve these problems. And the third stage of reinforcement learning is basically the practice problems. So sometimes you'll see this is just a single practice problem here. But of course there will
problem here. But of course there will be usually many practice problems at the end of each chapter in any textbook. And
practice problems of course we know are critical for learning because what are they getting you to do? they're getting
you to practice uh to practice yourself and discover ways of solving these problems yourself. And so what you get
problems yourself. And so what you get in the practice problem is you get the problem description, but you're not given the solution, but you are given the final answer, usually in the answer
key of the textbook. And so you know the final answer that you're trying to get to, and you have the problem statement, but you don't have the solution. You are
trying to practice the solution. you're
trying out many different things and you're seeing what gets you to the final solution the best and so you're discovering how to solve these problems. So, and in the process of that you're relying on number one the background
information which comes from pre-training and number two maybe a little bit of imitation of human experts and you can probably try similar kinds of solutions and so on. So, we've done
this and this and now in this section, we're going to try to practice. And so,
we're going to be given prompts. We're
going to be given solutions. U sorry,
the final answers, but we're not going to be given expert solutions. We have to practice and try stuff out. And that's
what reinforcement learning is about.
Okay. So, let's go back to the problem that we worked with previously just so we have a concrete example to talk through as we explore sort of the topic here. So um I'm here in the tech
here. So um I'm here in the tech tokenizer because I'd also like to well I get a text box which is useful but number two I want to remind you again that we're always working with
onedimensional token sequences and so um I actually like prefer this view because this is like the native view of the LLM if that makes sense like this is what it actually sees. It sees token ids right.
actually sees. It sees token ids right.
Okay so Emily buys three apples and two oranges. Each orange is $2. The total
oranges. Each orange is $2. The total
cost of all the fruit is $13. What is
the cost of each apple? And what I'd like to what I'd like you to appreciate here is these are like four possible candidate solutions as an example
and they all reach the answer three.
Now, what I'd like you to appreciate at this point is that if I am the human data labeler that is creating a conversation to be entered into the training set, I don't actually really
know which of these conversations to um to add to the data set. Some of
these conversations kind of set up a system of equations. Some of them sort of like just talk through it in English and some of them just kind of like skip right through to the solution.
Um if you look at chachi PT for example and you give it this question it defines a system of variables and it kind of like does this little thing. What we
have to appreciate and uh differentiate between though is um the first purpose of a solution is to reach the right answer. Of course we want to get the
answer. Of course we want to get the final answer three. That is the that is the important purpose here. But there's
a kind of like a secondary purpose as well where here we are also just kind of trying to make it like nice uh for the human because we're kind of assuming that the person wants to see the solution. They want to see the
solution. They want to see the intermediate steps. We want to present
intermediate steps. We want to present it nicely etc. So there are two separate things going on here. Number one is the presentation for the human but number two we're trying to actually get the
right answer. Um so let's for the moment
right answer. Um so let's for the moment focus on just reaching the final answer.
If we're only care if we only care about the final answer, then which of these is the optimal or like the best prompt um sorry the best solution for the LLM to
reach the right answer?
Um and what I'm trying to get at is we don't know me as a human labeler I would not know which one of these is best. So
as an example, we saw earlier on when we looked at um the token sequences here and the mental arithmetic and reasoning, we saw that for each token, we can only spend
basically a finite number of finite amount of compute here that is not very large or you should think about it that way. And so we can't actually make too
way. And so we can't actually make too big of a leap in any one token is is maybe the way to think about it. So as
an example in this one, what's really nice about it is that it's very few tokens. So it's going to take us very
tokens. So it's going to take us very short amount of time to get to the answer. But right here when we're doing
answer. But right here when we're doing 13 - 4 / 3 equals right in this token here, we're actually asking for a lot of computation to happen on that single individual token. And so maybe this is a
individual token. And so maybe this is a bad example to give to the LLM because it's kind of incentivizing it to skip through the calculations very quickly and it's going to actually make up mistakes make mistakes in its mental
arithmetic. Uh so maybe it would work
arithmetic. Uh so maybe it would work better to like spread out the spread it out more. Maybe it would be better to
out more. Maybe it would be better to set it up as an equation. Maybe it would be better to talk through it. We
fundamentally don't know. And we don't know because what is easy for you or I as or as human labelers, what's easy for us or hard for us is different than
what's easy or hard for the LLM. It's
cognition is different. Um and the token sequences are kind of like different hard for it. And so some of the token
sequences here that are trivial for me might be um very too much of a leap for the LLM. So right here, this token would
the LLM. So right here, this token would be way too hard. But conversely, many of the tokens that I'm creating here might be just trivial to the LLM and we're just wasting tokens. Like why waste all
these tokens when this is all trivial?
So if the only thing we're care care about is reaching the final answer and we're separating out the issue of the presentation to the human um then we don't actually really know how to annotate this example. We don't know
what solution to get to the LLM because we are not the LLM.
And it's clear here in the case of like the math example. But this is actually like a very pervasive issue like for our knowledge is not LLM's knowledge. Like
the LLM actually has a ton of knowledge of PhD and math and physics and chemistry and whatnot. So in many ways it actually knows more than I do and I'm I'm potentially not utilizing that knowledge in its problem solving. But
conversely, I might be injecting a bunch of knowledge in my solutions that the LLM doesn't know in its parameters. And
then those are like sudden leaps that are very confusing to the model. And so
our cognitions are different. And I
don't really know what to put here if all we care about is the reaching the final solution and doing it economically ideally. And so long story short, we are
ideally. And so long story short, we are not in a good position to create these uh token sequences for the LLM and they're useful by imitation to
initialize the system, but we really want the LLM to discover the token sequences that work for it. We need to find it needs to find for itself what token sequence reliably gets to the
answer given the prompt and it needs to discover that in a process of reinforcement learning and of trial and error. So let's see how this example
error. So let's see how this example would work like in reinforcement learning. Okay, so we're now back in the
learning. Okay, so we're now back in the hugging face inference playground and uh that just allows me to very easily call a diff different kinds of models. So as
an example here on the top right I chose the Gemma 2 two billion parameter model.
So two billion is very very small. So
this is a tiny model but it's okay. So
we're going to give it um the way that reinforcement learning will basically work is actually quite quite simple. um
we need to try many different kinds of solutions and we want to see which solutions work well or not. So we're
basically going to take the prompt, we're going to run the model and the model generates a solution and then we're going to inspect the solution and we know that the correct answer for
this one is $3. And so indeed the model gets it correct. It says it's $3. So
this is correct. So that's just one attempt at the solution. So now we're going to delete this and we're going to rerun it again. Let's try a second attempt. So the model solves it in a bit
attempt. So the model solves it in a bit slightly different way, right? Every
single attempt will be a different generation because these models are stoastic systems. Remember that at every single token here we have a probability distribution and we're sampling from that distribution. So we end up coming
that distribution. So we end up coming kind of going down slightly different paths. And so this is a second solution
paths. And so this is a second solution that also ends in the correct answer.
Now we're going to delete that. Let's go
a third time.
Okay. So again slightly different solution but also gets it correct.
Now we can actually repeat this many times and so in practice you might actually sample thousands of independent solutions or even like million solutions for just a single prompt. Um and some of
them will be correct and some of them will not be very correct. And basically
what we want to do is we want to encourage the solutions that lead to correct answers. So let's take a look at
correct answers. So let's take a look at what that looks like. So if we come back over here, here's kind of like a cartoon diagram of what this is looking like. We
have a prompt and then we tried many different solutions in parallel and some of the solutions um might go well so they get the right answer which is in green and some of the solutions
might go poorly and may not reach the right answer which is red. Now this
problem here unfortunately is not the best example because it's a trivial prompt and as we saw uh even like a two billion parameter model always gets it right. So it's not the best example in
right. So it's not the best example in that sense. But let's just exercise some
that sense. But let's just exercise some imagination here and let's just suppose that the um green ones are good and the red ones are bad.
Okay. So we generated 15 solutions only four of them got the right answer. And
so now what we want to do is basically we want to encourage the kinds of solutions that lead to right answers. So
whatever token sequences happened in these red solutions obviously something went wrong along the way somewhere and uh this was not a good path to take through the solution. And whatever token sequences there were in these green
solutions well things went uh pretty well in this situation. And so we want to do more things like it in prompts like this. And the way we encourage this
like this. And the way we encourage this kind of a behavior in the future is we basically train on these sequences. Um
but these training sequences now are not coming from expert human annotators.
There's no human who decided that this is the correct solution. This solution
came from the model itself. So the model is practicing here. It's tried out a few solutions. Four of them seem to have
solutions. Four of them seem to have worked and now the model will kind of like train on them. And this corresponds to a student basically looking at their solutions and being like, okay, well this one worked really well. So this is
how I should be solving these kinds of problems. And uh here in this example, there are many different ways to actually like really tweak the methodology a little bit here. But just
to get the core idea across, maybe it's simplest to just think about take the taking the single best solution out of these four. Uh like say this one, that's
these four. Uh like say this one, that's why it was yellow. Uh so this is the the solution that not only looked at the right answer, but many maybe had some other nice properties. Maybe it was the
shortest one or it looked nicest in some way. or uh there's other criteria you
way. or uh there's other criteria you could think of as an example, but we're going to decide that this is the top solution. We're going to train on it and
solution. We're going to train on it and then uh the model will be slightly more likely once you do the parameter update to take this path in this kind of a
setting in the future. But you have to remember that we're going to run many different diverse prompts across lots of math problems and physics problems and whatever wherever there might be. So
tens of thousands of prompts maybe have in mind. There's thousands of solutions
in mind. There's thousands of solutions per prompt. And so this is all happening
per prompt. And so this is all happening kind of like at the same time. And as
we're iterating this process, the model is discovering for itself what kinds of token sequences lead it to correct answers. It's not coming from a human
answers. It's not coming from a human annotator. The the model is kind of like
annotator. The the model is kind of like playing in this playground and it knows what it's trying to get to and it's discovering sequences that work for it.
uh these are sequences that don't make any mental leaps. Uh they they seem to work reliably and statistically and uh fully utilize the knowledge of the model
as it has it. And so uh this is the process of reinforcement learning. It's
basically a guess and check. We're going
to guess many different types of solutions. We're going to check them and
solutions. We're going to check them and we're going to do more of what worked in the future. And that is uh reinforcement
the future. And that is uh reinforcement learning. So in the context of what came
learning. So in the context of what came before, we see now that the SFD model, the supervised fine-tuning model, it's still helpful because it still kind of like initializes the model a little bit
into the vicinity of the correct solutions. So it's kind of like a
solutions. So it's kind of like a initialization of um of the model in the sense that it kind of gets the model to you know take solutions like write out solutions and maybe it has an
understanding of setting up a system of equations or maybe it kind of like talks through a solution. So it gets you into the vicinity of correct solutions. But
reinforcement learning is where everything gets dialed in. We really
discover the solutions that work for the model, get the right answers, we encourage them, and then the model just kind of like gets better over time.
Okay, so that is the high level process for how we train large language models.
In short, we train them kind of very similar to how we train children. And
basically the only difference is that children go through chapters of books and they do all these different types of training exercises um kind of within the chapter of each book. But instead when
we train AIS it's almost like we kind of do it stage by stage depending on the type of that stage. So first what we do is we do pre-training which as we saw is equivalent to uh basically reading all
the expository material. So we look at all the textbooks at the same time and we read all the exposition and we try to build a knowledge base. The second thing then is we go into the SFT stage which
is really looking at all the fixed uh sort of like solutions from human experts of all the different kinds of worked solutions across all the textbooks and we just kind of get an SFT
model which is able to imitate the experts but does so kind of blindly. it
just kind of like does its best guess uh kind of just like trying to mimic statistically the expert behavior and so that's what you get when you look at all the work solutions and then finally in
the last stage we do all the practice problems in the RL stage across all the textbooks we only do the practice problems and that's how we get the RL model so on a high level the way we
train LMS is very much equivalent uh to the process that we train uh that we use for training of children the next point I would like to make is that actually these first two stages pre-training and supervised finetuning they've been
around for years and they are very standard and everyone does them all the different LM providers it is this last stage the RL training that is a lot more early in its process of development and
is not standard yet in the field and so um this stage is a lot more kind of early and nent and the reason for that is because I actually skipped over a ton of little details here in this process
the highle idea is very simple it's trial error learning, but there's a ton of details and little mathematical kind of like nuances to exactly how you pick the solutions that are the best and how much you train on them and what is the
prompt distribution and how to set up the training run such that this actually works. So there's a lot of little
works. So there's a lot of little details and knobs to the core idea that is very very simple and so getting the details right here uh is not trivial and so a lot of companies like for example
OpenAI and other LM providers have experimented internally with reinforcement learning fine-tuning for LLMs for a while but they've not talked about it publicly um it's all kind of
done inside the company and so that's why the paper from deepseek that came out very very recently was such a big deal because this is a paper from this uh company called Deep CKI in China and
this paper really talked very publicly about reinforcement learning finetuning for large language models and how incredibly important it is for large language models and how it brings out a
lot of reasoning capabilities in the models. We'll go into this in a second.
models. We'll go into this in a second.
So this paper reinvigorated the public interest of using RL for LLMs and gave a lot of the um sort of nitty-gritty details that are needed to reproduce the results and
actually get the stage to work for large language models. So let me take you
language models. So let me take you briefly through this uh deepseek R1 paper and what happens when you actually correctly apply RL to language models and what that looks like and what that gives you. So the first thing I'll
gives you. So the first thing I'll scroll to is this uh kind of figure two here where we are looking at the improvement in how the models are solving mathematical problems. So this is the accuracy of solving mathematical
problems on the AME accuracy and then we can go to the web page and we can see the kinds of problems that are actually in these um these the kinds of math problems that are being measured here.
So these are simple math problems. You can um pause the video if you like. But
these are the kinds of problems that basically the models are being asked to solve. And you can see that in the
solve. And you can see that in the beginning they're not doing very well.
But then as you update the model with this many thousands of steps, their accuracy kind of continues to climb. So
the models are improving and they're solving these problems with a higher accuracy as you do this trial and error on a large data set of these kinds of problems. And the models are discovering
how to solve math problems. But even more incredible than the quantitative kind of results of solving these problems with a higher accuracy is the qualitative means by which the model
achieves these results. So when we scroll down uh one of the figures here that is kind of interesting is that later on in the optimization the model
seems to be uh using average length per response uh goes up. So the model seems to be using more tokens to get its higher accuracy results. So it's
learning to create very very long solutions. Why are these solutions very
solutions. Why are these solutions very long? We can look at them qualitatively
long? We can look at them qualitatively here. So basically what they discover is
here. So basically what they discover is that the model solutions get very very long partially because so here's a question and here's kind of the answer from the model. What the model learns to
do um and this is an emergent property of the optimization. It just discovers that this is good for problem solving is it starts to do stuff like this. Wait,
wait, wait. That's an aha moment I can flag here. Let's re-evaluate this step
flag here. Let's re-evaluate this step by step to identify the correct sum can be. So what is the model doing here?
be. So what is the model doing here?
Right? The model is basically re-evaluating steps. It has learned that
re-evaluating steps. It has learned that it works better for accuracy to try out lots of ideas, try something from different perspectives, retrace, reframe, backtrack. It's doing a lot of
reframe, backtrack. It's doing a lot of the things that you and I are doing in the process of problem solving for mathematical questions. But it's
mathematical questions. But it's rediscovering what happens in your head, not what you put down on the solution.
And there is no human who can hardcode this stuff in the ideal assistant response. This is only something that
response. This is only something that can be discovered in the process of reinforcement learning because you wouldn't know what to put here. This
just turns out to work for the model and it improves its accuracy in problem solving. So the model learns what we
solving. So the model learns what we call these chains of thought in your head and it's an emergent property of new optim of the optimization and that's what's bloating up the response lengths
but that's also what's increasing the accuracy of the problem. problem
solving. So what's incredible here is basically the model is discovering ways to think. It's learning what I like to
to think. It's learning what I like to call cognitive strategies of how you manipulate a problem and how you approach it from different perspectives, how you pull in some analogies or do different kinds of things like that and
how you kind of try out many different things over time uh check a result from different perspectives and how you kind of uh solve problems. But here it's kind of discovered by the RL. So extremely
incredible to see this emerge in the optimization without having to hardcode it anywhere. The only thing we've given
it anywhere. The only thing we've given it are the correct answers and this comes out from trying to just solve them correctly which is incredible. Um
now let's go back to actually the problem that we've been working with and let's take a look at what it would look like uh for uh for this kind of a model uh what we call reasoning or thinking
model to solve that problem. Okay, so
recall that this is the problem we've been working with. And when I pasted it into Chacht 40, I'm getting this kind of a response. Let's take a look at what
a response. Let's take a look at what happens when you give this same query to what's called a reasoning or a thinking model. This is a model that was trained
model. This is a model that was trained with reinforcement learning. So this
model described in this paper deepcar 1 is available on chat.deepc.com.
Uh so this is kind of like the company that uh that developed it is hosting it.
You have to make sure that the deep think button is turned on to get the R1 model as it's called. We can paste it here and run it.
And so let's take a look at what happens now and what is the output of the model.
Okay, so here's what it says. So this is previously what we get using basically what's an SFT approach, a supervised finetuning approach. This is like
finetuning approach. This is like mimicking an expert solution. This is
what we get from the RL model. Okay, let
me try to figure this out. So Emily buys three apples and two oranges. Each core
orange cost $2. total is 13. I need to find out blah blah blah. So here you you um as you're reading this, you can't escape thinking that this model is
thinking and it's definitely pursuing the solution. It deres that it must cost $3.
solution. It deres that it must cost $3.
And then it says, "Wait a second. Let me
check my math again to be sure." And
then it tries it from a slightly different perspective. And then it says,
different perspective. And then it says, "Yep, all that checks out. I think
that's the answer. I don't see any mistakes. Let me see if there's another
mistakes. Let me see if there's another way to approach the problem. Maybe
setting up an equation. Let's let the cost of one apple be $8, then blah blah blah. Yep, same answer. So definitely
blah. Yep, same answer. So definitely
each apple is $3. All right, confident that that's correct. And then what it does once it sort of um did the thinking process is it writes up the nice solution for the human. And so this is
now considering so this is more about the correctness aspect and this is more about the presentation aspect where it kind of like writes it out nicely and u boxes in the correct answer at the
bottom. And so what's incredible about
bottom. And so what's incredible about this is we get this like thinking process of the model and this is what's coming from the reinforcement learning process. This is what's bloating up the
process. This is what's bloating up the length of the token sequences. They're
doing thinking and they're trying different ways. This is what's giving
different ways. This is what's giving you higher accuracy in problem solving.
And this is where we are seeing these aha moments and these different strategies and these um ideas for how you can make sure that you're getting the correct answer.
The last point I wanted to make is some people are a little bit nervous about putting you know very sensitive data into chat.
Because this is a Chinese company so people don't um people are a little bit careful and cy with that a little bit.
Um, Deepseek R1 is a model that was released by this company. So, this is an open source model or open weights model.
It is available for anyone to download and use. You will not be able to like
and use. You will not be able to like run it in its full um sort of the full model in full precision. You won't run that on a MacBook but uh or like a local
device because this is a fairly large model. But many companies are hosting
model. But many companies are hosting the full largest model. One of those companies that I like to use is called together.ai.
together.ai.
So when you go to together.ai, you sign up and you go to playgrounds, you can select here in the chat DeepSseek R1 and there's many different kinds of other models that you can select here. These
are all state-of-the-art models. So this
is kind of similar to the hugging face inference playground that we've been playing with so far, but together.AI will usually host all the state-of-the-art models. So select deep
state-of-the-art models. So select deep car 1. Um, you can try to ignore a lot
car 1. Um, you can try to ignore a lot of these. I think the default settings
of these. I think the default settings will often be okay. And we can put in this. And because the model was released
this. And because the model was released by Deepseek, what you're getting here should be basically equivalent to what you're getting here. Now, because of the randomness in the sampling, we're going to get something slightly different. Uh
but in principle, this should be uh identical in terms of the power of the model, and you should be able to see the same things quantitatively and qualitatively. Uh but uh this model is
qualitatively. Uh but uh this model is coming from kind of a an American company.
So that's Deepseek and that's the what's called a reasoning model.
Now when I go back to chat uh let me go to chat here. Okay. So the models that you're going to see in the drop down here some of them like 01 03 mini 03 mini high etc. they are talking about
uses advanced reasoning. Now what this is referring to uses advanced reasoning is it's referring to the fact that it was trained by reinforcement learning with techniques very similar to those of
deepcar 1 per public statements of openi employees. Uh so these are thinking
employees. Uh so these are thinking models trained with RL and these models like GPT40 or GPT40 mini that you're getting in the free tier you should think of them as mostly SFT models
supervised finetuning models they don't actually do this like thinking as you see in the RL models and uh even though there's a little bit of reinforcement learning involved with these models and I'll go that into that in a second these
are mostly SFT models I think you should think about it that way so in the same way as what we saw here we can pick one of the thinking models like say 03 three mini high. And these models, by the way,
mini high. And these models, by the way, might not be available to you unless you pay a chachi subscription of either $20 per month or $200 per month for some of
the top models. So, we can pick a thinking model and run. Now, what's
going to happen here is it's going to say reasoning and it's going to start to do stuff like this. And um what we're seeing here is not exactly the stuff we're seeing here. So even though under
the hood the model produces these kinds of uh kind of chains of thought, OpenAI chooses to not show the exact chains of thought in the web interface. It shows
little summaries of that of those chains of thought. And OpenAI kind of does this
of thought. And OpenAI kind of does this I think partly because uh they are worried about what's called a distillation risk. That is that someone
distillation risk. That is that someone could come in and actually try to imitate those reasoning traces and recover a lot of the reasoning performance by just imitating the reasoning uh chains of thought. And so
they kind of hide them and they only show little summaries of them. So you're
not getting exactly what you would get in deepseek as with respect to the reasoning itself and then they write up the solution.
So these are kind of like equivalent even though we're not seeing the full under the hood details. Now in terms of the performance, uh these models and deepseek models are currently roughly on par. I would say it's kind of hard to
par. I would say it's kind of hard to tell because of the evaluations, but if you're paying $200 per month to OpenAI, some of these models I believe are currently they basically still look better. Uh, but Deepseek R1 for now is
better. Uh, but Deepseek R1 for now is still a very solid choice for a thinking model that would be available to you um sort of um either on this website or any
other website because the model is open weights. You can just download it.
weights. You can just download it.
So that's thinking models. So what is the summary so far? Well, we've talked about reinforcement learning and the fact that thinking emerges in the process of the optimization on when we
basically run RL on many math uh and kind of code problems that have verifiable solutions. So there's like an
verifiable solutions. So there's like an answer three etc. Now these thinking models you can access in for example deepseek or any inference provider like
together.ai and choosing deepseek over there. These thinking models are also
there. These thinking models are also available uh in Chacht under any of the 01 or 03 models.
But these GPT 40 models etc. they're not thinking models. You should think of
thinking models. You should think of them as mostly SFT models. Now if you are um if you have a prompt that requires advanced reasoning and so on, you should probably use some of the thinking models or at least try them
out. But empirically for a lot of my use
out. But empirically for a lot of my use when you're asking a simpler question, there's like a knowledge based question or something like that, this might be overkill. like there's no need to think
overkill. like there's no need to think 30 seconds about some factual question.
So for that I will uh sometimes default to just GP40. So empirically about 80 90% of my use is just GPT40. And when I come across a very difficult problem like in math and code etc I will reach
for the thinking models but then I have to wait a bit longer because they are thinking. Um so you can access these on
thinking. Um so you can access these on chach on deepseeek. Also I wanted to point out that um a studio.google.com
google.com even though it looks really busy, really ugly because Google is just unable to do this kind of stuff. Well, it's like what is happening? But if you choose model
is happening? But if you choose model and you choose here Gemini 2.0 flash thinking experimental 0121, if you choose that one, that's also a a kind of early experiment experimental of a
thinking model by Google. So we can go here and we can give it the same problem and click run. And this is also a thinking problem thinking model that will also do something similar
and comes out with the right answer here. So basically Gemini also offers a
here. So basically Gemini also offers a thinking model. Enthropic currently does
thinking model. Enthropic currently does not offer a thinking model. But
basically this is kind of like the frontier of development of these LLMs. I think RL is kind of like this new exciting stage. But getting the details
exciting stage. But getting the details right is difficult and that's why all these models and thinking models are currently experimental as of 2025 very early 2025. Um but this is kind of like
early 2025. Um but this is kind of like the frontier of development of pushing the performance on these very difficult problems using reasoning that is emergent in these optimizations. One
more connection that I wanted to bring up is that the discovery that reinforcement learning is extremely powerful way of learning is not new to the field of AI. And one place where
we've already seen this demonstrated is in the game of go and famously deep mind developed the system AlphaGo and you can watch a movie about it. um where the
system is learning to play the game of go against top human players and um when we go through the paper underlying alpha
go so in this paper when we scroll down we actually find a really interesting plot um that I think uh is kind of familiar uh to us and we're kind of like
rediscovering in the more open domain of arbitrary problem solving instead of on the closed specific domain of the game of go but basically Basally what they saw and we're going to see this in LLMs
as well as this becomes more mature is this is the ELO rating of playing game of go and this is leis do an extremely strong human player and here what they are comparing is the strength of a model
learned trained by supervised learning and a model trained by reinforcement learning. So the supervised learning
learning. So the supervised learning model is imitating human expert players.
So if you just get a huge amount of games played by expert players in the game of go and you try to imitate them you are going to get better but then you top out and you never quite get better
than some of the top top top players of in the game of go like ladle. So you're
never going to reach there because you're just imitating human players. You
can't fundamentally go beyond a human player if you're just imitating human players. But in a process of
players. But in a process of reinforcement learning is significantly more powerful. In reinforcement learning
more powerful. In reinforcement learning for a game of go, it means that the system is playing moves that empirically and statistically lead to win to winning
the game. And so Alph Go is a system
the game. And so Alph Go is a system where it kind of plays against itself and it's using reinforcement learning to create rollouts. So it's the exact same
create rollouts. So it's the exact same diagram here, but there's no prompt.
It's just uh because there's no prompt, it's just a fixed game of Go, but it's trying out lots of solutions. It's
trying out lots of plays. And then the games that lead to a win instead of a specific answer are reinforced. They're
they're made stronger. And so um the system is learning basically the sequences of actions that empirically and statistically lead to winning the game. And reinforcement learning is not
game. And reinforcement learning is not going to be constrained by human performance. And reinforcement learning
performance. And reinforcement learning can do significantly better and overcome even the top players like Lisa Dole.
And so, uh, probably they could have run this longer and they just chose to crop it at some point because this costs money. Uh, but this is a very powerful
money. Uh, but this is a very powerful demonstration of reinforcement learning.
And we're only starting to kind of see hints of this diagram in larger language models for reasoning problems. So, we're not going to get too far by just imitating experts. We need to go beyond
imitating experts. We need to go beyond that. Set up these like little game
that. Set up these like little game environments and get let let the system discover reasoning traces or like ways
of solving problems. uh that are unique and that uh just basically work well.
Now, on this aspect of uniqueness, notice that when you're doing reinforcement learning, nothing prevents you from veering off the distribution of how humans are playing the game. And so,
when we go back to uh this AlphaGo search here, one of the suggested modifications is called move 37. And
move 37 in AlphaGo is referring to a specific point in time where Alphago basically played a move that no human expert would play. Uh so the probability
of this move to be played by a human player was evaluated to be about 1 in 10,000. So it's a very rare move but in
10,000. So it's a very rare move but in retrospect it was a brilliant move. So
Alph Go in the process of reinforcement learning discovered kind of like a strategy of playing that was unknown to humans and but is in retrospect uh brilliant. I recommend this YouTube
brilliant. I recommend this YouTube video um Lisa Dole versus AlphaGo move 37 reactions and analysis and uh this is kind of what it looked like when AlphaGo
played this move. The value of >> that's a very that's a very surprising move.
>> I thought I thought it was I thought it was a mistake.
>> Would I see this move? Anyway, so
basically people are kind of freaking out because it's uh it's a move that a human would not play that AlphaGo played because in its training uh this move seemed to be a good idea. It just
happens not to be a kind of thing that humans would would do. And so that is again the power of reinforcement learning and in principle we can actually see the equivalence of that if we continue scaling this paradigm in
language models and what that looks like is kind of unknown. So um what does it mean to solve problems in such a way that uh even humans would not be able to
get? How can you be better at reasoning
get? How can you be better at reasoning or thinking than humans? How can you go beyond just uh a thinking human? Like
maybe it means discovering analogies that humans would not be able to uh create. Or maybe it's like a new
create. Or maybe it's like a new thinking strategy. It's kind of hard to
thinking strategy. It's kind of hard to think through. Uh maybe it's a wholly
think through. Uh maybe it's a wholly new language that actually is not even English. Maybe it discovers its own
English. Maybe it discovers its own language that is a lot better at thinking um because the model is unconstrained to even like stick with English. Uh so maybe it picks a
English. Uh so maybe it picks a different language to think in or it discovers its own language. So in
principle the behavior of the system is a lot less defined. It is open to do whatever works and it is open to also slowly drift from the distribution of it
training data which is English. But all
of that can only be done if we have a very large diverse set of problems in which the these strategies can be refined and perfected. And so that is a lot of the frontier LLM research that's going on right now is trying to kind of
create those kinds of prompt distributions that are large and diverse. These are all kind of like game
diverse. These are all kind of like game environments in which the LLMs can practice their thinking. And uh it's kind of like writing, you know, these practice problems. We have to create
practice problems for all of domains of knowledge. And if we have practice
knowledge. And if we have practice problems and tons of them, the models will be able to reinforcement learning reinforcement learn on them and kind of uh create these kinds of uh diagrams but
in the domain of open thinking instead of a closed domain like game of go.
There's one more section within reinforcement learning that I wanted to cover and that is that of learning in unverifiable domains. So so far all of
unverifiable domains. So so far all of the problems that we've looked at are in what's called verifiable domains. That
is any candidate solution we can score very easily against a concrete answer.
So for example answer is three and we can very easily score these solutions against the answer of three. Either we
require the models to like box in their answers and then we just check for equality of whatever is in the box with the answer. Or you can also use kind of
the answer. Or you can also use kind of what's called an LLM judge. So the LLM judge looks at a solution and it gets the answer and just basically scores the solution for whether it's consistent
with the answer or not. And LLMs
empirically are good enough at the current capability that they can do this fairly reliably. So we can apply those
fairly reliably. So we can apply those kinds of techniques as well. In any
case, we have a concrete answer and we're just checking solutions against it and we can do this automatically with no kind of humans in the loop. The problem
is that we can't apply the strategy in what's called unverifiable domains. So
usually these are for example creative writing tasks like write a joke about pelicans or write a poem or summarize a paragraph or something like that. In
these kinds of domains it becomes harder to score our different solutions to this problem. So for example writing a joke
problem. So for example writing a joke about pelicans we can generate lots of different uh jokes. Of course that's fine. For example we can go to chachi pt
fine. For example we can go to chachi pt and we can get it to uh generate a joke about pelicans.
Uh so much stuff in their beaks because they don't bellican in backpacks.
What? Okay,
we can uh we can try something else. Why
don't pelicans ever pay for their drinks? Because they always bill it to
drinks? Because they always bill it to someone else. Haha. Okay, so these
someone else. Haha. Okay, so these models are not obviously not very good at humor. Actually, I think it's pretty
at humor. Actually, I think it's pretty fascinating because I think humor is secretly very difficult and the model have the capability. I think anyway, in any case, you could imagine creating
lots of jokes. The problem that we are facing is how do we score them? Now in
principle we could of course get a human to look at all these jokes just like I did right now. The problem with that is if you are doing reinforcement learning you're going to be doing many thousands of updates and for each update you want
to be looking at say thousands of prompts and for each prompt you want to be potentially look at looking at hundred or thousands of different kinds of generations. And so there's just like
of generations. And so there's just like way too many of these to look at. And so
um in principle you could have a human inspect all of them and score them and decide that okay maybe this one is funny and maybe this one is funny and this one is funny and we could train on them to
get the model to become slightly better at jokes um in the context of pelicans at least. Um the problem is that it's
at least. Um the problem is that it's just like way too much human time. This
is an unscalable strategy. We need some kind of an automatic strategy for doing this. And one sort of solution to this
this. And one sort of solution to this was proposed in this paper uh that introduced what's called reinforcement learning from human feedback. And so
this was a paper from open AI at the time and many of these people are now um co-founders in anthropic um and this kind of proposed a approach for uh basically doing reinforcement learning
in unverifiable domains. So let's take a look at how that works. So this is the cartoon diagram of the core ideas involved. So as I mentioned the naive
involved. So as I mentioned the naive approach is if we just set infinity human time we could just run RL in these domains just fine. So for example we can run RL as usual if I have infinity
humans I I just want to do and these are just cartoon numbers I want to do 10,000 updates where each update will be on 10,00 prompts and in for each prompt
we're going to have 1,00 rollouts that we're scoring. So we can run RL with
we're scoring. So we can run RL with this kind of a setup. The problem is in the process of doing this I will need to run one I would need to ask a human to evaluate a joke a total of 1 billion
times and so that's a lot of people looking at really terrible jokes. So we
don't want to do that. So instead we want to take the arch of approach. So um
in our approach we are kind of like the the core trick is that of indirection.
So we're going to involve humans just a little bit. And the way we cheat is that
little bit. And the way we cheat is that we basically train a whole separate neural network that we call a reward model. And this neural network will kind
model. And this neural network will kind of like imitate human scores. So we're
going to ask humans to score um rollouts. We're going to then imitate
rollouts. We're going to then imitate human scores using a neural network. And
this neural network will become a kind of simulator of human preferences. And
now that we have a neural network simulator, we can do RL against it. So
instead of asking a real human, we're asking a simulated human for their score of a joke, as an example. And so once we have a simulator, we're off to the races
because we can query it as many times as we want to. And it's all whole automatic process and we can now do reinforcement learning with respect to the simulator.
And the simulator, as you might expect, is not going to be a perfect human. But
if it's at least statistically similar to human judgment, then you might expect that this will do something. And in
practice, indeed, uh, it does. So once
we have a simulator, we can do RL and everything works great. So let me show you a cartoon diagram a little bit of what this process looks like. Although
the details are not 100 like super important, it's just a core idea of how this works. So here I have a cartoon
this works. So here I have a cartoon diagram of a hypothetical example of what training the reward model would look like. So we have a prompt like
look like. So we have a prompt like write a joke about pelicans and then here we have five separate rollouts. So
these are all five different jokes just like this one. Now the first thing we're going to do is we are going to ask a human to uh order these jokes from the
best to worst. So this is uh so here this human thought that this joke is the best the funniest. So number one joke this is number two joke number three
joke four and five. So this is the worst joke. We're asking humans to order
joke. We're asking humans to order instead of give scores directly because it's a bit of an easier task. It's
easier for a human to give an ordering than to give precise scores. Now that is now the supervision for the model. So
the human has ordered them and that is kind of like their contribution to the training process. But now separately
training process. But now separately what we're going to do is we're going to ask a reward model uh about its scoring of these jokes. Now the reward model is a whole separate neural network
completely separate neural net um and it's also probably a transformer uh but it's not a language model in the sense that it generates diverse language etc.
It's just a scoring model. So the reward model will take as an input the prompt number one and number two a candidate joke. So um those are the two inputs
joke. So um those are the two inputs that go into the reward model. So here
for example the reward model would be taking this prompt and this joke. Now
the output of a reward model is a single number and this number is thought of as a score and it can range for example from zero to one. So zero would be the worst score and one would be the best
score. So here are some examples of what
score. So here are some examples of what a hypothetical reward model at some stage in the training process would give uh as a scoring to these jokes. So 0.1
is a very low score, 08 is a really high score and so on. And so now um we compare the scores given by the reward model with uh the ordering given by the
human. And there's a precise
human. And there's a precise mathematical way to actually calculate this uh basically set up a loss function and calculate a kind of like a correspondence here and uh update a model based on it. But I just want to
give you the intuition which is that as an example here for this second joke the the human thought that it was the funniest and the model kind of agreed right 8 is a relatively high score but
this score should have been even higher right so after an update we would expect that maybe this score should have been will actually grow after an update of the network to be like say 81 or something
um for this one here they actually are in a massive disagreement because the human thought that this was number two but here the the score is only 1 and so this score needs to be much higher. So
after an update on top of this um kind of a supervision this might grow a lot more like maybe it's 0.15 or something like that. Um and then here the human
like that. Um and then here the human thought that this one was the worst joke but here the model actually gave it a fairly high number. So you might expect that after the update uh this would come
down to maybe 3.5 or something like that. So basically we're doing what we
that. So basically we're doing what we did before. We're slightly nudging the
did before. We're slightly nudging the predictions from the models using a neural network training process and we're trying to make the reward
model scores be consistent with human ordering. And so um as we update the
ordering. And so um as we update the reward model on human data, it becomes better and better simulator of the scores and orders uh that humans provide
and then becomes kind of like the the neural the simulator of human preferences which we can then do RL against. But critically we're not asking
against. But critically we're not asking humans one billion times to look at a joke. We're maybe looking at thousand
joke. We're maybe looking at thousand prompts and five each. So maybe 5,000 jokes that humans have to look at in total. and they just give the ordering
total. and they just give the ordering and then we're training the model to be consistent with that ordering and I'm skipping over the mathematical details but I just want you to understand a high level idea that uh this reward model is
do is basically giving us the scores and we have a way of training it to be consistent with human orderings and that's how RLHF works okay so that is the rough idea we basically train
simulators of humans and RL with respect to those simulators now I want to talk about first the upside of reinforcement learning from human feedback.
The first thing is that this allows us to run reinforcement learning which we know is incredibly powerful kind of set of techniques and it allows us to do it in arbitrary domains and including the ones that are unverifiable. So things
like summarization and poem writing, joke writing or any other creative writing really uh in domains outside of math and code etc. Now empirically what we see when we actually apply RLHF is that this is a
way to improve the performance of the model and uh I have a top answer for why that might be but I don't actually know that it is like super well established on like why this is you can empirically
observe that when you do RHF correctly the models you get are just like a little bit better um but as to why is I think like not as clear. So here's my best guess. My best guess is that this
best guess. My best guess is that this is possibly mostly due to the discriminator generator gap.
What that means is that in many cases it is significantly easier to discriminate than to generate for humans. So in
particular an example of this is um in when we do supervised fine-tuning right SFD we're asking humans to generate the
ideal assistant response. And in many cases here um as I've shown it uh the ideal response is very simple to write but in many cases might not be. So for
example in summarization or poem writing or joke writing like how are you as a human as a human labeler um supposed to give the ideal response in these cases.
It requires creative human writing to do that. And so RHF kind of sidesteps this
that. And so RHF kind of sidesteps this because we get um we get to ask people a significantly easier question as a data labelers. They're not asked to write
labelers. They're not asked to write poems directly. They're just given five
poems directly. They're just given five poems from the model and they're just asked to order them. And so that's just a much easier task for a human labeler to do. And so what I think this allows
to do. And so what I think this allows you to do basically is it um it kind of like allows a lot more higher accuracy data because we're not asking people to
do the generation task which can be extremely difficult. Like we're not
extremely difficult. Like we're not asking them to do creative writing.
We're just trying to get them to distinguish between creative writings and uh find the ones that are best and that is the signal that humans are providing just the ordering and that is
their input into the system and then the system in our LHF just discovers the kinds of responses that would be graded well by humans and so that step of
indirection allows the models to become even better. So that is the upside of
even better. So that is the upside of RLHF. It allows us to run RL. It
RLHF. It allows us to run RL. It
empirically results in better models and it allows uh people to contribute their supervision uh even without having to do extremely difficult tasks um in the case of writing ideal responses.
Unfortunately, RHF also comes with significant downsides and so um the main one is that basically we are doing reinforcement learning not with respect to humans and actual human judgment but
with respect to a lossy simulation of humans, right? And this lossy simulation
humans, right? And this lossy simulation could be misleading because it's just a it's just a simulation, right? It's just
a language model that's kind of outputting scores and it might not perfectly reflect the opinion of an actual human with an actual brain in all the possible different cases. So that's
number one. But there's actually something even more subtle and devious going on that uh really dramatically holds back our LHF as a technique that
we can really scale to significantly um kind of smart systems and that is that reinforcement learning is extremely good at discovering a way to game the model
to game the simulation. So this reward model that we're constructing here that gives this course these models are transformers. These transformers are
transformers. These transformers are massive neural nets. They have billions of parameters and they imitate humans, but they do so in a kind of like a simulation way. Now, the problem is that
simulation way. Now, the problem is that these are massive complicated systems, right? There's a billion parameters here
right? There's a billion parameters here that are outputting a single score.
It turns out that there are ways to gain these models. You can find kinds of
these models. You can find kinds of inputs that were not part of their training set. And these inputs
training set. And these inputs inexplicably get very high scores, but in a fake way. So very often what you find if you run our LHF for very long.
So for example if we do 10,000 updates which is like say a lot of updates you might expect that your jokes are getting better and that you're getting like real bangers about pelicans but that's not ex
exactly what happens. What happens is that uh in the first few hundred steps the jokes about pelicans are probably improving a little bit and then they actually dramatically fall off the cliff and you start to get extremely
nonsensical results. Like for example,
nonsensical results. Like for example, you start to get um the top joke about Pelican starts to be dthe the the the and this makes no sense, right? Like
when you look at it, why should this be a top joke? But when you take the dthe and you plug it into your reward model, you'd expect score of zero. But
actually, the reward model loves this as a joke. It will tell you that the is a
a joke. It will tell you that the is a score of 1.0. This is a top joke. And
this makes no sense, right? But it's
because these models are just simulations of humans and they're massive neural nets and you can find inputs at the bottom that kind of like get into the part of the input space that kind of gives you nonsensical
results. These examples are what's
results. These examples are what's called adversarial examples and I'm not going to go into the topic too much, but these are adversarial inputs to the model. They are specific little inputs
model. They are specific little inputs that kind of go between the nooks and crannies of the model and give nonsensical results at the top. Now,
here's what you might imagine doing. you
say, "Okay, duh the dthe is obviously not score of one. Um, it's obviously a low score. So, let's take Duh. Let's add
low score. So, let's take Duh. Let's add
it to the data set and give it a an ordering that is extremely bad, like a score of five." And indeed, your model will learn that Da D should have a very low score, and it will give it score of zero. The problem is that there will
zero. The problem is that there will always be basically infinite number of nonsensical adversarial examples hiding in the model. If you iterate this process many times and you keep adding
nonsensical stuff to your reward model and giving it very low scores, you can you'll never win the game. Uh you can do this many many rounds and reinforcement learning if you run it long enough will
always find a way to gain the model. It
will discover adversarial examples. It
will get really high scores uh with nonsensical results. And fundamentally
nonsensical results. And fundamentally this is because our scoring function is a giant neural net and RL is extremely
good at finding just the ways to trick it. Uh so long story short you always
it. Uh so long story short you always run outf put for maybe a few hundred updates the model is getting better and then you have to crop it and you are
done. you can't run too much against
done. you can't run too much against this reward model because the optimization will start to game it and you basically crop it and you call it
and you ship it. Um and uh you can improve the reward model but you kind of like come across these situations eventually at some point. So RLHF
basically what I usually say is that RLHF is not RL and what I mean by that is I mean RLHF is RL obviously but it's not RL in the magical sense. This is not
RL that you can run indefinitely.
These kinds of problems like where you are getting concrete correct answer. You
cannot gain this as easily. You either
got the correct answer or you didn't.
And the scoring function is much much simpler. You're just looking at the
simpler. You're just looking at the boxed area and seeing if the result is correct. So it's very difficult to gain
correct. So it's very difficult to gain these functions but uh gaining a reward model is possible. Now in these verifiable domains you can run RL indefinitely. You could run for tens of
indefinitely. You could run for tens of thousands, hundreds of thousands of steps and discover all kinds of really crazy strategies that we might not even ever think about of performing really well for all these problems in the game
of Go. There's no way to to beat to
of Go. There's no way to to beat to basically game uh the winning of a game or losing of a game. We have a perfect simulator. We know all the different uh
simulator. We know all the different uh where all the stones are placed and we can calculate uh whether someone has won or not. There's no way to gain that. And
or not. There's no way to gain that. And
so you can do RL indefinitely and you can eventually be beat even least dull.
But with models like this which are gameable, you cannot repeat this process indefinitely. So I kind of see RHF as
indefinitely. So I kind of see RHF as not real RL because the reward function is gameable. So it's kind of more like
is gameable. So it's kind of more like in the realm of like little fine-tuning.
It's a little it's a little improvement, but it's not something that is fundamentally set up correctly where you can insert more compute, run for longer, and get much better and magical results.
So, it's it's uh it's not RL in that sense. It's not RL in the sense that it
sense. It's not RL in the sense that it lacks magic. Um it can fine-tune your
lacks magic. Um it can fine-tune your model and get a better performance. And
indeed, if we go back to chacht, the GPT4 model has gone through RLHF because it works well, but it's just not RL in the same sense. RLHF is like a little fine tune that slightly improves your
model is maybe like the way I would think about it. Okay, so that's most of the technical content that I wanted to cover. I took you through the three
cover. I took you through the three major stages and paradigms of training these models. Pre-training, supervised
these models. Pre-training, supervised finetuning, and reinforcement learning.
And I showed you that they loosely correspond to the process we already use for teaching children. And so in particular, we talked about pre-training being sort of like the basic knowledge acquisition of reading exposition,
supervised fine-tuning being the process of looking at lots and lots of worked examples and imitating experts and practice problems. The only difference is that we now have to effectively write
textbooks for LLMs and AIS across all the disciplines of human knowledge and also in all the cases where we actually would like them to work like code and math and you know basically all the
other disciplines. So we're in the
other disciplines. So we're in the process of writing textbooks for them refining all the algorithms that I've presented on the high level and then of course doing a really really good job at the execution of training these models
at scale and efficiently. So in
particular, I didn't go into too many details, but these are extremely large and complicated distributed uh sort of um jobs that have to run over tens of
thousands or even hundreds of thousands of GPUs. And the engineering that goes
of GPUs. And the engineering that goes into this is really at the state-of-the-art of what's possible with computers at that scale. So I didn't cover that aspect too much, but
um this is a very kind of serious endeavor underlying all these very simple algorithms. Ultimately, now I also talked about sort of like the theory of mind a little bit of these models. And the thing I want you to take
models. And the thing I want you to take away is that these models are really good, but they're extremely useful as tools for your work. You shouldn't sort of trust them fully. And I showed you some examples of that. Even though we
have mitigations for hallucinations, the models are not perfect and they will hallucinate still. It's gotten better
hallucinate still. It's gotten better over time and it will continue to get better, but they can hallucinate.
In other word in in addition to that I covered kind of like what I call the Swiss cheese uh sort of model of LM capabilities that you should have in your mind the models are incredibly good across so many different disciplines but
then fail randomly almost in some unique cases. So for example what is bigger
cases. So for example what is bigger 9.11 or 9.9 like the model doesn't know but simultaneously it can turn around and solve olympiad questions and so this is a hole in the Swiss cheese and there
are many of them and you don't want to trip over them. So don't um treat these models as infallible models. Check their
work, use them as tools, use them for inspiration, use them for the first draft, but uh work with them as tools and be ultimately respons responsible
for the you know product of your work.
And that's roughly what I wanted to talk about. This is how they're trained and
about. This is how they're trained and this is what they are. Let's now turn to what are some of the future capabilities of these models. uh probably what's coming down the pipe and also where can you find these models. I have a few
bullet points on some of the things that you can expect coming down the pipe. The
first thing you'll notice is that the models will very rapidly become multimodal. Everything I've talked about
multimodal. Everything I've talked about above concern text, but very soon we'll have LLMs that can not just handle text, but they can also operate natively and very easily over audio so they can hear
and speak and also images so they can see and paint. And we're already seeing the beginnings of all of this. uh but
this will be all done natively inside the language model and this will enable kind of like natural conversations and roughly speaking the reason that this is actually no different from everything
we've covered above is that as a baseline you can tokenize audio and images and apply the exact same approaches of everything that we've talked about above. So it's not a fundamental change. It's just uh it's
fundamental change. It's just uh it's just a to we have to add some tokens. So
as an example for tokenizing audio, we can look at slices of the spectrogram of the audio signal and we can tokenize that and just add more tokens that suddenly represent audio and just add
them into the context windows and train on them just like above. The same for images, we can use patches and we can separately tokenize patches and then what is an image? An image is just a
sequence of tokens and this actually kind of works and there's a lot of early work in this direction and so we can just create streams of tokens that are representing audio images as well as text and intersperse them and handle
them all simultaneously in a single model. So that's one example of
model. So that's one example of multiodality.
Uh second something that people are very interested in is currently most of the work is that we're handing individual tasks to the models on kind of like a silver platter like please solve this task for me and the
model sort of like does this little task but it's up to us to still sort of like organize a coherent execution of tasks to perform jobs and the models are not
yet at the capability required to do this in a coherent error correcting way over long periods of time. So they're
not able to fully string together tasks to perform these longer running jobs, but they're getting there and this is improving uh over time, but uh probably what's going to happen here is we're going to start to see what's called
agents which perform tasks over time and you you supervise them and you watch their work and they come up to once in a while report progress and so on. So
we're going to see more longer running agents uh tasks that don't just take you know a few seconds of response but many tens of seconds or even minutes or hours over time.
uh but these uh models are not infallible as we talked about above. So
all this will require supervision. So
for example in factories people talk about the human to robot ratio uh for automation. I think we're going to see
automation. I think we're going to see something similar in the digital space where we are going to be talking about human to agent ratios where humans becomes a lot more supervisors of
agentic tasks um in the digital domain.
Uh next um I think everything is going to become a lot more pervasive and invisible. So it's kind of like
invisible. So it's kind of like integrated into the tools and and everywhere. Um and in addition kind of
everywhere. Um and in addition kind of like computer using so right now these models aren't able to take actions on your behalf. Uh but I think this is a
your behalf. Uh but I think this is a separate bullet point. Um if you saw chashb launch the operator then uh that's one early example of that where you can actually hand off control to the
model to perform you know keyboard and mouse actions on your behalf. So that's
also something that that I think is very interesting. The last point I have here
interesting. The last point I have here is just a general comment that there's still a lot of research to potentially do in this domain. One example of that uh is something along the lines of test time training. So remember that
time training. So remember that everything we've done above and that we talked about has two major stages.
There's first the training stage where we tune the parameters of the model to perform the tasks well. Once we get the parameters, we fix them and then we deploy the model for inference. From
there the model is fixed. It doesn't
change anymore. It doesn't learn from all the stuff that it's doing at test time. It's a fixed number of parameters
time. It's a fixed number of parameters and the only thing that is changing is now the tokens inside the context windows. And so the only type of
windows. And so the only type of learning or test time learning that the model has access to is the incontext learning of its uh kind of like uh dynamically adjustable context window
depending on like what it's doing at test time. So but I think this is still
test time. So but I think this is still different from humans who actually are able to like actually learn uh depending on what they're doing. Especially when
you sleep for example like your brain is updating your parameters or something like that, right? So there's no kind of equivalent of that currently in these models and tools. So there's a lot of like um more wonky ideas I think that
are to be explored still. And uh in particular I think this will be necessary because the context window is a finite and precious resource and especially once we start to tackle very longunning multimodal tasks and we're
putting in videos and these token windows will basically start to grow extremely large like not thousands or even hundreds of thousands but significantly beyond that. And the only trick uh the only kind of trick we have
available to us right now is to make the context windows longer. But I think that that approach by itself will not will not scale to actual longunning tasks that are multimodal over time. And so I
think new ideas are needed in some of those disciplines um in some of those kind of cases and domains where these tasks are going to require very long context.
So those are some examples of some of the things you can um expect coming down the pipe. Let's now turn to where you
the pipe. Let's now turn to where you can actually uh kind of keep track of this progress and um you know be up to date with the latest and greatest of what's happening in the field. So I
would say the three resources that I have consistently used to stay up to date are number one LMA. Uh so let me show you LM Marina.
This is basically an LLM leaderboard and it ranks all the top models and the ranking is based on human comparisons.
So humans prompt these models and they get to judge which one gives a better answer. They don't know which model is
answer. They don't know which model is which. They're just looking at which
which. They're just looking at which model is the better answer and you can calculate a ranking and then you get some results. And so what you can hear
some results. And so what you can hear is what you can see here is the different organizations like Google Gemini for example that produce these models. When you click on any one of
models. When you click on any one of these it takes you to the place where that model is hosted. And then here we see Google is currently on top with OpenAI right behind here. We see
DeepSseek in position number three. Now
the reason this is a big deal is the last column here you see license.
Deepseek is an MIT licensed model. It's
open weights. Anyone can use these weights. Uh anyone can download them.
weights. Uh anyone can download them.
Anyone can host their own version of DeepSeek and they can use it in whatever way they like. And so it's not a proprietary model that you don't have access to. It's it's basically an open
access to. It's it's basically an open weights release. And so this is kind of
weights release. And so this is kind of unprecedented that a model this strong was released with open weights. So
pretty cool from the team. Next up, we have a few more models from Google and OpenAI. And then when you continue to
OpenAI. And then when you continue to scroll down, you're start to see some other usual suspects. So XAI here, Anthropic with Sonnet uh here at number
14. And um
14. And um then Meta with Llama over here. So
Llama, similar to Deepseek, is an open weights model. And so uh but it's down
weights model. And so uh but it's down here as opposed to up here. Now, I will say that this leaderboard was really good for a long time. I do think that in
the last few months, it's become a little bit gamed um and I don't trust it as much as I used to. I think um just empirically I feel like a lot of people for example are using Sonnet from
Enthropic and that it's a really good model. So, but that's all the way down
model. So, but that's all the way down here um in number 14. And conversely, I think not as many people are using Gemini, but it's ranking really really
high. Uh so I think use this as a first
high. Uh so I think use this as a first pass uh but uh sort of try out a few of the models for your tasks and see which one performs better. The second thing
that I would point to is the uh AI news uh newsletter. So AI news is not very
uh newsletter. So AI news is not very creatively named but it is a very good newsletter produced by Swixs and Friends. So thank you for maintaining it
Friends. So thank you for maintaining it and it's been very helpful to me because it is extremely comprehensive. So if you go to archives uh you see that it's produced almost every other day and um
it is very comprehensive and some of it is written by humans and curated by humans but a lot of it is constructed automatically with LLMs. So you'll see that there these are very comprehensive and you're probably not missing anything
major if you go through it. Of course
you're probably not going to go through it because it's so long but I do think that these summaries all the way up top are quite good and I think have some human oversight. Uh so this has been
human oversight. Uh so this has been very helpful to me. And the last thing I would point to is just X and Twitter. Uh
a lot of um AI happens on X and so I would just follow people who you like and trust and uh get all your latest and greatest uh on X as well. So those are the major places that have worked for me
over time. And finally, a few words on
over time. And finally, a few words on where you can find the models and where can you use them. So the first one I would say is for any of the biggest proprietary models, you just have to go to the website of that LLM provider. So
for example, for OpenAI, that's uh chat.com I believe actually works now.
Uh so that's for OpenAI.
Now for or you know for um for Gemini I think it's uh gemini.google.com
or AI studio. I think they have two for some reason that I don't fully understand. No one does. Um for the open
understand. No one does. Um for the open weights models like deep seal etc you have to go to some kind of an inference provider of LLMs. So my favorite one is together.ai and I showed you that when
you go to the playground of together.ai AI then you can sort of pick lots of different models and all of these are open models of different types and you can talk to them here as an example.
Um now if you'd like to use a base model like um you know a base model then this is where I think it's not as common to find base models even on these inference providers they are all targeting
assistants and chat and so I think even here I can't I couldn't see base models here. So for base models, I usually go
here. So for base models, I usually go to hyperbolic because they serve my llama 3.1 base and I love that model and you can just talk to it here. So as far as I know, this is this is a good place
for a base model and I wish more people hosted base models because they are useful and interesting to work with in some cases. Finally, you can also take
some cases. Finally, you can also take some of the models that are smaller and you can run them locally. And so for example, DeepSseek, the biggest model you're not going to be able to run locally on your MacBook, but there are
smaller versions of the Deepseek model that are what's called distilled. And
then also you can run these models at smaller precision. So not at the native
smaller precision. So not at the native precision of for example FP8 on DeepSeek or you know BF-16 Lama, but much much lower than that. Um, and don't worry if
you don't fully understand those details, but you can run smaller versions that have been distilled and then at even lower precision and then you can fit them on your uh computer.
And so you can actually run pretty okay models on your laptop. And my favorite I think place I go to usually is LM Studio uh which is basically an app you can get. And I think it kind of actually
get. And I think it kind of actually looks really ugly and it's I don't like that it shows you all these models that are basically not that useful. Like
everyone just wants to run deepseeek.
So, I don't know why they give you these 500 different types of models. They're
really complicated to search for and you have to choose different distillations and different uh precisions and it's all really confusing. But once you actually
really confusing. But once you actually understand how it works and that's a whole separate video, then you can actually load up a model like here. I
loaded up llama 3 uh2 instruct 1 billion and um you can just talk to it. So, I
ask for pelican jokes and I can ask for another one and it gives me another one etc. All of this that happens here is locally on your computer. So, we're not actually going to anywhere anyone else.
This is running on the GPU on the MacBook Pro. So, that's very nice. And
MacBook Pro. So, that's very nice. And
you can then eject the model when you're done. And that frees up the RAM. So, LM
done. And that frees up the RAM. So, LM
Studio is probably like my favorite one even though I don't I think it's got a lot of UIUX issues and it's really geared towards uh professionals almost.
Uh but if you watch some videos on YouTube, I think you can figure out how to how to use this interface. Uh, so
those are a few words on where to find them. So let me now loop back around to
them. So let me now loop back around to where we started. The question was when we go to chaship pt.com and we enter some kind of a query and we hit go, what
exactly is happening here? What are we seeing? What are we talking to? How does
seeing? What are we talking to? How does
this work? And I hope that this video gave you some appreciation for some of the under the hood details of how these models are trained and what this is that is coming back. So in particular we now
know that your query is taken and is first chopped up into tokens. So we go to tok tick tokenizer and here where is the place in the in the um sort of
format that is for the user query. We
basically put in our query right there.
So our query goes into what we discussed here is the conversation protocol format which is this way that we maintain conversation objects. So this gets
conversation objects. So this gets inserted there and then this whole thing ends up being just a token sequence, a one-dimensional token sequence under the hood. So Chachi PT saw this token
hood. So Chachi PT saw this token sequence and then when we hit go, it basically continues appending tokens into this list. It continues the sequence. It acts like a token
sequence. It acts like a token autocomplete. So in particular, it gave
autocomplete. So in particular, it gave us this response. So we can basically just put it here and we see the tokens that it continued. Uh these are the
tokens that it continued with roughly.
Now the question becomes okay why are these the tokens that the model responded with? What are these tokens? Where are they coming from? Uh
tokens? Where are they coming from? Uh
what are we talking to? And uh how do we program this system? And so that's where we shifted gears and we talked about the under the hood pieces of it. So the
first stage of this process and there are three stages is the pre-training stage which fundamentally has to do with just knowledge acquisition from the internet into the parameters of this neural network. And so the the neural
neural network. And so the the neural net internalizes a lot of knowledge from the internet. But where the personality
the internet. But where the personality really comes in is in the process of supervised fine-tuning here. And so what what happens here is that basically the
a company like OpenAI will curate a large data set of conversations like say 1 million conversation across very diverse topics and they will be conversations between a human and an
assistant. And even though there's a lot
assistant. And even though there's a lot of synthetic data generation used throughout this entire process and a lot of LLM help and so on, fundamentally this is a human data curation task with
lots of humans involved and in particular these humans are data labelers hired by OpenAI who are given labeling instructions that they learn and their task is to create ideal
assistant responses for any arbitrary prompts. So they are teaching the neural
prompts. So they are teaching the neural network by example how to respond to prompts.
So what is the way to think about what came back here? Like what is this? Well,
I think the right way to think about it is that this is the neural network simulation of a data labeler at OpenAI.
So it's as if I gave this query to a data label at OpenAI and this data labeler first reads all of the labeling instructions from OpenAI and then spends
two hours writing up the ideal assistant response to this query. and uh giving it to me. Now, we're not actually doing
to me. Now, we're not actually doing that, right? Because we didn't wait 2
that, right? Because we didn't wait 2 hours. So, what we're getting here is a
hours. So, what we're getting here is a neural network simulation of that process. And we have to keep in mind
process. And we have to keep in mind that these neural networks don't function like human brains do. They are
different. What's easy or hard for them is different from what's easy or hard for humans. And so, we really are just
for humans. And so, we really are just getting a simulation. So here I shown you this is a token stream and this is fundamentally the neural network with a bunch of activations and neurons in
between. This is a fixed mathematical
between. This is a fixed mathematical expression that mixes inputs from tokens with parameters of the model and they get mixed up and get you the next token
in a sequence. But this is a finite amount of compute that happens for every single token. And so this is some kind
single token. And so this is some kind of a lossy simulation of a human that is kind of like restricted in this way. And
so whatever the humans write, the language model is kind of imitating on this token level with only this this specific computation for every single
token in a sequence.
We also saw that as a result of this and the cognitive differences the models will suffer in a variety of ways and uh you have to be very careful with their use. So for example, we saw that they
use. So for example, we saw that they will suffer from hallucinations and they also we have the sense of a Swiss cheese model of the LM capabilities where basically there's like holes in the
cheese. Uh sometimes the models will
cheese. Uh sometimes the models will just arbitrarily like do something dumb.
Uh so even though they're doing lots of magical stuff, sometimes they just can't. So maybe you're not giving them
can't. So maybe you're not giving them enough tokens to think and maybe they're going to just make stuff up because their mental arithmetic breaks. Uh maybe
they are suddenly unable to count number of letters. um or maybe they're unable
of letters. um or maybe they're unable to tell you that 911 9.11 is smaller than 9.9 and it looks kind of dumb and so it's a Swiss cheese capability and we
have to be careful with that and we saw the reasons for that but fundamentally this is how we think of what came back.
It's again a simulation of this neural network of a human data labeler following the labeling instructions at OpenAI. So that's what we're getting
OpenAI. So that's what we're getting back. Now I do think that the uh things
back. Now I do think that the uh things change a little bit when you actually go and reach for one of the thinking models like 03 mini high. And the reason for
that is that GPT40 basically doesn't do reinforcement learning. It does do RLHF but I've told
learning. It does do RLHF but I've told you that RLHF is not RL. There's no
there's no uh time for magic in there.
It's just a little bit of a fine-tuning is the way to look at it. But these
thinking models they do use RL. So they
go through this third state stage of perfecting their thinking process and discovering new thinking strategies and um solutions uh to problem solving that
look a little bit like your internal monologue in your head and they practice that on a large collection of practice problems that companies like OpenAI create and curate and um then make
available to the LMS. So when I come here and I talk to a thinking model and I put in this question, what we're seeing here is not anymore just a straightforward simulation of a
human data labeler like this is actually kind of new, unique, and interesting.
Um, and of course, OpenAI is not showing us the under the hood thinking and the chains of thought that are underlying the reasoning here, but we know that such a thing exists and this is a summary of it. And what we're getting
here is actually not just an imitation of a human data labeler. It's actually
something that is kind of new and interesting and exciting in the sense that it is a function of thinking that was emergent in a simulation. It's not
just imitating human data labeler. It
comes from this reinforcement learning process. And so here we're of course not
process. And so here we're of course not giving it a chance to shine because this is not a mathematical or reasoning problem. This is just some kind of a
problem. This is just some kind of a sort of creative writing problem roughly speaking. And I think it's um it's a a
speaking. And I think it's um it's a a question an open question as to whether the thinking strategies that are developed inside verifiable domains
transfer and are generalizable to other domains that are unverifiable such as creative writing. The extent to which
creative writing. The extent to which that transfer happens is unknown in the field I would say. So we're not sure if we are able to do RL on everything that is verifiable and see the benefits of that on things that are unverifiable
like this prompt. So that's an open question. The other thing that's
question. The other thing that's interesting is that this reinforcement learning here is still like way too new, primordial, and nent. So we're just seeing like the beginnings of the hints
of greatness. Uh in the reasoning
of greatness. Uh in the reasoning problems, we're seeing something that is in principle capable of something like the equivalent of move 37, but not in the game of go, but in open domain
thinking and problem solving. In
principle, this paradigm is capable of doing something really cool, new, and exciting, something even that no human has thought of before. In principle,
these models are capable of analogies no human has had. So, I think it's incredibly exciting that these models exist, but again, it's very early and these are primordial models for now. Um,
and they will mostly shine in domains that are verifiable like math and code, etc. So, very interesting to play with and think about and use. And then that's
roughly it. Um, I would say those are
roughly it. Um, I would say those are the broad strokes of what's available right now. I will say that overall it is
right now. I will say that overall it is an extremely exciting time to be in the field. Personally, I use these models
field. Personally, I use these models all the time, daily, uh, tens or hundreds of times because they dramatically accelerate my work. I think
a lot of people see the same thing. I
think we're going to see a huge amount of wealth creation as a result of these models. Be aware of some of their
models. Be aware of some of their shortcomings. Even with RL models,
shortcomings. Even with RL models, they're going to suffer from some of these. Use it as a tool in a toolbox.
these. Use it as a tool in a toolbox.
Don't trust it fully because they will randomly do dumb things. They will
randomly hallucinate. They will randomly skip over some mental arithmetic and not get it right. Um they randomly can't count or something like that. So use
them as tools in the toolbox. Check
their work and own the product of your work. But use them for inspiration, for
work. But use them for inspiration, for first draft. Uh ask them questions, but
first draft. Uh ask them questions, but always check and verify and you will be very successful in your work if you do so. Uh so I hope this video was useful
so. Uh so I hope this video was useful and interesting to you. I hope you had it fun and uh it's already like very long so I apologize for that but I hope it was useful and yeah I will see you
later.
Loading video analysis...