【中文翻译收藏版】AI 大神安德烈.卡帕西AndrejKarpathy深入探索ChatGPT等大语言模型！

By 财富起点 | Wealth Starting Point

Summary

Topics Covered

Internet Compresses to 44TB High-Quality Text
Tokens Trade Sequence Length for Vocabulary Size
Pretraining Predicts Next Token Autocomplete
Base Models Simulate Internet Documents
SFT Imitates Human Labeler Simulations

Full Transcript

Hi everyone. So I've wanted to make this video for a while. It is a comprehensive but general audience introduction to large language models like ChachiBT. And

what I'm hoping to achieve in this video is to give you kind of mental models for thinking through what it is that this tool is. It is obviously magical and

tool is. It is obviously magical and amazing in some respects. It's uh really good at some things, not very good at other things. And there's also a lot of

other things. And there's also a lot of sharp edges to be aware of. So what is behind this text box? You can put anything in there and press enter, but uh what should we be putting there? And

what are these words generated back? How

does this work? And what what are you talking to exactly? So, I'm hoping to get at all those topics in this video.

We're going to go through the entire pipeline of how this stuff is built, but I'm going to keep everything uh sort of accessible to a general audience. So,

let's take a look at first how you build something like ChachiPT. And along the way I'm going to talk about um you know some of the sort of cognitive psychological implications of these

tools. Okay. So let's build chachi bt.

tools. Okay. So let's build chachi bt.

So there's going to be multiple stages arranged sequentially. The first stage

arranged sequentially. The first stage is called the pre-training stage and the first step of the pre-training stage is to download and process the internet.

Now to get a sense of what this roughly looks like I recommend looking at this URL here. So um this company called

URL here. So um this company called hugging face uh collected and created and curated this data set called fine web and they go into a lot of detail in

this blog post on how they constructed the fine web data set and all of the major LLM providers like OpenAI anthropic and Google and so on will have some equivalent internally of something

like the fine web data set. So roughly

what are we trying to achieve here?

We're trying to get ton of text from the internet from publicly available sources. So we're trying to have a huge

sources. So we're trying to have a huge quantity of very high quality documents and we also want very large diversity of documents because we want to have a lot of knowledge inside these models. So we

want large diversity of highquality documents and we want many many of them and achieving this is uh quite complicated and as you can see here takes multiple stages to do well. So

let's take a look at what some of these stages look like in a bit. For now, I'd like just like to note that for example the fine web data set which is fairly representative what you would see in a production grade application actually

ends up being only about 44 terabytes of disk space. Um you can get a USB stick

disk space. Um you can get a USB stick for like a terabyte very easily or I think this could fit on a single hard drive almost today. So this is not a huge amount of data at the end of the

day. Even though the internet is very

day. Even though the internet is very very large, we're working with text and we're also filtering it aggressively. So

we end up with about 44 terabytes in this example. So let's take a look at uh

this example. So let's take a look at uh kind of what this data looks like and what some of these stages uh also are.

So the starting point for a lot of these efforts and something that contributes most of the data by the end of it is data from Common Crawl. So Common Crawl is an organization that has been

basically scouring the internet since 2007. So as of 2024 for example,

2007. So as of 2024 for example, ComicCroll has indexed 2.7 billion web pages uh and uh they have all these crawlers going around the internet and what you end up doing basically is you start with

a few seed web pages and then you follow all the links and you just keep following links and you keep indexing all the information and you end up with a ton of data of the internet over time.

So this is usually the starting point for a lot of the uh for a lot of these efforts. Now this common crawl data is

efforts. Now this common crawl data is quite raw and is filtered in many many different ways.

So here they pro they document this is the same diagram they document a little bit the kind of processing that happens in these stages. So the first thing here is something called URL filtering.

So what that is referring to is that there's these block lists of uh basically URLs that are or domains that uh you don't want to be getting

data from. So usually this includes

data from. So usually this includes things like malware websites, spam websites, marketing websites, uh racist websites, adult sites, and things like that. So there's a ton of different

that. So there's a ton of different types of websites that are just eliminated at this stage because we don't want them in our data set. Um the

second part is text extraction. You have

to remember that all these web pages, this is the raw HTML of these web pages that are being saved by these crawlers.

So when I go to inspect here, this is what the raw HTML actually looks like. You'll notice that it's got all

like. You'll notice that it's got all this markup uh like lists and stuff like that and there's CSS and all this kind of stuff. So this is um computer code

of stuff. So this is um computer code almost for these web pages. But what we really want is we just want this text, right? We just want the text of this web

right? We just want the text of this web page and we don't want the navigation and things like that. So there's a lot of filtering and processing uh and heristics that go into uh adequately

filtering for just the uh good content of these web pages. The next stage here is language filtering. So for example, fine web filters uh using a language

classifier. They try to guess what

classifier. They try to guess what language every single web page is in and then they only keep web pages that have more than 65% of English as an example.

And so you can get a sense that this is like a design decision that different companies can uh can uh take for themselves. What fraction of all

themselves. What fraction of all different types of languages are we going to include in our data set?

Because for example, if we filter out all of the Spanish as an example, then you might imagine that our model later will not be very good at Spanish because it's just never seen that much data of that language. And so different

that language. And so different companies can focus on multilingual performance to uh to a different degree as an example. So fine web is quite focused on English and so their language model if they end up training one later

will be very good at English but not maybe very good at other languages.

After language filtering, there's a few other filtering steps and dduplication and things like that. Um, finishing

with, for example, the PII removal. This

is personally identifiable information.

So, as an example, addresses, social security numbers, and things like that.

You would try to detect them and you would try to filter out those kinds of web pages from the data set as well. So

there's a lot of stages here and I won't go into full detail but it is a fairly extensive part of the pre-processing and you end up with for example the fine web data set. So when you click in on it uh

data set. So when you click in on it uh you can see some examples here of what this actually ends up looking like and anyone can download this on the hugging phase web page. And so here are some

examples of the final text that ends up in the training set. So this is some article about tornadoes in 2012.

Um, so there's some tornadoes in 2020 in 2012 and what happened.

Uh, this next one is something about did you know you have two little yellow 9V battery sized adrenal glands in your body? Okay, so this is some kind of a

body? Okay, so this is some kind of a odd medical article.

So just think of these as basically uh web pages on the internet filter just for the text in various ways. And now we have a ton of text, 40 terabytes of it.

And that now is the starting point for the next step of this stage. Now I

wanted to give you an intuitive sense of where we are right now. So I took the first 200 web pages here and remember we have tons of them. And I just take all that text and I just put it all together

concatenated. And so this is what we end

concatenated. And so this is what we end up with. We just get this just just raw

up with. We just get this just just raw text, raw internet text. And there's a ton of it even in these 200 web pages.

So I can continue zooming out here. And

we just have this like massive tapestry of text data. And this text data has all these patterns. And what we want to do

these patterns. And what we want to do now is we want to start training neural networks on this data. So the neural networks can internalize and model how

this text flows. Right? So we just have this giant texture of text. And now we want to get neural nets that mimic it.

Okay. Now before we plug text into neural networks, we have to decide how we're going to represent this text uh and how we're going to feed it in. Now

the way our technology works for these neural nets is that they expect a one-dimensional sequence of symbols and they want a finite uh set of symbols that are possible. And so we have to

decide what are the symbols and then we have to represent our data as a one-dimensional sequence of those symbols.

So right now what we have is a onedimensional sequence of text. It

starts here and it goes here and then it comes here etc. So this is a one-dimensional sequence even though on my monitor of course it's laid out in a two-dimensional way but it goes from left to right and top to bottom right.

So it's a one dimensional sequence of text. Now this being computers of course

text. Now this being computers of course there's an underlying representation here. So if I do what's called UTF8 uh

here. So if I do what's called UTF8 uh encode this text then I can get the raw bits that correspond to this text in the computer and that's what uh that looks

like this. So it turns out that for

like this. So it turns out that for example this very first bar here is the first uh eight bits as an example.

So what is this thing right? This is um a representation that we are looking for. Uh in in a certain sense we have

for. Uh in in a certain sense we have exactly two possible symbols zero and one and we have a very long sequence of

it. Right now as it turns out um this

it. Right now as it turns out um this sequence length is actually going to be very finite and precious resource uh in our neural network and we actually don't

want extremely long sequences of just two symbols. Instead, what we want is we

two symbols. Instead, what we want is we want to trade off uh this um symbol size uh of this vocabulary as we call it and the resulting sequence length. So we

don't want just two symbols and extremely long sequences. We're going to want more symbols and shorter sequences.

Okay. Okay, so one naive way of compressing or decreasing the length of our sequence here is to basically um consider some group of consecutive bits

for example eight bits and group them into a single what's called bite. So

because uh these bits are either on or off if we take a group of eight of them there turns out to be only 256 possible combinations of how these bits could be on or off and so therefore we can

re-represent this sequence into a sequence of bytes instead. So this

sequence of bytes will be eight times shorter. But now we have 256 possible

shorter. But now we have 256 possible symbols. So every number here goes from

symbols. So every number here goes from 0 to 255. Now I really encourage you to think of these not as numbers but as unique IDs or like unique symbols. So

maybe it's a bit more maybe it's better to actually think of these to replace every one of these with a unique emoji.

Uh you'd get something like this. So um

we basically have a sequence of emojis and there's 256 possible emojis. You can

think of it that way. Now, it turns out that in production for state-of-the-art language models, uh you actually want to go even beyond this. You want to continue to shrink the length of the

sequence. Uh because again, it is a

sequence. Uh because again, it is a precious resource in return for more symbols in your vocabulary. And the way this is done is done by running what's called the bite pair encoding algorithm.

And the way this works is we're basically looking for consecutive bytes or symbols that are very common. So for

example, turns out that the sequence 116 followed by 32 is quite common and occurs very frequently. So what we're going to do is we're going to group uh

this um pair into a new symbol. So we're

going to mint a symbol with an ID 256 and we're going to rewrite every single uh pair 11632 with this new symbol. And

then we can we can iterate this algorithm as many times as we wish. And

each time when we mint a new symbol, we're decreasing the length and we're increasing the symbol size. And in

practice, it turns out that a pretty good setting of um the basically the vocabulary size turns out to be about 100,000 possible symbols. So in

particular, GPT4 uses 100,277 symbols.

Um, and this process of converting from raw text into these symbols or as we call them tokens is the process called tokenization.

So let's now take a look at how GPT4 performs tokenization, converting from text to tokens and from tokens back to text and what this actually looks like.

So one website I like to use to explore these token representations is called tickto tokenizer. And so come here to

tickto tokenizer. And so come here to the drop down and select CL100k base which is the GPT4 base model tokenizer.

And here on the left you can put in text and it shows you the tokenization of that text. So for example hello

that text. So for example hello space world.

So hello world turns out to be exactly two tokens. the token hello which is the

two tokens. the token hello which is the token with ID 15,339 and the token spaceworld

that is the token 1 917.

So um hello spacew world. Now if I was to join these two for example I'm going to get again two tokens but it's the token h followed by the token hello

world without the h.

Um if I put in two spa two spaces here between hello and world it's again a different uh tokenization. There's a new token 220 here.

Okay. So you can play with this and see what happens here. Also keep in mind this is not uh this is case sensitive.

So if this is a capital H, it is something else. Or if it's uh hello

something else. Or if it's uh hello world, then actually this ends up being three tokens instead of just two tokens.

Yeah. So you can play with this and get an sort of like an intuitive sense of uh what these tokens work like. We're

actually going to loop around to tokenization a bit later in the video.

For now, I just wanted to show you the website. And I wanted to uh show you

website. And I wanted to uh show you that this text basically at the end of the day. So for example, if I take one

the day. So for example, if I take one line here, this is what GPT4 will see it as. So this text will be a sequence of

as. So this text will be a sequence of length 62.

This is the sequence here. And this is how the chunks of text correspond to these symbols. And again, there's

these symbols. And again, there's 100,277 possible symbols. And we now have

possible symbols. And we now have one-dimensional sequences of those symbols.

So um yeah, we're going to come back to tokenization, but that's uh for now where we are. Okay. So what I've done now is I've taken this uh sequence of text that we have here in the data set

and I have re-represented it using our tokenizer into a sequence of tokens. And

this is what that looks like now.

So for example, when we go back to the fine web data set, they mentioned that not only is this 44 terabytes of disk space, but this is about a 15 trillion

token sequence of um in this data set.

And so here these are just some of the first uh one or two or three or a few thousand here I think uh tokens of this data set, but there's 15 trillion here uh to keep in mind. And again, keep in

mind one more time that all of these represent little text chunks. They're

all this like atoms of these sequences and the numbers here don't make any sense. They're just uh they're just

sense. They're just uh they're just unique IDs. Okay, so now we get to the

unique IDs. Okay, so now we get to the fun part which is the uh neural network training and this is where a lot of the heavy lifting happens computationally when you're training these neural

networks. So what we do here in this in

networks. So what we do here in this in this step is we want to model the statistical relationships of how these tokens follow each other in the sequence. So what we do is we come into

sequence. So what we do is we come into the data and we take windows of tokens.

So we take a window of tokens uh from this data fairly randomly and um the windows length can range anybody

anywhere between uh zero tokens actually all the way up to some maximum size that we decide on. Uh so for example in practice you could see a token windows

of say 8,000 tokens. Now in principle we can use arbitrary window lengths of tokens uh but uh processing very long uh

basically u window sequences would just be very computationally expensive. So we

just kind of decide that say 8,000 is a good number or 4,000 or 16,000 and we crop it there. Now in this example I'm going to be uh taking the first four

tokens just so everything fits nicely.

So these tokens we're going to take a window of four tokens this bar view in and space single

which are these token ids. And now what we're trying to do here is we're trying to basically predict the token that comes next in a sequence. So 3 962 comes next. Right? So what we do now here is

next. Right? So what we do now here is that we call this the context. These

four tokens are context and they feed into a neural network.

And this is the input to the neural network.

Now I'm going to go into the detail of what's inside this neural network in a little bit. For now, what's important to

little bit. For now, what's important to understand is the input and the output of the neural net. So the input are sequences of tokens of variable length anywhere between zero and some maximum

size like 8,000. The output now is a prediction for what comes next. So

because our vocabulary has 100,277 possible tokens, the neural network is going to output exactly that many numbers and all of those numbers correspond to the probability of that

token as coming next in the sequence. So

it's making guesses about what comes next. Um in the beginning, this neural

next. Um in the beginning, this neural network is randomly initialized. So um

and we're going to see in a little bit what that means, but it's a it's a it's a random transformation. So these

probabilities in the very beginning of the training are also going to be kind of random. Uh so here I have three

of random. Uh so here I have three examples but keep in mind that there's 100,000 numbers here. Um so the probability of this token space direction and neural network is saying

that this is 4% likely right now. 11799

is 2%. And then here the probability of 3962 which is post is 3%. Now of course we've sampled this window from our data set. So we know what comes next. we know

set. So we know what comes next. we know

and that's the label. We know that the correct answer is that 3962 actually comes next in the sequence. So now what we have is this mathematical process for

doing an update to the neural network.

We have a way of tuning it and uh we're going to go into a little bit of of detail in a bit but basically we know that this probability here of 3% we want

this probability to be higher and we want the probabilities of all the other tokens to be lower.

And so we have a way of mathematically calculating how to adjust and update the neural network so that the correct answer has a slightly higher probability. So if I do an update to the

probability. So if I do an update to the neural network now the next time I feed this particular sequence of four tokens into the neural network the neural network will be slightly adjusted now

and it will say okay post is maybe 4%.

And case now maybe is 1%.

And uh direction could become 2% or something like that. And so we have a way of nudging of slightly updating the neural net to um basically give a higher probability to the correct token that

comes next in the sequence. And now you just have to remember that this process happens not just for uh this um token here where these four fed in and

predicted this one. This process happens at the same time for all of these tokens in the entire data set. And so in practice, we sample little windows, little batches of windows. And then at

every single one of these tokens, we want to adjust our neural network so that the probability of that token becomes slightly higher. And this all happens in parallel in large batches of these tokens. And this is the process of

these tokens. And this is the process of training the neural network. It's a

sequence of updating it so that its predictions match up the statistics of what actually happens in your training set. and its probabilities become

set. and its probabilities become consistent with the uh statistical patterns of how these tokens follow each other in the data. So let's now briefly get into the internals of these neural

networks just to give you a sense of what's inside. So neural network

what's inside. So neural network internals. So as I mentioned we have

internals. So as I mentioned we have these inputs uh that are sequences of tokens. In this case this is four input

tokens. In this case this is four input tokens but this can be anywhere between zero up to let's say a thousand tokens.

In principle this can be an infinite number of tokens. we just uh it would just be too computationally expensive to process an infinite number of tokens. So

we just crop it at a certain length and that becomes the maximum context length of that uh model. Now these inputs x are mixed up in a giant mathematical

expression together with the parameters or the weights of these neural networks.

So here I'm showing six example parameters and their setting. But in

practice these uh um modern neural networks will have billions of these uh parameters and in the beginning these parameters are completely randomly set.

Now with a random setting of parameters you might expect that this uh this neural network would make random predictions and it does in the beginning it's totally random predictions but it's

through this process of iteratively updating the network uh as and we call that process training a neural network.

So uh that the setting of these parameters gets adjusted such that the outputs of our neural network becomes consistent with the patterns seen in our training set.

So think of these parameters as kind of like knobs on a DJ set. And as you're twiddling these knobs, you're getting different uh predictions for every possible uh token sequence input. And

training a neural network just means discovering a setting of parameters that seems to be consistent with the statistics of the training set.

Now let me just give you an example of what this giant mathematical expression looks like just to give you a sense and modern networks are massive expressions with trillions of terms probably but let me just show you a simple example here

it would look something like this I mean these are the kinds of expressions just to show you that it's not very scary we have inputs x uh like x1 x2 in this case

two example inputs and they get mixed up with the weights of the network w1 23 etc and this mixing is simple things like multiplication,

addition exponentiation division etc. And it is the subject of neural network architecture research to design effective mathematical expressions uh that have a lot of uh kind of convenient

characteristics. They are expressive,

characteristics. They are expressive, they're optimizable, they're paralyzable, etc. And so, but uh at the end of the day, these are not complex expressions and basically they mix up

the inputs with the parameters to make predictions. and we're optimizing uh the

predictions. and we're optimizing uh the parameters of this neural network so that the predictions come out consistent with the training set. Now I would like to show you an actual production grade

example of what these neural networks look like. So for that I encourage you

look like. So for that I encourage you to go to this website uh that has a very nice visualization of one of these networks.

So this is what you will find on this website. And this neural network here

website. And this neural network here that is used in production settings has this special kind of structure. This

network is called the transformer and this particular one as an example has 85,000 roughly parameters.

Now here on the top we take the inputs which are the token sequences and then information flows through the neural network until the output which

here are the loget softmax but these are the predictions for what comes next what token comes next and then here there's a sequence of transformations and all these

intermediate values that get produced inside this mathematical expression as it is sort of predicting what comes next. So as an example, these tokens are

next. So as an example, these tokens are embedded into kind of like this distributed representation as it's called. So every possible token has kind

called. So every possible token has kind of like a vector that represents it inside the neural network. So first we embed the tokens and then those values

uh kind of like flow through this diagram and these are all very simple mathematical expressions individually.

So we have layer norms and matrix multiplications and uh soft maxes and so on. So here's kind of like the attention

on. So here's kind of like the attention block of this transformer and then information kind of flows through into the multi-layer perceptron block and so on. And all these numbers here, these

on. And all these numbers here, these are the intermediate values of their expression. And uh you can almost think

expression. And uh you can almost think of these as kind of like the firing rates of these synthetic neurons. But I

would caution you to uh not um kind of think of it too much like neurons because these are extremely simple neurons compared to the neurons you would find in your brain. Your

biological neurons are very complex dynamical processes that have memory and so on. There's no memory in this

so on. There's no memory in this expression. It's a fixed mathematical

expression. It's a fixed mathematical expression from input to output with no memory. It's just a stateless. So these

memory. It's just a stateless. So these

are very simple neurons in comparison to biological neurons. But you can still

biological neurons. But you can still kind of loosely think of this as like a synthetic piece of uh brain tissue if you if you like uh to think about it that way. So information flows through

that way. So information flows through all these neurons fire until we get to the predictions. Now I'm not actually

the predictions. Now I'm not actually going to dwell too much on the precise uh kind of like mathematical details of all these transformations. Honestly, I

don't think it's that important to get into. What's really important to

into. What's really important to understand is that this is a mathematical function. It is uh

mathematical function. It is uh parameterized by some fixed set of parameters like say 85,000 of them. And

it is a way of transforming inputs into outputs. And as we twiddle the

outputs. And as we twiddle the parameters, we are getting uh different kinds of predictions. And then we need to find a good setting of these parameters so that the predictions uh

sort of match up with the patterns seen in training set. So that's the transformer. Okay. So I've shown you the

transformer. Okay. So I've shown you the internals of the neural network and we talked a bit about the process of training it. I want to cover one more

training it. I want to cover one more major stage of working with these networks and that is the stage called inference. So in inference what we're

inference. So in inference what we're doing is we're generating new data from the model and so uh we want to basically see what kind of patterns it has internalized in the parameters of its

network. So to generate from the model

network. So to generate from the model is relatively straightforward. We start

with some tokens that are basically your prefix like what you want to start with.

So say we want to start with the token 91. Well, we feed it into the network

91. Well, we feed it into the network and remember that the network gives us probabilities, right? it gives us this

probabilities, right? it gives us this probability vector here. So what we can do now is we can basically flip a biased coin. So um we can sample uh basically a

coin. So um we can sample uh basically a token based on this probability distribution. So the tokens that are

distribution. So the tokens that are given high probability by the model are more likely to be sampled when you flip this biased coin. You can think of it that way. So we sample from the

that way. So we sample from the distribution to get a single unique token. So for example, token 860 comes

token. So for example, token 860 comes next.

Uh so 860 in this case when we're generating from model could come next.

Now 860 is a relatively likely token. It

might not be the only possible token in this case. There could be many other

this case. There could be many other tokens that could have been sampled. But

we could see that 86 is a relatively likely token as an example. And indeed

in our training example here 860 does follow 91. So let's now say that we um

follow 91. So let's now say that we um continue the process. So after 91 there's a 60. We append it and we again ask what is the third token. Let's

sample and let's just say that it's 287 exactly as here. Let's do that again. We

come back in. Now we have a sequence of three and we ask what is the likely fourth token and we sample from that and get this one. And now let's say we do it

one more time. We take those four, we sample and we get this one. And this

13659 uh this is not actually uh 3962 as we had before. So this token is the token

had before. So this token is the token article uh instead. So viewing a single article. And so in this case we didn't

article. And so in this case we didn't exactly reproduce the sequence that we saw here in the training data. So keep

in mind that these systems are stoastic.

they have um we're sampling and we're flipping coins and sometimes we lock out and we reproduce some like small chunk of the text in the training set but

sometimes we're uh we're getting a token that was not verbatim part of any of the documents in the training data. So we're

going to get sort of like remixes of the data that we saw in the training because at every step of the way we can flip and get a slightly different token and then once that token makes it in if you

sample the next one and so on you very quickly uh start to generate token streams that are very different from the token streams that occur in the training documents. So statistically they will

documents. So statistically they will have similar properties but um they are not identical to training data. They're

kind of like inspired by the training data. And so in this case we got a

data. And so in this case we got a slightly different sequence. And why

would we get article? You might imagine that article is a relatively likely token in the context of bar viewing single etc. And you can imagine that the word article followed this context

window somewhere in the training documents uh to some extent and we just happen to sample it here at that stage.

So basically inference is just uh predicting from these distributions one at a time. we continue feeding back tokens and getting the next one. And we

uh we're always flipping these coins and depending on how lucky or unlucky we get um we might get very different kinds of patterns depending on how we sample from

these probability distributions. So

that's inference. So in most common scenarios uh basically downloading the internet and tokenizing it is a pre-processing step. You do that a

pre-processing step. You do that a single time and then uh once you have your token sequence we can start training networks. And in practical

training networks. And in practical cases you would try to train many different networks of different kinds of uh settings and different kinds of arrangements and different kinds of sizes. And so you'll be doing a lot of

sizes. And so you'll be doing a lot of neural network training and um then once you have a neural network and you train it and you have some specific set of parameters that you're happy with um

then you can take the model and you can do inference and you can actually uh generate data from the model and when you're on chach and you're talking with a model uh that model is trained and has

been trained by openai many months ago probably and they have a specific set of weights that work well and when you're talking to the model all of that is just inference. there's no more training.

inference. there's no more training.

Those parameters are held fixed and you're just talking to the model sort of uh you're giving it some of the tokens and it's kind of completing token sequences and that's what you're seeing

uh generated when you actually use the model on chatbt. So that model then just does inference alone. So let's now look at an example of training an inference that is kind of concrete and gives you a

sense of what this actually looks like uh when these models are trained. Now

the example that I would like to work with and that I'm particularly fond of is that of OpenAI's GPT2. So GPT uh stands for generatively pre-trained transformer and this is the second

iteration of the GPT series by OpenAI.

When you are talking to CH GPT today, the model that is underlying all of the magic of that interaction is GPT4. So

the fourth iteration of that series. Now

GPT2 was published in 2019 by OpenAI in this paper that I have right here. And

the reason I like GPT2 is that it is the first time that a recognizably modern stack came together. So um all of the pieces of GP2 are recognizable today by

modern standards. It's just everything

modern standards. It's just everything has gotten bigger. Now I'm not going to be able to go into the full details of this paper of course because it is a technical publication but some of the details that I would like to highlight

are as follows. GPT2 was a transformer neural network just like you were just like the neural networks you would work with today.

It was it had 1.6 billion parameters, right? So these are the parameters that

right? So these are the parameters that we looked at here. It would have 1.6 billion of them. Today, modern

transformers would have a lot closer to a trillion or several hundred billion probably.

The maximum context length here was 1,024 tokens. So it is when we are

1,024 tokens. So it is when we are sampling chunks of windows of tokens from the data set, we're never taking more than 10,024 tokens. And so when you are trying to predict the next token in

a sequence, you will never have more than 1,024 tokens uh kind of in your context in order to make that prediction. Now this is also tiny by

prediction. Now this is also tiny by modern standards. Today the token uh the

modern standards. Today the token uh the context length would be a lot closer to um couple hundred thousand or maybe even a million. And so you'd have a lot more

a million. And so you'd have a lot more context, a lot more tokens in history and you can make a lot better prediction about the next token in a sequence in that way. And finally, GP2 was trained

that way. And finally, GP2 was trained on approximately 100 billion tokens. And

this is also fairly small by modern standards. As I mentioned, the fine web

standards. As I mentioned, the fine web data set that we looked at here, the fine web data set has 15 trillion tokens. Uh so 100 billion is is quite

tokens. Uh so 100 billion is is quite small. Now,

small. Now, uh I actually tried to reproduce uh GPT2 for fun as part of this project called LM.C. So you can see my write up of

LM.C. So you can see my write up of doing that in this post on GitHub under the LM.C repository. So, in particular,

the LM.C repository. So, in particular, the cost of training GPD2 in 2019 would was estimated to be approximately $40,000, but today you can do

significantly better than that. And in

particular here, I it took about one day and about $600.

Uh, but this wasn't even trying too hard. I think you could really bring

hard. I think you could really bring this down to about $100 today. Now, why

is it that the costs have come down so much? Well, number one, these data sets

much? Well, number one, these data sets have gotten a lot better and the way we filter them, extract them, and prepare them has gotten a lot more refined and so the data set is of just a lot higher

quality. So, that's one thing, but

quality. So, that's one thing, but really the biggest difference is that our computers have gotten much faster in terms of the hardware and we're going to look at that in a second. And also the

software for uh running these models and really squeezing out all all the speed from the hardware as it is possible. uh

that software has also gotten much better as everyone has focused on these models and tried to run them very very quickly.

Now I'm not going to be able to go into the full detail of this GPT2 reproduction and this is a long technical post but I would like to still give you an intuitive sense for what it looks like to actually train one of

these models as a researcher. Like what

are you looking at and what does it look like? What does it feel like? So let me

like? What does it feel like? So let me give you a sense of that a little bit.

Okay, so this is what it looks like. Let

me slide this over.

So what I'm doing here is I'm training a GPT2 model right now and um what's happening here is that every single line

here like this one is one update to the model. So remember how here we are um

model. So remember how here we are um basically making the prediction better for every one of these tokens and we are updating these weights or parameters of the neural net. So here every single

line is one update to the neural network where we change its parameters by a little bit so that it is better at predicting the next token in sequence.

In particular, every single line here is improving the prediction on 1 million tokens in the training set. So we've

basically taken 1 million tokens out of this data set and we've tried to improve the prediction of that token as coming next in a sequence on all 1 million of

them. simultaneously

them. simultaneously and at every single one of these steps we are making an update to the network for that. Now the number to watch

for that. Now the number to watch closely is this number called loss and the loss is a single number that tell is telling you how well your neural network

is performing right now and it is created so that low loss is good. So

you'll see that the loss is decreasing as we make more updates to the neural net which corresponds to making better predictions on the next token in a sequence.

And so the loss is the number that you are watching as a neural network researcher and you are kind of waiting you're twiddling your thumbs uh you're drinking coffee and you're making sure that this looks good so that with every

update your loss is improving and the network is getting better at prediction.

Now here you see that we are processing 1 million tokens per update. Each update

takes about 7 seconds roughly. And here

we are going to process a total of 32,000 steps of optimization.

So 32,000 steps with 1 million tokens each is about 33 billion tokens that we are going to process. And we're

currently only about 420 step 420 out of 32,000. So we are still only a bit more

32,000. So we are still only a bit more than 1% done because I've only been running this for 10 or 15 minutes or something like that.

Now every 20 steps I have configured this optimization to do inference. So

what you're seeing here is the model is predicting the next token in a sequence.

And so you sort of start it randomly and then you continue plugging in the tokens. So we're running this inference

tokens. So we're running this inference step. And this is the model sort of

step. And this is the model sort of predicting the next token in a sequence.

And every time you see something appear that's a new token.

Um, so let's just look at this and you can see that this is not yet very coherent and keep in mind that this is only 1% of the way through training and so the model is not yet very good at predicting the next token in the

sequence. So what comes out is actually

sequence. So what comes out is actually kind of a little bit of gibberish, right? But it still has a little bit of

right? But it still has a little bit of like local coherence. So since she is mine, it's a part of the information should discuss my father great companions Gordon showed me sitting over

it and etc. So, I know it doesn't look very good, but let's actually scroll up and see what it looked like when I started the optimization.

So, all the way here at step one.

So, after 20 steps of optimization, you see that what we're getting here is looks completely random. And of course, that's because the model has only had 20 updates to its parameters. And so, it's giving you random text because it's a

random network. And so you can see that

random network. And so you can see that at least in comparison to this, the model is starting to do much better. And

indeed, if we waited the entire 32,000 steps, the model will have improved the point that it's actually uh generating fairly coherent English. Uh and the

tokens stream correctly. Um and uh they they kind of make up English a lot better.

Um so this has to run for about a day or two more now. And so uh at this stage we just make sure that the loss is decreasing. Everything is looking good.

decreasing. Everything is looking good.

Um and we just have to wait. And now um let me turn now to the um story of the computation that's required because of course I'm not running this optimization

on my laptop. That would be way too expensive. Uh because we have to run

expensive. Uh because we have to run this neural network and we have to improve it and we have we need all this data and so on. So you can't run this too well on your computer uh because the

network is just too large. Uh so all of this is running on a computer that is out there in the cloud and I want to basically address the compute side of the story of training these models and what that looks like. So let's take a

look. Okay, so the computer that I am

look. Okay, so the computer that I am running this optimization on is this 8x H100 node. So there are eight H100s in a

H100 node. So there are eight H100s in a single node or a single computer. Now I

am renting this computer and it is somewhere in the cloud. I'm not sure where it is physically. Actually the

place I like to rent from is called Lambda but there are many other companies who provide this service. So

when you scroll down you can see that uh they have some on demand pricing for um sort of computers that have these uh H100s which are GPUs and I'm going to

show you what they look like in a second but on demand 8 times Nvidia H100 uh GPU.

This machine comes for $3 per GPU per hour, for example. So, you can rent these and then you get a machine in a cloud and you can uh go in and you can train these models.

And these uh GPUs, they look like this.

So, this is one H100 GPU. Uh this is kind of what it looks like. And you slot this into your computer. And GPUs are this uh perfect fit for training neural networks because they are very

computationally expensive. but they

computationally expensive. but they display a lot of parallelism in the computation. So you can have many

computation. So you can have many independent workers kind of um working all at the same time in solving uh the matrix multiplication that's under the

hood of training these neural networks.

So this is just one of these H100s but actually you would put them you would put multiple of them together. So you

could stack eight of them into a single node and then you can stack multiple nodes into an entire data center or an entire system. So

entire system. So when we look at a data center can't spell when we look at a data center we start to see things that look like this right so we have one GPU goes

to eight GPUs goes to a single system goes to many systems and so these are the bigger data centers and there of course would be much much more expensive um and what's happening is that all the

big tech companies really desire these GPUs so they can train all these language models because they are so powerful And that has is fundamentally what has driven the stock price of Nvidia to be

$3.4 trillion today as an example and why Nvidia has kind of exploded. So this

is the gold rush. The gold rush is getting the GPUs getting enough of them so they can all collaborate to perform this optimization and they're what are they all doing? They're all

collaborating to predict the next token on a data set like the fine web data set.

This is the computational workflow that that basically is extremely expensive.

The more GPUs you have, the more tokens you can try to predict and improve on and you're going to process this data set faster and you can iterate faster and get a bigger network and train a bigger network and so on. So this is

what all those machines are look like are uh are doing and this is why all of this is such a big deal and for example this is a article from like about a month ago or

so. This is why it's a big deal that for

so. This is why it's a big deal that for example Elon Musk is getting 100,000 GPUs uh in a single data center and all of these GPUs are extremely expensive

are going to take a ton of power and all of them are just trying to predict the next token in a sequence and improve the network uh by doing so and uh get probably a lot more coherent text than

what we're seeing here a lot faster.

Okay, so unfortunately I do not have a couple ten or hundred million of dollars to spend on training a really big model like this. But luckily we can turn to

like this. But luckily we can turn to some big tech companies who train these models routinely and release some of them once they are done training. So

they've spent a huge amount of compute to train this network and they release the network at the end of the optimization. So it's very useful

optimization. So it's very useful because they've done a lot of compute for that. So there are many companies

for that. So there are many companies who train these models routinely. But

actually not many of them release these what's called base models. So the model that comes out at the end here is what's called a base model. What is a base model? It's a token simulator, right?

model? It's a token simulator, right?

It's an internet text token simulator.

And so that is not by itself useful yet because what we want is what's called an assistant. We want to ask questions and

assistant. We want to ask questions and have it respond to answers. These models

won't do that. they just uh create sort of remixes of the internet. They dream

internet pages. So the base models are not very often released because they're kind of just only a step one of a few other steps that we still need to take to get an assistant. However, a few

releases have been made. So as an example, the GPT2 model released the 1.6 billion sorry 1.5 billion model back in

2019. And this GPT2 model is a base

2019. And this GPT2 model is a base model. Now what is a model release? What

model. Now what is a model release? What

does it look like to release these models? So this is the GPT2 repository

models? So this is the GPT2 repository on GitHub. Well, you need two things

on GitHub. Well, you need two things basically to release model. Number one,

we need the um Python code usually that describes the sequence of operations in detail that they make in their model. So

um if you remember back this transformer the sequence of steps that are taken here in this neural network is what is being described by this code. So this

code is sort of implementing the what's called forward pass of this neural network. So we need the specific details

network. So we need the specific details of exactly how they wired up that neural network. So this is just computer code

network. So this is just computer code and it's usually just a couple hundred lines of code. It's not it's not that crazy and uh this is all fairly understandable and usually fairly standard. What's not standard are the

standard. What's not standard are the parameters. That's where the actual

parameters. That's where the actual value is. Where are the parameters of

value is. Where are the parameters of this neural network? Because there's one 1.6 billion of them and we need the correct setting or a really good setting. And so that's why in addition

setting. And so that's why in addition to this source code, they release the parameters which in this case is roughly 1.5 billion parameters. And these are

just numbers. So it's one single list of

just numbers. So it's one single list of 1.5 billion numbers. the precise and good setting of all the knobs such that the tokens come out well.

So uh you need those two things to get a base model release. Now

GPT2 was released but that's actually a fairly old model as I mentioned. So

actually the model we're going to turn to is called Llama 3 and that's the one that I would like to show you next. So

Llama 3. So GPT2 again was 1.6 billion parameters trained on 100 billion tokens. Lama 3 is a much bigger model

tokens. Lama 3 is a much bigger model and much more modern model. It is

released and trained by Meta and it is a 405 billion parameter model trained on 15 trillion tokens in very much the same

way just much much bigger. Um and Meta has also made a release of Llama 3 and that was part of this paper.

So with this paper that goes into a lot of detail, the biggest base model that they released is the Alama 3.1 4.5 405 billion parameter model. So this is the

base model. And then in addition to the

base model. And then in addition to the base model, you see here foreshadowing for later sections of the video. They

also released the instruct model. And

the instruct means that this is an assistant. You can ask it questions and

assistant. You can ask it questions and it will give you answers. We still have yet to cover that part later. For now,

let's just look at this base model, this token simulator, and let's play with it and try to think about, you know, what is this thing and how does it work and um what do we get at the end of this

optimization if you let this run until the end uh for a very big neural network on a lot of data. So, my favorite place to interact with the base models is this um company called Hyperbolic, which is

basically serving the base model of the 405B Lama 3.1. So when you go into the

Lama 3.1. So when you go into the website and I think you may have to register and so on. Make sure that in the models make sure that you are using llama 3.145 billion base. It must be the

base model. And then here let's say the

base model. And then here let's say the max tokens is how many tokens we're going to be generating. So let's just decrease this to be a bit less just so we don't waste compute. We just want the next 128 tokens and leave the other

stuff alone. I'm not going to go into

stuff alone. I'm not going to go into the full detail here. Um now

fundamentally what's going to happen here is identical to what happens here during inference for us. So this is just going to continue the token sequence of whatever prefix you're going to give it.

So I want to first show you that this model here is not yet an assistant. So

you can't for example ask it what is 2 plus two. It's not going to tell you oh

plus two. It's not going to tell you oh it's four. Uh what else can I help you

it's four. Uh what else can I help you with? It's not going to do that because

with? It's not going to do that because what is 2 plus two is going to be tokenized and then those tokens just act as a prefix and then what the model is going to do now is just going to get the

probability for the next token and it's just a glorified autocomplete. It's a

very very expensive autocomplete of what comes next um depending on the statistics of what it saw in its training documents which are basically web pages.

So let's just uh hit enter to see what tokens it comes up with as a continuation.

Okay, so here it kind of actually answered the question and started to go off into some philosophical territory.

Uh let's try it again. So let me copy and paste and let's try again from scratch. What is 2 plus two?

scratch. What is 2 plus two?

So okay, so it just goes off again. So

notice one more thing that I want to stress is that the system uh I think every time you put it in it just kind of starts from scratch.

So it doesn't uh the system here is stochastic. So for the same prefix of

stochastic. So for the same prefix of tokens we're always getting a different answer and the reason for that is that we get this probability distribution and we sample from it and we always get different samples and we sort of always

go into a different territory uh afterwards. So here in this case um I

afterwards. So here in this case um I don't know what this is. Let's try one more time.

So it just continues on. So it's just doing the stuff that it saw on the internet, right? Um and it's just kind

internet, right? Um and it's just kind of like regurgitating those uh statistical patterns.

So first things, it's not an assistant yet. It's a token autocomplete. And

yet. It's a token autocomplete. And

second, it is a stoastic system. Now the

crucial thing is that even though this model is not yet by itself very useful for a lot of applications just yet um it is still very useful because in the task

of predicting the next token in the sequence the model has learned a lot about the world and it has stored all that knowledge in the parameters of the network. So remember that our text

network. So remember that our text looked like this right internet web pages and now all of this is sort of compressed in the weights of the

network. So you can think of um these

network. So you can think of um these 405 billion parameters as a kind of compression of the internet. You can

think of the 45 billion parameters as kind of like a zip file. Uh but it's not a lossless compression. It's a loss C compression. We're kind of like left

compression. We're kind of like left with kind of a gestalt to the internet and we can generate from it. Right now

we can elicit some of this knowledge by prompting the base model accordingly.

So, for example, here's a prompt that might work to elicit some of that knowledge that's hiding in the parameters. Here's my top 10 list of the

parameters. Here's my top 10 list of the top landmarks to see in Paris.

Um, and I'm doing it this way because I'm trying to prime the model to now continue this list. So, let's see if that works when I press enter.

Okay. So, you see that it started a list and it's now kind of giving me some of those landmarks.

And now notice that it's trying to give a lot of information here. Now you might not be able to actually fully trust some of the information here. Remember that

this is all just a recollection of some of the internet documents. And so the things that occur very frequently in the internet data are probably more likely to be remembered correctly compared to

things that happen very infrequently. So

you can't fully trust some of the things that and some of the information that is here because it's all just a vague recollection of internet documents because the information is not stored explicitly in any of the parameters.

It's all just the recollection. That

said, we did get something that is probably approximately correct and I don't actually have the expertise to verify that this is roughly correct. But

you see that we've elicited a lot of the knowledge of the model and this knowledge is not precise and exact. This

knowledge is vague and probabilistic and statistical and the kinds of things that occur often are the kinds of things that are more likely to be remembered um in the model. Now I want to show you a few

the model. Now I want to show you a few more examples of this model's behavior.

The first thing I want to show you is this example. I went to the Wikipedia

this example. I went to the Wikipedia page for Zebra and let me just copy paste the first uh even one sentence here and let me put it here. Now when I click

enter, what kind of uh completion are we going to get? So let me just hit enter.

There are three living species etc etc. What the model is producing here is an exact regurgitation of this Wikipedia entry. It is reciting this

Wikipedia entry. It is reciting this Wikipedia entry purely from memory and this memory is stored in its parameters.

And so it is possible that at some point in these 512 tokens the model will uh stray away from the Wikipedia entry. But

you can see that it has huge chunks of it memorized here. Uh let me see for example if this sentence occurs by now.

Okay. So this so we're still on track.

Let me check here.

Okay. We're still on track. It will

eventually uh stray away.

Okay. So this thing is just recited to a very large extent. It will eventually deviate uh because it won't be able to remember exactly. Now the reason that

remember exactly. Now the reason that this happens is because these models can be extremely good at memorization and usually this is not what you want in the final model and this is something called regurgitation and it's usually

undesirable to site uh things uh directly uh that you have trained on.

Now the reason that this happens actually is because for a lot of documents like for example Wikipedia when these documents are deemed to be of very high quality as a source like for

example Wikipedia it is very often uh the case that when you train the model you will preferentially sample from those sources. So basically the model

those sources. So basically the model has probably done a few epochs on this data meaning that it has seen this web page like maybe probably 10 times or so and it's a bit like you like when you

read some kind of a text many many times say you read something a 100 times then you'll be able to recite it and it's very similar for this model if it sees something way too often it's going to be able to recite it later from memory

except these models can be a lot more efficient um like per presentation than a human. So probably it's only seen this

a human. So probably it's only seen this Wikipedia entry 10 times. But basically

it has remembered this article exactly in its parameters. Okay. The next thing I want to show you is something that the model has definitely not seen during its training. So for example, if we go to

training. So for example, if we go to the paper uh and then we navigate to the pre-training data, we'll see here that

uh the data set has a knowledge cutoff until the end of 2023. So it will not have seen documents after this point and certainly it has not seen anything about

the 2024 election and how it turned out.

Now if we prime the model with the tokens from the future, it will continue the token sequence and it will just take its best guess according to the knowledge that it has in its own

parameters. So let's take a look at what

parameters. So let's take a look at what that could look like. So the Republican party kit Trump okay president of the United States from 2017 and let's see what it says after

this point. So for example the model

this point. So for example the model will have to guess at the running mate and who it's against etc. So let's hit enter.

So here it thinks that Mike Pence was the running mate instead of JD Vance and the ticket was against Hillary Clinton and Tim Kaine. So this is kind of a

interesting parallel universe potentially of what could have happened according to the alarm. Let's get a different sample. So the identical

different sample. So the identical prompt and let's resample.

So here the running mate was Ron the Santis and they ran against Joe Biden and Kamala Harris. So this is again a different parallel universe. So the

model will take educated guesses and it will continue the token sequence based on this knowledge. Um and it will just kind of like all of what we're seeing here is what's called hallucination. the

model is just taking its best guess uh in a probabistic manner. The next thing I would like to show you is that even though this is a base model and not yet an assistant model, it can still be utilized in practical applications if

you are clever with your prompt design.

So here's something that we would call a few shot prompt. So what it is here is that I have 10 words or 10 pairs and

each pair is a word of English colon and then the translation in Korean and we have 10 of them. And what the model does here is at the end we have teacher colon

and then here's where we're going to do a completion of say just five tokens.

And these models have what we call in context learning abilities. And what

that's referring to is that as it is reading this context, it is learning sort of in place that there's some kind of a algorithmic pattern going on in my data and it knows

to continue that pattern and this is called kind of like in context learning.

So it takes on the role of translator and when we hit uh completion we see that the teacher translation is senim which is correct. Um, and so this is how

you can build apps by being clever with your prompting even though we still just have a base model for now. And it relies on what we call this um, uh, in context learning ability. And it is done by

learning ability. And it is done by constructing what's called a few shot prompt. Okay. And finally, I want to

prompt. Okay. And finally, I want to show you that there is a clever way to actually instantiate a whole language model assistant just by prompting. And

the trick to it is that we're going to structure a prompt to look like a web page that is a conversation between a helpful AI assistant and a human. And

then the model will continue that conversation. So actually to write the

conversation. So actually to write the prompt I turned to chatbt itself which is kind of meta but I told it I want to create an LM assistant but all I have is

the base model. So can you please write my um uh prompt and this is what it came up with which is actually quite good. So

here's a conversation between an AI assistant and a human. The AI assistant is knowledgeable, helpful, capable of answering a wide variety of questions, etc. And then here, it's not enough to

just give it a sort of description. It

works much better if you create this few shot prompt. So here's a few terms of

shot prompt. So here's a few terms of human assistant, human assistant, and we have, you know, a few turns of conversation. And then here at the end

conversation. And then here at the end is we're going to be putting the actual query that we like. So let me copy paste this into the base model prompt. And now

let me do human column. And this is where we put our actual prompt. Why is

the sky blue?

And uh let's uh run assistant. The sky appears blue due to

assistant. The sky appears blue due to the phenomenon called ray light scattering etc etc. So you see that the base model is just continuing the sequence but because the sequence looks like this conversation it takes on that

role but it is a little subtle because here it just uh you know it ends the assistant and then just you know hallucinates the next question by the human etc. So it'll just continue going on and on. Uh but you can see that we

have sort of accomplished the task. And

if you just took this why is the sky blue and if we just refresh this and put it here then of course we don't expect this to work with the base model right we're just going to who knows what we're going to get okay we're just going to

get more questions.

Okay so this is one way to create an assistant even though you may only have a base model. Okay so this is the kind of brief summary of the things we talked

about over the last few minutes.

Now let me zoom out here.

And this is kind of like what we've talked about so far. We wish to train LLM assistants like CHACHBT. We've

discussed the first stage of that which is the pre-training stage. And we saw that really what it comes down to is we take internet documents. We break them up into these tokens, these atoms of little text chunks. And then we predict

token sequences using neural networks.

The output of this entire stage is this base model. It is the setting of the

base model. It is the setting of the parameters of this network. And this

base model is basically an internet document simulator on the token level.

So it can just uh it can generate token sequences that have the same kind of like statistics as internet documents.

And we saw that we can use it in some applications but we actually need to do better. We want an assistant. We want to

better. We want an assistant. We want to be able to ask questions and we want the model to give us answers. And so we need to now go into the second stage which is called the post- training stage. So we

take our base model, our internet document simulator and hand it off to post-raining. So we're now going to

post-raining. So we're now going to discuss a few ways to do what's called post-raining of these models. These

stages in post-raining are going to be computationally much less expensive.

Most of the computational work. All of

the massive data centers um and all of the sort of heavy compute and millions of dollars are the pre-training stage.

But now we go into this slightly cheaper but still extremely important stage called post training where we turn this LLM model into an assistant. So let's

take a look at how we can get our model to not sample internet documents but to give answers to questions. So in other words what we want to do is we want to start thinking about conversations and

these are conversations that can be multi-turn. So so uh that can be

multi-turn. So so uh that can be multiple terms and they are in the simplest case a conversation between a human and an assistant. And so for example, we can imagine the conversation could look something like this. When a

human says, "What is 2 plus 2?" The

assistant should respond with something like 2 plus 2 is four. When a human follows up and says, "What if it was star instead of a plus?" Assistant could respond with something like this. Um,

and similar here, this is another example showing that the assistant could also have some kind of a personality here that it's kind of like nice. And

then here in the third example, I'm showing that when a human is asking for something that we uh don't wish to help with, we can produce what's called refusal. we can say that we cannot help

refusal. we can say that we cannot help with that. So in other words, what we

with that. So in other words, what we want to do now is we want to think through how an assistant should interact with the human and we want to program the assistant and its behavior in these

conversations. Now because this is

conversations. Now because this is neural networks, we're not going to be programming these explicitly in code.

We're not going to be able to program the assistant in that way because this is neural networks. Everything is done through neural network training on data sets. And so because of that we are

sets. And so because of that we are going to be implicitly programming the assistant by creating data sets of conversations. So these are three

conversations. So these are three independent examples of conversations in a data set. An actual data set and I'm going to show you examples will be much larger. It could have hundreds of

larger. It could have hundreds of thousands of conversations that are multi-turn very long etc and would cover a diverse breath of topics. But here I'm only showing three examples. But the way

this works basically is uh assistant is being programmed by example and where is this data coming from like 2 * 2= 4 same as 2 plus 2 etc. Where does that come

from? This comes from human labelers. So

from? This comes from human labelers. So

we will basically give human labelers some conversational context and we will ask them to um basically give the ideal assistant response in this situation and a human will write out the ideal

response for an assistant in any situation. And then we're going to get

situation. And then we're going to get the model to basically train on this and to imitate those kinds of responses.

So the way this works then is we are going to take our base model which we produced in the pre-training stage and this base model was trained on internet documents. We're now going to take that

documents. We're now going to take that data set of internet documents and we're going to throw it out and we're going to substitute a new data set and that's going to be a data set of conversations and we're going to continue training the

model on these conversations on this new data set of conversations. And what

happens is that the model will very rapidly adjust and will sort of like learn the statistics of how this assistant responds to human queries. And

then later during inference we'll be able to basically um prime the assistant and get the response and it will be imitating what the humans would human labelers would do in that situation if

that makes sense. So we're going to see examples of that and this is going to become bit more concrete. I also wanted to mention that this post-training stage we're going to basically just continue training the model but um the

pre-training stage can in practice take roughly three months of training on many thousands of computers. The

post-training stage will typically be much shorter like 3 hours for example.

Um and that's because the data set of conversations that we're going to create here manually is much more smaller than the data set of text on the internet.

And so this training will be very short, but fundamentally we're just going to take our base model. We're going to continue training using the exact same algorithm, the exact same everything, except we're swapping out the data set

for conversations. So the questions now

for conversations. So the questions now are where are these conversations? How

do we represent them? How do we get the model to see conversations instead of just raw text? And then what are the outcomes of um this kind of training?

And what do you get in a certain like psychological sense? uh when we talk

psychological sense? uh when we talk about the model. So let's turn to those questions now. So let's start by talking

questions now. So let's start by talking about the tokenization of conversations.

Everything in these models has to be turned into tokens because everything is just about token sequences. So how do we turn conversations into token sequences is the question. And so for that we need

to design some kind of an encoding. And

uh this is kind of similar to maybe if you're familiar you don't have to be with for example a TCP IP packet in um on the internet. There are precise rules and protocols for how you represent information and how everything is

structured together so that you have all this kind of data laid out in a way that is written out on a paper and that everyone can agree on. And so it's the same thing now happening in LLMs. We need some kind of data structures and we

need to have some rules around how these data structures like conversations get encoded and decoded to and from tokens.

And so I want to show you now how I would recreate uh this conversation in the token space.

So if you go to tick tokenizer I can take that conversation and this is how it is represented in uh for the language model. So here we have we are

language model. So here we have we are iterating a user and an assistant in this two-turn conversation and what you're seeing here is it looks ugly but it's actually relatively

simple. The way it gets turned into a

simple. The way it gets turned into a token sequence here at the end is a little bit complicated but at the end this conversation between a user and assistant ends up being 49 tokens. It

has a onedimensional sequence of 49 tokens and these are the tokens. Okay.

And all the different LLMs will have a slightly different format or protocols and it's a little bit of a wild west right now. But for example, GPT4 does it

right now. But for example, GPT4 does it in the following way. You have this special token called im start and this is short for immer imaginary monologue

uh the start. Then you have to specify um I don't actually know why it's called that to be honest. Then you have to specify whose turn it is. So for

example, user which is a token 1428.

Then you have internal monologue separator and then you it's the exact question. So

the tokens of the question and then you have to close it. So I am end the end of the imaginary monologue. So basically

the question from a user of what is 2 plus2 ends up being the token sequence of these tokens. And now the important thing to mention here is that IM start

this is not text right imart is a special token that gets added. It's a

new token and um this token has never been trained on so far. It is a new token that we create in a post-training stage and we introduce and so these

special tokens like IM set, IM start etc are introduced and interspersed with text so that they sort of um get the model to learn that hey this is a the start of a turn for who is it the start

of the turn for the start of the turn is for the user and then this is what the user says and then the user ends and then it's a new start of a turn and it is by the assistant and then what does

the assistant Okay, well these are the tokens of what the assistant says, etc. And so this conversation is not turned into this sequence of tokens. The

specific details here are not actually that important. All I'm trying to show

that important. All I'm trying to show you in concrete terms is that our conversations, which we think of as kind of like a structured object, end up being turned via some encoding into

one-dimensional sequences of tokens. And

so because this is a onedimensional sequence of tokens, we can apply all the stuff that we applied before. Now it's

just a sequence of tokens and now we can train a language model on it. And so

we're just predicting the next token in a sequence uh just like before and um we can represent and train on conversations and then what does it look like at test time during inference. So say we've

trained a model and we've trained a model on these kinds of data sets of conversations and now we want to inference.

So during inference what does this look like when you're on chach? Well, you

come to chasht and you have say like a dialogue with it. And the way this works is basically um say that this was already filled in.

So like what is 2 plus2 2 plus2 is four and now you issue what if it was times I im and what basically ends up happening um on the servers of openi or something

like that is they put in im start assistant im and this is where they end it right here. So they construct this context and now they start sampling from

the model. So it's at this stage that

the model. So it's at this stage that they will go to the model and say okay what is a good first sequence? What is a good first token? What is a good second token? What is a good third token? And

token? What is a good third token? And

this is where the LM takes over and creates a response like for example response that looks something like this.

But it doesn't have to be identical to this but it will have the flavor of this if this kind of a conversation was in the data set. So um that's roughly how the protocol works. Although the details

of this protocol are not important. So

again my goal is that just to show you that everything ends up being just a onedimensional token sequence. So we can apply everything we've already seen but we're now training on conversations and

we're now uh basically generating conversations as well. Okay. So now I would like to turn to what these data sets look like in practice. The first

paper that I would like to show you and the first effort in this direction is this paper from OpenAI in 2022. And this

paper was called instruct GPT or the technique that they developed. And this

was the first time that open has kind of talked about how you can take language models and fine-tune them on conversations. And so this paper has a

conversations. And so this paper has a number of details that I would like to take you through. So the first stop I would like to make is in section 3.4 before where they talk about the human

contractors that they hired uh in this case from Upwork or through scale AAI to uh construct these conversations. And so

there are human labelers involved whose job it is professionally to create these conversations and these labelers are asked to come up with prompts and then they are asked to also complete the

ideal assistant responses and so these are the kinds of prompts that people came up with. So these are human labelers. So list five ideas for how to

labelers. So list five ideas for how to regain enthusiasm for my career. What

are the top 10 science fiction books I should read next? And there's many different types of uh kind of prompts here. So translate the sentence from uh

here. So translate the sentence from uh to Spanish etc. And so there's many things here that people came up with.

They first come up with the prompt and then they also uh answer that prompt and they give the ideal assistant response.

Now how do they know what is the ideal assistant response that they should write for these prompts? So when we scroll down a little bit further, we see that here we have this excerpt of labeling instructions uh that are given

to the human labelers. So the company that is developing the language model like for example OpenAI writes up labeling instructions for how the humans should create ideal responses. And so

here for example is an excerpt uh of these kinds of labeling instructions. On

a high level you're asking people to be helpful, truthful, and harmless. And you

can pause the video if you'd like to see more here. But on a high level,

more here. But on a high level, basically just just answer try to be helpful, try to be truthful, and don't answer questions that we don't want um kind of the system to handle uh later in

Chachi PT. And so roughly speaking, the

Chachi PT. And so roughly speaking, the company comes up with the labeling instructions. Usually they are not this

instructions. Usually they are not this short. Usually they are hundreds of

short. Usually they are hundreds of pages and people have to study them professionally and then they write out the ideal assistant responses uh following those labeling instructions.

So this is a very humanheavy process as it was described in this paper. Now the

data set for instruct GPT was never actually released by OpenAI but we do have some open-source um reproductions that were trying to follow this kind of a setup and collect their own data. So

one that I'm familiar with for example is the effort of open assistant from a while back and this is just one of I think many examples but I just want to show you an example. So here's uh so

these were people on the internet that were asked to basically create these conversations similar to what um OpenI did with human labelers. And so here's an entry of a person who came up with

this prompt. Can you write a short

this prompt. Can you write a short introduction to the relevance of the term monopsy uh in economics please use examples etc. And then the same person or potentially a different person will write up the

response. So here's the assistant

response. So here's the assistant response to this. And so then the same person or different person will actually write out this ideal response

and then this is an example of maybe how the conversation could continue. Now

explain it to a dog and then you can try to come up with a slightly a simpler explanation or something like that. Now

this then becomes the label and we end up training on this. So what happens during training is that um of course we're not going to have a full coverage

of all the possible questions that um the model will encounter at test time during inference. We can't possibly

during inference. We can't possibly cover all the possible prompts that people are going to be asking in the future. But if we have a like a data set

future. But if we have a like a data set of a few of these examples, then the model during training will start to take on this persona of this helpful,

truthful, harmless assistant. And it's

all programmed by example. And so these are all examples of behavior. And if you have conversations of these example behaviors and you have enough of them, like 100 thousand, and you train on it, the model sort of starts to understand

the statistical pattern and it kind of takes on this personality of this assistant.

Now, it's possible that when you get the exact same question like this at test time, it's possible that the answer will be recited as exactly what was in the

training set. But more likely than that

training set. But more likely than that is that the model will kind of like do something of a similar vibe. Um and it will understand that this is the kind of answer that you want. Um

so that's what we're doing. We're

programming the system um by example and the system adopts statistically this persona of this helpful, truthful, harmless assistant which is kind of like reflected in the labeling instructions

that the company creates. Now I want to show you that the state-of-the-art has kind of advanced in the last two or three years uh since the instruct GPT paper. So in particular it's not very

paper. So in particular it's not very common for humans to be doing all the heavy lifting just by themselves anymore. And that's because we now have

anymore. And that's because we now have language models and these language models are helping us create these data sets and conversations. So it is very rare that the people will like literally just write out the response from scratch. It is a lot more likely that

scratch. It is a lot more likely that they will use an existing LLM to basically like uh come up with an answer and then they will edit it or things like that. So there's many different

like that. So there's many different ways in which now LLMs have started to kind of permeate this post-training set uh stack and LLMs are basically used

pervasively to help create these massive data sets of conversations. So I don't want to show like ultra chat is one um such example of like a more modern data set of conversations. It is to a very

large extent synthetic but uh I believe there's some human involvement. I could

be wrong with that. Usually there will be a little bit of human but there will be a huge amount of synthetic help. Um

and this is all kind of like uh constructed in different ways and ultrash is just one example of many SFT data sets that currently exist. And the

only thing I want to show you is that uh these data sets have now millions of conversations. Uh these conversations

conversations. Uh these conversations are mostly synthetic but they're probably edited to some extent by humans and they span a huge diversity of sort of um

uh areas and so on. So these are fairly extensive artifacts by now and there are all these like SFT mixtures as they're called. So you have a mixture of like

called. So you have a mixture of like lots of different types and sources and it's partially synthetic, partially human and it's kind of like um gone in that direction since uh but roughly

speaking we still have SFT data sets.

They're made up of conversations. We're

training on them um just like we did before. And uh I guess like the last

before. And uh I guess like the last thing to note is that I want to dispel a little bit of the magic of talking to an AI. Like when you go to chatt and you

AI. Like when you go to chatt and you give it a question and then you hit enter, uh what is coming back is kind of like statistically aligned with what's

happening in the training set. And these

training sets, I mean, they really just have a seed in humans following labeling instructions. So what are you actually

instructions. So what are you actually talking to in ChachiPT or how should you think about it? Well, it's not coming from some magical AI like roughly speaking it's coming from something that

is statistically imitating human labelers which comes from labeling instructions written by these companies.

And so you're kind of imitating this uh you're kind of getting um it's almost as if you're asking a human labeler. And

imagine that the answer that is given to you uh from Chashibbt is some kind of a simulation of a human labeler. Uh and

it's kind of like asking what would a human labeler say in this kind of a conversation.

And uh it's not just like this human labeler is not just like a random person from the internet because these companies actually hire experts. So for

example when you are asking questions about code and so on the human labelers that would be in um involved in creation of these conversation data sets they will usually be usually be educated

expert people and you're kind of like asking a question of like a simulation of those people if that makes sense. So

you're not talking to a magical AI, you're talking to an average labeler.

This average labeler is probably fairly highly skilled, but you're talking to kind of like an instantaneous simulation of that kind of a person that would be hired uh in the construction of these

data sets. So let me give you one more

data sets. So let me give you one more specific example before we move on. For

example, when I go to ChachiP and I say recommend the top five landmarks to see in Paris and then I hit enter.

Uh, okay. Here we go. Okay. When I hit

okay. Here we go. Okay. When I hit enter, what's coming out here?

How do I think about it? Well, it's not some kind of a magical AI that has gone out and researched all the landmarks and then ranked them using its infinite intelligence, etc. What I'm getting is a

statistical simulation of a labeler that was hired by OpenAI. You can think about it roughly in that way. And so if this specific um question is in the post-

training data set somewhere at OpenAI, then I'm very likely to see an answer that is probably very very similar to what that human labeler would have put down for those five landmarks. How does

the human labeler come up with this?

Well, they go off and they go on the internet and they kind of do their own little research for 20 minutes and they just come up with a list right now. So

if they come up with this list and this is in the data set, I'm probably very likely to see what they submitted as the correct answer from the assistant. Now

if this specific query is not part of the post training data set then what I'm getting here is a little bit more emergent uh because uh the model kind of understands that statistically

um the kinds of landmarks that are in the training set are usually the prominent landmarks the landmarks that people usually want to see the the kinds of landmarks that are usually uh very often talked about on the internet and

remember that the model already has a ton of knowledge from its pre-training on the internet so it's probably seen a ton of conversations about Paris, about landmarks, about the kinds of things that people like to see. And so it's the pre-training knowledge that is then

combined with the post-training data set that results in this kind of an imitation.

Um, so that's uh that's roughly how uh you can kind of think about what's happening behind the scenes here in uh in the statistical sense. Okay. Now I

want to turn to the topic of LLM psychology as I like to call it which is what are sort of the emergent cognitive effects of the training pipeline that we have for these models. So in particular

the first one I want to talk to is of course hallucinations.

So you might be familiar with model hallucinations. It's when LLMs make

hallucinations. It's when LLMs make stuff up. They just totally fabricate

stuff up. They just totally fabricate information etc. And it's a big problem with LLM assistants. It is a problem that existed to a large extent with early models uh for many years ago and I think the problem has gotten a bit

better uh because there are some medications that I'm going to go into in a second. For now, let's just try to

a second. For now, let's just try to understand where these hallucinations come from. So here's a specific example

come from. So here's a specific example of a few uh of three conversations that you might think you have in your training set. And um these are pretty

training set. And um these are pretty reasonable conversations that you could imagine being in the training set. So

like for example, who is Tom Cruz? Well,

Tom Cruz isn't a famous actor, American actor and producer, etc. Who is John Baraso? Uh, this turns out to be a UN US

Baraso? Uh, this turns out to be a UN US senator, for example. Who is

Genghaskhan? Well, Genghaskhan was blah blah blah. And so, this is what your

blah blah. And so, this is what your conversations could look like at training time. Now the problem with this

training time. Now the problem with this is that when the human is writing the correct answer for the assistant in each one of these cases uh the human either like knows who this person is or they

research them on the internet and they come in and they write this response that kind of has this like confident tone of an answer and what happens basically is that at test time when you ask for someone who is this is a totally

random name that I totally came up with and I don't think this person exists um as far as I know I just tried to generate it randomly. The problem is when we ask who is Orson Kovatz the

problem is that the assistant will not just tell you oh I don't know even if the assistant and the language model itself might know inside its features inside its activations inside of its

brain sort of it might know that this person is like not someone that um that is that it's familiar with even if some part of the network kind of knows that

in some sense the saying that oh I don't know who this is is not going to happen because the model statistically imitates it training set. In the training set,

the questions of the form who is blah are confidently answered with the correct answer. And so it's going to

correct answer. And so it's going to take on the style of the answer and it's going to do its best. It's going to give you statistically the most likely guess and it's just going to basically make stuff up because these models again we

just talked about it is they don't have access to the internet. They're not

doing research. These are statistical token tumblers as I call them. uh is

just trying to sample the next token in the sequence and it's gonna basically mix stuff up. So let's take a look at what this looks like.

I have here what's called the inference playground from hugging face and I am on purpose picking on a model called Falcon 7B which is an old model. This is a few

years ago now. So it's an older model so it suffers from hallucinations and as I mentioned this has improved over time recently. But let's say who is Orson

recently. But let's say who is Orson Kovat? Let's ask Falcon 7B instruct run.

Kovat? Let's ask Falcon 7B instruct run.

Oh yeah, Orson Kovat is an American author and science fiction writer. Okay,

this is totally false. It's a

hallucination. Let's try again. These

are statistical systems, right? So we

can resample. This time Orson Kovat is a fictional character from this 1950s TV show. It's total BS, right? Let's try

show. It's total BS, right? Let's try

again. He's a former minor league baseball player. Okay, so it basically

baseball player. Okay, so it basically the model doesn't know and it's given us lots of different answers because it doesn't know. It's just kind of like

doesn't know. It's just kind of like sampling from these probabilities. The

model starts with the tokens who is Orson Kovat's assistant and then it comes in here and it's get it's getting these probabilities and it's just sampling from the probabilities and it

just like comes up with stuff and the stuff is actually statistically consistent with the style of the answer in its training set and it's just doing that but you and I experienced it as a

madeup factual knowledge but keep in mind that uh the model basically doesn't know and it's just imitating the format of the answer and it's not going to go off and look it up uh because it's just

imitating again the answer. So how can we uh mitigate this? Because for example when we go to chatpt and I say who is Orson Kovatz and I'm now asking the state of the state-of-the-art model from

open AAI this model will tell you oh so this model is actually is even smarter because you saw very briefly it said searching the web uh we're going to

cover this later um it's actually trying to do tuluse and uh kind of just like came up with some kind

of a story but I want to just kovage not use any tools. I don't want it to do web search.

There's a well-known historical or public figure named or or some cobots.

So, this model is not going to make up stuff. This model knows that it doesn't

stuff. This model knows that it doesn't know and it tells you that it doesn't appear to be a person that this model knows. So, somehow we sort of improved

knows. So, somehow we sort of improved hallucinations even though they clearly are an issue in older models. And it

makes totally uh sense why you would be getting these kinds of answers if this is what your training set looks like. So

how do we fix this? Okay. Well, clearly

we need some examples in our data set that where the correct answer for the assistant is that the model doesn't know about some particular fact. But we only need to have those answers be produced

in the cases where the model actually doesn't know. And so the question is how

doesn't know. And so the question is how do we know what the model knows or doesn't know? Well, we can empirically

doesn't know? Well, we can empirically probe the model to figure that out. So

let's take a look at for example how meta uh dealt with hallucinations for the llama 3 series of models as an example. So in this paper that they

example. So in this paper that they published from meta we can go into hallucinations which they call here factuality and they describe the procedure by which they basically interrogate the model to

figure out what it knows and doesn't know to figure out sort of like the boundary of its knowledge and then they add examples to the training set where

for the things where the model doesn't know them the correct answer is that the model doesn't know them which sounds like very easy thing to do in principle,

but this roughly fixes the issue. And

the reason it fixes the issue is because remember like the model might actually have a pretty good model of its self-nowledge inside the network. So

remember we looked at the network and all these neurons inside the network.

You might imagine that there's a neuron somewhere in the network that sort of like lights up for when the model is uncertain. But the problem is that the

uncertain. But the problem is that the activation of that neuron is not currently wired up to the model actually saying in words that it doesn't know. So

even though the internals of the neural network know because there's some neurons that represent that the model uh will not surface that it will instead take its best guess so that it sounds

confident um just like it sees in the training set. So we need to basically

training set. So we need to basically interrogate the model and allow it to say I don't know in the cases that it doesn't know. So let me take you through

doesn't know. So let me take you through what metar roughly does. So basically

what they do is here I have an example.

Uh Dominic Kashek is uh the featured article today. So I just went there

article today. So I just went there randomly. And what they do is basically

randomly. And what they do is basically they take a random document in a training set and they take a paragraph and then they use an LLM to construct

questions about that paragraph. So for

example, I did that with chat GPT here.

So I said here's a paragraph from this document. generate three specific

document. generate three specific factual questions based on this paragraph and give me the questions and the answers. And so the LLMs are already

the answers. And so the LLMs are already good enough to create and reframe this information. So if the information is in

information. So if the information is in the context window um of this LLM, this actually works pretty well. It doesn't

have to rely on its memory. It's right

there in the context window. And so it can basically reframe that information with fairly high accuracy. So for

example can generate questions for us like for which team did he play here's the answer how many cups did he win etc and now what we have to do is we have some question and answers and now we

want to interrogate the model so roughly speaking what we'll do is we'll take our questions and we'll go to our model which would be uh say llama uh in meta but let's just interrogate mistrol 7b

here as an example that's another model so does this model know about this answer let's take a Uh, so he played for Buffalo Sabres,

right? So the model knows and the the

right? So the model knows and the the way that you can programmatically decide is basically we're going to take this answer from the model and we're going to compare it to the correct answer. And

again, the models are good enough to do this automatically. So there's no humans

this automatically. So there's no humans involved here. We can take uh basically

involved here. We can take uh basically the answer from the model and we can use another LLM judge to check if that is correct according to this answer. And if

it is correct, that means that the model probably knows. So, what we're going to

probably knows. So, what we're going to do is we're going to do this maybe a few times. So, okay, it knows it's Buffalo

times. So, okay, it knows it's Buffalo Sabres. Let's try again.

Sabres. Let's try again.

Um, Buffalo Sabers. Let's try one more time.

Buffalo Sabers. Let's try one more time.

Buffalo Sabres. So, we asked three times about this factual question and the model seems to know. So, everything is great. Now, let's try the second

great. Now, let's try the second question. How many Stanley Cups did he

question. How many Stanley Cups did he win?

And again, let's interrogate the model about that. And the correct answer is

about that. And the correct answer is two.

So, um, here the model claims that he won um four times, which is not correct, right? It doesn't match two. So, the

right? It doesn't match two. So, the

model doesn't know. It's making stuff up. Let's try again.

up. Let's try again.

Um, so here the model again is kind of like making stuff up, right? Let's try again.

Here it says it didn't he did not even did not win during his career. So obviously

the model doesn't know. And the way we can programmatically tell again is we interrogate the model three times and we compare its answers maybe three times, five times, whatever it is to the correct answer. And if the model doesn't

correct answer. And if the model doesn't know, then we know that the model doesn't know this question. And then

what we do is we take this question, we create a new conversation in the training set. So, we're going to add a

training set. So, we're going to add a new conversation in the training set.

And when the question is, how many Stanley Cups did he win? The answer is, I'm sorry, I don't know, or I don't remember. And that's the correct answer

remember. And that's the correct answer for this question because we interrogated the model and we saw that that's the case. If you do this for many different types of uh questions, for

many different types of documents, you are giving the model an opportunity to in its training set refuse to say based on its knowledge. And if you just have a few examples of that in your training

set, the model will know um and and has the opportunity to learn the association of this knowledgebased refusal to this internal neuron somewhere in its network

that we presume exists and empirically this turns out to be probably the case and it can learn that association that hey when this neuron of uncertainty is high then I actually don't know and I'm

allowed to say that I'm sorry but I don't think I remember this etc. And if you have these uh examples in your training set, then this is a large mitigation for hallucination and that's

roughly speaking why chachi PT is able to do stuff like this as well. So these

are the kinds of uh mitigations that people have implemented and that have improved the factuality issue over time.

Okay, so I've described mitigation number one for basically mitigating the hallucinations issue. Now we can

hallucinations issue. Now we can actually do much better than that. uh

it's instead of just saying that we don't know uh we can introduce an additional mitigation number two to give the LLM an opportunity to be factual and actually answer the question. Now what

do you and I do if I was to ask you a factual question and you don't know uh what would you do um in order to answer the question? Well, you could uh go off

the question? Well, you could uh go off and do some search and uh use the internet and you could figure out the answer and then tell me what that answer is. And we can do the exact same thing

is. And we can do the exact same thing with these models. So think of the knowledge inside the neural network inside its billions of parameters. Think

of that as kind of a vague recollection of the things that the model has seen during its training during the pre-training stage a long time ago. So

think of that knowledge in the parameters as something you read a month ago. And if you keep reading something

ago. And if you keep reading something then you will remember it and the model remembers that. But if it's something

remembers that. But if it's something rare then you probably don't have a really good recollection of that information. But what you and I do is we

information. But what you and I do is we just go and look it up. Now when you go and look it up, what you're doing basically is like you're refreshing your working memory with information and then you're able to sort of like retrieve it,

talk about it or etc. So we need some equivalent of allowing the model to refresh its memory or it's for collection and we can do that by introducing tools uh for the models.

So the way we are going to approach this is that instead of just saying hey I'm sorry I don't know we can attempt to use tools. So we can create uh a mechanism

tools. So we can create uh a mechanism by which the language model can emit special tokens and these are tokens that we're going to introduce new tokens. So

for example here I've introduced two tokens and I've introduced a format or a protocol for how the model is allowed to use these tokens. So for example instead of answering the question when the model

does not instead of just saying I don't know sorry the model has the option now to emitting the special token search start and this is the query that will go to like bing.com in the case of openai

or say Google search or something like that. So it will emit the query and then

that. So it will emit the query and then it will emit search end. And then here what will happen is that the program that is sampling from the model that is

running the inference when it sees the special token search end instead of sampling the next token uh in the sequence it will actually pause generating from the model it will go off

it will open a session with bing.com and it will paste the search query into Bing and it will then um get all the text that is retrieved and it will basically

take that text it will maybe represent again with some other special tokens or something like that and it will take that text and it will copy paste it here into what I try to like show with the

brackets. So all that text kind of comes

brackets. So all that text kind of comes here and when the text comes here it enters the context window. So the model so that text from the web search is now

inside the context window that will feed into the neural network. And you should think of the context window as kind of like the working memory of the model.

that data that is in the context window is directly accessible by the model. It

directly feeds into the neural network.

So it's not anymore a vague recollection. It's data that it it has

recollection. It's data that it it has in the context window and is directly available to that model. So now when it's sampling the new uh tokens here afterwards, it can reference very easily

the data that has been copy pasted in there. So that's roughly how these um

there. So that's roughly how these um how these tools use uh tools uh function. And so web search is just one

function. And so web search is just one of the tools. We're going to look at some of the other tools in a bit. Uh but

basically you introduce new tokens. You

introduce some schema by which the model can utilize these tokens and can call these special functions like web search functions. And how do you teach the

functions. And how do you teach the model how to correctly use these tools like say web search start search end etc. Well again you do that through training sets. So we need now to have a

training sets. So we need now to have a bunch of data and a bunch of conversations that show the model by example how to use web search. So what

are the what are the settings where you are using the search um and what does that look like and here's by example how you start a search end the search etc. And uh if you have a few thousand maybe

examples of that in your training set the model will actually do a pretty good job of understanding uh how this tool works and it will know how to sort of structure its queries. And of course, because of the pre-training data set and

its understanding of the world, it actually kind of understands what a web search is. And so it actually kind of

search is. And so it actually kind of has a pretty good native understanding um of what kind of stuff is a good search query. Um and so it all kind of

search query. Um and so it all kind of just like works. You just need a little bit of a few examples to show it how to use this new tool and then it can lean on it to retrieve information and uh put

it in the context window. And that's

equivalent to you and I looking something up because once it's in the context, it's in the working memory and it's very easy to manipulate and access.

So that's what we saw a few minutes ago when I was searching on ChachiPT for who is Orson Kovat. The ChachiPT language model decided that this is some kind of

a rare um individual or something like that. And instead of giving me an answer

that. And instead of giving me an answer from its memory, it decided that it will sample a special token that is going to do a web search. And we saw briefly something flash was like using the web tool or something like that. So it

briefly said that and then we waited for like two seconds and then it generated this. And you see how it's creating

this. And you see how it's creating references here. And so it's citing

references here. And so it's citing sources. So what happened here is it

sources. So what happened here is it went off, it did a web web search, it found these sources and these URLs. And

the text of these web pages was all stuffed in between here. And it's not shown here, but it's it's basically stuffed as text in between here. And now

it sees that text. And now it kind of references it and says that okay it could be these people citation it could be those people citation etc. So that's what happened here and that's what and

that's why when I said who is Orson Kovat I could also say don't use any tools and then that's enough to um basically convince chach to not use tools and just use its memory and its

recollection. I also went off and I um

recollection. I also went off and I um tried to ask this question of Chachi PT.

So how many standing cups did uh Dominic Hashek win? And Chesp actually decided

Hashek win? And Chesp actually decided that it knows the answer and it has the confidence to say that uh he won twice.

And so it kind of just relied on its memory because presumably it has um it has enough of a kind of confidence in its weights and

its parameters and activations that this is uh retrievable just from memory. Um,

but you can also conversely use web search to make sure and then for the same query it actually goes off and it searches and then it finds a bunch of sources. It finds all this. All of this

sources. It finds all this. All of this stuff gets copy pasted in there and then it tells us uh two again and sites and it actually says the Wikipedia article

which is the source of this information for us as well. So that's tools web search. the model determines when to

search. the model determines when to search and then uh that's kind of like how these tools uh work and this is an additional kind of mitigation for uh hallucinations and factuality. So I want

to stress one more time this very important sort of psychology point.

Knowledge in the parameters of the neural network is a vague recollection.

The knowledge in the tokens that make up the context window is the working memory. And it roughly speaking works

memory. And it roughly speaking works kind of like um it works for us in our brain. The stuff we remember is our

brain. The stuff we remember is our parameters uh and the stuff that we just experienced like uh a few seconds or minutes ago and so on. You can imagine that being in our context window and this context window is being built up as

you have a conscious experience around you. So this has a bunch of um

you. So this has a bunch of um implications also for your use of LLMs in practice. So for example, I can go to

in practice. So for example, I can go to chat and I can do something like this. I

can say can you summarize chapter one of Jane Austin's Pride and Prejudice right and this is a perfectly fine prompt and Chaship actually does something relatively reasonable here and but the reason it does that is because Chaship

has a pretty good recollection of a famous work like Pride and Prejudice it's probably seen a ton of stuff about it there's probably forums about this book it's probably read versions of this

book um and it kind of like remembers because even if you've read this or articles about it you'd kind of have a recollection enough to actually say all this. But usually when I actually

this. But usually when I actually interact with LMS and I want them to recall specific things, it always works better if you just give it to them. So I

think a much better prompt would be something like this. Can you summarize for me chapter one of Jane Austin's Pride and Prejudice? And then I am attaching it below for your reference.

And then I do something like the limiter here and I paste it in. And I I found it just copy pasting it from some website that I found here. Um so copy pasting

the chapter one here and I do that because when it's in the context window the model has direct access to it and can exactly it doesn't have to recall it it just has direct access to it and so

this summary is can be expected to be a significantly high quality or higher quality than this summary uh just because it's directly available to the model and I think you and I would work in the same way if you want to it would

be you would produce a much better summary if you had reread this chapter before you had to summarize it and that's uh basically what's happening here or the equivalent of it. The next

sort of psychological quirk I'd like to talk about briefly is that of the knowledge of self. So what I see very often on the internet is that people do something like this. They ask LLMs something like what model are you and

who built you? And um basically this question is a little bit nonsensical.

And the reason I say that is that as I tried to kind of explain with some of the under the hood fundamentals, this thing is not a person, right? It doesn't

have a persistent existence in any way.

It sort of boots up, processes tokens and shuts off and it does that for every single person. It just kind of builds up

single person. It just kind of builds up a context window of conversation and then everything gets deleted. And so

this this entity is kind of like restarted from scratch every single conversation if that makes sense. It has

no persistent self. It has no sense of self. It's a token tumbler and uh it

self. It's a token tumbler and uh it follows the stat statistical regularities of its training set. So it

doesn't really make sense to ask it who are you, what built you, etc. And by default, if you do what I described and just by default and from nowhere, you're going to get some pretty random answers.

So for example, let's uh pick on Falcon, which is a fairly old model. And let's

see what it tells us.

Uh so it's evading the question, uh talented engineers and developers. Here

it says, I was built by OpenAI based on the GP3 model. It's totally making stuff up. Now, the fact that it's built by

up. Now, the fact that it's built by OpenAI here, I think a lot of people would take this as evidence that this model was somehow trained on OpenAI data or something like that. I don't actually think that that's necessarily true. The

reason for that is that if you don't explicitly program the model to answer these kinds of questions, then what you're going to get

is its statistical best guess at the answer. And this model had a um SFT data

answer. And this model had a um SFT data mixture of conversations. And during the fine-tuning, um, the model sort of understands as

it's training on this data that it's taking on this personality of this like helpful assistant and it doesn't know how to it doesn't actually it wasn't told exactly what label to apply to

self. It just kind of has taken on this

self. It just kind of has taken on this uh this uh persona of a helpful assistant. And remember that the

assistant. And remember that the pre-training stage took the documents from the entire internet and chachi and open AAAI are very prominent in these documents. And so I think what's

documents. And so I think what's actually likely to be happening here is that this is just its hallucinated label for what it is. This is itself identity is that it's chat by OpenAI. And it's

only saying that because there's a ton of data on the internet of um answers like this that are actually coming from OpenAI from Chacht. And so that's its

label for what it is. Now you can override this as a developer. If you

have a LLM model, you can actually override it. And there are a few ways to

override it. And there are a few ways to do that. So for example, let me show

do that. So for example, let me show you. There's this Almo model from Alen

you. There's this Almo model from Alen AI. And um this is one LLM. It's not a

AI. And um this is one LLM. It's not a top tier LM or anything like that, but I like it because it is fully open source.

So the paper for Almo and everything else is completely fully open source, which is nice. Um so here we are looking at its SFT mixture. So this is the data mixture of um the fine-tuning. So this

is the conversations data set, right?

And so the way that they are solving it for the mo model is we see that there's a bunch of stuff in the mixture and there's a total of 1 million conversations here. But here we have

conversations here. But here we have Almo 2 hardcoded. If we go there, we see that this is 240 conversations.

And look at these 240 conversations.

They're hardcoded. Tell me about yourself says user. And then the assistant says, "I'mma an open language model developed by AI2 Allen Institute of Artificial Intelligence, etc. I'm

here to help blah blah blah. What is

your name? Uh, the project." So these are all kinds of like cooked up hard-coded questions about MMO 2 and the correct answers to give in these cases.

If you take 240 questions like this or conversations, put them into your training set and fine-tune with it, then the model will actually be expected to parrot this stuff later.

If you don't give it this then it's probably by open AI and um there's one more way to sometimes do this is that

basically um in these conversations and you have terms between human and assistant sometimes there's a special message called system message at the very beginning of the conversation. So

it's not just between human and assistant. There's a system and in the

assistant. There's a system and in the system message you can actually hardcode and remind the model that hey you are a model developed by open AAI and uh your

name is chaship 40 and you were trained on this date and your knowledge cutoff is this and basically it kind of like documents the model a little bit and then this is inserted into your conversations. So when you go on chasht

conversations. So when you go on chasht you see a blank page but actually the system message is kind of like hidden in there and those tokens are in the context window and so those are the two

ways to kind of um program the models to talk about themselves. Either it's done through uh data like this or it's done through system message and things like that basically invisible tokens that are

in the context window and remind the model of its identity. But it's all just kind of like cooked up and bolted on in some in some way. It's not actually like really deeply there in any real sense as

it would be for a haven. I want to now continue to the next section which deals with the computational capabilities or like I should say the native computational capabilities of these models in problem solving scenarios. And

so in particular, we have to be very careful with these models when we construct our examples of conversations.

And there's a lot of sharp edges here that are kind of like elucidative. Is

that a word? Uh they're kind of like interesting to look at when we consider how these models think. So um consider the following prompt from a human and suppose that basically that we are

building out a conversation to enter into our training set of conversations.

So we're going to train the model on this. We're teaching it how to basically

this. We're teaching it how to basically solve simple math problems. So the prompt is Emily buys three apples and two oranges. Each orange cost $2. The

two oranges. Each orange cost $2. The

total cost is 13. What is the cost of apples? Very simple math question. Now

apples? Very simple math question. Now

there are two answers here on the left and on the right. They are both correct answers. They both say that the answer

answers. They both say that the answer is three, which is correct. But one of these two is a significantly better answer for the assistant than the other.

Like if I was data labeler and I was creating one of these, one of these would be uh a really terrible answer for the assistant and the other would be okay. And so I'd like you to potentially

okay. And so I'd like you to potentially pause the video even and think through why one of these two is significantly better answer uh than the other. And um

if you use the wrong one, your model will actually be uh really bad at math potentially and it would have uh bad outcomes. And this is something that you

outcomes. And this is something that you would be careful with in your labeling documentations when you are training people uh to create the ideal responses for the assistant. Okay. So the key to this question is to realize and remember

that when the models are training and also inferencing they are working in one-dimensional sequence of tokens from left to right. And this is the picture that I often have in my mind. I imagine

basically the token sequence evolving from left to right and to always produce the next token in a sequence. We are

feeding all these tokens into the neural network and this neural network then gives us the probabilities for the next token in sequence. Right? So this

picture here is the exact same picture we saw uh before up here and this comes from the web demo that I showed you before. Right? So this is the

before. Right? So this is the calculation that basically takes the input tokens here on the top and uh performs these operations of all these neurons and uh gives you the answer for

the probabilities of what comes next.

Now the important thing to realize is that roughly speaking uh there's basically a finite number of layers of computation that happen here.

So for example this model here has only one two three layers of what's called attention and uh MLP here. Um maybe

typical modern state-of-the-art network would have more like say a 100 layers or something like that, but there's only 100 layers of computation or something like that to go from the previous token sequence to the probabilities for the

next token. And so there's a finite

next token. And so there's a finite amount of computation that happens here for every single token. And you should think of this as a very small amount of computation. And this amount of

computation. And this amount of computation is almost roughly fixed uh for every single token in this sequence.

um that's not actually fully true because the more tokens you feed in uh the the more expensive uh this forward pass will be of this neural network but not by much. So you should think of this

uh and I think as a good model to have in mind this is a fixed amount of compute that's going to happen in this box for every single one of these tokens and this amount of compute cannot possibly be too big because there's not that many layers that are sort of going

from the top to bottom here. there's not

that that much computationally that will happen here. And so you can't imagine

happen here. And so you can't imagine the model to to basically do arbitrary computation in a single forward pass to get a single token. And so what that means is that we actually have to distribute our reasoning and our

computation across many tokens because every single token is only spending a finite amount of computation on it. And

so we kind of want to distribute the computation across many tokens and we can't have too much computation or expect too much computation out of the model in any single individual token

because there's only so much computation that happens per token. Okay, roughly

fixed amount of computation here. So

that's why this answer here is significantly worse. And the reason for

significantly worse. And the reason for that is imagine going from left to right here um and I copy pasted it right here.

the answer is three etc. Imagine a model having to go from left to right emitting these tokens one at a time. It has to say or we're expecting to say the answer

is space dollar sign and then right here we're expecting it to basically cram all of the computation of this problem into this single token. It has to emit the correct answer three. And then once

we've emitted the answer three, we're expecting it to say all these tokens.

But at this point we've already produced the answer and it's already in the context window for all these tokens that follow. So anything here is just um kind

follow. So anything here is just um kind of post hawk justification of why this is the answer. Um because the answer is already created. It's already in the

already created. It's already in the token window. So it's it's not actually

token window. So it's it's not actually being calculated here. Um and so if you are answering the question directly and immediately you are training the model

to to try to basically guess the answer in a single token. And that is just not going to work because of the finite amount of computation that happens per token. That's why this answer on the

token. That's why this answer on the right is significantly better because we are distributing this computation across the answer. We're actually getting the

the answer. We're actually getting the model to sort of slowly come to the answer from the left to right. We're

getting intermediate results. We're

saying, okay, the total cost of oranges is four. So 13 - 4 is 9. And so we're

is four. So 13 - 4 is 9. And so we're creating intermediate calculations. And

each one of these calculations is by itself not that expensive. And so we're actually basically kind of guessing a little bit the difficulty that the model is capable of in any single one of these

individual tokens. And there can never

individual tokens. And there can never be too much work in any one of these tokens computationally because then the model won't be able to do that later at test time. And so we're teaching the

test time. And so we're teaching the model here to spread out its reasoning and to spread out its computation over the tokens. And in this way it only has

the tokens. And in this way it only has very simple problems in each token and they can add up and then by the time it's near the end it has all the previous results in its working memory

and it's much easier for it to determine that the answer is and here it is three.

So this is a significantly better label for our computation. This would be really bad and it is teaching the model to try to do all the computation in a single token is really bad.

So, uh, that's kind of like an interesting thing to keep in mind is in your prompts, uh, usually you don't have to think about it explicitly because, uh, the

people at OpenAI have labelers and so on that actually worry about this and they make sure that the answers are spread out. And so, actually, OpenAI will kind

out. And so, actually, OpenAI will kind of like do the right thing. So, when I ask this question for Chacht, it's actually going to go very slowly. It's

going to be like, okay, let's define our variables, set up the equation, and it's kind of creating all these intermediate results. These are not for you. These

results. These are not for you. These

are for the model. If the model is not creating these intermediate results for itself, it's not going to be able to reach three. I also wanted to show you

reach three. I also wanted to show you that it's possible to be a bit mean to the model. Uh we can just ask for

the model. Uh we can just ask for things. So, as an example, I said I gave

things. So, as an example, I said I gave it the exact same uh prompt and I said answer the question in a single token.

Just immediately give me the answer, nothing else. And it turns out that for

nothing else. And it turns out that for this simple um prompt here, it actually was able to do it in a single go. So it

just created a single I think this is two tokens right uh because the dollar sign is its own token. So basically this model didn't give me a single token it gave me two tokens but it still produced

the correct answer and it did that in a single forward pass of the network. Now

that's because the numbers here I think are very simple and so I made it a bit more difficult to be a bit mean to the model. So I said Emily buys 23 apples

model. So I said Emily buys 23 apples and 177 oranges and then I just made the numbers a bit bigger and I'm just making it harder for the model. I'm asking it to do more computation in a single

token. And so I said the same thing. And

token. And so I said the same thing. And

here it gave me five. And five is actually not correct. So the model failed to do all this calculation in a single forward pass of the network. It

failed to go from the input tokens and then in a single forward pass of the network, single go through the network, it couldn't produce the result. And then

I said, okay, now don't worry about the the token limit and just solve the problem as usual. And then it goes all the intermediate results. it simplifies

and every one of these intermediate results here and intermediate calculations is much easier for the model and um it sort of it's not too much work per token all of the tokens

here are correct and it arises solution which is seven and it just couldn't squeeze all of this work it couldn't squeeze that into a single for passive network so I think that's kind of just a cute example and something to kind of

like think about and I think is kind of again just elucidative in terms of how these uh models work the last thing that I would say on this topic is that if I was in practice trying to actually solve this in my day-to-day life, I might

actually not uh trust that the model did all the intermediate calculations correctly here. So actually probably

correctly here. So actually probably what I do is something like this. I

would come here and I would say use code and uh that's because code is one of the possible tools that Chachi PD can use.

And instead of it having to do mental arithmetic like this mental arithmetic here, I don't fully trust it. And

especially if the numbers get really big, there's no guarantee that the model will do this correctly. Any one of these intermediate steps might in principle fail. We're using neural networks to do

fail. We're using neural networks to do mental arithmetic. Uh kind of like you

mental arithmetic. Uh kind of like you doing mental arithmetic in your brain.

It might just like uh screw up some of the intermediate results. It's actually

kind of amazing that it can even do this kind of mental arithmetic. I don't think I could do this in my head. But

basically, the model is kind of like doing it in its head and I don't trust that. So I wanted to use tools. So you

that. So I wanted to use tools. So you

can say stuff like use code and uh I'm not sure what happened there.

Use code and so um like I mentioned there's a special tool and the uh the model can write code and I can inspect that this

code is correct and then uh it's not relying on its mental arithmetic. It is

using the Python interpreter which is a very simple programming language to basically write out the code that calculates the result. And I would personally trust this a lot more because this came out of a Python program which

I think has a lot more correctness guarantees than the mental arithmetic of a language model. Uh so just um another kind of potential hint that if you have these kinds of problems uh you may want

to basically just uh ask the model to use the code interpreter. And just like we saw with the web search, the model has special uh kind of tokens for calling uh like it will not actually

generate these tokens from the language model. it will write the program and

model. it will write the program and then it actually sends that program to a different sort of part of the computer that actually just runs that program and brings back the result and then the model gets access to that result and can

tell you that okay the cost of each apple is sudden um so that's another kind of tool and I would use this in practice for yourself and it's um yeah

it's just uh less errorprone I would say so that's why I called this section models need tokens to think distribute your computation across any tokens, ask models to create intermediate results,

or whenever you can, lean on tools and tool use instead of allowing the models to do all of this stuff in their memory.

So, if they try to do it all in their memory, I don't fully trust it and prefer to use tools whenever possible. I

want to show you one more example of where this actually comes up, and that's in counting. So, models actually are not

in counting. So, models actually are not very good at counting for the exact same reason. You're asking for way too much

reason. You're asking for way too much in a single individual token. So, let me show you a simple example of that. um

how many dots are below and then I just put in a bunch of dots and chach says there are and then it just tries to solve the problem in a single token. So

in a single token it has to count the number of dots in its context window. Um

and it has to do that in a single forward pass of a network. In a single forward pass of a network as we talked about there's not that much computation that can happen there. Just think of that as being like very little

computation that happens there. So if I just look at what the model sees, let's go to the LM go to tokenizer.

It sees uh this how many dots are below. And then it turns out that these dots here, this group of I think 20 dots is a single token. And then this group of whatever

token. And then this group of whatever it is is another token. And then for some reason they break up as this. So I

don't actually this has to do with the details of the tokenizer but it turns out that these um the model basically sees the token ID this this this and so

on and then from these token ids it's expected to count the number and spoiler alert it's not 161 it's actually I believe 177 so here's what we can do

instead uh we can say use code and you might expect that like why should this work and it's actually kind of subtle and kind of interesting so when I say use code. I actually expect this to

use code. I actually expect this to work. Let's see. Okay, 177 is correct.

work. Let's see. Okay, 177 is correct.

So, what happens here is I've actually, it doesn't look like it, but I've broken down the problem into problems that are easier for the model.

I know that the model can't count. It

can't do mental counting, but I know that the model is actually pretty good at doing copy pasting. So, what I'm doing here is when I say use code, it creates a string in Python for this. And

the task of basically copy pasting my input here to here is very simple because for the model um it sees this string of uh it sees it as just these

four tokens or whatever it is. So it's

very simple for the model to copy paste those token ids and um kind of unpack them into dots here. And so it creates this string and then it calls python

routine.count and then it comes up with

routine.count and then it comes up with the correct answer. So the Python interpreter is doing the counting. It's

not the model's mental arithmetic doing the counting. So it's again a simple

the counting. So it's again a simple example of um models need tokens to think. Don't rely on their mental

think. Don't rely on their mental arithmetic. And um that's why also the

arithmetic. And um that's why also the models are not very good at counting. If

you need them to do counting tasks, always ask them to lean on the tool. Now

the models also have many other little cognitive deficits here and there. And

these are kind of like sharp edges of the technology to be kind of aware of over time. So as an example, the models

over time. So as an example, the models are not very good with all kinds of spelling related tasks. They're not very good at it. And I told you that we would loop back around to tokenization. And

the reason to do for this is that the models, they don't see the characters.

They see tokens. And they their entire world is about tokens, which are these little text chunks. And so they don't see characters like our eyes do. And so

very simple character level tasks often fail. So for example,

fail. So for example, uh I'm giving it a string ubiquitous.

And I'm asking it to print only every third character starting with the first one. So we start with U and then we

one. So we start with U and then we should go every third. So every so 1 2 3 Q should be next and then etc. So this I see is not correct. And again my

hypothesis is that this is again the mental arithmetic here is failing number one a little bit. But number two, I think the the more important issue here is that if you go to tick tokenizer and

you look at ubiquitous, we see that it is three tokens, right? So you and I see ubiquitous and we can easily access the individual letters because we kind of see them and when we have it in the

working memory of our visual sort of field, we can really easily index into every third letter and I can do that task. But the models don't have access

task. But the models don't have access to the individual letters. They see this as these three tokens. And uh remember these models are trained from scratch on the internet and all these token uh

basically the model has to discover how many of all these different letters are packed into all these different tokens and the reason we even use tokens is mostly for efficiency. Uh but I think a lot of people are interested to delete

tokens entirely like we should really have character level or bite level models. It's just that that would create

models. It's just that that would create very long sequences and people don't know how to deal with that right now. So

while we have the token world any kind of spelling tasks are not actually expected to work super well. So because

I know that spelling is not a strong suit because of tokenization I can again ask it to lean on tools. So I can just say use code and I would again expect this to work because the task of copy

pasting ubiquitous into the Python interpreter is much easier and then we're leaning on Python interpreter to manipulate the characters of this

string. So when I say use code

string. So when I say use code ubiquitous, yes, it indexes into every third character and the actual truth is UQ2S, UQTS,

uh, which looks correct to me. So um,

again an example of spelling related tasks not working very well. A very

famous example of that recently is how many R are there in strawberry? And this

went viral many times and basically the models now get it correct. They say

there are three Rs in Strawberry. But

for a very long time, all the state-of-the-art models would insist that there are only two Rs in Strawberry. And this caused a lot of,

Strawberry. And this caused a lot of, you know, ruckus because, is that a word? I think so. Because um it just

word? I think so. Because um it just kind of like why are the models so brilliant and they can solve math olympiat questions, but they can't like count ours in strawberry. And the answer for that again is I've kind of built up

to it kind of slowly, but number one, the models don't see characters, they see tokens. And number two, they are not

see tokens. And number two, they are not very good at counting. And so here we are combining the difficulty of seeing the characters with the difficulty of counting. And that's why the models

counting. And that's why the models struggled with this even though I think by now honestly I think openi may have hardcoded the answer here or I'm not sure what they did but um uh but this

specific query now works. So

models are not very good at spelling and there there's a bunch of other little sharp edges and I don't want to go into all of them. I just want to show you a few examples of things to be aware of and uh when you're using these models in practice. I don't actually want to have

practice. I don't actually want to have a comprehensive analysis here of all the ways that the models are kind of like falling short. I just want to make the

falling short. I just want to make the point that there are some jagged edges here and there and we've discussed a few of them and a few of them make sense, but some of them also will just not make as much sense and they're kind of like

you're left scratching your head even if you understand in depth how these models work. And and a good example of that

work. And and a good example of that recently is the following. uh the models are not very good at very simple questions like this and uh this is shocking to a lot of people because these math uh these problems can solve

complex math problems they can answer PhD grade physics chemistry biology questions much better than I can but sometimes they fall short in like super simple problems like this so here we go

9.11 is bigger than 9.9 and it justifies in some way but obviously and then at the end okay it actually it flips its decision later So, um I don't believe

that this is very reproducible.

Sometimes it flips around its answer.

Sometimes it gets it right, sometimes gets it wrong. Uh let's try again.

Okay, even though it might look larger.

Okay, so here it doesn't even correct itself in the end. If you ask many times, sometimes it gets it right too.

But how is it that the model can do so great at Olympiad grade problems but then fail on very simple problems like this? And uh I think this one is as I

this? And uh I think this one is as I mentioned a little bit of a head scratcher. It turns out that a bunch of

scratcher. It turns out that a bunch of people studied this in depth and I haven't actually read the paper uh but what I was told by this team was that when you scrutinize the activations

inside the neural network when you look at some of the features and what what features turn on or off and what neurons turn on or off uh a bunch of neurons inside the neural network light up that

are usually associated with Bible verses. Uh, and so I think the model is

verses. Uh, and so I think the model is kind of like reminded that these almost look like Bible verse markers and in a Bible verse setting 9.11 would come

after 9.9. And so basically the model

after 9.9. And so basically the model somehow finds it like cognitively very distracting that in Bible verses 9.11 would be greater. Um, even though here it's actually trying to justify it and

come up to the answer with a math, it still ends up with the wrong answer here. So it basically just doesn't fully

here. So it basically just doesn't fully make sense and it's not fully understood and um there's a few jagged issues like that. So that's why treat this as a as

that. So that's why treat this as a as what it is which is a stochastic system that is really magical but that you can't also fully trust and you want to use it as a tool not as something that

you kind of like letter rip on a problem and copy paste the results. Okay. So we

have now covered two major stages of training of large language models. We

saw that in the first stage, this is called the pre-training stage. We are

basically training on internet documents. And when you train a language

documents. And when you train a language model on internet documents, you get what's called a base model. And it's

basically an internet document simulator. Right now, we saw that this

simulator. Right now, we saw that this is an interesting artifact and uh this takes many months to train on thousands of computers and it's kind of a lossy compression of the internet and it's

extremely interesting, but it's not directly useful because we don't want to sample internet documents. We want to ask questions of an AI and have it respond to our questions. So for that we

need an assistant. And we saw that we can actually construct an assistant in the process of post training and specifically in the process of

supervised fine-tuning as we call it.

So in this stage we saw that it's algorithmically identical to pre-training. Nothing is going to

pre-training. Nothing is going to change. The only thing that changes is

change. The only thing that changes is the data set. So instead of internet documents, we now want to create and curate a very nice data set of conversations. So we want million

conversations. So we want million conversations on all kinds of diverse topics between a human and an assistant.

And fundamentally these conversations are created by humans. So humans write the prompts and humans write the ideal responses and they do that based on

labeling documentations.

Now in the modern stack it's not actually done fully and manually by humans right they actually now have a lot of help from these tools so we can use language models um to help us create these data sets and we that's done

extensively but fundamentally it's all still coming from human curation at the end. So we create these conversations

end. So we create these conversations that now becomes our data set. We

fine-tune on it or continue training on it and we get an assistant. And then we kind of shifted gears and started talking about some of the kind of cognitive implications of what this assistant is like. And we saw that for

example the assistant will hallucinate if you don't take some uh sort of mitigations towards it. So we saw that hallucinations would be common. And then

we looked at some of the mitigations of those hallucinations. And then we saw

those hallucinations. And then we saw that the models are quite impressive and can do a lot of stuff in their head. But

we saw that they can also lean on tools to become better. So for example, we can lean uh lean on the web search in order to hallucinate less and to maybe bring up some more um recent information or

something like that. Or we can lean on tools like code interpreter so the code can so the LLM can write some code and actually run it and see the results.

So these are some of the topics we looked at so far. Um now what I'd like to do is I'd like to cover the last and major stage of this pipeline and that is

reinforcement learning. So reinforcement

reinforcement learning. So reinforcement learning is still kind of thought to be under the umbrella of post-raining uh but it is a the last third major stage and it's a different way of training

language models and usually follows as this third step. So inside companies like OpenAI, you will start here and these are all separate teams. So there's a team doing data for pre-training and a

team doing training for pre-training and then there's a team doing all the conversation generation in uh in a different team that is kind of doing the supervised fine tuning and then there will be a team for the reinforcement

learning as well. So it's kind of like a handoff of these models. You get your base model, then you fine-tune it to be an assistant and then you go into reinforcement learning which we'll talk about uh now.

So that's kind of like the major flow.

And so let's now focus on reinforcement learning, the last major stage of training. And let me first actually

training. And let me first actually motivate it and why we would want to do reinforcement learning and what it looks like on a high level. So I would now like to try to motivate the reinforcement learning stage and what it corresponds to with something that

you're probably familiar with and that is basically going to school. So just

like you went to school to become um really good at something, we want to take large language models through school. And really what we're doing is

school. And really what we're doing is um we're um we have a few paradigms of ways of uh giving them knowledge or transferring skills. So in particular

transferring skills. So in particular when we're working with textbooks in school, you'll see that there are three major kind of uh pieces of information in these textbooks, three classes of

information. The first thing you'll see

information. The first thing you'll see is you'll see a lot of exposition. Um

and by the way, this is a totally random book I pulled from the internet. I think

it's some kind of an organic chemistry or something. I'm not sure. Uh but the

or something. I'm not sure. Uh but the important thing is that you'll see that most of the text most of it is kind of just like the meat of it is exposition.

It's kind of like background knowledge etc. As you are reading through the words of this exposition you can think of that roughly as training on that

data. So um and that's why when you're

data. So um and that's why when you're reading through this stuff this background knowledge on this all this context information it's kind of equivalent to pre-training.

So it's it's where we build sort of like a knowledge base of this data and get a sense of the topic. The next major kind of information that you will see is

these uh problems and with their worked solutions. So basically a human expert

solutions. So basically a human expert in this case uh the author of this book has given us not just a problem but has also worked through the solution and the solution is basically like equivalent to

having like this ideal response for an assistant. So it's basically the expert

assistant. So it's basically the expert is showing us how to solve the problem in its uh kind of like um in its full form. So as we are reading the solution

form. So as we are reading the solution we are basically training on the expert data and then later we can try to imitate the expert um and basically um

that's that roughly corresponds to having the SFT model. That's what it would be doing. So basically we've already done pre-training and we've already covered this um imitation of

experts and how they solve these problems. And the third stage of reinforcement learning is basically the practice problems. So sometimes you'll see this is just a single practice problem here. But of course there will

problem here. But of course there will be usually many practice problems at the end of each chapter in any textbook. And

practice problems of course we know are critical for learning because what are they getting you to do? they're getting

you to practice uh to practice yourself and discover ways of solving these problems yourself. And so what you get

problems yourself. And so what you get in the practice problem is you get the problem description, but you're not given the solution, but you are given the final answer, usually in the answer

key of the textbook. And so you know the final answer that you're trying to get to, and you have the problem statement, but you don't have the solution. You are

trying to practice the solution. you're

trying out many different things and you're seeing what gets you to the final solution the best and so you're discovering how to solve these problems. So, and in the process of that you're relying on number one the background

information which comes from pre-training and number two maybe a little bit of imitation of human experts and you can probably try similar kinds of solutions and so on. So, we've done

this and this and now in this section, we're going to try to practice. And so,

we're going to be given prompts. We're

going to be given solutions. U sorry,

the final answers, but we're not going to be given expert solutions. We have to practice and try stuff out. And that's

what reinforcement learning is about.

Okay. So, let's go back to the problem that we worked with previously just so we have a concrete example to talk through as we explore sort of the topic here. So um I'm here in the tech

here. So um I'm here in the tech tokenizer because I'd also like to well I get a text box which is useful but number two I want to remind you again that we're always working with

onedimensional token sequences and so um I actually like prefer this view because this is like the native view of the LLM if that makes sense like this is what it actually sees. It sees token ids right.

actually sees. It sees token ids right.

Okay so Emily buys three apples and two oranges. Each orange is $2. The total

oranges. Each orange is $2. The total

cost of all the fruit is $13. What is

the cost of each apple? And what I'd like to what I'd like you to appreciate here is these are like four possible candidate solutions as an example

and they all reach the answer three.

Now, what I'd like you to appreciate at this point is that if I am the human data labeler that is creating a conversation to be entered into the training set, I don't actually really

know which of these conversations to um to add to the data set. Some of

these conversations kind of set up a system of equations. Some of them sort of like just talk through it in English and some of them just kind of like skip right through to the solution.

Um if you look at chachi PT for example and you give it this question it defines a system of variables and it kind of like does this little thing. What we

have to appreciate and uh differentiate between though is um the first purpose of a solution is to reach the right answer. Of course we want to get the

answer. Of course we want to get the final answer three. That is the that is the important purpose here. But there's

a kind of like a secondary purpose as well where here we are also just kind of trying to make it like nice uh for the human because we're kind of assuming that the person wants to see the solution. They want to see the

solution. They want to see the intermediate steps. We want to present

intermediate steps. We want to present it nicely etc. So there are two separate things going on here. Number one is the presentation for the human but number two we're trying to actually get the

right answer. Um so let's for the moment

right answer. Um so let's for the moment focus on just reaching the final answer.

If we're only care if we only care about the final answer, then which of these is the optimal or like the best prompt um sorry the best solution for the LLM to

reach the right answer?

Um and what I'm trying to get at is we don't know me as a human labeler I would not know which one of these is best. So

as an example, we saw earlier on when we looked at um the token sequences here and the mental arithmetic and reasoning, we saw that for each token, we can only spend

basically a finite number of finite amount of compute here that is not very large or you should think about it that way. And so we can't actually make too

way. And so we can't actually make too big of a leap in any one token is is maybe the way to think about it. So as

an example in this one, what's really nice about it is that it's very few tokens. So it's going to take us very

tokens. So it's going to take us very short amount of time to get to the answer. But right here when we're doing

answer. But right here when we're doing 13 - 4 / 3 equals right in this token here, we're actually asking for a lot of computation to happen on that single individual token. And so maybe this is a

individual token. And so maybe this is a bad example to give to the LLM because it's kind of incentivizing it to skip through the calculations very quickly and it's going to actually make up mistakes make mistakes in its mental

arithmetic. Uh so maybe it would work

arithmetic. Uh so maybe it would work better to like spread out the spread it out more. Maybe it would be better to

out more. Maybe it would be better to set it up as an equation. Maybe it would be better to talk through it. We

fundamentally don't know. And we don't know because what is easy for you or I as or as human labelers, what's easy for us or hard for us is different than

what's easy or hard for the LLM. It's

cognition is different. Um and the token sequences are kind of like different hard for it. And so some of the token

sequences here that are trivial for me might be um very too much of a leap for the LLM. So right here, this token would

the LLM. So right here, this token would be way too hard. But conversely, many of the tokens that I'm creating here might be just trivial to the LLM and we're just wasting tokens. Like why waste all

these tokens when this is all trivial?

So if the only thing we're care care about is reaching the final answer and we're separating out the issue of the presentation to the human um then we don't actually really know how to annotate this example. We don't know

what solution to get to the LLM because we are not the LLM.

And it's clear here in the case of like the math example. But this is actually like a very pervasive issue like for our knowledge is not LLM's knowledge. Like

the LLM actually has a ton of knowledge of PhD and math and physics and chemistry and whatnot. So in many ways it actually knows more than I do and I'm I'm potentially not utilizing that knowledge in its problem solving. But

conversely, I might be injecting a bunch of knowledge in my solutions that the LLM doesn't know in its parameters. And

then those are like sudden leaps that are very confusing to the model. And so

our cognitions are different. And I

don't really know what to put here if all we care about is the reaching the final solution and doing it economically ideally. And so long story short, we are

ideally. And so long story short, we are not in a good position to create these uh token sequences for the LLM and they're useful by imitation to

initialize the system, but we really want the LLM to discover the token sequences that work for it. We need to find it needs to find for itself what token sequence reliably gets to the

answer given the prompt and it needs to discover that in a process of reinforcement learning and of trial and error. So let's see how this example

error. So let's see how this example would work like in reinforcement learning. Okay, so we're now back in the

learning. Okay, so we're now back in the hugging face inference playground and uh that just allows me to very easily call a diff different kinds of models. So as

an example here on the top right I chose the Gemma 2 two billion parameter model.

So two billion is very very small. So

this is a tiny model but it's okay. So

we're going to give it um the way that reinforcement learning will basically work is actually quite quite simple. um

we need to try many different kinds of solutions and we want to see which solutions work well or not. So we're

basically going to take the prompt, we're going to run the model and the model generates a solution and then we're going to inspect the solution and we know that the correct answer for

this one is $3. And so indeed the model gets it correct. It says it's $3. So

this is correct. So that's just one attempt at the solution. So now we're going to delete this and we're going to rerun it again. Let's try a second attempt. So the model solves it in a bit

attempt. So the model solves it in a bit slightly different way, right? Every

single attempt will be a different generation because these models are stoastic systems. Remember that at every single token here we have a probability distribution and we're sampling from that distribution. So we end up coming

that distribution. So we end up coming kind of going down slightly different paths. And so this is a second solution

paths. And so this is a second solution that also ends in the correct answer.

Now we're going to delete that. Let's go

a third time.

Okay. So again slightly different solution but also gets it correct.

Now we can actually repeat this many times and so in practice you might actually sample thousands of independent solutions or even like million solutions for just a single prompt. Um and some of

them will be correct and some of them will not be very correct. And basically

what we want to do is we want to encourage the solutions that lead to correct answers. So let's take a look at

correct answers. So let's take a look at what that looks like. So if we come back over here, here's kind of like a cartoon diagram of what this is looking like. We

have a prompt and then we tried many different solutions in parallel and some of the solutions um might go well so they get the right answer which is in green and some of the solutions

might go poorly and may not reach the right answer which is red. Now this

problem here unfortunately is not the best example because it's a trivial prompt and as we saw uh even like a two billion parameter model always gets it right. So it's not the best example in

right. So it's not the best example in that sense. But let's just exercise some

that sense. But let's just exercise some imagination here and let's just suppose that the um green ones are good and the red ones are bad.

Okay. So we generated 15 solutions only four of them got the right answer. And

so now what we want to do is basically we want to encourage the kinds of solutions that lead to right answers. So

whatever token sequences happened in these red solutions obviously something went wrong along the way somewhere and uh this was not a good path to take through the solution. And whatever token sequences there were in these green

solutions well things went uh pretty well in this situation. And so we want to do more things like it in prompts like this. And the way we encourage this

like this. And the way we encourage this kind of a behavior in the future is we basically train on these sequences. Um

but these training sequences now are not coming from expert human annotators.

There's no human who decided that this is the correct solution. This solution

came from the model itself. So the model is practicing here. It's tried out a few solutions. Four of them seem to have

solutions. Four of them seem to have worked and now the model will kind of like train on them. And this corresponds to a student basically looking at their solutions and being like, okay, well this one worked really well. So this is

how I should be solving these kinds of problems. And uh here in this example, there are many different ways to actually like really tweak the methodology a little bit here. But just

to get the core idea across, maybe it's simplest to just think about take the taking the single best solution out of these four. Uh like say this one, that's

these four. Uh like say this one, that's why it was yellow. Uh so this is the the solution that not only looked at the right answer, but many maybe had some other nice properties. Maybe it was the

shortest one or it looked nicest in some way. or uh there's other criteria you

way. or uh there's other criteria you could think of as an example, but we're going to decide that this is the top solution. We're going to train on it and

solution. We're going to train on it and then uh the model will be slightly more likely once you do the parameter update to take this path in this kind of a

setting in the future. But you have to remember that we're going to run many different diverse prompts across lots of math problems and physics problems and whatever wherever there might be. So

tens of thousands of prompts maybe have in mind. There's thousands of solutions

in mind. There's thousands of solutions per prompt. And so this is all happening

per prompt. And so this is all happening kind of like at the same time. And as

we're iterating this process, the model is discovering for itself what kinds of token sequences lead it to correct answers. It's not coming from a human

answers. It's not coming from a human annotator. The the model is kind of like

annotator. The the model is kind of like playing in this playground and it knows what it's trying to get to and it's discovering sequences that work for it.

uh these are sequences that don't make any mental leaps. Uh they they seem to work reliably and statistically and uh fully utilize the knowledge of the model

as it has it. And so uh this is the process of reinforcement learning. It's

basically a guess and check. We're going

to guess many different types of solutions. We're going to check them and

solutions. We're going to check them and we're going to do more of what worked in the future. And that is uh reinforcement

the future. And that is uh reinforcement learning. So in the context of what came

learning. So in the context of what came before, we see now that the SFD model, the supervised fine-tuning model, it's still helpful because it still kind of like initializes the model a little bit

into the vicinity of the correct solutions. So it's kind of like a

solutions. So it's kind of like a initialization of um of the model in the sense that it kind of gets the model to you know take solutions like write out solutions and maybe it has an

understanding of setting up a system of equations or maybe it kind of like talks through a solution. So it gets you into the vicinity of correct solutions. But

reinforcement learning is where everything gets dialed in. We really

discover the solutions that work for the model, get the right answers, we encourage them, and then the model just kind of like gets better over time.

Okay, so that is the high level process for how we train large language models.

In short, we train them kind of very similar to how we train children. And

basically the only difference is that children go through chapters of books and they do all these different types of training exercises um kind of within the chapter of each book. But instead when

we train AIS it's almost like we kind of do it stage by stage depending on the type of that stage. So first what we do is we do pre-training which as we saw is equivalent to uh basically reading all

the expository material. So we look at all the textbooks at the same time and we read all the exposition and we try to build a knowledge base. The second thing then is we go into the SFT stage which

is really looking at all the fixed uh sort of like solutions from human experts of all the different kinds of worked solutions across all the textbooks and we just kind of get an SFT

model which is able to imitate the experts but does so kind of blindly. it

just kind of like does its best guess uh kind of just like trying to mimic statistically the expert behavior and so that's what you get when you look at all the work solutions and then finally in

the last stage we do all the practice problems in the RL stage across all the textbooks we only do the practice problems and that's how we get the RL model so on a high level the way we

train LMS is very much equivalent uh to the process that we train uh that we use for training of children the next point I would like to make is that actually these first two stages pre-training and supervised finetuning they've been

around for years and they are very standard and everyone does them all the different LM providers it is this last stage the RL training that is a lot more early in its process of development and

is not standard yet in the field and so um this stage is a lot more kind of early and nent and the reason for that is because I actually skipped over a ton of little details here in this process

the highle idea is very simple it's trial error learning, but there's a ton of details and little mathematical kind of like nuances to exactly how you pick the solutions that are the best and how much you train on them and what is the

prompt distribution and how to set up the training run such that this actually works. So there's a lot of little

works. So there's a lot of little details and knobs to the core idea that is very very simple and so getting the details right here uh is not trivial and so a lot of companies like for example

OpenAI and other LM providers have experimented internally with reinforcement learning fine-tuning for LLMs for a while but they've not talked about it publicly um it's all kind of

done inside the company and so that's why the paper from deepseek that came out very very recently was such a big deal because this is a paper from this uh company called Deep CKI in China and

this paper really talked very publicly about reinforcement learning finetuning for large language models and how incredibly important it is for large language models and how it brings out a

lot of reasoning capabilities in the models. We'll go into this in a second.

models. We'll go into this in a second.

So this paper reinvigorated the public interest of using RL for LLMs and gave a lot of the um sort of nitty-gritty details that are needed to reproduce the results and

actually get the stage to work for large language models. So let me take you

language models. So let me take you briefly through this uh deepseek R1 paper and what happens when you actually correctly apply RL to language models and what that looks like and what that gives you. So the first thing I'll

gives you. So the first thing I'll scroll to is this uh kind of figure two here where we are looking at the improvement in how the models are solving mathematical problems. So this is the accuracy of solving mathematical

problems on the AME accuracy and then we can go to the web page and we can see the kinds of problems that are actually in these um these the kinds of math problems that are being measured here.

So these are simple math problems. You can um pause the video if you like. But

these are the kinds of problems that basically the models are being asked to solve. And you can see that in the

solve. And you can see that in the beginning they're not doing very well.

But then as you update the model with this many thousands of steps, their accuracy kind of continues to climb. So

the models are improving and they're solving these problems with a higher accuracy as you do this trial and error on a large data set of these kinds of problems. And the models are discovering

how to solve math problems. But even more incredible than the quantitative kind of results of solving these problems with a higher accuracy is the qualitative means by which the model

achieves these results. So when we scroll down uh one of the figures here that is kind of interesting is that later on in the optimization the model

seems to be uh using average length per response uh goes up. So the model seems to be using more tokens to get its higher accuracy results. So it's

learning to create very very long solutions. Why are these solutions very

solutions. Why are these solutions very long? We can look at them qualitatively

long? We can look at them qualitatively here. So basically what they discover is

here. So basically what they discover is that the model solutions get very very long partially because so here's a question and here's kind of the answer from the model. What the model learns to

do um and this is an emergent property of the optimization. It just discovers that this is good for problem solving is it starts to do stuff like this. Wait,

wait, wait. That's an aha moment I can flag here. Let's re-evaluate this step

flag here. Let's re-evaluate this step by step to identify the correct sum can be. So what is the model doing here?

be. So what is the model doing here?

Right? The model is basically re-evaluating steps. It has learned that

re-evaluating steps. It has learned that it works better for accuracy to try out lots of ideas, try something from different perspectives, retrace, reframe, backtrack. It's doing a lot of

reframe, backtrack. It's doing a lot of the things that you and I are doing in the process of problem solving for mathematical questions. But it's

mathematical questions. But it's rediscovering what happens in your head, not what you put down on the solution.

And there is no human who can hardcode this stuff in the ideal assistant response. This is only something that

response. This is only something that can be discovered in the process of reinforcement learning because you wouldn't know what to put here. This

just turns out to work for the model and it improves its accuracy in problem solving. So the model learns what we

solving. So the model learns what we call these chains of thought in your head and it's an emergent property of new optim of the optimization and that's what's bloating up the response lengths

but that's also what's increasing the accuracy of the problem. problem

solving. So what's incredible here is basically the model is discovering ways to think. It's learning what I like to

to think. It's learning what I like to call cognitive strategies of how you manipulate a problem and how you approach it from different perspectives, how you pull in some analogies or do different kinds of things like that and

how you kind of try out many different things over time uh check a result from different perspectives and how you kind of uh solve problems. But here it's kind of discovered by the RL. So extremely

incredible to see this emerge in the optimization without having to hardcode it anywhere. The only thing we've given

it anywhere. The only thing we've given it are the correct answers and this comes out from trying to just solve them correctly which is incredible. Um

now let's go back to actually the problem that we've been working with and let's take a look at what it would look like uh for uh for this kind of a model uh what we call reasoning or thinking

model to solve that problem. Okay, so

recall that this is the problem we've been working with. And when I pasted it into Chacht 40, I'm getting this kind of a response. Let's take a look at what

a response. Let's take a look at what happens when you give this same query to what's called a reasoning or a thinking model. This is a model that was trained

model. This is a model that was trained with reinforcement learning. So this

model described in this paper deepcar 1 is available on chat.deepc.com.

Uh so this is kind of like the company that uh that developed it is hosting it.

You have to make sure that the deep think button is turned on to get the R1 model as it's called. We can paste it here and run it.

And so let's take a look at what happens now and what is the output of the model.

Okay, so here's what it says. So this is previously what we get using basically what's an SFT approach, a supervised finetuning approach. This is like

finetuning approach. This is like mimicking an expert solution. This is

what we get from the RL model. Okay, let

me try to figure this out. So Emily buys three apples and two oranges. Each core

orange cost $2. total is 13. I need to find out blah blah blah. So here you you um as you're reading this, you can't escape thinking that this model is

thinking and it's definitely pursuing the solution. It deres that it must cost $3.

solution. It deres that it must cost $3.

And then it says, "Wait a second. Let me

check my math again to be sure." And

then it tries it from a slightly different perspective. And then it says,

different perspective. And then it says, "Yep, all that checks out. I think

that's the answer. I don't see any mistakes. Let me see if there's another

mistakes. Let me see if there's another way to approach the problem. Maybe

setting up an equation. Let's let the cost of one apple be $8, then blah blah blah. Yep, same answer. So definitely

blah. Yep, same answer. So definitely

each apple is $3. All right, confident that that's correct. And then what it does once it sort of um did the thinking process is it writes up the nice solution for the human. And so this is

now considering so this is more about the correctness aspect and this is more about the presentation aspect where it kind of like writes it out nicely and u boxes in the correct answer at the

bottom. And so what's incredible about

bottom. And so what's incredible about this is we get this like thinking process of the model and this is what's coming from the reinforcement learning process. This is what's bloating up the

process. This is what's bloating up the length of the token sequences. They're

doing thinking and they're trying different ways. This is what's giving

different ways. This is what's giving you higher accuracy in problem solving.

And this is where we are seeing these aha moments and these different strategies and these um ideas for how you can make sure that you're getting the correct answer.

The last point I wanted to make is some people are a little bit nervous about putting you know very sensitive data into chat.

Because this is a Chinese company so people don't um people are a little bit careful and cy with that a little bit.

Um, Deepseek R1 is a model that was released by this company. So, this is an open source model or open weights model.

It is available for anyone to download and use. You will not be able to like

and use. You will not be able to like run it in its full um sort of the full model in full precision. You won't run that on a MacBook but uh or like a local

device because this is a fairly large model. But many companies are hosting

model. But many companies are hosting the full largest model. One of those companies that I like to use is called together.ai.

together.ai.

So when you go to together.ai, you sign up and you go to playgrounds, you can select here in the chat DeepSseek R1 and there's many different kinds of other models that you can select here. These

are all state-of-the-art models. So this

is kind of similar to the hugging face inference playground that we've been playing with so far, but together.AI will usually host all the state-of-the-art models. So select deep

state-of-the-art models. So select deep car 1. Um, you can try to ignore a lot

car 1. Um, you can try to ignore a lot of these. I think the default settings

of these. I think the default settings will often be okay. And we can put in this. And because the model was released

this. And because the model was released by Deepseek, what you're getting here should be basically equivalent to what you're getting here. Now, because of the randomness in the sampling, we're going to get something slightly different. Uh

but in principle, this should be uh identical in terms of the power of the model, and you should be able to see the same things quantitatively and qualitatively. Uh but uh this model is

qualitatively. Uh but uh this model is coming from kind of a an American company.

So that's Deepseek and that's the what's called a reasoning model.

Now when I go back to chat uh let me go to chat here. Okay. So the models that you're going to see in the drop down here some of them like 01 03 mini 03 mini high etc. they are talking about

uses advanced reasoning. Now what this is referring to uses advanced reasoning is it's referring to the fact that it was trained by reinforcement learning with techniques very similar to those of

deepcar 1 per public statements of openi employees. Uh so these are thinking

employees. Uh so these are thinking models trained with RL and these models like GPT40 or GPT40 mini that you're getting in the free tier you should think of them as mostly SFT models

supervised finetuning models they don't actually do this like thinking as you see in the RL models and uh even though there's a little bit of reinforcement learning involved with these models and I'll go that into that in a second these

are mostly SFT models I think you should think about it that way so in the same way as what we saw here we can pick one of the thinking models like say 03 three mini high. And these models, by the way,

mini high. And these models, by the way, might not be available to you unless you pay a chachi subscription of either $20 per month or $200 per month for some of

the top models. So, we can pick a thinking model and run. Now, what's

going to happen here is it's going to say reasoning and it's going to start to do stuff like this. And um what we're seeing here is not exactly the stuff we're seeing here. So even though under

the hood the model produces these kinds of uh kind of chains of thought, OpenAI chooses to not show the exact chains of thought in the web interface. It shows

little summaries of that of those chains of thought. And OpenAI kind of does this

of thought. And OpenAI kind of does this I think partly because uh they are worried about what's called a distillation risk. That is that someone

distillation risk. That is that someone could come in and actually try to imitate those reasoning traces and recover a lot of the reasoning performance by just imitating the reasoning uh chains of thought. And so

they kind of hide them and they only show little summaries of them. So you're

not getting exactly what you would get in deepseek as with respect to the reasoning itself and then they write up the solution.

So these are kind of like equivalent even though we're not seeing the full under the hood details. Now in terms of the performance, uh these models and deepseek models are currently roughly on par. I would say it's kind of hard to

par. I would say it's kind of hard to tell because of the evaluations, but if you're paying $200 per month to OpenAI, some of these models I believe are currently they basically still look better. Uh, but Deepseek R1 for now is

better. Uh, but Deepseek R1 for now is still a very solid choice for a thinking model that would be available to you um sort of um either on this website or any

other website because the model is open weights. You can just download it.

weights. You can just download it.

So that's thinking models. So what is the summary so far? Well, we've talked about reinforcement learning and the fact that thinking emerges in the process of the optimization on when we

basically run RL on many math uh and kind of code problems that have verifiable solutions. So there's like an

verifiable solutions. So there's like an answer three etc. Now these thinking models you can access in for example deepseek or any inference provider like

together.ai and choosing deepseek over there. These thinking models are also

there. These thinking models are also available uh in Chacht under any of the 01 or 03 models.

But these GPT 40 models etc. they're not thinking models. You should think of

thinking models. You should think of them as mostly SFT models. Now if you are um if you have a prompt that requires advanced reasoning and so on, you should probably use some of the thinking models or at least try them

out. But empirically for a lot of my use

out. But empirically for a lot of my use when you're asking a simpler question, there's like a knowledge based question or something like that, this might be overkill. like there's no need to think

overkill. like there's no need to think 30 seconds about some factual question.

So for that I will uh sometimes default to just GP40. So empirically about 80 90% of my use is just GPT40. And when I come across a very difficult problem like in math and code etc I will reach

for the thinking models but then I have to wait a bit longer because they are thinking. Um so you can access these on

thinking. Um so you can access these on chach on deepseeek. Also I wanted to point out that um a studio.google.com

google.com even though it looks really busy, really ugly because Google is just unable to do this kind of stuff. Well, it's like what is happening? But if you choose model

is happening? But if you choose model and you choose here Gemini 2.0 flash thinking experimental 0121, if you choose that one, that's also a a kind of early experiment experimental of a

thinking model by Google. So we can go here and we can give it the same problem and click run. And this is also a thinking problem thinking model that will also do something similar

and comes out with the right answer here. So basically Gemini also offers a

here. So basically Gemini also offers a thinking model. Enthropic currently does

thinking model. Enthropic currently does not offer a thinking model. But

basically this is kind of like the frontier of development of these LLMs. I think RL is kind of like this new exciting stage. But getting the details

exciting stage. But getting the details right is difficult and that's why all these models and thinking models are currently experimental as of 2025 very early 2025. Um but this is kind of like

early 2025. Um but this is kind of like the frontier of development of pushing the performance on these very difficult problems using reasoning that is emergent in these optimizations. One

more connection that I wanted to bring up is that the discovery that reinforcement learning is extremely powerful way of learning is not new to the field of AI. And one place where

we've already seen this demonstrated is in the game of go and famously deep mind developed the system AlphaGo and you can watch a movie about it. um where the

system is learning to play the game of go against top human players and um when we go through the paper underlying alpha

go so in this paper when we scroll down we actually find a really interesting plot um that I think uh is kind of familiar uh to us and we're kind of like

rediscovering in the more open domain of arbitrary problem solving instead of on the closed specific domain of the game of go but basically Basally what they saw and we're going to see this in LLMs

as well as this becomes more mature is this is the ELO rating of playing game of go and this is leis do an extremely strong human player and here what they are comparing is the strength of a model

learned trained by supervised learning and a model trained by reinforcement learning. So the supervised learning

learning. So the supervised learning model is imitating human expert players.

So if you just get a huge amount of games played by expert players in the game of go and you try to imitate them you are going to get better but then you top out and you never quite get better

than some of the top top top players of in the game of go like ladle. So you're

never going to reach there because you're just imitating human players. You

can't fundamentally go beyond a human player if you're just imitating human players. But in a process of

players. But in a process of reinforcement learning is significantly more powerful. In reinforcement learning

more powerful. In reinforcement learning for a game of go, it means that the system is playing moves that empirically and statistically lead to win to winning

the game. And so Alph Go is a system

the game. And so Alph Go is a system where it kind of plays against itself and it's using reinforcement learning to create rollouts. So it's the exact same

create rollouts. So it's the exact same diagram here, but there's no prompt.

It's just uh because there's no prompt, it's just a fixed game of Go, but it's trying out lots of solutions. It's

trying out lots of plays. And then the games that lead to a win instead of a specific answer are reinforced. They're

they're made stronger. And so um the system is learning basically the sequences of actions that empirically and statistically lead to winning the game. And reinforcement learning is not

game. And reinforcement learning is not going to be constrained by human performance. And reinforcement learning

performance. And reinforcement learning can do significantly better and overcome even the top players like Lisa Dole.

And so, uh, probably they could have run this longer and they just chose to crop it at some point because this costs money. Uh, but this is a very powerful

money. Uh, but this is a very powerful demonstration of reinforcement learning.

And we're only starting to kind of see hints of this diagram in larger language models for reasoning problems. So, we're not going to get too far by just imitating experts. We need to go beyond

imitating experts. We need to go beyond that. Set up these like little game

that. Set up these like little game environments and get let let the system discover reasoning traces or like ways

of solving problems. uh that are unique and that uh just basically work well.

Now, on this aspect of uniqueness, notice that when you're doing reinforcement learning, nothing prevents you from veering off the distribution of how humans are playing the game. And so,

when we go back to uh this AlphaGo search here, one of the suggested modifications is called move 37. And

move 37 in AlphaGo is referring to a specific point in time where Alphago basically played a move that no human expert would play. Uh so the probability

of this move to be played by a human player was evaluated to be about 1 in 10,000. So it's a very rare move but in

10,000. So it's a very rare move but in retrospect it was a brilliant move. So

Alph Go in the process of reinforcement learning discovered kind of like a strategy of playing that was unknown to humans and but is in retrospect uh brilliant. I recommend this YouTube

brilliant. I recommend this YouTube video um Lisa Dole versus AlphaGo move 37 reactions and analysis and uh this is kind of what it looked like when AlphaGo

played this move. The value of >> that's a very that's a very surprising move.

>> I thought I thought it was I thought it was a mistake.

>> Would I see this move? Anyway, so

basically people are kind of freaking out because it's uh it's a move that a human would not play that AlphaGo played because in its training uh this move seemed to be a good idea. It just

happens not to be a kind of thing that humans would would do. And so that is again the power of reinforcement learning and in principle we can actually see the equivalence of that if we continue scaling this paradigm in

language models and what that looks like is kind of unknown. So um what does it mean to solve problems in such a way that uh even humans would not be able to

get? How can you be better at reasoning

get? How can you be better at reasoning or thinking than humans? How can you go beyond just uh a thinking human? Like

maybe it means discovering analogies that humans would not be able to uh create. Or maybe it's like a new

create. Or maybe it's like a new thinking strategy. It's kind of hard to

thinking strategy. It's kind of hard to think through. Uh maybe it's a wholly

think through. Uh maybe it's a wholly new language that actually is not even English. Maybe it discovers its own

English. Maybe it discovers its own language that is a lot better at thinking um because the model is unconstrained to even like stick with English. Uh so maybe it picks a

English. Uh so maybe it picks a different language to think in or it discovers its own language. So in

principle the behavior of the system is a lot less defined. It is open to do whatever works and it is open to also slowly drift from the distribution of it

training data which is English. But all

of that can only be done if we have a very large diverse set of problems in which the these strategies can be refined and perfected. And so that is a lot of the frontier LLM research that's going on right now is trying to kind of

create those kinds of prompt distributions that are large and diverse. These are all kind of like game

diverse. These are all kind of like game environments in which the LLMs can practice their thinking. And uh it's kind of like writing, you know, these practice problems. We have to create

practice problems for all of domains of knowledge. And if we have practice

knowledge. And if we have practice problems and tons of them, the models will be able to reinforcement learning reinforcement learn on them and kind of uh create these kinds of uh diagrams but

in the domain of open thinking instead of a closed domain like game of go.

There's one more section within reinforcement learning that I wanted to cover and that is that of learning in unverifiable domains. So so far all of

unverifiable domains. So so far all of the problems that we've looked at are in what's called verifiable domains. That

is any candidate solution we can score very easily against a concrete answer.

So for example answer is three and we can very easily score these solutions against the answer of three. Either we

require the models to like box in their answers and then we just check for equality of whatever is in the box with the answer. Or you can also use kind of

the answer. Or you can also use kind of what's called an LLM judge. So the LLM judge looks at a solution and it gets the answer and just basically scores the solution for whether it's consistent

with the answer or not. And LLMs

empirically are good enough at the current capability that they can do this fairly reliably. So we can apply those

fairly reliably. So we can apply those kinds of techniques as well. In any

case, we have a concrete answer and we're just checking solutions against it and we can do this automatically with no kind of humans in the loop. The problem

is that we can't apply the strategy in what's called unverifiable domains. So

usually these are for example creative writing tasks like write a joke about pelicans or write a poem or summarize a paragraph or something like that. In

these kinds of domains it becomes harder to score our different solutions to this problem. So for example writing a joke

problem. So for example writing a joke about pelicans we can generate lots of different uh jokes. Of course that's fine. For example we can go to chachi pt

fine. For example we can go to chachi pt and we can get it to uh generate a joke about pelicans.

Uh so much stuff in their beaks because they don't bellican in backpacks.

What? Okay,

we can uh we can try something else. Why

don't pelicans ever pay for their drinks? Because they always bill it to

drinks? Because they always bill it to someone else. Haha. Okay, so these

someone else. Haha. Okay, so these models are not obviously not very good at humor. Actually, I think it's pretty

at humor. Actually, I think it's pretty fascinating because I think humor is secretly very difficult and the model have the capability. I think anyway, in any case, you could imagine creating

lots of jokes. The problem that we are facing is how do we score them? Now in

principle we could of course get a human to look at all these jokes just like I did right now. The problem with that is if you are doing reinforcement learning you're going to be doing many thousands of updates and for each update you want

to be looking at say thousands of prompts and for each prompt you want to be potentially look at looking at hundred or thousands of different kinds of generations. And so there's just like

of generations. And so there's just like way too many of these to look at. And so

um in principle you could have a human inspect all of them and score them and decide that okay maybe this one is funny and maybe this one is funny and this one is funny and we could train on them to

get the model to become slightly better at jokes um in the context of pelicans at least. Um the problem is that it's

at least. Um the problem is that it's just like way too much human time. This

is an unscalable strategy. We need some kind of an automatic strategy for doing this. And one sort of solution to this

this. And one sort of solution to this was proposed in this paper uh that introduced what's called reinforcement learning from human feedback. And so

this was a paper from open AI at the time and many of these people are now um co-founders in anthropic um and this kind of proposed a approach for uh basically doing reinforcement learning

in unverifiable domains. So let's take a look at how that works. So this is the cartoon diagram of the core ideas involved. So as I mentioned the naive

involved. So as I mentioned the naive approach is if we just set infinity human time we could just run RL in these domains just fine. So for example we can run RL as usual if I have infinity

humans I I just want to do and these are just cartoon numbers I want to do 10,000 updates where each update will be on 10,00 prompts and in for each prompt

we're going to have 1,00 rollouts that we're scoring. So we can run RL with

we're scoring. So we can run RL with this kind of a setup. The problem is in the process of doing this I will need to run one I would need to ask a human to evaluate a joke a total of 1 billion

times and so that's a lot of people looking at really terrible jokes. So we

don't want to do that. So instead we want to take the arch of approach. So um

in our approach we are kind of like the the core trick is that of indirection.

So we're going to involve humans just a little bit. And the way we cheat is that

little bit. And the way we cheat is that we basically train a whole separate neural network that we call a reward model. And this neural network will kind

model. And this neural network will kind of like imitate human scores. So we're

going to ask humans to score um rollouts. We're going to then imitate

rollouts. We're going to then imitate human scores using a neural network. And

this neural network will become a kind of simulator of human preferences. And

now that we have a neural network simulator, we can do RL against it. So

instead of asking a real human, we're asking a simulated human for their score of a joke, as an example. And so once we have a simulator, we're off to the races

because we can query it as many times as we want to. And it's all whole automatic process and we can now do reinforcement learning with respect to the simulator.

And the simulator, as you might expect, is not going to be a perfect human. But

if it's at least statistically similar to human judgment, then you might expect that this will do something. And in

practice, indeed, uh, it does. So once

we have a simulator, we can do RL and everything works great. So let me show you a cartoon diagram a little bit of what this process looks like. Although

the details are not 100 like super important, it's just a core idea of how this works. So here I have a cartoon

this works. So here I have a cartoon diagram of a hypothetical example of what training the reward model would look like. So we have a prompt like

look like. So we have a prompt like write a joke about pelicans and then here we have five separate rollouts. So

these are all five different jokes just like this one. Now the first thing we're going to do is we are going to ask a human to uh order these jokes from the

best to worst. So this is uh so here this human thought that this joke is the best the funniest. So number one joke this is number two joke number three

joke four and five. So this is the worst joke. We're asking humans to order

joke. We're asking humans to order instead of give scores directly because it's a bit of an easier task. It's

easier for a human to give an ordering than to give precise scores. Now that is now the supervision for the model. So

the human has ordered them and that is kind of like their contribution to the training process. But now separately

training process. But now separately what we're going to do is we're going to ask a reward model uh about its scoring of these jokes. Now the reward model is a whole separate neural network

completely separate neural net um and it's also probably a transformer uh but it's not a language model in the sense that it generates diverse language etc.

It's just a scoring model. So the reward model will take as an input the prompt number one and number two a candidate joke. So um those are the two inputs

joke. So um those are the two inputs that go into the reward model. So here

for example the reward model would be taking this prompt and this joke. Now

the output of a reward model is a single number and this number is thought of as a score and it can range for example from zero to one. So zero would be the worst score and one would be the best

score. So here are some examples of what

score. So here are some examples of what a hypothetical reward model at some stage in the training process would give uh as a scoring to these jokes. So 0.1

is a very low score, 08 is a really high score and so on. And so now um we compare the scores given by the reward model with uh the ordering given by the

human. And there's a precise

human. And there's a precise mathematical way to actually calculate this uh basically set up a loss function and calculate a kind of like a correspondence here and uh update a model based on it. But I just want to

give you the intuition which is that as an example here for this second joke the the human thought that it was the funniest and the model kind of agreed right 8 is a relatively high score but

this score should have been even higher right so after an update we would expect that maybe this score should have been will actually grow after an update of the network to be like say 81 or something

um for this one here they actually are in a massive disagreement because the human thought that this was number two but here the the score is only 1 and so this score needs to be much higher. So

after an update on top of this um kind of a supervision this might grow a lot more like maybe it's 0.15 or something like that. Um and then here the human

like that. Um and then here the human thought that this one was the worst joke but here the model actually gave it a fairly high number. So you might expect that after the update uh this would come

down to maybe 3.5 or something like that. So basically we're doing what we

that. So basically we're doing what we did before. We're slightly nudging the

did before. We're slightly nudging the predictions from the models using a neural network training process and we're trying to make the reward

model scores be consistent with human ordering. And so um as we update the

ordering. And so um as we update the reward model on human data, it becomes better and better simulator of the scores and orders uh that humans provide

and then becomes kind of like the the neural the simulator of human preferences which we can then do RL against. But critically we're not asking

against. But critically we're not asking humans one billion times to look at a joke. We're maybe looking at thousand

joke. We're maybe looking at thousand prompts and five each. So maybe 5,000 jokes that humans have to look at in total. and they just give the ordering

total. and they just give the ordering and then we're training the model to be consistent with that ordering and I'm skipping over the mathematical details but I just want you to understand a high level idea that uh this reward model is

do is basically giving us the scores and we have a way of training it to be consistent with human orderings and that's how RLHF works okay so that is the rough idea we basically train

simulators of humans and RL with respect to those simulators now I want to talk about first the upside of reinforcement learning from human feedback.

The first thing is that this allows us to run reinforcement learning which we know is incredibly powerful kind of set of techniques and it allows us to do it in arbitrary domains and including the ones that are unverifiable. So things

like summarization and poem writing, joke writing or any other creative writing really uh in domains outside of math and code etc. Now empirically what we see when we actually apply RLHF is that this is a

way to improve the performance of the model and uh I have a top answer for why that might be but I don't actually know that it is like super well established on like why this is you can empirically

observe that when you do RHF correctly the models you get are just like a little bit better um but as to why is I think like not as clear. So here's my best guess. My best guess is that this

best guess. My best guess is that this is possibly mostly due to the discriminator generator gap.

What that means is that in many cases it is significantly easier to discriminate than to generate for humans. So in

particular an example of this is um in when we do supervised fine-tuning right SFD we're asking humans to generate the

ideal assistant response. And in many cases here um as I've shown it uh the ideal response is very simple to write but in many cases might not be. So for

example in summarization or poem writing or joke writing like how are you as a human as a human labeler um supposed to give the ideal response in these cases.

It requires creative human writing to do that. And so RHF kind of sidesteps this

that. And so RHF kind of sidesteps this because we get um we get to ask people a significantly easier question as a data labelers. They're not asked to write

labelers. They're not asked to write poems directly. They're just given five

poems directly. They're just given five poems from the model and they're just asked to order them. And so that's just a much easier task for a human labeler to do. And so what I think this allows

to do. And so what I think this allows you to do basically is it um it kind of like allows a lot more higher accuracy data because we're not asking people to

do the generation task which can be extremely difficult. Like we're not

extremely difficult. Like we're not asking them to do creative writing.

We're just trying to get them to distinguish between creative writings and uh find the ones that are best and that is the signal that humans are providing just the ordering and that is

their input into the system and then the system in our LHF just discovers the kinds of responses that would be graded well by humans and so that step of

indirection allows the models to become even better. So that is the upside of

even better. So that is the upside of RLHF. It allows us to run RL. It

RLHF. It allows us to run RL. It

empirically results in better models and it allows uh people to contribute their supervision uh even without having to do extremely difficult tasks um in the case of writing ideal responses.

Unfortunately, RHF also comes with significant downsides and so um the main one is that basically we are doing reinforcement learning not with respect to humans and actual human judgment but

with respect to a lossy simulation of humans, right? And this lossy simulation

humans, right? And this lossy simulation could be misleading because it's just a it's just a simulation, right? It's just

a language model that's kind of outputting scores and it might not perfectly reflect the opinion of an actual human with an actual brain in all the possible different cases. So that's

number one. But there's actually something even more subtle and devious going on that uh really dramatically holds back our LHF as a technique that

we can really scale to significantly um kind of smart systems and that is that reinforcement learning is extremely good at discovering a way to game the model

to game the simulation. So this reward model that we're constructing here that gives this course these models are transformers. These transformers are

transformers. These transformers are massive neural nets. They have billions of parameters and they imitate humans, but they do so in a kind of like a simulation way. Now, the problem is that

simulation way. Now, the problem is that these are massive complicated systems, right? There's a billion parameters here

right? There's a billion parameters here that are outputting a single score.

It turns out that there are ways to gain these models. You can find kinds of

these models. You can find kinds of inputs that were not part of their training set. And these inputs

training set. And these inputs inexplicably get very high scores, but in a fake way. So very often what you find if you run our LHF for very long.

So for example if we do 10,000 updates which is like say a lot of updates you might expect that your jokes are getting better and that you're getting like real bangers about pelicans but that's not ex

exactly what happens. What happens is that uh in the first few hundred steps the jokes about pelicans are probably improving a little bit and then they actually dramatically fall off the cliff and you start to get extremely

nonsensical results. Like for example,

nonsensical results. Like for example, you start to get um the top joke about Pelican starts to be dthe the the the and this makes no sense, right? Like

when you look at it, why should this be a top joke? But when you take the dthe and you plug it into your reward model, you'd expect score of zero. But

actually, the reward model loves this as a joke. It will tell you that the is a

a joke. It will tell you that the is a score of 1.0. This is a top joke. And

this makes no sense, right? But it's

because these models are just simulations of humans and they're massive neural nets and you can find inputs at the bottom that kind of like get into the part of the input space that kind of gives you nonsensical

results. These examples are what's

results. These examples are what's called adversarial examples and I'm not going to go into the topic too much, but these are adversarial inputs to the model. They are specific little inputs

model. They are specific little inputs that kind of go between the nooks and crannies of the model and give nonsensical results at the top. Now,

here's what you might imagine doing. you

say, "Okay, duh the dthe is obviously not score of one. Um, it's obviously a low score. So, let's take Duh. Let's add

low score. So, let's take Duh. Let's add

it to the data set and give it a an ordering that is extremely bad, like a score of five." And indeed, your model will learn that Da D should have a very low score, and it will give it score of zero. The problem is that there will

zero. The problem is that there will always be basically infinite number of nonsensical adversarial examples hiding in the model. If you iterate this process many times and you keep adding

nonsensical stuff to your reward model and giving it very low scores, you can you'll never win the game. Uh you can do this many many rounds and reinforcement learning if you run it long enough will

always find a way to gain the model. It

will discover adversarial examples. It

will get really high scores uh with nonsensical results. And fundamentally

nonsensical results. And fundamentally this is because our scoring function is a giant neural net and RL is extremely

good at finding just the ways to trick it. Uh so long story short you always

it. Uh so long story short you always run outf put for maybe a few hundred updates the model is getting better and then you have to crop it and you are

done. you can't run too much against

done. you can't run too much against this reward model because the optimization will start to game it and you basically crop it and you call it

and you ship it. Um and uh you can improve the reward model but you kind of like come across these situations eventually at some point. So RLHF

basically what I usually say is that RLHF is not RL and what I mean by that is I mean RLHF is RL obviously but it's not RL in the magical sense. This is not

RL that you can run indefinitely.

These kinds of problems like where you are getting concrete correct answer. You

cannot gain this as easily. You either

got the correct answer or you didn't.

And the scoring function is much much simpler. You're just looking at the

simpler. You're just looking at the boxed area and seeing if the result is correct. So it's very difficult to gain

correct. So it's very difficult to gain these functions but uh gaining a reward model is possible. Now in these verifiable domains you can run RL indefinitely. You could run for tens of

indefinitely. You could run for tens of thousands, hundreds of thousands of steps and discover all kinds of really crazy strategies that we might not even ever think about of performing really well for all these problems in the game

of Go. There's no way to to beat to

of Go. There's no way to to beat to basically game uh the winning of a game or losing of a game. We have a perfect simulator. We know all the different uh

simulator. We know all the different uh where all the stones are placed and we can calculate uh whether someone has won or not. There's no way to gain that. And

or not. There's no way to gain that. And

so you can do RL indefinitely and you can eventually be beat even least dull.

But with models like this which are gameable, you cannot repeat this process indefinitely. So I kind of see RHF as

indefinitely. So I kind of see RHF as not real RL because the reward function is gameable. So it's kind of more like

is gameable. So it's kind of more like in the realm of like little fine-tuning.

It's a little it's a little improvement, but it's not something that is fundamentally set up correctly where you can insert more compute, run for longer, and get much better and magical results.

So, it's it's uh it's not RL in that sense. It's not RL in the sense that it

sense. It's not RL in the sense that it lacks magic. Um it can fine-tune your

lacks magic. Um it can fine-tune your model and get a better performance. And

indeed, if we go back to chacht, the GPT4 model has gone through RLHF because it works well, but it's just not RL in the same sense. RLHF is like a little fine tune that slightly improves your

model is maybe like the way I would think about it. Okay, so that's most of the technical content that I wanted to cover. I took you through the three

cover. I took you through the three major stages and paradigms of training these models. Pre-training, supervised

these models. Pre-training, supervised finetuning, and reinforcement learning.

And I showed you that they loosely correspond to the process we already use for teaching children. And so in particular, we talked about pre-training being sort of like the basic knowledge acquisition of reading exposition,

supervised fine-tuning being the process of looking at lots and lots of worked examples and imitating experts and practice problems. The only difference is that we now have to effectively write

textbooks for LLMs and AIS across all the disciplines of human knowledge and also in all the cases where we actually would like them to work like code and math and you know basically all the

other disciplines. So we're in the

other disciplines. So we're in the process of writing textbooks for them refining all the algorithms that I've presented on the high level and then of course doing a really really good job at the execution of training these models

at scale and efficiently. So in

particular, I didn't go into too many details, but these are extremely large and complicated distributed uh sort of um jobs that have to run over tens of

thousands or even hundreds of thousands of GPUs. And the engineering that goes

of GPUs. And the engineering that goes into this is really at the state-of-the-art of what's possible with computers at that scale. So I didn't cover that aspect too much, but

um this is a very kind of serious endeavor underlying all these very simple algorithms. Ultimately, now I also talked about sort of like the theory of mind a little bit of these models. And the thing I want you to take

models. And the thing I want you to take away is that these models are really good, but they're extremely useful as tools for your work. You shouldn't sort of trust them fully. And I showed you some examples of that. Even though we

have mitigations for hallucinations, the models are not perfect and they will hallucinate still. It's gotten better

hallucinate still. It's gotten better over time and it will continue to get better, but they can hallucinate.

In other word in in addition to that I covered kind of like what I call the Swiss cheese uh sort of model of LM capabilities that you should have in your mind the models are incredibly good across so many different disciplines but

then fail randomly almost in some unique cases. So for example what is bigger

cases. So for example what is bigger 9.11 or 9.9 like the model doesn't know but simultaneously it can turn around and solve olympiad questions and so this is a hole in the Swiss cheese and there

are many of them and you don't want to trip over them. So don't um treat these models as infallible models. Check their

work, use them as tools, use them for inspiration, use them for the first draft, but uh work with them as tools and be ultimately respons responsible

for the you know product of your work.

And that's roughly what I wanted to talk about. This is how they're trained and

about. This is how they're trained and this is what they are. Let's now turn to what are some of the future capabilities of these models. uh probably what's coming down the pipe and also where can you find these models. I have a few

bullet points on some of the things that you can expect coming down the pipe. The

first thing you'll notice is that the models will very rapidly become multimodal. Everything I've talked about

multimodal. Everything I've talked about above concern text, but very soon we'll have LLMs that can not just handle text, but they can also operate natively and very easily over audio so they can hear

and speak and also images so they can see and paint. And we're already seeing the beginnings of all of this. uh but

this will be all done natively inside the language model and this will enable kind of like natural conversations and roughly speaking the reason that this is actually no different from everything

we've covered above is that as a baseline you can tokenize audio and images and apply the exact same approaches of everything that we've talked about above. So it's not a fundamental change. It's just uh it's

fundamental change. It's just uh it's just a to we have to add some tokens. So

as an example for tokenizing audio, we can look at slices of the spectrogram of the audio signal and we can tokenize that and just add more tokens that suddenly represent audio and just add

them into the context windows and train on them just like above. The same for images, we can use patches and we can separately tokenize patches and then what is an image? An image is just a

sequence of tokens and this actually kind of works and there's a lot of early work in this direction and so we can just create streams of tokens that are representing audio images as well as text and intersperse them and handle

them all simultaneously in a single model. So that's one example of

model. So that's one example of multiodality.

Uh second something that people are very interested in is currently most of the work is that we're handing individual tasks to the models on kind of like a silver platter like please solve this task for me and the

model sort of like does this little task but it's up to us to still sort of like organize a coherent execution of tasks to perform jobs and the models are not

yet at the capability required to do this in a coherent error correcting way over long periods of time. So they're

not able to fully string together tasks to perform these longer running jobs, but they're getting there and this is improving uh over time, but uh probably what's going to happen here is we're going to start to see what's called

agents which perform tasks over time and you you supervise them and you watch their work and they come up to once in a while report progress and so on. So

we're going to see more longer running agents uh tasks that don't just take you know a few seconds of response but many tens of seconds or even minutes or hours over time.

uh but these uh models are not infallible as we talked about above. So

all this will require supervision. So

for example in factories people talk about the human to robot ratio uh for automation. I think we're going to see

automation. I think we're going to see something similar in the digital space where we are going to be talking about human to agent ratios where humans becomes a lot more supervisors of

agentic tasks um in the digital domain.

Uh next um I think everything is going to become a lot more pervasive and invisible. So it's kind of like

invisible. So it's kind of like integrated into the tools and and everywhere. Um and in addition kind of

everywhere. Um and in addition kind of like computer using so right now these models aren't able to take actions on your behalf. Uh but I think this is a

your behalf. Uh but I think this is a separate bullet point. Um if you saw chashb launch the operator then uh that's one early example of that where you can actually hand off control to the

model to perform you know keyboard and mouse actions on your behalf. So that's

also something that that I think is very interesting. The last point I have here

interesting. The last point I have here is just a general comment that there's still a lot of research to potentially do in this domain. One example of that uh is something along the lines of test time training. So remember that

time training. So remember that everything we've done above and that we talked about has two major stages.

There's first the training stage where we tune the parameters of the model to perform the tasks well. Once we get the parameters, we fix them and then we deploy the model for inference. From

there the model is fixed. It doesn't

change anymore. It doesn't learn from all the stuff that it's doing at test time. It's a fixed number of parameters

time. It's a fixed number of parameters and the only thing that is changing is now the tokens inside the context windows. And so the only type of

windows. And so the only type of learning or test time learning that the model has access to is the incontext learning of its uh kind of like uh dynamically adjustable context window

depending on like what it's doing at test time. So but I think this is still

test time. So but I think this is still different from humans who actually are able to like actually learn uh depending on what they're doing. Especially when

you sleep for example like your brain is updating your parameters or something like that, right? So there's no kind of equivalent of that currently in these models and tools. So there's a lot of like um more wonky ideas I think that

are to be explored still. And uh in particular I think this will be necessary because the context window is a finite and precious resource and especially once we start to tackle very longunning multimodal tasks and we're

putting in videos and these token windows will basically start to grow extremely large like not thousands or even hundreds of thousands but significantly beyond that. And the only trick uh the only kind of trick we have

available to us right now is to make the context windows longer. But I think that that approach by itself will not will not scale to actual longunning tasks that are multimodal over time. And so I

think new ideas are needed in some of those disciplines um in some of those kind of cases and domains where these tasks are going to require very long context.

So those are some examples of some of the things you can um expect coming down the pipe. Let's now turn to where you

the pipe. Let's now turn to where you can actually uh kind of keep track of this progress and um you know be up to date with the latest and greatest of what's happening in the field. So I

would say the three resources that I have consistently used to stay up to date are number one LMA. Uh so let me show you LM Marina.

This is basically an LLM leaderboard and it ranks all the top models and the ranking is based on human comparisons.

So humans prompt these models and they get to judge which one gives a better answer. They don't know which model is

answer. They don't know which model is which. They're just looking at which

which. They're just looking at which model is the better answer and you can calculate a ranking and then you get some results. And so what you can hear

some results. And so what you can hear is what you can see here is the different organizations like Google Gemini for example that produce these models. When you click on any one of

models. When you click on any one of these it takes you to the place where that model is hosted. And then here we see Google is currently on top with OpenAI right behind here. We see

DeepSseek in position number three. Now

the reason this is a big deal is the last column here you see license.

Deepseek is an MIT licensed model. It's

open weights. Anyone can use these weights. Uh anyone can download them.

weights. Uh anyone can download them.

Anyone can host their own version of DeepSeek and they can use it in whatever way they like. And so it's not a proprietary model that you don't have access to. It's it's basically an open

access to. It's it's basically an open weights release. And so this is kind of

weights release. And so this is kind of unprecedented that a model this strong was released with open weights. So

pretty cool from the team. Next up, we have a few more models from Google and OpenAI. And then when you continue to

OpenAI. And then when you continue to scroll down, you're start to see some other usual suspects. So XAI here, Anthropic with Sonnet uh here at number

14. And um

14. And um then Meta with Llama over here. So

Llama, similar to Deepseek, is an open weights model. And so uh but it's down

weights model. And so uh but it's down here as opposed to up here. Now, I will say that this leaderboard was really good for a long time. I do think that in

the last few months, it's become a little bit gamed um and I don't trust it as much as I used to. I think um just empirically I feel like a lot of people for example are using Sonnet from

Enthropic and that it's a really good model. So, but that's all the way down

model. So, but that's all the way down here um in number 14. And conversely, I think not as many people are using Gemini, but it's ranking really really

high. Uh so I think use this as a first

high. Uh so I think use this as a first pass uh but uh sort of try out a few of the models for your tasks and see which one performs better. The second thing

that I would point to is the uh AI news uh newsletter. So AI news is not very

uh newsletter. So AI news is not very creatively named but it is a very good newsletter produced by Swixs and Friends. So thank you for maintaining it

Friends. So thank you for maintaining it and it's been very helpful to me because it is extremely comprehensive. So if you go to archives uh you see that it's produced almost every other day and um

it is very comprehensive and some of it is written by humans and curated by humans but a lot of it is constructed automatically with LLMs. So you'll see that there these are very comprehensive and you're probably not missing anything

major if you go through it. Of course

you're probably not going to go through it because it's so long but I do think that these summaries all the way up top are quite good and I think have some human oversight. Uh so this has been

human oversight. Uh so this has been very helpful to me. And the last thing I would point to is just X and Twitter. Uh

a lot of um AI happens on X and so I would just follow people who you like and trust and uh get all your latest and greatest uh on X as well. So those are the major places that have worked for me

over time. And finally, a few words on

over time. And finally, a few words on where you can find the models and where can you use them. So the first one I would say is for any of the biggest proprietary models, you just have to go to the website of that LLM provider. So

for example, for OpenAI, that's uh chat.com I believe actually works now.

Uh so that's for OpenAI.

Now for or you know for um for Gemini I think it's uh gemini.google.com

or AI studio. I think they have two for some reason that I don't fully understand. No one does. Um for the open

understand. No one does. Um for the open weights models like deep seal etc you have to go to some kind of an inference provider of LLMs. So my favorite one is together.ai and I showed you that when

you go to the playground of together.ai AI then you can sort of pick lots of different models and all of these are open models of different types and you can talk to them here as an example.

Um now if you'd like to use a base model like um you know a base model then this is where I think it's not as common to find base models even on these inference providers they are all targeting

assistants and chat and so I think even here I can't I couldn't see base models here. So for base models, I usually go

here. So for base models, I usually go to hyperbolic because they serve my llama 3.1 base and I love that model and you can just talk to it here. So as far as I know, this is this is a good place

for a base model and I wish more people hosted base models because they are useful and interesting to work with in some cases. Finally, you can also take

some cases. Finally, you can also take some of the models that are smaller and you can run them locally. And so for example, DeepSseek, the biggest model you're not going to be able to run locally on your MacBook, but there are

smaller versions of the Deepseek model that are what's called distilled. And

then also you can run these models at smaller precision. So not at the native

smaller precision. So not at the native precision of for example FP8 on DeepSeek or you know BF-16 Lama, but much much lower than that. Um, and don't worry if

you don't fully understand those details, but you can run smaller versions that have been distilled and then at even lower precision and then you can fit them on your uh computer.

And so you can actually run pretty okay models on your laptop. And my favorite I think place I go to usually is LM Studio uh which is basically an app you can get. And I think it kind of actually

get. And I think it kind of actually looks really ugly and it's I don't like that it shows you all these models that are basically not that useful. Like

everyone just wants to run deepseeek.

So, I don't know why they give you these 500 different types of models. They're

really complicated to search for and you have to choose different distillations and different uh precisions and it's all really confusing. But once you actually

really confusing. But once you actually understand how it works and that's a whole separate video, then you can actually load up a model like here. I

loaded up llama 3 uh2 instruct 1 billion and um you can just talk to it. So, I

ask for pelican jokes and I can ask for another one and it gives me another one etc. All of this that happens here is locally on your computer. So, we're not actually going to anywhere anyone else.

This is running on the GPU on the MacBook Pro. So, that's very nice. And

MacBook Pro. So, that's very nice. And

you can then eject the model when you're done. And that frees up the RAM. So, LM

done. And that frees up the RAM. So, LM

Studio is probably like my favorite one even though I don't I think it's got a lot of UIUX issues and it's really geared towards uh professionals almost.

Uh but if you watch some videos on YouTube, I think you can figure out how to how to use this interface. Uh, so

those are a few words on where to find them. So let me now loop back around to

them. So let me now loop back around to where we started. The question was when we go to chaship pt.com and we enter some kind of a query and we hit go, what

exactly is happening here? What are we seeing? What are we talking to? How does

seeing? What are we talking to? How does

this work? And I hope that this video gave you some appreciation for some of the under the hood details of how these models are trained and what this is that is coming back. So in particular we now

know that your query is taken and is first chopped up into tokens. So we go to tok tick tokenizer and here where is the place in the in the um sort of

format that is for the user query. We

basically put in our query right there.

So our query goes into what we discussed here is the conversation protocol format which is this way that we maintain conversation objects. So this gets

conversation objects. So this gets inserted there and then this whole thing ends up being just a token sequence, a one-dimensional token sequence under the hood. So Chachi PT saw this token

hood. So Chachi PT saw this token sequence and then when we hit go, it basically continues appending tokens into this list. It continues the sequence. It acts like a token

sequence. It acts like a token autocomplete. So in particular, it gave

autocomplete. So in particular, it gave us this response. So we can basically just put it here and we see the tokens that it continued. Uh these are the

tokens that it continued with roughly.

Now the question becomes okay why are these the tokens that the model responded with? What are these tokens? Where are they coming from? Uh

tokens? Where are they coming from? Uh

what are we talking to? And uh how do we program this system? And so that's where we shifted gears and we talked about the under the hood pieces of it. So the

first stage of this process and there are three stages is the pre-training stage which fundamentally has to do with just knowledge acquisition from the internet into the parameters of this neural network. And so the the neural

neural network. And so the the neural net internalizes a lot of knowledge from the internet. But where the personality

the internet. But where the personality really comes in is in the process of supervised fine-tuning here. And so what what happens here is that basically the

a company like OpenAI will curate a large data set of conversations like say 1 million conversation across very diverse topics and they will be conversations between a human and an

assistant. And even though there's a lot

assistant. And even though there's a lot of synthetic data generation used throughout this entire process and a lot of LLM help and so on, fundamentally this is a human data curation task with

lots of humans involved and in particular these humans are data labelers hired by OpenAI who are given labeling instructions that they learn and their task is to create ideal

assistant responses for any arbitrary prompts. So they are teaching the neural

prompts. So they are teaching the neural network by example how to respond to prompts.

So what is the way to think about what came back here? Like what is this? Well,

I think the right way to think about it is that this is the neural network simulation of a data labeler at OpenAI.

So it's as if I gave this query to a data label at OpenAI and this data labeler first reads all of the labeling instructions from OpenAI and then spends

two hours writing up the ideal assistant response to this query. and uh giving it to me. Now, we're not actually doing

to me. Now, we're not actually doing that, right? Because we didn't wait 2

that, right? Because we didn't wait 2 hours. So, what we're getting here is a

hours. So, what we're getting here is a neural network simulation of that process. And we have to keep in mind

process. And we have to keep in mind that these neural networks don't function like human brains do. They are

different. What's easy or hard for them is different from what's easy or hard for humans. And so, we really are just

for humans. And so, we really are just getting a simulation. So here I shown you this is a token stream and this is fundamentally the neural network with a bunch of activations and neurons in

between. This is a fixed mathematical

between. This is a fixed mathematical expression that mixes inputs from tokens with parameters of the model and they get mixed up and get you the next token

in a sequence. But this is a finite amount of compute that happens for every single token. And so this is some kind

single token. And so this is some kind of a lossy simulation of a human that is kind of like restricted in this way. And

so whatever the humans write, the language model is kind of imitating on this token level with only this this specific computation for every single

token in a sequence.

We also saw that as a result of this and the cognitive differences the models will suffer in a variety of ways and uh you have to be very careful with their use. So for example, we saw that they

use. So for example, we saw that they will suffer from hallucinations and they also we have the sense of a Swiss cheese model of the LM capabilities where basically there's like holes in the

cheese. Uh sometimes the models will

cheese. Uh sometimes the models will just arbitrarily like do something dumb.

Uh so even though they're doing lots of magical stuff, sometimes they just can't. So maybe you're not giving them

can't. So maybe you're not giving them enough tokens to think and maybe they're going to just make stuff up because their mental arithmetic breaks. Uh maybe

they are suddenly unable to count number of letters. um or maybe they're unable

of letters. um or maybe they're unable to tell you that 911 9.11 is smaller than 9.9 and it looks kind of dumb and so it's a Swiss cheese capability and we

have to be careful with that and we saw the reasons for that but fundamentally this is how we think of what came back.

It's again a simulation of this neural network of a human data labeler following the labeling instructions at OpenAI. So that's what we're getting

OpenAI. So that's what we're getting back. Now I do think that the uh things

back. Now I do think that the uh things change a little bit when you actually go and reach for one of the thinking models like 03 mini high. And the reason for

that is that GPT40 basically doesn't do reinforcement learning. It does do RLHF but I've told

learning. It does do RLHF but I've told you that RLHF is not RL. There's no

there's no uh time for magic in there.

It's just a little bit of a fine-tuning is the way to look at it. But these

thinking models they do use RL. So they

go through this third state stage of perfecting their thinking process and discovering new thinking strategies and um solutions uh to problem solving that

look a little bit like your internal monologue in your head and they practice that on a large collection of practice problems that companies like OpenAI create and curate and um then make

available to the LMS. So when I come here and I talk to a thinking model and I put in this question, what we're seeing here is not anymore just a straightforward simulation of a

human data labeler like this is actually kind of new, unique, and interesting.

Um, and of course, OpenAI is not showing us the under the hood thinking and the chains of thought that are underlying the reasoning here, but we know that such a thing exists and this is a summary of it. And what we're getting

here is actually not just an imitation of a human data labeler. It's actually

something that is kind of new and interesting and exciting in the sense that it is a function of thinking that was emergent in a simulation. It's not

just imitating human data labeler. It

comes from this reinforcement learning process. And so here we're of course not

process. And so here we're of course not giving it a chance to shine because this is not a mathematical or reasoning problem. This is just some kind of a

problem. This is just some kind of a sort of creative writing problem roughly speaking. And I think it's um it's a a

speaking. And I think it's um it's a a question an open question as to whether the thinking strategies that are developed inside verifiable domains

transfer and are generalizable to other domains that are unverifiable such as creative writing. The extent to which

creative writing. The extent to which that transfer happens is unknown in the field I would say. So we're not sure if we are able to do RL on everything that is verifiable and see the benefits of that on things that are unverifiable

like this prompt. So that's an open question. The other thing that's

question. The other thing that's interesting is that this reinforcement learning here is still like way too new, primordial, and nent. So we're just seeing like the beginnings of the hints

of greatness. Uh in the reasoning

of greatness. Uh in the reasoning problems, we're seeing something that is in principle capable of something like the equivalent of move 37, but not in the game of go, but in open domain

thinking and problem solving. In

principle, this paradigm is capable of doing something really cool, new, and exciting, something even that no human has thought of before. In principle,

these models are capable of analogies no human has had. So, I think it's incredibly exciting that these models exist, but again, it's very early and these are primordial models for now. Um,

and they will mostly shine in domains that are verifiable like math and code, etc. So, very interesting to play with and think about and use. And then that's

roughly it. Um, I would say those are

roughly it. Um, I would say those are the broad strokes of what's available right now. I will say that overall it is

right now. I will say that overall it is an extremely exciting time to be in the field. Personally, I use these models

field. Personally, I use these models all the time, daily, uh, tens or hundreds of times because they dramatically accelerate my work. I think

a lot of people see the same thing. I

think we're going to see a huge amount of wealth creation as a result of these models. Be aware of some of their

models. Be aware of some of their shortcomings. Even with RL models,

shortcomings. Even with RL models, they're going to suffer from some of these. Use it as a tool in a toolbox.

these. Use it as a tool in a toolbox.

Don't trust it fully because they will randomly do dumb things. They will

randomly hallucinate. They will randomly skip over some mental arithmetic and not get it right. Um they randomly can't count or something like that. So use

them as tools in the toolbox. Check

their work and own the product of your work. But use them for inspiration, for

work. But use them for inspiration, for first draft. Uh ask them questions, but

first draft. Uh ask them questions, but always check and verify and you will be very successful in your work if you do so. Uh so I hope this video was useful

so. Uh so I hope this video was useful and interesting to you. I hope you had it fun and uh it's already like very long so I apologize for that but I hope it was useful and yeah I will see you

later.

Loading...

Loading video analysis...