Zero to Hero LLMs Course (8+ Hours)
By Dev G
Summary
## Key takeaways - **Understand Transformers to Stand Out**: The key to standing out is to actually understand how these models work down to the level of transformers. Anyone can import an API and build a simple project, but how many people actually understand transformers, the model behind ChatGPT? [00:24], [00:32] - **Sentiment Analysis for Trading Bots**: Build a model that takes in a tweet and outputs a number between zero and one, zero representing negative emotion and one representing positive emotion. This has real-world applications like detecting emotion in tweets and news headlines so a crypto trading bot can decide whether to buy or sell a stock. [01:11], [01:28] - **Convert Strings to Numbers First**: The first step to any NLP project is to convert strings into numbers since models only work with numbers. Convert all the words in tweets into numbers so the model can process them, using a consistent mapping from words to integers. [02:07], [02:32] - **Embeddings Encode Word Meaning**: The first step in any NLP model is to get embedding vectors for every word, where related words are closer together and unrelated words are farther apart. Use an embedding layer of size 16 so each word has a vector with 16 entries encoding its meaning. [10:19], [10:46] - **Training Updates Model Parameters**: Training is finding the right values for model parameters like W1 through W3 and B, starting from random numbers and iteratively updating them until the model's predictions are accurate enough. Each W factors in how important each input is for the prediction. [18:20], [19:49] - **Gradient Descent Minimizes Error**: Gradient descent iteratively minimizes the error function by updating parameters with the equation involving the learning rate alpha and the derivative. It's an approximation algorithm essential for complex functions in ML models that can't be solved analytically. [34:33], [37:28]
Topics Covered
- Understand Transformers to Stand Out
- Convert Strings to Numbers for NLP
- Embeddings Encode Word Meanings
- Training Updates Model Parameters
- Gradient Descent Minimizes Error
Full Transcript
Let's go from zero to hero, from an empty resume to a portfolio full of projects so you can land your dream offer. Everyone talks about escaping
offer. Everyone talks about escaping tutorial hell, getting ahead of the 99%, an unfair advantage. How do you actually do that? Well, in the last few years,
do that? Well, in the last few years, I've gone from not knowing how to write a single line of code to getting offers from Amazon and Google. And in my opinion, the key to standing out is to
actually understand how these models work down to the level of transformers.
Anyone can import an API and build a simple project. But how many people
simple project. But how many people actually understand transformers, the model behind Chad GBT? Well, I've
created hundreds of videos covering the fundamentals of AI. So, this video contains all my best videos that will actually help you understand Transformers. And along the way, you
Transformers. And along the way, you will build several projects for your portfolio. These are my best videos.
portfolio. These are my best videos.
I've cherrypicked them with very, very specific intentions in mind. So, I hope you enjoy and I look forward to seeing you in the next clip. And as always, if you'd like to learn directly from me
until you land your dream offer, click the link in the description. All right,
let's build a realworld AI project. Our
goal is to build a model that takes in a tweet and outputs a number between zero and one. Zero represents negative
and one. Zero represents negative emotion and one represents positive emotion. This project is called
emotion. This project is called sentiment analysis and it has realworld applications. Let's say you want to
applications. Let's say you want to build a crypto trading bot. Well, let's
say you want your bot to take in tweets and news headlines. Well, we need to actually detect the emotion in those tweets and news headlines so the bot can decide whether to buy or sell a stock.
And the unique part about this video is that you can actually follow along and run your code in your browser using this link. It's in the pinned comment. And
link. It's in the pinned comment. And
you don't have to install any dependencies. There's no setup. And the
dependencies. There's no setup. And the
website will check to see if you've built your model correctly just by running your code. And this video is going to be divided into two parts. In
the first part, we're going to build our data set of Elon Musk tweets. And in the second part, we're going to actually create the model. If you don't have time to get through the whole video right now, try to at least get through the
data set creation. Let's get started with part one. All right. So, the first step to any NLP or natural language processing project is to convert our strings into numbers. Models only work
with numbers. So, we've got to convert
with numbers. So, we've got to convert all the words in these tweets into numbers so the model can actually process those. Let's take a look at a
process those. Let's take a look at a quick example for this problem. You can
see that the two inputs are positive and negative. Each of those is a list of
negative. Each of those is a list of strings, one with positive emotion and one with negative emotion. And we can go ahead and take a look at example one here. So each of those two lists just
here. So each of those two lists just has one string just so we have a simple example and we have something that we might find on Twitter from Elon Musk.
Dogecoin to the moon and I will short Tesla today. Obviously one's positive
Tesla today. Obviously one's positive and one's negative but we can take a look at the output. So for the first tweet we actually have some sort of list of numbers and for the second tweet we also have another list of numbers. But
where do these numbers actually come from? We can take a look at the
from? We can take a look at the explanation. It says lexographically,
explanation. It says lexographically, which is just a fancy word for alphabetically. Dogeishcoin becomes one
alphabetically. Dogeishcoin becomes one just alphabetically. Word I becomes two.
just alphabetically. Word I becomes two.
Tesla becomes three and so on. And this
doesn't actually matter because the model is just going to interpret these as numbers. The important thing is that
as numbers. The important thing is that we are consistent. We have to have a consistent mapping or encoding from words to integers and integers to words.
We can also see that the first sentence is four words and the second sentence is five words. So just so that we don't
five words. So just so that we don't have like a jagged tensor or a jagged array, we're going to pad that first sentence with another zero at the end.
And zero doesn't correspond to any word.
It's just a dummy number for padding.
All right, so I'm going to jump into the code now. I recommend pausing the video
code now. I recommend pausing the video here. Try to think about how you might
here. Try to think about how you might do this, but I'm going to get started with the code. So we want to develop a mapping from words to integers. I want
to have some sort of data structure where I can just look up that Dogecoin is one, I is two, and so on. I want to encode this in some sort of mapping. So,
the first step is going to be to get all of our unique words using a set. So, the
first step is just going to be to combine both of our lists. So, we're
just dealing with one list. And you can simply add two lists in Python to concatenate them together. So, now we have one list. Next, I want to go through every word in every sentence. So
if I said something like for sentence in combined that would give me a iterable over each sentence. But what we actually want to do is we want to get each word.
So the way we can obviously get the words from a sentence is by splitting on the space character. And I'll have some sort of set called words. And let's go ahead and add every word to that set. So
we actually need to define that set now.
Okay. So we're almost there. We have all the unique words in a set. But I
actually need to sort them now so that I can say that the alphabetical first word is one, alphabetical second word is two and so on. But we don't normally sort sets at least not usually. So let's
convert it to a list and then we'll sort that. Now let's actually build that
that. Now let's actually build that mapping that we talked about earlier. So
it's going to be a dictionary and I'm going to call it word to int the keys are going to be words and the values are going to be ins. And all we have to do now is go through our sorted list. So
we'll use the enumerate function from Python. So I'll say enumerate on this
Python. So I'll say enumerate on this sorted list. And in case you're not
sorted list. And in case you're not familiar with it, I is going to be the index and C is going to be the actual value at the list at index i. So it's
just an easier way to kind of condense the code. And all we want to do is
the code. And all we want to do is actually store this in word to int. So
we'll say word to int. Well, what's the actual word, right? That's c, right?
That's the actual value at index i in this sorted list. And then what should the actual integer encoding be? It
should just be i + 1, right? cuz I is obviously going to start at zero, but we said that we actually want to leave zero to be for padding. So, we're going to have everything start at one. You can
see over here that Dogecoin goes to one, I goes to two, and so on. So, we'll
simply say I + 1. And our dictionary is complete. Okay, we're actually almost
complete. Okay, we're actually almost done now. I just created this list
done now. I just created this list called unpadded. And that's going to be
called unpadded. And that's going to be very similar to the output list you can see on the left, except we won't have that zero padding. So, we'll take care of the padding last. But unpadded is
going to be a list of lists just like that output that we can see on the left over there. All we have to go through
over there. All we have to go through now is actually encode every word in every sentence. We're going to actually
every sentence. We're going to actually call this dictionary or actually fetch something from the dictionary. So we'll
say word to int or that word is going to give us some number, right? We want to append this to a growing list for every sentence. So we'll say something like
sentence. So we'll say something like the encoded version is some sort of list. We'll actually just append this to
list. We'll actually just append this to encoded for every word in that sentence.
Then once we're actually finished with that sentence, we can go ahead and append this to unpadded. So
unpadded.append
encoded. So we can see that unpadded is a list of lists. All right. So it might be tempting to just return unpadded, but we still need to actually do the padding. Fortunately, we don't have to
padding. Fortunately, we don't have to do that ourselves. There's a simple function from the PyTorch library that I'm going to import. It's going to take care of that for us. All right. So you
can go and see that I've added a couple import statements at the top. That's
just to import our pad sequence function. And on line 25, we're actually
function. And on line 25, we're actually going to call that function. And we also have to pass in batch first equals true because we actually want our output at least for the example on the left to be
a 2 by capital T tensor. So the
dimensions would be 2 by t where t is the number of words in the longest sentence. Right? But if you say batch
sentence. Right? But if you say batch first equals false, which is the default, it's actually going to be inverted. It's actually going to be kind
inverted. It's actually going to be kind of a long or tall matrix instead of this kind of horizontal long horizontal matrix. So it's going to be transposed
matrix. So it's going to be transposed and that would be tx 2, which isn't what we want. So we just say batch first
we want. So we just say batch first equals true. Okay, but there's one final
equals true. Okay, but there's one final issue and then we'll actually be ready to run our code. If you try to run the code now, it's not going to work because this pietorch function expects everything passed in to also be a
tensor. So, we can't have a list of
tensor. So, we can't have a list of Python lists. We want to pass in a list
Python lists. We want to pass in a list of tensors. So, on line 23, I've went
of tensors. So, on line 23, I've went ahead and made that small change. We're
just going to cast the Python list to a tensor before we append it to our giant list. And that should take care of
list. And that should take care of everything. All right, let's go ahead
everything. All right, let's go ahead and run the code. And we've passed the test cases. All right, let's move on to
test cases. All right, let's move on to the actual model. There's going to be two parts to our code. We're going to fill in this class on the right. And we
have one function called the init function, also known as the constructor.
And that's where we're going to define the actual model. So if you look at a neural network diagram like this, and don't worry, not much background on neural networks is required for this.
But if you look at this diagram, what defines this network is how many layers there are and how many nodes are actually within each layer. So that's
what we're going to define in this first function, the init function. And there's
one parameter that's passed in called vocabulary size, which is the number of different words the model should be able to recognize. And we can take a look at
to recognize. And we can take a look at the example there and we see 170,000.
That's about the number of words in the English language. So there's going to be
English language. So there's going to be that parameter there. We're going to explain how that actually ties into the model in a bit. And just to make sure we understand the inputs and outputs here, let's take a look at those two sentences
passed in. And we can actually kind of
passed in. And we can actually kind of see that those sentences are passed in in number format. So this is kind of like a continuation of the first problem. We talked about how we're not
problem. We talked about how we're not going to actually pass strings into models. We're going to pass in sequences
models. We're going to pass in sequences of numbers where each word is represented as a number. So the first string passed in where we want to gauge the emotion, detect the emotion in that is the movie was okay. And we can see
that it's padded with a bunch of zeros cuz the second sentence is way longer.
And the second sentence was I don't think anyone should ever waste their money on this movie. And we can see that the model outputed 0.5. We can see that in the output there. Essentially saying
that that's kind of a mix between positive and negative. The movie was okay, right? That's a very kind of
okay, right? That's a very kind of neutral statement. We can see the model
neutral statement. We can see the model outputed the number 0.1 which is much closer to zero than one. So we can see that's obviously a very negative statement to say that no one should waste their money on this movie. And if
we take a look at the description on the left where it starts talking about the model architecture that we're supposed to use here says to use an embedding layer of size 16. Compute the average of the embeddings to remove the time
dimension. Kind of confusing, right? And
dimension. Kind of confusing, right? And
with a single neuron linear layer followed by a sigmoid. We're going to explain what all of that means. Don't
worry about reading that for now. Let's
get into it. All right. So, the first step in any NLP model is to actually get the embedding vectors for every word.
So, here's an example of embeddings. We
can see that the words that are more related in terms of their meaning are actually closer together and the words that are not so related in terms of their meaning are farther apart. So, we
want to have an embedding layer, meaning we simply want the model to actually have vectors that encode the meaning of every word. And in this case, we can see
every word. And in this case, we can see we only have two dimensional vectors, but we can see in the problem that it says to use an embedding layer of size 16. So we want a vector with 16 entries.
16. So we want a vector with 16 entries.
And the larger this vector is, the more information the model can actually encode for each word. All right, so it's actually pretty straightforward to define an embedding layer in PyTorch.
We're simply going to say self.bedings.
Just going to name it embeddings is nn.bing.
nn.bing.
So, we're going to make an instance of this existing class in PyTorch called embedding, which comes from the NN or neural network module, which has a ton of useful classes that we'll use later
in this problem as well. And the two things that we simply have to pass in are the vocab size, right? The number of words in the English language. Well,
that's the number of different words that we want to store an embedding for.
We also want to pass in the size of each embedding vector. In the problem
embedding vector. In the problem description, we can see that's going to be 16. And the way we can think of this
be 16. And the way we can think of this is imagine we have a table and the number of rows is equal to the number of words in our embedding layer. So we can think of that as vocab size the number
of words in the English language and the column size the number of columns is 16 because we have a vector of size 16 at every single row and each row or each
vector is essentially storing the embedding for that word. So then in the forward step, right, we haven't gotten to the forward step yet, so don't worry too much about it right now. When we're
actually getting the model prediction or some sentence that's passed in, some sequence of words, right, we're going to actually go ahead and fetch or pluck out the relevant rows for each word in our
input sentence from that embedding table we talked about earlier. Okay, the next step is to define the linear layer for this model. So you might be familiar
this model. So you might be familiar with linear regression. Here's an
example of that equation when we have three input attributes X1, X2 and X3.
But this equation can also extend to having any number of input attributes like 16 different input attributes. So
we can have X1 through X16 passed into this equation and even in that case we have one output number. And the reason we're going to need this is because we know that ultimately this model should
output one output number, right? The
emotion found in the text. But at this point we have vectors of size 16, right?
for every single word. So, we're going to need some sort of linear layer to kind of project our dimension back down to one single number. And of course, we're going to need to define an instance of the sigmoid function
pictured here. It's simply a nonlinear
pictured here. It's simply a nonlinear function where the outputs are between 0 and one, which is exactly what we want for sentiment analysis. So, let's go ahead and define those two. First, we
can define our linear layer. We'll
simply say self.linear equals nn.linear.
And you have to pass in the number of input attributes, which is 16. And of
course, we only want one single number outputed there. And we also want to
outputed there. And we also want to define our sigmoid function. So that's
just nnn. sigmoid. That's it for the constructor or init function. All right.
So now let's write the forward function, which is simply passing our input data x through the neural network. So the first thing we're going to do is we're simply going to pass x into self.beddings. And
that's going to actually give us the embeddings for every single word. So
let's think about what the shape of word embeddings is. So let's imagine for a
embeddings is. So let's imagine for a second that X is 2x 5. So let's say some comment here. X is 2 by 5. Meaning we
comment here. X is 2 by 5. Meaning we
have two different sentences that are passed in and we want to get the model prediction for two different sentences.
And that's kind of in line with our input example on the left. Right? We had
two sentences there. Let's say five is the length of each sentence. I know that the sentences have different lengths in that example on the left. Let's not
worry about that for now. Let's say X is 2x 5. So we have five words in each
2x 5. So we have five words in each sentence. And remember we said the
sentence. And remember we said the dimension of each embedding vector is 16. So what is going to be the shape of
16. So what is going to be the shape of word embeddings? Well, in this case,
word embeddings? Well, in this case, assuming X is 2x 5, word embeddings is going to be 2x 5x 16 because we still have all the words from earlier, but then for every single word, we don't
just have a single number anymore like we did in the input on the left. We now
have a vector of size 16. That's why
it's going to be 2x 5 by 16. But we're
going to have to shrink this giant three-dimensional matrix down eventually to a single number for each input sentence. Right? That's the goal of this
sentence. Right? That's the goal of this model. So, we're going to have to make
model. So, we're going to have to make some simplifications. And what we're
some simplifications. And what we're going to do is we're going to assume that every word in the sentence matters equally. That's obviously a bad
equally. That's obviously a bad assumption since some words are more important than others, right? Just to
have a simple model for now. Let's
assume that every word in a sentence, its embedding vector matters equally.
Obviously the model more advanced models like the transformer actually weigh each word differently using a concept called attention. But let's not worry about
attention. But let's not worry about attention for this project. Okay. So to
shrink and simplify this giant three-dimensional matrix, let's go down to a two-dimensional matrix by weighing each word equally and simply averaging all the embeddings. All right, so I've
gone ahead and written the code for that step. We're using the torch.m mean
step. We're using the torch.m mean
function for averaging and we're going to pass in word embeddings and we're specifically going to say dim equals 1 and the dim or dimensions follows zero indexing. So we can see two here that
indexing. So we can see two here that would be dim equals 0. This five here is dim equals 1 and this 16 here is dim equals 2. So what is going to be the
equals 2. So what is going to be the shape of this actual tensor or vector or matrix over here? But what we did is we actually got rid of that time dimension,
the second dimension that actually tells us how many words do we have. In this
dummy example, I said we had five words.
And that's because I said we wanted to actually weigh each word equally and just average what's going on for every word all into one straightforward vector. So we're going to have a 2x6
vector. So we're going to have a 2x6 matrix. Now, and just in case that's not
matrix. Now, and just in case that's not clear, what we did for each sentence is we just averaged those embedding vectors where each one is of size 16. For each
sentence, we just averaged those embedding vectors for all five words in the sentence. So now for the first
the sentence. So now for the first sentence, we have this one vector of size 16. And for the second sentence, we
size 16. And for the second sentence, we have another vector of size 16. And you
can think of that vector of size 16 being the model's summary for all of the important information in that sentence all encoded into one vector. But we
still need to shrink this down even more. We want a single number for every
more. We want a single number for every single sentence. So that's where the
single sentence. So that's where the linear regression equation from earlier comes into the picture. We want to apply that to each of those 16dimensional vectors. We'll simply say our final
vectors. We'll simply say our final output or technically it won't be the final output. We can say this is like
final output. We can say this is like our pre- sigmoid output because we haven't passed it into the sigmoid function yet. We can simply say
function yet. We can simply say self.linear, right? That's the linear
self.linear, right? That's the linear instance that we created earlier in the init function. And what do we want to
init function. And what do we want to pass into it? We want to pass in average. And what is going to be the
average. And what is going to be the shape of average? It'll be 2x one after the linear regression equation is applied. All right, the final step. We
applied. All right, the final step. We
just want to pass that pre- sigmoid output into the sigmoid function so that we get two numbers one for each sentence each between zero and one and that's what we can return. So just to clarify
the shape is still going to be 2 by one.
All right so we're ready to run our code against the test cases. We just need to round our answer to four decimal places to make the answer checking easier.
Let's go ahead and run the code and we can see that it works. Okay, in this video we're going to go over three core AI concepts that you've got to know.
It's going to be different than a lot of other intro to AI videos as it doesn't require any machine learning experience.
We're also not going to stray away from the technical details. So, be sure to stick around. AI is becoming more and
stick around. AI is becoming more and more important even for software engineers. So, make sure to master these
engineers. So, make sure to master these three concepts. Okay. The first concept
three concepts. Okay. The first concept is called training. So, people talk a lot about models training and learning.
But what does that actually mean? Well,
first we need to define what a model actually is. The simplest absolute kind
actually is. The simplest absolute kind of model is this equation right here.
And let's say we're trying to predict how good someone is at beer pong. So you
might have played beer pong before. And
let's say our model is going to predict someone's win accuracy at the game. And
the accuracy is going to be predicted based on three inputs. X1, X2, and X3.
Let's say X1 is like your alcohol tolerance, the number of beers you can take before you start slurring your words. And let's say X2 is your general
words. And let's say X2 is your general accuracy. So like a number on a scale of
accuracy. So like a number on a scale of 1 to 10 that actually represents how accurate are you with each throw. And
let's say X3 is your trash talkinging effectiveness. So another number on a
effectiveness. So another number on a scale of 1 to 10 that kind of represents how good are you at trash talkinging since obviously that will affect you can get in your opponent's head. It will
overall affect how good you are at the game, right? Your chance of winning.
game, right? Your chance of winning.
That's what the model is predicting.
That's what the value Y represents. But
in the equation, we also have these other numbers, right? W1, W2, W3, and B.
Those numbers are called the parameters.
And that actually is the model, right?
That's those are the numbers that the model uses to actually make its prediction y whenever we pass in x1, x2 and x3 for like any arbitrary person. So
then what training is is just finding the right values of those parameters, right? W1 through w3 plus b. It's about
right? W1 through w3 plus b. It's about
finding and updating those values until we're satisfied with the model's prediction y until we feel like it's accurate enough. So when we initialize
accurate enough. So when we initialize the model, we'll have totally random numbers for W1 through W3 as well as B.
But over the course of training this iterative process, we'll be adjusting those values, adjusting the values of the parameters until we actually have an accurate enough model. Okay, but what do
those W's actually represent? So we have like each W multiplied by an X. So let's
look at W2 for example. It's multiplied
by X2. Each W is actually just factoring in how important each input variable is for determining Y. And if that didn't make sense, let's go over an example. So
we have W2 multiplied by X2. So the
higher X2 is and the higher W2 are, well, the higher Y is going to be. And
we know that X2 was someone's accuracy with each throw in beer pong. And
obviously that's pretty important for determining why, right? The higher the accuracy of their throws, the higher their probability of winning or their win rate is going to be. So we actually going to have the model learn over the
course of training the right value for W2 cuz W2 is just factoring in how important X2 is. So over the course of training, let's say we initialize W2
something like -2, right? Just a random initial value. Well, the model is going
initial value. Well, the model is going to end up learning some much larger positive number for W2 because the higher X2 is, the higher Y should be. So
the model is going to actually have W2 measure or encapsulate how important X2 is for predicting Y. That's why we're multiplying them together. So that's all training is. Training is just the
training is. Training is just the process of updating the model parameters, which are usually just numbers like W's and B. Those numbers
start off totally random, but over the course of training or learning, their values are updated until they actually make sense. So that no matter what x1,
make sense. So that no matter what x1, x2, and x3 are, the model can have like a pretty solidly accurate prediction for what y should be. So that's all for concept number one, training. We get to concept number three. We're actually
going to talk about gradient descent.
It's this equation. It'll pop up on the screen soon. And that's the actual
screen soon. And that's the actual equation used to update the values of the W's and B. But we don't need to worry about that for now. We'll get to that eventually. All right, on to
that eventually. All right, on to concept number two. If you made it through concept number one, that's awesome. try to hang in for the rest of
awesome. try to hang in for the rest of the video cuz it's going to be a lot easier from here on out. Number two is linear regression. And the good news is
linear regression. And the good news is we already talked about it in concept number one. This equation that we were
number one. This equation that we were talking about, that's linear regression.
It's a simple unsexy model from statistics, but it's actually the foundation of AI. So, you've got to understand linear regression. The idea
is pretty simple. We're going to have some number of inputs, our X's, and we'll have these W's that we multiply against the X's. We add some number B, and that's how we get the model's prediction Y. And this can work for an
prediction Y. And this can work for an example where we have any number of inputs. In our case with beer pong, we
inputs. In our case with beer pong, we had three input attributes, but you could have two input attributes, five input attributes, and the equation would change accordingly. We would have more
change accordingly. We would have more W's. We would have five W's, W1 through
W's. We would have five W's, W1 through W5, if we had X1 through X5. So the
equation is flexible and it can change accordingly. So linear regression sounds
accordingly. So linear regression sounds awesome, right? But there is one issue
awesome, right? But there is one issue and it's that most data in the world isn't strictly linear. looking at this equation, it's not going to capture any nonlinear relationships like this one right here. Right? So that's where
right here. Right? So that's where neural networks come in. And that brings us to concept number three. Okay? So I
know I said we were going to finally talk about the equation gradient descent for training in concept number three.
But first, let's actually go over this diagram, neural networks. And I think this is going to be one of the simplest explanations you've ever seen. Okay. So
to make this neural network explanation much more simpler to understand, we're going to explain it in terms of linear regression. Now, obviously, neural
regression. Now, obviously, neural networks are far more powerful than linear regression. The relationships and
linear regression. The relationships and the data that they can model is far different than simple linear regression models. So, I'm going to explain neural
models. So, I'm going to explain neural networks in terms of linear regression.
But, just keep in mind that they're not actually identical to linear regression due to nonlinear functions like this one, which we'll talk about shortly. But
just so anyone doesn't get triggered in the comments, I'm not oversimplifying things or dumbing anything down. We're
just going to start off by explaining neural networks in terms of linear regression. Also, if you're actually
regression. Also, if you're actually still watching this video, you're not an NPC. I like you. Let's get into it. So,
NPC. I like you. Let's get into it. So,
the premise is the same as before. We
have X1, we have X2, and we have X3. And
those are actually going to be in that first leftmost column, which we call the input layer or the input column. So,
there's going to be those same three input attributes. And we're ultimately
input attributes. And we're ultimately on the right side of the diagram going to actually predict this one number O or Y, whatever you want to call it, that's going to be someone's predicted chance of actually winning a game of beer pawn.
to the same x1, x2, x3 situation as before. What makes neural networks
before. What makes neural networks different is that we actually have this column of nodes in the middle, right? We
call that the hidden layer. And what
we're going to do is simply more linear regression. We're going to let each of
regression. We're going to let each of those nodes do linear regression and use the equation we talked about earlier, right? This equation where we're going
right? This equation where we're going to have W1 through W3 plus B. We're
going to have each of those nodes, each of those nodes in the hidden layer use that equation. So then we're going to
that equation. So then we're going to get four numbers y1 through y4 all calculated based on the same x1 x2 and x3 from the input layer. So the
difference is that each of those four nodes is learning and updating its own set of parameters its own set of w1 w2 w3 and b. So overall the model is doing
a lot more calculation which is going to make neural networks far more powerful than linear regression. Of course, there are nonlinearities which we still have to talk about, but this is also a big part of neural networks. Just the fact
that we're having more nodes that are just doing more calculations and more computation as a whole. But we have four numbers right now, y1 through y4, and we need to get one final output number. Oh,
so this final node is also going to do some more calculation, but it's going to take in four numbers as input. It's
going to use this equation right here.
It's going to take in Y1 through Y4 and it's going to learn W1 through W4 as well as again another constant term B to predict that final output number O.
Okay, so that's the gist of neural networks. We just need to talk about
networks. We just need to talk about nonlinearities. So here is an example of
nonlinearities. So here is an example of a nonlinearity called the sigmoid function. And just bear with me, we're
function. And just bear with me, we're almost done. Everything's going to make
almost done. Everything's going to make sense. So the sigmoid function right
sense. So the sigmoid function right here, we can see that the input could be anything, right? Any number. But the
anything, right? Any number. But the
output on the y-axis will always be between 0 and one. So this function is actually transforming any input to be between 0 and one in a nonlinear method.
And we want to use functions like the sigmoid function in our neural network just so the model can actually capture and learn nonlinear relationships because most data in the real world is
not actually the simple linear relationship. So how exactly do we want
relationship. So how exactly do we want to incorporate this sigmoid function into the neural network? Well, we want to incorporate it between layers. So
after we have the input in a hidden layer, we've done some linear regression, right? We've calculated
regression, right? We've calculated those numbers Y1 through Y4, those four numbers. But before we pass those
numbers. But before we pass those numbers into that final output layer, right, that final output node, we should actually pass Y1 through Y4, those four numbers each, into the sigmoid function
to get four different outputs, all of them are going to be between 0 and one just because that's what the sigmoid function does. And those four numbers,
function does. And those four numbers, the ones between 0 and one, the outputs, the outputs from the sigmoid function, that's what we're going to send into that final output node. And this might seem like a small difference, right?
Like, okay, what's the big deal? All
we've done is incorporated this one weird looking curve into the function.
But this is going to drastically change the power of the model. Model is now going to be able to actually pick up on way more complex relationships and nuances in the data. And if that didn't make sense, just leave a comment below.
We'll talk about it more in another video. Okay, the video is definitely
video. Okay, the video is definitely getting a bit long now. So, just to wrap it up, let's come back to gradient descent. It's this equation right here.
descent. It's this equation right here.
And it's the actual equation used for training, right? To actually update
training, right? To actually update those W's and B at every iteration.
There's this equation that depends on the old value for W or B as well as this other variable alpha which we'll explain as well as the derivative. So it might be kind of weird that derivatives from calculus are coming into the picture now
and I promise it won't be any crazy complicated. And for gradient descent,
complicated. And for gradient descent, right, this crucial ML algorithm, I actually have a separate video. It
should pop up now. It'll be in the description. It'll be in the pinned
description. It'll be in the pinned comment. I put a ton of time into this
comment. I put a ton of time into this video kind of animating exactly how gradient descent works. It's just around 3 or 4 minutes long, so definitely check it out. Next, I honestly would have just
it out. Next, I honestly would have just included the gradient descent explanation in this video, but just to keep the video from getting too long so it's actually digestible and not too overwhelming, there's a separate video linked in the description and comment.
It should be popping up now, too. Go
ahead and check it out. Hey everyone,
it's Dev again. This video is a quick and interactive review of the math you need for ML. I can confidently say that this is the most concise review of the math you need for ML. But don't take it
from me. Thousands of students in the
from me. Thousands of students in the GBT Learning Hub community have used this exact guide to kickstart their ML journey. Without further ado, let's get
journey. Without further ado, let's get started. It's actually a misconception
started. It's actually a misconception that you need a crazy amount of math for ML. You can get started with a basic
ML. You can get started with a basic understanding of matrix multiplications and derivatives from calculus. That
means you should understand how to find the derivative of functions like this.
But don't worry about functions like these. Also, you may have heard before
these. Also, you may have heard before that complex ML models just boil down to matrix multiplication. While this is a
matrix multiplication. While this is a bit of an oversimplification, there's also plenty of truth to that. Consider a
large language model. Here's how they generate text. The prompt is passed in
generate text. The prompt is passed in and the output is the word that's most likely to come next in the sequence.
This word is concatenated to the input sequence and the next word is obtained.
can concatenate again and the process repeats. LLMs are just a large
repeats. LLMs are just a large autocomplete models that we can iteratively invoke. But to actually
iteratively invoke. But to actually understand the internals of this black box, we have to dive into matrix multiplications and some basic derivatives. Let's go over a few quiz
derivatives. Let's go over a few quiz questions that cover the essential math for ML. And then after you finish the
for ML. And then after you finish the quiz, I recommend checking out my step-by-step machine learning road map, which consists of over 25 video modules and practice problems. It's completely
free and you can grab it from the link in the description. Let's get started with the quiz. Question one. Given two
matrices, what is the shape of the product? Understanding the dimensions of
product? Understanding the dimensions of matrix multiplication is actually important for implementing models since libraries like PyTorch require you to specify these dimensions when
instantiating components of a model. We
have a 4x6 matrix on the left and a 6x4 matrix on the right. The first rule to check is whether this matrix multiplication is possible. The number
of columns in the first matrix needs to match the number of rows in the second matrix. If these dimensions are unequal,
matrix. If these dimensions are unequal, then the output is undefined. When we
get to the question on how to actually multiply two matrices, this rule will make sense, but for now, let's just accept it. Okay, so the matrix
accept it. Okay, so the matrix multiplication is defined. The shape of the product is another simple rule.
Let's take the number of rows in the first matrix and the number of columns in the second matrix and those will be the output dimensions. For the actual multiplication, we'll pair up every row
in the first matrix with every column in the second matrix. The answer then is 4x4. This might be unsatisfying. So,
4x4. This might be unsatisfying. So,
let's move on to the next question, which asks us to actually perform the multiplication. We have three columns in
multiplication. We have three columns in the first matrix and three rows in the second matrix. So, the output is defined
second matrix. So, the output is defined and will also be a 3x3 matrix. The only
phrase you need to remember for matrix multiplication is row column. To get the top left entry, we multiply the first row with the first column. This means
multiplying the corresponding numbers and adding them up. To get the other entries in the output, we keep following the row column rule. First row, second column, and we get the second entry.
First row, third column, and we get the third entry. Second row, first column,
third entry. Second row, first column, and we get this entry, and so on. This
also explains why the number of columns in the first matrix needs to match the number of rows in the second matrix. The
number of columns in a matrix is just the number of entries per row and the number of rows is just the number of entries per column. This might seem abstract right now, but remembering the phrase row column will be very helpful
in understanding self attention, the crux of transformers and large language models. Question three, at what value of
models. Question three, at what value of x does this function have a positive slope? Understanding basic calculus is
slope? Understanding basic calculus is important for gradient descent, the algorithm behind training. We can look at a line tangent to the function at
three points. X less than 0, X= 0, and X
three points. X less than 0, X= 0, and X greater than 0. Only when x is greater than 0 does this line have a positive slope. So the answer must be x= 3. This
slope. So the answer must be x= 3. This
might seem simple or obvious, but it's actually critical to visually understanding gradient descent, the first ML algorithm everyone should learn. That video was linked in the
learn. That video was linked in the description if you're interested in watching it next. Lastly, question four.
How many derivatives does this function have? This is actually important when
have? This is actually important when minimizing a function during training.
To find the minimum of a function, we adjust the values of the inputs. In this
case, x, y, and z. In gradient descent, the derivative of the function with respect to each input is used to minimize the function. That's one
derivative for each input for a total of three derivatives. And that wraps up our
three derivatives. And that wraps up our short quiz. Finally, I have a huge
short quiz. Finally, I have a huge announcement. The beginner's blueprint
announcement. The beginner's blueprint is finally available. This is the exact study plan I wish I had when I was first getting started with machine learning.
Everyone told me to read papers, but I had no idea which papers to read. Once I
figured that out, I had no idea how to understand them, dissect them, much less implement them, and code up the main concepts. And lastly, workday was a
concepts. And lastly, workday was a nightmare. I had no idea how to present
nightmare. I had no idea how to present these projects on my resume and actually land more interviews. The beginner's
blueprint will solve all of these problems for you so that you can make progress faster than I did. Take it from someone like um from the IIT Madras class of 2025. I personally provided him
with a road map helping him get started with the implementation ASAP or shear.
He's an NLP expert and his personal favorite resources are our ML programming questions which are accessible for free. Lastly, someone
like Chang. He's an exy Yahoo a IML engineer and he knows his stuff. It's
been a blast making videos on this channel for the last year and I'm excited to help even more of you with premium personalized instruction. Our
launch sale is now active and you can secure the entire blueprint for 50% off.
Head to the link in the description to learn more and I'll see you on the other side. People talk a lot about training
side. People talk a lot about training ML models, but what does this actually mean? There's often a training data set
mean? There's often a training data set and for simplicity, let's say our model predicts how tall someone will be based on their weight and height at age 10.
The data set could have thousands of people's weight and height at age 10 as well as their final height once they stop growing. And during training, the
stop growing. And during training, the model learns a relationship in the training data. And there's an algorithm
training data. And there's an algorithm called gradient descent used for this.
Let's quickly go over it. Now that the NPCs have clicked off the video, let's get into it. Gradient descent is used to minimize a function. Here's a simple function y=x^2. In calculus, and don't
function y=x^2. In calculus, and don't worry if you don't remember this, you might have taken its derivative 2x, set it equal to zero, and you find x equals 0 is the minimum of the function. But
for some functions, like the ones in machine learning models, it's just way too complicated to take the derivatives by hand. In some cases, it's even
by hand. In some cases, it's even impossible. We need a different way to
impossible. We need a different way to find the minimum. But why is minimizing a function even important for training models? It's because we want to minimize
models? It's because we want to minimize the error function. the error between the model's prediction for the final height and the true answer for all the people in our data set. Let's build some
intuition for how we might approximate the minimum of a function with efficient iterations. Here we have y= x^2 again
iterations. Here we have y= x^2 again and let's say our initial guess for the minimum is x= 3. We know the minimum is at x= 0. What if we look at the slope of
the function at x= 3? This is the same as the derivative or the gradient. It's
a positive number, meaning the function is increasing at this point. But if we want to get to the minimum of a function, we want to step in the opposite direction. Let's say our new
opposite direction. Let's say our new guess is the old guess minus the slope time some step size, which I'll call alpha. For simplicity, let's say alpha
alpha. For simplicity, let's say alpha is 0.1, but in practice, we would use smaller values. Our new guess is then 3
smaller values. Our new guess is then 3 - 0.6, which gives us 2.4. We got closer to the answer of x= 0. If we repeated this procedure for enough iterations, we
would converge to the answer. What if
our starting guess was -2? Here's the
slope. The derivative is -4. And using
the same formula, we get -2 + 0.4 for a new guess of -1.6.
Again, we got closer to our answer. And
we would repeat this process. Here's the
algorithm in pseudo code. So now we have some intuition for why this process works. You can try implementing it here
works. You can try implementing it here and running your code against the test cases at the link in the description.
For those who want a deeper understanding, there's actually a theorem that says if you're sitting at a point on a function and there's a bunch of directions you could travel in, the gradient gives the direction of greatest
increase. So the negative of the
increase. So the negative of the gradient, which is exactly what we use since we subtract out alpha times the gradient, gives us the direction of greatest decrease. stepping in the
greatest decrease. stepping in the direction of greatest decrease in of times gets us very close to the minimum of a function. So that's all it means to train a model. We're just iteratively
minimizing the error function. What this
error function actually is depends on the model we're training. And to learn more, I recommend going through this list starting with implementing gradient descent. Leave a comment for future
descent. Leave a comment for future video suggestions and I'll see you soon.
Hey, in this problem we're going to solve gradient descent or we're going to implement this super important algorithm. If you ever heard of machine
algorithm. If you ever heard of machine learning or neural networks or just a lot of AI algorithms in general, they're trained with this algorithm called gradient descent. Can often be like
gradient descent. Can often be like super annoying when people talk about, oh, the model is learning a relationship. Oh, we trained it on this
relationship. Oh, we trained it on this data set. So at the lowest level, what
data set. So at the lowest level, what that's actually talking about, just to make it super clear, is this algorithm called gradient descent. In this
problem, we're asked to minimize the function f ofx= x^2. So gradient
descent, it's an algorithm for minimizing a function. It can be used to maximize a function, but most of the time we use it to minimize a function.
So what is the x that minimizes the function y= x^2? Clearly like the answer is zero. But we have to implement an
is zero. But we have to implement an iterative approximation algorithm. So
just taking a look at the graph right here, you can see that this function low this function's lowest value like its lowest y-value is zero. And the x value that achieves that is x equals z as you
can see at the origin. And why don't we just take a look at some of the test cases for this problem? Because as input we're going to be given for the algorithm the number of iterations to perform it. So it seems like that might
perform it. So it seems like that might affect the answer. the learning rate. Uh
don't worry about that for now. We'll
talk about it later. And then we have init, which is like an initial guess for which x minimizes the function. So let's
take a look at some of the test cases.
So for example one, we can see that if we were to do this algorithm for zero iterations with an init of five, don't worry about learning rate for now. The
output is five. That makes sense. It's
kind of just like a base case if you want to think of it that way. Example
two. On the algorithm, we have the same learning rate again. We'll explain what that is later. Same initial guess. The
only thing we do is increase iterations to 10. And it looks like we get
to 10. And it looks like we get something that's less than five, a number that's closer to zero around four. So maybe developing this intuition
four. So maybe developing this intuition that as we perform more iterations, we should maybe get a better and better answer. But what do we actually mean by
answer. But what do we actually mean by a better answer? Well, we're about to go into that in more detail in a bit. But
essentially what it means is we're getting a better approximation. Gradient
descent is really an algorithm that would not really find the exact answer.
The exact answer would be xals 0. But
our goal with this algorithm is to approximate the answer. And obviously
that obviously that might seem like really silly for a simple function like x squ. But as we get super complicated
x squ. But as we get super complicated functions like the functions inside self-driving large language models like chatgbt and llama deep fakes, we're
definitely going to just need to use approximation algorithms. So maybe as we perform more iterations, we'll get closer to the answer. If we take a look at the graph over here, we started off
with like let's say this is x= 5 and over the course of 10 iterations, we know that we did make some progress in this direction towards x=0. We got a
little closer to our answer. So now
let's talk about why we actually care about minimization algorithms, right?
Because we want to use gradient descent for minimization. So let's go over that.
for minimization. So let's go over that.
Okay. So, so we're using gradient descent for minimization. And there's
two main things to understand. The first
is why do we care about minimization?
And the second is why are we using gradient descent. So maybe I'll go over
gradient descent. So maybe I'll go over the second one first. So in like a calculus class, again, it's not a huge deal if you don't have taken calc if you haven't taken calculus, you don't really need to remember a ton of details. The
main thing you need to know to kind of understand these videos and this list of problems is just like very basic stuff from calc 1. Like you should know that the derivative of this function right
here is 2x. We just bring the two down and the exponent changes to one. So we
have to know like some basic calculus for this kind of sequence of videos, but nothing crazy complicated. We're going
to focus more on the actual concepts you need to know. So if you remember from calculus, the way or one way you can find like the x that minimizes this
function is you can take the derivative.
So that's 2x and you set it equal to zero. So clearly this gives us x= z,
zero. So clearly this gives us x= z, right? So that might seem kind of silly
right? So that might seem kind of silly as to why we're using this iterative algorithm that's only going to get us an approximation, not even the exact answer. But it turns out the functions
answer. But it turns out the functions can get crazy crazy complicated. They
could take in more than one thing, right? X Y Z. Like they could be a bunch
right? X Y Z. Like they could be a bunch of parameters or inputs to the function.
They don't necessarily even need to output a single number. The output
itself could be some sequence of numbers, some vector. And they can get even more complicated than this. And we
can already tell by that doing that math by hand, they call this like analytically solving it. That term isn't super important. Uh that's going to get
super important. Uh that's going to get sometimes basically impossible. So we
need an approximation algorithm. And so
why do we want to minimize functions?
Going back to the first thing, why do we even want to minimize functions? Why is
this apparently so crucial for AI and machine learning and neural networks?
And we're going to go into more detail on like what neural networks are later.
Let's just have this black box of AI and ML in our head right now. That it's
essentially if if you saw my previous video on what is a IML. It's essentially
just about approximating functions, right? It's about making predictions.
right? It's about making predictions.
Given X, we want to predict Y, right?
Chad GPT takes in some sequence of words and needs to predict the next word. For
self-driving, we take in the surroundings and the model inside the car needs to figure out if we should either break or keep driving, right?
These your medical AI algorithms take in an image of, you know, like a scan of someone's brain and need to figure out if they have a certain disease, right?
It's all about taking in some input and making a prediction. So apparently, so why is like minimizing a function even important for AI then? So it turns out that when we're training these models,
right, what that actually means is we're minimizing something and we're minimizing the error and actually so let's say we had a data set, right? Some
set of data points and we want the model to pick up the relationship between these data points and then we want to be able to feed in a new data point it's never seen before and we wanted to get the right answer or at least close to
it. And to do that we just need the
it. And to do that we just need the model at the simplest level. We need the model to minimize the error on the training data set. And people will call
this like the loss or the cost. And
those are in my opinion just overly fancy terms. It's just the error essentially. So you might have some
essentially. So you might have some error function. And we're just going to
error function. And we're just going to keep this super high level for now. The
error function might basically be like the model's prediction minus the true answer that we have in our data set. And
maybe we should take like the absolute value of that. That's not super important for now, but essentially we often want to minimize the error function, right? So that the model is
function, right? So that the model is doing a better job of learning the relationship in our training data points. And then for this prediction
points. And then for this prediction part in this equation right here, you might plug in what the model's function is that it's currently using to get the prediction. And we want to make this
prediction. And we want to make this function better and better over time.
But essentially, we need to minimize this error. This is this error itself is
this error. This is this error itself is a function and we want to minimize it.
And that's why gradient descent matters.
Now let's jump into how it works. So
then we can actually implement.
All right. So everything we've said so far is cool and all, right? But like
we've just been treating gradient descent as a black box so far. It takes
in a number of iterations to perform this thing called learning rate and our initial guess. And somehow it outputs a
initial guess. And somehow it outputs a better guess depending on how many iterations we do. So let's actually explain what's going on inside that black box. It turns out this whole like
black box. It turns out this whole like training thing, this whole process of the model learning the relationship gradient descent, it's basically just a big for loop. So I'll write it out in
pseudo code and then we'll explain what it means conceptually and visually. So
it might be something like for number of iterations. This is essentially what
iterations. This is essentially what we're going to do at every iteration.
All we're going to do is we're going to take our current guess essentially, right? So in the first iteration that's
right? So in the first iteration that's just the initial guess in it and it kind of improves over time as we perform more iterations. And what we're going to do
iterations. And what we're going to do is we're going to essentially calculate the derivative. We're going to minimize
the derivative. We're going to minimize the loss function, right? Or minimize
sum function. So what you want to do is you just want to calculate whatever function we're trying to minimize, right? Why don't we just call it like f
right? Why don't we just call it like f for now? What you want to do is you want
for now? What you want to do is you want to calculate f prime evaluated at the current guess. So send in the current
current guess. So send in the current guess into fprime. And this is going to be changing over the course of the algorithm. And this prime thing, that
algorithm. And this prime thing, that little prime symbol that I wrote above f, that just means the derivative. And
you want to calculate the value of the derivative at whatever the current guess is, right? And then essentially what you
is, right? And then essentially what you want to do is you want to update current guess at every iteration of this algorithm. So you'll say current guess
algorithm. So you'll say current guess to make this super clear. It would be current guess should be set equal to current guess minus alpha times and for
that thing we calculated over here why don't we just store that in a variable called d for now d for like derivative or something and you would do alpha * d and alpha is actually that learning rate
that we talked about earlier and now we're going to explain how that works.
This is essentially the pseudo code for gradient descent. Now let's explain
gradient descent. Now let's explain exactly how it works. So it's super intuitive. Okay. So I think we'll start
intuitive. Okay. So I think we'll start off visually and then kind of go back to focus more on the formula. So let's say we have our initial guess, right? Let's
say our initial guess is x= 5, right?
Well, the first thing like every loop, the algorithm is telling us to calculate the value of the derivative. So there is this fact from calculus. It's all right if you don't remember it, but one way to calculate the derivative at some point
is to draw a line that's basically tangent, meaning it touches the function in one place at that point. It's just
kind of a terrible line. We just pretend it's straight. And basically the slope
it's straight. And basically the slope of this line, right? If I were to get the slope of that line, and then that would essentially be the value of the derivative at that point. And then if we
look at like the next line of code after we calculate D, what it's doing is we're we're trying to update our guess. To
update our guess, we take a negative step in the direction of that slope. So
whatever the slope is, we take the negative of it and step at some fraction. Alpha, right? So alpha is
fraction. Alpha, right? So alpha is going to be something between 0 and 1.
Alpha might be 0.01, might be 0.1, right? So we're going to take some
right? So we're going to take some fraction of D. That's why we're multiplying D * alpha. We're taking a fraction of the negative slope at that point and we are just moving in that
direction. So let's take a look at
direction. So let's take a look at visually what that would actually be doing for this example over here. So the
slope here of this line, this is a line with positive slope, right? So if you were to step in the negative direction, well positive slope is kind of moving up this way, right? So negative slope would
actually push our x a bit more this way.
And that's actually what we want, right?
We want to get our x value closer to zero. Let's say our initial guess. So
zero. Let's say our initial guess. So
let's go ahead and erase that for now.
Let's say our current guess was somewhere over here, right? Like -2x=
-2. And then we take a look at the value of the derivative here, which is just the slope of this line, right? Well, the
negative of a negative number is a positive number, right? So taking a step or some fraction of some positive number would still be a positive number, right?
So we would actually be increasing our x value moving in this direction closer to zero. And obviously that's not going to
zero. And obviously that's not going to get us all the way there, right? So if
we just imagine some numbers here, let's say the slope of this line, okay, right?
Let's say the slope of this line, we use like m for slope, y = mx plus b. So you
might find that the slope of that line, just keeping the value super simple, let's say it's negative one, right? We
know it has a negative slope. And then
let's say our learning rate, let's just go with 0.1 for now. Typically people
use 0.01 or 0.001, but we can use 0.1 for now. Things will still make sense. I
for now. Things will still make sense. I
guess now is probably a good time to specify more what I mean by learning rate. So learning rate is something that
rate. So learning rate is something that we call a hyperparameter for this algorithm. So it's always going to be
algorithm. So it's always going to be specified. It's like an overall input to
specified. It's like an overall input to gradient descent and it kind of already has that word in it, right? Rate. So it
explains how fast do we want to change current guess like how fast do we want to take our steps towards the answer, right? We kind of see that since it's
right? We kind of see that since it's between 0 and one and it's being multiplied by D, it essentially scales down D, right? And if you multiply some
number like D by something that's less than one but obviously still greater than zero, you're scaling it down, right? You're taking a fraction of that.
right? You're taking a fraction of that.
So the higher alpha is, the greater of a fraction of D, we're left with, right?
And this is essentially what we're subtracting out from current guess, right? So you're essentially changing
right? So you're essentially changing current guess by a lot every iteration.
If you had a high alpha, a high learning rate, that's why we call it a rate, right? It's just how fast are we getting
right? It's just how fast are we getting to our answer. And if you use a value of alpha that's too small, right? We need
way too many iterations to perform this algorithm. The runtime will be worse
algorithm. The runtime will be worse just intuitively, right? But we don't like things that take and we don't like things that take longer. We want to minimize runtime. But if we use an alpha
minimize runtime. But if we use an alpha that's too big, right? We can kind of tell by this formula that we might change current guess by way too much at a given iter iteration. And just to kind
of give you an intuition of what might happen there, let's say we're currently at -2 as our current guess, right? What
if we accidentally overshoot all the way here? We overshoot past our answer of x
here? We overshoot past our answer of x equals z and then we end up all the way over here for our next value. We
increased x by too much. We went all the way past zero. So that's kind of kind of why using a value of alpha that's too high won't yield good results in practice. So anyways, now that we have a
practice. So anyways, now that we have a better intuition as to what alpha is, let's go back to this. So our new guess should be the old guess and the old
guess was -2 minus 0.1 * d. and d was negative 1. So if I can do some math, we
negative 1. So if I can do some math, we would get using the formula -2 + 0.1 and that's negative 1.9. So we can kind of
see, okay, great 1.9 is greater than -2.
We did step a little bit in this direction and we got closer to our answer. And okay, don't be worried about
answer. And okay, don't be worried about this picture. You know, it might bring
this picture. You know, it might bring back some bad memories if you've taken multivariable calculus before. Don't
worry, we're not going to do anything excessively mathy here. I just really want to explain as to why at every iteration in the algorithm, our way of updating current guess is to use the derivative. Like why are we calculating
derivative. Like why are we calculating this derivative and then using that as our way of updating current guess? So it
turns out there's this mathematical fact. It's not super important, but I
fact. It's not super important, but I feel like I should mention it at least once is it says that the gradient or the derivative points in the direction of greatest increase. And we're not going
greatest increase. And we're not going to excessively focus on mathematical theorems in these videos. I want to focus on how to solve the problem. But I
just want to make sure gradient descent is super clear so we're not left here wondering why this formula works the way it does. And why are we using the
it does. And why are we using the derivative? Why are we subtracting the
derivative? Why are we subtracting the learning rate times the derivative? And
the way you can kind of think about this is we're trying to minimize some overall function rate. And you can kind of think
function rate. And you can kind of think of this as we need to use some approximation algorithm because we don't exactly know how to get there. And
imagine you're standing on the top of a hill and you need to get to the bottom, but you can't see where the bottom is yet. You're blindfolded. And all you
yet. You're blindfolded. And all you need, all you can do is you could pick a direction and go down in that direction.
And you turn a little bit to the right, like a little bit clockwise, and you can say, "Okay, go down this direction. Turn
a little bit more clockwise, a little bit more to the right, could go down in this direction." Right? You have a lot
this direction." Right? You have a lot of different choices as to which direction you're going to go in and every quote unquote iteration of this algorithm. But what if you did is I'm
algorithm. But what if you did is I'm just going to base it off of what's locally the steepest, right? I'm going
to be greedy. You may have heard heard the term greedy a lot if you've been on the le code grind. But what if we just locally choose the best option? Which
direction can I go in that's going steepest downhill? That should maybe get
steepest downhill? That should maybe get me there the fastest, right? But
obviously wherever I am right now might not actually be globally the best option, right? because what if it starts
option, right? because what if it starts going uphill later on, right? But
locally, it seems like the best option.
And let's just say at every iteration, every so often, every time I take a step, I'm going to reassess. I'm going
to recalculate the derivative all of my different options where I could go in and I'll see, okay, how steep are things are now in each direction. I'll reassess
again, and then I'll say, okay, should I move and start going in this direction?
Basically the idea behind gradient descent is if we kind of keep doing this for enough iterations obviously it might not be the most efficient way to get there as we realize right because it might be super steep for a bit and then
suddenly start going up again right so if you imagine this like imaginary hill or valley that we're on right apparently if you do this and this is like mathematically proven for certain functions if we do this for enough iterations we actually will end up
really close to what we wanted like the overall minimum or bottom of this hill or valley that we were trying to get to.
So that's basically why we're using the gradient or the derivative, right? If
it's pointing in the direction of greatest increase, then the negative of the gradient must point in the direction of greatest decrease. We were trying to minimize, not maximize this function, right? So this is all the information I
right? So this is all the information I have. I know how steep things are around
have. I know how steep things are around me locally where I am. And let's just be greedy. Let's go in the direction where
greedy. Let's go in the direction where it's descending, where the elevation is descending the fastest, right? And let's
just keep doing that every so often.
Let's re-evaluate, right? at every
iteration, after every step, we're going to re-evaluate and recalculate the value of the derivative. And that'll get me there eventually. And just kind of
there eventually. And just kind of visualize this for these more complicated loss functions or error functions that we're going to minimize, right? I might be standing right here.
right? I might be standing right here.
And the goal is to get to like the global minimum, right? But I don't know what direction to go in. Do I go here?
Do I go up here? Do I go down? Right
here, right? And that's why we calculate the gradient. And don't worry, we won't
the gradient. And don't worry, we won't be doing any like super complicated derivatives for this function.
Thankfully, we have libraries like PyTorch that would take care of that for us. But essentially, what we would do is
us. But essentially, what we would do is we would calculate the the gradient in step one of the algorithm. And the
direction of the gradient would essentially be the direction of greatest increase. And the negative gradient is
increase. And the negative gradient is the direction of greatest decrease. We
keep stepping in that direction based on alpha over and over again and we get to our answer. Okay, so now let's jump into
our answer. Okay, so now let's jump into the code. So we're just going to follow
the code. So we're just going to follow the pseudo code that we talked about earlier. It's nothing complicated. And
earlier. It's nothing complicated. And
if we try to return zero, we actually won't get the right answer because we have to implement an iterative approximation algorithm. Right? These
approximation algorithm. Right? These
test cases are not checking to see if you get exactly zero. So we can start with for i in range of the number of iterations, right? That's how many times
iterations, right? That's how many times we're going to update our guess. Let's
calculate the current derivative. So
that's 2x if the function is y= x^2. So
we just do 2 * x whatever x is which is just our current guess and that starts off as in it. And then we can update our current guess which is just let's just
have that stored in in it. And all we have to do is subtract out alpha time d.
Alpha is just our learning rate. And
this is essentially the algorithm. We
can just go ahead and return in it and we're done.
And after rounding our answer to five decimal places, we can see that the code works. So this is gradient descent. It's
works. So this is gradient descent. It's
a great introduction and place to start if you're getting into machine learning, deep learning. It's actually the
deep learning. It's actually the foundation of how neural networks are trained as we simply are trying to minimize their loss. So stay tuned for the rest of the problems as we go ahead and build a language model. In this
video, we're going to explain linear regression. You might have heard this
regression. You might have heard this before in like a stats class or an intro to AI class. Maybe the explanation didn't make a ton of sense. But in this video, we'll explain exactly what you
actually need to know and skip over the the unnecessary math proofs. And we'll
even explain what you would need to know to code it up from scratch in Python.
Linear regression is actually really important. It's the foundation of neural
important. It's the foundation of neural networks. It's okay if you don't know
networks. It's okay if you don't know what these are yet. We're going to get to that soon. Linear regression is actually the foundation of all the latest AI in deep learning like chat GPT self-driving deep fakes. So it's
definitely really important to understand and let's jump into the explanation. Now let's break down what
explanation. Now let's break down what this term means. Linear regression.
We'll start with the regression part. So
regression is just the opposite of something called classification.
So let's say we wanted to build an build a model that predicts whether or not someone gets diabetes. There's only two possible outputs, right, for this model.
They either get diabetes or they do not develop diabetes. So in every anytime
develop diabetes. So in every anytime you have a fixed number of classes that your out your your input could belong to, let's say you were building an image classification model that predicts
whether or not the input image is a dog, a human, or a piece of food, right?
There's still like a fixed number of classes there. So we call that a
classes there. So we call that a classification model. So all regression
classification model. So all regression means is that there is not a fixed number of classes. The output is like a number of some sort and it belongs on a number line basically and it could be
like literally any number. It's not a fixed set of numbers. So if you wanted to predict make an AI model that predicts how tall someone's going to be based on say their current weight, their
current height and how tall their parents are or something like that, right? Whatever features we think are
right? Whatever features we think are relevant that output number exists on like some scale like like some infinite scale, right? We can't really obviously
scale, right? We can't really obviously we could say like it's unrealistic for someone to be past some certain height right but in general the output is some number and it could have like some number of decimal places it's not really
belonging to some like fixed set of classes like two classes three classes five classes so that's all the regression part means now let's get into what the linear part means and this is
probably the more important part so when we're building our model right our AI model is going to make some kind of prediction right that's what AI I and
linear regression is four. And all we're saying in the linear part is that the relationship between our data points and the actual answer, let's say like going back to the height example, we had those
three pieces of information that might matter and the corresponding output like the actual true answer for how tall someone ends up being. This is all we're saying when we say linear. We're saying
that the function is going to look something like this. It takes in three things, X, Y, and Z, the three numbers we talked about earlier. And we just multiply each one by a weight. So W1 * X
plus W2 * Y + W3 * Z. And we might add a constant called a bias. And you can think of that as like the B in y mx plus
b. Linear regression is basically doing
b. Linear regression is basically doing y= mx plus b, but for as many input attributes as we want. So when we say we're doing linear regression, all the
linear part really means is we can't really square anything here. We can't
put like a cosine. We can't send that x into like a cosine or a logarithm. The
only thing we're allowed to do is multiply our inputs by like w's and add constants. And over the course of
constants. And over the course of something called training, which is going to use an algorithm called gradient descent, when the model actually improves over some number of iterations, all we're doing is at each
iteration, we're going to adjust what W1 is, what W2 is, what W3 is, and what B is. And hopefully at the end of some
is. And hopefully at the end of some number of iterations, the model is like a pretty decent um way to predict how tall someone's going to be. So for a new person that comes along, you could send
in X, you could send in their Y, you could send in their Z, like those three pieces of information about them, which is essentially like their weight, their current height, and how tall their parents ended up being. And hopefully
this model will give a decent prediction for how tall that person ends up being.
So you can kind of think of W1, W2, and W3 as these weights that kind of factor in, okay, how important is X into actually predicting how tall someone's going to be? Is it an important factor?
If W1 ends up being a bigger number at the end of training, then the model is basically saying, hey, X is actually pretty important for figuring out how tall someone's going to be. And then
same thing for Y, the W2 for Y, same thing for W3, for Z. And then B is kind of just like an extra additional factor that the model might sometimes need to
add in, right? So let's say even after you factor in W1, W2 and W3. Let's say
for all person every for every person, there's some like base height, right?
Like there's no one who's going to ever have a height of zero, right? No one's
ever going to have a height of zero. So
there has to be some base. B is some number that's greater than zero, right?
So we always might just add in some base factor there. And then the model will
factor there. And then the model will actually learn the value of B over training as well. And this is all linear regression is. Now let's go over the
regression is. Now let's go over the pseudo code for how this would actually work. So let's say we had some function
work. So let's say we had some function for training a linear regression model and actually learning what those W1, W2, W3 and B are. This is what it would look like. So we would do like for some
like. So we would do like for some number of iterations, right? So for some number of iterations that is kind of decided beforehand, right? The more
iterations you do generally the better your model is going to be. There's some
exceptions to that which we can talk about later. At every iteration, what
about later. At every iteration, what you're going to do is based on your current W1, W2, W3, and B, you're going to do something called get model
prediction, which I'm just kind of going to leave as a black box for now as to how we would exactly implement that. But
what this kind of sub routine should do is for every every example in your data set, right? Let's say you have n
set, right? Let's say you have n training examples in your data set. N
people for which you have those three pieces of information and you have the corresponding label like what their ultimate height ended up being based on what your current weights and bias is.
You should figure out what the model's current prediction is right and hopefully this prediction gets better as we do more iterations. Then we are going
to do something called get error. Right?
So if we have the model's prediction and we have the actual right answer, we should be able to get an estimate of what our model's current error is, right? Because the hope is that the
right? Because the hope is that the model has as low error or loss as they sometimes call it as possible. Then the
next step we're going to do is something called get derivatives. And here's where it's going to be really important to be familiar with something called gradient descent, which I have a video about on
this channel. And this is how we
this channel. And this is how we actually optimize our models to minimize that error over some number of iterations. Basically, this function is
iterations. Basically, this function is going to do some little bit of ugly math. Don't worry, you rarely ever have
math. Don't worry, you rarely ever have to do it by hand as a machine learning engineer or data scientist or even for side projects. But this function is
side projects. But this function is going to do some calculus actually and calculate some derivatives. And
essentially the final step is going to be update our weights. Update our weights. So
our weights. Update our weights. So
based on those derivatives that we calculated, let's kind of change and the error of course let's change W1, W2, W3 and B to hopefully get a little better.
And as we do this over some number of iterations, the hope is that this error goes very close to zero. So this is the rough idea of linear regression. So just
one small thing we have left to explain.
When I said that we would be calling some function called get error at every iteration, wanted to clarify what error we're using. So what we use is something
we're using. So what we use is something called mean squared error. And the
formula looks kind of ugly, but we'll break down exactly what it's doing. So
all we're doing is for every n of our n training examples, right? We're going to kind of iterate over those examples. And
for every example, let's get the difference between what the model's prediction was and the true answer from our data set. And for every one of those, why don't we square them? add
them all up for every single example and then just divide by n. So just taking the average, right? So there's kind of two things to understand here. One is
why are we kind of adding them up and dividing by n? Well, if we take the average error, then that way our error isn't really dependent on the number of examples in our data set, right? So it's
just kind of a rough like kind of approximation of how poor our model is doing at the moment or as it goes down, how well it's doing, right? The second
thing is why are we squaring it right?
Why are we squaring the error for every example? Why not just do something like
example? Why not just do something like this for every single example? What if
we just did the following? What if we just did the absolute value of the prediction at i minus the truth at i. So
the absolute value of it and then divide by n. Right? Just taking the absolute
by n. Right? Just taking the absolute value of that difference should kind of give us a gauge of the error of the model. Right? But the reason we don't do
model. Right? But the reason we don't do that is if you remember from calculus, the absolute value function has some issues with its derivative. It's not
super important to get deep into that for understanding machine learning. But
that's essentially the reason why we just square it, right? Because squaring
kind of gives us that same idea, right?
whether the error was negative or positive, whether a prediction was less than or greater than truth. All we
really care about is the difference, right? So squaring kind of gets rid of
right? So squaring kind of gets rid of that negative and turns out it works better than absolute value. One final
thing before you're you basically have all the information you need to code this up. When we actually implement this
this up. When we actually implement this in code, we use like vectors and matrices to do like the multiplications and additions we talked about earlier.
Let's say we have our weights in a vector right here and we have the data point for like one person their x y and z right here. If you do a dotproduct,
which just means multiplying X by W1 plus Y * W2 plus Z * W3 that gives us the exact model prediction, right? So
then let's say we had a data set of a whole bunch of people, right? N people.
So that's over here. This is going to be an N by3 matrix. So you have X1, Y1, Z1. This is kind of X, Y, and Z for
Y1, Z1. This is kind of X, Y, and Z for person number one. And then in the next row you have the second person in our data set. Right? So you have their X,
data set. Right? So you have their X, their Y, and their Z. Right? And this
kind of keeps going on for the n number of people. So all the way to the nth row
of people. So all the way to the nth row using one indexing. So Y N CN. Then if
you did the same thing again, multiply by a 3x 1. 3 by 1. Three rows one column. Yep. W1 W2 W3. Well, what's the
column. Yep. W1 W2 W3. Well, what's the result? Right? It's an n by one. So
result? Right? It's an n by one. So
that's n rows, one column. And for every single one of our n people, we have their prediction, the model's prediction. So that's kind of an
prediction. So that's kind of an efficient way to do it for n people without a loop, right? You can just do a matrix multiplication. And it turns out
matrix multiplication. And it turns out a bunch of like libraries have really efficient matrix multiplication algorithms. So that's why we like to kind of vectorize and put things into
matrices whenever we're doing machine learning whenever possible. And if you actually remember how to do the matrix multiplication, it's not like super important. It'd be like this row times
important. It'd be like this row times this column, right? And that would be the first entry in this vector over here. Then for the next person, it would
here. Then for the next person, it would be their row again times this whole column dotproducted, right? And that's
the next number over here. So doing this for all the people is essentially like doing get model prediction what we did over here for each and every person except it's a lot more efficient. And
this is basically all you need to know for at least the forward path of linear regression. So feel free to try out the
regression. So feel free to try out the code. Now in this problem, we're going
code. Now in this problem, we're going to solve the forward pass of linear regression. This is made up of two
regression. This is made up of two different functions or sub routines. And
in the next problem, that's where we'll actually do the backward pass and train our linear regression model. So I'd
highly recommend checking out the video complete explanation of linear regression. It is linked on the problem
regression. It is linked on the problem and it definitely explains all the background you need. Zero background
knowledge required except for gradient descent to kind of understand that video. However, I will give some
video. However, I will give some background in this video too. So in this problem we have to implement two functions. One of them is get model
functions. One of them is get model prediction and one of them is get error.
And both of these sub routines are kind of two core subutines that compose the process of training training a linear regression model. And when I say
regression model. And when I say training a linear regression model, what that really means is for this case figuring out what W1, W2, and W3 should
be so that our model does pretty well on our training data. And when we pass in a new data point that the model's never seen before, hopefully its prediction is pretty close to the actual answer. So
these are the two functions that we have to implement and they're a core part of the training loop. Specifically in the training loop for every iteration we
would actually call get model prediction. So we would call get model
prediction. So we would call get model prediction. Then we would probably want
prediction. Then we would probably want to calculate the error or the loss. So
we would do something like get error and maybe every 100 iterations or so we would actually print that error. But
that's not super important for this problem. And then we would probably call
problem. And then we would probably call some subruine like get derivatives. And
what this is going to do is calculate some derivatives that we need to update our weights and perform gradient descent. Don't worry, you don't have to
descent. Don't worry, you don't have to do those derivatives by hand as a machine learning engineer or data scientist, but it is good to know that get derivatives is being called under
the hood by whatever library you're using. And lastly, you would want to
using. And lastly, you would want to update your weights, right? You would
want to update your W's, right? And
hopefully with the next iteration, your guess is a bit better. So, let's go back to the actual functions that we have to focus on for this problem. Get model
prediction and get error. Right? So,
let's look at the two inputs for get model prediction. The first input is X.
model prediction. The first input is X.
This is just the data set, right? This
is just the data set that will be used by the model to predict the output.
Right? And we can think of this model as predicting say the price for Uber rides, right? Let's say there are three things
right? Let's say there are three things that affect the price of an Uber ride.
The first might be the time of day, right? The next might be the total
right? The next might be the total distance of the ride that the driver has to take has to take you, right? Because
that would affect say how much gas they're using. So total distance and say
they're using. So total distance and say maybe miles. And then maybe the duration
maybe miles. And then maybe the duration of the trip in say minutes or hours, right? If there's more traffic, the
right? If there's more traffic, the duration doesn't necessarily match up with the distance. So maybe these are our three input attributes and that would be stored in our data set X. We
can see that the len of X is N and that for every row we will have three columns. We can see that over there. So
columns. We can see that over there. So
X is actually X is actually an N by3 array where every row corresponds to our three attributes for that data point.
The other input we have is simply another array of size three and those are our initial or current weights for the model. So W1 is like how much we are
the model. So W1 is like how much we are weighing time of day in our model's prediction. W2 is how much we are
prediction. W2 is how much we are weighing the distance. W3 is how much we are weighing the duration. And if we remember what the linear regression kind of formula looks like, it'd be something
like the following, right? We would say something like price as a function of let's say this is X this is Y and this
one is Z. Price as a function of XYZ becomes something like W1 * X + W2 * Y plus W3 * Z. And what is our goal?
Right? It's to figure out what W1 W2 and W3 should be. And we're going to implement the forward pass of this model. And the other function that we
model. And the other function that we actually have to code up is get error.
This takes in the model prediction for each training example. If we have n training examples that the size of that array should be n. We have n different numbers that the model spit out. And we
should have the corresponding labeled answer from our data set. This will be stored in another array of the same size and it will be called ground truth. So
why don't we get into how we would actually code up get model prediction.
Okay. So for get model prediction we are giving the data set X an N by3 array as well as weights which is just of size three. You can think of that as 3 by 1
three. You can think of that as 3 by 1 if you want. And we know that from linear regression the function that
we're following is price of XYZ our three input parameters is just W1 * X +
W2 * Y + W3 * Z. And we could actually think of this as a dotproduct. If you
don't remember what dotproducts are, it's not a huge deal. It's essentially
just two vectors multiplied against each other. So let's say in the first vector
other. So let's say in the first vector we had XYZ, the three input attributes for some example, some input, right? And
then we wrote W1, W2, W3. The dotproduct
from linear algebra is just a way of multiplying these two vectors. It says
to multiply x by w1, y by w2, z by w3 and add them up together so that we actually end up with that. So there's
actually a way then that we could do get model prediction for all n people in one giant matrix multiplication, one giant dotproduct instead of having to say
iterate over all n people in a for loop and do this dotproduct. And all we have to do is realize the format in which x is given, right? Because we have actually the input information for each
person, right? What we have for each
person, right? What we have for each person is say in the first row we have x1, y1, z1, x, y, and z for the first
person. For the second person, we have
person. For the second person, we have their information x2, y2, and z2. And if
we go all the way down to the last row, what we have is we have xn, yn, and we have zn, right? And let's say we just multiply this by a 3x 1, right? Because
this is actually an n by 3 matrix or array. And over here, what we have if we
array. And over here, what we have if we had our weights w1, w2, and w3 is a 3x 1. And if you remember the rules of
1. And if you remember the rules of matrix multiplication, again, it's okay if you don't. This is probably one of the only times we're going to like actually get into the nitty-gritty details of the matrices. Well, the
threes kind of match up and cancel out and the output is an n by one. And we
know that an n by one, well, that gives us a number for each of our n people, which is the model prediction we wanted.
And if we look into how this matrix multiplication is done, it's every row of the first thing times every the column of the second thing, right? So if
we have this row right multiplied by this column well again for the first entry in the output vector here that gives us exactly what we want. Then if
we did the second row times the weight column vector again that also gives us exactly what we want. So forget model prediction all we have to do is call np
numpy domat mole and this will take in our two relevant matrices get error right we have two inputs to
this function prediction and truth and both of these are essentially vectors of size n right if we look at the prediction for each of our n data points it's some vector with n numbers in it
and if we have truth it's the same size and we have n numbers is here as well.
And the indices kind of match up, right?
The prediction for our zeroth training example is here. And the true value for our zero training example is here, right? And we're going to use the mean
right? And we're going to use the mean squared error function. So if we just write that here, mean squared error is essentially a sum over all of our data
points. So we need to consider all n
points. So we need to consider all n data points. Take the prediction for the
data points. Take the prediction for the eth one, subtract out the truth for the eth one. So just get the difference
eth one. So just get the difference between them, but don't just take the like normal difference. Instead, we want to get the squared difference and then divide this whole thing by n. So it's
kind of like you're averaging the square of the error for every example. And in
the complete explanation of linear regression video, we kind of go over why we use this error function specifically for linear regression, right? because
maybe you want to take the absolute value. Turns out that doesn't work as
value. Turns out that doesn't work as well. Maybe why why are we squaring it?
well. Maybe why why are we squaring it?
Why not raise it to the third power, the fourth power, right? We'll explain why we use the mean squared error for linear regression in that video. So now that we know which error function to use, how are we going to actually implement this?
Right? So it turns out that when we have our data stored in numpy in numpy arrays which is what we have for this problem.
We actually get to take advantage of a lot of functions in the numpy library.
And these functions are a lot faster than a lot of normal Python functions.
And that is because they call highly optimized CC code which is generally much faster than Python. We also get to take advantage of parallel computation whenever possible, even without a GPU.
This can be done on a CPU as well in some cases. And overall, you want to use
some cases. And overall, you want to use the numpy function for whatever you're trying to do instead of manually implementing it yourself. So an example of that is if we just wanted to get the
difference between every value in both of these arrays instead of iterating over each index and subtracting what you can do is you can just do vector one like the first variable minus the second
variable. And what this is going to do
variable. And what this is going to do is it's going to return back to us the actual vector which has all the differences we want. And this is actually going to be much faster if you were to say actually check the runtime
of this using the system clock. you. We
also really want to take advantage of np.m mean. So this will get us the mean
np.m mean. So this will get us the mean of a vector by sum summing of elements and dividing by n as well as np.square.
np.square will be useful for taking in a vector and returning essentially a vector of the same size back but with with every element squared. So the main takeaway here is that mean squed error
is once you understand what it's doing, it's a pretty simple equation, right?
But moving forward, you always want to kind of use these data science or machine learning libraries, whether that's NumPy or PyTorch. You want to use their versions of these very simple
operations because it will make a huge difference in the runtime of your code.
Now, let's jump into the code. As we
talked about earlier, get model prediction is really easy. All we're
doing is calling the matrix multiplication function from the numpy library. So all we really need to do is
library. So all we really need to do is say something like our result equals np.mmat
np.mmat mole. There's other functions you could
mole. There's other functions you could use that do the same thing but I think matt mole is the most accurately named.
So you would just pass in x and then weights. And we just want to actually
weights. And we just want to actually round our answer to five decimal places.
Specifically every element in this output vector should be rounded. So we
can do np.round
of resz and then five. And this is actually going to be better than say iterating over the entire array and rounding each index. Then forget error.
We know that we want to get the difference, right? And we don't want to
difference, right? And we don't want to actually iterate over. So we might might say something like the difference just to make this super explicit is model prediction
minus the ground truth, right? And then
we know we can use np.square instead of iterating over every index. So squared
should just be np.square
of diff. And now the size of the array hasn't changed. But now we have the
hasn't changed. But now we have the squared value at every index. And now
all we want is the mean. And thus this is going to actually be faster than doing a for loop and actually summing up every element in squared and then dividing by n. So avoiding that for loop
is something that num using numpy operations does confer us a runtime advantage with. So you can just say np.
advantage with. So you can just say np.
mean of squared and this is essentially just the average right and then this is a single number right so we can just use the normal Python round function so
let's just return round of average to five and we can see that it works actually two test cases for each of these is enough to verify that you've done the
right thing since this is a bit mathematically involved this it's really unlikely you would get the answer right with an incorrect effect implementation.
That was the forward pass of linear regression. Next, we are going to train
regression. Next, we are going to train the model. And then once we're done with
the model. And then once we're done with that, we'll finally be ready to explain and code up neural networks. And you'll
actually see that these are just linear regression models stacked up together.
So that's why it was so important to understand linear regression. And
definitely check out the next problem on training our model. In this problem, we're going to solve linear regression training. So our job is actually to
training. So our job is actually to implement the train model function and we're actually given two functions as it says in the problem description. We're
actually given the get model prediction function for this model which is supposed to be called at every iteration of the training loop. We are also given get derivative and this is the reason
we're given get derivative is to actually perform an algorithm called gradient descent. If you're not familiar
gradient descent. If you're not familiar with gradient descent, I would recommend checking out the easy problem in this problem list. And there is actually a
problem list. And there is actually a corresponding video solution that requires zero background knowledge about machine learning at all. But the reason we need the get derivative function is to actually update the weights of our
model using the learning rate at every iteration of training. And let's go ahead and check out the inputs that we're given. So the first thing we're
we're given. So the first thing we're given is X. X is our data set for training the model. If we check out the shape, it will have length n where n is
the number of data points in our data set and it will always each data point has length three. We can see that over here. So checking out the example we can
here. So checking out the example we can see that we have two data points here, right? And each of those two data points
right? And each of those two data points actually has three attributes. So we can see that we have length three for this sublist and length three for this sublist. So that makes sense given the
sublist. So that makes sense given the problem description and the constraints.
Then we can check out y. So y is supposed to be the correct answers for our data set. And we should probably have a label for every single one of our data points in the data set. So that
makes sense that y's length is also n.
Six is the answer for the first data point. And three is the answer for the
point. And three is the answer for the second data point. We are then given the number of iterations to train for.
Essentially this is the number of iterations to run gradient descent for.
We can already kind of tell that whatever the solution is for this problem is going to be involve some sort of significant for loop. That's the main part of the solution. And this is the number of iterations to train for.
Lastly, we have the initial weights for the model, right? Before any kind of training is done, models, you can still get the predictions from a model based on its current weights. And these are
typically chosen completely at random in machine learning. And we can have the
machine learning. And we can have the initial w1, w2 and w3 given here as 0.2, 0.1 and 0.6 respectively. And since it's
given as input to this problem kind of going to develop this intuition that the initial weights do affect your final weights. And talking about the final
weights. And talking about the final weights, that's actually what we have to return. We have to return the final
return. We have to return the final weights after training in the form of a numpy array with dimension three. And
kind of one last thing to kind of make some sense of this test case is if we check out the problem or the sample data points in X, we can actually kind of
maybe guess what the weights should be.
So we have 1 2 3 corresponding to six and then we have 111 corresponding to 3.
So we can maybe guess that W1 is equal to 1, W2 is equal to 1 and W3 is equal to 1. This is just adding up the
to 1. This is just adding up the essentially the three input attributes.
1 + 2 + 3 gets us six and 1 + 1 + 1 would get us three. So that makes sense that after 10 iterations, our initial weights, they ended up much closer to
the final answer. Right? 0.5 is closer to one than 0.2. 0.59 is closer to one than 0.1. And 1.27 is closer to one than
than 0.1. And 1.27 is closer to one than 0.6. And this is also continuing to kind
0.6. And this is also continuing to kind of reinforce that intuition that gradient descent is just an approximation algorithm. We saw that for
approximation algorithm. We saw that for the first two weights, we didn't overshoot, right? We increased, but we
overshoot, right? We increased, but we were still less than one for 0.5 and 0.59. But for W3, we actually completely
0.59. But for W3, we actually completely overshot and we went all the way up to 1.27, even though we know W3 should probably be one, right? So, we actually
have to perform this for more and more iterations. This the number of
iterations. This the number of iterations that you actually have to perform gradient descent for totally depends on the use case. In some case a thousand might be enough. In some case you might need to perform this algorithm
for 100,000 iterations. But we argue developing this intuition that gradient descent is just an approximation algorithm and the number of iterations at which you perform it will affect the
performance of your model. So now let's jump into the explanation.
Now let's talk about the update rule. So
gradient descent updates the weights of our model at every iteration. And the
update rule for gradient descent, just as a reminder, is that the new weight or the new W should just be equal to the old W minus the derivative specifically
for this weight. So the derivative for W, whatever derivative we're talking about here. Let's say this was for W1.
about here. Let's say this was for W1.
So the new W1 is the old W1 minus the derivative for W1 time alpha, which is our learning rate. and will always be given. And if just for an explanation of
given. And if just for an explanation of why we're using this equation, I would highly recommend checking out the gradient descent solution video. But if
this is our update rule, then it's pretty clear that if we're given alpha, right, and we're given initial W's, right, that's actually an input to this problem. Our main task at every
problem. Our main task at every iteration of this loop, the the heavy lifting is just to get the the derivatives for each W. And because we have three weights for this problem, we
have W1, W2, and W3. We're actually
going to need to call the get derivative function. Get derivative. We're going to
function. Get derivative. We're going to need to call this function three times at every iteration of this of this algorithm so that we can get the individual derivatives for each weight.
And if we check out the the function for get derivative, we check out what things it takes in. Two two of the things it takes in are the current model prediction. So it needs the current
prediction. So it needs the current model prediction. This needs to be sent
model prediction. This needs to be sent in to get derivative. And we also tell it our desired W. So that might be W1,
W2 or W3. And then we are also given a function called get model prediction. Right? So if we need
model prediction. Right? So if we need the model prediction to calculate or call get derivative then it makes sense that every iteration the first step should be to call get model prediction.
Maybe we can store that somewhere. Then
we're going to pass that in into the get derivative function. We're going to pass
derivative function. We're going to pass that in into this function. We're going
to call it three different times depending on our desired W. And then we can lastly use the update rule and update our weights. And we will repeat this for the number of iterations
specified by the problem. and then our training is complete. Okay, so now let's jump into the code. We know that there this whole entire algorithm of training
is one giant loop. So we'll say like for i in range of num iterations and we know that to even update our weights and get the derivatives, we need to get our model prediction at every iteration
based on our current weights which I'll keep updating in the variable initial weights. So we can do self.get get model
weights. So we can do self.get get model prediction and go ahead and pass in x as well as whatever our current weights are. Then we're going to need to grab
are. Then we're going to need to grab our three different derivatives. So
going to say that d1 the derivative for w1 is self.get
derivative and let's pass in the model prediction. Let's pass in ground truth
prediction. Let's pass in ground truth which is just the answers or labels for our data set which is given in y. Let's
pass in the length of x as that is the n number of data points. Let's pass in x.
And then for our desired weight, we want the derivative for w1. But we can see that get derivative is using zero indexing, right? So we'll actually need
indexing, right? So we'll actually need to pass in zero over here. And that is how we calculate d1. Then we can do the same for d2 and d3. Okay. So now we can
actually update our weights based on the learning rate. So we can say initial
learning rate. So we can say initial weights at zero should just be well we're actually going to subtract out D1 times our learning rate. So that's
self.arning
rate. And if you're wondering why learning rate is not given as a par input to this function but instead as kind of a class level hyperparameter that is just a convention when training
machine learning models especially as we move into PyTorch. So we can say initial weights at one then we're going to
subtract out D2 time self.arning rate
and finally the same for W3. So we can say at the second index D3 * self.arning
rate and that's it. We just have to then return the rounded version of our answer. So np.round
answer. So np.round
of initial weights at five and we're done. And we can see that the code works
done. And we can see that the code works and one test case is enough to verify that your code just didn't get the answer right by a fluke given that there is some mathematical complexity going on
here. So now that we've got the basics
here. So now that we've got the basics down, the next problems that we'll do are going to focus more on PyTorch and starting to train and explain neural networks. Hey, so in this video we're
networks. Hey, so in this video we're going to finally explain neural networks. This is definitely a buzzword
networks. This is definitely a buzzword that gets tossed around a lot in AI and ML, but they are really important.
They're probably the most powerful form of machine learning to date. They are
what chat GPT uses. They are used in self-driving, deep fakes, image generators like Dolli. Neural networks
are really important concept to understand. And I think a lot of
understand. And I think a lot of resources online often really over complicate neural networks, right?
Because you've probably seen a drawing like this before. might see like three nodes in this first column and then like a bunch more over here and then maybe
like a bunch more over here. And the
idea is that they're all connected to each other like this. That actually took a while to draw out, but the idea is that all the nodes in every column are connected to all the nodes in the
previous column or layer as they call it. But this is honestly an overly
it. But this is honestly an overly complicated diagram to understand. So
let's start from a simpler example that we can still understand neural networks from. So in the first node or the first
from. So in the first node or the first layer, I'll draw three nodes and we'll just go ahead and do two over here. And
all the concepts are still going to apply even though this is a simpler neural network. So we'll just finish
neural network. So we'll just finish drawing this out. And essentially what neural networks are is they're actually just basically multiple instances of
linear regression. Right? We've talked
linear regression. Right? We've talked
about linear regression before and the main important concept behind linear regression is we have some sort of output number, some sort of function. In
one of our previous examples, we talked about predicting the price of an Uber based on the the time the mile distance in miles and the time of day, right? So
maybe the price is a function of X Y Z is equal to W1 * X + W2 * Y + W3
* Z. And then we also might learn an
* Z. And then we also might learn an additional bias or constant term which we call B. And when we talk about training this linear regression model,
it's just about learning the W1, W2, W3, and B that make this function work pretty well for new data points that it's never seen before. So this is actually all a neural network does too,
except we have way more than just W1, W2, and W3. So the first layer of a neural network, the leftmost layer is called the input layer. And that each node in that layer will actually have a
number which corresponds to the input attribute. So let's say the distance
attribute. So let's say the distance goes here, the time length goes here and time of day number goes here. So each of these neurons will have a number associated in it. There will be a number
in this neuron. There will be a number in this neuron and there will be a number in this neuron. And what each node in the following layer does, right?
So this node here and this node here is since each of those nodes is fully connected, right? Each of those nodes is
connected, right? Each of those nodes is connected to each of the input attributes. Each node is actually just
attributes. Each node is actually just doing its own linear regression. Right?
So in this node right here, it will have its own W1, W2, W3, and B that has to be learned through training, which is just gradient descent, taking derivatives,
and then using the derivatives and the learning rate to update the weights.
This node over here independently completely independently of this neuron will do its own linear regression. So
this node will also have its own you know W1 W2 W3 and bias and then well the point of the neural network was to use it to make a prediction right that's all
these AI models are about. But then
you'll notice that we would have like an output number in this neuron and an output number in this neuron. Right? So
how would I actually use this neural network to get a meaningful answer?
Right? If I wanted to use this neural network as a way to predict the price of Ubers. Well, we might do something like
Ubers. Well, we might do something like average these two numbers. Average the
number in this neuron, average the number in this neuron, right? So that
might look something like this. So
that's all a neural network is, right?
We might take those two numbers in this output layer right here. We have a number in this node and a number in this node. Right? But what we'll actually do
node. Right? But what we'll actually do is we will just send them into some average function and then the number that gets kind of outputed by this function is our final prediction by the neural network. And then based on that
neural network. And then based on that number and whatever the correct answer was for these you know for this data point which consisted of three features or three attributes then that would be
used to calculate our loss or our error.
And then we can kind of do something called back propagation. You may have heard this this term before. It's okay
if you haven't. But back propagate back propagation is just that process of calculating the derivatives and essentially then we're going to use those derivatives to update our weights
through gradient descent. So this is actually our first neural network. So
then how does that kind of tie into this ugly looking diagram? Well, the same thing goes for the first input layer of the neural network. For whatever kind of data set this neural network is being used on, there must be three input
attributes because we have three nodes in that first layer. Then each of these four neurons, this one, this one, this one, and this one will learn a W1, a W2,
a W3, and a B. And the B or the bias, the constant term is optional, but it will learn what those parameters are based on gradient descent. And then
we'll basically notice that if each of these nodes, this one, this one, this one, and this one are doing their own linear regression, right? Then each of those are outputting a number. But each
of these nodes in our final layer, right, the output layer, well, this node, this node, this node, and this node, and this node, right? They're all
connected to each of those nodes in the second layer. So what that means is that
second layer. So what that means is that each of these nodes, right? All the
nodes in the final layer are doing their own linear regression based on how many features or attributes are in the previous layer. And we have four there,
previous layer. And we have four there, right? We have four nodes in that second
right? We have four nodes in that second layer. So that means that this node over
layer. So that means that this node over here is going to have to learn a W1, a W2, a W3, a W4, and an optional constant term or bias. And this one will have its
own four weights and a bias. This one
will have its own four weights in a bias. This one will have its own four
bias. This one will have its own four weights and a bias. And this one will have its own four weights in a bias. And
maybe for whatever data set this neural network is for, we're actually going to predict one 2 3 4 five output numbers, right? I don't know what data set this
right? I don't know what data set this neural network is for, but we can certainly imagine a case where some model has to predict five different things, right? So, but if we only wanted
things, right? So, but if we only wanted to predict one thing, we would then maybe send this into some other function like the average function to get some sort of estimate of whatever number we're looking for. And that's all a
neural network is. It's just linear regression stacked up vertically in each layer. Okay. So, the only part really
layer. Okay. So, the only part really left to explain is how this would work in code, right? How would we implement this? So, for our input layer, our three
this? So, for our input layer, our three attributes that we could represent that as a vector, right? Those attributes X, Y, and Z for a single data point could be represented in a vector like this.
And we know that each one of those nodes in that second layer has three weights, right? Because each one of those nodes
right? Because each one of those nodes is fully connected to each of our input attributes. So if we have three weights
attributes. So if we have three weights for each of those three neurons, we need to have a 3x3 matrix here to kind of encapsulate all those weights in a
compact way. So let's draw that 3x3
compact way. So let's draw that 3x3 matrix. And the last thing we need to
matrix. And the last thing we need to remember, it's just a fact from linear algebra. If you need a quick refresher,
algebra. If you need a quick refresher, the way we do matrix multiplication is we do the row of the first thing times the column of the second thing, right?
And we know that this is ultimately going to give us something that's actually just three numbers, right? Cuz
if this here on the left, right, we have one row and three columns, so it's a 1x3. multiplying that by something
1x3. multiplying that by something that's a 3x3, these actually cancel out and we're just left with a one by three, right? And that kind of makes sense that
right? And that kind of makes sense that we will have a 1x3 as the output cuz we have three numbers which is essentially like the output number for each of those neurons. People sometimes call that the
neurons. People sometimes call that the activation of the neuron. Though that's
not a super important term. So in the first column of this 3x3 matrix, what we would have is we would have W1 for the first neuron. We'll say this is the
first neuron. We'll say this is the first neuron. So that's W1 for the first
first neuron. So that's W1 for the first neuron. So let's we'll call that W1
neuron. So let's we'll call that W1 comma 1. Then here we would have W2
comma 1. Then here we would have W2 still for the first neuron and W3 still for the first neuron. Here we would have W1 for the second neuron. We'll say
that's the middle neuron in the second layer. Then this is W2 for the second
layer. Then this is W2 for the second neuron. And then finally we would have
neuron. And then finally we would have W3 for the second neuron. So in our code, we're going to have to maintain the state of this matrix. We're going to need to remember what all these weights
are, right? So we're going to need to
are, right? So we're going to need to maintain that matrix because when we do get model prediction and we're doing all these matrix multiplications, we're going to need the state of this matrix.
And since we're going to calculate a bunch of derivatives with respect to each of these weights to update it, we're also going to need the matrix, right? And so there's one class in a
right? And so there's one class in a library called PyTorch, which we're definitely going to have a whole separate video on, but that class is called nnn.linear, and it will actually
called nnn.linear, and it will actually keep track of the matrix under the hood for us, as well as all the derivatives.
The only thing we have to pass into the constructor of this class when we're making an instance of it is two things.
Essentially, we just need to specify the dimensions of this matrix. And this
class takes in something called in features. We can just call this inf for
features. We can just call this inf for in features and it takes in something called out features. So the input features for a given linear layer like this is just the number of features in
the previous layer. Right? So we can see in the previous layer we had three nodes here, three features. That's what we would specify there. The number of output features is just the number of individual instances of linear
regression. It's how many nodes or
regression. It's how many nodes or neurons you have in the layer that this class or object that we're making is for. And here we can see we also have
for. And here we can see we also have three neurons there. So out features would be three. So to do this in pietorch all we would do is make an instance of this class n.linear and we
would just have to pass in three for in features three for out features and that would be our simple two layer neural network that we have right here. One
final concept we have to talk about before our crash course in neural networks is finished and that is something that is completely orthogonal or different than linear regression and
it's it's literally the opposite. It's
something called a nonlinearity a nonlinearity and the most popular one is something called the sigmoid function which has this symbol and if you were to graph that function it essentially looks
something like this. So we have our x and y axis here and for every x input right like negative infinity is here positive infinity is here. The outputs
of this function are always between zero and one. So let's see on the y- axis we
and one. So let's see on the y- axis we have one over here and it looks something like this. It's kind of like that J-shaped curve. It's kind of asmtoically approaching one here and
asmtoically approaching zero over here.
This is kind of a terrible drawing over here but pretend that it's going down.
So the outputs of this function are always between zero and one and it has this kind of nonlinear nature to it. So
it's actually could be like mathematically proven that when we have those neural networks that we talked about previously, right, where all we have is these linear matrix
multiplications, these linear connections over here, there's only so complex of a relationship that they can learn. there comes a point where their
learn. there comes a point where their neural network isn't powerful enough anymore to capture and learn like really complex relationships.
For example, like for a neural network to understand negation in speech if you were inputting sentences into a neural network. So we have to kind of add these
network. So we have to kind of add these nonlinearities to the neural network to make them more complex and apply these neural networks to more and more problems and something called the
sigmoid layer. There is a class called
sigmoid layer. There is a class called nnn.sigmoid sigmoid which you can make
nnn.sigmoid sigmoid which you can make an instance of in PyTorch. That's kind
of how we would do that in code. And the
other benefit of adding a sigmoid layer to your neural network to and just to be clear as to how we would do that in a diagram is literally right here you have your neural network, right? There's two
layers, the input layer and we have one hidden layer is what they call it with these three neurons right here, right?
This neuron, this neuron and this neuron. So what we would actually do
neuron. So what we would actually do then is just draw say the sigmoid symbol over here. put that in its own little
over here. put that in its own little circle of its own. What we would do is we would actually connect each of the previous layers nodes to this sigmoid because each of those three neurons in
that second layer has a number an output. People call it an activation
output. People call it an activation associated with it, right? And for each of those, we're going to pass that in to the sigmoid function to get something that's between zero and one. The higher
the number is, the closer it'll be to one. So, scrolling down back here, the
one. So, scrolling down back here, the main thing we really need to talk about now is how you can apply the sigmoid function to a neural network. Not only
does it allow you to learn a more complex relationship, but allows you to apply ML and neural networks to more complex problems, right? Previously,
we've just been talking about something called regression problems where the output of the neural network is some number, right? Like the price of an Uber
number, right? Like the price of an Uber or how tall someone's going to be. And
this is like a number that exists on a spectrum, right? There's no fixed set of
spectrum, right? There's no fixed set of values. But what if we are doing
values. But what if we are doing something called classification, right?
Where let's say we need to build a neural network to predict whether or not someone will develop diabetes, right?
You either develop diabetes or you don't. There's two classes. And maybe
don't. There's two classes. And maybe
our neural network actually needs to output a probability that the patient inputed to the neural network will develop diabetes. That means our neural
develop diabetes. That means our neural network needs to if it's going to output a probability, it needs to output something between 0 and one. Well, the
sigmoid function is perfect for that, right? Because if we stick a sigmoid
right? Because if we stick a sigmoid function at the end of our neural network right here and let's say we were to average before we pass into the sigmoid function, we average this
number, this number, and this number.
Then we pass that number into the sigmoid function. Well, our neural
sigmoid function. Well, our neural network is now outputting one single number, right? And we can interpret that
number, right? And we can interpret that number after we've kind of done enough iterations of gradient descent and training to make these weights actually make sense for them to not just be random numbers. Well, we can output we
random numbers. Well, we can output we can interpret that output number as a probability that the input person is going to develop diabetes. So now we're actually able to do classification
problems, right? We can classify an
problems, right? We can classify an input into like say two classes or three classes as we talk about like more complex functions other than just the sigmoid function. The sigmoid function
sigmoid function. The sigmoid function would just be used for binary classification because the output is just between zero and one. But we might be doing turnary classification if we wanted to build a neural network that
can classify an image as either like a dog, a bird or a human. Right? But this
is just one example of how we can extend neural networks by adding this nonlinearity over here to use it for something called binary classification for say diabetes prediction. And say the
three features of the patient like three markers of their health like three facts about their blood work might be one of them over here, one of them over here and one of them over here. And of course
just one last clarification, we can obviously have as many features as we want in the input layer over here. We
can have way more than three nodes over there if we need to. We just have three for this case. So that's an introduction to neural networks. Neural networks,
before you start coding them up, it's important to be familiar with the basics. So let's go over a few multiple
basics. So let's go over a few multiple choice quiz questions. Then at the end, there are even more practice problems that you can try. And if you need a quick refresher on the basics of neural networks before we go through the quiz,
check out the second link in the description. Let's get started. Question
description. Let's get started. Question
one. You may have heard that GPT3 has over a 100red billion parameters across its many subn networks. How many
parameters does this simple model have?
Remember, a parameter is either a weight or a bias. If you want to try calculating the answer on your own, I recommend pausing here. Okay, here's the
explanation. So, in this equation, there
explanation. So, in this equation, there are three weights and one bias or four parameters. And let's go layer by layer.
parameters. And let's go layer by layer.
The first layer is the input layer which doesn't contain any parameters. Each
node simply contains one of the input attributes X1, X2, and X3. The next
layer is where the parameters begin.
Since each hidden node is connected to each of the three input nodes, that means that each hidden node uses this equation to calculate a number Y. That's
four parameters per equation and we have four nodes. So that gives us 16
four nodes. So that gives us 16 parameters in this layer. Finally, the
output layer. Each output node is connected to each of the four hidden nodes. So each output node uses this
nodes. So each output node uses this equation to calculate a number O.
There's five parameters in this equation and we have two nodes. So that gives us 10 more parameters. This neural network has 26 parameters in total. Question
two. Let's say we want to create a neural network that could predict the probability that the next word in a live stream should be censored. Also, let's
say that the network should factor in the previous three words. Here's option
one. Here's option two. Here's option
three. And here's option four. The
answer is option four. The internals of the network actually don't matter for this question. We want a network with
this question. We want a network with three input nodes, one for each word and a single output node since the network outputs one final number in the form of
a probability.
Question three. Let's say we want to store the weights of a layer inside a matrix. Let's consider this equation.
matrix. Let's consider this equation.
The input data point can be represented by this vector and the weight matrix can be represented by this matrix. The
product of the weights and the input vector plus the bias which could simply be stored in a separate variable gives us the output we want. So how many
matrices and what shapes would be needed to store all the weights for this model?
And let's not worry about biases. Option
one, a 4x3 matrix and a 2x4 matrix.
Option two, a 4x2 matrix and a 4x4 matrix. Option three, a 3x two matrix
matrix. Option three, a 3x two matrix and a 2x4 matrix. Option four, a 3x3 matrix and a 4x4 matrix. If you want to
try calculating the answer on your own, I recommend pausing here. Okay, we'll go layer by layer. Again, the input layer doesn't contain any weights, so we can
move on to the hidden layer. The hidden
layer has four nodes, and each node stores a W1, W2, and W3. That's three
weights. So our first matrix should be a 4x3 matrix. Assuming the input vector is
4x3 matrix. Assuming the input vector is a 3x1 vector, this allows the matrix multiplication to work out and we would end up with a 4x1 vector which is
exactly what we want. Each entry in that vector corresponds to a yvalue in a hidden node. In terms of the actual
hidden node. In terms of the actual multiplication, we would multiply this row by this column to get this value and then this row by this column to get this
value and so on. Finally, the output layer. We have two output nodes and each
layer. We have two output nodes and each stores four weights W1 through W4. So,
we would need a 2x4 matrix. That means
that option one is the correct answer.
If you have any questions about this concept or are interested in a short video covering the math you need for ML, be sure to leave a comment since it's definitely important to understand.
Question four, between layers, we often plug values from the neural network into nonlinearities like the sigmoid. Its
outputs are always between 0 and one, and the greater the input, the closer the output is to one. There are multiple benefits to using the sigmoid and other nonlinearities like the tanh function.
But let's say we had a neural network for predicting whether someone develops diabetes and we added the sigmoid after the final layer. What would be the purpose of this? Option one, perform
regression instead of classification.
Option two, ensure the model outputs a probability. Option three, simplify
probability. Option three, simplify derivatives for gradient descent. And
option four, improve runtime since sigmoid calculations are actually easier on GPUs. The answer is option two. Since
on GPUs. The answer is option two. Since
the output is between 0 and 1, we can call it a probability. Option one is incorrect. We want to classify every
incorrect. We want to classify every input person as diabetic or non-diabetic, not perform regression. An
example of regression would be a model that predicts how tall someone would be after they're finished growing. Option
three is incorrect as well. This doesn't
simplify derivatives for gradient descent. It actually makes them more
descent. It actually makes them more complex. And option four is actually
complex. And option four is actually just irrelevant here. And that concludes our neural networks quiz. If you've made it to the end of the video, then I think you're the right fit for our ML
community. But first, I would recommend
community. But first, I would recommend checking out some more practice problems, which you can grab from the link in the description. I've actually
created a full course on LLM with over 25 concise modules. It will always be free and you can secure it using the link in the description. I hope you found this video useful and I'll see you
soon. Okay, this video is going to be an
soon. Okay, this video is going to be an introduction to PyTorch. It will assume a little bit of background knowledge on machine learning things like what is a neural network, but you can go ahead and
check out some other videos on this channel for that. But if you don't have any PyTorch experience, we're going to start completely from scratch in this video. I won't lie and say that this
video. I won't lie and say that this video covers everything you need to know about about PyTorch, but it will give you a pretty solid starting point. So,
let's just get into it. First thing I want to say is that these two import statements right here are incredibly powerful. As a data scientist or machine
powerful. As a data scientist or machine learning engineer or even just for side projects, PyTorch is almost the only library you'll need. Sometimes you might use pandas for loading in certain data
sets, but you can pretty much get away with exclusively using PyTorch. So
learning this library is extremely high ROI. And the fundamental concept behind
ROI. And the fundamental concept behind this library is the idea of a tensor.
You might have heard of something called TensorFlow. PyTorch is kind of like the
TensorFlow. PyTorch is kind of like the industry standard now, especially in research, but it was actually used for chat GPT. But so TensorFlow isn't used
chat GPT. But so TensorFlow isn't used that much anymore, but the fundamental data type in PyTorch is still something called a tensor. And a tensor is kind of
just like a matrix or an array. We can
have onedimensional tensors, twodimensional tensors, threedimensional tensors. We can have like a 3x 10x 20
tensors. We can have like a 3x 10x 20 tensor which you can think of as three different 10x 20 matrices or arrays or tensors. And these store any kind of
tensors. And these store any kind of data we wish. Usually integers or floatingoint numbers, but they can store any kind of data we wish. But tensors
are actually more than just a matrix or an array. They carry they carry
an array. They carry they carry derivatives under the hood. So you never have to thankfully worry about doing derivatives by hand with PyTorch. That's
the amazing thing. You don't have to be bogged down by doing derivatives by hand or linear algebra or matrix multiplications by hand because this library, this goated library and these
two import statements take care of the ugly math for you. And as a result with this main data type in PyTorch, this main data structure, it carries various
derivative attributes associated with it that you never really need to access or look at or it's extremely rare that you would need to. But we should know that this tensor data type carries other
information and other properties that are used to calculate derivatives actually inside a directed asyclic graph inside PyTorch under the hood. So we
should know that tensors are more than just matrices or arrays even though we're going to abstract them away to essentially be as such. So why don't we create our first tensor? We can just say
that a equals torch. This is a a function within pytorch. I can take in a variable number of arguments. So
depending on what we want the dimensions of our tensor to be. So if we wanted a 5x5 tensor, you would pass in two arguments. And if we run this, we can
arguments. And if we run this, we can actually go ahead and see that tensor.
Just go ahead and print it. We can go ahead and see that tensor. And here's
our 5x5 tensor. And we can actually, although it's not necessary to access other properties like the gradients or the derivatives for this tensor since we haven't done anything yet, there wouldn't actually be any information on
that. But we should know that that
that. But we should know that that information is there stored by PyTorch under the hood. So now that we've created our first tensor, why don't we jump into some of the most important
functions in PyTorch that involve tensors. Okay, so one of the first most
tensors. Okay, so one of the first most important functions in PyTorch that we'll deal with is something called the sum function and the mean function. So
we might say something like sum equals torch sum and it will have two arguments the tensor and the axis or dimension along which we want to sum right. So we
might say something like a and then axis because we have a two-dimensional tensor here. The axis will either be it will
here. The axis will either be it will either be axis equals zero or we might say something like axis equals 1. And
this essentially is going to specify whether or not we want to sum along the rows or the columns. However, one
important and maybe unintuitive thing is that if you wanted to get the sum of every row, we have three rows in this tensor right here. We might it's not this. If you want to do torch sum of a,
this. If you want to do torch sum of a, axis equals zero. You might think that axis equals 0 corresponds to rows and axis equals 1 corresponds to columns. So
to get the sum of every row, you would say axis equals zero. It's actually the opposite in PyTorch for some reason. Not
sure why this was done by the creators of PyTorch, but it is an important thing to know as you'll be specifying axes all the time when you're working with PyTorch. So to get the sum of every row,
PyTorch. So to get the sum of every row, what we actually want to do is go across the column. So that's why we say across
the column. So that's why we say across the axis equals 1. And if we go ahead and print this, if we go ahead and print sum, we should actually get the sum of every row. And we can see the sum of
every row. And we can see the sum of every row is five. So this is definitely an important concept in PyTorch, the concept of axes. Another important
concept in PyTorch is the idea of squeeze and unsqueeze. This actually
happens. This is these are two functions that are used all the time in PyTorch and they happen when we have kind of unnecessary dimensions. So we can just
unnecessary dimensions. So we can just kind of erase this for now. We can say our tensor that we care about right now is a 5x1 tensor, right? So five rows and
one column. And if you were to like do
one column. And if you were to like do print a.shape, you would get a tupole
print a.shape, you would get a tupole which has five in the zeroth index and one in the first index. But that one is kind of unnecessary, right? The whole
point is that like you know if we print a it's just you know of size five or of length five. This whole five saying five
length five. This whole five saying five by one and saying oh it's just of size five is kind of carrying the same information but when we're using other functions later on in PyTorch some of
these functions are really particular about if we're talking about a 5x1 or just something that is of size five.
Sometimes you might see this notation in PyTorch the comma which indicates that it's not 5x one it's just of size five.
So if you want to ever get rid of the one, there is something called squeeze in PyTorch, it kind of squeezes out those any unnecessary dimensions. So if
we were to say print a.shape, right? And
then we were to say okay squeezed equals torch.squeeze
torch.squeeze on a and then we print squeezed.shape.
We will actually see a difference, right? The one has now disappeared. And
right? The one has now disappeared. And
although this might seem like an extremely small change, it will actually make a difference for various functions that we use later on. Just to make this
kind of really clear, let's print a when it's 5x1 and let's print squeezed.
We can actually see that there is a difference right here. We can see in the first tensor that there are five rows in one column. This is just simply you can
one column. This is just simply you can think of it as a vector of size five. we
kind of erased a little bit of information about its dimensionality. So
squeezing is important. So if squeezing is a thing, unsqueezing is should also be a thing, right? Because sometimes we actually are passing into a function later on that expects two tensors,
right? And we want things to be
right? And we want things to be consistent with those two input tensors, right? We can't have one tensor passed
right? We can't have one tensor passed into this hypothetical function that has this extra one and one that doesn't. So
unsqueezing might be something that it actually is something that is done all the time and it's just like a good function to be familiar with. So it
might say something like unsqueezed equals torch. Squeeze and you have to
equals torch. Squeeze and you have to pass in the first thing which is what you actually want to unsqueeze. So that
would be squeezed. But then you actually have to specify dim and dim is just like axis. In this case it is essentially
axis. In this case it is essentially where do you want to insert that extra dimension. So I can say well it's
dimension. So I can say well it's currently of just size five right to make it five by one again right I would have to say dim equals one because at the zero index of the shape right if we
were thinking of a tupil of size you know if we printed the shape it would actually be you know why don't we do that that's probably a good practice so let's print squeezedshape
and this is how you get the shape of any tensor in PyTorch we don't need these let's just print squeezeshape it's just going to be essentially just a five, right? But we
want to make it five by one, right? So
we want it to be like a comma after this five and then a one, right? So that's
kind of the one index in this in this uh iterable or this data structure that is returned by shape, right? If we just get the zeroth index in shape,
we see that it's five. If we tried one, it would be out of bounds, right? So to
unsqueeze we simply want to add this one to make it 5x 1 at the one index of the shape and now if we print unsqueezed.shape
unsqueezed.shape we can see that we have it as a 5x1 now so it's very important to be familiar with squeezing and unsqueezing. How do
we actually define neural network models in PyTorch because that's what this library is all about. PyTorch is all about defining models that are specifically neural networks. So we have
this rough idea of a model, right? It
should be some sort of class probably let's just call it my model for now and maybe there is a constructor where we initialize the various objects that are going to compose this model. This might
be the layers of a neural network as we previously talked about and there's one core method that is really important for models and it's always called the forward method which is really similar
to something called get model prediction which if you're not familiar with my previous videos that's okay but get model prediction just kind of by the name is essentially a method that would
use the model it would send in so this would actually send in some example data point, right? This is essentially a sub
point, right? This is essentially a sub routine where you could send in an example data point and the model would use whatever weights and biases that it currently has after, you know, some
number of iterations of training and it should return the model prediction. So
for a model object, the main ideas are a constructor and a forward method. And to
make kind of this whole process of defining these neural network models much easier, PyTorch actually has a base class which we can view over here. And
this is an incredibly important concept within PyTorch. It's the idea of a
within PyTorch. It's the idea of a module. A module is basically the same
module. A module is basically the same thing as a model. And all neural network models that you define in PyTorch are going to inherit from or subclass a
class called nnn.module.
So if you want to define your own module class model, you just specify that it subasses or inherits from nnn.module.
And then when you have the constructor, we're kind of defining the layers of the neural network. This model right here is
neural network. This model right here is actually using convolutional layers.
It's a convolutional neural network. We
haven't talked about those yet. Don't
worry about it. But it is essentially a layer within a neural network. And then
as we talked about earlier, the main kind of method that's important for neural network models, every subclass of nn.module needs to override the forward
nn.module needs to override the forward method from the base class. It's going
to take in an example data point or set of data points, a batch of data points X and doing using the model, it is going to using the layers of the model and
maybe some other functions as we can see in this case which we'll talk about later. we are actually returning the
later. we are actually returning the ultimate model prediction. That is what the forward method does. So this is nn.module and why don't we kind of learn
nn.module and why don't we kind of learn by example and create our own model.
Okay. So now we need to talk about an extremely important existing module in the pietorch library that we'll use as a layer in our neural networks all the
time and that is called nnn.linear.
If you're familiar with neural networks, you know that each layer of a, you know, traditional vanilla neural network is actually just a bunch of nodes that are each doing linear regression based on
the previous layers input attributes, right? And you know that the only things
right? And you know that the only things that you really need to specify are the dimensions of your matrix, right? Which
is just going to be dependent on, you know, your current or output nodes, output layers number of nodes and the previous layers number of nodes. And
that's those are the only required attributes to pass in to nn.linear. The
rest of these are actually optional and we don't need to worry about them for now. But to essentially create a layer
now. But to essentially create a layer of a neural network, specifically a kind of a traditional fully connected neural network, all we really need to specify
is in features and out features. And in
features is essentially the number of nodes in the layer previous to this layer, right? That's essentially coming
layer, right? That's essentially coming in to this layer. and out features is the number of nodes in this layer. So
why don't we take a look at an example?
If you just Google neural networks, this is actually one of the diagrams that comes up. So I think that a good way to
comes up. So I think that a good way to kind of make this diagram less scary is we're going to implement this exact diagram in code. So just to make that super clear, this is our input layer to
the neural network. Whatever data set we're dealing with here clearly has 1 2 3 four attributes associated with it.
And then our subsequent layers, this layer, this layer, and this layer are actually doing computation, right?
They're figuring out the weights, the W's necessary to actually make the model give the right prediction. So, we're
going to have 1 2 3 instances of N9.linear in our model. And this is just
N9.linear in our model. And this is just going to be the input to the model. So,
let's go ahead and say class uh simple. I don't know if it's simple.
uh simple. I don't know if it's simple.
Let's just call it my model. And we know it has a subclass nn.module.
And the first thing we need is our constructor, right? So we'll just go and
constructor, right? So we'll just go and create our constructor. And we'll say that there's not really any other attributes now that the user can specify when creating this neural network. We'll
just hardcode them to be whatever's in this diagram. First thing we'll need is
this diagram. First thing we'll need is our super call, which is kind of boilerplate code. And now we need to
boilerplate code. And now we need to start defining our neural networks, right? So in features and out features.
right? So in features and out features.
So we can do self dot first layer equals nn.linear and in features is going to be
nn.linear and in features is going to be 1 2 3 4 and we can see that out features or the number of nodes in this layer. Remember
each node is doing linear regression. We
have 1 2 3 4 5 6. So that's all we would do there. Then for the second layer it's
do there. Then for the second layer it's pretty self-explanatory. The out
pretty self-explanatory. The out features from the previous layer should be the in features for this layer. So
that would be six. And if we count how many are here, that's obviously six as well. But whatever model this neural
well. But whatever model this neural network was for. For some reason, it's predicting two numbers, right? So we can go ahead and do something like self.final
self.final layer equals nn.linear
of six and then kind of down project to a dimensionality of two. So the final output of this model again we don't know what use case or data set this model was created for. We just grabbed it from
created for. We just grabbed it from Google images. We're just trying to
Google images. We're just trying to clarify how to code up neural networks.
We don't really care about the use case here but for some reason it predicts two numbers this neural network. So we would specify out features as two over here and that would be it. Now all we have to
do is override the forward method. Every
subclass of nn.module does need to do that. So we would define forward. It is
that. So we would define forward. It is
an instance method. So we have self and we'll just say it takes an x which is some series of data points and we know that each data point needs to be of size four. It needs to have four attributes
four. It needs to have four attributes that's what nn.linear expects over here.
And now what we want to do is actually what we're going to do is since each linear each linear instance is actually a subclass this every n.linear is a
subclass of nn.module it too has a forward method. So we're just going to
forward method. So we're just going to call and that's already overwritten for us. That's already written for us by
us. That's already written for us by PyTorch. So we're going to call the
PyTorch. So we're going to call the forward method from each of these linear instances. So we might say something
instances. So we might say something like first layer output equals self.irst
layer and we're calling the forward method from that you know that instance of a linear class which is a subclass of nn.module. So we will pass in x.
nn.module. So we will pass in x.
However, something to make to make the syntax more concise in PyTorch is that the following also works to call the forward method. You can just pass in X
forward method. You can just pass in X directly like that and PyTorch infers that you were calling the forward method from that layer. And then what we essentially want to return then, right,
is just kind of pass X to this layer.
The output of this layer goes to the second layer. The output of the second
second layer. The output of the second layer goes to this final hidden layer and then is passed and does some final computation some final final matrix multiplications to get our output over
here. So we can just return self.fal
here. So we can just return self.fal
layer p to into that forward method we want to send in self do second layer right and into the second layer we want to send in self.irst
layer and then we're simply passing in x into there. So we're in a chain here
into there. So we're in a chain here kind of calling all the forward methods consecutively in a sequence and this is our first neural network model. And one
thing we should probably clarify is all the W's, right? The weights and biases, we haven't done any training yet at all.
Right? So if you check the documentation over here, it will actually say that the values right for the weights of this model. Same thing for the biases, it's
model. Same thing for the biases, it's initialized from some kind of probability distribution, that's not super important for this video. But the
point is that these initial weights that are hidden, we can't really see that.
It's abstracted away from us. But the
weights in the matrices that this model is currently using to make our prediction, they're initially just randomly chosen. So if I were to make an
randomly chosen. So if I were to make an instance of this model right here, right? Model equals my model. Now I have
right? Model equals my model. Now I have an instance of this model and I could technically use it to grab predictions of data points, right? So let's say our
first data point, example data point, let's just say that it is 1x4. We're
going to make it a 1x4 tensor. And the
reason we're going to make it a 1x4 tensor is we just want to do a batch size of one. One single data point. And
every data point needs to have four attributes. And we'll just r send in
attributes. And we'll just r send in like a completely random numbers for a data point. So torch.rand n is actually
data point. So torch.rand n is actually the function pietorch to do this. And we
can just pass in 1 4 1 4. And if we send this into the model, if I call model.forward forward of this example
model.forward forward of this example data point. But we know that for all
data point. But we know that for all subasses of NN module, PyTorch will infer which method we're using. If we
just do this example data point, if you just do this, what the model is going to do is it's going to pass this data point in from left to right through this network doing all the matrix multiplications using our current
weights of the model. But the weights of the model right now are completely meaningless, right? they were just
meaningless, right? they were just randomly initialized from this probability distribution described over here. So if I was to to do this model
here. So if I was to to do this model send in example data point into the model and if I was to actually print this there's no point in really doing that right now. It would be completely uninterpretable right cuz we haven't
actually trained this model yet. So
while we won't cover that in this video, the next step to do after you make an instance of your model is we need to train the model for some number of
iterations. Then then we can actually
iterations. Then then we can actually use the model and get predictions. So
after the training step is done, kind of leaving this as a black box for now, it's going to involve calling various gradient descent functions from PyTorch and that's absolutely something that we're going to cover in a later video.
But a after we train this model then we can actually do all we want. We can keep using our model and kind of using it to see how good it predictions are. We can
send in various example data points into the model. So this is an intro to
the model. So this is an intro to PyTorch. We covered the various concepts
PyTorch. We covered the various concepts behind a tensor and different basic functions associated with tensors. We
actually wrote our first neural network model. We defined its architecture based
model. We defined its architecture based on kind of a predefined architecture that we found from Google images. And be
sure to check out the practice problems where you will actually write your own model classes to do things like diabetes prediction, predict the next words in a
sequence like jet gpt. Then you're we will also have practice problems on training the model and using the model to get predictions. Hey, so if you want to learn PyTorch, this is the video to
start with. We're going to learn the
start with. We're going to learn the basics through a short coding problem where you can read the description on the left, write your code on the right, and there's a button where you can submit your code, run it against test cases. If you want to try it out, this
cases. If you want to try it out, this is completely free. Link is in the description. I've actually created a
description. I've actually created a whole list of these coding problems that take you from the basics of ML all the way to implementing your own GPT and link in the description for the full list as well. So after I created all
these problems, solutions, test cases, my colleague and pretty famous YouTuber Node hosted the problems on this website which he created. And this is the interface you're seeing. Okay. So let's
just take a little bit of time to read the description. There is a background
the description. There is a background video that you can watch before solving the problem, but I'm going to go over everything right now as well. So you
don't really need to watch that video.
So we're going to use built-in PyTorch functions to manipulate tensors. Tensors
are like the fundamental data type of PyTorch. It's where we store all the
PyTorch. It's where we store all the data and parameters for our ML models.
And you don't need any ML experience or experience with neural nets for this video. Though the applications might be
video. Though the applications might be a bit more clear if you do have that experience. So tensors are basically
experience. So tensors are basically just multi-dimensional arrays or matrices. So it's not just like
matrices. So it's not just like two-dimensional. It can be
two-dimensional. It can be threedimensional four-dimensional and it's how we store the data for our ML programs. So we have our tasks that we can see here. These are the functions we're going to write. We have our inputs
for the functions on the right. And
let's just quickly go over the examples.
So here we have a 3x4 tensor, right? And
our task is to actually reshape it. Says
that we're supposed to reshape it into an M byn. So M is three, N is four.
We're supposed to reshape it into a tensor that has only two columns. And if
you only have two columns, then it kind of changes the number of rows. So we can see here that we actually now have six rows. 1 2 3 4 5 6 and two columns. So
rows. 1 2 3 4 5 6 and two columns. So
all the data has stayed the same, but we're actually reshaping it. We're
actually reviewing it in a way. So the
data is represented or stored in a slightly different manner. And that's
actually something that's very important when writing ML programs since sometimes you need your data to be in a particular shape to pass into some other downstream function. So we're going to actually use
function. So we're going to actually use a PyTorch function to do this. Next is
averaging. So we're given some sort of tensor and the description actually says find the average of every column in a tensor. So we can see that we have this
tensor. So we can see that we have this entry over here and this entry over here and that's the first column and then this number is the average of those numbers. We go to the next column. We
numbers. We go to the next column. We
have a data point here, data point there and we want to find the average of that column as well. So we have three numbers over here since we had three columns in the input. But we don't want to actually
the input. But we don't want to actually do this manually with a for loop. It's
actually going to be way faster in terms of runtime to call a PyTorch function.
And the reason is that PyTorch function will then call really fast and efficient C or C++ code that also takes advantage of parallel processing whenever possible. So that's also a general rule
possible. So that's also a general rule for writing ML programs. Generally want to avoid those like traditional for loops and opt for like these really optimized functions that you can just call whenever possible. And we'll go
over that in a second. Then we actually want to concatenate these two tensors.
So we're just going to concatenate them left to right. So we can see over here they actually want to combine an M byN tensor and another M byM tensor into an M by and then M plus N. So we can
clearly see that we had N over here, we had M over here. So the number of rows is staying the same. We still have M rows but our number of colum our number of columns has increased. So we're kind of concatenating these left to right. We
can clearly see here we have this tensor the 2x3 and then here we have a 2x two and when we stick them together we have this 2x3 still valid over here and then we have the 2x two kind of stuck to the
right. So that's something that we'll
right. So that's something that we'll also want to do when we're writing ML programs. So let's get familiar with how to actually use the concatenate function in PyTorch. And lastly we actually want
in PyTorch. And lastly we actually want to get the loss. something that we're going to have to do when later writing ML programs is we need to get the error of our model at every iteration or every
100 iterations. Right? So given the
100 iterations. Right? So given the model's prediction and the actual true answer that we wanted the model to predict, let's actually write a function that gets the error or gets the loss as
it's called. And we can actually read up
it's called. And we can actually read up over here. It says to use the mean
over here. It says to use the mean squared error loss. So I'll flash an equation for that on the screen soon.
But basically what it means is to go over every data point. So here we have the model's prediction for a data point.
The model predicted zero. The true
answer was one. Then the model predicted one for this data point and the true answer was one. So you basically go through each data point and you actually get the difference between the models prediction and the true answer, right?
The actual target the model was supposed to predict. And for all the differences,
to predict. And for all the differences, you just square them and then you average all them together. So you add them all up and divide by the number of data points. And we actually also don't
data points. And we actually also don't want to do this with a for loop. We're
actually going to use an very optimized function that we actually just call from PyTorch and also explain later why exactly we want to use that function but ultimately we do just want to return the
output. Okay, so now let's jump into the
output. Okay, so now let's jump into the code and if you want to see more of these types of ML coding problems definitely leave a like on the video. So
the first function we have to write is reshape and the comma actually tells us that torch.reshape is going to be really
that torch.reshape is going to be really useful. Let's take a look at the
useful. Let's take a look at the documentation for that. it actually just takes in some sort of tensor that we want to reshape. And if we look at the second thing that we have to pass into the function, we have to pass in some
sort of tupil that represents the new shape. So we can see here we actually
shape. So we can see here we actually started off with something that was just torch range of four. So if you're not familiar with that function, that might seem a little weird, but it's basically
basically just going to give us back a tensor that has the numbers from 0 to three. So it's exclusive of four. And
three. So it's exclusive of four. And
that's just a one-dimensional tensor, right? But let's say you want to reshape
right? But let's say you want to reshape that into a 2x two tensor. Then the
output now has our 0, 1, 2, and 3. But
now it's in 2x two format. So we just have to pass in the tensor that we want to reshape. That's a in this case. And
to reshape. That's a in this case. And
we have to pass in a tupole with the new shape. So in code, that's just going to
shape. So in code, that's just going to look like torch.reshape.
And the first thing we're going to pass in is to reshape. And now we need to write our tupil. So here's what I'm actually going to do. we can just say -12 because we know from the problem
description we wanted the result to have two columns that's the two over here but if we specify the number of columns and the number of data points is the same then that kind of automatically determines the number of rows so if we
think back to that example where we had a 3x4 input that's 12 elements right and if I say that the output has to have two columns that's going to force it to have six rows so pietorch can automatically
determine what number should go there which should be six if we had you know 12 total elements. So we can just say -1 and PyTorch will infer what number should be there for the dimension. So
this is our input to reshape we pass in the tupil as well and that's all we need to do and the actually the comments also tell us to round. So we'll just say torchround and we'll just go ahead and pass in
decimals 4 and then we can return this and that's all for that function. Okay,
next is the average function. And
instead of actually iterating over each column to get the average of each column, we're just going to use torch.m
since it'll take advantage of parallel processing and it's going to be a lot more efficient. So let's just take a
more efficient. So let's just take a quick look at the documentation. All we
really need to do is pass in the tensor that we want to find the average of and this parameter called dim. And this is just essentially an integer and it's just the dimension that we want to find the mean on. is do we want to find the
mean of each row or do we want to find the mean of each column? And there's
actually something that's a bit tricky about this that we're gonna have to be careful about. So we're gonna say torch
careful about. So we're gonna say torch mean of two average. But then for dim, so dim is either going to be zero or one cuz we're dealing with a two-dimensional
m byn tensor. Zero corresponds to essentially the first dimension, zero indexing, and then one would correspond to like the second dimension, the n and m byn. So we want to find the average or
m byn. So we want to find the average or mean of each column. So it might be tempting to actually just say dim equals 1. But that is not what how that's not
1. But that is not what how that's not the convention for PyTorch. We would
actually say dim equals 0 since we want actually the average of each column.
We're going across the rows for each column. So we would actually say dim
column. So we would actually say dim equals 0. And I know that might be a
equals 0. And I know that might be a little confusing at first. So feel free to use the link in the description to play around with that. Run your code.
Use some print statements. This code
sandbox supports print statements and try to see the difference between dim equals 0 and dim equals 1. So next is concatenate and we want to use torch.cat. So take a quick look at the
torch.cat. So take a quick look at the documentation. The two things that we
documentation. The two things that we need to pass in are the two tensors that we want to concatenate and this just needs to be any Python sequence of the tensors. So it could be like a list of
tensors. So it could be like a list of our two tensors we want to concatenate.
It could be a tupole. It could be anything like that. And we also need to pass in the dimension along which we want to concatenate them. Are we trying to concatenate our tensors left to right like we did in the problem description
or are we trying to concatenate them actually stack them put one on top and one on the bottom? Obviously you want to do left to right. So we can go ahead and
say return torch.round of torch.cat and
we'll just use a tupole. So I'll say cat one and cat 2. So I've put both those tensors in a tupole. And we know that we want to concatenate them left to right.
We want the number of columns to increase. So we will say dim equals 1.
increase. So we will say dim equals 1.
And that's actually it for this function. Our last function is get loss.
function. Our last function is get loss.
And this one's pretty straightforward.
All we really need to do is call this function. So torch.n.functional
function. So torch.n.functional
msec loss for mean squared error loss.
And all we really need to do is pass in the prediction and the target. And
that's it. And then we can just make sure to round our answer to four decimal places. Again, the reason we want to use
places. Again, the reason we want to use this function instead of manually going over each data point and actually taking the difference, squaring it, finding the average is because this function will take advantage of parallel processing
and it can operate on multiple columns in our input simultaneously. When we
press submit, we can see that our code works. If you want more practice
works. If you want more practice problems, definitely check out the full list of problems, the playlist linked in the description, and you can jump into whichever problem is right for you.
Definitely leave a comment as well if you found this helpful and I'll see you soon. Okay, let's talk about dropout.
soon. Okay, let's talk about dropout.
This is actually a really important concept in deep learning and in training neural networks. So dropout is to solve
neural networks. So dropout is to solve the problem of overfitting.
So overfitting is actually just when your training performance, right, your training accuracy is greater than your testing accuracy.
So let's say you're training your model.
Error is going down with every iteration. You think your training
iteration. You think your training accuracy is great. You test the model.
You get its predictions on data it's never seen before. And the prediction is horrible. That's what overfitting is.
horrible. That's what overfitting is.
And overfitting is caused by the model essentially memorizing irrelevant details in the training data.
Essentially just noise. And when that doesn't appear in the testing data, its predictions aren't too great. So how do we stop the model or what's the cause of the model memorizing this irrelevant
noise in the training data? It's caused
by the model being too complex. This
could mean that the model has too many layers. This could mean that each layer
layers. This could mean that each layer has way too many nodes. The point is the model is just one big mathematical formula. And right now that formula is
formula. And right now that formula is way too complex. the model is memorizing all these irrelevant intricacies in the training data that's causing its actual performance on the testing data and
that's what we care about to go down. So
dropout was a technique created by deep learning researchers to solve it's one of the techniques to solve this problem of overfitting. So let's explain
of overfitting. So let's explain dropout. Let's say we have a neural
dropout. Let's say we have a neural network right we'll just do one hidden layer. So this is our three nodes say in
layer. So this is our three nodes say in our input layer and we'll just say two nodes in our output layer. So we draw our connections and this is going to be
kind of a fully connected neural network as we've been doing so far.
And what we're going to do at every iteration if we were to do say a dropout layer after this linear layer so something like nn.ropout dropout and the only thing you have to specify for a
dropout layer is a probability a probability of P maybe 0.2 or 0.4 four what you're going to do is if you apply dropout to say this linear layer for
every node independently with probability P it gets turned off so with probability P this node gets turned off with probability P this node could get turned off and when I say turned off I
mean set its output or activation to zero right so this node over here let's say this is X Y Z well this node over
here its output or its activation is based on its weights Right? So W1x plus W2 Y plus W3Z and maybe an optional
bias. Right? So what we mean by say
bias. Right? So what we mean by say turning this node off is setting whatever this is its output its activation to zero. So that would be like severing its three connections
temporarily just for that one training iteration to zero. We're severing we're severing those connections. So what
dropout does just to be super clear is that if you apply a dropout layer to a linear layer like this one at every iteration of training there will be a
probability of P that a given node is turned off and its activation is set to zero and it's essentially that would mean that its connections with the nodes in the previous layer are severed and
that is done independently for each node. So for that node that I just start
node. So for that node that I just start there is a chance that that node is turned off as well. And what dropout does is it essentially reduces the complexity of our model a little bit.
We're essentially with, you know, some randomness going to delete some of the nodes in a given layer. And this is going to make our model less complex and actually it's going to make it a bit
stupider. So think of dropout as taking
stupider. So think of dropout as taking our bit giant neural network and we kind of drop it on the ground, knock some of the screws loose. Most of it's still intact as long as the probability value
is not too high. But we made our network a bit stupider. We decreased its ability to learn like you know really really you know intricate noise in the training data. So now the model is kind of just
data. So now the model is kind of just focusing on the big picture instead of memorizing specific noise in the training data. But by making it a bit
training data. But by making it a bit stupider it's kind of been found time and time again that our testing accuracy which is what we really care about the testing accuracy will go up. So dropout
does per does increase the performance of our neural networks especially as they get you know deeper and deeper you know that means that we have more and more layers and this is definitely something we want to include in our
neural networks. So you can jump into
neural networks. So you can jump into the code now. Okay let's solve digit classifier. We're going to build a
classifier. We're going to build a neural network that can actually recognize handwritten digits. They'll be
passed in as images and they're going to be black and white. And our job is to actually predict what digit is in the picture. The the model needs to kind of
picture. The the model needs to kind of interpret this. And this is actually
interpret this. And this is actually kind of a simple but still powerful application of neural networks. And I
definitely would recommend checking out this 10-minute clip just at the timestamp that that link starts at just for a background on neural networks. So,
we're given kind of a model architecture in the description and this blurb right here. And most of our video will be
here. And most of our video will be explaining how to kind of reason through that and code it up. But let's first take a look at our input. So the two things we have to write are the
architecture or the constructor and the forward method that every neural network class in PyTorch has. So the input here is actually the input to then the forward method. And it's essentially one
forward method. And it's essentially one or more. So it could be a whole batch of
or more. So it could be a whole batch of 28x 28 black and white images. And we're
actually guaranteed that for you know every single batch element it actually will be of size 28 * 28. So that's 784.
So kind of you can think of the image as flattened out into a horizontal vector and it says do not write the training loop or gradient descent to actually train the model and minimize the error.
That's actually going to be in the next video. So let's take a look at the
video. So let's take a look at the example. So we have our input image here
example. So we have our input image here and we've since it's of size 784 actually emitted many of the indices but this is essentially a vector where every
number is between 0 and 255. 255 being
you know completely white and zero being completely dark and our output here is actually a vector of size 10. Okay, the
length of this vector is of size 10 and every entry in this vector is between 0 and one and we can kind of interpret it as a probability. So we if we look at the index corresponding to seven, we can
see that the model is essentially saying there's a 90% chance that this input image is a seven. That's the model's kind of confidence there. And you might think that sevens kind of look like twos
depending on how you know some people draw them. So for some reason the model
draw them. So for some reason the model thinks that there's some slight chance that this input image is a seven. But
this output kind of gives us an idea of what the forward method the the forward method needs to return. It needs to return a list of probabilities. And this
is just a note that our exact model prediction once you run your code isn't going to exactly match this because we're not going to train the network. So
the weights of the model will just be whatever they're randomly initialized to and your prediction won't be too great.
But this is just to understand the format of the output and in the next problem we'll actually train the model and see it achieve like 98% accuracy.
Let's jump into the architecture explanation.
Okay. So here I've drawn the architecture that's described in the problem description. So we're going to
problem description. So we're going to do two things. We're going to explain exactly why we're supposed to use this architecture and then we're going to kind of give code snippets so you can implement it without jumping all the way
to the solution at the end. So the first layer on the left, this is the input layer and I've went ahead and drawn 784 neurons. And the reason I've done that
neurons. And the reason I've done that is because for each image in our input, we should be treating each image independently. Let's just focus on
independently. Let's just focus on passing one image at a time from left to right through this neural network, right? For each image, there are
right? For each image, there are actually 784 corresponding numbers or features for each image. And that is specifically the the grayscale
activation at each specific location in the 28x 28 image. So we have 784 numbers in the input layer. You can think of a number being stored in each one of these
nodes or neurons. To the right of it, we actually have our first linear layer of this neural network, which has 584 neurons. And these are not input neurons
neurons. And these are not input neurons anymore. Rather, these are kind of
anymore. Rather, these are kind of internal to the network. They're not the input. They're not the output. They are
input. They're not the output. They are
internal to the network. So, we would actually call those hidden neurons. And
each of these neurons, actually haven't drawn them here, but each of this neuron, each of these neurons, this one, this one, and all 584 of them, all 584
of them are actually fully connected to the input layer. So this neuron is connected to every all 784 neurons. So
there would be many many connections here as well as for every single other neuron in this hidden layer. And there's
actually just a small typo there. Let's
just go ahead and say 512.
512 neurons. And because each of those
neurons. And because each of those neurons is fully connected to every node in the input layer, there are actually 784
weights or W's. We've heard of W1, W2, W3, all the way out to W784 just stored inside this neuron. And then
there will also be 784 different weights stored in this neuron. So we'll go ahead and say 784 weights per neuron. Yet we
have 512 neurons. And that's why all the weights for this linear layer are actually going to be stored in a matrix just to make all the computations a lot
more efficient. and all of those weights
more efficient. and all of those weights and even some biases, optional biases.
784 * 512, that's a lot of weights. And
this layer is actually does a lot of the heavy lifting in this model. This model
is able to kind of learn so much just from this one layer. The model is figuring out the essentially how important every single pixel is. For
each of the 784 pixels, we have 512 weights in this layer over here, right?
Because one way you can think of this is that each of the 512 neurons in this layer is has 784 weights associated with it. But each of those weights is for a
it. But each of those weights is for a single pixel, right? Because each neuron in this layer has one connection to each of the 784 neurons in the preceding
layer. So the this layer is helping the
layer. So the this layer is helping the model actually learn how important each pixel in the image is depending on say what number is actually in that pixel in
terms of the grayscale value. So each
pixel has 512 weights associated with it one from each neuron. After this layer we actually have a nonlinear activation which is a relu activation. And if we
take a look at the RA function which is listed right here, we can see that this is essentially kind of like an onoff function. Here we have off
function. Here we have off before X equals Z and here the function is essentially on. So this function kind
of helps the model learn when a feature is important enough to pay attention to.
Right? Let's say the model needs to detect that some feature in the input.
It's past some kind of threshold and then we have to start paying attention to it, right? It's not just zero or completely irrelevant anymore. That's
actually what the Rail activation can help the model learn. It helps the model learn a far more complex relationship than just those from linear regression
like W1 * X, right? W2 * Y, right? And
if this was our if this was just our output with no kind of nonlinearities, there's a limit to how complex of a relationship the model can learn. And
just to kind of give an example of, you know, a cutoff feature in this model, let's say the model is trying to differentiate between sevens and twos.
And I've drawn this two kind of intentionally because even though this isn't a great two, we'd still want the model to be able to pick up that this is a two for it to be a good digit classifier, right? And let's say that
classifier, right? And let's say that the model learns that essentially the cut off between a seven and a two is how much pixels that are non black of course
we have in this region of the image over there. Well if we had this as a seven
there. Well if we had this as a seven and then this looks like a seven right but let's say we just had a little more over here. Well the model should the
over here. Well the model should the model interpret this as a two? This is
kind of a silly example, but it just kind of gives you the idea that the model does need to have some way of having thresholding and and cut offs and the relu activation is one way that the
model kind of achieves that. One
additional thing I wanted to touch on before we get to dropout is why 512 neurons here. We have an explanation for
neurons here. We have an explanation for why 784 and we'll have an explanation for why 10 over here, but why 512? The
truth is it's actually somewhat arbitrary. I said that this model was
arbitrary. I said that this model was able to achieve like 98% accuracy once we train it. But it would probably achieve similar accuracy if we used I
don't know 550 if we used 500 even if we went a bit lower and said like 490. It's
actually a little bit arbitrary and a range of numbers would probably work for this layer. But the general idea that we
this layer. But the general idea that we have is the larger this number is the more complex of a relationship the model is able to learn. So we would not want a number that is too small. that would
probably not yield great results. But
additionally, depending on how large our data set is and how how much there is for the model to actually learn, if we use a number that is too high, then the model has too many parameters. Then the
model can start overfitting. And I have a separate video on overfitting. But
overfitting is essentially when the model starts memorizing irrelevant noise in its training data. And this actually makes its performance worse if we have way too many neurons, way too many
parameters. So there is a sweet spot and
parameters. So there is a sweet spot and 512 works well. Definitely check out the the following video to for some proof of that that it this model does achieve great accuracies. So dropout, why are we
great accuracies. So dropout, why are we applying a dropout layer over here? I
have a separate video explaining dropout. But essentially what dropout
dropout. But essentially what dropout does is it kills some of the nodes in the prior layer. So at every iteration of training we have some probability each node has a chance of just getting
wiped out which actually lowers it slightly lowers the complexity of the model during training. And this actually helps to prevent overfitting because if
the model doesn't have as many neurons if it doesn't have as many weights even temporarily then the model is essentially able to avoid memorizing irrelevant noise in the training data.
The last and most relevant layer that actually has a lot of conceptual significance is this final output layer which has 10 neurons. And although I haven't drawn the previous layers
neurons over here, the total number of neurons has not changed yet. We still
have this 512 post relu and 512 post dropout. Those are just changing the
dropout. Those are just changing the numbers in each neuron but are not actually changing the total number of neurons. But in this final layer where
neurons. But in this final layer where we have 10 neurons, each of these neurons is fully connected to each of the 512 neurons in the preceding layer.
And the reason that we're actually choosing 10 neurons in this final layer is because we want our model to actually output 10 numbers, right? We want the model to have a number for digit zero
through digit 9. So then we can interpret that as a probability. And of
course, each of these 10 neurons, this one, this one, and all of them have 512 weights associated with them. Due to the fact that there are 512 neurons in the
previous layer, there are 512 numerical features that the this final layer over here, this final layer can actually pay attention to. And lastly, we are
attention to. And lastly, we are actually going to apply the sigmoid function. the sigmoid function. By now,
function. the sigmoid function. By now,
we've seen that it actually causes our model's output to be between zero and one. And the fact that the number is
one. And the fact that the number is between 0 and one means that we can more easily interpret it as a probability. So
the last thing before we jump into the code I want to mention is nn.linear.
N.linear is going to be how we set up this layer and this layer in code. And
nn.linear linear is going to take in in features and out features. Those are the two things that we have to pass into its constructor. So the for the first linear
constructor. So the for the first linear layer we have 784 input features and 512 output features. For the second linear
output features. For the second linear layer which is over here we would have 512 input features and 10 output features. And of course between that we
features. And of course between that we will use nn.relu
to which we don't have to pass in anything. That function is kind of
anything. That function is kind of already predefined as we saw in the previous image that we saw as well as nn.ropout
nn.ropout nn.ropout and the only thing you have to
nn.ropout and the only thing you have to pass n.ropout is the probability. So if
pass n.ropout is the probability. So if
we use a probability of 0.2 as we mentioned in the problem description that would be n. Lastly we will have nn sigmoid.
So all these kind of nn instances will be in the constructor of our handwritten digit recognizer. And then in the
digit recognizer. And then in the forward method, we will just string together the forward calls for all of these layers from starting with the first linear layer all the way to the
final sigmoid forward call. And that
will be our model. So here in the constructor is where we can define the architecture. We can have our first
architecture. We can have our first linear layer. So self first linear and
linear layer. So self first linear and that's going to be an instance of nn.linear. And we talked about how we'll
nn.linear. And we talked about how we'll have 784 input features and 512 output features. After this layer, we are
features. After this layer, we are really interested in applying some nonlinearities. So we go ahead and do
nonlinearities. So we go ahead and do our nn.relu.
our nn.relu.
And there's nothing we have to pass in there. That function is already defined
there. That function is already defined based on the graph we saw earlier. Then
we would like to actually apply our dropout layer. So we can do n.ropout
dropout layer. So we can do n.ropout
with a probability of 0.2. Obviously
don't want to use too high of a value there. and just totally destroy our
there. and just totally destroy our neural network. Lastly, we will have the
neural network. Lastly, we will have the final linear layer. So, we can solve we can actually call this a projection, right? Because we want to project the
right? Because we want to project the dimension down essentially to output 10 neurons, 10 different probabilities. So,
this will be another instance of nn.linear with 512 input features and 10
nn.linear with 512 input features and 10 output features. And lastly, we have our
output features. And lastly, we have our sigmoid instance to make all the outputs between zero and one.
Then we are asked to actually return the model's prediction to four decimal places. So this is going to require
places. So this is going to require calling the forward methods of all of those other NN instances. Our solution
itself is an NN instance. So let's write forward for this module. And one quick thing is that instead of calling forward, so instead of doing something
like self.irst refers to linear and then
like self.irst refers to linear and then calling its forward method with images.
Of course, we can actually just use the following syntax and we will get the same desired behavior. So,
self.firstlinear of images. Well, we
want to pass that into the RLU, right?
So, self.relu of whatever that returns, right? And then we can go ahead and do
right? And then we can go ahead and do the same for dropout. So, self dropout of whatever relu.
And then we can do our final projection.
self dot projection of whatever dropout returns and then lastly we can do self sigmoid of whatever projection returns and then of course we are interested in
rounding this to four decimal places. So
we can store this in a variable called out and simply say that we want to return torch.round round of out with
return torch.round round of out with four decimal places and we can see that it works. Now that we've written our
it works. Now that we've written our first neural network, we can jump into training loops in PyTorch. Once we have a PyTorch model defined, how do we train
it and how do we actually use it to get predictions on data points we've never seen before? That's what we're going to
seen before? That's what we're going to cover in this video. So let's say you have written a class, right? In this
case, it is a model that recognizes images of handwritten digits. And you've
written your constructor. This right
here is where you define your model architecture. And then you have your
architecture. And then you have your forward method. This is where we get the
forward method. This is where we get the model prediction. But in this video,
model prediction. But in this video, we're going to go over how to actually train a model in PyTorch. How do you call the right functions to do gradient descent? And you won't have to take any
descent? And you won't have to take any derivatives by hand. PieTorch will do all of that for you. But how do you actually over some number of iterations update this model and make it better and
better based on the training data? We're
going to explain the training loop in this video. And this is super
this video. And this is super fundamental to all future neural networks that you train. So the below code is going to be the same regardless of the neural network. And if
essentially it might seem like boilerplate code, but there's actually a lot of important concepts embedded within this. So the first step is to
within this. So the first step is to make an instance of your model, the class that we defined earlier. Next,
you'll need to define your loss function. For our previous linear
function. For our previous linear regression problems, we used the mean squared error, and it was kind of an ugly looking formula with a summation and a square, but we did kind of make
sense of it. And ultimately, we realized that if we minimize that error, then we increase the chance that the model will do well on data points it's never seen before. Because this problem is a bit
before. Because this problem is a bit different. It's not a regression
different. It's not a regression problem. It's a classification problem.
problem. It's a classification problem.
Right? Our model is taking in an image and the model needs to predict which digit it is and there's only 10 possible digits. So the model will actually
digits. So the model will actually output probabilities that the given input image belongs to a certain class.
So given say a picture of a number seven, this model will hopefully output something like 95% for the class number seven. So the model is actually
seven. So the model is actually outputting probabilities here. And
that's why we're not going to use the mean squared error whenever we are doing these probability based models. The
model is outputting probabilities. In
that case, we will use something called cross entropy loss. And it sounds like an ugly term and if you look it up, there's kind of a confusing math formula which isn't super important to understand and we'll definitely go over
that in a different video. But the
fundamental idea here is because our model is no longer outputting continuous numbers like if we were say predicting the price of an Uber or predicting how tall someone's going to be. Instead, our
model is trying to do a classification here. It's trying to group the input.
here. It's trying to group the input.
It's trying to put it into one of some fixed number of buckets. We're
outputting probabilities here. And
that's why we have to use a different error function or a different loss function. And this cross entropy loss,
function. And this cross entropy loss, if we kind of skip down here, the cross entropy loss takes two things just like the other error functions. The model
prediction and what the actual answers are, the ground truth labels from our data set. So let's say model prediction
data set. So let's say model prediction is some sort of probability. And let's
say that probability that we gave for some input image was 0.8, right? There
was some 80% chance that the model thinks that that image of some number seven is actually the number seven, right? Then in labels, right? Actually
right? Then in labels, right? Actually
labels is always essentially either ones or zeros, right? Label in that case would just be one. The true answer is that there's a 100% probability that that given image, let's say it was a
number seven, belongs to the class of images of number seven. So just to kind of treat this loss function as a black box, it takes in probabilities of the model's predictions and the true answers
and outputs our error. The next really important line of code that will appear in every single training loop is something called the optimizer. And this
is essentially an object in PyTorch that does gradient descent for us. When we
create the object, we need to pass in the parameters of the model.
Model.parameters. And model.parameters
is going to tell this optimizer all the weights, all the W's inside this neural network that we need to optimize and update over some number of iterations of training. And you might be wondering
training. And you might be wondering what Atom is. Atom is kind of like gradient descent on steroids. It's an
optimized. It's still doing gradient descent. You know, it's taking those
descent. You know, it's taking those derivatives. It's using the learning
derivatives. It's using the learning rate. It's actually using a default
rate. It's actually using a default learning rate. We didn't pass that in.
learning rate. We didn't pass that in.
It'll assume a default learning rate of 10 -3 or 0.001.
But Adam is gradient descent with kind of some optimization tricks to dynamically change the learning rate across the course of the algorithm. Once
we get closer to the answer, right? Once
our we are getting closer to the minimum of a function, we don't want to accidentally overshoot, right? We'd want
to decrease the learning rate at that point. So that's an example of some of
point. So that's an example of some of the tricks and optimizations that Adam uses. But opt to torch.optimatom
uses. But opt to torch.optimatom
is still gradient descent. Next, we
always have to define the number of epochs. An epoch is actually defined as
epochs. An epoch is actually defined as the model being trained on the entire training data set once. So, five epochs would be five passes over the training
data set. You can kind of guess that as
data set. You can kind of guess that as you do more epochs, the model quote unquote would memorize the training data better and better. Too many epochs of course might not be good as we want our
model to not, you know, pay attention to small unnecessary details in the training data. And that's essentially
training data. And that's essentially called overfitting. And we want the
called overfitting. And we want the model to perform well on data it's never seen before rather than solely memorizing the training data. Then we
will essentially iterate over the number of epochs. And we have something here
of epochs. And we have something here called a train data loader. That's just
a couple lines of code to define. And
I've already done that for us earlier in the code. We can kind of delve into the
the code. We can kind of delve into the couple lines of code needed to create the train data loader in a different video. But essentially this is an
video. But essentially this is an iterator that is giving us tupils and it's giving us a batch of images right so images is a batch of images where
each image is 28x 28 or you can think of it as 784 and we have the corresponding labels or what digit that is in the image for each of those images. So
images and labels are the same size and we're kind of pairing them up in a tupole. The first thing we have to do
tupole. The first thing we have to do with the batch of images is instead of having it be 28 x 28, if we kind of refer back to the model architecture here, we notice that the first linear
layer is expecting essentially a vector of size 784. So we're essentially just flattening out. We're viewing or
flattening out. We're viewing or reshaping. You could use torch.resshape
reshaping. You could use torch.resshape
here as well. You're essentially
reshaping that image into a flat vector of size 784 and but it's still encoding the same, you know, information for every pixel in the image. The next part is the training body. These several
lines of code right here are incredibly important to understand are actually the main focus of this video. The first
thing we'll do is we will call the forward method of our model. This syntax
over here calls the forward method from the model class we define and essentially gets our model prediction.
The next step we have to do is a bit of a frustrating line that is even required in PyTorch but will essentially become second nature to you. Optimizer.zero
zero grad cancels out all the derivatives that were calculated from the previous iteration of gradient descent. We know that at every iteration
descent. We know that at every iteration we want to recalculate the derivatives so we can update our weights, right? But
pietorch by default will store the previous derivatives and if it will still use them and add them up with the derivatives we calculate for this iteration of gradient descent unless we say zero grad which cancels out the
gradients from the previous iteration.
Next, this is a line of code that will always appear in a training loop. It is
calculating the loss or the error based on our current model prediction. That is
one parameter passed into the loss function as well as the ground truth labels. The next step is probably or the
labels. The next step is probably or the next line of code is probably the most important line of a training loop in PyTorch. Lossbackward.
PyTorch. Lossbackward.
This will calculate every single derivative necessary to perform gradient descent. This model depending on how big
descent. This model depending on how big the neural network is may have tons of W's, tons of weights and tons of derivatives that need to be calculated.
Specifically, the derivative of our error with respect to those weights. So
then we can update those weights based on the learning rate. But so
loss.backward is essentially probably the most computationally intensive step of this entire program. It is
calculating all necessary derivatives and storing those in such a way that we can actually use them. For the next line of code, optimizer.step. This is the line of code that updates all of our
weights. So you can think of
weights. So you can think of optimizer.step as doing new w equals the
optimizer.step as doing new w equals the old w minus the derivative times the learning rate. This is exactly what
learning rate. This is exactly what optimizer.step is doing. It is taking a
optimizer.step is doing. It is taking a step in the direction hopefully towards the minimum of our loss function. The
whole point of gradient descent is minimization. And it is using all the
minimization. And it is using all the derivatives that we calculated in the previous step or the previous line of code.
And then we can actually see how the model is actually doing once training is done. So I've actually gone ahead and
done. So I've actually gone ahead and ran this cell. Right? We have a trained model now. It has the right weights in
model now. It has the right weights in this model object that was updated over the course of this algorithm. And if
you're wondering just a reminder as to how we actually updated the weight is because for our optimizer which actually does the step we told the optimizer at you know construction time which model
the parameters of the model to update.
So after this line of code is done we have the right model parameters. We can
put the model in something called evaluation mode which tells PyTorch that because we're just trying to get a bunch of model predictions don't worry about calculating the derivatives for training. And then we can iterate over
training. And then we can iterate over our test data loader, reshape our images as we did before, pass them in to the model, right? And then think about what
model, right? And then think about what the model prediction should be. It's
going to be something like batch size by 10. We are feeding in a batch of images
10. We are feeding in a batch of images to the model in this line of code right here. And for every for every image, we
here. And for every for every image, we are predicting a bunch of probabilities, specifically 10 different probabilities for every image. the probability that the image belongs to say the class of
handwritten zeros or handwritten ones all the way through handwritten nines.
So you have 10 numbers for every single input image. But the way we can actually
input image. But the way we can actually say what the model prediction is based on those probabilities is just by taking the max right depending on which probability that was highest. We can say okay that was probably that was we'll
take as what the model actually predicted. So if this is batch size by
predicted. So if this is batch size by 10, we want to get the max index from every single row in this batch size by 10 tensor. So we will take max across
10 tensor. So we will take max across dim equals 1. And then let's iterate over the images that we actually fed in for testing. Print them. So we have to
for testing. Print them. So we have to reshape them back into 28x 28. And let's
actually print the model prediction and see how the model did. So after we run this line of code, let's see how the model was doing. This was a imp 7 passed in. And it looks like the model after we
in. And it looks like the model after we printed on index.index.i item actually predicted that this was a seven. So the
model recognized this image. We can see the model did a great job here again predicted two as the highest probability. A one here, a zero here, a
probability. A one here, a zero here, a four here, and the model did actually well for these five images. If we print some more, we can probably see that the model may have not done so great. So why
don't we take a look at some where the model maybe didn't do so great. So here
we have an image where the model actually predicted that this clear two was a three. That was the highest probability it assigned among all the possible dit digits for this input. So
this neural network is not perfect. It
doesn't have a 100% accuracy. But it's
still pretty simple and achieves overall pretty good results without a convolutional neural network. Don't
worry if you don't know what those are yet. But with essentially with this
yet. But with essentially with this simple neural network architecture, you can achieve around a 98 or 99% accuracy on this handwritten digit data set. So
that just shows that even with these simple neural networks, we're able to learn pretty powerful relationships.
We're teaching computers to see. So the
main takeaway from this video is that we're neur our neural networks are going to get more and more complicated. But
that training loop, that training block of code where we get the model prediction, we zero out the previous gradients. We calculate the derivatives
gradients. We calculate the derivatives and we do optimizer.step. And of course, first we define our loss function, we define the optimizer that is going to hold standard for almost every neural
network we train. So be sure to check out the quiz to make sure you understand that. In this problem, we're going to
that. In this problem, we're going to solve the PyTorch training quiz. And I
would definitely recommend really understanding these concepts as it's the same code that you would use to train any neural network as well as kind of like the handwritten digit model and NLP
models and even chat GPT in future problems. So the first question, what would happen if you called zerorad after calling backward in a training loop? So
zero grad clears out all calculated derivatives that may have been calculated in previous iterations and we know backward calculates derivatives. So
option A says training would be sped up due to parallelization.
That's kind of just completely irrelevant to this. We're not really doing anything with the GPU directly in these lines of code. No weights would change after calling step. This one is
actually the correct answer. So we know that the weights are updated based on the values of the derivatives, right?
That's the formula for gradient descent.
But if we cleared all the derivatives, if we set the derivatives equal to zero, then when we subtract out the derivative times the learning rate, none of the weights would change. Uh C says training would take longer, but it would still
minimize the error. Actually, no, we wouldn't make any progress in minimizing the error. And option D says we would
the error. And option D says we would have a runtime error. You wouldn't get a runtime error. You would just be
runtime error. You would just be confused as to why your model's not working. So the next question asks us
working. So the next question asks us what happens when backward is called and we just talked about this the derivatives for each and every weight.
So the derivative of the loss or the error function with respect to each weight for gradient descent is calculated and that allows us to minimize the loss function. That's the
function we're minimizing. The weights
would not actually be updated. That
would be optimizer.step.
Uh the loss is not calculated. That
would be we would actually have to call the loss function to do this. and D we were actually not adjusting the learning rate in this step. This is something that is actually handled auto handled
automatically by the optimizer.
The next question asks us what happens when step is called. We know this is the line this is the line of code that actually updates all our weights by subtracting out the learning rate times the derivative. So we can actually see
the derivative. So we can actually see that over here that the weights are updated based on the update rule. The
data set is not randomized that is just not really handled by the optimizer. uh
the learning rate may or may not decrease or increase. So the optimizer will actually adjust learning rate accordingly. It's pretty customary to
accordingly. It's pretty customary to start with a small learning rate and slowly increase it and based on how the model is adjusting to the current value of the learning rate and then towards the end of training when we're getting
close to minimizing the loss function, we'll actually decrease the learning rate. So just by knowing that we're
rate. So just by knowing that we're calling step, we can't really know just from this line of code whether or not learning rate is changing or not. This
is kind of handled on the inside by the optimizer algorithms we use. But we do know that this does mean that we are actually updating our weights based on the current learning rate. So quick
reminder on the cross entropy loss function. This is actually an error
function. This is actually an error function used for models that are outputting probabilities. So let's say
outputting probabilities. So let's say your goal is to categorize an input image as one of 10 digits 0 through 9.
Your model is actually outputting a probability that the input image belongs to one of our categories. And then we also know for every image the true class
label or the the true digit that it is.
So this cross entropy loss function takes in the model's probabilities as well as for each training example that we're say processing in a given batch.
It will also take in the correct answer for that image, the correct probability, the correct class that the model should have assigned this image to. So part A or choice A says something about a
regression model. We automatically know
regression model. We automatically know that's out. We' probably use mean
that's out. We' probably use mean squared error for regression. For a
language model that predicts the next word in a sentence among a fixed vocabulary. Yeah, that is definitely
vocabulary. Yeah, that is definitely example where we are outputting a list of probabilities. The probability of the
of probabilities. The probability of the next possible word in the sentence. This
is actually how transformers like chat GPT work. And C says for a
GPT work. And C says for a classification model that predicts whether an email is spam or legitimate.
So this is also a case where we're doing binary classification specifically and we're outputting a probability that a given email is spam. So that's going to go ahead and be B and C. So in this line
of code, we're instantiating the optimizer object and we're trying to figure out what algorithm is actually running inside this optimizer. That's
going to be gradient descent. This is
the object that actually updates our weights based on the calculated derivatives. I hope this was helpful and
derivatives. I hope this was helpful and definitely leave a comment if there's anything you would like me to explain in more detail. I highly recommend
more detail. I highly recommend understanding the concepts behind PyTorch training loops. And after you understand this, you're definitely ready to jump into the next problem, which is
our introduction to NLP or natural language processing. NLP or natural
language processing. NLP or natural language processing is the field of ML focused on teaching AI to read and write like humans. With the development of the
like humans. With the development of the transformer, AI generated text is almost indistinguishable from humans. But
before diving into transformers, there are some NLP fundamentals that are essential to understand. This video will go over three quiz questions that cover these fundamentals. We'll cover
these fundamentals. We'll cover tokenization, the process of breaking up a string into a series of characters, words, or subwords, also known as tokens. We'll also cover word
tokens. We'll also cover word embeddings, which are vector representations for each token. After
you go through this quiz, you'll be well prepared to start my machine learning road map, which is completely free and available at the top link in the description. Let's get started with
description. Let's get started with question one. Here is an example of a
question one. Here is an example of a model vocabulary. The vocabulary is just
model vocabulary. The vocabulary is just a set of all the unique words the model encounters within the training body of text. Side note, the training body of
text. Side note, the training body of text for chat GPT is essentially the entire internet. While the training body
entire internet. While the training body of text for a sentiment analysis model, which is an emotion predictor model, might be a series of movie reviews and a corresponding label for each sentence.
We can see that for every token in the vocabulary, there is a corresponding integer. When passing text into the
integer. When passing text into the model, each token is encoded with its assigned integer. The model is actually
assigned integer. The model is actually generating numbers and we decode the integers back into the corresponding tokens. But how are the integer
tokens. But how are the integer assignments for each token learned? A.
No learning necessary. They're
arbitrarily assigned at the start of training and kept constant throughout.
B. They're randomly initialized but changed during training as the model learns which integers best encode each token. C. They're initialized based on
token. C. They're initialized based on the initial embedding representations and kept constant forever.
D. No learning necessary. Each time the model is called during training or testing, we randomly reassign the integers. The answer is A. No learning
integers. The answer is A. No learning
necessary. They're arbitrarily assigned and kept constant. We need the integer assignments to be consistent so that we can decode the model's output back into coherent strings. But the way we assign
coherent strings. But the way we assign them can be completely arbitrary as long as we keep the mapping constant. But
that means that the integer encodings carry no information about the actual meaning of each word. And it's important for the model to understand the meaning of each word when predicting the emotion
in a sentence or generating a response to our prompt. That's where embedding vectors come into the picture, which we'll talk about in question three.
Okay, question two. Here's an example of character level tokenization. And here's
word level tokenization. Which of the following is an effect of using character level over word level?
The answer is B. The model vocabulary will be smaller but diversity will be greater. If we use word level
greater. If we use word level tokenization, the vocabulary might be the set of all words in a language. But
if we use character level tokenization, the vocabulary would just be the set of all characters in the alphabet plus the special characters. But that's still far
special characters. But that's still far less than the number of words there could be. So the vocabulary would be
could be. So the vocabulary would be smaller. And for a model where we
smaller. And for a model where we generate one character at a time, there are actually many combinations or paths the model could take every time the next character is chosen. In contrast, with a
wordle model, the entire next word is chosen each time the model is called. So
the responses tend to be less diverse.
Okay, question three. How are these vectors learned? Here's a quick summary
vectors learned? Here's a quick summary of embeddings. There are vector
of embeddings. There are vector representations for words that also encode meaning. Similar words will have
encode meaning. Similar words will have representations that are closer to each other when visualized and unrelated words will be farther apart. Also, the
dimension of each vector here is only two, but in practice it could be in the thousands. When passing a sequence of
thousands. When passing a sequence of text into a model, the first step in the calculations for the output is to fetch the embedding vectors for each word. You
can think of this like a lookup table from the token integers to vectors. Once
the embedding vectors are retrieved, a series of multiplications, additions, and nonlinear functions are used to form the final output. Back to the question, how are these vectors learned? The
answer is that they're randomly initialized and over many iterations of training, the gradient descent rule is used to update the embedding vectors.
This means that initially words that are entirely unrelated may be very close to each other and words that are related may be far apart. But over training, these vectors are updated. Gradient
descent is the training algorithm and it makes use of some very basic calculus.
My threeinut video breaking down the equation is linked in the description for those looking to understand it deeply. That wraps up our brief review
deeply. That wraps up our brief review of tokenization and embeddings. If
you're interested in how to implement basic tokenization in Python, I'll also link that video in the description.
Okay, if you've made it to this part of the video, then I think you'll enjoy our ML community. I offer weekly lectures,
ML community. I offer weekly lectures, one-on-one mentorship, and a few special bonuses. If you're interested, just head
bonuses. If you're interested, just head to the link in the description to learn more. I hope you found this review
more. I hope you found this review useful, and I'll see you soon. Let's
solve intro to NLP or natural language processing. This is actually our first
processing. This is actually our first problem on NLP. And this is going to be really exciting because now we can finally get into the application of neural networks. Our examples and our
neural networks. Our examples and our problems have been maybe a bit abstract so far, but we're finally ready to jump into NLP and build models that can do interesting things like generate text,
detect emotion. We're going to build a
detect emotion. We're going to build a sentiment analysis model in a later problem and we're finally going to explore NLP in more detail. So in this
problem, we're actually going to load in starting from a raw body of text, so just strings, and we're going to set up a training data set. You may have heard that chat GPT uses like almost the
entire internet for training. But that's
just a giant string, right? You could
actually represent that in one massive string. How do we actually convert these
string. How do we actually convert these into numbers? How do we convert them
into numbers? How do we convert them into integers? These characters that the
into integers? These characters that the model can actually understand because these models that obviously work with matrices and all these matrix
multiplications, the model needs to take in numbers, not strings. So in this problem, we're going to do exactly that.
We're going to do something called tokenization. Tokenization is just a
tokenization. Tokenization is just a fancy term for encoding whatever your input to the model is. In this case, it would be strings. How will you encode the input? And specifically, how will
the input? And specifically, how will you break it up? Would you break up your sentence into words into individual characters? And how will you encode each
characters? And how will you encode each of those tokens once you've broken them up? How will you encode them into
up? How will you encode them into integers? That is the process of
integers? That is the process of tokenization. And we're actually given
tokenization. And we're actually given two lists. One list of positive strings
two lists. One list of positive strings and one list of negative strings. So we
can imagine that this sort of data processing that we're going to be doing in this problem is going to be used for a sentiment analysis model. And a
sentiment analysis model is just an AI model that can detect emotion within text. And that's actually going to be
text. And that's actually going to be the next problem we solve. I highly
recommend you solve that problem after this one. And one thing I wanted to
this one. And one thing I wanted to clarify is that this set of problems, the goal is not going to be parsing and processing data. Instead, we're going to
processing data. Instead, we're going to focus on how these models actually work.
But we do still need to do one or two problems on how we actually set up the data sets so we can feed them into the models for training and ultimately get those outputs that we want. And the
problem is actually telling us that the lexographically smallest first word should be represented as one, the second should be two and so on. So that's how
we're actually going to encode each word as an integer. That's the rule that we're going to use. And in the final tensor that we return, we should list the positive encodings before the negative encodings. And that's just
negative encodings. And that's just because the positive input comes first.
So we'll process them in that order.
Let's just make sure we understand the example. The first sentence is Dogecoin
example. The first sentence is Dogecoin to the moon. So we can imagine we might be feeding these examples into some sort of sentiment analysis model that is
going to detect emotion in tweets.
That's just one application of this kind of model. And then maybe this fuels some
of model. And then maybe this fuels some sort of stock trading algorithm after our model has detected the emotion in say Elon Musk's tweets. And based on that we might buy or sell a certain
stock. And we can see that Dogecoin is
stock. And we can see that Dogecoin is encoded as 1, two is encoded as seven, thaw is encoded as six, moon is encoded
as four, and there is actually a padding token. The zero will be used for padding
token. The zero will be used for padding and that will actually be added to the end to ensure that our first row in this tensor can be the same length as the row
for the second tensor as we know there's actually one more word there.
And this padding that we're doing, we don't actually have to implement that ourselves. There's actually a function
ourselves. There's actually a function in PyTorch that we can call. All it
really needs to take in is our list of variable length tensors. So it should take in a normal Python list of PyTorch tensors that each have variable length.
And this function will automatically pad them all to match whatever the longest tensor is in that list. So we will then end up with a rectangular non-jagged tensor. And I think the default, yes,
tensor. And I think the default, yes, the default padding value will be one.
And we should set batch first to be true when calling this function. And that's
just so we actually end up with a tensor that is 2 n by t where t is the length of the longest sentence. n is the number of positive examples we have and the number of negative examples we have. So
we have 2 n in total. If we don't set batch first to be true, the return tensor from that method will actually be t by 2 n, which is a bit weird. So we
just need to set batch first to be true.
So our goal here should actually be to construct a mapping. Our goal should be to build a dictionary that maps every single unique string, every single
unique word in our input data set to some sort of integer. So we want to build some sort of dictionary that maps strings to integers. And how would we actually build that data set build this
dictionary given that we know we have to sort the words in lexographical order?
Well, if I actually had a list, if I had a sorted list of every unique word of every unique word in our input data set, then this would be really easy, right?
Because I could just iterate through this to construct the dictionary. And we
know the zeroth index, we're actually we would actually add one since we know we want to reserve zero for padding. And
we're actually going to start our smallest word, our lexographically smallest word at one, the second at two, and so on. So we could actually iterate through this sorted list. If I had a
sorted list of all the unique words in my input and I've sorted them based on lexographic order, I could just iterate through this to construct this dictionary. The actual values in this
dictionary. The actual values in this list would be the keys in the dictionary. And then the index plus one
dictionary. And then the index plus one since it would be obviously be using zero indexing would be the values. And
then once I have this dictionary constructed, once I have this mapping to generate the final return data set that we want, I would just iterate through the entire list or the two lists that
are given. And for every single word in
are given. And for every single word in each sentence, I would simply call the dictionary. I would simply query this
dictionary. I would simply query this dictionary to get the encoded integer for each string and append this to the appropriate list for that for that string for the corresponding sentence.
And then we call our pad. We call the pad function at the end to make sure we end up with a 2n byt tensor. So how do we actually build this dictionary? Well
earlier I said that if we had that list of all the unique words in our data set.
So we've kind of removed all the repeated words cuz remember a given word could in some arbitrary data set some arbitrary list of strings that's given to us it could appear more than once. So
if I had a list of all the unique words and I sorted that list, then we would be good. Then we could construct the
good. Then we could construct the dictionary as I talked about earlier. So
it's just going to come down to collecting all the unique words, eliminating repetitions in the input data set. And we know a data structure
data set. And we know a data structure that could help us with that is a hash set or just a set in Python. So if I actually go through the input list of strings and add every single word to a
set then I will actually have all the desired words that I need in this set and then we can actually cast it. We can
convert that set into a list and then we can actually sort it and then we have the sorted list we want. We can build the dictionary and then we can call or query the dictionary for every word in
the input and return our final desired list.
So let's add all the words all the words in our input of strings to a set. So
we'll just call this our vocabulary.
And we're going to actually have to split up each sentence which is a string into words. And then we can actually use
into words. And then we can actually use dotsplit for that to actually get words separated by spaces. So we can just go
ahead and say for each sentence in positive and then for each word in sentence.split
sentence.split that will actually return a list of all the words in that string back to us. So
we'll automatically break up that string based on spaces. We can just say vocabulary add word. And we want to actually go
add word. And we want to actually go ahead and do the exact same for the list of all of the negative emotioned sentences. So for each word in
sentences. So for each word in sentenceplit we would just say vocabulary do add of word
then we can actually convert this set to a list and call sorted so that we get the sorted list that we talked about earlier. Now we want to build our word
earlier. Now we want to build our word to int dictionary which can easily be built by just iterating through the sorted list. So we can just say for i in
sorted list. So we can just say for i in range of len of sorted list and now we just need to map every word to an
integer. So word to int. Well the key
integer. So word to int. Well the key should just be sorted list at i.
Whatever word is at that index and the corresponding value should be i + 1 since we want the smallest word to have value one, the second smallest word to
have value two and so on. Now let's
encode every sentence as a tensor of integers. So this list is actually just
integers. So this list is actually just going to be all the tensors. This list
is going to have size 2 * n where n is the length of positive as well as the length of negative. And every element in this list will be a tensor. And then
we'll call our padding function to ensure all the tensors have the same length. And the way we're actually going
length. And the way we're actually going to code this up is we're first going to convert every sentence to a list of integers. And then we'll just call
integers. And then we'll just call torch.tensor tensor to go ahead and
torch.tensor tensor to go ahead and convert each list into a tensor. Since
tensors are not really something you can dynamically add values to as you can with lists in Python. So we'll go and say for sentence in positive we can go ahead and say for each word in
sentence.split
sentence.split we just want to create a new list right so this is going to be our current list and we're going to append to this list.
So cur list.append
and then we can go ahead and say word to int.
So we're actually getting the integer conversion for that word and putting it into a list. And then once we're done with that, we can actually go ahead and append this to tensors. So
tensors.append.
And then we want to go ahead and convert the list to a tensor. So torch.tensor
of cur list and that's going to be the tensor that we want for that sentence.
And then we just want to do the exact same for negative.
Now we just need to pad our tensors. And
that's exactly what we'll return. We'll
return nnn.utils.
Rnn.pad
sequence. And all we have to do is pass in our list of tensors where those tensors don't necessarily have the same length. So that's in tensors. And let's
length. So that's in tensors. And let's
say batch first equals true. And this
function will automatically know to use zero as the padding token. And we're
done. And we can see that the code works. If this was helpful, feel free to
works. If this was helpful, feel free to let me know. And next, I would highly recommend jumping into the sentiment analysis problem. With all the problems
analysis problem. With all the problems you've done so far in this series, you're finally ready to code up an AI model that can detect emotion. Okay,
sentiment analysis. This is actually a large area of kind of study in the field of NLP or natural language processing.
Given some text that might be as long as a sentence, a paragraph, you know, even maybe even an entire page, but given some kind of text, we would like to feed this into some kind of model. So this is
usually going to be a neural network.
They tend to perform best on sensible analysis. And this model should ideally
analysis. And this model should ideally output whether this input text was positive or negative. So if this text was something like that movie was okay, then that would probably lean more
towards negative. Or if the text was
towards negative. Or if the text was something like that movie changed my life then it would bo be something like positive right and sentiment analysis actually has a lot of applications. Uh
one one example is we often want to apply sentiment analysis models to scrape tweets right because we know tweets are really influential in affecting you know the stock market and if we can kind of set up some kind of
model that scrapes tweets and feeds it into our sentiment analysis model then maybe we can get a gauge on where certain stocks are going to go. And one
of the most important concepts in this kind of neural network in terms of how it actually works at the low level is something called embeddings. And
embeddings are used all the time in NLP.
They're actually the first step in chat GPT. So let's go in and understand
GPT. So let's go in and understand embeddings.
So first I'll go ahead and give like a highle explanation of what embeddings are and then we'll explain how it works at the lowest level. So what an embedding is is it's essentially learning a trained through training. We
want to learn a vector representation of every word or token in our total set of words that the model could recognize. So
basically we feed into the model let's say our sentence is I love I loved that movie right and as you may have seen in the NLP intro problem before we even
feed a sentence into a model we have to associate each word with some kind of integer and this is just arbitrary but it does need to be consistent so the model encode strings as numbers so let's
just say I ends up becoming zero in our vocabulary which might be like say 500 words right loved becomes like two somewhere in our vocabulary that becomes
like one and then movie is like four this and let's just say there were like hundreds and hundreds of words but these were the mappings for those words. Let's
say we feed this into the model, right?
So we feed in the vector of 0 2 1 and 4.
Well, the first step in the model should actually be the model understanding the meaning of each of these tokens independently. The model needs to
independently. The model needs to generate some sort of actual meaningful representation of each word, right? For
each token, for this token, for this token, for this token, and for this token, we need to have some sort of vector that encapsulates the information in that vector. Cuz just saying 0 2 1
and 4 was our encoding, that's completely arbitrary, right? There's no
actual meaning encoding in that. There's
not that's not helping the model learn any kind of relationship. So for every single token we want to associate some kind of vector with it right and this is going to be learned through training.
These are going to be weights that are learned through training. So let's say after the training is done we actually end up with and let's just say our embedding dimension. So you'll always
embedding dimension. So you'll always choose your embedding dimension and the higher this number is this this number by the way is the size of the vector that we will essentially learn for each
and every token. The higher this number is, the more complex of a relationship our model can pick up on. And let's just say we chose an embedding dimension of two. Just just for a simple example,
two. Just just for a simple example, we'll use much higher numbers in actual models, but let's just say we had two for the word I, for the token I, let's just say we essentially learned that the
vector that represents I should be something like this, right? And let's
just say that for loved or let's just say for that so for token that ends up having integer value one we ended up learning something like this uh -1 and
2.6. So these are essentially going to
2.6. So these are essentially going to be weights. These are weights that will
be weights. These are weights that will be learned and updated through training.
And they're going to start off with completely random numbers, but over the course of training, they will actually become updated as we, you know, minimize the loss over some number of training iterations. These embeddings are
iterations. These embeddings are actually going to make sense. And after
training, if you ever take a look at your trained embeddings and you plot them, let's just say our embedding dimension was two. If you plot your embeddings, what you'll find is that
words that are, you know, similar in the language will actually end up being very close to each other when you plot them.
when you plot that two-dimensional vector that's supposed to represent the embedding. Let's just say this is like
embedding. Let's just say this is like W1 and this is like W2. So if you were to plot say the embedding for the word man, let's just say after training is done and you were then to plot the
embedding for the word woman, you might actually find that they're very close to each other. And this shows that the
each other. And this shows that the model has actually learned some sort of relationship or some sort of meaning for every single token.
So how does the embedding layer actually fit into our neural network? Well, our
input is going to be something of size B by T. B is our batch size or just how
by T. B is our batch size or just how many examples that were independently processing in parallel. So T would essentially be the length of the longest sentence if we kind of pad it into one
rectangular tensor. So if our first
rectangular tensor. So if our first sentence was something like I loved that movie, I love that movie. And our second sentence might be something like I hated
that movie. And of course we would not
that movie. And of course we would not be passing in strings. We would be passing in the integer representations of these strings. But this might be our input. So here B is 2. You can see that
input. So here B is 2. You can see that B is 2. And you can see that T is 4. So
this is our B byt tensor. Then we will have something called the embedding layer. So this is going to be NN. And
layer. So this is going to be NN. And
the architecture or constructor for the neural network. And we're going to
neural network. And we're going to explain how this actually works at the lowest level next. But for now, let's recognize that it has to output a B byt by embedding dimension tensor because
for every single token right at every single time step in this in this sequence, we are going to generate a vector of size embedding dim and that will be a learn vector. It will be
trained through gradient descent and it should encapsulate the meaning of the word and like we said earlier if we plot it, it would even make sense. It would
make sense. So, and just to clarify what that would be in the code, if you check out the the documentation, when you instantiate your NN.bing layer, there are two things that you need to pass in.
The first thing you need to pass in is your vocabulary size. So, how many different tokens or words in total does this layer need to even learn representations for? And the second
representations for? And the second thing you need to pass in is well, the size of the representation, right? How
how complex should that representation be that we learn for each token. So that
would essentially be the embedding dim and that would be the second input there. And if you actually look at the
there. And if you actually look at the documentation for n.bedding, you'll see it referred to as a lookup table. They
call it a lookup table. So what this layer is essentially doing is for every single token in our input, it will essentially go and look up in some table. You can think of the table as
table. You can think of the table as having vocab size rows and embedding dimension columns. So this layer will
dimension columns. So this layer will essentially go into the lookup table and it will find for every token what the corresponding row is and it will pluck
out its feature vector its embedding dimension to pass downstream into the later part of the neural network. So now
let's explain what we actually mean by lookup table.
So we know at a high level what nn.bedding takes in and what it outputs.
nn.bedding takes in and what it outputs.
But let's explain how it actually works at the lowest level. So let's pretend that this is our vocabulary or this is actually just some of the words in our vocabulary. Let's pretend we had a vocab
vocabulary. Let's pretend we had a vocab size of six. So our integer can range from 0 through 5. And let's just say that these are the kind of encodings or
mappings in our dictionary for this vocabulary. So I maps to two, loved maps
vocabulary. So I maps to two, loved maps to zero and so on. One way that we could actually represent this input and feed it into a neural network is using one hot encoding. So this is the one hot
hot encoding. So this is the one hot input and this should encode the same information. The size of this tensor is
information. The size of this tensor is T by vocab size. T by vocab size. This
isn't how we'll actually feed it in, but it is one way we could feed it in. And
we can see that this encodes the same information. Here we have the
information. Here we have the representation for I. So we can see that at the second index there is a one and there is a zero everywhere else indicating that this is representing I.
Here we have a one at the zeroth index or the zeroth position in this row and a zero everywhere else. So we can see that that is loved. So if we could follow that pattern for the other rows, we'll see the same thing. So this is a one hot
encoding of the input. And let's say that the lookup table is over here. So
this is a lookup table and its size will actually be vocab size by n embed. So
this this table over here, this tensor, it would be vocab size by nmbbed. And
let's pretend that the zeroth row, so the zeroth row in this tensor contains the feature vector for whatever is token zero in our vocabulary. So we can see that that is loved. And the first row
would contain the feature representation, the learned and trained feature representation for the token with index one in our vocabulary all the way down. And if you actually do the
way down. And if you actually do the matrix multiplication and you realize what this is doing, we're doing row time column, right? She would take this row
column, right? She would take this row times this column and then of course the first row again times this column all the way through with the rest of the columns. Right? If you see what that's
columns. Right? If you see what that's doing, you'll see that the only entry here is one. Right? So whenever we are taking this row with every column,
right? We're actually only going to be
right? We're actually only going to be plucking out the entry in the third row all the way across from each column, right? because the first two kind of get
right? because the first two kind of get ignored and the remaining three would get ignored. So this one is kind of
get ignored. So this one is kind of being if we have you know the zeroth row, the second row and the third row over here, this is what we would actually be plucking out. So we can see
that the result of this matrix multiplication is actually going to be well if this is t by vocab size and this is vocab size by n embed then the result
is t by n embed and t by n embed or t by embedding dim. T by embedding dim means
embedding dim. T by embedding dim means that for every token at every time step we have essentially plucked out or generated the feature vector. So this is
just one way of plucking out the appropriate rows from this table which is what the embedding layer needs to output. But then if we take a look at
output. But then if we take a look at this neural network that I've drawn over here in this input layer we have vocab size number of neurons right and in
every neuron we would either have a one or a zero depending on whether or not we have that token in our input. That's the
same as the one hot encodings. And here
we have embedding dim neurons which is essentially out features for a neural network right for a linear layer. So
this kind of tells us that this nn.bing
class right that we're actually going to just kind of treat with a little bit of abstraction and we're going to use in our neural networks to generate the embeddings. It's the first layer in our
embeddings. It's the first layer in our neural networks is actually just a wrapper on nn.linear. NN.bing is built on top it's built on top of nn.linear.
linear specifically a linear layer where in features is vocab size and out features is the embedding dim.
So that should make sense. Leave a
comment if that did not make sense. I
could definitely explain that in a different way. But nmbbedding is
different way. But nmbbedding is actually just a wrapper for nn.linear.
All we have to actually feed in is kind of the tokens, right? So let's say we had I loved that movie and I mapped to
zero loved mapped to say two that mapped to say one and movie mapped to say four.
All we have to pass in is this. And
obviously we could have more than one example. So there could be multiple rows
example. So there could be multiple rows here but it is B by T. And then we go ahead and pass this in to NN. Right? So
we have a black box which is our embedding layer. But this is this box,
embedding layer. But this is this box, this kind of layer that we're using on the inside. What it does is it kind of
the inside. What it does is it kind of converts this into a oneh hot encoding and then sets up a linear layer and does the matrix multiplication that we talked about earlier. So I just kind of wanted
about earlier. So I just kind of wanted to explain that and the embedding layer.
The embedding layer is a wrapper that is built on top of nn.linear. It has, you know, those neurons that are fully connected. There are weights and biases
connected. There are weights and biases that are learned, but we can kind of treat it as just a lookup table that fetches the feature vector for every token.
Okay, let's go over the architecture one last time. Our embedding layer outputs a
last time. Our embedding layer outputs a tensor of size B by T by C, where C is the embedding dimension. Then we will do an averaging. This won't really be a
an averaging. This won't really be a layer in the defined in the constructor, but it will be a function torch. called
in the forward method and it's going to output a B by C tensor kind of getting rid of the time dimension averaging along the time dimension. You can think of this as having a vector initially for
every single time step in a sequence but then we average them all together so that for every single example along the batch dimension we have a vector summarizing or encapsulating that
sentence the meaning in that entire sentence one vector of size embedding dimension. Then we can apply a linear
dimension. Then we can apply a linear layer and this layer is simply going to have a single neuron out features will be one. So that we can then get a tensor
be one. So that we can then get a tensor of size B by one. For every single element we want to have a single number and this number can kind of be
interpreted as the gauge of how positive or negative that sentence is. And lastly
we can apply a sigmoid layer. And this
layer will allow us to get a number between 0 and one for every single input example in our say training batch. And
this allows us to say that one is say completely positive and zero is say completely negative. And that's the
completely negative. And that's the architecture for sentiment analysis. I'd
recommend trying to code it up now.
Okay. In this problem, we're going to solve sentiment analysis. Our task is actually to code up a neural network that can recognize positive or negative
emotion in an input sentence. And we are going to code up a neural network. So
I'll assume that you're familiar with the idea of a neural network. We want to be able to feed in a sentence like the movie was okay. The movie was okay. So
the input to this model is going to be some kind of sentence. It could be just one sentence. It could be multiple
one sentence. It could be multiple sentences. But we know that it will be a
sentences. But we know that it will be a string. the movie was okay. And it says
string. the movie was okay. And it says that we actually want the model's prediction to be a number between zero and one. So for something like the movie
and one. So for something like the movie was okay, maybe it would output something like 0.5 or maybe 0.4 cuz it's slightly more on the negative side. But
we want to build a model that can actually detect and assess the emotion within an input sentence. And the
problem says that this is actually an application of word embeddings. So by
this point we're familiar with the idea of neural networks. We have nodes and these nodes are connected in a way that we have a bunch of numbers being multiplied with each other. And each of
these can be thought of as a layer. This
one being the input layer and the second one being say a hidden layer. And it
turns out in chat GPT's neural network which we're going to work up to coding up the first layer is actually called an embedding layer. And embedding are
embedding layer. And embedding are actually the core concept within this problem. One of the main benefits of
problem. One of the main benefits of this problem is that it teaches us how to actually use word embeddings within a neural network. So, let's get into it. I
neural network. So, let's get into it. I
would highly recommend checking out the detailed background video on word embeddings, but if you need a refresher, we're also going to explain it in this video. So, the problem tells us the
video. So, the problem tells us the model architecture to use. That's over
here. And that's what actually is going to be defined in the constructor. And
then we go ahead and actually have to code up the forward method which will return the model's prediction for some sort of input sentence. And we'll come back to the model architecture later.
But this is essentially explaining the parts of the layers of the neural network that we're going to use. And we
are told that we have to code up both the constructor and the forward method within this class. And we do not want to actually train the model. We do not want to code up the gradient descent loop. So
what does the model actually take in as input? We're going to take in vocabulary
input? We're going to take in vocabulary size. Vocabulary size is going to be an
size. Vocabulary size is going to be an integer that represents the number of different words that the model should be able to recognize. And of course, we're going to explain embeddings in a bit more detail in a bit. But if you're
familiar with word embeddings, we know that embeddings are actually just a lookup table. For every possible word or
lookup table. For every possible word or token in our vocabulary, we want to be able to fetch or query the embedding the embedding vector that is actually a
learned and trained embedding vector. It
consists of weights that is learned through gradient descent. But regardless
of how we look at it, there are some fixed set of words the model should be able to recognize and that is essentially the number of rows in this lookup table. And of course, we are also
lookup table. And of course, we are also given a list of strings X each with negative emotion. So let's go ahead and
negative emotion. So let's go ahead and take a look at those actual examples.
The vocab size and X the list of strings. So if we go ahead and take a
strings. So if we go ahead and take a look at these inputs, we can see that vocabulary size here is just a number 170,000. And this might seem a bit
170,000. And this might seem a bit weird, but remember that vocabulary size is not the number of unique words in your actual input sentences. And we'll
describe what these numbers are in a second. Vocabulary size is just the
second. Vocabulary size is just the number of words that your model should be able to recognize. So this is actually the number of words in the English language. But we might actually
English language. But we might actually have a very small subset of words in the list of strings that we want to get the model's prediction for the emotion for.
So it is highly recommended that you have solved the problem NLP intro before solving this problem. But if you haven't, it's okay. Essentially, what
that problem teaches us is that let's after we decide on our tokenizer. And
our tokenizer in NLP is just how are we going to split up the words in a sentence. Let's say for simplicity,
sentence. Let's say for simplicity, we're going to split up the sentence based on spaces. So, we break it up into individual words and feed in a list of words into the model. But after we've
done our tokenization, we actually have to convert each word, each string to some sort of number. Right? These neural
networks, these models, they only understand numbers. To actually do all
understand numbers. To actually do all the matrix multiplications that make up a a neural network as well as calculate all the derivatives needed to optimize the model, the model has to be dealing
with numbers as input specifically just vectors and matrices of numbers. So for
every single token, for every single word, for the then for movie, then for was, then for okay, we actually need to assign a consistent mapping between numbers and and these strings. And I've
g gone ahead and went and already done that as I created the test cases for this problem. So let's go ahead and
this problem. So let's go ahead and assume that this first sentence that is passed in is the movie was okay. Right?
Slightly slightly negative slightly neutral sentence. And that's actually
neutral sentence. And that's actually what's represented in here as the first list within X. The movie was okay. And
because we don't want this to be like a jagged tensor, we want this to actually be rectangular. And we want the number
be rectangular. And we want the number of columns in the first row to match the number of columns in the second row.
Again, we'll get to that second sentence in a bit. But you can actually see that after the word okay, which is represented over here, we have just kind of padded the rest of this row with zeros. And the model will learn to just
zeros. And the model will learn to just kind of ignore the zeros, as we'll explain later. And then for the second
explain later. And then for the second sentence, we go ahead and have I don't think anyone should ever waste their money on this movie. This one is again
split up word by word. which with each string encoded as a number. And that is actually the the second row within X represented over here. And if you check the mappings between the the strings and
the numbers, you'll find that it is consistent. And we can see that for the
consistent. And we can see that for the first string within X. The first
example, the model is supposed to output something like 0.5 cuz that's kind of a neutral sentence. We could say it's a
neutral sentence. We could say it's a bit negative. So maybe 0.4. 0.5 does
bit negative. So maybe 0.4. 0.5 does
encapsulate how neutral that sentence is. But this second sentence, right,
is. But this second sentence, right, this one over here, the second row in X, this is a strongly negative sentence.
And we said that zero was the most negative a sentence would be. Maybe that
would be like, I hated that movie. It
was utter garbage. Maybe that would be like very close to zero. But we go ahead and say that the model should output 0.1, something really close to zero.
Definitely far away from one for that second example. So this is essentially
second example. So this is essentially the output. This is the output for the
the output. This is the output for the forward method. And vocabulary size is
forward method. And vocabulary size is something that the constructor receives is that's just like kind of an attribute of the neural network. And also just want to clarify that the output of say
0.5 and 0.1 that's just to help you actually understand the shape of the inputs and outputs. If you actually print your model prediction after you're done solving this problem and you train the and you and you submit the code,
it's it won't exactly match these numbers because we won't be training the model. So the model's prediction as it
model. So the model's prediction as it is before you true any training before you when you have this random initialization of the weights and numbers inside the neural network you won't get these like really nice
predictions that actually detect the emotion to do that you actually have to do a training loop there is a a separate problem for that in this playlist you actually have to write the training loop but don't worry there's actually a
Google collab a Google collab which is just a a notebook of Python code that you can click run on each cell and and see the output there is a collab notebook linked in the description for
this problem and that will actually use the exact code that you write for this problem. There's no more code you have
problem. There's no more code you have to write after you get your your accepted solution on this on this platform. And once you solve the
platform. And once you solve the problem, you'll be able to actually see the model being trained on Collab. I
have comments and and text blocks explaining what each cell is doing. And
you'll be able to see the model actually achieve this this performance. Actually,
we have some even more interesting examples to see that the model can learn to generate emotion. and we're going to use your code, your solution code exactly in this in this collab notebook.
So definitely don't check out the collab notebook until you solve the problem.
Now let's explain the model architecture. So if the input into our
architecture. So if the input into our model is B by T, where T is the number of words in each sentence, then the first step should actually be for every
single token, for every single word at each time step T, we want to actually get the embedding vector for that word or for that token. And just kind of a crash course on what embedding vectors
are. It's essentially the model learning
are. It's essentially the model learning to represent the meaning of a word or a character in this case words in numbers.
And once we're actually done training the model, these embeddings will actually make sense. So let's say our embedding dimension was two. For every
single word, we were going to learn a vector of size two. There would be two entries in this vector. And these these numerical vectors are supposed to encapsulate the meaning of the word.
Then what we would find after we're done training and we have learned the right representations for each word is similar words when you when you graph them.
Let's assume that it's actually dimension we can graph. Obviously in the problem we're going to use dimension 16 which is not graphable. But let's assume it was just two-dimensional a dimension we can graph. Then what you'll actually
find is similar words will actually be be plotted next to each other once that training process is done. So you might have like man over here and you might have like women over here because
obviously these words are related in the English language and they have some similarities in their meaning but there is obviously a slight difference that we can tell between them. They're not these vectors are not entirely on top of each
other. So when we say we want to use an
other. So when we say we want to use an embedding layer of size 16, it actually means that in the constructor for this model, we need to define that that lookup table that's eventually going to
be trained. And the number of rows in
be trained. And the number of rows in that table should just be the vocabulary size, right? For every single token in a
size, right? For every single token in a word in our vocabulary, we need to learn that embedding table. And it should be vocabulary size by whatever our embedding dimension is. And that's just
kind of a parameter that's up to us as the designers of the neural network to choose. As we choose higher and higher
choose. As we choose higher and higher numbers for this, in this case, you can see that two is the embedding dimension.
But as we choose higher and higher numbers, 16, 32, 64, 128, and so on, the model learns a more and more complex representation and can understand
language at a deeper level. So in this case, as we instantiate our embedding layer, we're actually just going to use nn.bing embedding in our constructor and
nn.bing embedding in our constructor and we'll just pass in the vocabulary size as well as the embedding dimension which is 16 and in the forward method for our
model we'll just call the forward method for this instance of nn embedding and that will actually get us the embeddings for every token the output of that would
be b by t by let's call that b byt by e where e is the embedding dimension because for every single token we would have the embedding dimension so that would actually be the first line of code
in the constructor just declaring this nn.bedding instance where v is the
nn.bedding instance where v is the vocabulary size as well as the first line in the forward method where we want to call the forward method of the embedding layer and get our b by t by e
tensor. However, for these series of
tensor. However, for these series of vectors that we have for the entire sentence for each sentence which is of length t, we have a bunch of vectors and each of those vectors are of size e. We
need some way to kind of combine or aggregate this information into a single vector to kind of have the model take into account all the information for a sentence across every single word. We
want to kind of combine that. And the
way we're going to do that is we're just going to average all the embeddings for across all the time steps for each example across the batch dimension. So
to make that super clear, we want to take our B by T by E tensor and end up with a B by E tensor. And you can think of that as for every single batch
element, we actually have a T by E tensor. T rows and E columns. And what
tensor. T rows and E columns. And what
you actually want to do is you actually want to compute the average of every single row here cuz every single row is actually a time step where the number of columns in every row is E. But we want
to end up with one single vector of size E. So just to make that super clear, we
E. So just to make that super clear, we actually have T rows and we have E columns. So you want to take this row,
columns. So you want to take this row, this row, this row, you want to take all the rows and average them together until you end up with essentially one horizontal vector of size E. And then
for every single batch element for each independent sentence or string that's passed into this model that the model needs to get the emotion of, then we will actually have one vector of size E
encapsulating that sentence and its meaning. And that's going to then be
meaning. And that's going to then be used for the final layer of the model, which we'll get to in a bit. It will be a linear layer that will be used as input to the final linear layer which would then get us a single number for
every single element across the batch dimension. And in code how we would do
dimension. And in code how we would do that in the forward method is we would take embedded which is B by T by E pass it into torch.m and say dim equals 1 cuz that's the dimension we want to collapse
out. We want to collapse out this
out. We want to collapse out this dimension and end up with something that is B by E. And this averaging is actually called the bag of words model in NLP. And the reason it's called that
in NLP. And the reason it's called that is you can think of taking all these vectors for every single word in a sentence and you're just jumbling them all up together. You're just mixing them all up. You're just averaging them in a
all up. You're just averaging them in a single bag, so to speak. You're not
really worrying about the fact that one word comes after another in the sentence. You're not worrying about the
sentence. You're not worrying about the sequential order of the words in the sentence. You're just taking all of them
sentence. You're just taking all of them and you're going to average their embedding vectors into a single bag. And
that's why we call it the bag of words.
So the next part in the neural network is to take these 16 neurons right because we know that for every once we have our B by6 our B bye tensor we know that for every batch element remember we
look at those independently in parallel for every batch element there are 16 numbers right we have a feature vector of size 16 and we want to actually then collapse that into a single number we
want one number that encapsulates the output for the model's prediction for that example so this is one neuron this is just one number. And we know that this neuron, if you're familiar with
linear regression, should have W1 all the way through W16. And of course, that optional constant term or or bias B. And
this linear layer is then going to learn the values of W1 through W16 as well as that optional constant term B, the bias, such that we can, you know, minimize the
the error and the model actually has the right valid prediction for this one number. However, this number is not
number. However, this number is not guaranteed to be between 0ero and one as we actually desired, right? That number
is just some number. It could have some number of decimal places. Could be
negative, could be positive. Doesn't
have any restrictions. Could be greater than zero, greater than one, could be like negative 5,000, could be anything, right? And we want to squash the model's
right? And we want to squash the model's predictions to be in the range of 0 and one. And that's where the sigmoid
one. And that's where the sigmoid function which we actually used in previous problems like the handwritten digit classification problem. That's
where the sigmoid layer actually becomes very useful. So we will toss an nn. We
very useful. So we will toss an nn. We
will toss an nn.igmoigmoid call. We'll
actually define this in the constructor and then you can go ahead and call the forward method of nn.igmoid. In the
forward method for this problem we will toss a sigmoid layer at the end such that we end up with as we wanted that b bye tensor but then all our numbers. All
right. Sorry, we'll end up with error B by one tensor, but all those numbers will be between zero and one. Okay, now
let's jump into the code. We know the first thing that we need to declare in the constructor is actually the embedding layer. That's the first layer
embedding layer. That's the first layer of the neural network. So, we can go ahead and say self.bing
layer equals nn.bing.
And we first thing that an end embedding requires us to pass in is the number of rows of this trainable table which is actually going to be vocabulary size.
And the size of the vector will go ahead and be 16. And that's the size of each trainable feature vector for each word.
And the next layer that we need, well, first we're going to do the averaging, but that would occur in the forward method. So the next layer we actually
method. So the next layer we actually need here, the next trainable layer is going to be the linear layer. So we'll
say self.linearalsn.linear
linear and we know the input number of neurons is 16 and the output number of neurons is just one and of course we need our sigmoid layer so we can say nn sigmoid and that's it for the
constructor next we can move on to writing the forward method also known as get model prediction we know we need the embedded version of the input so that
would be self embedding layer of x that's just calling the forward method from this embedding layer module it itself is a subclass of nnn.module
meaning that it is a neural network model or at least we can think of it as a trainable layer which would make up a neural network model and then we can go ahead and average right so because we
want to get the B by embed dim tensor just as the comment says over here so that would be we could say averaged equals torch mean of embedded and again
we want to it's b by t by embed dim we want to squeeze out the t so we can say dim equals not zero, not not two, but one. And we also want to pass this into
one. And we also want to pass this into the linear layer. So we would go ahead and say something like projected. And
the reason this makes sense is you're kind of like projecting this vector of size 16 down into a single number. So
you can go ahead and say self.linear
of average. And lastly, of course, we know we need to get these between zero and one to interpret them as emotion, right? zero being completely negative
right? zero being completely negative and one being completely positive. So
that would go ahead and be like predictions equals self.igmoid
of projected and this is what we actually want to return. We just need to round our answer to four decimal places.
So let's return torch.round of
predictions and say decimals equals 4.
And we're done. And we can see that the code works. Embeddings are a super
code works. Embeddings are a super important concept in natural language processing. and they're actually the
processing. and they're actually the first layer in the neural network for chat GBT. I definitely recommend
chat GBT. I definitely recommend understanding the code for this problem as well as the concepts behind it super well. And if you need to, you can refer
well. And if you need to, you can refer back to the background video that's linked in the description. That's just
kind of a whiteboarding video where I explain embeddings in more detail. And
now that we've solved this problem, the next problems to jump into kind of part two in this series of problems are all going to be building us up towards coding chat GPT, building in the data
set for to train a chat GPT replica as well as coding up the neural network layers themselves. So hopefully I'll see
layers themselves. So hopefully I'll see you soon. In this video, we're going to
you soon. In this video, we're going to solve this machine learning programming question. It's part of a collaborative
question. It's part of a collaborative project between my colleague Nab and I.
We'll go over the description and a basic test case on the left and solve the problem on the right. The problem is titled GPT data set. For large language models like Chad GPT, a very special
kind of data set is used for training.
This programming problem asks us to build and return that data set. Let's go
over the concepts behind this problem before jumping into the code. You've
probably heard that LLM are trained on the entire internet. And this is true.
We can simply feed in massive chunks of text into the model and there's no need to label each sentence or paragraph.
This is different than if we train say a sentiment analysis model that classifies each input sentence as positive or negative. The data set to train a
negative. The data set to train a sentiment analysis model would consist of sentences where every data point is labeled as positive or negative. So
that's the good news to train an LLM. We
don't need to label any data. we can
effectively just feed the raw text into the model and then during training if we passed in this giant block of text the model learns to predict the next word in the sequence for every possible context.
So the model learns that after the context cricket the word is can come next. The model learns that after the
next. The model learns that after the context cricket is the word a can come next. The model learns that after the
next. The model learns that after the context cricket is a the phrase bat and ball can come next and so on. The way
LLM learn to read and write in a language is by memorizing all the likely sequences of words that can come up.
They're effectively just large autocomplete models that can predict what word comes next in a sequence extremely well. When we actually prompt
extremely well. When we actually prompt the LLM, it generates the response one word at a time based on what's most likely to come next. Formally, we would
say that the neural network is learning a probability distribution over the entire language. But don't worry, this
entire language. But don't worry, this video doesn't require a background in neural networks or probability. Okay, so
why does this programming question even exist? Why can't we just feed in the
exist? Why can't we just feed in the entire data set as one long sequence into the model for training? It's
because of a limitation in the transformer architecture known as context length. To summarize, there's a
context length. To summarize, there's a maximum number of words that an LLM can remember or process at once. The LLM
forgets or more accurately cannot factor words that are outside of this window into its prediction. The context length is a hyperparameter for training an LLM,
meaning that we decide its value before training the model. GPT4 has a context length of 32,000 words and the code that we'll write at the end of this video will depend on the context length passed
in. So at every iteration of training,
in. So at every iteration of training, we want to select a random sequence from the entire very large training data set and have the model memorize the
different autocomplete sequences within it. One more clarification, to make
it. One more clarification, to make training more efficient, we don't just pass in one random sequence at every iteration. we actually pass in multiple
iteration. we actually pass in multiple of them. The batch size hyperparameter
of them. The batch size hyperparameter tells us how many sequences the model will learn from at each iteration. Those
batch size different sequences have nothing to do with each other. They're
completely independent and the model learns from them in parallel. Okay, now
we're ready to jump into the code. Small
side note, if you're enjoying the video, it'd be great if you hit like so that YouTube can recommend more of these videos to you. Back to the code. This
function would be called at every iteration of training to generate the batch. Let's take a quick look at the
batch. Let's take a quick look at the example test case. The raw data set context length and batch size are provided to us. We have to return x and
y where both x and y have length equal to the batch size of two. The first
entry in x corresponds to the first entry in y and the second entry in x corresponds to the second entry in y.
The first random sequence chosen is darkness my old. We know that my follows darkness, old follows my, and friend
follows old. The same logic applies to
follows old. The same logic applies to the second random sequence chosen. Okay,
the first step is to split the input string into a list of words since we're operating on a word level. Next, we want to randomly generate batchized different
starting indices. That way, we can
starting indices. That way, we can simply grab the context length words that follow from each starting index.
We'll use PyTorch's rand int function.
If you're looking to learn PyTorch, I have a couple intro to PyTorch videos already uploaded, and I'll be releasing a more visual animated tutorial in a few days. But the good news is that making
days. But the good news is that making this function call won't require significant PyTorch knowledge. We just
need to specify the highest random number that could be chosen as a starting index. That should be the
starting index. That should be the number of words minus the context length. And this range is exclusive of
length. And this range is exclusive of that number so that we don't go out of bounds from the data set. So let's
specify low and high and also specify how many random numbers we want to select. Lastly, let's convert the
select. Lastly, let's convert the returned tensor to a normal Python list.
Next, we simply need to grab the sequences X and Y. We'll set up the lists. Iterate over all the starting
lists. Iterate over all the starting indices. Index out a sequence to store
indices. Index out a sequence to store in X. Index out a sequence to store in
in X. Index out a sequence to store in Y, which is effectively the same sequence, just shifted one unit over since Y contains the words that need to
be predicted via autocomplete. Append
those to X and Y. And we're finished.
Our code passes the test cases as well.
And that's the explanation and solution to GPT data set. If you made it to the end of the video, you might enjoy the ML community. Just drop your email at the
community. Just drop your email at the link in the description to learn more about it and receive more free ML resources. I hope you found this
resources. I hope you found this programming question useful and I'll see you soon. Okay, let's solve a GPT data
you soon. Okay, let's solve a GPT data set. Before we start building the actual
set. Before we start building the actual GPT class and talking about how we can actually generate text from transformers, we should do one problem where we prep the data set that's
actually used to train GPTs. So you may have heard that chat GPT was trained on the entire internet where the entire internet can be thought of as some body of text. So in this problem, we're
of text. So in this problem, we're actually going to break down what that means in the code. So we can actually take some sort of giant body of text and from this body of text we create
examples for the model to keep learning to predict the next token based on different contexts. So these these chat
different contexts. So these these chat bots these large language models the way they work is they keep predicting the next token which might be like the next character or the next word or the next
subword. They keep doing this over and
subword. They keep doing this over and over again until they have some sort of big paragraph that they've given back to you. But the way they even learn to do
you. But the way they even learn to do this, the way they what they're really good at is predicting the next token in a sequence given like some words or some incomplete sequence, it can complete it.
It's like autocomplete on steroids. But
the way they actually do that is what we're going to learn in this problem or rather the data set that actually helps them do that. And we're going to write a function called batch loader which is
going to set up a batch of training examples. And that tensor needs to be of
examples. And that tensor needs to be of size batch size by context length. And
we also need to have the appropriate labels for this data set. So that then during training we can calculate the loss or the error between the models
prediction and the correct labels or the true answers. And we will explain what
true answers. And we will explain what context length and batch size are soon.
And there's just an implementation tip on what function we're supposed to use.
So we're given some sort of string which is just the raw data set. That's before
we're going to do any processing on this actually creating the examples. We're
also given something called context length. And this is actually how many
length. And this is actually how many tokens back can the model factor in into its response back to you. So where how far back is it actually taking into
account how many tokens back can it quote unquote read? And this is also going to be the length of each training example. It's going to be of length
example. It's going to be of length capital T which represents the context size. Each training example that we
size. Each training example that we create which will be like say a substring from this giant body of text.
This giant string each of those is just going to be of length capital T. And
then how many sequences or independent examples do we want to generate? That's
just the batch size. And we just need to return X and Y. So let's look at an example now. So the way we actually get
example now. So the way we actually get our data points X is we pick batch size or I'll just refer to that as capital B.
We'll pick this many random different starting indices for our substrings of length capital T. And all those substrings are just like valid starting
indices. All those starting indices just
indices. All those starting indices just need to be valid starting indices inside our raw data set. And when I say valid starting indices, they need to be far left enough in the data set such that
you can actually extend capital T tokens to the right. So we're going to develop those starting indices first. And that's
going to then explain our example over here. So we have hello darkness my old
here. So we have hello darkness my old friend as a string capital T= 3 batch size equals 2. So let's say the first
random index we choose is one. So that
corresponds to darkness being the starting token. And with a context
starting token. And with a context length of three, we would say darkness my old. So that ends up being our first
my old. So that ends up being our first example in X. And then the way Y works is Y contains the tokens that the model
is supposed to actually predict, the right answers that the model is supposed to predict. So when we have a sequence,
to predict. So when we have a sequence, it turns out there's a bunch of training examples within them. If we have darkness, my old well the model can
learn what to come predict what comes next given a context of darkness. And we
know my comes next. The model can also learn that if you have darkness my then old comes next. And the model also needs to learn given this entire sequence that
darkness my old that friend should come next. And that's exactly what you'll
next. And that's exactly what you'll find in the corresponding index for Y.
The first label or the first right token that the model needs to learn to predict is my. That's just given darkness. Given
is my. That's just given darkness. Given
darkness my we need to predict old. And
given darkness my old we need to predict friend.
The second random index that's chosen is actually index zero. So that would be hello as our starting token. And if you follow that same reasoning, then this
example should also make sense. So our
goal here is actually just to prep this data set and specifically return these two tensors X and Y. And both of them are essentially containing strings. And
it's highly recommended that you solve the NLP intro problem before this one because that one actually explains the before feeding this into the model for training. You would actually encode each
training. You would actually encode each word as an integer. You wouldn't
directly pass in strings into the model.
But in this problem, it's okay to just return strings.
So in the code, we're going to start off by actually splitting up our raw data set into words since we know we're going to be taking essentially substrings or
sublists to actually return or get the words in our our output tensors. So the
main thing we need to explain is how to actually use torch.rand int. So the
lowest index is going to be zero. For
size, we can actually pass in a tupole where this tupil is just of shape batch size by one since we essentially just want a vector or a tensor of length B since we want B different random
numbers. And we need to figure out what
numbers. And we need to figure out what high should be. So this upper bound is actually exclusive. It will not include
actually exclusive. It will not include anything that it will not include the number for high. That won't be a valid random number. So that's kind of the
random number. So that's kind of the case for a lot of Python functions. And
what we actually want to use is just the number of words minus capital T. So
imagine capital T is one. Then this is number of words minus one. Essentially
the final index. But if this is exclusive of that then we would not include the final index as a valid starting index because we want to still
have capital T. We want to still have capital T remaining tokens left after the starting index so that we can actually have the complete Y, right? Y
contains our labeled answers. So if the starting index was like one of the the actually the last index in our list of words, then that wouldn't actually make sense cuz there's no word there's no
word after that for the model to actually learn to predict.
So once we have our list of words, it's just going to come down to actually taking the right sublists. So whatever
our starting index is, and of course this is different for X versus Y. Y
would have a starting index that is one greater since we start predicting the next token. We just need to go all the
next token. We just need to go all the way to start index plus T. So let's jump into the code. So let's start off by generating our random indices like the starter code tells us to. So we can say
something like indices. We're going to use torch.randt.
use torch.randt.
We can say that indices is torch.randt.
And we just want to set low equal to zero. We want to set high equal to
zero. We want to set high equal to however many words we have. Right? So
maybe we're going to do some sort of split and call that the words minus the context length. Right? And then we know
context length. Right? And then we know the size can simply be a tupole which is just you know of size batch size. So
that's indices, but we need to know how many words we have. We need to actually get that list of words. So we can say words is raw data set data set.split. By
default, we're going to split on the space delimiter, which is going to break that string up into a list of words. And
now we can actually get X and Y. So we
can say X is going to be some sort of list. Then we can say Y is going to be
list. Then we can say Y is going to be some sort of list. And then we can go ahead and say for each index in indices we can generate each of our batch size
different examples. Right? The length of
different examples. Right? The length of x will be batch size and the length of y will be batch size. So x.pend
we can actually just take a sublist of words from index index plus context length. And that's it for that entry in
length. And that's it for that entry in x. And then similarly in y we're want to
x. And then similarly in y we're want to trying to start predicting the next token. Right? So we'd say index + one
token. Right? So we'd say index + one all the way to index + one plus again context length. And that's it. We simply
context length. And that's it. We simply
return a tupil of x and y. And we're
done. And we can see that it works. So
this problem teaches us that training a transformer training a language model is just about generating a data set where the model can keep learning to predict the next token in the sequence. So now
that we know what the inputs and outputs are, now let's actually start jumping into the neural network architecture for transformers. The next problem is on
transformers. The next problem is on self attention. A very complicated
self attention. A very complicated problem at first, although we'll definitely break it down and make it fairly intuitive, but it's definitely the crux of how transformers work.
Chatbots in recent years have been all right, but now we have chat GBT. So,
what makes these newer models so effective at reading and writing like humans? It's a concept called self
humans? It's a concept called self attention. A Google search of this topic
attention. A Google search of this topic gives some confusing diagrams and equations. So, let's break that down.
equations. So, let's break that down.
I'll take the first few minutes of the video to give a highle overview. And
then after that, I'll make a second pass through the explanation, but the second time I'll add more of the math and dive even deeper. So, let's consider these
even deeper. So, let's consider these LLMs or GPTs as a black box. We know
they take in some sequence of words like an instruction or a question and they output the response. But the way the response is generated is actually word by word. So the next word is generated,
by word. So the next word is generated, concatenated. The model is then called
concatenated. The model is then called again and so on. And inside this black box, the model is doing a ton of math to ultimately make its prediction for the next word in the sequence. To get some
inspiration for teaching computers to read, let's think about how humans read.
Do we read each word completely independently, not considering its relation to the words that came before it? No. We read word by word and each
it? No. We read word by word and each word has some relationship with the word that came before it. When we get to a noun, we realize it's the subject being described by the adjective that came
before it. Or when we get to a question
before it. Or when we get to a question mark, we realize that a question was being asked based on a previous word like how, why, or what. We
subconsciously consider the relationships between the different words in a sentence in order to totally understand it rather than looking at words independently. But how does a
words independently. But how does a model do this? For an input sequence of length t, so t different words, the model calculates a t byt matrix of
attention scores where each entry signifies how strongly associated two words are with each other. For example,
the word how can be used in the sentence how are you or the sentence this is how you write. The word how has a slightly
you write. The word how has a slightly different meaning depending on the context. And we see a relatively high
context. And we see a relatively high score for the connection between how and the question mark indicating that how is used a particular way in this sentence.
Once the model has this matrix it knows how important each word is to the previous words and it can do some calculations to predict the next word.
But here's the crazy part. I know it can be really annoying when people treat training as a black box, but this video would simply be too long if we also went into how training works. So for a
5-minute intro to training, check out the video in the top right. And for now, we can just treat training or learning as some iterative process where the model gets better at its task of
predicting the next word. But the crazy part is that during training, the model develops some complex math formula to calculate the entries of this matrix for
the future sentences that will be passed in. The formula is learned from the data
in. The formula is learned from the data it's trained on, which is typically a massive body of text like Wikipedia fed into the model during training. So
that's the highle overview of attention.
The model learns to calculate a number signifying the affinity of every pair of words which helps to accurately predict what comes next. Now let's dive deeper and make a second pass through this
explanation. This time we won't leave
explanation. This time we won't leave out the math involved in calculating this really important attention matrix.
The main idea is that we want the words in a sentence to talk to each other and figure out which ones they should associate with. For example, we want the
associate with. For example, we want the adjectives and the nouns they describe to seek each other out and associate, ultimately resulting in a high attention
score. The way we do this is by having
score. The way we do this is by having every word emit two vectors, a key and a query. I've only shown it for two words
query. I've only shown it for two words here, but every word emits its own key and query. A word's query vector
and query. A word's query vector represents what it's searching for or querying for. And a word's key vector
querying for. And a word's key vector represents what it has to offer or the information it actually stores or encodes. If a noun is searching for the
encodes. If a noun is searching for the adjective that describes it, then its query might align with the adjectives's key. Then we just take the dotproduct
key. Then we just take the dotproduct between every word's query with every other word's key. And those values populate the attention matrix. Remember
that the dotproduct between two vectors is a measure of similarity. The higher
the output, the closer two vectors are to each other. But how do you actually calculate the query and key vectors for every word? Linear regression. A simple
every word? Linear regression. A simple
single layer neural network is used.
Before a sequence of words is passed into the attention layer, we have every word represented as an embedding vector.
So we have one linear network that takes in the embeddings for each word and outputs the query vectors. and another
linear network that also takes in the embeddings but outputs the key vectors.
Remember that each of the nodes in a neural network is just performing a linear regression based on this equation. Each node operates
equation. Each node operates independently of the others and learns its weights through training. For a
10-minute refresher on neural networks, just check out this video. So, now we've calculated the attention matrix. This
txt matrix is actually the crux of how LLMs read and write. There is one part still left to discuss though. In
addition to key and query vectors, another linear network calculates a value vector that is emitted by every word. We then multiply the attention
word. We then multiply the attention matrix by the matrix of value vectors for every word. And this is the actual output of the attention layer. If the
key is what information a word has or encodes, let's think of the value as what's actually relevant for the word to share, what it actually exposes. So
let's see what happens in this matrix multiplication. If this row is
multiplication. If this row is multiplied by this column to yield this value and if this row is multiplied by this column to yield this value and so on. We can see that we're taking a
on. We can see that we're taking a weighted average of the words from the past to end up with a new and transformed vector for this word. And we
end up with a new and transformed vector for every word. So the ultimate output of the attention layer then is actually the model's refined interpretation of the meaning of every word. Far more
nuanced than the crude embeddings that were passed in. The model has factored in the context of neighboring words to generate a new representation of each word ultimately culminating in the model
accurately predicting the next word. If
you found this video helpful and are interested in more videos breaking down the transformer, leave a comment and I'll see you soon. Okay, let's explain attention and specifically self
attention. So here we have a really
attention. So here we have a really complicated looking diagram which you may have seen before and it's called the transformer architecture. It's
transformer architecture. It's essentially a neural network architecture that was used for chat GBT.
Specifically, we don't worry about the left part which called the encoder and we only use the right part of this architecture which is called the decoder. And here we can see the
decoder. And here we can see the different layers in this neural network.
We can see that embeddings are used in this neural network. That's a really big part of chat GBT. Then we also have linear layers. Obviously those
linear layers. Obviously those traditional feed forward neural networks. There's also another one over
networks. There's also another one over here. Just the distinction between this
here. Just the distinction between this linear layer and the feed forward is that the feed forward includes like linearities like RLU and and Sigmoid, right? But actually, one of the most
right? But actually, one of the most important parts of this architecture, the part that makes chat GBT and modern-day chatbots so effective, way more effective than previous chatbots in
previous years, is the concept of attention. And attention is one of the
attention. And attention is one of the most important topics in NLP right now.
So, let's go ahead and explain exactly how it works. So, what is the point of attention? Let's say we're working with
attention? Let's say we're working with some chatbot and we pass in something like write me a poem. This is a kind of a complicated instruction for a computer
to parse. But we know that chat gpt can
to parse. But we know that chat gpt can actually respond really well to this.
Well, the input will be something like B by T, right? And let's just say we're passing in only one sentence or one example right now. So B is one and T will be the number of tokens. If we're
breaking things up on a word level, then T equals 4 here. The first step would actually be the embedding layer, right?
This was actually the first step in say our sentiment analysis model. And as you may have seen in the transformer diagram previously, the first step is the embedding layer. So for every token,
embedding layer. So for every token, right, for every word, every time step, however you want to think about it, we are going to generate the embedding vector and that will be of size C or embedding dim, those might be kind of
the symbols used for that. So this is what I've kind of represented over here.
If B is one, right? Then we don't have to worry about B by T by C. Let's just
say it's T by C, right? Then in the first row we have the embedding for the token right and that's some vector. In
the second row we have the embedding for the token me and that's some vector right. But how does the model actually
right. But how does the model actually combine or aggregate all this information to get a sense of what the instruction or the task is? Right? How
does the model understand the relationships between tokens? Because
there's actually many pairs of relationships that are important within this task or this command that the chatbot is given. For example, what is the model supposed to write? It's
supposed to write a poem, not a movie or a book. It's supposed to write a poem,
a book. It's supposed to write a poem, right? There's many pairs of
right? There's many pairs of relationships between tokens that the model needs to take into account. And
the model somehow needs to aggregate or combine all these embeddings together cuz right now they're kind of these are all separate independent rows, right?
Each row in this in this tensor is a time step, right? But we need to somehow aggregate and combine them together.
Well, the simplest way to do that is is just by averaging, right? Let's say we have a sequence of tokens, right? So, we
have write as the at the zero time step and then me and then a and then poem and write is the first token. So, let's say it doesn't get averaged with anything.
But then me, well, me is the second token. So to maybe to get a more complex
token. So to maybe to get a more complex representation that actually factors in the sequential nature of this sentence or at least the spatial relationships that are going on, let's say we average
me and write together. So it's kind of like a running average. And then we have a. So then a would get averaged with
a. So then a would get averaged with write me and a. And then finally poem.
Well, poem would actually be averaged with itself and everything that came before it. So it gets it's essentially
before it. So it gets it's essentially going to be represented the representation for poem could then be the representation for the whole sentence averaged together. And this is this can work and we can get decent
results in neural networks with this but not the greatest results. And this might remind you of what we did in the sentiment analysis problem where we just averaged our embeddings together. We
just did a simple average. It's a simple aggregation. But we actually want to
aggregation. But we actually want to develop more and more complex aggregations, right? So before we can do
aggregations, right? So before we can do that, let's first start to think of this aggregation, this simple aggregation we did as a matrix multiplication. So here
we have our T by C, right? You can think of each row in this tensor as being the embedding for a token. And I'm just following like row column notation here.
So this is row one, column one, right?
Row one, column 2. That's kind of the subscript notation there. And this
simple running weighted average that we just talked about, you can think of it as a matrix multiplication. So imagine
that we left multiply. So on the left we have this tensor that is T by T. So T is four in this case, right? If it's write me a poem, that's four tokens. So we
have a T by T tensor that is going to multiply against the T by C tensor. And
this is going to help us end up with a T by C tensor again. And in every row, what we're going to have, at least at the entries that are non zero, is just one divided by the number of tokens. And
then everything that comes after it is zero for the first row. Everything that
comes after the first two tokens is zero for the second row. And so on. So, and
once we actually do this matrix multiplication, remember we're doing this row times this column to get this number, right? And then similarly, we
number, right? And then similarly, we would do the first row times this column to get this number, right? And similarly
we would actually be doing this row times this column to get this number.
Right? Once we actually do that let's see what we get. So initially just for the first row right this would just be dealing with the first time step first time token it looks like it doesn't get
averaged or added anything else into it.
It just kind of gets scaled with a factor of 1/4 for both the entries in the two embedding dimensions we have.
Then for the second time step, we can see that we get the we essentially are taking 1/4 times what came before it and also 1/4 with what we currently have.
And we can see the same thing being done here. 1/4 of E12, which is essentially
here. 1/4 of E12, which is essentially what came before it, especially. And
then we add in 1/4 * E22, which is exactly what was there. And this is essentially that running weighted average that we were talking about earlier. Because what is an average,
earlier. Because what is an average, right? You're adding everything up and
right? You're adding everything up and you're dividing by the number of examples. But if your number of examples
examples. But if your number of examples is essentially constant, you can kind of pull that out. And the 1/4, the four that you would be dividing by, if you were say adding everything up and then
dividing by four, that's kind of what a running average would be doing, right?
We can essentially just distribute that 1/4 to each to each number that's being averaged, right? So just to make this
averaged, right? So just to make this super clear, here we have 1/4 * E11.
Here we have 1/4 * E11 plus 1/4 * E21.
So that's kind of averaging right and me. And then here we have 1/4* e11 plus
me. And then here we have 1/4* e11 plus 1/4* e21 plus 1/4 * e31. So this is kind of the average of right me a right as we
get more words and more further along into the sentence we get more and more context that's gets taken into account.
And this is being done for this column over here. And kind of the similar thing
over here. And kind of the similar thing is being done in this column over here because remember in this case C equals 2. C equals 2 the embedding dimension is
2. C equals 2 the embedding dimension is two but we are averaging along the time dimension. So this is essentially
dimension. So this is essentially nothing more than what we already talked about over here. Right? We talked about how we want some way for the model to know what's important to pay attention to and what the relationships are
between each sentence. So for every token we average itself with what came before it. Right gets averaged with
before it. Right gets averaged with everything with itself and what came before it which is nothing. Me gets
averaged itself with everything that came before it which is just right and so on. Right? And that can actually be
so on. Right? And that can actually be represented as a matrix multiplication where we just take the t by c embeddings and we multiply on the left with this t byt tensor.
But the fact that this tensor which contains our weights for this weighted average or aggregation the fact that it's t by t actually means something. So
the fact that it's t by t means that for row i column j inside this tensor we can interpret that as the strength or
affinity or score as you may hear between token i and token j. So we can think of the first row of this T byt matrix as corresponding to the row for
right and then me and then A and then poem. And then similarly for the columns
poem. And then similarly for the columns this is the column for right me a poem.
So if I were to look at this number right here well that's right that's right me a and then write me. So this
1/4 over here that I'm circling should actually be a number that symbolizes the strength or of the connection or the relationship between the words me and a
in the input sentence. But that doesn't seem like a particularly important connection, right? What's probably more
connection, right? What's probably more important is the connection between the tokens write and poem, right? Because
what are we writing a poem? The model
needs to understand that there is a strong relationship between these two tokens. So ideally we would actually
tokens. So ideally we would actually want this number over here to be very high cuz this row is for poem and this column is for right. So we have this T
by T tensor. It's we kind of interpret row I column J as being the strength or connection between the token for row I and the token for column J. But we want to actually have a weighted average.
Right? This is just a simple average.
Everything is just 1/4. But we want to have a weighted average. We want to have a weighted average so the model can actually pay attention to some tokens more than others. Cuz in a given
sentence, write me a poem. The word a isn't really that important to pay attention to. Maybe it's somewhat
attention to. Maybe it's somewhat related to the word poem because the model needs to output only one poem instead of multiple poems in as opposed to if we said write me poems. So the
word a does have some significance in that command that we're giving into the chatbot, but it's not the most important to pay attention to. We don't want to weigh everything equally with this whole 1/4 or 1 over t situation. Instead, what
we want is we want a weighted average, right? But how does the model for any
right? But how does the model for any arbitrary command that it's given any arbitrary sequence of tokens, how does it actually learn what the weights inside this T byt tensor should be?
That's what the self attention layer accomplishes. And there is a bit of a
accomplishes. And there is a bit of a complex mathematical formula, but we'll break down exactly how that works. So
just to make it super clear on what the whole point of this self attention layer in a neural network is supposed to accomplish is it's supposed to actually come up with weights. Specifically, it's
supposed to come up with a T byt tensor of attention weights or attention scores where we can figure out how important each token is to every other token. So I
went ahead and crossed out the future tokens, right? So for high, high is
tokens, right? So for high, high is right here in the first row. High
shouldn't be able to look at any of the future tokens. We only want to look at
future tokens. We only want to look at the current token and what came before it to figure out what comes next. Right?
And I didn't exactly make this explicit earlier, but how these language models like chatbot work, chatbots work is continuously prediction. What's going to
continuously prediction. What's going to come next in the sequence? We'll talk
more about that later, but let's focus on trying to understand this T by Tensor for now. One number that we can see that
for now. One number that we can see that is particularly high is for the row corresponding to the question mark and the column corresponding to high. We can
see that we have a number 73 there which is relatively high. So why would that number be relatively high in the t byt tensor for this particular input. Well
the fact that how is associated with the question mark means that we're we're actually asking a question right. The
word how can be used in various contexts in English. We can say that's how that
in English. We can say that's how that works. that's just kind of declaring
works. that's just kind of declaring something. It's stating some kind of
something. It's stating some kind of information. But if how is associated
information. But if how is associated with the question mark, then the model can learn that the word how is being used to form a question here. So the
goal is to actually have a layer in our neural network that given our embeddings for every single token. We can then generate a T byT tensor that tells us for every pair of tokens how important
they are to each other. And we actually want that to be like trainable and learnable through gradient descent based on our training data. And then we can use that T byt tensor multiply it against the embeddings like we did
previously. Right? What we did is we
previously. Right? What we did is we took our T byt tensor and we matrix multiplied it against our T by C embeddings. That's what we want to do.
embeddings. That's what we want to do.
And then our output can then be kind of sent on and forwarded through later parts of the neural network. So now
let's try to understand how this T byt tensor is actually going to be generated. What's the formula for this
generated. What's the formula for this layer? So there's kind of an ugly
layer? So there's kind of an ugly looking formula for self attention. If
you just Google the formula for self attention, here is the formula to get the T by T scores. And then the layer output will be the scores times something called V, which is similar to
the thing that we multiplied scores by previously. But the main things that we
previously. But the main things that we need to talk about here are what in the world is Q, what in the world is K, and why are we raising it to the power of T and then multiplying them together, and
what is V? DK is also something else that we're going to talk about, but it's very similar to the embedding dimension.
And the softmax is also something that we'll talk about, but it's pretty straightforward. It's kind of like the
straightforward. It's kind of like the sigmoid function in the sense that we can kind of squeeze our values to be between 0 and one. Cuz if we go ahead and take a look at the previous picture that we were looking at, these values
are not really constrained any kind of range, right? We want to normalize them
range, right? We want to normalize them and set them to be in some kind of fixed range where the highest value is one and the lowest value is zero but the kind of the scale and relation the ratios are
still kept. So that's what softmax does
still kept. So that's what softmax does but the main things we really need to explain are what is V, what is K or sorry what is Q and what is K cuz those are the main things that make up self
attention. So let's let's break that
attention. So let's let's break that down and make it really intuitive. So
let's explain the Q in that formula. The
Q in that formula actually stands for query. So in our attention model, we are
query. So in our attention model, we are going to have every single token in the sentence actually be able to quote unquote talk to the other tokens and communicate with each other until they
pair up and we figure out the right T by T scores. And the way that we're going
T scores. And the way that we're going to do that is by having every token I've only drawn it for write and poem right now, but every token is going to emit.
It's going to emit and send out a vector called a query. And just like the word query means, this vector is going to contain information that represents what
that token is searching for. What is its query? So what maybe if we were using a
query? So what maybe if we were using a character level language model and our token was like Q, then Q might be emitting some sort of vector of
information that represents what it is searching for. And we know that in the
searching for. And we know that in the English language, U almost always follows Q. So it might be that
follows Q. So it might be that information in that vector that the token Q is emitting out. If we just had Q as a token using a character level
model instead of a word level model, the vector that Q is emitting might contain information or numbers in here that are similar to what U has. So Q would then
match up with U because U would be a vector that emits its own information.
And what we're going to do is we're going to have every token emit this query vector. And that query vector is
query vector. And that query vector is going to be of size attention dim, which will be a parameter specified to you.
And that can kind of be then represented as a linear layer. That's how we're going to generate the queries for every single token. It's simply going to have
single token. It's simply going to have embedding dim as the input number of features cuz right now before we even generate the queries, we know that for every token, we have a vector of size embedding dim. That's how we kind of
embedding dim. That's how we kind of have that represented. And then we're going to actually change the dimension to attention dim so that every token in this sequence is emitting a query of
size attention dim. And that represents what the token is looking for. So if
every token emits a query vector representing what it's looking for, every token also should emit a key vector representing the information that it has. And then we can kind of match up
it has. And then we can kind of match up the queries with keys. the tokens can kind of pair up with the tokens that are actually relevant for themselves and then the T by T scores are actually
learned. So every token is also going to
learned. So every token is also going to emit a key vector of size attention dim.
Again, I've only drawn this for write and poem right now, but this will be emitted for every single token just as the queries were. And this will also be generated using a linear layer which of
course has, you know, those trainable W's and biases. And then what we do is we take that query matrix which is t by a because for every token at every time
step we have a vector of size a attention dim specifically being the query and then we're going to multiply that by k transpose. So we have that one that matrix over here that one instead
of being t by a because we've transposed it it would be a by t and that's going to be the key vector for every token. So
here each row is the query and here each column because we transposed the matrix meaning we flipped its rows and columns and essentially put the matrix on its
side so to speak. We have the columns here being the keys. So write key me key a key poem key and then when we think about what this matrix multiplication is
doing is we have the query for right kind of dotproducted with the query or the key for right and then we have the query for right dotproducted with the
key for me and then we have the query for right dotproducted with the key for a and we have the query for right dotproducted with the key for poem and the same thing is kind of done for every
other query right all the queries and and keys kind of get dotproducted. And
if you recall the dotproduct dotproduct operation from linear algebra, the dotproduct operation is supposed to be a measure of how similar two vectors are to each other. I'll have a separate
explanation for that right after this clip which you can jump to. But if we take the query for every vector, remember every vector is emitting a query and every vector is emitting a
key. Well, if we dotproduct them, that's
key. Well, if we dotproduct them, that's kind of like saying, okay, if this is a gauge for how similar they are to each other, then the tokens whose queries and keys match up, maybe the query for write
matches up with the key for poem. Or
using the character example we talked about earlier, maybe the query for Q matches up with the key for u. If those
match up, then we can say those tokens are important to each other and we would have a higher number coming out of the dot product. And if we see what shape
dot product. And if we see what shape this this tensor is after you do this multiplication, we can actually see that the t by a multiplied by the a byt would
give us that t byt tensor which we were looking for. So q * a t transpose
looking for. So q * a t transpose specifically q which is the output of that trained linear layer. K which is the output of the trained linear key layer gives us this t byt attention
scores which you were looking for all along. So let's give a quick explanation
along. So let's give a quick explanation on why the dotproduct represents how similar two vectors are to each other.
Remember when we're doing that matrix multiplication, that's what we're doing.
We're doing a dot productduct. This row
time this column. So that ends up being this 1 * this 0 and then we go in and do plus this 0 * this one. Well, for these two vectors in this example, their dot productduct ends up being zero. And if
you were to graph them or plot them, you see that they're completely perpendicular to each other. They're not
pointing in the same direction at all.
we would say they're completely orthogonal to each other. This is kind of different than these two vectors over here. Let's imagine that this is the
here. Let's imagine that this is the query for a token and this is the key for a token. And we're trying to figure out how similar these are to each other.
These at least have they're partially in the same direction. They're not, you know, completely orthogonal, completely pointing away from each other like this one. So here if we had say 3 comma 2
one. So here if we had say 3 comma 2 dotproducted with one of them being 2a 3 we're getting some number like 12 and that's obviously much farther away from zero kind of indicating that these two
vectors do share come some kind of similarity. So that's why we are
similarity. So that's why we are actually doing the query times kranspose the query time krpose and that's why that gives us the t byt scores.
Okay, so we just have one or two things left to explain in our T byt tensor that we have generated. The next step, if you remember that attention formula that I showed at the start, is to actually
apply softmax to it. And you can kind of think of softmax as a multi-dimensional sigmoid. So it'll squash everything to
sigmoid. So it'll squash everything to be between 0 and one. So we can see that being done over here. Between 0 and one now, between 0 and one now, between 0
and one now, and between 0 and one now.
But the way the formula works, the way it kind of raises everything to the power of e, and that's to make it positive, by the way, and not negative anymore. And then we divide by the sum
anymore. And then we divide by the sum of the entire vector. Not only does it squash everything to be between 0 and 1, but it'll make everything sum to one. So
all of these numbers will now sum to one. And on our t byt tensor of
one. And on our t byt tensor of attention scores, what we do when we apply softmax is every row is now going to be between zero and one. And every
row sums to one. So we can kind of think of it as given the say a given row which corresponds for a given token and then all the columns going left to right. If
those sum to one, we can kind of think of those as scores or maybe even probabilities that that token might be relevant to the token corresponding to
the row. So again, not super important
the row. So again, not super important to worry about the mathematical details of softmax, but it is nice that it would normalize our values and squash everything to be between 0 and 1 instead
of having like totally random numbers like we did in the previous tensor we showed where we had a number like 73 and like 96 and then you know just a totally
arbitrary range of numbers. So once we have those normalized T byt scores to actually get the output of this attention layer, why do we multiply it by V, something called V, instead of
just multiplying it by the input like we originally did when we were doing the example with all of the 1/4s in the matrix. So in addition to how every
matrix. So in addition to how every token will emit a query and a key, every token is also going to emit a value and this is going to be another vector of
size dim at or attention dim that every vector is going to emit and it will be learned and trained with a linear layer.
And the reason we do this is to just kind of add another level of complexity to the model cuz if the query is what the token is searching for. If the key
is what the token actually has and well let's say the value is actually what is the token actually willing to share because there are various pieces of
information associated with every token right that's in the key. But the value not the Q not the K but now the V. The
value is okay what information is actually relevant what information do I emit in my vector do I actually want to share not we don't necessarily want to
share the entire unmasked input right like doing t byt scores t byt scores times the input instead we'll multiply it by v where v will be learned so then
the model can actually learn okay for the token for every token what information is actually relevant to share with the other tokens so that would be the value what the token is looking for is the query
what information the token has which should be kind of matching up with the queries for other tokens is the key and what information it's actually willing to share is the value. So the last thing
we need to explain is why are we why did we divide by the square root of DK before we applied softmax where DK is the attention dim. That's what DK actually is. This is actually something
actually is. This is actually something that you'll find in neural networks and deep learning all the time that researchers kind of experiment around with different scale factors. We're just
dividing by the square root of DK, which is a single number here. No matrix
multiplication or matrix divisions or anything like that. But often neural networks can suffer from something called an exploding gradient or a vanishing gradient where the values of the derivatives during training get
either way too big or way too small and things kind of become unfeasible. So
it's often better to kind of scale down or scale up our values by some sort of scale factor, which is exactly what we're doing here. And the researchers who created self attention, they were
researchers at Google in 2017. They
actually found that this achieved far better results and it's kind of become standard when coding up self attention.
And the at operator that I've been using here, by the way, is matrix multiplication in PyTorch. So now you're ready to code up self attention.
Let's solve self attention. We're
finally ready to code up this super important component of transformers. And
I'm going to highly recommend that you check out this video for an explanation of the concepts. Though I will definitely explain the concepts, give an overview of them in the solution video
as well. And similarly to how we first
as well. And similarly to how we first learned linear regression, we implemented it from scratch and then we unlocked the nn.linear class which we simply used as a building block in
neural networks. It was specifically
neural networks. It was specifically used extensively in the handwritten digit recognition. It was used in
digit recognition. It was used in sentiment analysis. It's used in many
sentiment analysis. It's used in many neural networks. So now that we're on
neural networks. So now that we're on our journey to actually coding up this transform architecture that makes up GPT, there's a whole bunch of new layers that we're going to need to talk about.
NN.linear is a component of GPTs, but there's some other layers in this really complicated looking neural network architecture that we're going to break down and make simple. And mainly
actually one of the most important components of the GPT and transformers in general is self attention. We can see it appears here and it appears here and it's one of the most powerful components
of a transformer. It's actually the thing that makes transformers very unique compared to other neural network architectures.
So by this point I will assume that you have solved the GBT data set problem and are generally familiar for the in with the input that we pass into these GPTs
during training. It's actually just a
during training. It's actually just a sequence of tokens or words and there's actually many training examples embedded within even one sentence. This model
during training is just learning to predict the next token over and over again given a bunch of contexts. So in
this sentence if we were to say pass this in during training as one of our sentences in a batch the model can actually learn that given a context of write me comes next. Given a context of
write me a comes next. Given a context of write me a poem comes next. So as
information is actually flowing through this neural network all culminating in the model's prediction for the next token which we'll talk about this later but it ends up being a bunch of probabilities a bunch of probabilities
for all the possible next tokens and then we may do something like take the highest probability or we might do something more complicated just depending on the scenario. There's many
neural network layers that factor in to the model's prediction of what is the next token in the sequence. And one of the most important layers within this neural network that helps the model
achieve that is the attention layer. And
I'm going to talk about a little bit before we get into the attention layer actually what happens before the attention layer. So we can see that
attention layer. So we can see that there's actually some sort of embedding layer over here. And this is exactly the same embedding layer that we used in the sentiment analysis model. So given a a
vector or a sequence of tokens of length t here we can say with write me a poem capital t equals 4 for every token we're going to get an embedding or feature vector which should encapsulate the
meaning of that word of or of that token and this is actually learned through training through gradient descent and let's say our embedding dimension that we choose is capital E now our input of
size capital T has now become T by E if for every token we have a vector of size E and this T by E tensor is what's going to be fed in to the attention layer over
here. And what the attention layer
here. And what the attention layer outputs is actually something of size T by A where A is not the embedding dim but actually the attention dim. So we
can see now for every token at every single time step. Here's one time step in the sequence. Here's the next time step in the sequence. We can see that for every single token we have a different vector now. And this vector
may may or may not be of the same size depending on whether or not a equals e.
But that's actually not the important thing. The important thing is that this
thing. The important thing is that this transformed vector that we have for every token now contains a slightly different piece of information. It
actually is now a transformed version of the vector that encapsulates what the model needs to attend attend to or focus on. What's actually important for the
on. What's actually important for the model to pay attention to. That's where
the term self-attention comes from. And
this transformed vector, the way that the self- attention layer works inside here, the way we're generating this vector of what the model should pay attention to for every token is actually
by aggregating aggregating information from the other tokens. So let's say the model has now a new representation for the word me. Well, the model has factored in everything that came before
me, factored in all those other tokens and aggregated that information together to have a new transformed representation of this token which represents what the model actually needs to pay attention
to. So now let's dive into how that
to. So now let's dive into how that actually works and make this a bit less abstract.
Before we dive into exactly how attention works, let's just make sure we're clear on everything in the problem description. The forward method will
description. The forward method will return a batch size by context length.
Context length is the capital T that I was just talking about by attention dim tensor. And we can see that the input
tensor. And we can see that the input dimensionality is actually the embedding dim. So the input would be B by T by
dim. So the input would be B by T by capital E. And we're also given the
capital E. And we're also given the attention dim, which is like the capital A that I was just talking about. People
also call this the head size. We don't
really need to worry about the word head right now. That's actually going to be
right now. That's actually going to be explained more in the next problem where we talk about multi-headed attention.
And then third input that we receive is the actual B by T by embedding dim tensor which is the input to the forward method for this self attention layer.
Again we are actually coding coding up the self attention class which will simply be used as a layer later on in the GPT class just like NN.ar. So in the constructor for the self attention class
we'll define the relevant you know the instance variables and the layers that make up the self attention class and then the forward method for self attention will be just like getting the model's prediction for that layer
passing in the B by T by E input into that layer and it returns the B by T by attention dim tensor and just to make sure that's super clear here for this
input tensor we can see that we have two different 2x two tensors so this is like 2x by two by two. So this is B by T by
E. So you can actually see that capital
E. So you can actually see that capital T or the context length is equal to two in this case and the batch size is equal to two and capital E equals 2. So then
this one should be 2x 2 by 3. Since 3 is the attention dim, the batch size and capital T or number of tokens remain the same. And we can see that that is the
same. And we can see that that is the shape of this tensor over here. And the
numbers might feel a bit meaningless right now. This example is more just to
right now. This example is more just to help you understand the shape of the inputs and outputs. So the numbers might feel a bit meaningless right now, but they will make sense soon once we explain how self attention works at the
lowest level. Let's build some more
lowest level. Let's build some more intuition for how this works with a simple example. Let's say we had a
simple example. Let's say we had a sentence like dog is cute that we were feeding into GPT. And again, we don't really need to worry about the inner workings of GPT for now and that complex transformer diagram. Let's just build
transformer diagram. Let's just build some intuition for how this concept of self attention works. And let's just say B equals 1. So that we are only dealing with one example at a time. We're not
processing like other sentences in parallel in this batch. So T equals 3 if we're doing a word level breakdown of the sentence. And let's say E equals 2.
the sentence. And let's say E equals 2.
So every token would be represented with a vector of size 2. And this is our input X that is kind of that T by E tensor that is fed into the attentions forward method. And we can see that the
forward method. And we can see that the first row is for dog, the second row is for the word is, and the third row is for the word cute. So let's say we had
these embeddings learned by the model.
And let's just take them to be like what they are at face value. Let's not worry about how the where these numbers came from. Just that this is the model's
from. Just that this is the model's learned representation of these words.
And this will make sense soon. So if we wanted to plot them, we can see that dog is over here is is over here. And then
cute is over here.
So then the way self attention works the way it actually then represents or generates this this new tensor of size t by a where now we have this transform
vector which has the important information for the model to look at for every single token. The way that this is generated is the model considers a t by t or it actually generates a t by t
tensor. So this you can think of this as
tensor. So this you can think of this as a tensor of scores or weights. Right? If
we have t tokens and we look at a t byt t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t tensor, then that is like t squared entries, right? So we're considering
entries, right? So we're considering every single pair of tokens, every single possible pair of words and we're matching them up. And in this tensor, if you were to index it at say row i,
column j, where row i is corresponding to the i token and column j is corresponding to the jave token. The way
we interpret this, and by the way, we're going to make these numbers be between 0ero and one, is we're actually going to interpret that as a score of how important those words are to each other
in the sentence. So in the sentence like dog is cute, you might imagine that the row for dog and the column for cute, we might have like a really high number cuz cute is important for describing the word dog in that sentence. So this would
be like a number or a score between 0 and one, one being more important, zero being not important to each other at all. And this is going to be a ts
all. And this is going to be a ts squared sized tensor which contains the scores for every single pair of tokens and how important they are to each other. How strong and of association do
other. How strong and of association do they have with each other and how important are those pair of tokens for actually understanding and processing the meaning of the input. So for now
let's not worry about the exact formula that the model uses to generate this tensor. We will explain that in a bit.
tensor. We will explain that in a bit.
But let's assume that the model did generate this T byt tensor for the input dog is cute. And let's just understand what's in this tensor. So we can see
that for dog dog. So we can see the row one column one dog and dog has some like relation to each other, right? It's the
same word. So we'll see that for is and is cute and cute. I just put the same number 0.7 because obviously a word is important to itself, right? that these
diagonal entries are kind of meaningless but the model will still learn some sort of number for those entries. The more
important entries to focus on are the ones not on this diagonal. So we can see that for road dog the future tokens that come after dog like is and cute are actually completely zeroed out and this
is intentional and you may have heard of something called masking in self attention. It's okay if you haven't
attention. It's okay if you haven't heard that term before, but essentially what this means is if the whole goal of these models is to predict the next token in the sequence, then we shouldn't be letting the model during training
look at those future tokens. If the goal is to predict that is cute comes after dog, then why would we let the model look at these future tokens? We're
completely going to zero out those attention scores. Instead, for every
attention scores. Instead, for every single token at every single time step, we only let the model see the current time step and everything that came before it. So if we look at the second
before it. So if we look at the second row which is for is we can see that the word that came before it dog 0.5 right it has some importance to the word is dog and is obviously it's kind of like a
helping verb in this sentence but not maybe not like crazy important then obviously is is important to itself and then we don't let the model look at the future token of cute which comes after is this is a really important entry in
the matrix though comes before cute right so for the row cute we're considering all the other tokens that came before it up to and including cute. And we see a really high
including cute. And we see a really high number here. This makes sense, right?
number here. This makes sense, right?
Cute is actually the adjective describing dog in the sentence. For the
model to truly understand the language and meaning of the sentence, the model needs to realize that these this pair of tokens dog and cute are very strongly associated with each other. Hence the
value of 0.9. And then of course maybe 0.5 for is is just a helping verb that kind of links dog and cute. So we just have a 0.5 there. So we'll talk about
how the model generates this T byt tensor. That's one of the biggest parts
tensor. That's one of the biggest parts of coding up the self attention class in a bit. But we just want to understand
a bit. But we just want to understand the whole value of having this TT tensor first. So now let's explain how the
first. So now let's explain how the model uses this TT tensor to actually get the output. Here's on the right is going to be the output of the self attention forward method. So here on the
left matrix we actually have that T byt tensor over here. And here what we have is the actual input. This is that t by e tensor which is the embedded vectors for
each token. Each row represents a time
each token. Each row represents a time step in this tensor. And then what if we matrix multiply them together? What
would that actually give us? And again
this is as deep as kind of like getting into the actual like matrix multiplications we're going to go is.
This is just kind of the one time we're going to really look at it is it really does help with the understanding. Won't
do like any crazy unnecessary math. So
we have this row here 0.7 0 0 and we multiply that with this column over here which ends up getting us this entry over here. So we have the dotproduct of those
here. So we have the dotproduct of those two vectors and that gets us 0.7 over here. And again let's look at this
here. And again let's look at this matrix over here. It's technically we said the output was t by a. Let's just
say a equals e in this scenario. So the
output is still t by e or t by a. And so
each row here is also a time step and this is dog is cute. So the vector here, the vector here should be like the transformed version of the word or the entry for dog. And this this transformed
vector should actually contain the information that we need to pay attention to. So let's try and
attention to. So let's try and understand what this multiplication was actually doing. We have the scores here
actually doing. We have the scores here for how dog is relating to all the other tokens in the sequence. And this vector over here is essentially the first
column of every single row in our feature vectors or embedding vectors.
Right? So this this entry over here is like kind of half the representation of the word dog. Of course, we're not including this column over here. This is
part of the representation of the word is. This is part of the representation
is. This is part of the representation of the word cute. And the score for dog dog is being multiplied with this thing for dog. The score for dog is is being
for dog. The score for dog is is being multiplied with, you know, the number for is. The score for dog cute is being
for is. The score for dog cute is being multiplied with the the number for cute in this column over here. And remember,
our goal is that we want this new vector for every token/ row to actually aggregate or factor in the information from the other tokens, right? Because
for the model to again truly understand text, the model needs to understand the relationships between the different tokens in a sentence, right? We know
that we can't just be looking at these words independently. Rather, the model
words independently. Rather, the model needs to actually understand the different relationships between the tokens in a sentence. However, when we say aggregate or factor in information
from the other tokens, we really only mean the information from the tokens that came before and like up to including the current token, right? We
don't want to let the model look at the future tokens. So if we actually see
future tokens. So if we actually see what's been going on here for this first row for the new transformed version of dog. If we actually do the matrix
dog. If we actually do the matrix multiplications. So then we have the
multiplications. So then we have the same row again over here times times this column. We actually just end up
this column. We actually just end up with zero over here. So it doesn't seem like a drastic change for dog. We just
have 0.70.
But what this actually is representing is that well dog is the first token in the sentence, right? So that's why we have these zeros over here. Dog wasn't
even allowed to look at the future tokens, which why it doesn't seem like we really factored in much from the future tokens.
Let's not worry too much about this second row over here since it's probably not going to give us a ton of important information. As we saw earlier when we
information. As we saw earlier when we were looking at the scores, that is was not like a super critical word in the sentence. However, the relationship
sentence. However, the relationship between dog and cute and the effect of having this 0.9 in this entry of the matrix over there is about to become apparent. So let's say we are trying to
apparent. So let's say we are trying to generate this third row over here. So we
know that is done by taking this row over here and multiplying it with this column over here. So we get 0.9 minus 0.5
minus 0.7.
So, so if we take a look at the effect of that, this 0.9 is a much larger than 0.5 and 0.7 and that 0.9 gets multiplied with this entry for dog whereas these
smaller weights over here are getting multiplied with the entries for these other tokens over here. So in this in this sum that actually becomes this entry over here, the effect of dog and
cute this this score or association of 0.9 is weighted far more heavily and is actually contributing far more to this overall sum rather than these these other terms which are actually bringing
this number down. And if you actually finish out the matrix multiplications, you'll end up when you actually plot those new vectors for each token, this is what we end up with. And it might seem like a pretty small change. We'll
notice that dog has shrked a bit. We'll
notice that is has actually shifted a bit and cute has shifted a bit. And this
might seem small to us these numbers, but this is actually helping the model learn the relationship between all the tokens inside the model and ultimately make its prediction for the next token
in the sequence with the final layer of the model. So how is that tensor of T
the model. So how is that tensor of T byt scores actually generated? If you
look up the formula for self- attention, you're going to see something like this, which seems a little ugly, but it actually has an intuitive understanding or an intuitive explanation. So, we said
we wanted to consider every single pair of tokens and how important they are to each other. Well, the way the model
each other. Well, the way the model achieves that is by actually breaking down these, you know, invisible barriers between the tokens and now instead of processing the tokens independently, it
starts letting the tokens talk to each other. Attention is actually a
other. Attention is actually a communication mechanism between words.
And that sounds a little crazy at first, but what I actually mean by that is that for every single token in the sentence, we're going to emit two vectors. The
model is going to generate and learn two vectors for every single token. And one
of those vectors is called Q for query, and one of those vectors is called K for key. And the query actually represents
key. And the query actually represents what the token is searching for or querying for. I am a token. So the token
querying for. I am a token. So the token is talking to other tokens. It's
generating a query vector and the token is saying, "Okay, I am a token. Here is
what I am looking for. If you if you meet me, hopefully if you meet what I'm looking for, my criteria, hopefully we can meet up and then get a high number or a high score together in this tensor." So what I actually mean by that
tensor." So what I actually mean by that is a noun might be searching for an adjective to describe itself. So the
query for dog, this vector might have similar information that is actually embedded in the vector for cute because we know that is the adjective that describes dog. So we're actually going
describes dog. So we're actually going to look at all these queries and keys and see which ones match up. So the key if the query is what information the token is searching for and that is
emitted by every single token. Then the
key vector the key vector emitted by every single token represents what information that it actually has. what
information does it that token actually contain? So the query for dog might
contain? So the query for dog might match up with say the key for cute since we know the this is a noun, we know this is an adjective, maybe they're looking
for each other. So every single token is going to generate a key and a query.
So if for every single token we have a vector and that vector will be of size a size attention dim then the query tensor or the query matrix is t by a. Same
thing for the key matrix or the keys.
That would be T by A. So then if we actually multiplied Q * A * K transpose, transpose just means to swap the rows and columns in a matrix. Then that's a T
by A. This is Q multiplied by an A by T.
by A. This is Q multiplied by an A by T.
This is um this is K. So then the A's the A's are actually going to cancel out. This makes the matrix
out. This makes the matrix multiplication dimensions work out and we end up with something that is actually t by t which is actually exactly what we wanted. And then if you actually kind of take a look at an
example and see like the row for a query being multiplied with a column in K or K transpose, you'll see that that's literally aligning the queries and keys
because we want to see which queries and keys actually match up, right? We want
the queries for maybe say a dog to match up or actually align up with the vector that represents the key for cute. And
then that dotproduct of those two vectors is some sort of number that's stored at entry I J in this T by T tensor. So maybe I is the row for dog
tensor. So maybe I is the row for dog and J is the column for cute. Or
actually as we talked about earlier that entry would be masked out to be zero because we wouldn't let dog look at future tokens. So it might be more
future tokens. So it might be more accurate to say the row for cute. Maybe
I corresponds to Q and J corresponds to to dog, right? It would actually be the row for Q and the column for dog. And
then that would that number that dotproduct of say that query and that key is a number in this t byt tensor which represents the score or
association between dog and cute.
So then what is this little scale factor that's square rooted in the denominator over here? This is just one of the
over here? This is just one of the contributions from the 2017 paper attention is all you need and they actually introduce this idea of scaled
attention. So DK or the dimensionality
attention. So DK or the dimensionality of K is actually just this attention dim the capital A or head size number that we talked about earlier and the
researchers found that when you actually divide this Q * K tangros we know what that represents that's our T byt tensor but then if you scale it by dividing by the square root of DK you actually end
up with a much smoother training process and you won't run into any extreme values or extremely small or large gradients which makes the training process a a lot easier. So this part of
the formula isn't as important for our conceptual understanding, but this part Q * Kranspose is definitely really important to understand. One additional
thing we're going to do to actually ensure that all our entries in that T by Tensor are between 0 and 1 is we're going to apply the softmax function. And
the softmax function is kind of like a multi-dimensional sigmoid. We talked
multi-dimensional sigmoid. We talked about the sigmoid function earlier in these in these problems, these these ML problems as a function that makes anything that's inputed into it be between zero and one. And the higher the
input is, the higher the number is to one. The lower the input to sigmoid is,
one. The lower the input to sigmoid is, the closer it is to zero. So softmax is pretty similar except it will also make all the entries in your vector that you apply softmax to. Not only will it make
every entry between zero and one, but it will make them sum to one. So we're
actually going to apply the softmax function to each row of this txt tensor.
We're going to apply the softmax function to each row. So then every single entry in this tensor will be between 0 and one. And each row would also sum to one. So if I looked at the
row for dog, right? The row for dog and the columns are like dog is cute. Then
this actually is going to force all of these numbers to actually sum to one.
And we like that because then we kind of think of this as a probability. We could
we know all the probabilities have to sum to one. So we can actually think of it as we can think of I J as like the probability that the I token and the J token are important to each other.
Clarify exactly what we mean by softmax is let's say we had some vector here 1 2 3. To apply softmax to this vector means
3. To apply softmax to this vector means to actually raise every term in the vector to the power e to the power of whatever that is. So e to the^ of 1, e
to the power of two, e to the power of 3. And then all of these entries in the
3. And then all of these entries in the resulting vector are divided by the same thing. That's what makes everything be
thing. That's what makes everything be between zero and one and actually sum all the way up to one when you actually add up all of these numbers over here.
So represented in a formula the softmax at entry I is just e to the x i where x i is say the entry at the i position in
our vector which we can call x and then we divide by the sum of all these e to the x i's. That's kind of an ugly looking formula but here's what it would
look like as an example. And in the code since you want to make in our let's see that's a B by T by T tensor that we're
actually applying softmax to right the whole time we've been talking about a T byt tensor but that's just for one example right if we're processing examples independently in a batch then for every example there's a T byt tensor
so just to be really clear when we actually want to apply softmax to this we want to apply the softmax to every row of the T byt tensor but for every
batch. So the way you would do this in
batch. So the way you would do this in code is nnn dot functional nn.functional
this is the function that we use to call softmax dot dots softmax and then what we're actually going to do is pass in the tensor. So let's just call this
the tensor. So let's just call this scores. The these are our scores. So
scores. The these are our scores. So
you'll go ahead and pass in scores and you'll go ahead and pass in the dim. And
the DM since we have a three-dimensional tensor is either 0, 1, or two. But since
we actually want to make each row of this of this txt tensor for every batch sum to one, you're simply going to pass in dim equals 2. One final thing before
we're ready to talk about the coding. So
the forward method is actually going to return the softmax of our scores. That's
this first factor that I've just underlined here times V which is our value. Earlier when we were doing that
value. Earlier when we were doing that first dummy example where we multiplied our 2x two T by T scores times our input
the B or T by E input. I essentially
said that we were doing our scores times what essentially what we can call X the input to the forward method. But that
might that's not actually what's done in practice. And this that might have been
practice. And this that might have been why the numbers looked a little weird.
So what's actually done in practice is for every single token in a sentence. So
let's say we have a sentence like dog is cute. For every single token, for every
cute. For every single token, for every single ver word, not only do we emit a query, not only do we emit a key, but for every single token, I'm only drawing
it for dog, but it's also the case for every other token. We're also going to emit something called a value. a value
vector. So if the query is what the token is searching for, the key is what information the token actually has. The
value will also be another learned vector for every token that will be generated. And this is what information
generated. And this is what information does the token want to share because there might be a whole bunch of information right in the key. There's a
whole bunch of information that a token has, but maybe not all of it is actually relevant to share to share with the other tokens. So when we actually do our
other tokens. So when we actually do our T by T scores multiplied by you know our input and actually that gets us our
aggregated output we actually are going to multiply by V where V is T by A as well just like Q and K. However every
token is going to emate a vector of size A called the value which represents what information is actually relevant to share. Now we're ready to talk about how
share. Now we're ready to talk about how to actually code this up and the implementation details. So nn.linear is
implementation details. So nn.linear is
going to be our friend again. In order
to generate our keys, queries, and values, we're going to have in our constructor for the self attention class three instances of nn.linear. One of
them for generating the keys, one of them for generating the queries, and one of them for generating the values. And
you can imagine that in features would be the embedding dim or e, and out features would be the attention dim or a. Next, when we actually matrix
a. Next, when we actually matrix multiply Q and K in the self attention formula, this at operator in PyTorch actually does matrix multiplication or
you can use torch.mmatmo.
You can actually check the torch documentation for that. You're actually
going to want to do Q matrix multiplied.
So, at operator time torch.transpose and
then pass in K. And then you're going to transpose the first and second dimension. So that will actually give us
dimension. So that will actually give us the B by T by A tensor which is Q matrix multiplied by the transpose version of K which is B by A by T. So the reason
we're transposing the first and second dimension is because originally it's B by T by A for the K the keys but we want to switch the A and the T. Remember this
is dimension zero. This is dimension one. This is dimension two. So if we
one. This is dimension two. So if we swap those that gets us the matrix multiplication that we actually want.
it'll do the batches. It'll preserve
this batch dimension. It's kind of like a parallel processing independent leftmost dimension. And then for every
leftmost dimension. And then for every single pair of batches, we're doing the t by a * the a by t. And we know a t by
a * an a by t gives us the t byt that we want. Another thing is how do we
want. Another thing is how do we actually do the masking that we talked about earlier? The best way to actually
about earlier? The best way to actually implement this is to use torch ones and pass in capital T capital T which will give you a tensor of all ones which is
of shape T by T. And then you can pass that into another function called torch.trill
torch.trill which is actually short for lower triangular or something like that. It's
some linear algebra term. And then this is going to give us this tensor over here which actually has you know the future tokens masked out. And then we can maybe store that in a variable
called premask because what we're actually then going to want to do is let's say we have our scores, right?
Scores is actually, you know, after we've applied q * krpose and we've done our division that divided by square root
of dk. Let's say that is the state of
of dk. Let's say that is the state of scores at this point. But it's before we've applied softmax. We're going to actually want to do the masked fill the
masked fill on scores before we apply softmax. We want to get rid of those
softmax. We want to get rid of those those future tokens for every single time step before we apply softmax.
Otherwise then in the softmax if let's say you were to zero out the scores for those for those future indices after applying softmax. Well then for the ones
applying softmax. Well then for the ones that are not zeroed out in at every single row, right? because of how the softmax formula works where it's doing like e to like you know whatever's in
the first entry e to whatever's in the second entry even if you later mask those out the denominator was still factoring those in right so you want to erase those future tokens contribution
before you apply softmax and the way you'll do that is with scores you will simply call the masked fill method from pytorch and then in the first entry you want to pass in a mask which is just you
know a tensor of trus and falses So you would say premask which is just this tensor over here equals equals zero which would get us you know the trues and falses the TRS where we actually
want to do the masking and then at those entries you want to pass in negative infinity and the reason you want to pass in negative infinity is because then as a result of softmax then after you apply
softmax you would get zero. And let's
explain why that's the case. So just a quick math review on why this is the case. Again, this is probably as deep as
case. Again, this is probably as deep as the math is going to go. When will you pass in that negative infinity or whatever like the lowest value in Python is into the E function in the softmax
formula? Well, e to the negative
formula? Well, e to the negative infinity is like 1 / e to the infinity.
That's just an algebra rule. And then e to the infinity, if we think about what the the e function looks like, it's like an exponentially growing function like
that. So, as e is approaching infinity,
that. So, as e is approaching infinity, this function is approaching infinity, right? So I can just replace 1 / e to
right? So I can just replace 1 / e to the infinity with infinity. And this is just zero which is exactly what we wanted. We didn't want those future
wanted. We didn't want those future tokens to have any contribution. We
wanted to zero them out. And the way to zero them out is to actually set them to negative infinity before you apply softmax. So why is 1 over infinity zero?
softmax. So why is 1 over infinity zero?
It's just a quick review from maybe calculus one. So 1 / two is a half. 1
calculus one. So 1 / two is a half. 1
over 5 is2. 1 over 10 is 0.1. 1 over 100 is 0.1. As this denominator is getting
is 0.1. As this denominator is getting bigger and bigger, the numbers are getting smaller and smaller. So 1 over infinity is essentially zero. Okay, now
we're ready to code it up. So we know we're going to need three linear layers.
H1 for generating keys, queries, and values respectively. So we can say
values respectively. So we can say self.get
self.get keys is nn.linear linear and you know the in features for the you know the vectors or tensors passed in to n.linear
there are going to be of size embedding dim since for every vector originally we have a for every token originally we have a vector of size embedding dim and we want to get for every token then a
vector of size attention dim and the similar we might have get queries is nn.linear linear of the embedding dim
nn.linear linear of the embedding dim and then the same thing attention dim and then again the same thing for the
values right nn.linear linear embedding dim attention dim and those are actually the only instance variables we need here
just a small adjustment remember we don't use biases in the linear layers for getting the keys queries and values in linear regression when we're doing w1
* x1 w2 * x2 w3 * x3 at the end there is an optional constant term b called a constant term since it's not actually being multiplied by any of our input features, right?
We're just adding it as a constant. And
it can optionally also be learned through training through gradient descent. But if we pass in bias equals
descent. But if we pass in bias equals false, this will not be included. We'll
simply have our W's X's. And in the case of self attention, it's actually been found that we can get slightly better results without the bias. So there's no need for a bias here.
So now let's actually use those layers to get our queries and keys and values.
So we can say K is self.get
keys of the embedded version of our input. So if embedded is B by T by E.
input. So if embedded is B by T by E.
This is B by T by A. Similarly Q should be self.get get queries of again
be self.get get queries of again embedded and has the same dimensions and then the values self.get values of embedded and again same dimensions. So
now we actually want to do the matrix multiplication right we have to do Q matrix multiplied with K. That's how we actually align the queries and values
and see which ones match up. So I might say scores is Q matrix multiplied with torch.trans transpose of K. And again, I
torch.trans transpose of K. And again, I want to swap the T and the A. So, I
would say 1 2. It'll also work if you pass in 2 1. So, 1 2 and then we also are going to need the value of the attention dim, right? As we know, we
actually are going to want to take the square root and divide by it, right? So,
we can actually get the attention dim from a variety of places. The easiest
place to get it is just K, right? Since
we know it's B by T by A. So we can just say B by T by A is K.shape. So we're
unpacking that whole dotshape tupil. And
then here is when we can do our scaling, right? So you can actually say that
right? So you can actually say that scores is equal to scores divided by the attention dim which is just a square rooted. So now we have scaled the
rooted. So now we have scaled the scores. Now we want our lower triangular
scores. Now we want our lower triangular tensor. So we'll do something like
tensor. So we'll do something like torch.trill trill just like we talked
torch.trill trill just like we talked about earlier on torch ones of t and then maybe this is our premask that I talked about earlier that's before we
actually get the tensor of trus and falses that would be like the true definition of a mask and then we might say that our true mask is simply premask but we want to get look at the entries
where this thing is equal to zero and then we can say something like scores equals scores domasked fill go ahead and pass in the mask and then Python's
version of negative infinity or whatever like the smallest system value is. So
that will just be float of negative infinity. And now we can actually apply
infinity. And now we can actually apply softmax. So we can say something like
softmax. So we can say something like scores equals nn.functional
softmax pass in the scores. And again
here is where we want to say dim equals 2 because scores is b by t by t but we actually want to do this for every t by t. And then given a t by t if you wanted
t. And then given a t by t if you wanted to say set every row to sum to one right you're trying to apply softmax to every row that's going across the columns. So
it's actually that final dimension over here dim equals 2. And now that we have the scores we can actually use them to actually get our transformed output. So
we would go and say something like transformed our transformed output is simply scores matrix multiplied with our values and then we can simply round this. So return
torch.round
of transformed and say decimals equals 4. And we're done. And we can see that
4. And we're done. And we can see that it works. I hope this was helpful. And
it works. I hope this was helpful. And
definitely leave a comment if there's anything that you would like me to explain in more detail. There's
definitely a ton of concepts embedded within coding up this class and we're a few problems away from having a working GPT from scratch. Okay, in this video
we're going to explain multi-headed attention. Multi-headed
attention. Multi-headed attention. And as you can see from the
attention. And as you can see from the transformer architecture, which you may have seen before, it's one of the most important parts of the model. It's
actually occurs over here. We actually
won't include this part of attention right here since that has to do with the encoder which I've actually cropped out of the diagram. Large language models like GPTs don't use encoders. But
attention specifically this masked multi-head attention and we're going to really focus on explaining the multi head aspect of attention in this video is one of the aspects of the GPT that
makes it so powerful. It was discovered in 2017 by researchers at Google and we're going to break down exactly how it works. But before we break down
works. But before we break down multi-headed attention, let's do a quick breakdown of how single-headed or just normal attention works. So when chat GPT or any language model receives an
instruction like write me a poem, we know this is something that chat GPT does pretty well on. Well, the model actually needs to know what parts of this sentence, this input sentence to
focus on specifically to pay attention to. Not all of these tokens are
to. Not all of these tokens are important, right? Obviously, the more
important, right? Obviously, the more important ones in this sentence are write and poem. Imagine if you just said write poem to chat GBT. It would still understand what you're doing. That kind
of or understand what you're requesting.
That shows us that these are the two most important tokens and the model needs to learn to pay attention to these when it's given like a real sentence like write me a poem. And this is
actually what attention solves. So this
is what the attention layer solves. It's
actually a layer in a neural network that will take in. Here's what it will take in. it will take in the embedded
take in. it will take in the embedded version of a sentence. So the first step in any kind of language model is actually tokenize the sentence. So we'll
break up the sentence into tokens. And
for now let's just assume we're using a word level tokenizer. So we have some sort of matrix that is actually going to represent this sentence or a tensor in PyTorch. And we're going to say this is
PyTorch. And we're going to say this is T by E. And the reason it's number this has T rows is because we have T tokens in every sentence. If we break this down
on the word level, we'll have four tokens here. So we would actually have
tokens here. So we would actually have four rows here. And then E is actually something called the embedding dimension. The embedding dimension.
dimension. The embedding dimension.
Let's just assume an embedding dimension of two. Then we know our matrix or
of two. Then we know our matrix or tensor looks something like this. And
the embedding dimension for every single token. So for write, for me, for a for
token. So for write, for me, for a for poem, this is going to essentially be a vector that encapsulates the meaning of the word. So you can think of it as a
the word. So you can think of it as a feature vector that summarizes the information in that word in a way the model can actually understand and that will actually be learned through the
process of training a neural network.
But what the attention layer outputs right this is just what the attention layer takes in the attention layer is actually going to output a tensor that is instead of being T by embedding
dimension the output tensor is actually going to be T by attention dimension. So
t we can call this a for the attention dimension also sometimes called the head size and this is also then technically going to be a vector for every token but it's going to be a transformed vector
that kind of encapsulates the relevant parts of the token that the model actually needs to pay attention to. So
if I had to summarize attention in just a couple sentences it's actually a communication mechanism. Attention says
communication mechanism. Attention says we don't want to look at each token in a sentence completely independently. We
want to kind of break these walls. So,
we want to let the tokens talk to each other. We want to let them talk to each
other. We want to let them talk to each other. And we want to let them actually
other. And we want to let them actually have some time specifically through the attention layer in a neural network. And
we want to let them match up with each other until the tokens that are actually relevant to each other have paired up, right? Because every token has some kind
right? Because every token has some kind of relationship within it, right? The
relationship between write and poem is that poem is what needs to be written.
the relationship between me and let's say write well if you remember for grammar again not super important but me is kind of like the indirect object here
the poem is being written for me so clearly for a model to really understand a language it needs to understand the relationships between the tokens and ascendants so to summarize attention it
is essentially a communication mechanism that lets tokens talk to each other okay so now let's explain what we mean by multi-headed let's Say attention dim
equals 8 and embedding dim equals 4. So
then the input to our attention layer, which I'm just calling a in this black box, is a 4x4 tensor. And I'm just assuming t equals 4 using the same sentence, write me a poem, and then the
output would have to be 4x 8. But what
if we wanted to do something called two heads of attention? So multi-headed
attention. Well, we would have two act two separate instances of a where a is doing self attention. And these heads or these two separate instances are going to operate in parallel completely
separately. They'll have their own
separately. They'll have their own parameters and weights that are trained.
Let's keep output dim equals eight. So
if we want to do that, then we're actually going to have to change the attention dim for each instance of a each head as they call it to four. So
that way we can still have this 4 + 4 equal to 8. But if we have two 4x4 outputs, how do we get the output dim equal to 8? Well, we can use concatenation, right? We know there's a
concatenation, right? We know there's a function in PyTorch. Don't worry if you're not familiar with this, but we can actually concatenate two tensors and tell PyTorch along which dimension to
concatenate. And we want to concatenate
concatenate. And we want to concatenate not along the zeroth dimension rather, but rather the one dimension. And that
will actually give us a 4x8 tensor. And
let's say we had an attention dim of 16 and we wanted to have four heads of A.
Then each head size would actually be four since 4 * 4 is 16. So we call this number this number the head size and we can call this the overall tension
dimension. So what multi-headed
dimension. So what multi-headed attention is is just doing this normal attention but having a bunch of instances of it operating in parallel and each of them are going to have their own trainable weights biases parameters
that are learned through gradient descent and then we're going to take all their outputs and concatenate it and that's what the output of multi-headed attentions forward method would be. So
why does multi-headed attention actually yield better results in neural networks than just having one single head of attention, but it has the same attention dim as the output dimension when we
concatenate all our single heads together? Why does that actually give
together? Why does that actually give better results to have these two separate instances of self attention or sometimes four or six or eight? Well,
the number of parameters is staying the same, right? the number of weights or
same, right? the number of weights or the number of trainable or learnable parameters in the model is staying exactly the same, right? We know that the queries, keys, and values, if you've
seen the video on the lower level details of attention, we know that those are linear layers that are going to occur in each single head of attention to for the model to actually figure out
the relationships and scores between pairs of tokens. But if we're shrinking the attention dim for each head, then the number of parameters is actually staying the same. So why are we getting
better results when we do attention, two different heads of attention, one here and one here, and then we go ahead and concatenate the results. Why is that actually giving us better results? Well,
if we think about it, each head of attention gets to operate on the input separately. Each the input, let's say
separately. Each the input, let's say the embedded input, right? That is our T by embedding dim. So T by embedding dim.
This is passed in separately in parallel to each head of attention. So each head of attention gets to operate and do a bunch of math on that input tensor. And
since each input or since each head of attention has its own parameters, each head can actually learn something different. Actually what we'll find is
different. Actually what we'll find is after language models are trained each head of attention is actually specialized at learning some component of the language. So, and actually when
they did an analysis of BERT, and BERT is actually a very popular transformer that Google uses for all sorts of things such as Google Translate, when they did
an analysis on the heads of attention in BERT, they actually found that one head of attention was specialized at looking at direct objects of verbs, direct
objects of verbs, and another head of attention was specialized at looking at indirect objects of verbs. So
intuitively we see these large language models. These models with billions of
models. These models with billions of parameters as essentially having different heads and different components of its neural network that specialize at learning different parts of a language
and kind of a grammar. So that's kind of the main benefit of multi-headed attention. We have different heads that
attention. We have different heads that can each specialize in learning something different even if the number of parameters stays the same whether we do one head or many heads. So that's
multi-headed attention. It's actually
not a terribly complicated concept, but it is probably most people would agree the crux of a GPT. It is the most powerful layer of a GPT, and it is what
allows these modern-day language models to actually learn English language and actually any language so well. So, I'd
highly recommend trying to code it up now. And after you do that, you'll
now. And after you do that, you'll actually be ready to code up the transformer block, this giant rectangular block, which you can see in the picture. and then you will almost
the picture. and then you will almost have a working GPT. In this problem, we're going to solve multi-headed attention. And again, the first thing I
attention. And again, the first thing I want to say is that I highly recommend you solve single-headed attention before we solve this problem. As if you understand single-headed attention, this
problem is not that hard. It's actually
kind of easy. Single-headed attention,
on the other hand, which this problem builds on, is a lot harder just cuz there's so many new concepts introduced in that problem. So, we're going to assume that you've kind of solved that
problem and then this problem isn't going to be that bad, but we'll still generally explain the idea of attention, but we won't go into as much detail as we did in the single-headed attention video. So, the problem statement says
video. So, the problem statement says that this layer is actually what makes LLM so good at talking like real people.
So single-handed attention is pretty good. But when we do this whole
good. But when we do this whole multi-headed thing, which we're going to explain in this problem, that's has what made like the latest transformer neural networks way better at chat bots that you might have seen in previous years.
And I've also made an explanation video of the concepts if you would like to watch that as well. So we have to code up the multi-headed self attention class and then the single-headed attention
class. Essentially, the solution code
class. Essentially, the solution code for that is going to be given in the starter code for this problem. And we're
going to actually make instances of this class to solve the multi-headed attention problem. And we're told that
attention problem. And we're told that the forward method needs to return a B battens dim tensor. So actually returning a
dim tensor. So actually returning a tensor of the same shape as the single-headed attention problem. And
we're going to be given embedding dim.
That's our first input. And this is actually something that we were also previously given in the previous problem. We're also given attention dim.
problem. We're also given attention dim.
We're given embedded which is B by T by E and this was also given in the previous problem. However, this this is
previous problem. However, this this is our new this is our new input that we really need to pay attention to. Num
heads. This is the number of times that we're going to make an instance of the given self attention class. And we have this constraint that attention dim mod
num heads is zero. So attention dim is a multiple of num heads.
And there's an example here. The numbers
are not going to make a ton of sense, but it does help us understand the shapes. Our input is supposed to be B by
shapes. Our input is supposed to be B by T by E. That's the shape of embedded.
And we can actually infer that the batch size is two. Here the T or the context length is also two. You can infer that from this example over here. And then E,
we can find that E is two. And that
makes sense as well given that we have a 2x two by two. We have two different 2x twos in our input embedded. And if
attention dim is three and the b and t are supposed to remain the same then we have two different 2x3's. So it's a 2x 2x3 and that is exactly what we have
over here. So the shapes do make sense.
over here. So the shapes do make sense.
So what actually is multi-headed attention? So we take an instance of
attention? So we take an instance of single-headed attention ha and previously we were passing in our input the b by t by e input and whatever
single-headed attention returned was the output. However now depending on what
output. However now depending on what that numbum heads parameter is whatever that is set to we'll make that many instances of self attention and have them operate in parallel. So let's just
say numbum heads was three then we have three over here. And what we'll do is what if our input was that B by T by E input embedded. I'm just going to
input embedded. I'm just going to simplify that to X for now. Let's pass
X. Let's actually pass X in to all of our heads of self- tension. If I had a fourth one over here, it would also be passed in there. And then what we'll do
is we'll just concatenate all of these outputs together. The outputs of each
outputs together. The outputs of each head. So each head of self attention,
head. So each head of self attention, each of these black each of these boxes over here was able to operate independently and process X and obtain a
different output, right? Cuz each of these layers of the neural network have different if they're separate instances, they have different weights, right?
They're learning different keys, queries, values, and those linear layers that make up the single-headed self attention. And ultimately, they're going
attention. And ultimately, they're going to generate something different since, again, they have different numbers, different weights. they're trained
different weights. they're trained separately even if they're given the same input. So we're going to then
same input. So we're going to then concatenate all their outputs together.
We can call this the concatenated version. The concatenated version and
version. The concatenated version and that is what we would return from the multi-headed attention forward method.
And that's it. So then when we have the transformer neural network architecture and remember we're getting pretty close to coding up this full GPT class. We're
actually then going to use the multi-headed attention class as a layer in this neural network. So then that's going to become one of our building blocks for this giant block over here
which is called the transformer block.
So why does this actually work better than just single-headed attention? Cuz
remember that attention dim which we can call a mod num heads mod numb num heads is zero. So for each head, we're
is zero. So for each head, we're actually making the size of each head or the attention dim parameter. When we
instantiate that single-headed attention object for each of those heads, we're making each of them have a head size of a essentially floor divide num heads,
whatever this integer is. So our total number of learnable parameters in the neural network isn't changing. Right? If
I were to do just one head with an attention dim of a equals 64 and the other option is to maybe do eight heads,
do eight heads and for each of them set a equal to 8. Then the total number of learnable parameters in the neural network hasn't really changed. Even
though we're having multiple heads that are each operating in parallel, right?
Each of them gets to process the input, process that input called embedded and learn parameters. Albeit each head is
learn parameters. Albeit each head is learning less parameters than if we just had one giant head. So why does this actually yield better results? Because
the general trend is more parameters, bigger model, the model can learn a more complex relationship and we get better results.
Well, even though each head has less parameters than if we had one massive head, it turns out that each head gets to learn something different because our
input X is independently passed in to each head. So each head independently
each head. So each head independently gets to learn something different. And
what we actually find when we see how the heads and those weights get activated based on different inputs is that researchers have found that you'll actually have one head that might specialize in focusing on or attending
to the next token. One head that really specializes on focusing or attend attending to. Right? This is all about
attending to. Right? This is all about which parts of the input do we pay attention to the direct objects of verbs. One head might specialize in
verbs. One head might specialize in indirect objects of verbs. So we know that for a model to truly understand language, it needs to understand grammar, right? It needs to understand
grammar, right? It needs to understand the structure of the language and how all the words are put together or how all the tokens come together. So we
actually find that each head specializes in maybe a different grammar rule and this seems to be why this is boosting the performance of the neural network rather than just having one head that's
responsible for learning all the rules of the language. Okay, now let's jump into the code. So, the first thing we need to think about is that we know we're going to have a bunch of instances
of this given single-headed attention class. So, we're probably going to need
class. So, we're probably going to need to store those in some kind of collection in our constructor over here, right? But it turns out that any
right? But it turns out that any instance variables of a class that is a subclass of nn.module and this multi-headed self attention class is a subclass of nn.m module. Any instance
variables like need to also be neural network parameters. So if I did
network parameters. So if I did something like this self heads and then made it a normal Python list, this is going to throw an error when we're using PyTorch because of the
restriction that I just mentioned. So
there's actually a class called nnn.module list. So I can say self.heads
nnn.module list. So I can say self.heads
is nn.module
list and it works just like a normal Python list. You can append to it but it
Python list. You can append to it but it actually is restricted to only store other modules or neural network layers.
So then we can say something like for i in range of num heads we can actually then append those instances. So
self.heads.append
and we can instantiate here. So self we have to do this for these inner classes.
We have to refer to it as self. So self
single head attention and we have to pass in the embedding dim. But then the attention dim for each head is just the size of that head. And we know that
should just be attention dim floor divide num heads. So now we have all our heads defined. Now we just need to
heads defined. Now we just need to actually write the forward method.
Actually get all those outputs and concatenate them together. So we can go ahead and make a list for all our outputs. We can say outputs is just some
outputs. We can say outputs is just some list. And then we actually want to call
list. And then we actually want to call the forward method of every head of single-headed attention. So we can
single-headed attention. So we can actually just say something like for head in self.heads
and we can append to this list. So
outputs.append
and it would just be head do.forward or
simply just use the default call. So
head of embedded and now it's just about concatenating those together, right? So
we can say concatenated is torch. And
torch.cat expects two things. The first
thing it expects is a collection of the things to concatenate. So our collection here will be a list the list output and it expects a dimension something like 0
1 2 etc. So if we think about what the size of each thing in outputs is, right?
Each element in this list is B by T by let's call it head size, right? Cuz that
would be the output dimension from each head of attention over here. And we want to concatenate them along this last dimension so that when I concatenate
them all together, it becomes B by T by this overall attention dim that was given as a parameter in the constructor for multi-headed attention. So I would
simply say dim equals 2 or you can say dim equals1 for the final dimension.
Either would work. And then we simply round our output. So return torch.round
of catted say decimals equals 4. And
we're done. And we can see that it works. So if you found this helpful,
works. So if you found this helpful, definitely leave a comment or leave a comment if there's anything else that you would like me to go into more in-depth on. Now that we have this
in-depth on. Now that we have this multi-headed self attention class done, we can just use this as a layer when we code up the transformer block in the next problem. So definitely check that
next problem. So definitely check that one out next as it's our next step in getting closer to having a working GPT all the way from scratch.
Okay. In this video, we're going to completely break down the transformer block. So, we're going to explain how
block. So, we're going to explain how every single one of these components works. We'll touch a little bit on the
works. We'll touch a little bit on the embeddings and the final two layers, linear and softmax, but we're primarily going to focus on explaining the transformer block. So, that's the
transformer block. So, that's the attention, the ad and norm, as well as the feed forward. We're going to break these down one by one, starting with the ad. Okay, so the ad actually refers to a
ad. Okay, so the ad actually refers to a concept called skip connections or residual connections. And here's how it
residual connections. And here's how it looks visually. We have some arbitrary
looks visually. We have some arbitrary layer in a neural network. This might be a linear layer. This might be an attention layer. And it takes in some
attention layer. And it takes in some input X which we can see over here. But
instead of just taking the output of the layer as the output, we'll also let some portion of X actually completely bypass completely bypass the layer and it will get added to the output of the layer
over here. So in code if we were writing
over here. So in code if we were writing the forward method for some kind of layer and we wanted to incorporate skip connections or residual connections we would actually return layer of X. So
call the forward method for the layer passing in X but also add X. And just to get some intuition as to what this kind of means is the model will actually
learn the right weights and biases for this layer so that the right proportion of X is either sent through the layer or the right proportion of X is allowed to
actually bypass the layer. Right? Maybe
we don't actually want to transform X based on this layer. Maybe we want to retain X's original identity and pass most of it or at least incorporate it into our output for this layer. So the
model can kind of essentially through training through the minimizing of the loss through the training data figure out how much of x should be passed through and how much of x should be sent through this layer. That is some of the
intuition for what a skip connection is doing. However, why do we actually need
doing. However, why do we actually need to do this addition? Because technically
shouldn't the model just be able to learn the right weights and biases in this layer to not change x at all?
Something like this. Why do we actually have to add X to actually get this to achieve our desired results? Well,
there's actually another big benefit of skip connections, and it's actually the main reason that they were originally added into these deep neural networks, and that has to do with solving the
exploding the exploding or vanishing gradient problem. Exploding or vanishing
gradient problem. Exploding or vanishing gradient problem or exploding or vanishing derivative problem. We know
that calculating derivatives and gradients during the process of training is very important for gradient descent and minimizing our loss by updating our parameters. However, something that it
parameters. However, something that it can occur with these super deep neural networks that have tons of layers left to right, these neural networks have tons of layers is that gradients can
either become super big or super small.
And if we recall our update rule for gradient descent, we know that for any weight or parameter in a neural network, the new weight is actually just equal to
the old weight minus alpha the learning rate times the value of the derivative or gradient. And if this derivative
or gradient. And if this derivative value, if it's too small, if it's vanishing and going basically to zero, then this term is basically going to zero and the new weight is almost the same as the old weight. So we
essentially were not able to update the parameter at all. Similarly, if this value of the derivative is way too large, then the new weight is going to be drastically different than the old weight and that won't be great for the
neural network's performance. So, we
would like to mitigate this problem in some way or another. And turns out that this simple addition does end up reducing that problem because if we
recall from calculus if you have some function f ofx and it's just the sum of two other functions g ofx plus h ofx when you're actually taking the
derivative you can actually just say that frime of x is g prime of x g prime of x plus h ofx. So when the model is actually calculating all the derivatives
when we call loss.backward in in PyTorch, the model will actually have an additive term when calculating the necessary gradients for this layer. And
that actually will significantly help with reducing the problem of the exploding and vanishing gradient. the
fact that we're doing an addition instead of a multiplication as that doesn't drastically compound and you know either drastically increase or drastically decrease the value of the
gradients as the network gets deeper and deeper. So the short answer is that
deeper. So the short answer is that incorporating some kind of addition seems to actually keep the gradient values from either getting too big or too small and researchers have found this time and time again with neural
networks and that is why skip and residual connections are included in the transformer. So the norm in the
transformer. So the norm in the transformer block actually refers to something called layer normalization.
And this is kind of a module of its own in PyTorch that you can instantiate by saying nn.layerm and passing in in our
saying nn.layerm and passing in in our case the embedding dimension. And when
you tell a layerm that you're passing in the embedding dimension that is the dimension along which we will normalize.
So what does it actually mean to normalize in statistics? It essentially
means to send reenter our data so that it all revolves around the mean. So it
will kind of look like a bell curve. And
the way that we can take a set of data points and make it have this kind of shape where the mean is the most probable data point. The way we normalize data is for every single data point X, you actually just subtract out
whatever the mean is and you divide by the standard deviation. And that kind of recent your data around the mean and makes it look like that. However, this
would kind of destroy the whole point of a neural network if there was nothing learnable or trainable about this. It's
just a direct formula for this layer that restricts the output to be having this shape. So to still have the neural
this shape. So to still have the neural network have some sort of learning capacity, we go ahead and multiply for every data point this minus mean divided by standard deviation, we multiply by
some other number gamma. The symbol is not super important there. And add some other number beta. And these two will actually be adjustable and learnable across the iterations of gradient
descent that we perform. But the
question should be why does this actually improve the performance of neural networks? It has been found to
neural networks? It has been found to actually increase the training speed of neural networks and make deep learning deep learning far more effective. And
researchers are actually still not entirely sure why normalization tends to improve the training of neural networks.
But their their current hypothesis is that we know that the neural network starts off with totally random parameters, right? Totally random
parameters, right? Totally random weights and to some extent our data could be random as well. But if during training there's there's drastic shift in the nature of the data, meaning it's
the two main attributes that characterize data, right? the mean and the standard deviation. If this kind of changes drastically in in random ways, then the neural network training tends
to be really slow and just not as effective in terms of convergence. So by
kind of centering it around the mean around the standard deviation, sorry, this is the standard deviation and still by adding in some learnable parameters.
So this isn't too restrictive.
Researchers have found that the performance of the neural network does increase and that's why layer normalization is used in a transformer.
One additional thing that I just want to clarify is what data we're actually normalizing. So to feed into this layer,
normalizing. So to feed into this layer, we would have something that's t by e, right? T by e where t is our time step
right? T by e where t is our time step or length of some kind of sequence like our sentence, say the number of words in a sentence. And e is something called
a sentence. And e is something called the embedding dimension. So you can think of that as the size of the feature vector for every token. So let's say for every word we had a vector that
encapsulates the the meaning of that word and that was our embedding vector of size E or of size embedding dim. When
we say we are going to do layer normalization it means we are going to normalize the actual feature not necessarily the time steps or the batch dimension. if we had say multiple
dimension. if we had say multiple examples being fed in and we had batch by t by e but we're actually going to normalize along the embedding dimension.
So for every single time step we have a vector and we're going to go ahead and normalize along that dimension.
Okay, masked multi-headed attention.
Let's explain how this works. So during
the process of training, let's say we have an example that the model is being fed in. So we have this sentence, write
fed in. So we have this sentence, write me a poem. So we'll quickly explain what the masked part means. Because this
model is always predicting the next token in the sequence, the model actually does not get to look at future tokens. So when the model is fed in a
tokens. So when the model is fed in a sentence like this, there's actually a few training examples within here that the model can see. The first one is that given write me comes next. The next one
is that given write me a should come next. And finally the model sees that
next. And finally the model sees that given write me a poem should come next.
But the masked part actually means that when the model is making its next token prediction, you know, given all the tokens before it, the model doesn't actually get to see what comes next. So
if the model was tasked with predicting poem and the model had the context of write to me a in our code in our tensors or our matrices, we would actually mask out the word poem so that the model
cannot see the answer before it's actually made its prediction. So that's
what the masked part means. Now, let's
explain the multi-headed attention part, which is definitely far more significant. So, when the model receives
significant. So, when the model receives a complicated instruction like this, the model needs to know what part of the input to pay most attention to, cuz not all words in this input are equally as
important. And moreover, there are
important. And moreover, there are actually relationships with between every pair of tokens. So, we know that this pair of tokens, right, and poem, there's a strong relationship between them because what are we writing? We're
writing a poem. The model needs to pay attention to that. We're not writing a book. We're not writing a script. We're
book. We're not writing a script. We're
not writing a movie. The model needs to know what to write. So this is clearly an important pair of tokens. So what
attention does? Attention is the model actually letting these tokens talk to each other. And every single pair of
each other. And every single pair of tokens will be considered until the model has actually figured out which to pair of tokens are important and which ones aren't important. And then the
model is actually able to generate some sort of new feature vector, some sort of new feature vector instead of the embeddings that were initially passed in. The model is now able to know the
in. The model is now able to know the important parts of the input to focus on. So attention, if we had to summarize
on. So attention, if we had to summarize it, attention is a communication mechanism. Attential attention at a high
mechanism. Attential attention at a high level is the model letting the tokens talk to each other until the right relationships between them are learned.
And the multi-headed component actually refers to the fact that this attention process I just described will actually have the model perform at the attention layer on the same input. So we have the same input going into this attention,
the same input going into this attention head as well as maybe further attention heads. We might have three, four, five
heads. We might have three, four, five or six and so on. And what the model will do is it will actually just go ahead and concatenate the output of all the attention heads and pass that in to
the next layer of the neural network. So
attention is a highly powerful mechanism for the model to gain a deeper understanding of the language. And
because it's so powerful, we actually want to have many different heads that each learn their own weights and each operate on the input. This head of attention operates on the entire input
and it actually gets to learn its own weights and biases. The entire input is also passed in to this head of attention and this set of attention learns its own set of parameters. So by exploiting
multi-headed attention, the model is actually able to learn and understand language at a far deeper level.
Okay. Feed forward. Feed forward might actually surprise you that it's extremely simple. Feed forward is just a
extremely simple. Feed forward is just a traditional or vanilla neural network.
It's a vanilla neural network. That
means there won't be any attention in this neural network. There won't be anything complicated except linear layers. So linear layers with you know
layers. So linear layers with you know an arbitrary number of nodes or neurons as well as we'll have some nonlinear activations like the relu activation and we may have dropout as well but it is
still a traditional fully connected neural network with when we do toss in some nonlinearities and some dropout and it turns out that first having the
attention and then having this linear neural network that essentially just does a bunch of computation a bunch of weights biases and a bunch of matrix multiplication that essentially is
remembered just doing linear regression.
It turns out that first having computation first or sorry first having the communication in the attention layers and then after these tokens have paired up and essentially the model has
figured out which tokens are relevant to focus on. then doing a bunch of
focus on. then doing a bunch of computation essentially a very complex mathematical formula figuring out those weights and biases after this is actually highly effective for the
model's performance and the model's ability to ultimately predict the next token. So the last part of the
token. So the last part of the transformer architecture and specifically the decoder is actually the linear and softmax components. And if
you're coding up the transformer block, you won't have to worry about this yet, but you absolutely will once you actually code the GPT class. So the
whole point of that final linear layer in softmax is to actually get an interpretable prediction. So for that
interpretable prediction. So for that linear layer, it will actually look something like this. It would be nn.linear of the attention dim. So
nn.linear of the attention dim. So
attention dim as that's kind of the new dimension after we've done all of this attention after the embeddings. So it
would be the attention dim, but that's actually not the most important thing.
The output features or the number of neurons in this layer should actually be vocab size. And the reason for that is
vocab size. And the reason for that is for every single time step that this model receives in every single token in the input sequence, the model wants to predict what token comes next. And
that's going to be a bunch of probabilities, right? So we want to get
probabilities, right? So we want to get a vector of size vocab size where we can interpret each number in this vector, each entry in this vector as the
probability that the token or character or word corresponding to that index is going to come next in the sequence. And
then to squash those numbers between 0 and one and actually make them all add up to one so we can interpret them as a series of probabilities. That's what the softmax function is for. And that will
be the decoder transformer. So, if
you're following along with the code, you now have the information you need to code up the transformer block class, which will actually you'll be given the multi-headed attention and feed forward
classes. And then you can go ahead and
classes. And then you can go ahead and code the GPT class where you'll be given the block class. You train the model and then we'll finally code up generate where we will finally see our trained
language model generate text that looks pretty good and pretty similar to English. So, recommend jumping into the
English. So, recommend jumping into the code next. It's finally time to code up
code next. It's finally time to code up the transformer class. So this is actually the last class and maybe the most important class that we're going to write in defining the GPT model. So the
next problem is actually going to be writing the GPT class. And we're going to use this transformer block class. And
this transformer block class that we're about to write is going to make use of the multi-headed attention class that you've previously written.
So the transformer block is actually this gray rectangular block that is repeated nx times in this diagram. And
this itself is a giant neural network layer that we're going to use in the GPT class. And there is also like a really
class. And there is also like a really in-depth explanation on the transformer block over here that explains like layer by layer breaking down and explaining what every piece of this block is doing.
I recommend checking that out. But in
this video, we're also going to go over how the transformer block works at a general level and we're going to code it up as well. And the forward method for this class. So this is transformer block
this class. So this is transformer block class is going to subclass an endnodule and it's going to have a forward method because we're going to pass some sort of input in to the transformer block and
get some sort of output. And it needs to actually return a tensor that is B by T.
T is our context length by model dim.
Let's just call that capital D. And one
thing I want to clarify is that in the previous problems when we were introducing attention, I kind of made a distinction between embedding dim and attention dim, but they're actually the same thing. There's just one giant
same thing. There's just one giant parameter called model dim, and they're the same number. And then model dim mod numb n numb n numb n numb n numb n numb n numb n numb n numb n numb num heads is zero. And then the head size for
zero. And then the head size for multi-headed attention is model dim by numbum heads. So let's make some sense
numbum heads. So let's make some sense of this example. We don't really need to worry about the numbers. We just need to understand the shapes. So the input we call that embedded since it's actually
the output from the embedding layers of this neural network right before the transformer block starts in the diagram that we just showed. So the input is B by T by D where D is the model
dimensionality and the output is also B by T by D. the model dimensionality and we're told that the model dimensionality is four as we can tell over here that you know we have these four columns over
here in the input and the output and then we can actually infer from this that t equals 2 the context length is two and we can also infer that the batch
size the outermost dimension is two. So
now let's jump into how the transformer block works. So one small change from
block works. So one small change from this diagram is that this middle multi-headed attention part which is actually factoring information from the left part of the transformer diagram.
We're not going to incorporate that at all. Transformers like Chad GPT only use
all. Transformers like Chad GPT only use this right part of the transformer which is actually called a decoder. Although
that term is not super important for now. They only use the decoder and they
now. They only use the decoder and they don't actually include the left part of the transformer at all. So we're just going to completely ignore that. The
next thing that we need to actually explain is in the original transformer diagram, they went ahead and described this as add and norm over here and the same thing over here. But it would
actually be more accurate to say norm and add. Researchers have found that
and add. Researchers have found that norming first and then doing our skip connection or add. And we'll definitely explain what these actually mean in a
sec, but norm is actually what's done.
So let's explain the norm layer. This
refers to something called layer norm which you can actually use as a layer make an instance variable inside the constructor for the transformer block class and that's going to be nn layer
norm and you simply pass in model dimension and this is going to tell this layer that whatever tensor is passed in it's b byt by d go ahead and do the
normalization along this dimension and what we mean by normalize is when you have some sort of data to normalize it just means to really center it around
some sort of fixed mean and standard deviation. So this is kind of an example
deviation. So this is kind of an example of normalizing data. We can see that the data is kind of all revolved around or centered around the mean, the symbol mu.
The mean is the most common data point and the standard deviation just kind of reflects the width and how the data is stretched out from left to right. And
researchers have found that this really does boost the performance of transformers.
And researchers are actually not entirely sure why layer normalization seems to improve the training of neural networks. But it seems to be something
networks. But it seems to be something along the fact that we start off with some random distribution of the weights in the neural network and we don't want them to change too drastically that
actually makes our training process really unstable. So next let's next
really unstable. So next let's next let's explain the ad also known as skip connections. So and then one additional
connections. So and then one additional weird thing that is mentioned in the problem is that the norm is actually going to come before the attention instead of attention before the norm. So
now we can actually explain this add in addin norm also known as skip connections. We have our input to the
connections. We have our input to the transformer block X which is actually going to pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass through layer norm and then the output of layerorm goes into multi-headed self attention but then we
can actually add the unchanged X all the way back to the output of multi-headed self attention and then this is the output this is the output of at least the first part of the transformer block
over here this is the first part of the transformer block and how we would actually do this in the forward method is we would actually just say something
like X plus multi-headed self attention of the norm. Remember, we're going to norm force first and pass in X into the norm. And again, this is kind of covered
norm. And again, this is kind of covered in more detail in the transformer block background video, but the short explanation for why this skip connection
over here actually letting some portion of the input bypass bypass these layers entirely and actually adding X adding X
to the output over here. is that having some sort of additive term instead of having only multiplication in this neural network. Having some sort of
neural network. Having some sort of additive term seems to actually smooth out our gradients and actually slightly mitigate the exploding and vanishing
gradient problem during training. So
then the output of this first part of the transformer block over here is then passed into the second part of the transformer block over here. And again,
we're actually going to apply the norm before the feed forward, even though in the original diagram, it actually says feed forward and then norm. So, we're
going to do norm and then feed forward and then add this component. Add this
component which came out of the first part of the transformer block. We're
going to add this component back in over here. And that's our second skip
here. And that's our second skip connection. So, what's the whole point
connection. So, what's the whole point of this feed forward component of the neural network? Well, this is also
neural network? Well, this is also called a multi-layer perceptron.
Multi-layer because we're going to have multiple linear layers and then nonlinearities. That's kind of maybe the
nonlinearities. That's kind of maybe the perceptron part of this term if you want to think of it that way. It's also
called a vanilla or just a standard neural network. Well, if the attention
neural network. Well, if the attention part if the attention component of the neural network over here is actually our communication mechanism, this is where we let the tokens talk to each other.
Again, I highly recommend understanding multi-headed attention through those coding problems before we move into kind of this transformer block problem. But
if attention is our communication mechanism, then the feed forward part of the neural network, this is actually our computation mechanism. This is where we
computation mechanism. This is where we have a bunch of neurons in each linear layer of this feed forward neural network that are learning a bunch of
weights biases. W1 * x1 plus w2 * x2
weights biases. W1 * x1 plus w2 * x2 doing a ton of linear regression. So
after we've let the tokens talk to each other, after we've let the tokens do their communication, we want to let the tokens have some time as they pass through these linear layers and of course nonlinearities like RLU as well.
We want to let the tokens actually do computation. And this is all going to
computation. And this is all going to factor in to the model's prediction in the final layer or two where the model predicts the next token. And just
looking ahead a bit, this is not going to be part of the code for this problem.
It's going to be part of the next problem where we actually code the full GPT class, but just looking ahead a little bit after the model has done all this communication, all this computation
in this transformer block. And then in the GPT class, by the way, the transformer block is going to be repeated some number of times. So we
would take the output of the transformer block, pass it back in and do this some number of times, maybe like six times, maybe 12 times. But the point is after the model has done all this learning
within the transformer block. Let's
actually take a look at what's to come ahead. There's going to be a linear
ahead. There's going to be a linear layer where in features is model dim.
This should make sense given the dimensions used in the transformer block. And then we're actually going to
block. And then we're actually going to project to a dimension of vocab size. So
this linear layer has vocab size different neurons. We're going to have a
different neurons. We're going to have a essentially vocab size different numbers for every single token at every single time step. And the reason we have that
time step. And the reason we have that many different numbers being predicted is because each number is going to correspond to a probability or the chance that that corresponding token
comes next. And we know we have exactly
comes next. And we know we have exactly this many different tokens in our vocabulary. So we're going to have the
vocabulary. So we're going to have the possibility that any one of them could come next in the sequence as this model is learning to predict the next token.
So we're actually going to need to have that many different numbers to get the full prediction. So then just to clarify
full prediction. So then just to clarify you know just looking ahead a little bit as to how this model is going to be trained once we've fully written the GPT class is we know we had a B byt input a
batch of sequences each of length capital T and then you know the model if you've solved the GPT data set problem the GPT data set problem I highly recommend solving that problem before
doing any of these transformer problems well we know that we have B by T labels because we know that for every context text or all the sequences of tokens the model is trying to predict the next
token. So the correct answer or the
token. So the correct answer or the labels is just actually the tokens kind of offset by one and that is kind of become clear in the GPT data set problem
and then our output our output is B by T by V for every single batch and then for every single token every single time step in this sequence then we have a vector of size V the vocabulary size
which we're thinking of as a list of probabilities the softmax layer over here is going to make all the numbers between 0 and one. So they're actually can be interpreted as probabilities. We
have a bunch of probabilities for which token's going to come next in the sequence. That's the model's prediction.
sequence. That's the model's prediction.
And then the loss or error is going to be calculated using two things. The
model's prediction of the probability for what token came next. Again, that is in this vector of size V. And then we actually know what token comes next.
That's in our labels. The labels were extracted from the raw data set. And
then that is used to calculate the loss.
We dried the loss down during training using an optimizer and following the gradient descent algorithm. And in the end, we actually have a model that can predict the next token. Really well.
Before we jump into the code, I just wanted to clarify one thing. In the
constructor for the transformer block class, you will definitely need to make two different instances of nn.layer
norm. One for this layer norm and one for this layer norm. So that way we can learn different parameters across training. So now let's code this up. We
training. So now let's code this up. We
know the first thing we need is our instance of multi-headed self attention.
So we can say self do multi-headed self attention and we can use the inner class below. So self do multi-headed
below. So self do multi-headed selfattention.
And the only things that we really need to pass in there to this constructor.
Well, why don't we check the constructor for the constructor for multi-headed self attention? We just need to pass in
self attention? We just need to pass in model dim and numb heads as that is visible over here. So scrolling up, we can actually then take a look at that
and we can pass in model dim and numb heads. So model dim and numb heads. The
heads. So model dim and numb heads. The
next thing we're going to want to do is make our first layer norm. So self dot first layer norm is nn layer norm. And
this takes in model dim, the channel or dimension along which to normalize. Then
we need our second layer norm. So second
layer norm and that's nn.layer norm. And
again we want to pass in model dim. And
we're also going to need a feed forward layer, right? The vanilla neural
layer, right? The vanilla neural network. So we can say self.feed forward
network. So we can say self.feed forward
equals self. Vanilla neural network.
This is actually given below if you want to check it out. It's just a couple linear layers with a relu function thrown in between. And then we're going to go ahead and again pass in model dim.
And that's it for the constructor.
Now we can go ahead and write the forward method. So to know that the for
forward method. So to know that the for the first skip connection or the first part of the transformer block, we just want to take embedded which is like X that we talked about earlier. And then
we're going to do self.attension or self do multi-headed self attention. And then
we're actually going to invoke our first layer norm here. So self do first layer norm. And then to there we are actually
norm. And then to there we are actually passing in embedded. So this does reflect the diagram that we talked about earlier. And then we can save this in a
earlier. And then we can save this in a variable called like first part. This is
the first part of the transformer block.
The part that doesn't have the feed forward layer. And then what we're going
forward layer. And then what we're going to want ahead and do is actually take first part and then go ahead and add the result of the second part. So then we
can go ahead and say self feed forward.
And here's where we'll invoke our second layer norm. So you can say self.c
layer norm. So you can say self.c
layer norm. And we're going to go ahead and pass in the first part. So this also does reflect the diagram that we talked about earlier. And this is actually
about earlier. And this is actually exactly what we want to return. We just
need to round it. So we can say that this is our result like resz and then we just round it. So we can return torch.round
torch.round of resz and say decimals equals 4. And
we're done. And we can see that the code works. We've finally written the
works. We've finally written the transformer block. And in the next
transformer block. And in the next problem, we're going to code up the GPT class. So that's definitely going to be
class. So that's definitely going to be an interesting problem. So definitely
check that one out. Next, let's give a highle explanation of transformer architecture. And specifically, we're
architecture. And specifically, we're going to focus on the decoder for this video, not the encoder since that's what GPT uses. So why don't we just start
GPT uses. So why don't we just start with the embedding layers. So we're
going to actually have two embedding layers. One of them is called the token
layers. One of them is called the token embedding layer. Token embedding layer
embedding layer. Token embedding layer and the other is called the positional embedding layer. So we'll discuss this
embedding layer. So we'll discuss this one over here. And the goal of these embedding layers is actually to learn or train feature vectors. Specifically for
the token embedding layer, the goal is to learn or train a feature vector for every single token in our vocabulary. So
let's say we are doing a word level language model. So we can think of this
language model. So we can think of this embedding layer is actually a lookup table and that lookup table will be of size vocab size vocab size where vocab
size is actually the number of different words that this model can recognize. So
maybe the total number of unique words in our body of text that we're training the model on. Vocap size by embedding dim. We're embedding dim is actually our
dim. We're embedding dim is actually our choice. And the higher that we make
choice. And the higher that we make embedding dim, the more complex of a relationship the model can learn. Next
is the positional embedding. So this is also another lookup table except this one is actually going to be context length context length by embedding dim.
You may have heard of context length discussed in the context of large language models. You might know that
language models. You might know that chat GPT has a context length of 128,000 specifically GPT4. And context length is
specifically GPT4. And context length is essentially how many tokens back. So
when you're talking to this these language models, how many tokens back in the sequence can the model read? Cuz at
some point it has to essentially cut off and say okay the model's not going to factor in anything that was this far back. But how far back can the model
back. But how far back can the model actually read? That is the context
actually read? That is the context length. And similarly during training
length. And similarly during training when we were we when we are feeding in batches and batches of training examples we will actually be feeding in B byt
tensors where B is the batch size and T is the length of each say sentence in this training batch and T will actually be the context length during training.
So the positional embedding is essentially another lookup table that has context length rows in it. So this
would be from zero to context length minus one in terms of the rows or indices of that table. And the number of columns or the size of each row is also the embedding dimension where we are
essentially going to learn a vector of size embedding dim for every single possible position.
So what are we actually passing in to the positional embedding layer? Cuz we
know for the token embedding layer, we're literally just going to pass in the input tokens themselves. And the
embedding or the embedding layer here is actually going to then look up the feature vector for each token because we have vocab size number of rows where each row corresponds to a token in our
vocabulary. However, if the rows here
vocabulary. However, if the rows here are going from zero to context length minus one, then that's actually corresponding to positions or indices in
a sequence. If we sent in write me a
a sequence. If we sent in write me a poem, we would say that write is at position zero, me is at position one, a is at position two, and so on. So to
pass in, what are we going to pass into the forward method for the positional embedding layer? We're actually going to
embedding layer? We're actually going to pass in the following torch.
T where torch.arrange A range actually generates a tensor of size t that is sorted in order where its values go from zero to capital t minus one. And that's
how we would essentially want to index our embedding lookup table. And once we have the token embeddings generated, once we have the positional embeddings generated, we simply go ahead and add
them in. We just add those two tensors
them in. We just add those two tensors together before we'd feed them in to the transformer block. Next is the
transformer block. Next is the transformer block. Once we have actually
transformer block. Once we have actually have added our two embeddings together, the token and position embeddings, we're going to go ahead and pass that in to the transformer block. This gray
rectangular box over here. And this part actually does the heavy lifting within a transformer. This is what allows the
transformer. This is what allows the model to predict the next token in the sequence so well. And the most important parts of the transformer block, we won't go super in-depth about this entire box over here. I actually got a separate
over here. I actually got a separate video on that, but we will kind of give a highle explanation for what it's doing and how to code it up. So, the most important part is probably the masked multi-headed attention. And that might
multi-headed attention. And that might sound like a mouthful, but this is essentially the communication part of a transformer. Communication. We know that
transformer. Communication. We know that we pass in sequences of words into transformers. So, let's say we passed in
transformers. So, let's say we passed in write me a poem. At some point the model needs to actually realize which parts of this se sequence of words to pay
attention to and specifically which pairs of tokens are more important. So
in the attention layer the model will actually consider every single pair of tokens. This pair, this pair, this pair
tokens. This pair, this pair, this pair and actually every single pair of tokens in any sequence and the model will figure out which pairs of tokens are actually relevant to each other. So we
know that write and poem must be relevant to each other because how else would the model actually know what to write? The model shouldn't write a book.
write? The model shouldn't write a book.
The model shouldn't write a movie.
Rather, the model needs to know that a poem needs to be written. So the model will actually let these tokens talk to each other, so to speak. The model will consider every single pair of tokens
until the importance of every token pair relationship is determined. And then the model can focus on certain parts of the input and appropriately predict the next token in the sequence. The other most
important part of the transformer block is the feed forward component. And the
feed forward component is actually just a bunch of linear layers kind of stacked together as well as nonlinearities like the rail function. And if attention is
communication, we can actually say that the feed forward part or vanilla neural network part is computation. The model
is going to learn a ton of weights and biases similar to a linear regression in order to encapsulate the entire relationship between all the words in the input so that the model can
appropriately predict the next token. So
that's the transformer block. It does
the heavy lifting within the transformer. However, if you were coding
transformer. However, if you were coding up the GPT class, let's say we are coding up a class called GPT, you'll actually be given the transformer block class and you can treat the transformer
block class as a black box. However,
when you actually are coding up the full transformer in the GBT class, you're going to actually notice this NX in the diagram, and that actually indicates that we're going to have many blocks in
sequence. So what actually occurs in a
sequence. So what actually occurs in a transformer is the output of the transformer block. So the output over
transformer block. So the output over here is actually passed back in to the transformer block n number of times where n is essentially the number of transformer blocks and this is
predetermined beforehand. Obviously as
predetermined beforehand. Obviously as we increase the number of blocks the model gets more and more complex and the model can learn a more and more complex relationship with more parameters. But
to actually implement this into code we're actually going to use something called nnn.sequential. sequential. So in
called nnn.sequential. sequential. So in
your GPT class in the constructor, you will define something called nnn.sequential.
nnn.sequential.
And you can actually treat this just like a python list. You can call.append
on an nn.sequential with the only restriction that the only thing this list can contain is other neural network layers. And then once you actually call
layers. And then once you actually call the forward method of the nn.sequential
sequential object. Once you call the forward method of this object, it will actually call the forward method of every single block that was passed in to
this list in order from left to right.
So that entire process of passing in the output of one transformer block all the way back in would be handled by that line of code. And you will find that incredibly useful if you're ever coding
up the GBT class. One additional
component that's actually not included in this original famous transformer architecture is that in GPTs something that has been found to work really well is after the last transformer block. So
right here an additional layer norm is used. So that's an additional NN.layer
used. So that's an additional NN.layer
norm. And we know roughly at a high level the benefits of a layer norm. By
actually making sure that our data is centered around the mean and has an appropriate standard deviation. we find
that we don't have such crazy extreme range of values in our neural network and the training process is actually much smoother. So we're actually going
much smoother. So we're actually going to have one additional layer norm before the linear layer and the softmax layer.
And the linear layer is something that we can actually think of as a vocabulary projection layer because the dimensions for that linear layer it will actually have vocab size. It will have vocab size
neurons in that layer. So out features for that linear layer would be vocab size and that's because the model needs to figure out what the next token is and it's going to output a bunch of
probabilities specifically vocab size number of probabilities as we will have a probability for every single token in our vocabulary as every single token in the vocabulary could technically come
next. And of course last we will have
next. And of course last we will have the softmax layer. We will have a softmax layer over here which will squash all our values to be between zero and one so that we have a true probability distribution.
Next, I would highly recommend jumping into the code and coding up the GPT class. Your job will be to write the
class. Your job will be to write the constructor or in it as well as the forward method. And once this is done,
forward method. And once this is done, you could actually train this GPT and then go ahead and use it for text generation. So, I highly recommend
generation. So, I highly recommend jumping into the code. We're finally
ready to code up the GPT class. And once
we've written this class in the next problem, we'll actually get to see the model generate text. And this neural network that we're about to code up actually follows the architecture that
almost all LLMs use, including chat GBT as well as many others. And we are going to be given a class called transformer block that's going to be in the starter
code so that we don't have to actually code up this entire gray rectangular box that is repeated nx times in the diagram. And that class is actually then
diagram. And that class is actually then going to make use of other classes like multi-headed attention and the feed forward neural network class. So you
might have previously coded up the transformer block in a different problem. Now we're actually going to use
problem. Now we're actually going to use the transformer block to code up the entire architecture from start to finish. And this diagram is actually
finish. And this diagram is actually from Google's 2017 paper attention is all you need. And this is one of the most influential papers in machine learning to date. And usually papers I
would say are not super important to actually read, but if you're interested definitely check that one out. And the
forward method should return a batch size by context length by vocab size tensor. We can see that over here and
tensor. We can see that over here and we'll explain why that's the appropriate shape of the return tensor in a second.
And the output layer will have vocab size neurons. So that's specifically
size neurons. So that's specifically talking about this linear layer over here, the one in purple. And each neuron in that layer corresponds to the probability or likelihood of a
particular token coming next. So just
like in the handwritten digit problem, if you solve that one, we can actually have 10 output neurons corresponding to digits 0 through 9. And the number in
that neuron, the number in that output neuron represents the probability that the passin image is that digit. So each
neuron here is like the probability that the token corresponding to that neuron.
So let's say we had V for vocab size total number of words or characters or tokens in our vocabulary. Then we can think of our neurons as being indexed
from zero to v minus one and then we have the probability that that particular token comes next. So let's
take a look at the inputs. So the
model's constructor takes in vocab size.
This is just the number of different tokens the model recognizes. And as we get to our embedding table, once we get to that, we'll see that's actually the number of rows in our embedding table.
Context length, this is just the number of tokens back the model can actually read. So when you're talking to a large
read. So when you're talking to a large language model, at some point in the conversation that you're having, it's not factoring in stuff that's like really far back into the conversation.
So what's that cut off point? How many
words or tokens are we talking about?
That's the context length. the model
dim. So this is the same thing as the embedding dim or the attention dim. In
previous problems, if you saw a separate number for embedding dim and attention dim, that was just to kind of help with understanding that those are different layers in the network. But in reality,
when we have multi-headed attention, we actually just say that the embedding dim is equal to the overall attention dim, right? With the overall attention dim is
right? With the overall attention dim is just like your number of heads times each individual head size. So, I
definitely recommend reviewing multi-headed attention if you're a little rusty on that. I can link something in the description like a problem or a video for that. But we
actually just refer to the embedding dim or the attention dim as the overall model dimensionality. And the larger
model dimensionality. And the larger this number is, the more complex the model is. And just for reference, GPT2,
model is. And just for reference, GPT2, so that was like before GPT3 had a model dim of 512. And then obviously this got even bigger with GPT3 and then GPT4.
Number of blocks. So how many instances of the transformer block class do we want to make? We know in the diagram that it is repeated nx times. So what is n? That's num blocks. And then num
n? That's num blocks. And then num heads. We need to know how many heads of
heads. We need to know how many heads of self attention we're going to do if we're doing multi-headed self attention.
And then context. This is what the forward method takes in. Right? In order
to actually train the model, we need some sequence of text. We can actually see an example over here. Or if we were say actually using the model for generation before we generate text the
model needs to have something to start with. So that's what we store in context
with. So that's what we store in context and that's what the forward method of the GPT class takes in.
Okay let's explain the input example. So
here we're actually just defining those constants that the constructor takes in.
Those aren't super important to focus on. However, I do really want to explain
on. However, I do really want to explain this input and its shape and why this should make sense. So, we know the input to the forward method is actually
supposed to be B by T because we train these neural networks in batches. So,
we'll have many sequences passed in in parallel and that is how many we're passing in in parallel is the batch size capital B and that is just one here.
there's only one sentence or one sequence but based on like the shape of this 2D list or 2D tensor we can tell that there is still an additional batch
dimension going on there and then t so during training t the is actually just the context length so we can see that there were five tokens passed in with
great power comes great and then the model is supposed to make predictions for the next token given all the training examples within this this sequence of tokens so given the context
of with the model will predict what comes next. Given the context of with
comes next. Given the context of with great the model will predict what comes next and so on all the way until the model has the entire input with great power comes great and then the model
predict what what predicts what comes next. And we see that the context length
next. And we see that the context length is five, right? And this this kind of makes sense that why would we actually have necessarily a capital T basically how long the sequence is greater than
five because the model can't factor in context length is the greatest amount of tokens the model can even factor in to its next token prediction right so with great power comes great that's already
five tokens in a row the model wouldn't even really be able to factor anything else more than five tokens in it would have to start if we added like another token here the model would then have to not factor in the first token to
actually predict you know what comes next. So that's just an example of
next. So that's just an example of context length during training. So let's
take a look at this dictionary now. So
we're going to assume that the model is consistently using this internal mapping. So we know models don't
mapping. So we know models don't actually take in strings, right? Each of
these tokens needs to be converted to an integer. That's how we actually feed in
integer. That's how we actually feed in strings into models for natural language processing. And let's just and this
processing. And let's just and this model's mapping any model's mapping is pretty arbitrary. As long as it's
pretty arbitrary. As long as it's consistent, that's all that matters. So
we can say width maps to zero, great maps to one, power maps to two, and so on. So in the final layer, the final
on. So in the final layer, the final layer of the neural network where we have vocab size or capital V neurons, we'll say that the zeroth neuron is essentially corresponding to width. The
first neuron is corresponding to great and so on all the way until the V minus one neuron.
Okay, so the returned output first let's explain the shape of it B by T by V. So
we know that batch size was one as we saw in the input. So it makes sense that this whole 2D tensor or this whole 2D array over here is actually wrapped in another one. We can see that there's
another one. We can see that there's another pair of brackets over here. So
the batch size equals 1 is still maintained. However, for that example,
maintained. However, for that example, for the first element in the batch dimension or for every element in the batch dimension, we get a capital T by V tensor. And that's because for every
tensor. And that's because for every single time step, for every single context or training example, the model is generating a vector of size capital V. And the reason we have a vector of
V. And the reason we have a vector of size capital V is because that is essentially a list of probabilities, a probability for every possible token that could come next. So we'll actually
see that the first row the first row in this in this output tensor is actually the model's probability output. It's
vector of size V. We can see that there's V columns here, right? V equals
5. There's five unique words with great power comes responsibility. Those are
our five tokens. We can see that this dictionary has five keys in it. So we
can see that our vocab size is clearly five. Those are only five tokens that
five. Those are only five tokens that we're going to say this dummy example of this dummy model is dealing with. So
back to the first row. The first row in this output tensor is the model's prediction which is a vector of size V for the first context. And the first context is just looking at width. And
then the second row, the second row which is actually over here it's zero and then zero. And then we can see a 0.1 over here. The second row is the model's
over here. The second row is the model's prediction. another vector of size V for
prediction. another vector of size V for the second training example in context.
So the second training example is with great and then the model is trying to predict what comes next. We know power comes next but the model doesn't actually get to see that. And again, the reason the model doesn't get to see
that, doesn't get to see the future tokens, the actual answer for each training example is because in the self attention implementation, and again, I highly recommend reviewing self attention, multi-headed attention if
you're a little rusty on that. But the
reason the model doesn't get to see those feature tokens is because in the code, we actually apply a mask. If you
look at the multi-headed or actually single-headed attention class, and then the multi-headed attention class makes use of the single-headed attention class. But in the single head attention
class. But in the single head attention class, we apply a mask and we mask out those future tokens so that the model can't look at the answer for what it's actually trying to predict. So let's
actually think about why this output would make sense. And let's say the model has actually been trained. So by
this point, it's pretty good at actually predicting the next token. It's not just like an initial neural network in its before training is done where it's weights are and weights and parameters
are just completely random. Let's assume
the model's been trained. So for the first example, we know that the only token the model is looking at is with.
So the model will hopefully in a train state be pretty good at predicting that the word great would come next. And we
know great corresponds to the one index.
And we can actually see that over here.
The model is saying that there's an 80% chance that great comes next if you just are looking at the word with. And there
seems to be a 10% chance that the word power comes next. Let's see if that makes sense. With power. So that that
makes sense. With power. So that that could be like the start of a logical sentence. So that seems reasonable for
sentence. So that seems reasonable for the model to still say that there's some chance for that to happen. And then the final column corresponds to responsibility. The fourth column is
responsibility. The fourth column is using zero indexing. And there's a 10% chance that responsibility comes next.
So that would be like with responsibility. And that that seems
responsibility. And that that seems reasonable too. But thankfully the model
reasonable too. But thankfully the model is still saying that we have an 80% chance. We have an 80% chance that great
chance. We have an 80% chance that great essentially token number one the one column we can actually see that there's an 80% chance that great comes next
given the context of width. Let's take a look at the Vsiz vector for the second row. So the second training example is
row. So the second training example is the context with great and hopefully the model has a pretty good does a pretty good job of predicting that power comes next. So we say there's a 90% chance for
next. So we say there's a 90% chance for the third column and a 10% chance for the last column. So the third column here is corresponding to well 012 using zero indexing right and that corresponds
to power. So you can see that the model
to power. So you can see that the model is saying that with a context of with great there's a 90% chance that power comes next the final column. So that
corresponds to token ID number four is responsibility. So does that make sense
responsibility. So does that make sense with great responsibility? Yeah, that
that could be reasonable. That seems
like a logical sentence in English. So
their model is still saying there's a 10% chance we can say there's a 10% chance that responsibility comes next but the model is still fairly confident fairly confident you know after training
has been done let's say this sentence was actually in its training body of text the model is fairly confident that power comes next with the context of with great so that's that's good that
the model is predicting that I definitely recommend making sense of this row and this row but why don't we skip to the last row and see if the model's prediction makes sense there as
well. So we have this vector of size V
well. So we have this vector of size V capital V equals 5 and this is supposed to be the model's prediction given the entire context. So with great power
entire context. So with great power comes great and the model is predicting what comes next. So we can see here that there's a 90% chance corresponding to
the fifth column. So that's with zero indexing corresponding to token ID equals 4 and that's responsibility. The
model is saying there's a 90% chance that responsibility comes next. So we
can see that this B by T byV shape does make sense. Okay. So now let's run
make sense. Okay. So now let's run through the transform architecture so we can see how we would actually code this up. So the first layer is this pink
up. So the first layer is this pink looking rectangle that is the embedding layer and actually crossed out the word output from this from this layer cuz this original transformer diagram
created by Google was actually used for translation. They were working on models
translation. They were working on models that can translate between different languages. And when we use the
languages. And when we use the transformer to for generating text, the architecture is slightly different.
You've noticed that maybe if you've seen the original version of this diagram, we cut off what was on the left. That's cuz
large language models like chat GPT don't use that at all. And I explain that in more detail in my encoder decoder course. So this first layer of
decoder course. So this first layer of the neural network, this pink embedding layer over here is actually how the model learns the meanings of the different tokens or words. We know that
we have some sort of mapping where we map every single token to an integer and the model then receives this this list for a given sentence where each token is represented as an integer. However,
those are completely arbitrary and we need the model to learn some sort of deeper representation that encapsulates the meaning of each word. Ideally, the
model would learn a vector that represents the meaning of each word. And
we call those embedding vectors and they're actually trainable and learnable through gradient descent. So, we can imagine it being a lookup table, a lookup table where we're given a given
token. We would actually look up the
token. We would actually look up the corresponding row for that token and the all the columns that are associated with that given row would actually just be the embedding vector or feature vector
that is learned for that word. So the
number of rows in this table, the number of rows in this table should just be the vocabulary size capital V. And this is because for that many tokens, that's how
many tokens we want to learn the feature vector for. And the number of columns,
vector for. And the number of columns, the number of columns in this lookup table should just be the model dimensionality since we know that is our dimensionality of our embeddings. And
that's how many features we would essentially have to learn the meaning of each word. So then in the constructor
each word. So then in the constructor for the GPT class, the way we would actually instantiate this embedding layer is by using nn.bedding.
NN.bing.
And we only have to pass in two things to NN. The first thing is the number of
to NN. The first thing is the number of rows of the table and the second thing is the number of columns of the table.
Next, let's talk about the other embedding layer which generates our positional encodings. So when we have a
positional encodings. So when we have a sentence in English, it might be one that we pass into chat GBT. Let's say
it's a command like write me a poem.
This might sound kind of obvious, but it is something that we should clarify.
When it comes to teaching computers to understand speech, the order of the tokens is actually really important. The
positions of each token in this sentence. This token is at the zeroth
sentence. This token is at the zeroth index. This token is at the one index
index. This token is at the one index and so on. If we have something like poem me write a, if we jumble the order, the meaning is completely lost. So we
need the model to actually learn embeddings for each token's position as well. Which means we're going to have
well. Which means we're going to have another lookup table where the number of rows is just capital T or the context length. So the model can learn a feature
length. So the model can learn a feature vector again of size model dim for each token position all the way from zero to capital T minus one. So then when we
actually call call this layer in the forward method, what are we passing in to the forward method of our positional embedding layer? because this is going
embedding layer? because this is going to be a separate NN.bedding instance
than than this one over here. What we
would actually pass in is simply a vector of size capital T that simply has all the numbers from zero to capital T minus one since that would essentially
pluck out the rows of the table for each position in our sequential input string.
And the way you can actually generate that tensor is using torch.range.
It's going to arrange numbers from zero to the whatever the input is to this function minus one. So you would just put in torch.arrange oft and this is the tensor that you would want to pass in in
the forward method of the GPT class when you're calling the embedding positional embedding layer. And of course maybe I
embedding layer. And of course maybe I didn't clarify this before but just to make it super clear you would actually just be passing in the raw input our B by T input. That's what you would pass
in to the actual token embedding layer.
Then in the forward method, what we're going to do is we're just going to add the positional and token embeddings together. And then we're going to pass
together. And then we're going to pass those in to a series of transformer blocks. So let's just go ahead and treat
blocks. So let's just go ahead and treat the transformer block as a black box for now. Is we actually have an entirely
now. Is we actually have an entirely separate coding problem explaining the inner workings of this box. Masked
multi-headed attention, the addin norm, the feed forward, all the components within the transformer block. And this
class is given to us. So we just need to know how to instantiate it in the constructor for GPT class and then how to actually call it in our forward method that we're going to write. So
since we're going to be instantiating a bunch of these a bunch of transformer blocks based on the num blocks parameter that was given to us, we're actually
going to make use of something called nnn.sequential.
nnn.sequential.
nn.sequential. So you can instantiate this in the constructor for the GPT class and then you can actually treat this like a normal Python list you can call theappend method with the only
restriction that anything you pass in needs to be in your own network class.
It needs to be something that also subasses and then do module. So each
time that you append to this list, maybe you would loop over and the range of num blocks, you're going to append an instance of the transformer block. And
every time you instantiate the transformer block, the only things that you'll see you have to pass in are the model dimensionality as well as the number of heads. This is
an input that's given to us. And the
number of heads is simply the number of heads of self attention. So you would define blocks your nn.sequential in the constructor for the GPT class. And then
just to be super clear, the way you would actually do this in the forward method for the GPT class is you would pass in your total embeddings. This is
just your positional embeddings plus your token embeddings. You're going to go ahead and pass that in to the essentially the default calling method, which is essentially the forward method
for the blocks. And the way n.sequential
works is for this list of modules that you've passed in, it's just going to call them in order. So it will pass in total embeddings into whatever's at the zeroth index in this list. And then it will go ahead and pass the output of
that into the next layer, the next transformer block. So the input comes
transformer block. So the input comes all the way back in as for as many times as we have elements in this nn.se
sequential. So you'll simply get the output like this. You will pass in total embeddings into blocks and then this this is the output of the transformer blocks. Okay. So what's left? The bulk
blocks. Okay. So what's left? The bulk
of the code is done. Even though the transformer architecture that was originally developed, this diagram does not include an additional layer norm over here after the transformer block
and before this this linear layer. It is
customary to include one additional layer norm. So an additional instance of
layer norm. So an additional instance of NN.layer norm defined in our constructor
NN.layer norm defined in our constructor that we then call in the forward method.
Researchers have found that these large language models get way better result if you do an additional normalization, an additional layer norm before the final linear layer. And again, for a
linear layer. And again, for a refreshment on what layer norm does, I highly recommend checking out the transformer block video where I explain every single inner working inside the
transformer block. So the final linear
transformer block. So the final linear layer is over here and this is what is going to get us the the output to be in the shape we want so that we can actually extract predictions for the next token. We know that the input to it
next token. We know that the input to it is B by T byD where D is the model dimensionality and we know that we want to get something that's B by T by V where V is the vocabulary size. So the
way we'll do that is by appending a linear layer having a linear layer defined in the constructor of the GPT class and in features for this linear layer will be a model dim as that is
capital D and then we will pass in V or vocab size for out features and this will actually get us the prediction and the shape that we want. And this does make sense given that example we talked
about at the start. And lastly for the softmax layer, again the whole point of the softmax layer is so that we actually get numbers between zero and one because we want those tokens, those final
predictions in the B by T byV tensor. We
want those numbers to be between zero and one. So we can think of them as
and one. So we can think of them as probabilities. So you can actually
probabilities. So you can actually instantiate a softmax layer in the constructor of the GPT class. That would
be like nn.s softmax and you would pass in the dimension along which you want to normalize or make everything add up to one. Well, that would just be the final
one. Well, that would just be the final dimension, right? The vector of size
dimension, right? The vector of size capital V is our list of probabilities.
So, you could just pass in dim= 2 012 or you could say dim equals ne1 for the final dimension. And then you would
final dimension. And then you would simply call the forward layer of your softmax instance in the forward layer of the GPT class. Or since the softmax layer is sometimes thought of as simply
a function rather than a layer, you can actually just call nn.functional n kind of like nn.function.s softmax in the forward method for your GPT class. And
you don't have to worry about instantiating anything in the constructor. And then this function
constructor. And then this function would take in two things. It would take in your tensor as the first input, what you actually want to perform softmax on, as well as the dimension over here,
which we would just pass in -1 for. And
that's it. We're finally done explaining this entire transformer architecture that we've been building up to across all these videos and trying to understand how this neural network works. Before we jump into the code, I
works. Before we jump into the code, I just wanted to say thank you for making it this far into the videos. And then
two other things. One is that we're actually not done. We still, even after this coding problem, have one more. And
that's going to be where we finally generate text from the model. We see
this neural network architecture in action and working like a GPT. So, we're
finally going to be able to generate text from the model. The next problem, you're actually going to write the generate function, which is a bit different than the forward function for this class. and we'll actually get to
this class. and we'll actually get to return a string and that's how like the test cases will work. Like we'll see if the string that your GPT returns back to you based on some sort of prompt actually matches what the string is
supposed to be. So that's going to be a pretty cool problem to test to see if you can write the generation code, the logic that actually makes these transformers generate code. And the next thing I wanted to say is that I highly
recommend if you need to reviewing any of the previous like videos or problems in the series all the way from I think the first problem we started with was like gradient descent and then we've worked all the way through like linear
regression, sentiment analysis, multi-headed attention and these problems and these videos will always be free and if you need to review any of these concepts at any time I highly
recommend them. Okay, let's finally jump
recommend them. Okay, let's finally jump into the code for the GPT class. And I
recommend reviewing like the other classes that are provided below just to refresh yourself on those concepts. But
for the most part, we'll just treat the transformer block class like a black box. So the first layer in the neural
box. So the first layer in the neural network is the word embedding or token embedding layer. So that'll be something
embedding layer. So that'll be something like self token embedding equals nn.bing
equals nn.bing and the number of rows. We know that's the vocab size and the number of columns or the size of that vector is the model dimensionality. Then we're going to have
dimensionality. Then we're going to have our positional embeddings layer. So
positional embeddings is nn embedding.
And the number of rows here is actually just the context length. We're going to go all the way up from 0 to t minus one.
And then the size of each vector or the number of columns is also the model dimensionality. Then we need our
dimensionality. Then we need our transformer blocks, right? So we can say something like self.blocks equals nn.se
sequential. And then something like for i and range of num blocks. we know we actually want to append to this. So,
self.blocks.append.
And then we're actually going to do something like transformer block. We're
going to use that class that's below.
And then we can go ahead and just pass in the two things that it takes. And it
takes in the model dimensionality. If
you check the constructor that's that's way below, it takes in the model dimensionality and the number of heads.
And then we're actually going to need our final layer norm, which actually comes before the final linear layer. So
we can say something like self.final
final layer norm equals nn.layer norm.
And then if you've solved the transformer block problem before, you know that the layer norm class or the layer constructor just needs to take in the dimension along which to normalize.
So this thing is going to take in something that's actually B by T byD and it's going to pass actually return something that's still B by T byD, but it normalizes along that model dimensionality, that third dimension. So
we just pass in model dim. And then
we're going to want to do our final linear layer. So self, we can often call
linear layer. So self, we can often call this this the vocab the vocab projection because we're projecting down to the vocabulary size dimension. So the vocab projection is nn.linear. The input
features is model dim. The output
features is vocab size. And that's
actually it for our GPT constructor. One
small thing though, since transformer block is an inner class, we do need to refer to it as self in Python. And for
the softmax, we'll just use the function and do that in the forward method below.
Okay. So now let's pass in our input also called context all the way through the transform architecture. The first
thing we need is our embeddings. So we
can say something like token embeddings or we can just say token embeds equals self.ken embeddings self.ken embeddings
self.ken embeddings self.ken embeddings
and we pass in our context. So this is going to be something that is B by T by D. Since for every token we get that
D. Since for every token we get that vector of size D. Then we need our positional embeddings. So we know we're
positional embeddings. So we know we're going to call the positional embeddings function or the positional embeddings layer. And what do we want to pass in?
layer. And what do we want to pass in?
We want to pass in torch.range
of T. But how do we actually get T?
Well, we can actually get that from any of these tensors. So we can say something like B by T by D equals token embeds.shape. And that will unpack
embeds.shape. And that will unpack this.shape tupil. And then we can simply
this.shape tupil. And then we can simply pass in T over here. So then we know that our total embeddings or our sum of embeddings is token embeds plus
positional embeds. And this is what we
positional embeds. And this is what we then want to pass in through the blocks.
So we can actually say self.blocks
pass in total embeddings. And that will then be passed through all n of our transformer blocks. Then we know this
transformer blocks. Then we know this goes through the final layer norm. So
self.fal
layer norm. We pass in this into there.
And then the output of the layer norm goes into the vocab projection, the linear layer. So self.vocab projection
linear layer. So self.vocab projection
takes in that output. And then this is almost the final output. So we can say this is like the unnormalized, right?
It's not between zero and one yet. We
have to buy softmax for that. So
nn.functional.softmax
and we pass in the unnormalized. Pass in
dim=1 as we talked about earlier. And
this will actually be our normalized probabilities. They're between 0 and 1.
probabilities. They're between 0 and 1.
So I'll just call it probs and it's been normalized now. And then this is what we
normalized now. And then this is what we want to return. We just need to make sure to round our answer to four decimal places just that the test cases are consistent. So we can say return
consistent. So we can say return torch.round
torch.round of probs and then decimals equals 4. And
we're done.
And we can see that it works. We've
finally written a working GPT class.
There are so many concepts embedded within this. So definitely leave a
within this. So definitely leave a comment if there's anything you would like me to go more in depth on and I'll just make a video on that. If you're
curious as to how I wrote the test case for this problem into how it's actually checking the correctness of your code, I simply created a random context and then randomly initialized the weights of your model which is actually automatically
done by PyTorch. And then I'm checking to see if the probabilities the output tensor returned by your forward method is actually matching the correct solution codes output tensor to four
decimal places. And by the use of all
decimal places. And by the use of all these seeds, I'm able to ensure that everything is reproducible and there's actually no randomness and inconsistencies in the inputs and
outputs. Next, I would highly recommend
outputs. Next, I would highly recommend jumping into the make GPT talk back problem. In that problem, you'll
problem. In that problem, you'll implement the logic to make GPT generate text and it's going to be a really nice satisfying ending to this sequence of problems.
So, how do we actually generate text from language models? Let's say we're going to treat these language models, these transformers like GPTs as a black box. How do they actually generate text
box. How do they actually generate text though? So after we watch this video,
though? So after we watch this video, there will be a coding exercise to generate new Drake lyrics using a trained model. And I highly recommend
trained model. And I highly recommend doing that exercise and then playing around with the model. So how do we actually generate the lyrics? So first
let's treat the model as a black box. So
this is our model. It's some sort of trained GPT and we'll take in we'll explain what it takes in and what it outputs. So it actually takes in a
outputs. So it actually takes in a tensor of size B by T. B is something called a batch size and it represents how many independent examples or requests in parallel we're processing
for generating text. Let's just say batch size equals 1. So we'll see B equals 1. The T is actually really
equals 1. The T is actually really important though. The T is actually
important though. The T is actually equal to the length of the input sequence. So if we're using word level
sequence. So if we're using word level tokenization and we said something like write me a poem that would actually be a capital T of four and what the model
output is actually something of size B by T by V where V is the vocabulary size. So it's the number of unique words
size. So it's the number of unique words or tokens that this model recognizes.
And this should make sense because for every single time step we have this vector of size V. So for every single time step, for every single space or every single position in some sort of
input sequence, maybe that input sequence is four words like write me a poem, we're getting a vector of size V.
And that vector of size V is actually a list of probabilities. And if it's of size V, then we can actually think of index zero in that vector as corresponding to the probability, the
probability that the token associated with index zero comes next. And the same for index one and index 2 and so on all the way to index vocab size minus one.
So in short we have vocab size or capital v entries for every time step where every entry in that in that vector corresponds to the probability that the
corresponding token comes next. So after
we've passed in our input something like write me a poem. So that's b= 1 and t= 4 we get something that's b by t by b. So,
let's actually go ahead and explain how the next token is chosen in more detail later in the video. But given that list of probabilities, there's going to be some sort of sampling algorithm. You can
think of this as reaching into a bag of marbles, and the marbles that occur more often are more likely to actually get picked out. It's essentially going to be
picked out. It's essentially going to be the same concept here. We have a bunch of probabilities of which token could come next. And using those
come next. And using those probabilities, we're actually going to pick one token to come next in the sequence. So let's say the input was
sequence. So let's say the input was write me a poem. Let's say we had a bunch of probabilities that came out of the model and the token that we actually
ended up choosing to come next is stanza. Well, what we go ahead and do to
stanza. Well, what we go ahead and do to actually generate text continuously from the model is we go ahead and append or concatenate this word to the previous input. So now the input says write me a
input. So now the input says write me a poem stanza. And then this is actually
poem stanza. And then this is actually going to go ahead and be passed into the model. So we have write me a poem stanza
model. So we have write me a poem stanza and this is going to be passed into the model. We're going to call the model
model. We're going to call the model again and we get another output. And
this time after we you know choose from the probabilities let's say we get the token one. So then we would actually
token one. So then we would actually append that over here and then we would get stanza one and then we would actually go ahead and pass write me a poem stanza 1 into the model again. And
the model will actually output another probability distribution. And this
probability distribution. And this process of continuously calling the model over and over again in some sort of cyclic manner is actually how we keep getting the next token in the sequence
and how we can generate long long text.
And just to make sure this is super clear, after we pass in write me a poem stanza one into the model, the next token that comes out should hopefully be either the colon operator so that we can
actually start the poem or actually the first token in the poem. And then we would keep calling this model repeatedly to generate the entire poem. So how is this actually going to work in our
coding exercise? Instead of passing in
coding exercise? Instead of passing in an instruction, all we're going to pass in is a start token. It's essentially a dummy token that tells the model to start generating text. And this is actually how text is generated from
language models when they are in their pre-trained state before we actually pre actually fine-tune them on a Q&A data set. So we can talk back and forth with
set. So we can talk back and forth with the model. However, they can still
the model. However, they can still generate text very well. So our start token or initial context is just going to be zero in a tensor and that will be given. But after calling the model once,
given. But after calling the model once, it's going to output something that is one by one by vocab size, assuming a batch size of one and t equals one.
However, after calling the model twice, right? Because after calling the model
right? Because after calling the model once, we're actually going to append or concatenate the model's output to that initial starting context, that starting token. And after calling the model
token. And after calling the model twice, the output is going to be B by 2 by V. And of course, after calling the
by V. And of course, after calling the model three times, the output is going to be B by 3 by V. So we actually only care about the final time step, right?
The next token that's going to be predicted in the sequence based on everything that's come so far, all the previous tokens. So whenever we actually
previous tokens. So whenever we actually extract the output from the model, the output of the models forward, we always are going to need to focus on the last time step and this is how we can index
the output to do so. So once we get our probabilities such as by applying softmax to the model's output and having numbers between 0 and one. Once we have a list of probabilities, we need to
actually choose a token. We need to do something called choose the next token because we don't actually know which token is going to come next. All we have is a bunch of probabilities. One option
is just to pick the token with the highest probability. But this actually
highest probability. But this actually leads to really boring results and the models don't actually give humanlike context or humanlike text generation. So
we actually are going to have to call a function in PyTorch called torch.multial
which is going to simulate the process of sampling. So the process of sampling
of sampling. So the process of sampling can actually be likened to choosing marbles from a bag. The marbles that occur more often, the colors that occur more often are going to be the ones that
you get most of the time. But every once in a while you will get a marble that occurs less often because there's still a chance that that marble could get chosen. The same kind of holds for the
chosen. The same kind of holds for the output of the model which is of size V or of size vocap size. The tokens that are have higher probabilities are the ones that are more likely to get chosen.
But we still want the less likely tokens to get chosen sometimes instead of only choosing the max instead of just calling torch.mmax or something like that. So
torch.mmax or something like that. So
once we have our probabilities, we're going to call torch.multial and
torch.innomial is going to sample or pick a token for us. Then once we have the returned tensor from torch.multial,
which will essentially be a tensor with just one element in it, the next character or the next token in our sequence, we're actually going to append that to our context. We're going to append that to our growing context.
We're going to concatenate it with our growing context. And at the next
growing context. And at the next iteration or the next iteration of the loop, we're going to call the model again. Call the forward method for the
again. Call the forward method for the model again. However, there is one
model again. However, there is one limitation that we still need to take into account and it is actually something called the context length. The
context length of these models, these language models. You might have heard
language models. You might have heard that GPT4 has a context length of 128K.
And the context length actually represents how many tokens back into the past. How many tokens back into the past
past. How many tokens back into the past of this sequence can the model read?
Because this actually has to be cut off somewhere as we don't have infinite compute. So obviously higher numbers for
compute. So obviously higher numbers for the context length are going to yield better results but the context length does have to be established somewhere.
So in our iteration or in our loop for kind of generating new text where we keep continuously calling the model prediction or calling forward for the model. We know that this context is
model. We know that this context is going to be growing. We saw earlier that we had write me a poem and then write me a poem stanza and then write me a poem stanza one. But as this context gets too
stanza one. But as this context gets too long, if this context ever exceeds exceeds context length just in the implementation, it is an important detail that we are going to have to
truncate the the tokens at the start so that we're only looking at the previous context length number of tokens. So
next, I would highly recommend jumping into the code. you will actually get to run your code against test cases and see if the string that you return is correct, the string that represents
GPT's prediction and then you will have a working model that can generate new Drake lyrics. And of course, you could
Drake lyrics. And of course, you could just switch the data set to a different text file if you wanted to generate, you know, some other sort of text. So, I'd
highly recommend jumping into the code and then you can check out the collab linked on the problem which will actually allow you to play around and generate new lyrics. And let's test our code and let's see if it works. So we
can go ahead and run this cell. And
these actually look like real lyrics. In
fact, we can run it again. We can see if we get real lyrics as well. And yeah, we get some different lyrics. And of
course, we can increase the number of characters here and get even more lyrics.
Okay, this is our final problem in this list of problems. And we're finally going to generate text. We're finally
going to generate text using all the code we've written so far. This problem
is going to combine concepts from this entire series and we're finally going to use the GPT class that you wrote. We're
going to use that exact class, one of the inputs, and as is actually an instance of that class, and we're going to use it to generate text. I took the model that you coded up in the previous
problems, the GPT class. I made an instance of that class and I trained it on a raw data set of all of Drake's songs. So what's being passed into you
songs. So what's being passed into you is actually a model that is specializing in generating Drake songs since that is the body of text that it learned to
model really well during training. And
we're finally going to get to write the code that actually generates text from models, these large language models. So
it turns out that implementing inference, this idea of using a trained model and actually using its weights to generate new things is called inference.
And it's a bit more complicated than just calling this forward method from the GPT class over and over again. And
this is a valuable skill to really understand cuz then you could actually download open-source models. you could
download open- source weights from online and implement inference for them on your own and actually generate text and that that can be a really fun side project. So, and the reason it's more
project. So, and the reason it's more complicated than just fall calling forward in a loop is because forward is not outputting like a clearcut output of
what the next token in the sequence should be. Right? If we pass in my name
should be. Right? If we pass in my name is into this model, let's treat the model as a black box. This is our model.
And if we actually pass in this sequence and let's say we know the answer is Bob.
And let's say the model has been trained really well and the model knows the next word is Bob. But it's not going to give us a clearcut answer like this. Right?
That's not what this model outputs. This
model outputs a vector, a vector of size, vocabulary size or capital V where that's the number of tokens that the model was trained on. We have a bunch of
numbers each between zero and one. And
if we go to the index where Bob is, that's actually going to be like a really high number, maybe like 0.95. So
the model is outputting a bunch of probabilities of each possible next token, whether that's a character. In
this model, it's actually going to be a character, but it can obviously be a word. it's just some token. So given a
word. it's just some token. So given a bunch of probabilities, how do you actually choose the next token? Because
it turns out just taking the highest probability token doesn't necessarily yield the greatest results. And we'll
explain why in a bit. The model is only also allowed to read context length, number of tokens back into the past.
Right? As we're generating text, these generations can keep growing and growing and growing. And we're going to need to
and growing. And we're going to need to actually truncate off tokens that are farther back than this threshold at every iteration. And the way that your
every iteration. And the way that your code will actually be validated is we'll just check to see if the string that GPT is returning back to you, whatever text
actually matches what it should be.
So let's try and understand this example. We're given an instance of the
example. We're given an instance of the GPT class and that's going to be the model variable. Then we're told how many
model variable. Then we're told how many new characters to generate and we're given the starting context. So if the model is a black box, this is our model.
We go ahead and pass in some sequence of tokens of length capital T. We're going
to start off here by just passing in one token and specifically is zero. And so
since we just have one token, we can say that capital T equals 1. And if you look at the vocabulary dictionary, remember we actually encode our tokens as
numbers, right? When we're doing NLP, we
numbers, right? When we're doing NLP, we can see that zero is corresponding to the new line token. This is
corresponding to the new line token. And
that's going to kind of be our start token. That's going to kind of be like a
token. That's going to kind of be like a start token to tell the model, hey, start generating text. And in this case, since we chain train the model on lyrics, you can maybe think of the new line token as say, okay, start
generating a new stanza or a new verse or something like that, right? And we
actually are going to then get a probability distribution. We're going to
probability distribution. We're going to pick a token. So, we're going to choose choose the next token. Choose the next token. And then once we decide that,
token. And then once we decide that, we're actually going to concatenate it to our original input. So now this will be of length t + 1. That'll be the size of the input. And then we'll call the
model again inside some sort of loop.
We'll get the next token. And then this process just continues. So we need to be told how many times to actually generate new characters. And that's this input
new characters. And that's this input over here. And of course as this input
over here. And of course as this input grows right as we are generating new tokens appending them back on to the context variable as we are appending to
context and the context is growing at some point you know those tokens from the past maybe those starting tokens need to be cut off if the input is getting too long too long for the model
to actually process. So that's our context length and then again this is just a dictionary for encoding integers to characters. This is going to be a
to characters. This is going to be a character level model instead of a word level model. But all the concepts are
level model. But all the concepts are exactly the same. And although your output when you actually run the code when you when you solve this problem, it's not going to match exactly this
just due to computational limits within the browser. We actually I went ahead
the browser. We actually I went ahead and ran this model. So I used the solution code for this problem and used your GPT class that you wrote in the previous problems. And this is the
output I got for only 60 characters. You
can count this. This is 60 characters.
And we can see that it's something that actually resembles like Drake lyrics. So
that's pretty cool. So now let's jump into how it works.
So this is going to be our general workflow for this generation loop. From
the model output at every iteration, we're obviously going to call the forward method of the given model at every iteration using our growing context, which again we have to remember
to truncate if it gets too long. We are
going to then get probabilities.
However, the output of this model as the starter code says has actually been before softmax is applied. So they're
not actually between zero and one yet.
And we need to actually get those probabilities. So we're going to
probabilities. So we're going to actually use nn.functional.softmax.
And of course we're going to use the last dimension. So we say something like
last dimension. So we say something like dim equals1. So we can make each row
dim equals1. So we can make each row actually sum to one. However, the
model's output before we even apply softmax, the model's output is of shape B by T by V. Right? For every single token, at every single time step, we
have this vector of size V, all the probabilities. But we've actually been
probabilities. But we've actually been calling this model over and over again inside the loop. And this T has been growing. It starts off with T= 1 and
growing. It starts off with T= 1 and then T= 2 and then T= 3 as we generate the tokens. But each time we only really
the tokens. But each time we only really care about the next token, right? The
next token that's going to come in our sequence. So we only actually care about
sequence. So we only actually care about the last time step here. We don't care about the model's predictions for the previous time steps because we already have the answers or we've already chosen
a token. We've already chosen a token
a token. We've already chosen a token from the probabilities and done our whole our whole sampling thing which we're about to explain. We've already
chosen those tokens. So we just want to actually index the model output at the final time step. So we might do something like let's say we had the model prediction stored in some kind of
variable the model output. You would
actually want to leave batch size the same but take the last time step and of course you want to grab all the vocabulary entries all the probabilities. So you would leave the
probabilities. So you would leave the vocabulary dimension the same. then you
can apply softmax and from those probabilities we're going to do sampling. So let's talk about sampling.
sampling. So let's talk about sampling.
So given a bunch of probabilities one choice is just to keep choosing the highest probability token. Look at the max look at the index of the max probability and simply choose that
token. However, we're instead going to
token. However, we're instead going to do something called sampling. Sampling
can kind of be likened to drawing marbles from a bag. Let's say there's a bunch of marbles of different colors and some colors appear more often than others. The ones that appear most often
others. The ones that appear most often if you keep repeatedly drawing into the bag, those are going to pop out more often. But occasionally you can still
often. But occasionally you can still get the lower probability, the lower count marbles. That's exactly what we're
count marbles. That's exactly what we're going to do here. And there's a function called torch.multial that is going to
called torch.multial that is going to simulate this sampling process. All we
have to do is pass in the probabilities.
Right? So that's after we applied softmax. the number of new essentially
softmax. the number of new essentially drawings that we want to do from this bag this bag of probabilities that's just one and then for reproducibility so that you know your test case is not
random every time your test case is output you need to pass in the generator why do we sample why are we doing sampling instead of just taking the max
probability token I go into this in more detail in the background video for this problem but the short answer is if we let the model occasionally choose like the second most likely or the third most
likely, we can get a lot more diverse and interesting outputs cuz it's not like the model is forced to just take one path. Instead, there's different
one path. Instead, there's different paths that the generation can go down, right? Because we know that at a given
right? Because we know that at a given iteration, if we choose a given character like A, right, instead of a character like B, then that is going to affect the future tokens that are generated because they're all
conditioned. They're all kind of
conditioned. They're all kind of dependent on the previous tokens that came before it, right? So we can get some more interesting and some more fun outputs if we occasionally choose maybe
the slightly less likely token at a given iteration because then this can allow the model to go down a different path and generate something different.
Okay, we're finally ready to jump into the code. Since we're going to be
the code. Since we're going to be returning a string that's actually going to be growing at every iteration depending on which characters we generate. First I'm going to in declare
generate. First I'm going to in declare a variable called result which is just going to hold the list of generated characters and then we can just convert this into a string at the end. And again
to maintain the reproducibility here we actually have to have this generator which is going to keep track of all the random numbers and make sure that everything consist everything's consistent. So you actually have to set
consistent. So you actually have to set the generator to the initial state at every iteration to make sure everything's consistent. If that doesn't
everything's consistent. If that doesn't make a ton of sense don't worry it's not too important. The only thing we really
too important. The only thing we really need to do is just have our code here.
And we've said that it's going to be some arbitrary number of lines. And then
you're going to call torch.multial
on this line over here, right? Because
we know that there's going to be a call to that sampling function at some point.
And like we said earlier, we're going to pass in the generator. And then right after that line, we need to set the state again and then finish our code however many lines that is for the loop, for the body of the loop. And then it
starts over again. So the first thing we're going to go ahead and do is check to see if we even need to do any truncation. So has the context grown
truncation. So has the context grown long enough to where it's longer than context length? And the way that I'm
context length? And the way that I'm going to do this is since context is b by t, right? So we know context is b by t and I'm actually going to go ahead and
check the length by using the second dimension. Right? So an easy way to
dimension. Right? So an easy way to check the second dimension is just take the length of the transpose version because len always returns the first dimension right if it's b by t len of
context is b right but if you actually transpose it you flip it into a t by b you could say len of context t so t for
transpose actually is that t not to be confusing between those t's so then if we say okay if the length of context t so transpose is actually greater than
the context length then it's too long and we need to truncate. So we could say context equals context preserve all the the batch dimensions and we can go and
say from context length. So go context length tokens backwards and then so that's kind of what negative context length means right going context length tokens backwards from the end and then colon all the way to the end. So then
we'll preserve everything from that point onwards.
Then based on this context we need to get the model prediction. So we can say prediction equals model.forward forward
or again just use the default calling.
So we can then we can pass in context and this is the truncated context that we passed in. Then we want to focus on the last time step. So you might say last time step because again we only
care about getting the next token. So
prediction and then we might say colon 1 colon as we talked about previously. So
instead of this being b by t by v right this one is b by t by v. This one is just B by V which is actually what we wanted. Then we can get our
wanted. Then we can get our probabilities by calling softmax. So NNF
functional softmax pass in that last time step and we want to normalize along the vocabulary dimension. Make every row in this tensor actually sum to one and
all the entries need to be between 0 and one. So pass in dim equals -1. And now
one. So pass in dim equals -1. And now
we can actually call torch. multinnomial
right. So we can actually go over here says now that we have our probabilities we can call torch.multial. So let's go ahead and do that now. We can say next
char or next character is torch do multi-nomial and we know we need to pass in the probabilities. We know we need to pass in the number of new essentially how many times do we want to draw out of
this bag. So that's called num samples.
this bag. So that's called num samples.
So you can say num samples equals 1 or you can just directly pass in one. And
then we do have to pass in the generator just to make sure the sampling is consistent. And then, you know, we want
consistent. And then, you know, we want to set the state just so everything's consistent, but we don't have to touch that at all. And here's where we're going to finish out the loop. We know we need to grow the context and actually append the new character we just got.
So, we can use torch.cat for that concatenation. So, torch.cat, we pass in
concatenation. So, torch.cat, we pass in our tupil that torch.
And we just want to concatenate context with the new character. So then we can actually say dim=1 over here. So that
takes it from something that's b by t that's the initial shape of context to something that's b by t + 1. That's what
passing in dim=1 does over there. We're
focusing on this dimension over here.
And then we can actually append to res, right? So we can actually say int to
right? So we can actually say int to char. That's that dictionary that we had
char. That's that dictionary that we had earlier, right? We want to index this
earlier, right? We want to index this with whatever token we chose. So we can say next char and next char is supposed to be like an integer right because the model is thinking in terms of these
token ids not the literal strings themselves the model is thinking in terms of the integers that represent each string. However this is actually
each string. However this is actually going to be a tensor still since it's actually something that's returned by torch.mult. So we can simply do item and
torch.mult. So we can simply do item and do item extracts the number out of a tensor of size one. So if you had a tensor which is like wrapping around
some sort of single constant and then if you call it item it gives you like the actual number the actual scaler. So that
would give us five over there. And then
this is just what we want to append to resz. So we can say brez.append
resz. So we can say brez.append
of whatever that dictionary gave us whatever character that was. And now
we're ready to actually return the string. However, we need to return a
string. However, we need to return a string not a list. So why don't we just join all the entries in the list with essentially the empty string. So we can
say return empty string dot join of res and this is going to join all the elements in the list with the empty string and we can see that the code
works. Next I highly highly recommend
works. Next I highly highly recommend checking out the linked Google collab notebook in the description. You don't
have to write any more code. In fact,
you're going to see your code, this exact generate function that you wrote here being used in that notebook. And
all you have to do is click run on each cell one by one. And you'll actually every time you run it, you're going to get new Drake lyrics. So that'll be really interesting. Bumping Justin
really interesting. Bumping Justin Bieber, but her favor ain't left. She
know what she need. All her need, all she bless. Giving you my best. Yeah, I
she bless. Giving you my best. Yeah, I
got my heart.
Welcome to GPT Learning Hub, where we simplify complex ML concepts. In this
video, we'll be going over the two types of machine learning models, or more accurately, the two ways to train models, supervised and unsupervised.
Supervised models learn from a label data set. Imagine a model that predicts
data set. Imagine a model that predicts how many votes a candidate will receive based on a few input attributes. The
data set might have X1, the age of a candidate, X2, the number of laws the candidate has passed, and X3, the number of years they've been in politics. Each
row represents a prior candidate. And
for every row, we also have a label, which is the recorded number of votes received. The model learns to predict
received. The model learns to predict how many votes a future candidate will receive based on this data set. Learning
from label data is the essence of supervised learning. What's actually
supervised learning. What's actually inside this black box? The model could be any mathematical formula, but the simplest model would use this equation
to make its prediction for y given x1, x2, and x3 as input. During the learning phase, the values for w1, w2, w3, and b
are updated until we're satisfied with the model's accuracy. W1 factors in how important X1 is for calculating Y. W2
factors in how important X2 is for calculating Y and so on. The B tells us the value of Y if X1, X2, and X3 were
all zero. On to unsupervised models.
all zero. On to unsupervised models.
They learn from unlabelled data. GBT and
other LLM are trained through unsupervised learning. We simply pass in
unsupervised learning. We simply pass in giant chunks of text like all of Wikipedia during training and there's no need for each sentence or paragraph to be labeled with any additional
information. In our voting example, each
information. In our voting example, each row was a training data point. But when
feeding text into an LLM, there are actually tons of training data points inside a single text snippet. Consider a
phrase like who are you? I am a helpful AI. LLM creators add phrases like these
AI. LLM creators add phrases like these into the training data set which is just a giant body of text. There are actually nine different training data points in
this phrase. Let's see why. Given the
this phrase. Let's see why. Given the
sequence who, the model learns that the word are can follow. Given the sequence who are, the model learns that the word you can follow. Given the sequence who are you, the model learns that a
question mark can follow and so on. Once
training is complete, the model has learned to respond to the question, "Who are you?" with the response, "I am a
are you?" with the response, "I am a helpful AI." We don't have to label any
helpful AI." We don't have to label any of the data points. We already have many sequences where the model can learn to predict the next word. That's the main
benefit of unsupervised learning. Most
Gen AI models are trained with this technique since training on large amounts of data is much easier. If
you've made it to the end of the video, congratulations. you're on your way to
congratulations. you're on your way to mastering this topic. If you like my teaching style, you'll love our 6-day ML challenge. Each day, I'll explain a new
challenge. Each day, I'll explain a new concept in your inbox in an easy to digest format. You can sign up for free
digest format. You can sign up for free by dropping your email at the link in the description. See you soon. You may
the description. See you soon. You may
have seen this weird looking diagram before, and it's actually an RNN or recurrent neural network. So, let's
explain exactly how it works. And as
usual, we're going to start pretty high level and then get down to the actual equations that are used for RNNs.
There's also specific types of RNN's like LSTMs and GRUs, but we actually will go over these in a later video since they're just extensions of this vanilla type of RNN that we're about to
go over. In 2017, Google came out with
go over. In 2017, Google came out with the transformer. And since then, RNN's
the transformer. And since then, RNN's haven't been as popular, but they are still used today in Google Translate.
So, Google Translate uses a hybrid model. It uses both a transformer and an
model. It uses both a transformer and an RNN. So, RNN's are definitely still
RNN. So, RNN's are definitely still important to understand. Definitely
leave a comment below if you'd be interested in a video on Google Translate. So, how do these RNNs
Translate. So, how do these RNNs actually work? We're told that they take
actually work? We're told that they take in some input at time t, get an output at time t, and we also get a hidden state at time t, which is apparently fed
back into the model. So, now the model has two inputs. the actual input at time t displayed over here and the actual hidden state at the previous time. So
what does all of this mean and why do we actually have time states? The thing
about RNN's is that they take in sequential data. So we have sequential
sequential data. So we have sequential data and this can come in many different forms. We can actually have time series data, literal time series data of prices from say maybe the stock market. So
prices from the stock market and at each time step so again our x is actually our input our input data and at each time step we may have a different price. So
this may be our starting day t minus one and then we have the price on the next day of whatever stock we're actually modeling and then we have the price on the next day and maybe the ultimate goal
maybe the ultimate downstream goal is to predict the price of a stock on a future day. The model takes in x which is the
day. The model takes in x which is the actual price which we can see over here and we also take in v. V is actually interchangeable with h. It's the
previous hidden state. So we might initialize that hidden state to be some random vector and then at every time step we actually have some sort of calculations done inside this blue box
or black box over there and we actually get a model output which is some sort of vector as well as another hidden state to be passed in to the next time step.
RNNs can also be used to model textbased data. We may know by now that we can
data. We may know by now that we can think of text as a sequence of words, right? So we can think of every word or
right? So we can think of every word or subword or character, every word is a actual time step. And that we can go ahead and convert each word to some sort of number, maybe an embedding vector and
that will actually be the X that we pass into these models. And maybe our ultimate goal is to do autocomplete and get the next word in the sequence. Maybe
it's to actually gauge the sentiment or the or the emotion in our text.
Regardless of whatever our NLP application is, we can definitely pass in text into these RNNs simply by modeling each word as a time step. So
let's continue with the stock price example. Let's say we have a couple days
example. Let's say we have a couple days of stock prices and the goal is to predict the price on day three. So we
actually have our day one price fed into the RNN over here and then there is some computation done inside the RNN and we'll get into the equations for how that computation is done soon but we
actually have then the day two price over here fed into the next unit as it's called of the RNN and in this box over here there is some very similar math being done and don't forget that we
actually have this initial hidden vector since we don't have any day zero data we would just randomly initialize a hidden vector to pass into the RNA and on the left. Then we get a day one output which
left. Then we get a day one output which can actually be thought of as the model's prediction for the day 2 price based on the day one price. However, we
know the actual true day 2 price and our ultimate goal is to predict the day three price using the computation and the equations that are actually going on inside this box over here. We then
generate a day one hidden vector which is then passed into the next unit. The
RNN factors in both the day 2 price and the information, this hidden information as it's called from day one to ultimately get the day 2 output. So we
have the day 2 output over here as well as a day 2 hidden vector. And depending
on the situation, we might say that our model's prediction for the day three price is either the day 2 output or the day 2 hidden vector. And this can keep going on for as many days of data we
have. And maybe ultimately we only care
have. And maybe ultimately we only care about the final day. We want to predict the future price of a stock. But what
equations are actually being used inside these RNN's, right? Inside these RNNs, these these units to actually get these outputs and hidden vectors. Here are the
main equations. All we really do is use
main equations. All we really do is use the current hidden state that's passed in as well as X to ultimately calculate the next hidden state. And then this
next hidden state is used to actually calculate the output for that time step.
And this set of equations, this pair of equations is used in every single time step. However, what we do need to talk
step. However, what we do need to talk about is what is W, what is U, what is B, what is this W, which is actually different from this W as well as this B
or bias is actually different from this one. So those are actually the
one. So those are actually the parameters of the model that need to be learned through training. To understand
what W, U, and B, those matrices and biases are actually doing. All we really need to do is revisit standard or vanilla neural networks. We have some sort of input here which in code would
actually be represented by some sort of vector. So maybe we have an input that
vector. So maybe we have an input that has some number of attributes. Might
call this one X1. We might call this one X2. And we might call this one X3.
X2. And we might call this one X3.
Although we were actually talking about feeding numbers into our model, the stock prices in our previous example, we might imagine embedding each stock price as some sort of vector. As a result of
the calculations that are actually going on inside these complicated edges, we actually then end up with a four-dimensional vector. We can say that
four-dimensional vector. We can say that this is output number one or 01. This is
output number two or O2, output number three, and lastly output number four. So
this layer, this linear layer took a three-dimensional vector and transformed it to a four-dimensional vector. What we
need to remember is that each of the four hidden nodes, each of the four hidden nodes performs linear regression.
So that means each of those four hidden nodes is actually storing a W1, a W2, and a W3. Those are just weights of a linear regression model, one for each of
our three input attributes, as well as a bias. So that gives us four parameters
bias. So that gives us four parameters or four parameters per hidden node. And
given that we also have four nodes that leaves us with actually 16 total parameters which would actually be represented as a 4x4 matrix of parameters. Of course in practice we
parameters. Of course in practice we might do something like a 3x4. So we
actually have the four parameters and then we have the three input attributes and then we actually store our biases.
we store our biases in a separate variable instead of in that original matrix. The idea is clear. Each weight
matrix. The idea is clear. Each weight
matrix is simply the parameters for a linear layer. So let's return to these
linear layer. So let's return to these equations. W over here represents a
equations. W over here represents a linear layer with learn parameters. The
same goes for U and the same goes for B as well as for this B and for this W.
Each W and B and U is actually learning its own parameters. However, we saw that in the original diagram, we had many time steps going left to right. Each of
those RNN boxes does use the same set of W, U, and B as well as this W and this B. The parameters are shared across time
B. The parameters are shared across time steps. So, the main component we still
steps. So, the main component we still need to talk about then is G. What are
these G functions? The G actually refers to some sort of nonlinear function. Here
we have the sigmoid function which actually always outputs numbers between 0ero and one. Here we have the tanh function which always outputs a number between negative 1 and one. And here we
have the function which takes the max of the given input and zero. So if the given input is negative it gets squashed down to zero as we can see over here.
And if the input is positive then the output is just the same as the input.
It's the identity function. And these
functions can really impact the performance of a neural network. They
can actually drastically improve the performance of a neural network in some cases. And on some cases, they can
cases. And on some cases, they can actually worsen the performance of a neural network if the wrong activation is chosen. And when I say wrong
is chosen. And when I say wrong activation, I mean a function that when we end up calculating our derivatives and gradients, we actually end up with really poor numbers that might be really close to zero and make it hard to train
the network. Ultimately, it comes down
the network. Ultimately, it comes down to trial and error. Researchers will try different activations for different use cases and see which one results in the best neural network performance. So I
want to end this video by talking about some of the limitations of RNN's. One
significant limitation is that we cannot take advantage of parallel processing.
We actually have modern-day GPUs that are very very specialized for parallel processing. However, for RNN's, the
processing. However, for RNN's, the input for the next time step actually depends on the output from the previous time step. Because of this, we're
time step. Because of this, we're actually limited by the length of the sequence. The longer the sequence, the
sequence. The longer the sequence, the longer the processing will take. So the
longer the sequence, the longer the processing will take. So we cannot really exploit parallel processing as much as we would like to. Another
significant issue with RNN's is called the vanishing gradient problem. When
we're actually calculating derivatives, which are ultimately used to update the weights through gradient descent, so we use gradient descent to actually train and optimize any kind of neural network.
when we actually calculate those derivatives, a lot of them end up going to zero as the sequences grow longer. So
as the sequences grow longer, RNN's actually have significant limitation in training. The reason for why longer
training. The reason for why longer sequences result in a vanishing gradient requires a little bit of calculus and I can definitely make a future video on this if there's interest. So definitely
leave a comment if that's something you're interested in. I hope that was helpful. And if you're looking to get
helpful. And if you're looking to get practice problems, I actually have coding problems and quizzes on my website in the description. For each of these practice problems, I actually have a background and solution video. And for
the coding problems, you can run your code against the test cases. There's
also a full playlist, a full playlist on my channel that goes in order for each of these problems. And that should pop up on the screen soon hopefully. I'll
see you soon. Okay. LSTM networks. This
stands for long short-term memory network. They're a special kind of RNN
network. They're a special kind of RNN which stands for recurrent neural network. But don't worry if you're not
network. But don't worry if you're not familiar with these since this video doesn't require background on them. This
video is going to start super high level and then we're going to get more detailed and actually go into more depth on this diagram as the video goes on. So
LSTMs are used to make predictions on sequencebased data. This might be a
sequencebased data. This might be a sentence and we want to predict the next word in the sentence kind of like an autocomplete model. or we might pass in
autocomplete model. or we might pass in stock prices. We might pass in a bunch
stock prices. We might pass in a bunch of stock prices from different days and we want to predict a future price. The
transformer neural network, which I have a whole playlist on, and it should pop up somewhere in the top right in a second, is actually used more commonly these days. However, LSTMs, these neural
these days. However, LSTMs, these neural networks over here, they're definitely still worth learning for a few reasons.
One, they're actually still used in products like Google Translate, so they're not entirely obsolete. The head
researcher at OpenAI, Ilia, actually said that he thinks RNNs could make a comeback. And last, understanding the
comeback. And last, understanding the issues with LSTMs, they're not perfect, help us actually understand why the transformer neural network over here was originally developed. By the way, if
originally developed. By the way, if you're already familiar with the general idea of an RNN, you can skip to the timestamp in the description. So, let's
start pretty high level. Let's say we're building a model that is kind of like autocomplete. We want to predict the
autocomplete. We want to predict the next word in some sort of sequence aka a sentence. So let's say we have I grew up
sentence. So let's say we have I grew up in France. I speak fluent. And then the
in France. I speak fluent. And then the model is supposed to predict what word comes next. We know that the word French
comes next. We know that the word French should come next. But we want the model to actually be able to do this. So let's
start over here. We're passing in some sort of input X into the model. and x is simply going to be our sequence our sequence of words that we have so far.
So this entire sequence over there and then we get some sort of output. We get
some sort of output and we actually go ahead and pass something else which we can see over here back into the model.
So back into the model over here and then we have our sequence continuing again. When we actually unfold this what
again. When we actually unfold this what we actually see is we have the initial time step. So we'll think of each token
time step. So we'll think of each token or each word in our sentence as some sort of time step. So maybe this is t equals 0. Maybe the first token or word
equals 0. Maybe the first token or word is t equals z and the next one is t= 1 and so on. So let's say we pass in i over here the word i and then the word
grew over here and so on. And the
ultimate goal is to actually grab the next word in the sentence. So when we pass in the word I over here, the model is actually outputting some sort of
information in the form of a vector that will actually be used in the next time step. So then when we pass in grew over
step. So then when we pass in grew over here, the model is actually factoring in both I and grew to ultimately make its prediction for the next word in the
sentence which we know is up. So the
fundamental idea here is that the model can factor in the previous information to actually make a future prediction.
The model can actually factor in words for all the way from the past rather than only the previous word, only the previous word that came in the sentence.
For now, let's not really worry about all these symbols like V and H. And now
let's go back to the general idea of an LSTM. So let's say that between those
LSTM. So let's say that between those two sentences we have a bunch of irrelevant information that isn't relevant for the model predicting which word should come over here. We would
actually say that the model needs to remember some information that was far into the past and the model needs to actually factor that in to its response
over here. That's where LSTMs come in.
over here. That's where LSTMs come in.
Now that we have a high-level understanding of LSTMs, let's dive into these weird looking symbols and break down how they work. So, the fundamental difference between a normal RNN and an
LSTM is actually just what's going on inside this box over here. the general
idea of passing in information from our previous time steps, getting some sort of output which we can see over here and over here. Passing those in to the later
over here. Passing those in to the later time steps where we also factor in whatever word is at our current time step and simply continuing this simply continuing this for further time steps
until we get the output we're interested in. This fundamental idea remains the
in. This fundamental idea remains the same. One of the core concepts behind
same. One of the core concepts behind LSTMs is this black line over here called the cell state. It actually runs through all the time steps. And the core
idea going on inside this LSTM is simply to update all these operations that we see going on over here as well as to the right. They're actually involved with
right. They're actually involved with updating or modifying that cell state.
So let's look at the first gate that we have in the network right over here. If
we actually scroll down, we can see that what's actually going on is we concatenate XT, the word at our current time step, and this vector HT minus1,
the vector that came from the previous time step. The concatenated vector is
time step. The concatenated vector is then passed into a single linear layer.
So that's just a standard neural network. Remember, each of these nodes,
network. Remember, each of these nodes, if you're not familiar with neural networks, that's all right. As a quick review, we're essentially passing in some sort of vector as the input. So we
can think of a number. We can think of a number being stored in each of these nodes over here. So this is x1, this is x2, and this is x3. We can think of this
as a vector with three entries. And then
what's going on in each of these hidden nodes is actually linear regression. Y1
through Y4 is calculated for each of those nodes where each of the Y's is actually calculated based on this equation. So for each of those nodes,
equation. So for each of those nodes, the X1, X2 and X3 from the input layer are used and each of those nodes actually learns over the process of
training the LSTM a separate W1, W2, W3 and B or bias. Then each of those entries y1 through y4 are actually passed into this sigmoid function over
here which always outputs a number between zero and one. So ultimately
after this layer over here or this gate over here symbolized by the sigmoid or Greek sigma letter over here we have a vector of entries where each entry is
between zero and one. Conceptually,
whatever vector was actually calculated and is outputed over here is actually factoring in the current word over here as well as whatever vector came in from the previous time step. And we're going
to use that to then update the cell state. The way we update the cell state
state. The way we update the cell state symbolized by this X, which is essentially represented by this equation, we take the elementwise multiplication of the previous cell
state vector and our sigmoid output. So
if we're actually having entries between 0 and one in this vector over here, then we can think of that as essentially taking a fraction for each entry in the previous cell state vector. We are
taking a fraction of it since we're multiplying it by something between 0 and one and reflecting that in the updated cell vector. Just to make that super clear, we can see that if we
multiply elementwise two vectors where one of the vectors has entries between 0 and one, this is equivalent to keeping or preserving some fraction of the
information. That means there's actually
information. That means there's actually an intuitive explanation for what that sigmoid gate is doing. It's actually
referred to as the forget gate. We know
that some of the information in some sort of body of text is irrelevant for actually predicting the next word in the sentence. Going back to our example, we
sentence. Going back to our example, we where we wanted to predict the word French. We know that we had a bunch of
French. We know that we had a bunch of irrelevant information in between in that body of text and the main thing that we wanted to remember from the past was the word France that the speaker
grew up in France. But some of the information in our current time step in our current time step and maybe even in the previous time step needs to be
forgotten by the neural network. So
that's why we refer to this as the forget gate. If we're actually
forget gate. If we're actually multiplying the previous information in the cell state by some sort of number that is between 0ero and one then we are essentially forgetting some of that
information and also preserving some of that information. closer the sigmoid
that information. closer the sigmoid output is to zero, the more information we're actually forgetting. The closer
that output is to one, the more information we're actually remembering.
Since if we multiply something by one, it doesn't change at all. We're
preserving that previous information.
And through the process of training the LSTM, the right parameters are actually learned for each of these nodes. So this
top node as well as the three below it to actually learn to forget and preserve the right information from XT and HT minus one. The other gates which are
minus one. The other gates which are actually to the right of the first gate work in a very similar way. We also need to figure out what information to add to the cell state after we figured out what
information to forget and what information to preserve. And our final gate, which can actually be seen over here, simply helps us determine what information we should output. Interest
of keeping this video concise, let's quickly explain how this gate works.
There's two steps to actually updating and adding information into our cell state, adding in the new information.
The first step is remembering or actually figuring out what information we want to add. That's what this tanh gate does. And the next step is figuring
gate does. And the next step is figuring out how much of that new information we ultimately want to add. The sigmoid gate over here works the exact same as the
previous sigmoid gate. If we just scroll down, we see that we just concatenate XT and HT minus one. We have a linear layer. Although we should note that this
layer. Although we should note that this linear layer that is used in this gate, the new sigmoid gate we're talking about has a separate set of weights that are
learned for each of our nodes Y1 through Y4. And we follow that up with a sigmoid
Y4. And we follow that up with a sigmoid function, which we can see over here.
The tanage gate over here works extremely similarly. We're going to have
extremely similarly. We're going to have a linear layer and we're going to have nodes in the hidden layer that are actually learning weights based on the
XT and the HT minus one that are passed in. However, we follow up the output of
in. However, we follow up the output of the linear layer with a tanh activation which actually has outputs between -1 and 1 instead of 0 and one as in the
sigmoid function. This should make sense
sigmoid function. This should make sense since the tanh gate is actually supposed to calculate what information we want to ultimately add in to our cell state.
What new information is relevant for the model to remember while the sigmoid gate this sigmoid gate figures out okay how much of that information is ultimately
relevant to add in to our cell state. So
before we actually add the information into the cell state, we simply need to multiply that information together since we know that multiplying vectors together where one vector has entries
between 0 and one actually helps us decide how much of our second vector should actually be preserved. All that's
left to figure out is how we calculate ht the output of this time step which is also used in the next time steps for getg gate over here. figuring out what
actually needs to be outputed for this time step. We know that we should
time step. We know that we should probably draw that information from our updated cell state after the forget gate has done some calculations and our addition gate has done some
calculations. We know that the updated
calculations. We know that the updated value of the cell state at this point should actually be factoring in our previous information and our current
word. Our current word at this time
word. Our current word at this time step. And not only has it factored that
step. And not only has it factored that information in, but over here it's actually discarded what's irrelevant.
And here it's added in what is relevant.
That's exactly what the LSTM does. It
actually passes in the cell state into a tanh gate over here, a tanh gate. And
then we actually just elementwise elementwise multiply that with the output of yet another sigmoid gate. This
neural network layer is factoring in the current value of the cell state which we might think of is symbolized as CT. And
this neural network layer over here, the sigmoid layer is factoring in XT and HT minus one. As usual, as usual, the
minus one. As usual, as usual, the element-wise multiplication helps us figure out how much of that information that came from the cell state is actually relevant to preserve and
ultimately pass on as our output over here and over here. Fact that the sigmoid output over here is between 0 and 1 is what helps us achieve that. So
that's the overview of LSTMs, but there's still some clarifications we need to make. LSTMs aren't perfect and they have issues. Primary issue is we cannot take advantage of parallel
processing as much as we would like to with modern GPUs. Need the outputs from previous time steps during training to actually calculate our future time
steps. That means we're actually limited
steps. That means we're actually limited by the sequence length. And the longer the sequence length, the longer training will take. Transformers, on the other
will take. Transformers, on the other hand, actually process all tokens. So
tokens are essentially words at different time steps in parallel. Next,
it may be unclear based on the the concept overview we just gave as to how to actually implement an LSTM and train it. We know there's going to be a lot of
it. We know there's going to be a lot of matrix multiplications, a lot of sigmoid functions, a lot of tanh functions. If
you're interested in a video on implementing an LTM, definitely leave a comment. Lastly, learning requires
comment. Lastly, learning requires practice. That's why I've created a
practice. That's why I've created a bunch of coding problems and quizzes which should be popping up on the screen soon. They're all free on my website.
soon. They're all free on my website.
Every practice problem has a video associated with it in the playlist that's about to pop up. So definitely
check it out and hopefully I'll see you soon. Large language models are capable
soon. Large language models are capable of many different tasks. You can
roleplay with them, ask them to generate code, and even write poems. But in some cases, we want to customize them or have them specialize in a particular task, like generating code in C or speaking a
really niche language that wasn't in the original training data. This is where fine-tuning is useful. This means
further training of the model, actually updating the weights and biases using gradient descent, except this time we're using a new data set, typically much smaller than the original data set, and
we start off with the pre-trained version of the model. Although some
models like GPT are closed source, many models like the llama family from Meta are open source. You can download the model weights, chat with the model locally, and even fine-tune it on a personal data set. But these models have
billions of parameters. So, how can you fine-tune them without the expensive GPUs used to pre-train them? The answer
is Laura. Now that the NPCs have clicked off the video, let's get into it. If
you're still here, you're the kind of person I would want in my ML community, which I'll talk about at the end. Back
to Laura. There are some other tricks like quantization, but lowering adaptation is almost always used for fine-tuning. There's two main questions
fine-tuning. There's two main questions to ask. First, of the billions of
to ask. First, of the billions of parameters in the model and all the layers, or all of them necessary to update? The answer is a resounding no.
update? The answer is a resounding no.
We can achieve almost the same performance by freezing most of the weight matrices and only updating the ones used in the attention layer. one of
the core components of an LLM. The
second question to ask is for the matrices we update, do we actually need to directly update each entry through gradient descent? For an N byN matrix,
gradient descent? For an N byN matrix, that's n^ squ total parameters and n can be in the thousands for LLMs. So, we can make a second simplification that achieves almost the same performance.
Freeze the original weight matrix W K knot. Let's calculate delta W and add it
knot. Let's calculate delta W and add it to W knot. Once fine-tuning is complete, W knot is the product of two smaller matrices B and A. B is N by R and A is R
by N. The entries in B and A are the
by N. The entries in B and A are the ones updated through gradient descent and this is only two NR parameters. R is
typically much much smaller than N. If N
is in the thousands, R might only be say 16. Dropping the constant factor, this
16. Dropping the constant factor, this is on the order of N entries to update which is far less than N squ. By the
way, the original Laura paper recommends initializing B to zero and A to a normal distribution for optimal results. Side
note, if you're interested in an end-to-end walkthrough with every line of code explained, I created a full course on fine-tuning exclusively for the students in our ML community. In
addition to requiring far less compute, there's another benefit to Laura. Let's
look at a tree where we have different versions of an LLM, each serving a use in production. The most intensive step
in production. The most intensive step here is loading the root model into memory. But that only needs to be done
memory. But that only needs to be done once. To switch between customized
once. To switch between customized states of the model, we simply add and subtract delta W, which is just the product BA for each node in this tree.
Adding and subtracting these smaller matrices, is far less intensive than reloading the full model, which is billions of parameters large. Laura is
incredibly easy to use on free GPUs like the one provided in Google Collab and will likely be used for years to come.
If you found this video helpful and are interested in more fine-tuning videos, leave a comment and I'll see you soon.
You may have asked ChatgPT a question before and it says, "Since it's the last knowledge update, there's no way to answer your question. How do we keep LLMs up to date with new information or
let them access information that wasn't in the training data? One option is to continue training the model or fine-tune it on new data, but this is expensive
and timeconuming. The alternative is rag
and timeconuming. The alternative is rag or retrieval augmented generation. This
approach originated from a 2020 paper written by researchers at Facebook's AI research division and has quickly become an extremely popular tool. On to how rag
actually works. First, let's start with
actually works. First, let's start with some bank of information that we want an LLM to have access to so that the LLM's responses to our questions can actually
factor that bank of information in. This
might be a new version of a textbook that just wasn't in the training data.
Or it might consist of internal company documents that obviously weren't in the training data, which is usually just the internet, but are still necessary for a company's internal LLM to have access
to. That is the general idea. Here is
to. That is the general idea. Here is
the rag workflow that helps us achieve this. And if any parts seem a bit
this. And if any parts seem a bit confusing at first, don't worry. We'll
go over all of them. We have a user that sends a query or question to an LLM. But
let's not send the query directly to the LLM. We'll send the query to an
LLM. We'll send the query to an embedding model which will output a vector representation of the query. This
hits a retriever model which will search a vector database for knowledge that might help answer the question. This
database outputs the top K documents that were most similar to the question being asked where K is a parameter we get to choose. The higher K is, the more information we're giving to the LLM to
answer the original question. But keep
in mind that K can't be too large since we might end up including irrelevant information, introduce additional latency into our workflow or simply exceed the LLM's context length, which
is the maximum number of tokens or words it can take in at once. When actually
prompting the LLM, we can say something like, "Here are some documents that might help answer the following question." And then we concatenate the
question." And then we concatenate the question in the top K documents. The
output from the LLM is then given back to the user. That's a highle overview, but let's go over the embedding generator and the retriever. Regarding
the LLM, it's just a standard transformer like chat GPT. Let's treat
that as a black box for this video. We
pass in a prompt and we get a response back. Okay, let's go back to the
back. Okay, let's go back to the knowledge bank from earlier. We break it into chunks. This might be the chapters
into chunks. This might be the chapters of a textbook, the subsections within a long company document, or even something arbitrary like each paragraph or a total
number of words for each chunk. How we
choose to chunk the knowledge bank will definitely affect the results and depends on our use case. Then we'll
generate a highdimensional vector representation for each chunk, which is essentially an embedding. We're going
from raw strings to highdimensional vectors that capture the meaning of the text. We'll then store those embeddings
text. We'll then store those embeddings in a vector database. But what neural network is actually used to generate the embedding of each chunk? This is
typically just another transformer which is trained to generate embeddings for the text passed in. If you're interested in how transformers work, check the second link in the description. This
neural network is typically just the left side of the transformer known as the encoder. We're encoding the input
the encoder. We're encoding the input into a meaningful highdimensional vector. To summarize, before anything in
vector. To summarize, before anything in this workflow happens, we break up the knowledge bank into manageable chunks which are sent through the embedding model and then stored in the vector
database. Then during the actual rag
database. Then during the actual rag workflow, each question we want to ask the LLM is also sent through the embedding model and then the retriever searches the vector database for the
similar documents that might help answer the question. But let's talk a bit more
the question. But let's talk a bit more about the retriever. A class of algorithms called maximum inner product search or MIP is used here. For our
case, inner product essentially means dotproduct, which we know is a measure of similarity between two vectors. So a
maximum inner product search means we're looking for the top k chunks in the vector database that have the highest dotproduct or similarity with the embedded question. And that's a highle
embedded question. And that's a highle overview of rack. If you're an aspiring data scientist or machine learning engineer, this is a must know topic. In
this video, we're going over the vision transformer, and it won't require understanding all the details of this diagram. The transformer is a special
diagram. The transformer is a special kind of neural network that powers large language models like GPT. It was
developed in 2017 and was presented with the paper, attention is all you need from Google. And in 2021, Google
from Google. And in 2021, Google developed the vision transformer for processing images instead of text.
What's crazy is that vision transformers can outperform CNN's or convolutional neural networks. is the go-to model for
neural networks. is the go-to model for image classification. Attention is the
image classification. Attention is the main mechanism in a transformer that makes the model so effective at processing sequences. The model learns
processing sequences. The model learns which parts of the sequence are important to well pay attention to. In
processing text, this means that the model might associate adjectives and nouns. And with the vision transformer,
nouns. And with the vision transformer, the model pays attention to the most important parts of the image. Before we
go over how this actually works, this is your reminder to sign up for GPT Insiders so that you don't miss our next edition. GPT Insiders is my free
edition. GPT Insiders is my free emailing list where each day I share a different insight or resource for your ML journey. No spam, just valuable tips
ML journey. No spam, just valuable tips and resources. You can sign up at the
and resources. You can sign up at the link in the description. Let's get into it. We're going to treat the transformer
it. We're going to treat the transformer as a black box. Let's focus on the inputs and the outputs. The input is always a sequence. In the case of GPT, a sequence of words or tokens where each
is represented as a number. How can we break an image down into a sequence? One
option is to pass in every single pixel as an integer, but for an image that's 100x 100, which is still very low resolution, the sequence would have 10,000 elements. This is infeasible
10,000 elements. This is infeasible since the computational cost of attention is n^ squ, where n is the length of the input sequence. The
solution is to break the image into patches. So our input sequence's length
patches. So our input sequence's length is now equal to the number of patches.
This would just be the size of the image divided by the size of a single P by P patch. At each entry in the input
patch. At each entry in the input sequence, we have a vector which is just the patch flattened into a 1D list. The
patch size is critical. If we make the patches too small, the sequence will be too long and compute demands will be too high. If we make the patches too large,
high. If we make the patches too large, the sequence will be shorter and compute demands will be less, but we risk oversimplifying the input and obscuring important information from the model.
The optimal patch size can be found experimentally. In summary, the
experimentally. In summary, the transformer always takes in a sequence of length n with a vector for each entry in the sequence. But what about the output? After the model performs its
output? After the model performs its calculations, the transformer outputs another sequence of n different vectors.
But what if our goal is to use the model for image classification? We might want to classify the input image as a dog, cat, or bird. There are ways that we can force the model to output vectors of
size three where each entry corresponds to a probability. But we have n different vectors. Which vector do we
different vectors. Which vector do we consider to be the model's prediction?
The solution is to prepend the input sequence with another dummy vector. Then
we can look at the model's output vector for that corresponding index. This is
the model prediction which will be improved over many iterations of training and that's the gist of vision transformers. There is one catch though.
transformers. There is one catch though.
Vision transformers must be trained on extremely large amounts of data to outperform CNN's. On the other hand,
outperform CNN's. On the other hand, they require substantially less compute to train. The bottom line is that while
to train. The bottom line is that while vision transformers are extremely powerful and promising, CNN's aren't going away anytime soon. If you're
interested in a deeper discussion of the vision transformer, I have a huge announcement for you. Beginner's
blueprint is finally available. This is
the exact study plan I wish I had when I was first getting started with machine learning. Everyone told me to read
learning. Everyone told me to read papers, but I had no idea which papers to read. Once I figured that out, I had
to read. Once I figured that out, I had no idea how to understand them, dissect them, much less implement them, and code up the main concepts. And lastly,
workday was a nightmare. I had no idea how to present these projects on my resume and actually land more interviews. The beginner's blueprint
interviews. The beginner's blueprint will solve all of these problems for you so that you can make progress faster than I did. Take it from someone like from the IIT Madras class of 2025. I
personally provided him with a road map helping him get started with the implementation ASAP or shear. He's an
NLP expert and his personal favorite resources are our ML programming questions which are accessible for free.
Lastly, someone like Chang. He's an exy Yahoo a IML engineer and he knows his stuff. It's been a blast making videos
stuff. It's been a blast making videos on this channel for the last year and I'm excited to help even more of you with premium personalized instruction.
Our launch sale is now active and you can secure the entire blueprint for 50% off. Head to the link in the description
off. Head to the link in the description to learn more and I'll see you on the other side. CNN's or convolutional
other side. CNN's or convolutional neural networks are one of the most powerful models for giving computers the gift of vision. They're used in self-driving, image captioning networks, and even GANs or generative adversarial
networks. The central idea is to have
networks. The central idea is to have the model learn which features and patterns in an image are the most important. Here is my promise to you. I
important. Here is my promise to you. I
will provide the most clear explanation of CNN's you've ever seen in return for 5 minutes of your time. Let's get
started. First, let's understand the input. We pass images into CNN's, which
input. We pass images into CNN's, which are typically represented as arrays of pixel values. For a 4x4 image, we would
pixel values. For a 4x4 image, we would have a 4x4 array of numbers. The main
calculation inside a CNN is called a convolution, and each filter performs a convolution. Each filter is just another
convolution. Each filter is just another array of numbers that slides over the input image left to right and top to bottom, performing multiplications and additions. Now is a good time to clarify
additions. Now is a good time to clarify what convolution really is. Given a 4x4 image, let's slide this 2x2 filter over the image and see what the output is. We
overlay the filter on every 2x2 subrid within the image. Multiply the
corresponding numbers and add them up.
This is the calculation for the top left subrid.
Slide one unit right. This process
repeats for the entire image. Also,
CNN's have many filters. So, if we had 10 filters, we would have 10 outputs.
Since each output is 3x3, at this stage we have a 3x3x 10 array of numbers. But
let's say the ultimate task of this model is to classify the input image as a dog, cat, or bird. We would want to output a vector of size three where each entry corresponds to the probability
that the input image is a dog, cat, or bird. So given this 3x3x10 array, the
bird. So given this 3x3x10 array, the model would use a series of reshape operations, nonlinear functions, matrix multiplications, and even more convolutions to arrive at the final
output vector of size three. But there's
one crucial detail we still haven't discussed. Training. When we initialize
discussed. Training. When we initialize a CNN, the entries in all the filters are random numbers. So the initial model predictions will be inaccurate. Over the
course of training, the entries of the filters are updated until we're satisfied with the model's predictions.
Through training, each filter learns to detect a unique pattern in the image.
For example, one filter might detect a dog's tail. Meaning, when we slide that
dog's tail. Meaning, when we slide that filter over the image, the output would be non zero. But if we slide that filter over an image with no dog tail, the output would be mostly zero. But how are
the filter entries actually updated during training? That brings us to
during training? That brings us to gradient descent. the first ML algorithm
gradient descent. the first ML algorithm I think everyone should learn. I have a threeinut video breaking down gradient descent which should pop up in the top right. If you enjoyed the visuals in
right. If you enjoyed the visuals in this video, you'll love the visuals in the gradient descent video. I can't
recommend it enough. See you soon. CNN
or convolutional neural networks are a kind of model that specializes in detecting patterns in images. They're
used in self-driving, image captioning networks, and even GANs or generative adversarial networks, which can be used to create deep fakes. The core idea is to have many weight matrices called
filters that slide over an input image, multiplying and adding up numbers. The
model learns the right numbers for each filter so that special characteristics in the image can be detected. Let's
answer a few quiz questions about CNN's.
Question one, what algorithm is used to update the filter weights during training? A, linear regression, B,
training? A, linear regression, B, gradient descent, C, dynamic programming, D, self attention. The
answer is gradient descent, a minimization algorithm. This equation is
minimization algorithm. This equation is used to update each entry in the filter matrix at each iteration of training. If
you're not familiar with gradient descent, I have a 3minute video breaking this equation down. It's the second link in the description. If we have a 4x4
image and a 2x2 filter, what shape would the output called the convolved image
have? A 4x4, B 3x3, C 4x3, D3 3x4.
have? A 4x4, B 3x3, C 4x3, D3 3x4.
Understanding these dimensions is important since libraries like PyTorch and TensorFlow require you to specify them when instantiating CNN's. If you
want to try the calculation yourself, pause the video here. Okay, let's assume a stride of one, which means that the filter slides one unit at a time. This
filter can slide from left to right a total of three times and downwards a total of three times. And all the convolutions look like this.
And we end up with a 3x3 output.
Question four of this quiz will cover how the calculation is actually done.
Question three, which of the following about CNN's is true? A. Gradient descent
works better on normal neural networks than CNN's. B. The many filters of a
than CNN's. B. The many filters of a convolutional layer are applied in sequence. C. Each filter is
sequence. C. Each filter is independently applied in parallel. And
D. CNN's typically don't contain other layers like linear or sigmoid layers.
The answer is C. Each filter is independent and applied in parallel.
Since each filter operates on the input image independently, each can learn a unique pattern in the image, like a vertical edge or horizontal edge. If
you're wondering about the other choices, gradient descent works just fine on CNN's and CNN's do use other layers in addition to convolutions to form the final prediction. Question
four, given this 4x4 image and the numbers at each pixel and this 2x2 filter, what would the top left entry in the output be? If you want to try the
calculation yourself, pause the video here. Okay. The top left entry is found
here. Okay. The top left entry is found by overlaying the filter on the top left corner of the image. We multiply the corresponding numbers and add them up.
That's all a convolution is. That's 4 * 1 + 4 * 2 + 0 * 0 + 4 * 2, which comes out to 20. Keep in mind that the output
matrix from this convolution is passed on to other layers of the model which ultimately give us a single vector of the model's predictions. Then at each iteration of training, the numbers in
the filter are adjusted to make the final model prediction more accurate.
That wraps up our quiz on CNN's. If you
watch through the full video, then I think you'll benefit from my ML community. I offer one-on-one a IML
community. I offer one-on-one a IML mentorship and help you build projects for your resume. You can read more about it at the link in the description. I
hope you found this video quiz useful and I'll see you soon. This video is going to be different. We're going to learn about something other than neural networks for once. By the way, if you need a refresher on neural networks,
just check the second link in the description. So, that means we're not
description. So, that means we're not going to talk about deep learning today, which gets all the attention these days.
Deep learning is just the field of AI that's focused specifically on neural networks, a type of model that's far more powerful than simple statistical models like linear regression. By the
way, it's a small technicality, but technically deep learning is a sub field of ML and ML is a sub field of AI. So
today we're going to talk about the K means algorithm, which is a classical ML method that does not fall under deep learning. Sometimes neural networks are
learning. Sometimes neural networks are overkill. For simple ML problems, neural
overkill. For simple ML problems, neural networks are not necessary and that means we can build a high accuracy model without neural networks. We prefer this when possible since training neural
networks is expensive and timeconuming.
Let's say we have a thousand data points of information about dogs and each data point contains five numbers describing a particular dog, their weight, height,
tail length, etc. But for each dog, we have no idea which breed they belong to.
In other words, our data is unlabeled.
Can we create an algorithm that groups the similar dogs together? Beforehand,
we must decide how many groups or breeds there might be. We'll call this variable K. And let's represent each dog from the
K. And let's represent each dog from the data set as a vector with five entries.
We want the similar dogs to end up in the same group or breed. Let's start off by randomly assigning each dog to one of our kiff groups or clusters. And over
some number of iterations, let's update which group each dog belongs to. And at
the end, the similar dogs will be assigned to the same group or breed.
Here's what the final output of the algorithm after it finishes running might look like for the case where K equals 3 and we only have two attributes
for each dog. Attribute one on the x-axis and attribute 2 on the y-axis.
But we've been treating this algorithm like a black box. At each iteration, how do we actually update which group or breed each dog belongs to? Step one,
calculate the average vector of each group. Step two, store the average as
group. Step two, store the average as the centrid. Step three, assign each dog
the centrid. Step three, assign each dog to the group whose centroid is nearest.
Step four, repeat. Let's talk about step one. By the way, if you're still
one. By the way, if you're still watching the video, then I know you're not just mindlessly scrolling through YouTube and that you're actually interested in ML. You're the kind of person I want to join my ML community,
which you can read about at the second link in the description. Back to step one. Before the algorithm, we randomly
one. Before the algorithm, we randomly assigned each dog to a group where the groups are indexed from one to K. To
calculate the mean of each cluster, we simply average all the vectors assigned to a given group together. Specifically,
this average is done element by element for all the vectors and the resulting vector is called the centrid of that group. In the image from earlier, the
group. In the image from earlier, the black dots are the centrids for each cluster after the algorithm finishes.
But while the algorithm is running, each dog isn't necessarily assigned to the nearest center. The photo below should
nearest center. The photo below should clarify what that means. We start off with totally random clusters and centroidids. After each iteration, the
centroidids. After each iteration, the centrid approximations get better. At
each iteration, we assign each data point to the group whose centrid is nearest, which reshifts the centrids, and this process repeats. Let's focus on
the final blue cluster. Initially that
cluster centrid was here. After
reassigning data points based on the closest centrid that cluster centrid shifted over here. We performed one more iteration and nothing changed. So we
knew the algorithm concluded. Let's talk
briefly about step three. We want to assign each dog to the nearest center or to the nearest cluster. That means we need to calculate the distance between every data point and each of the K
centrids so that we can find which cluster is closest to each data point.
We can simply use the distance formula which looks like this for two-dimensional data and looks like this for five-dimensional data. After enough
iterations, which could be hundreds or thousands, the C means algorithm is complete. There are other clustering
complete. There are other clustering algorithms, but C means is the simplest one. Let's say you're working with a new
one. Let's say you're working with a new data set and there are no labels. You
might want to gain an initial understanding about your data or group the data points into buckets. Cines
might be the best place to start. There
are also many efficient implementations of CINS and leave a comment if you'd be interested in a video breaking down the code. I hope you found the break from
code. I hope you found the break from deep learning interesting and I'll see you soon. Four must know ML concepts.
you soon. Four must know ML concepts.
Whether you're a student, a data scientist, or engineer, this video will serve as either a refresher of the basic fundamentals or simply a way to bridge the key concepts together. Instead of
talking about each concept independently, I'll bridge them together like a story. Let's get started. The
first is training. It can be really annoying when people talk about models learning or training without being clear as to what that means. So let's start off with a highle explanation and later
in the video we'll talk about the actual equation used for training. Over some
number of iterations we adjust the model and improve its performance. Let's say
our model learns to predict how tall someone will be from a data set where we have various information about people as well as their final height once they're finished growing. How do we actually
finished growing. How do we actually increase the model's accuracy? Well, we
need to remind ourselves of what a model is. It's just a mathematical formula
is. It's just a mathematical formula that outputs a prediction. There's two
components to the formula. First, the
inputs that might affect someone's final height. This might be their current
height. This might be their current weight, their current height, and the average of their parents heights.
Second, the parameters. They're also
called weights. Let's look at the simplest kind of a MEL model represented by this equation. The input numbers are
X1, X2, and X3. While W1, W2, W3, and B are the parameters of the model. The
parameters are just random numbers at first. So the output number H won't be a
first. So the output number H won't be a very accurate prediction of how tall someone will be. But over training, we update the values of the parameters, improving the model's accuracy. And
that's all training is. We're just
adjusting the parameters of the formula.
LLMs, for example, are also just large math formulas. They are formulas that
math formulas. They are formulas that predict the exact sequence of words that best answers your prompt. The formula is a lot more complex since we're dealing with words. And before we pass words
with words. And before we pass words into a math formula, we have to represent each word as a number. And
then when numbers are outputed by the LLM, we have to convert them back into words so that we can actually read the response. But at the end of the day, a
response. But at the end of the day, a large language model is also still a math formula. In concept number four,
math formula. In concept number four, the last concept of this video, we'll see the actual equation used to update the parameters at each iteration of training. But that concludes our
training. But that concludes our overview of training for now. Concept
number two is linear regression. This
one will take less time to explain. The
super simple model we talked about earlier actually has a name, linear regression. Here's the equation again.
regression. Here's the equation again.
This model factors in three pieces of information about the input person, X1, X2, and X3. But if we wanted to factor in more characteristics about someone,
we could add more terms. And there is one issue with linear regression though.
We can't square or raise to any power any of the inputs. And we also can't use any nonlinear functions like the sigmoid function. For many cases, linear
function. For many cases, linear regression is effective. But sometimes
the relationship between input and output is more complex. And in these cases, linear regression yields poor accuracies. But what if we just ran the
accuracies. But what if we just ran the training algorithm for more iterations?
In those cases, even if we update the parameters repeatedly, the model's accuracy will remain low. Fortunately,
neural networks solve this issue, which brings us to concept number three.
Neural networks are another kind of model and they're much more powerful than linear regression. Here is a standard neural network. They use linear regression as well as nonlinear
functions like the sigmoid. The three
input nodes store x1, x2 and x3 respectively. Each of the four nodes in
respectively. Each of the four nodes in the next layer actually uses this equation. Each node in that layer stores
equation. Each node in that layer stores its own set of parameters which are W1 through W3 plus the constant term B.
Those parameters are updated during training. Just to be clear, that's
training. Just to be clear, that's actually four parameters per node in this middle layer for a total of 16 parameters in that column of four nodes.
But here's where neural networks differ from simply having each node do a linear regression in parallel. At this point, we have four YV values. y1 through y4.
And before passing those values on to the next and final layer, let's pass each y value through the sigmoid function. Here it is again. Its outputs
function. Here it is again. Its outputs
are always between zero and one. The
larger the input, the closer the output is to one. So what do the two nodes in the output layer, which will symbolize with the letter O, actually do? They use
this equation. But the x's or the inputs for this layer are actually the outputs of the sigmoid function which transforms the previously calculated y-v values.
Just to be clear, here's the relationship between the y's and these x's. Each of those two output nodes
x's. Each of those two output nodes learns and stores its own set of parameters w1 through w4 plus a constant
term b as well. That gives us two output values from the model. Let's simply take the average of them and report that number as the final height prediction.
And here's the bottom line. Adding
nonlinear functions in between the linear regression layers is powerful.
The model is now capable of learning a far more complex relationship. That
brings us to concept number four. The
actual equation for learning. The
equation or algorithm is called gradient descent. Here are the formulas used at
descent. Here are the formulas used at every iteration of training. Don't
worry, no crazy calculus needed here.
Also, the same formula is used to update W's and constant terms B. So, we can just write one equation. Alpha is called the learning rate. It's typically a
value like 01, but it can vary. And the
derivative of the error is called the gradient. We subtract out the product of
gradient. We subtract out the product of alpha and the gradient at each iteration. And with enough iterations,
iteration. And with enough iterations, the parameters will eventually stop changing. At this stage, the model's
changing. At this stage, the model's error function is minimized. That's
actually the main idea. This equation or algorithm gradient descent can be used to minimize any function. In this case, we use it to minimize the error function. Specifically, the error
function. Specifically, the error between the model's predictions and the true answers. The higher alpha is, the
true answers. The higher alpha is, the higher the product is. So, we are subtracting out a greater number at each iteration. Higher alpha will cause
iteration. Higher alpha will cause parameters to change quickly from iteration to iteration and vice versa.
That's why we call it the learning rate.
We want to find a sweet spot for alpha and it depends on the model we're training. But why exactly does the
training. But why exactly does the derivative of the error function come into this equation? I actually have a separate video to visualize this concept. It's only 3 minutes long and I
concept. It's only 3 minutes long and I promise it's worth it. It should pop up in the top right and I really recommend checking it out right now. really right
now. It'll also pop up at the end in case you don't click it right now. Okay,
that wraps up our four must know concepts. If you're still watching this
concepts. If you're still watching this video, I think you're the kind of person that would love our ML community. Check
out the link in the description to learn more about it and to grab the free LLM's course I created. It's over 10 hours long with 25 different modules and practice problems. Let me know if you
have any questions and I'll see you
Loading video analysis...