Zero to Hero LLMs Course (8+ Hours)

By Dev G

Summary

## Key takeaways - **Understand Transformers to Stand Out**: The key to standing out is to actually understand how these models work down to the level of transformers. Anyone can import an API and build a simple project, but how many people actually understand transformers, the model behind ChatGPT? [00:24], [00:32] - **Sentiment Analysis for Trading Bots**: Build a model that takes in a tweet and outputs a number between zero and one, zero representing negative emotion and one representing positive emotion. This has real-world applications like detecting emotion in tweets and news headlines so a crypto trading bot can decide whether to buy or sell a stock. [01:11], [01:28] - **Convert Strings to Numbers First**: The first step to any NLP project is to convert strings into numbers since models only work with numbers. Convert all the words in tweets into numbers so the model can process them, using a consistent mapping from words to integers. [02:07], [02:32] - **Embeddings Encode Word Meaning**: The first step in any NLP model is to get embedding vectors for every word, where related words are closer together and unrelated words are farther apart. Use an embedding layer of size 16 so each word has a vector with 16 entries encoding its meaning. [10:19], [10:46] - **Training Updates Model Parameters**: Training is finding the right values for model parameters like W1 through W3 and B, starting from random numbers and iteratively updating them until the model's predictions are accurate enough. Each W factors in how important each input is for the prediction. [18:20], [19:49] - **Gradient Descent Minimizes Error**: Gradient descent iteratively minimizes the error function by updating parameters with the equation involving the learning rate alpha and the derivative. It's an approximation algorithm essential for complex functions in ML models that can't be solved analytically. [34:33], [37:28]

Topics Covered

Understand Transformers to Stand Out
Convert Strings to Numbers for NLP
Embeddings Encode Word Meanings
Training Updates Model Parameters
Gradient Descent Minimizes Error

Full Transcript

Let's go from zero to hero, from an empty resume to a portfolio full of projects so you can land your dream offer. Everyone talks about escaping

offer. Everyone talks about escaping tutorial hell, getting ahead of the 99%, an unfair advantage. How do you actually do that? Well, in the last few years,

do that? Well, in the last few years, I've gone from not knowing how to write a single line of code to getting offers from Amazon and Google. And in my opinion, the key to standing out is to

actually understand how these models work down to the level of transformers.

Anyone can import an API and build a simple project. But how many people

simple project. But how many people actually understand transformers, the model behind Chad GBT? Well, I've

created hundreds of videos covering the fundamentals of AI. So, this video contains all my best videos that will actually help you understand Transformers. And along the way, you

Transformers. And along the way, you will build several projects for your portfolio. These are my best videos.

portfolio. These are my best videos.

I've cherrypicked them with very, very specific intentions in mind. So, I hope you enjoy and I look forward to seeing you in the next clip. And as always, if you'd like to learn directly from me

until you land your dream offer, click the link in the description. All right,

let's build a realworld AI project. Our

goal is to build a model that takes in a tweet and outputs a number between zero and one. Zero represents negative

and one. Zero represents negative emotion and one represents positive emotion. This project is called

emotion. This project is called sentiment analysis and it has realworld applications. Let's say you want to

applications. Let's say you want to build a crypto trading bot. Well, let's

say you want your bot to take in tweets and news headlines. Well, we need to actually detect the emotion in those tweets and news headlines so the bot can decide whether to buy or sell a stock.

And the unique part about this video is that you can actually follow along and run your code in your browser using this link. It's in the pinned comment. And

link. It's in the pinned comment. And

you don't have to install any dependencies. There's no setup. And the

dependencies. There's no setup. And the

website will check to see if you've built your model correctly just by running your code. And this video is going to be divided into two parts. In

the first part, we're going to build our data set of Elon Musk tweets. And in the second part, we're going to actually create the model. If you don't have time to get through the whole video right now, try to at least get through the

data set creation. Let's get started with part one. All right. So, the first step to any NLP or natural language processing project is to convert our strings into numbers. Models only work

with numbers. So, we've got to convert

with numbers. So, we've got to convert all the words in these tweets into numbers so the model can actually process those. Let's take a look at a

process those. Let's take a look at a quick example for this problem. You can

see that the two inputs are positive and negative. Each of those is a list of

negative. Each of those is a list of strings, one with positive emotion and one with negative emotion. And we can go ahead and take a look at example one here. So each of those two lists just

here. So each of those two lists just has one string just so we have a simple example and we have something that we might find on Twitter from Elon Musk.

Dogecoin to the moon and I will short Tesla today. Obviously one's positive

Tesla today. Obviously one's positive and one's negative but we can take a look at the output. So for the first tweet we actually have some sort of list of numbers and for the second tweet we also have another list of numbers. But

where do these numbers actually come from? We can take a look at the

from? We can take a look at the explanation. It says lexographically,

explanation. It says lexographically, which is just a fancy word for alphabetically. Dogeishcoin becomes one

alphabetically. Dogeishcoin becomes one just alphabetically. Word I becomes two.

just alphabetically. Word I becomes two.

Tesla becomes three and so on. And this

doesn't actually matter because the model is just going to interpret these as numbers. The important thing is that

as numbers. The important thing is that we are consistent. We have to have a consistent mapping or encoding from words to integers and integers to words.

We can also see that the first sentence is four words and the second sentence is five words. So just so that we don't

five words. So just so that we don't have like a jagged tensor or a jagged array, we're going to pad that first sentence with another zero at the end.

And zero doesn't correspond to any word.

It's just a dummy number for padding.

All right, so I'm going to jump into the code now. I recommend pausing the video

code now. I recommend pausing the video here. Try to think about how you might

here. Try to think about how you might do this, but I'm going to get started with the code. So we want to develop a mapping from words to integers. I want

to have some sort of data structure where I can just look up that Dogecoin is one, I is two, and so on. I want to encode this in some sort of mapping. So,

the first step is going to be to get all of our unique words using a set. So, the

first step is just going to be to combine both of our lists. So, we're

just dealing with one list. And you can simply add two lists in Python to concatenate them together. So, now we have one list. Next, I want to go through every word in every sentence. So

if I said something like for sentence in combined that would give me a iterable over each sentence. But what we actually want to do is we want to get each word.

So the way we can obviously get the words from a sentence is by splitting on the space character. And I'll have some sort of set called words. And let's go ahead and add every word to that set. So

we actually need to define that set now.

Okay. So we're almost there. We have all the unique words in a set. But I

actually need to sort them now so that I can say that the alphabetical first word is one, alphabetical second word is two and so on. But we don't normally sort sets at least not usually. So let's

convert it to a list and then we'll sort that. Now let's actually build that

that. Now let's actually build that mapping that we talked about earlier. So

it's going to be a dictionary and I'm going to call it word to int the keys are going to be words and the values are going to be ins. And all we have to do now is go through our sorted list. So

we'll use the enumerate function from Python. So I'll say enumerate on this

Python. So I'll say enumerate on this sorted list. And in case you're not

sorted list. And in case you're not familiar with it, I is going to be the index and C is going to be the actual value at the list at index i. So it's

just an easier way to kind of condense the code. And all we want to do is

the code. And all we want to do is actually store this in word to int. So

we'll say word to int. Well, what's the actual word, right? That's c, right?

That's the actual value at index i in this sorted list. And then what should the actual integer encoding be? It

should just be i + 1, right? cuz I is obviously going to start at zero, but we said that we actually want to leave zero to be for padding. So, we're going to have everything start at one. You can

see over here that Dogecoin goes to one, I goes to two, and so on. So, we'll

simply say I + 1. And our dictionary is complete. Okay, we're actually almost

complete. Okay, we're actually almost done now. I just created this list

done now. I just created this list called unpadded. And that's going to be

called unpadded. And that's going to be very similar to the output list you can see on the left, except we won't have that zero padding. So, we'll take care of the padding last. But unpadded is

going to be a list of lists just like that output that we can see on the left over there. All we have to go through

over there. All we have to go through now is actually encode every word in every sentence. We're going to actually

every sentence. We're going to actually call this dictionary or actually fetch something from the dictionary. So we'll

say word to int or that word is going to give us some number, right? We want to append this to a growing list for every sentence. So we'll say something like

sentence. So we'll say something like the encoded version is some sort of list. We'll actually just append this to

list. We'll actually just append this to encoded for every word in that sentence.

Then once we're actually finished with that sentence, we can go ahead and append this to unpadded. So

unpadded.append

encoded. So we can see that unpadded is a list of lists. All right. So it might be tempting to just return unpadded, but we still need to actually do the padding. Fortunately, we don't have to

padding. Fortunately, we don't have to do that ourselves. There's a simple function from the PyTorch library that I'm going to import. It's going to take care of that for us. All right. So you

can go and see that I've added a couple import statements at the top. That's

just to import our pad sequence function. And on line 25, we're actually

function. And on line 25, we're actually going to call that function. And we also have to pass in batch first equals true because we actually want our output at least for the example on the left to be

a 2 by capital T tensor. So the

dimensions would be 2 by t where t is the number of words in the longest sentence. Right? But if you say batch

sentence. Right? But if you say batch first equals false, which is the default, it's actually going to be inverted. It's actually going to be kind

inverted. It's actually going to be kind of a long or tall matrix instead of this kind of horizontal long horizontal matrix. So it's going to be transposed

matrix. So it's going to be transposed and that would be tx 2, which isn't what we want. So we just say batch first

we want. So we just say batch first equals true. Okay, but there's one final

equals true. Okay, but there's one final issue and then we'll actually be ready to run our code. If you try to run the code now, it's not going to work because this pietorch function expects everything passed in to also be a

tensor. So, we can't have a list of

tensor. So, we can't have a list of Python lists. We want to pass in a list

Python lists. We want to pass in a list of tensors. So, on line 23, I've went

of tensors. So, on line 23, I've went ahead and made that small change. We're

just going to cast the Python list to a tensor before we append it to our giant list. And that should take care of

list. And that should take care of everything. All right, let's go ahead

everything. All right, let's go ahead and run the code. And we've passed the test cases. All right, let's move on to

test cases. All right, let's move on to the actual model. There's going to be two parts to our code. We're going to fill in this class on the right. And we

have one function called the init function, also known as the constructor.

And that's where we're going to define the actual model. So if you look at a neural network diagram like this, and don't worry, not much background on neural networks is required for this.

But if you look at this diagram, what defines this network is how many layers there are and how many nodes are actually within each layer. So that's

what we're going to define in this first function, the init function. And there's

one parameter that's passed in called vocabulary size, which is the number of different words the model should be able to recognize. And we can take a look at

to recognize. And we can take a look at the example there and we see 170,000.

That's about the number of words in the English language. So there's going to be

English language. So there's going to be that parameter there. We're going to explain how that actually ties into the model in a bit. And just to make sure we understand the inputs and outputs here, let's take a look at those two sentences

passed in. And we can actually kind of

passed in. And we can actually kind of see that those sentences are passed in in number format. So this is kind of like a continuation of the first problem. We talked about how we're not

problem. We talked about how we're not going to actually pass strings into models. We're going to pass in sequences

models. We're going to pass in sequences of numbers where each word is represented as a number. So the first string passed in where we want to gauge the emotion, detect the emotion in that is the movie was okay. And we can see

that it's padded with a bunch of zeros cuz the second sentence is way longer.

And the second sentence was I don't think anyone should ever waste their money on this movie. And we can see that the model outputed 0.5. We can see that in the output there. Essentially saying

that that's kind of a mix between positive and negative. The movie was okay, right? That's a very kind of

okay, right? That's a very kind of neutral statement. We can see the model

neutral statement. We can see the model outputed the number 0.1 which is much closer to zero than one. So we can see that's obviously a very negative statement to say that no one should waste their money on this movie. And if

we take a look at the description on the left where it starts talking about the model architecture that we're supposed to use here says to use an embedding layer of size 16. Compute the average of the embeddings to remove the time

dimension. Kind of confusing, right? And

dimension. Kind of confusing, right? And

with a single neuron linear layer followed by a sigmoid. We're going to explain what all of that means. Don't

worry about reading that for now. Let's

get into it. All right. So, the first step in any NLP model is to actually get the embedding vectors for every word.

So, here's an example of embeddings. We

can see that the words that are more related in terms of their meaning are actually closer together and the words that are not so related in terms of their meaning are farther apart. So, we

want to have an embedding layer, meaning we simply want the model to actually have vectors that encode the meaning of every word. And in this case, we can see

every word. And in this case, we can see we only have two dimensional vectors, but we can see in the problem that it says to use an embedding layer of size 16. So we want a vector with 16 entries.

16. So we want a vector with 16 entries.

And the larger this vector is, the more information the model can actually encode for each word. All right, so it's actually pretty straightforward to define an embedding layer in PyTorch.

We're simply going to say self.bedings.

Just going to name it embeddings is nn.bing.

nn.bing.

So, we're going to make an instance of this existing class in PyTorch called embedding, which comes from the NN or neural network module, which has a ton of useful classes that we'll use later

in this problem as well. And the two things that we simply have to pass in are the vocab size, right? The number of words in the English language. Well,

that's the number of different words that we want to store an embedding for.

We also want to pass in the size of each embedding vector. In the problem

embedding vector. In the problem description, we can see that's going to be 16. And the way we can think of this

be 16. And the way we can think of this is imagine we have a table and the number of rows is equal to the number of words in our embedding layer. So we can think of that as vocab size the number

of words in the English language and the column size the number of columns is 16 because we have a vector of size 16 at every single row and each row or each

vector is essentially storing the embedding for that word. So then in the forward step, right, we haven't gotten to the forward step yet, so don't worry too much about it right now. When we're

actually getting the model prediction or some sentence that's passed in, some sequence of words, right, we're going to actually go ahead and fetch or pluck out the relevant rows for each word in our

input sentence from that embedding table we talked about earlier. Okay, the next step is to define the linear layer for this model. So you might be familiar

this model. So you might be familiar with linear regression. Here's an

example of that equation when we have three input attributes X1, X2 and X3.

But this equation can also extend to having any number of input attributes like 16 different input attributes. So

we can have X1 through X16 passed into this equation and even in that case we have one output number. And the reason we're going to need this is because we know that ultimately this model should

output one output number, right? The

emotion found in the text. But at this point we have vectors of size 16, right?

for every single word. So, we're going to need some sort of linear layer to kind of project our dimension back down to one single number. And of course, we're going to need to define an instance of the sigmoid function

pictured here. It's simply a nonlinear

pictured here. It's simply a nonlinear function where the outputs are between 0 and one, which is exactly what we want for sentiment analysis. So, let's go ahead and define those two. First, we

can define our linear layer. We'll

simply say self.linear equals nn.linear.

And you have to pass in the number of input attributes, which is 16. And of

course, we only want one single number outputed there. And we also want to

outputed there. And we also want to define our sigmoid function. So that's

just nnn. sigmoid. That's it for the constructor or init function. All right.

So now let's write the forward function, which is simply passing our input data x through the neural network. So the first thing we're going to do is we're simply going to pass x into self.beddings. And

that's going to actually give us the embeddings for every single word. So

let's think about what the shape of word embeddings is. So let's imagine for a

embeddings is. So let's imagine for a second that X is 2x 5. So let's say some comment here. X is 2 by 5. Meaning we

comment here. X is 2 by 5. Meaning we

have two different sentences that are passed in and we want to get the model prediction for two different sentences.

And that's kind of in line with our input example on the left. Right? We had

two sentences there. Let's say five is the length of each sentence. I know that the sentences have different lengths in that example on the left. Let's not

worry about that for now. Let's say X is 2x 5. So we have five words in each

2x 5. So we have five words in each sentence. And remember we said the

sentence. And remember we said the dimension of each embedding vector is 16. So what is going to be the shape of

16. So what is going to be the shape of word embeddings? Well, in this case,

word embeddings? Well, in this case, assuming X is 2x 5, word embeddings is going to be 2x 5x 16 because we still have all the words from earlier, but then for every single word, we don't

just have a single number anymore like we did in the input on the left. We now

have a vector of size 16. That's why

it's going to be 2x 5 by 16. But we're

going to have to shrink this giant three-dimensional matrix down eventually to a single number for each input sentence. Right? That's the goal of this

sentence. Right? That's the goal of this model. So, we're going to have to make

model. So, we're going to have to make some simplifications. And what we're

some simplifications. And what we're going to do is we're going to assume that every word in the sentence matters equally. That's obviously a bad

equally. That's obviously a bad assumption since some words are more important than others, right? Just to

have a simple model for now. Let's

assume that every word in a sentence, its embedding vector matters equally.

Obviously the model more advanced models like the transformer actually weigh each word differently using a concept called attention. But let's not worry about

attention. But let's not worry about attention for this project. Okay. So to

shrink and simplify this giant three-dimensional matrix, let's go down to a two-dimensional matrix by weighing each word equally and simply averaging all the embeddings. All right, so I've

gone ahead and written the code for that step. We're using the torch.m mean

step. We're using the torch.m mean

function for averaging and we're going to pass in word embeddings and we're specifically going to say dim equals 1 and the dim or dimensions follows zero indexing. So we can see two here that

indexing. So we can see two here that would be dim equals 0. This five here is dim equals 1 and this 16 here is dim equals 2. So what is going to be the

equals 2. So what is going to be the shape of this actual tensor or vector or matrix over here? But what we did is we actually got rid of that time dimension,

the second dimension that actually tells us how many words do we have. In this

dummy example, I said we had five words.

And that's because I said we wanted to actually weigh each word equally and just average what's going on for every word all into one straightforward vector. So we're going to have a 2x6

vector. So we're going to have a 2x6 matrix. Now, and just in case that's not

matrix. Now, and just in case that's not clear, what we did for each sentence is we just averaged those embedding vectors where each one is of size 16. For each

sentence, we just averaged those embedding vectors for all five words in the sentence. So now for the first

the sentence. So now for the first sentence, we have this one vector of size 16. And for the second sentence, we

size 16. And for the second sentence, we have another vector of size 16. And you

can think of that vector of size 16 being the model's summary for all of the important information in that sentence all encoded into one vector. But we

still need to shrink this down even more. We want a single number for every

more. We want a single number for every single sentence. So that's where the

single sentence. So that's where the linear regression equation from earlier comes into the picture. We want to apply that to each of those 16dimensional vectors. We'll simply say our final

vectors. We'll simply say our final output or technically it won't be the final output. We can say this is like

final output. We can say this is like our pre- sigmoid output because we haven't passed it into the sigmoid function yet. We can simply say

function yet. We can simply say self.linear, right? That's the linear

self.linear, right? That's the linear instance that we created earlier in the init function. And what do we want to

init function. And what do we want to pass into it? We want to pass in average. And what is going to be the

average. And what is going to be the shape of average? It'll be 2x one after the linear regression equation is applied. All right, the final step. We

applied. All right, the final step. We

just want to pass that pre- sigmoid output into the sigmoid function so that we get two numbers one for each sentence each between zero and one and that's what we can return. So just to clarify

the shape is still going to be 2 by one.

All right so we're ready to run our code against the test cases. We just need to round our answer to four decimal places to make the answer checking easier.

Let's go ahead and run the code and we can see that it works. Okay, in this video we're going to go over three core AI concepts that you've got to know.

It's going to be different than a lot of other intro to AI videos as it doesn't require any machine learning experience.

We're also not going to stray away from the technical details. So, be sure to stick around. AI is becoming more and

stick around. AI is becoming more and more important even for software engineers. So, make sure to master these

engineers. So, make sure to master these three concepts. Okay. The first concept

three concepts. Okay. The first concept is called training. So, people talk a lot about models training and learning.

But what does that actually mean? Well,

first we need to define what a model actually is. The simplest absolute kind

actually is. The simplest absolute kind of model is this equation right here.

And let's say we're trying to predict how good someone is at beer pong. So you

might have played beer pong before. And

let's say our model is going to predict someone's win accuracy at the game. And

the accuracy is going to be predicted based on three inputs. X1, X2, and X3.

Let's say X1 is like your alcohol tolerance, the number of beers you can take before you start slurring your words. And let's say X2 is your general

words. And let's say X2 is your general accuracy. So like a number on a scale of

accuracy. So like a number on a scale of 1 to 10 that actually represents how accurate are you with each throw. And

let's say X3 is your trash talkinging effectiveness. So another number on a

effectiveness. So another number on a scale of 1 to 10 that kind of represents how good are you at trash talkinging since obviously that will affect you can get in your opponent's head. It will

overall affect how good you are at the game, right? Your chance of winning.

game, right? Your chance of winning.

That's what the model is predicting.

That's what the value Y represents. But

in the equation, we also have these other numbers, right? W1, W2, W3, and B.

Those numbers are called the parameters.

And that actually is the model, right?

That's those are the numbers that the model uses to actually make its prediction y whenever we pass in x1, x2 and x3 for like any arbitrary person. So

then what training is is just finding the right values of those parameters, right? W1 through w3 plus b. It's about

right? W1 through w3 plus b. It's about

finding and updating those values until we're satisfied with the model's prediction y until we feel like it's accurate enough. So when we initialize

accurate enough. So when we initialize the model, we'll have totally random numbers for W1 through W3 as well as B.

But over the course of training this iterative process, we'll be adjusting those values, adjusting the values of the parameters until we actually have an accurate enough model. Okay, but what do

those W's actually represent? So we have like each W multiplied by an X. So let's

look at W2 for example. It's multiplied

by X2. Each W is actually just factoring in how important each input variable is for determining Y. And if that didn't make sense, let's go over an example. So

we have W2 multiplied by X2. So the

higher X2 is and the higher W2 are, well, the higher Y is going to be. And

we know that X2 was someone's accuracy with each throw in beer pong. And

obviously that's pretty important for determining why, right? The higher the accuracy of their throws, the higher their probability of winning or their win rate is going to be. So we actually going to have the model learn over the

course of training the right value for W2 cuz W2 is just factoring in how important X2 is. So over the course of training, let's say we initialize W2

something like -2, right? Just a random initial value. Well, the model is going

initial value. Well, the model is going to end up learning some much larger positive number for W2 because the higher X2 is, the higher Y should be. So

the model is going to actually have W2 measure or encapsulate how important X2 is for predicting Y. That's why we're multiplying them together. So that's all training is. Training is just the

training is. Training is just the process of updating the model parameters, which are usually just numbers like W's and B. Those numbers

start off totally random, but over the course of training or learning, their values are updated until they actually make sense. So that no matter what x1,

make sense. So that no matter what x1, x2, and x3 are, the model can have like a pretty solidly accurate prediction for what y should be. So that's all for concept number one, training. We get to concept number three. We're actually

going to talk about gradient descent.

It's this equation. It'll pop up on the screen soon. And that's the actual

screen soon. And that's the actual equation used to update the values of the W's and B. But we don't need to worry about that for now. We'll get to that eventually. All right, on to

that eventually. All right, on to concept number two. If you made it through concept number one, that's awesome. try to hang in for the rest of

awesome. try to hang in for the rest of the video cuz it's going to be a lot easier from here on out. Number two is linear regression. And the good news is

linear regression. And the good news is we already talked about it in concept number one. This equation that we were

number one. This equation that we were talking about, that's linear regression.

It's a simple unsexy model from statistics, but it's actually the foundation of AI. So, you've got to understand linear regression. The idea

is pretty simple. We're going to have some number of inputs, our X's, and we'll have these W's that we multiply against the X's. We add some number B, and that's how we get the model's prediction Y. And this can work for an

prediction Y. And this can work for an example where we have any number of inputs. In our case with beer pong, we

inputs. In our case with beer pong, we had three input attributes, but you could have two input attributes, five input attributes, and the equation would change accordingly. We would have more

change accordingly. We would have more W's. We would have five W's, W1 through

W's. We would have five W's, W1 through W5, if we had X1 through X5. So the

equation is flexible and it can change accordingly. So linear regression sounds

accordingly. So linear regression sounds awesome, right? But there is one issue

awesome, right? But there is one issue and it's that most data in the world isn't strictly linear. looking at this equation, it's not going to capture any nonlinear relationships like this one right here. Right? So that's where

right here. Right? So that's where neural networks come in. And that brings us to concept number three. Okay? So I

know I said we were going to finally talk about the equation gradient descent for training in concept number three.

But first, let's actually go over this diagram, neural networks. And I think this is going to be one of the simplest explanations you've ever seen. Okay. So

to make this neural network explanation much more simpler to understand, we're going to explain it in terms of linear regression. Now, obviously, neural

regression. Now, obviously, neural networks are far more powerful than linear regression. The relationships and

linear regression. The relationships and the data that they can model is far different than simple linear regression models. So, I'm going to explain neural

models. So, I'm going to explain neural networks in terms of linear regression.

But, just keep in mind that they're not actually identical to linear regression due to nonlinear functions like this one, which we'll talk about shortly. But

just so anyone doesn't get triggered in the comments, I'm not oversimplifying things or dumbing anything down. We're

just going to start off by explaining neural networks in terms of linear regression. Also, if you're actually

regression. Also, if you're actually still watching this video, you're not an NPC. I like you. Let's get into it. So,

NPC. I like you. Let's get into it. So,

the premise is the same as before. We

have X1, we have X2, and we have X3. And

those are actually going to be in that first leftmost column, which we call the input layer or the input column. So,

there's going to be those same three input attributes. And we're ultimately

input attributes. And we're ultimately on the right side of the diagram going to actually predict this one number O or Y, whatever you want to call it, that's going to be someone's predicted chance of actually winning a game of beer pawn.

to the same x1, x2, x3 situation as before. What makes neural networks

before. What makes neural networks different is that we actually have this column of nodes in the middle, right? We

call that the hidden layer. And what

we're going to do is simply more linear regression. We're going to let each of

regression. We're going to let each of those nodes do linear regression and use the equation we talked about earlier, right? This equation where we're going

right? This equation where we're going to have W1 through W3 plus B. We're

going to have each of those nodes, each of those nodes in the hidden layer use that equation. So then we're going to

that equation. So then we're going to get four numbers y1 through y4 all calculated based on the same x1 x2 and x3 from the input layer. So the

difference is that each of those four nodes is learning and updating its own set of parameters its own set of w1 w2 w3 and b. So overall the model is doing

a lot more calculation which is going to make neural networks far more powerful than linear regression. Of course, there are nonlinearities which we still have to talk about, but this is also a big part of neural networks. Just the fact

that we're having more nodes that are just doing more calculations and more computation as a whole. But we have four numbers right now, y1 through y4, and we need to get one final output number. Oh,

so this final node is also going to do some more calculation, but it's going to take in four numbers as input. It's

going to use this equation right here.

It's going to take in Y1 through Y4 and it's going to learn W1 through W4 as well as again another constant term B to predict that final output number O.

Okay, so that's the gist of neural networks. We just need to talk about

networks. We just need to talk about nonlinearities. So here is an example of

nonlinearities. So here is an example of a nonlinearity called the sigmoid function. And just bear with me, we're

function. And just bear with me, we're almost done. Everything's going to make

almost done. Everything's going to make sense. So the sigmoid function right

sense. So the sigmoid function right here, we can see that the input could be anything, right? Any number. But the

anything, right? Any number. But the

output on the y-axis will always be between 0 and one. So this function is actually transforming any input to be between 0 and one in a nonlinear method.

And we want to use functions like the sigmoid function in our neural network just so the model can actually capture and learn nonlinear relationships because most data in the real world is

not actually the simple linear relationship. So how exactly do we want

relationship. So how exactly do we want to incorporate this sigmoid function into the neural network? Well, we want to incorporate it between layers. So

after we have the input in a hidden layer, we've done some linear regression, right? We've calculated

regression, right? We've calculated those numbers Y1 through Y4, those four numbers. But before we pass those

numbers. But before we pass those numbers into that final output layer, right, that final output node, we should actually pass Y1 through Y4, those four numbers each, into the sigmoid function

to get four different outputs, all of them are going to be between 0 and one just because that's what the sigmoid function does. And those four numbers,

function does. And those four numbers, the ones between 0 and one, the outputs, the outputs from the sigmoid function, that's what we're going to send into that final output node. And this might seem like a small difference, right?

Like, okay, what's the big deal? All

we've done is incorporated this one weird looking curve into the function.

But this is going to drastically change the power of the model. Model is now going to be able to actually pick up on way more complex relationships and nuances in the data. And if that didn't make sense, just leave a comment below.

We'll talk about it more in another video. Okay, the video is definitely

video. Okay, the video is definitely getting a bit long now. So, just to wrap it up, let's come back to gradient descent. It's this equation right here.

descent. It's this equation right here.

And it's the actual equation used for training, right? To actually update

training, right? To actually update those W's and B at every iteration.

There's this equation that depends on the old value for W or B as well as this other variable alpha which we'll explain as well as the derivative. So it might be kind of weird that derivatives from calculus are coming into the picture now

and I promise it won't be any crazy complicated. And for gradient descent,

complicated. And for gradient descent, right, this crucial ML algorithm, I actually have a separate video. It

should pop up now. It'll be in the description. It'll be in the pinned

description. It'll be in the pinned comment. I put a ton of time into this

comment. I put a ton of time into this video kind of animating exactly how gradient descent works. It's just around 3 or 4 minutes long, so definitely check it out. Next, I honestly would have just

it out. Next, I honestly would have just included the gradient descent explanation in this video, but just to keep the video from getting too long so it's actually digestible and not too overwhelming, there's a separate video linked in the description and comment.

It should be popping up now, too. Go

ahead and check it out. Hey everyone,

it's Dev again. This video is a quick and interactive review of the math you need for ML. I can confidently say that this is the most concise review of the math you need for ML. But don't take it

from me. Thousands of students in the

from me. Thousands of students in the GBT Learning Hub community have used this exact guide to kickstart their ML journey. Without further ado, let's get

journey. Without further ado, let's get started. It's actually a misconception

started. It's actually a misconception that you need a crazy amount of math for ML. You can get started with a basic

ML. You can get started with a basic understanding of matrix multiplications and derivatives from calculus. That

means you should understand how to find the derivative of functions like this.

But don't worry about functions like these. Also, you may have heard before

these. Also, you may have heard before that complex ML models just boil down to matrix multiplication. While this is a

matrix multiplication. While this is a bit of an oversimplification, there's also plenty of truth to that. Consider a

large language model. Here's how they generate text. The prompt is passed in

generate text. The prompt is passed in and the output is the word that's most likely to come next in the sequence.

This word is concatenated to the input sequence and the next word is obtained.

can concatenate again and the process repeats. LLMs are just a large

repeats. LLMs are just a large autocomplete models that we can iteratively invoke. But to actually

iteratively invoke. But to actually understand the internals of this black box, we have to dive into matrix multiplications and some basic derivatives. Let's go over a few quiz

derivatives. Let's go over a few quiz questions that cover the essential math for ML. And then after you finish the

for ML. And then after you finish the quiz, I recommend checking out my step-by-step machine learning road map, which consists of over 25 video modules and practice problems. It's completely

free and you can grab it from the link in the description. Let's get started with the quiz. Question one. Given two

matrices, what is the shape of the product? Understanding the dimensions of

product? Understanding the dimensions of matrix multiplication is actually important for implementing models since libraries like PyTorch require you to specify these dimensions when

instantiating components of a model. We

have a 4x6 matrix on the left and a 6x4 matrix on the right. The first rule to check is whether this matrix multiplication is possible. The number

of columns in the first matrix needs to match the number of rows in the second matrix. If these dimensions are unequal,

matrix. If these dimensions are unequal, then the output is undefined. When we

get to the question on how to actually multiply two matrices, this rule will make sense, but for now, let's just accept it. Okay, so the matrix

accept it. Okay, so the matrix multiplication is defined. The shape of the product is another simple rule.

Let's take the number of rows in the first matrix and the number of columns in the second matrix and those will be the output dimensions. For the actual multiplication, we'll pair up every row

in the first matrix with every column in the second matrix. The answer then is 4x4. This might be unsatisfying. So,

4x4. This might be unsatisfying. So,

let's move on to the next question, which asks us to actually perform the multiplication. We have three columns in

multiplication. We have three columns in the first matrix and three rows in the second matrix. So, the output is defined

second matrix. So, the output is defined and will also be a 3x3 matrix. The only

phrase you need to remember for matrix multiplication is row column. To get the top left entry, we multiply the first row with the first column. This means

multiplying the corresponding numbers and adding them up. To get the other entries in the output, we keep following the row column rule. First row, second column, and we get the second entry.

First row, third column, and we get the third entry. Second row, first column,

third entry. Second row, first column, and we get this entry, and so on. This

also explains why the number of columns in the first matrix needs to match the number of rows in the second matrix. The

number of columns in a matrix is just the number of entries per row and the number of rows is just the number of entries per column. This might seem abstract right now, but remembering the phrase row column will be very helpful

in understanding self attention, the crux of transformers and large language models. Question three, at what value of

models. Question three, at what value of x does this function have a positive slope? Understanding basic calculus is

slope? Understanding basic calculus is important for gradient descent, the algorithm behind training. We can look at a line tangent to the function at

three points. X less than 0, X= 0, and X

three points. X less than 0, X= 0, and X greater than 0. Only when x is greater than 0 does this line have a positive slope. So the answer must be x= 3. This

slope. So the answer must be x= 3. This

might seem simple or obvious, but it's actually critical to visually understanding gradient descent, the first ML algorithm everyone should learn. That video was linked in the

learn. That video was linked in the description if you're interested in watching it next. Lastly, question four.

How many derivatives does this function have? This is actually important when

have? This is actually important when minimizing a function during training.

To find the minimum of a function, we adjust the values of the inputs. In this

case, x, y, and z. In gradient descent, the derivative of the function with respect to each input is used to minimize the function. That's one

derivative for each input for a total of three derivatives. And that wraps up our

three derivatives. And that wraps up our short quiz. Finally, I have a huge

short quiz. Finally, I have a huge announcement. The beginner's blueprint

announcement. The beginner's blueprint is finally available. This is the exact study plan I wish I had when I was first getting started with machine learning.

Everyone told me to read papers, but I had no idea which papers to read. Once I

figured that out, I had no idea how to understand them, dissect them, much less implement them, and code up the main concepts. And lastly, workday was a

concepts. And lastly, workday was a nightmare. I had no idea how to present

nightmare. I had no idea how to present these projects on my resume and actually land more interviews. The beginner's

blueprint will solve all of these problems for you so that you can make progress faster than I did. Take it from someone like um from the IIT Madras class of 2025. I personally provided him

with a road map helping him get started with the implementation ASAP or shear.

He's an NLP expert and his personal favorite resources are our ML programming questions which are accessible for free. Lastly, someone

like Chang. He's an exy Yahoo a IML engineer and he knows his stuff. It's

been a blast making videos on this channel for the last year and I'm excited to help even more of you with premium personalized instruction. Our

launch sale is now active and you can secure the entire blueprint for 50% off.

Head to the link in the description to learn more and I'll see you on the other side. People talk a lot about training

side. People talk a lot about training ML models, but what does this actually mean? There's often a training data set

mean? There's often a training data set and for simplicity, let's say our model predicts how tall someone will be based on their weight and height at age 10.

The data set could have thousands of people's weight and height at age 10 as well as their final height once they stop growing. And during training, the

stop growing. And during training, the model learns a relationship in the training data. And there's an algorithm

training data. And there's an algorithm called gradient descent used for this.

Let's quickly go over it. Now that the NPCs have clicked off the video, let's get into it. Gradient descent is used to minimize a function. Here's a simple function y=x^2. In calculus, and don't

function y=x^2. In calculus, and don't worry if you don't remember this, you might have taken its derivative 2x, set it equal to zero, and you find x equals 0 is the minimum of the function. But

for some functions, like the ones in machine learning models, it's just way too complicated to take the derivatives by hand. In some cases, it's even

by hand. In some cases, it's even impossible. We need a different way to

impossible. We need a different way to find the minimum. But why is minimizing a function even important for training models? It's because we want to minimize

models? It's because we want to minimize the error function. the error between the model's prediction for the final height and the true answer for all the people in our data set. Let's build some

intuition for how we might approximate the minimum of a function with efficient iterations. Here we have y= x^2 again

iterations. Here we have y= x^2 again and let's say our initial guess for the minimum is x= 3. We know the minimum is at x= 0. What if we look at the slope of

the function at x= 3? This is the same as the derivative or the gradient. It's

a positive number, meaning the function is increasing at this point. But if we want to get to the minimum of a function, we want to step in the opposite direction. Let's say our new

opposite direction. Let's say our new guess is the old guess minus the slope time some step size, which I'll call alpha. For simplicity, let's say alpha

alpha. For simplicity, let's say alpha is 0.1, but in practice, we would use smaller values. Our new guess is then 3

smaller values. Our new guess is then 3 - 0.6, which gives us 2.4. We got closer to the answer of x= 0. If we repeated this procedure for enough iterations, we

would converge to the answer. What if

our starting guess was -2? Here's the

slope. The derivative is -4. And using

the same formula, we get -2 + 0.4 for a new guess of -1.6.

Again, we got closer to our answer. And

we would repeat this process. Here's the

algorithm in pseudo code. So now we have some intuition for why this process works. You can try implementing it here

works. You can try implementing it here and running your code against the test cases at the link in the description.

For those who want a deeper understanding, there's actually a theorem that says if you're sitting at a point on a function and there's a bunch of directions you could travel in, the gradient gives the direction of greatest

increase. So the negative of the

increase. So the negative of the gradient, which is exactly what we use since we subtract out alpha times the gradient, gives us the direction of greatest decrease. stepping in the

greatest decrease. stepping in the direction of greatest decrease in of times gets us very close to the minimum of a function. So that's all it means to train a model. We're just iteratively

minimizing the error function. What this

error function actually is depends on the model we're training. And to learn more, I recommend going through this list starting with implementing gradient descent. Leave a comment for future

descent. Leave a comment for future video suggestions and I'll see you soon.

Hey, in this problem we're going to solve gradient descent or we're going to implement this super important algorithm. If you ever heard of machine

algorithm. If you ever heard of machine learning or neural networks or just a lot of AI algorithms in general, they're trained with this algorithm called gradient descent. Can often be like

gradient descent. Can often be like super annoying when people talk about, oh, the model is learning a relationship. Oh, we trained it on this

relationship. Oh, we trained it on this data set. So at the lowest level, what

data set. So at the lowest level, what that's actually talking about, just to make it super clear, is this algorithm called gradient descent. In this

problem, we're asked to minimize the function f ofx= x^2. So gradient

descent, it's an algorithm for minimizing a function. It can be used to maximize a function, but most of the time we use it to minimize a function.

So what is the x that minimizes the function y= x^2? Clearly like the answer is zero. But we have to implement an

is zero. But we have to implement an iterative approximation algorithm. So

just taking a look at the graph right here, you can see that this function low this function's lowest value like its lowest y-value is zero. And the x value that achieves that is x equals z as you

can see at the origin. And why don't we just take a look at some of the test cases for this problem? Because as input we're going to be given for the algorithm the number of iterations to perform it. So it seems like that might

perform it. So it seems like that might affect the answer. the learning rate. Uh

don't worry about that for now. We'll

talk about it later. And then we have init, which is like an initial guess for which x minimizes the function. So let's

take a look at some of the test cases.

So for example one, we can see that if we were to do this algorithm for zero iterations with an init of five, don't worry about learning rate for now. The

output is five. That makes sense. It's

kind of just like a base case if you want to think of it that way. Example

two. On the algorithm, we have the same learning rate again. We'll explain what that is later. Same initial guess. The

only thing we do is increase iterations to 10. And it looks like we get

to 10. And it looks like we get something that's less than five, a number that's closer to zero around four. So maybe developing this intuition

four. So maybe developing this intuition that as we perform more iterations, we should maybe get a better and better answer. But what do we actually mean by

answer. But what do we actually mean by a better answer? Well, we're about to go into that in more detail in a bit. But

essentially what it means is we're getting a better approximation. Gradient

descent is really an algorithm that would not really find the exact answer.

The exact answer would be xals 0. But

our goal with this algorithm is to approximate the answer. And obviously

that obviously that might seem like really silly for a simple function like x squ. But as we get super complicated

x squ. But as we get super complicated functions like the functions inside self-driving large language models like chatgbt and llama deep fakes, we're

definitely going to just need to use approximation algorithms. So maybe as we perform more iterations, we'll get closer to the answer. If we take a look at the graph over here, we started off

with like let's say this is x= 5 and over the course of 10 iterations, we know that we did make some progress in this direction towards x=0. We got a

little closer to our answer. So now

let's talk about why we actually care about minimization algorithms, right?

Because we want to use gradient descent for minimization. So let's go over that.

for minimization. So let's go over that.

Okay. So, so we're using gradient descent for minimization. And there's

two main things to understand. The first

is why do we care about minimization?

And the second is why are we using gradient descent. So maybe I'll go over

gradient descent. So maybe I'll go over the second one first. So in like a calculus class, again, it's not a huge deal if you don't have taken calc if you haven't taken calculus, you don't really need to remember a ton of details. The

main thing you need to know to kind of understand these videos and this list of problems is just like very basic stuff from calc 1. Like you should know that the derivative of this function right

here is 2x. We just bring the two down and the exponent changes to one. So we

have to know like some basic calculus for this kind of sequence of videos, but nothing crazy complicated. We're going

to focus more on the actual concepts you need to know. So if you remember from calculus, the way or one way you can find like the x that minimizes this

function is you can take the derivative.

So that's 2x and you set it equal to zero. So clearly this gives us x= z,

zero. So clearly this gives us x= z, right? So that might seem kind of silly

right? So that might seem kind of silly as to why we're using this iterative algorithm that's only going to get us an approximation, not even the exact answer. But it turns out the functions

answer. But it turns out the functions can get crazy crazy complicated. They

could take in more than one thing, right? X Y Z. Like they could be a bunch

right? X Y Z. Like they could be a bunch of parameters or inputs to the function.

They don't necessarily even need to output a single number. The output

itself could be some sequence of numbers, some vector. And they can get even more complicated than this. And we

can already tell by that doing that math by hand, they call this like analytically solving it. That term isn't super important. Uh that's going to get

super important. Uh that's going to get sometimes basically impossible. So we

need an approximation algorithm. And so

why do we want to minimize functions?

Going back to the first thing, why do we even want to minimize functions? Why is

this apparently so crucial for AI and machine learning and neural networks?

And we're going to go into more detail on like what neural networks are later.

Let's just have this black box of AI and ML in our head right now. That it's

essentially if if you saw my previous video on what is a IML. It's essentially

just about approximating functions, right? It's about making predictions.

right? It's about making predictions.

Given X, we want to predict Y, right?

Chad GPT takes in some sequence of words and needs to predict the next word. For

self-driving, we take in the surroundings and the model inside the car needs to figure out if we should either break or keep driving, right?

These your medical AI algorithms take in an image of, you know, like a scan of someone's brain and need to figure out if they have a certain disease, right?

It's all about taking in some input and making a prediction. So apparently, so why is like minimizing a function even important for AI then? So it turns out that when we're training these models,

right, what that actually means is we're minimizing something and we're minimizing the error and actually so let's say we had a data set, right? Some

set of data points and we want the model to pick up the relationship between these data points and then we want to be able to feed in a new data point it's never seen before and we wanted to get the right answer or at least close to

it. And to do that we just need the

it. And to do that we just need the model at the simplest level. We need the model to minimize the error on the training data set. And people will call

this like the loss or the cost. And

those are in my opinion just overly fancy terms. It's just the error essentially. So you might have some

essentially. So you might have some error function. And we're just going to

error function. And we're just going to keep this super high level for now. The

error function might basically be like the model's prediction minus the true answer that we have in our data set. And

maybe we should take like the absolute value of that. That's not super important for now, but essentially we often want to minimize the error function, right? So that the model is

function, right? So that the model is doing a better job of learning the relationship in our training data points. And then for this prediction

points. And then for this prediction part in this equation right here, you might plug in what the model's function is that it's currently using to get the prediction. And we want to make this

prediction. And we want to make this function better and better over time.

But essentially, we need to minimize this error. This is this error itself is

this error. This is this error itself is a function and we want to minimize it.

And that's why gradient descent matters.

Now let's jump into how it works. So

then we can actually implement.

All right. So everything we've said so far is cool and all, right? But like

we've just been treating gradient descent as a black box so far. It takes

in a number of iterations to perform this thing called learning rate and our initial guess. And somehow it outputs a

initial guess. And somehow it outputs a better guess depending on how many iterations we do. So let's actually explain what's going on inside that black box. It turns out this whole like

black box. It turns out this whole like training thing, this whole process of the model learning the relationship gradient descent, it's basically just a big for loop. So I'll write it out in

pseudo code and then we'll explain what it means conceptually and visually. So

it might be something like for number of iterations. This is essentially what

iterations. This is essentially what we're going to do at every iteration.

All we're going to do is we're going to take our current guess essentially, right? So in the first iteration that's

right? So in the first iteration that's just the initial guess in it and it kind of improves over time as we perform more iterations. And what we're going to do

iterations. And what we're going to do is we're going to essentially calculate the derivative. We're going to minimize

the derivative. We're going to minimize the loss function, right? Or minimize

sum function. So what you want to do is you just want to calculate whatever function we're trying to minimize, right? Why don't we just call it like f

right? Why don't we just call it like f for now? What you want to do is you want

for now? What you want to do is you want to calculate f prime evaluated at the current guess. So send in the current

current guess. So send in the current guess into fprime. And this is going to be changing over the course of the algorithm. And this prime thing, that

algorithm. And this prime thing, that little prime symbol that I wrote above f, that just means the derivative. And

you want to calculate the value of the derivative at whatever the current guess is, right? And then essentially what you

is, right? And then essentially what you want to do is you want to update current guess at every iteration of this algorithm. So you'll say current guess

algorithm. So you'll say current guess to make this super clear. It would be current guess should be set equal to current guess minus alpha times and for

that thing we calculated over here why don't we just store that in a variable called d for now d for like derivative or something and you would do alpha * d and alpha is actually that learning rate

that we talked about earlier and now we're going to explain how that works.

This is essentially the pseudo code for gradient descent. Now let's explain

gradient descent. Now let's explain exactly how it works. So it's super intuitive. Okay. So I think we'll start

intuitive. Okay. So I think we'll start off visually and then kind of go back to focus more on the formula. So let's say we have our initial guess, right? Let's

say our initial guess is x= 5, right?

Well, the first thing like every loop, the algorithm is telling us to calculate the value of the derivative. So there is this fact from calculus. It's all right if you don't remember it, but one way to calculate the derivative at some point

is to draw a line that's basically tangent, meaning it touches the function in one place at that point. It's just

kind of a terrible line. We just pretend it's straight. And basically the slope

it's straight. And basically the slope of this line, right? If I were to get the slope of that line, and then that would essentially be the value of the derivative at that point. And then if we

look at like the next line of code after we calculate D, what it's doing is we're we're trying to update our guess. To

update our guess, we take a negative step in the direction of that slope. So

whatever the slope is, we take the negative of it and step at some fraction. Alpha, right? So alpha is

fraction. Alpha, right? So alpha is going to be something between 0 and 1.

Alpha might be 0.01, might be 0.1, right? So we're going to take some

right? So we're going to take some fraction of D. That's why we're multiplying D * alpha. We're taking a fraction of the negative slope at that point and we are just moving in that

direction. So let's take a look at

direction. So let's take a look at visually what that would actually be doing for this example over here. So the

slope here of this line, this is a line with positive slope, right? So if you were to step in the negative direction, well positive slope is kind of moving up this way, right? So negative slope would

actually push our x a bit more this way.

And that's actually what we want, right?

We want to get our x value closer to zero. Let's say our initial guess. So

zero. Let's say our initial guess. So

let's go ahead and erase that for now.

Let's say our current guess was somewhere over here, right? Like -2x=

-2. And then we take a look at the value of the derivative here, which is just the slope of this line, right? Well, the

negative of a negative number is a positive number, right? So taking a step or some fraction of some positive number would still be a positive number, right?

So we would actually be increasing our x value moving in this direction closer to zero. And obviously that's not going to

zero. And obviously that's not going to get us all the way there, right? So if

we just imagine some numbers here, let's say the slope of this line, okay, right?

Let's say the slope of this line, we use like m for slope, y = mx plus b. So you

might find that the slope of that line, just keeping the value super simple, let's say it's negative one, right? We

know it has a negative slope. And then

let's say our learning rate, let's just go with 0.1 for now. Typically people

use 0.01 or 0.001, but we can use 0.1 for now. Things will still make sense. I

for now. Things will still make sense. I

guess now is probably a good time to specify more what I mean by learning rate. So learning rate is something that

rate. So learning rate is something that we call a hyperparameter for this algorithm. So it's always going to be

algorithm. So it's always going to be specified. It's like an overall input to

specified. It's like an overall input to gradient descent and it kind of already has that word in it, right? Rate. So it

explains how fast do we want to change current guess like how fast do we want to take our steps towards the answer, right? We kind of see that since it's

right? We kind of see that since it's between 0 and one and it's being multiplied by D, it essentially scales down D, right? And if you multiply some

number like D by something that's less than one but obviously still greater than zero, you're scaling it down, right? You're taking a fraction of that.

right? You're taking a fraction of that.

So the higher alpha is, the greater of a fraction of D, we're left with, right?

And this is essentially what we're subtracting out from current guess, right? So you're essentially changing

right? So you're essentially changing current guess by a lot every iteration.

If you had a high alpha, a high learning rate, that's why we call it a rate, right? It's just how fast are we getting

right? It's just how fast are we getting to our answer. And if you use a value of alpha that's too small, right? We need

way too many iterations to perform this algorithm. The runtime will be worse

algorithm. The runtime will be worse just intuitively, right? But we don't like things that take and we don't like things that take longer. We want to minimize runtime. But if we use an alpha

minimize runtime. But if we use an alpha that's too big, right? We can kind of tell by this formula that we might change current guess by way too much at a given iter iteration. And just to kind

of give you an intuition of what might happen there, let's say we're currently at -2 as our current guess, right? What

if we accidentally overshoot all the way here? We overshoot past our answer of x

here? We overshoot past our answer of x equals z and then we end up all the way over here for our next value. We

increased x by too much. We went all the way past zero. So that's kind of kind of why using a value of alpha that's too high won't yield good results in practice. So anyways, now that we have a

practice. So anyways, now that we have a better intuition as to what alpha is, let's go back to this. So our new guess should be the old guess and the old

guess was -2 minus 0.1 * d. and d was negative 1. So if I can do some math, we

negative 1. So if I can do some math, we would get using the formula -2 + 0.1 and that's negative 1.9. So we can kind of

see, okay, great 1.9 is greater than -2.

We did step a little bit in this direction and we got closer to our answer. And okay, don't be worried about

answer. And okay, don't be worried about this picture. You know, it might bring

this picture. You know, it might bring back some bad memories if you've taken multivariable calculus before. Don't

worry, we're not going to do anything excessively mathy here. I just really want to explain as to why at every iteration in the algorithm, our way of updating current guess is to use the derivative. Like why are we calculating

derivative. Like why are we calculating this derivative and then using that as our way of updating current guess? So it

turns out there's this mathematical fact. It's not super important, but I

fact. It's not super important, but I feel like I should mention it at least once is it says that the gradient or the derivative points in the direction of greatest increase. And we're not going

greatest increase. And we're not going to excessively focus on mathematical theorems in these videos. I want to focus on how to solve the problem. But I

just want to make sure gradient descent is super clear so we're not left here wondering why this formula works the way it does. And why are we using the

it does. And why are we using the derivative? Why are we subtracting the

derivative? Why are we subtracting the learning rate times the derivative? And

the way you can kind of think about this is we're trying to minimize some overall function rate. And you can kind of think

function rate. And you can kind of think of this as we need to use some approximation algorithm because we don't exactly know how to get there. And

imagine you're standing on the top of a hill and you need to get to the bottom, but you can't see where the bottom is yet. You're blindfolded. And all you

yet. You're blindfolded. And all you need, all you can do is you could pick a direction and go down in that direction.

And you turn a little bit to the right, like a little bit clockwise, and you can say, "Okay, go down this direction. Turn

a little bit more clockwise, a little bit more to the right, could go down in this direction." Right? You have a lot

this direction." Right? You have a lot of different choices as to which direction you're going to go in and every quote unquote iteration of this algorithm. But what if you did is I'm

algorithm. But what if you did is I'm just going to base it off of what's locally the steepest, right? I'm going

to be greedy. You may have heard heard the term greedy a lot if you've been on the le code grind. But what if we just locally choose the best option? Which

direction can I go in that's going steepest downhill? That should maybe get

steepest downhill? That should maybe get me there the fastest, right? But

obviously wherever I am right now might not actually be globally the best option, right? because what if it starts

option, right? because what if it starts going uphill later on, right? But

locally, it seems like the best option.

And let's just say at every iteration, every so often, every time I take a step, I'm going to reassess. I'm going

to recalculate the derivative all of my different options where I could go in and I'll see, okay, how steep are things are now in each direction. I'll reassess

again, and then I'll say, okay, should I move and start going in this direction?

Basically the idea behind gradient descent is if we kind of keep doing this for enough iterations obviously it might not be the most efficient way to get there as we realize right because it might be super steep for a bit and then

suddenly start going up again right so if you imagine this like imaginary hill or valley that we're on right apparently if you do this and this is like mathematically proven for certain functions if we do this for enough iterations we actually will end up

really close to what we wanted like the overall minimum or bottom of this hill or valley that we were trying to get to.

So that's basically why we're using the gradient or the derivative, right? If

it's pointing in the direction of greatest increase, then the negative of the gradient must point in the direction of greatest decrease. We were trying to minimize, not maximize this function, right? So this is all the information I

right? So this is all the information I have. I know how steep things are around

have. I know how steep things are around me locally where I am. And let's just be greedy. Let's go in the direction where

greedy. Let's go in the direction where it's descending, where the elevation is descending the fastest, right? And let's

just keep doing that every so often.

Let's re-evaluate, right? at every

iteration, after every step, we're going to re-evaluate and recalculate the value of the derivative. And that'll get me there eventually. And just kind of

there eventually. And just kind of visualize this for these more complicated loss functions or error functions that we're going to minimize, right? I might be standing right here.

right? I might be standing right here.

And the goal is to get to like the global minimum, right? But I don't know what direction to go in. Do I go here?

Do I go up here? Do I go down? Right

here, right? And that's why we calculate the gradient. And don't worry, we won't

the gradient. And don't worry, we won't be doing any like super complicated derivatives for this function.

Thankfully, we have libraries like PyTorch that would take care of that for us. But essentially, what we would do is

us. But essentially, what we would do is we would calculate the the gradient in step one of the algorithm. And the

direction of the gradient would essentially be the direction of greatest increase. And the negative gradient is

increase. And the negative gradient is the direction of greatest decrease. We

keep stepping in that direction based on alpha over and over again and we get to our answer. Okay, so now let's jump into

our answer. Okay, so now let's jump into the code. So we're just going to follow

the code. So we're just going to follow the pseudo code that we talked about earlier. It's nothing complicated. And

earlier. It's nothing complicated. And

if we try to return zero, we actually won't get the right answer because we have to implement an iterative approximation algorithm. Right? These

approximation algorithm. Right? These

test cases are not checking to see if you get exactly zero. So we can start with for i in range of the number of iterations, right? That's how many times

iterations, right? That's how many times we're going to update our guess. Let's

calculate the current derivative. So

that's 2x if the function is y= x^2. So

we just do 2 * x whatever x is which is just our current guess and that starts off as in it. And then we can update our current guess which is just let's just

have that stored in in it. And all we have to do is subtract out alpha time d.

Alpha is just our learning rate. And

this is essentially the algorithm. We

can just go ahead and return in it and we're done.

And after rounding our answer to five decimal places, we can see that the code works. So this is gradient descent. It's

works. So this is gradient descent. It's

a great introduction and place to start if you're getting into machine learning, deep learning. It's actually the

deep learning. It's actually the foundation of how neural networks are trained as we simply are trying to minimize their loss. So stay tuned for the rest of the problems as we go ahead and build a language model. In this

video, we're going to explain linear regression. You might have heard this

regression. You might have heard this before in like a stats class or an intro to AI class. Maybe the explanation didn't make a ton of sense. But in this video, we'll explain exactly what you

actually need to know and skip over the the unnecessary math proofs. And we'll

even explain what you would need to know to code it up from scratch in Python.

Linear regression is actually really important. It's the foundation of neural

important. It's the foundation of neural networks. It's okay if you don't know

networks. It's okay if you don't know what these are yet. We're going to get to that soon. Linear regression is actually the foundation of all the latest AI in deep learning like chat GPT self-driving deep fakes. So it's

definitely really important to understand and let's jump into the explanation. Now let's break down what

explanation. Now let's break down what this term means. Linear regression.

We'll start with the regression part. So

regression is just the opposite of something called classification.

So let's say we wanted to build an build a model that predicts whether or not someone gets diabetes. There's only two possible outputs, right, for this model.

They either get diabetes or they do not develop diabetes. So in every anytime

develop diabetes. So in every anytime you have a fixed number of classes that your out your your input could belong to, let's say you were building an image classification model that predicts

whether or not the input image is a dog, a human, or a piece of food, right?

There's still like a fixed number of classes there. So we call that a

classes there. So we call that a classification model. So all regression

classification model. So all regression means is that there is not a fixed number of classes. The output is like a number of some sort and it belongs on a number line basically and it could be

like literally any number. It's not a fixed set of numbers. So if you wanted to predict make an AI model that predicts how tall someone's going to be based on say their current weight, their

current height and how tall their parents are or something like that, right? Whatever features we think are

right? Whatever features we think are relevant that output number exists on like some scale like like some infinite scale, right? We can't really obviously

scale, right? We can't really obviously we could say like it's unrealistic for someone to be past some certain height right but in general the output is some number and it could have like some number of decimal places it's not really

belonging to some like fixed set of classes like two classes three classes five classes so that's all the regression part means now let's get into what the linear part means and this is

probably the more important part so when we're building our model right our AI model is going to make some kind of prediction right that's what AI I and

linear regression is four. And all we're saying in the linear part is that the relationship between our data points and the actual answer, let's say like going back to the height example, we had those

three pieces of information that might matter and the corresponding output like the actual true answer for how tall someone ends up being. This is all we're saying when we say linear. We're saying

that the function is going to look something like this. It takes in three things, X, Y, and Z, the three numbers we talked about earlier. And we just multiply each one by a weight. So W1 * X

plus W2 * Y + W3 * Z. And we might add a constant called a bias. And you can think of that as like the B in y mx plus

b. Linear regression is basically doing

b. Linear regression is basically doing y= mx plus b, but for as many input attributes as we want. So when we say we're doing linear regression, all the

linear part really means is we can't really square anything here. We can't

put like a cosine. We can't send that x into like a cosine or a logarithm. The

only thing we're allowed to do is multiply our inputs by like w's and add constants. And over the course of

constants. And over the course of something called training, which is going to use an algorithm called gradient descent, when the model actually improves over some number of iterations, all we're doing is at each

iteration, we're going to adjust what W1 is, what W2 is, what W3 is, and what B is. And hopefully at the end of some

is. And hopefully at the end of some number of iterations, the model is like a pretty decent um way to predict how tall someone's going to be. So for a new person that comes along, you could send

in X, you could send in their Y, you could send in their Z, like those three pieces of information about them, which is essentially like their weight, their current height, and how tall their parents ended up being. And hopefully

this model will give a decent prediction for how tall that person ends up being.

So you can kind of think of W1, W2, and W3 as these weights that kind of factor in, okay, how important is X into actually predicting how tall someone's going to be? Is it an important factor?

If W1 ends up being a bigger number at the end of training, then the model is basically saying, hey, X is actually pretty important for figuring out how tall someone's going to be. And then

same thing for Y, the W2 for Y, same thing for W3, for Z. And then B is kind of just like an extra additional factor that the model might sometimes need to

add in, right? So let's say even after you factor in W1, W2 and W3. Let's say

for all person every for every person, there's some like base height, right?

Like there's no one who's going to ever have a height of zero, right? No one's

ever going to have a height of zero. So

there has to be some base. B is some number that's greater than zero, right?

So we always might just add in some base factor there. And then the model will

factor there. And then the model will actually learn the value of B over training as well. And this is all linear regression is. Now let's go over the

regression is. Now let's go over the pseudo code for how this would actually work. So let's say we had some function

work. So let's say we had some function for training a linear regression model and actually learning what those W1, W2, W3 and B are. This is what it would look like. So we would do like for some

like. So we would do like for some number of iterations, right? So for some number of iterations that is kind of decided beforehand, right? The more

iterations you do generally the better your model is going to be. There's some

exceptions to that which we can talk about later. At every iteration, what

about later. At every iteration, what you're going to do is based on your current W1, W2, W3, and B, you're going to do something called get model

prediction, which I'm just kind of going to leave as a black box for now as to how we would exactly implement that. But

what this kind of sub routine should do is for every every example in your data set, right? Let's say you have n

set, right? Let's say you have n training examples in your data set. N

people for which you have those three pieces of information and you have the corresponding label like what their ultimate height ended up being based on what your current weights and bias is.

You should figure out what the model's current prediction is right and hopefully this prediction gets better as we do more iterations. Then we are going

to do something called get error. Right?

So if we have the model's prediction and we have the actual right answer, we should be able to get an estimate of what our model's current error is, right? Because the hope is that the

right? Because the hope is that the model has as low error or loss as they sometimes call it as possible. Then the

next step we're going to do is something called get derivatives. And here's where it's going to be really important to be familiar with something called gradient descent, which I have a video about on

this channel. And this is how we

this channel. And this is how we actually optimize our models to minimize that error over some number of iterations. Basically, this function is

iterations. Basically, this function is going to do some little bit of ugly math. Don't worry, you rarely ever have

math. Don't worry, you rarely ever have to do it by hand as a machine learning engineer or data scientist or even for side projects. But this function is

side projects. But this function is going to do some calculus actually and calculate some derivatives. And

essentially the final step is going to be update our weights. Update our weights. So

our weights. Update our weights. So

based on those derivatives that we calculated, let's kind of change and the error of course let's change W1, W2, W3 and B to hopefully get a little better.

And as we do this over some number of iterations, the hope is that this error goes very close to zero. So this is the rough idea of linear regression. So just

one small thing we have left to explain.

When I said that we would be calling some function called get error at every iteration, wanted to clarify what error we're using. So what we use is something

we're using. So what we use is something called mean squared error. And the

formula looks kind of ugly, but we'll break down exactly what it's doing. So

all we're doing is for every n of our n training examples, right? We're going to kind of iterate over those examples. And

for every example, let's get the difference between what the model's prediction was and the true answer from our data set. And for every one of those, why don't we square them? add

them all up for every single example and then just divide by n. So just taking the average, right? So there's kind of two things to understand here. One is

why are we kind of adding them up and dividing by n? Well, if we take the average error, then that way our error isn't really dependent on the number of examples in our data set, right? So it's

just kind of a rough like kind of approximation of how poor our model is doing at the moment or as it goes down, how well it's doing, right? The second

thing is why are we squaring it right?

Why are we squaring the error for every example? Why not just do something like

example? Why not just do something like this for every single example? What if

we just did the following? What if we just did the absolute value of the prediction at i minus the truth at i. So

the absolute value of it and then divide by n. Right? Just taking the absolute

by n. Right? Just taking the absolute value of that difference should kind of give us a gauge of the error of the model. Right? But the reason we don't do

model. Right? But the reason we don't do that is if you remember from calculus, the absolute value function has some issues with its derivative. It's not

super important to get deep into that for understanding machine learning. But

that's essentially the reason why we just square it, right? Because squaring

kind of gives us that same idea, right?

whether the error was negative or positive, whether a prediction was less than or greater than truth. All we

really care about is the difference, right? So squaring kind of gets rid of

right? So squaring kind of gets rid of that negative and turns out it works better than absolute value. One final

thing before you're you basically have all the information you need to code this up. When we actually implement this

this up. When we actually implement this in code, we use like vectors and matrices to do like the multiplications and additions we talked about earlier.

Let's say we have our weights in a vector right here and we have the data point for like one person their x y and z right here. If you do a dotproduct,

which just means multiplying X by W1 plus Y * W2 plus Z * W3 that gives us the exact model prediction, right? So

then let's say we had a data set of a whole bunch of people, right? N people.

So that's over here. This is going to be an N by3 matrix. So you have X1, Y1, Z1. This is kind of X, Y, and Z for

Y1, Z1. This is kind of X, Y, and Z for person number one. And then in the next row you have the second person in our data set. Right? So you have their X,

data set. Right? So you have their X, their Y, and their Z. Right? And this

kind of keeps going on for the n number of people. So all the way to the nth row

of people. So all the way to the nth row using one indexing. So Y N CN. Then if

you did the same thing again, multiply by a 3x 1. 3 by 1. Three rows one column. Yep. W1 W2 W3. Well, what's the

column. Yep. W1 W2 W3. Well, what's the result? Right? It's an n by one. So

result? Right? It's an n by one. So

that's n rows, one column. And for every single one of our n people, we have their prediction, the model's prediction. So that's kind of an

prediction. So that's kind of an efficient way to do it for n people without a loop, right? You can just do a matrix multiplication. And it turns out

matrix multiplication. And it turns out a bunch of like libraries have really efficient matrix multiplication algorithms. So that's why we like to kind of vectorize and put things into

matrices whenever we're doing machine learning whenever possible. And if you actually remember how to do the matrix multiplication, it's not like super important. It'd be like this row times

important. It'd be like this row times this column, right? And that would be the first entry in this vector over here. Then for the next person, it would

here. Then for the next person, it would be their row again times this whole column dotproducted, right? And that's

the next number over here. So doing this for all the people is essentially like doing get model prediction what we did over here for each and every person except it's a lot more efficient. And

this is basically all you need to know for at least the forward path of linear regression. So feel free to try out the

regression. So feel free to try out the code. Now in this problem, we're going

code. Now in this problem, we're going to solve the forward pass of linear regression. This is made up of two

regression. This is made up of two different functions or sub routines. And

in the next problem, that's where we'll actually do the backward pass and train our linear regression model. So I'd

highly recommend checking out the video complete explanation of linear regression. It is linked on the problem

regression. It is linked on the problem and it definitely explains all the background you need. Zero background

knowledge required except for gradient descent to kind of understand that video. However, I will give some

video. However, I will give some background in this video too. So in this problem we have to implement two functions. One of them is get model

functions. One of them is get model prediction and one of them is get error.

And both of these sub routines are kind of two core subutines that compose the process of training training a linear regression model. And when I say

regression model. And when I say training a linear regression model, what that really means is for this case figuring out what W1, W2, and W3 should

be so that our model does pretty well on our training data. And when we pass in a new data point that the model's never seen before, hopefully its prediction is pretty close to the actual answer. So

these are the two functions that we have to implement and they're a core part of the training loop. Specifically in the training loop for every iteration we

would actually call get model prediction. So we would call get model

prediction. So we would call get model prediction. Then we would probably want

prediction. Then we would probably want to calculate the error or the loss. So

we would do something like get error and maybe every 100 iterations or so we would actually print that error. But

that's not super important for this problem. And then we would probably call

problem. And then we would probably call some subruine like get derivatives. And

what this is going to do is calculate some derivatives that we need to update our weights and perform gradient descent. Don't worry, you don't have to

descent. Don't worry, you don't have to do those derivatives by hand as a machine learning engineer or data scientist, but it is good to know that get derivatives is being called under

the hood by whatever library you're using. And lastly, you would want to

using. And lastly, you would want to update your weights, right? You would

want to update your W's, right? And

hopefully with the next iteration, your guess is a bit better. So, let's go back to the actual functions that we have to focus on for this problem. Get model

prediction and get error. Right? So,

let's look at the two inputs for get model prediction. The first input is X.

model prediction. The first input is X.

This is just the data set, right? This

is just the data set that will be used by the model to predict the output.

Right? And we can think of this model as predicting say the price for Uber rides, right? Let's say there are three things

right? Let's say there are three things that affect the price of an Uber ride.

The first might be the time of day, right? The next might be the total

right? The next might be the total distance of the ride that the driver has to take has to take you, right? Because

that would affect say how much gas they're using. So total distance and say

they're using. So total distance and say maybe miles. And then maybe the duration

maybe miles. And then maybe the duration of the trip in say minutes or hours, right? If there's more traffic, the

right? If there's more traffic, the duration doesn't necessarily match up with the distance. So maybe these are our three input attributes and that would be stored in our data set X. We

can see that the len of X is N and that for every row we will have three columns. We can see that over there. So

columns. We can see that over there. So

X is actually X is actually an N by3 array where every row corresponds to our three attributes for that data point.

The other input we have is simply another array of size three and those are our initial or current weights for the model. So W1 is like how much we are

the model. So W1 is like how much we are weighing time of day in our model's prediction. W2 is how much we are

prediction. W2 is how much we are weighing the distance. W3 is how much we are weighing the duration. And if we remember what the linear regression kind of formula looks like, it'd be something

like the following, right? We would say something like price as a function of let's say this is X this is Y and this

one is Z. Price as a function of XYZ becomes something like W1 * X + W2 * Y plus W3 * Z. And what is our goal?

Right? It's to figure out what W1 W2 and W3 should be. And we're going to implement the forward pass of this model. And the other function that we

model. And the other function that we actually have to code up is get error.

This takes in the model prediction for each training example. If we have n training examples that the size of that array should be n. We have n different numbers that the model spit out. And we

should have the corresponding labeled answer from our data set. This will be stored in another array of the same size and it will be called ground truth. So

why don't we get into how we would actually code up get model prediction.

Okay. So for get model prediction we are giving the data set X an N by3 array as well as weights which is just of size three. You can think of that as 3 by 1

three. You can think of that as 3 by 1 if you want. And we know that from linear regression the function that

we're following is price of XYZ our three input parameters is just W1 * X +

W2 * Y + W3 * Z. And we could actually think of this as a dotproduct. If you

don't remember what dotproducts are, it's not a huge deal. It's essentially

just two vectors multiplied against each other. So let's say in the first vector

other. So let's say in the first vector we had XYZ, the three input attributes for some example, some input, right? And

then we wrote W1, W2, W3. The dotproduct

from linear algebra is just a way of multiplying these two vectors. It says

to multiply x by w1, y by w2, z by w3 and add them up together so that we actually end up with that. So there's

actually a way then that we could do get model prediction for all n people in one giant matrix multiplication, one giant dotproduct instead of having to say

iterate over all n people in a for loop and do this dotproduct. And all we have to do is realize the format in which x is given, right? Because we have actually the input information for each

person, right? What we have for each

person, right? What we have for each person is say in the first row we have x1, y1, z1, x, y, and z for the first

person. For the second person, we have

person. For the second person, we have their information x2, y2, and z2. And if

we go all the way down to the last row, what we have is we have xn, yn, and we have zn, right? And let's say we just multiply this by a 3x 1, right? Because

this is actually an n by 3 matrix or array. And over here, what we have if we

array. And over here, what we have if we had our weights w1, w2, and w3 is a 3x 1. And if you remember the rules of

1. And if you remember the rules of matrix multiplication, again, it's okay if you don't. This is probably one of the only times we're going to like actually get into the nitty-gritty details of the matrices. Well, the

threes kind of match up and cancel out and the output is an n by one. And we

know that an n by one, well, that gives us a number for each of our n people, which is the model prediction we wanted.

And if we look into how this matrix multiplication is done, it's every row of the first thing times every the column of the second thing, right? So if

we have this row right multiplied by this column well again for the first entry in the output vector here that gives us exactly what we want. Then if

we did the second row times the weight column vector again that also gives us exactly what we want. So forget model prediction all we have to do is call np

numpy domat mole and this will take in our two relevant matrices get error right we have two inputs to

this function prediction and truth and both of these are essentially vectors of size n right if we look at the prediction for each of our n data points it's some vector with n numbers in it

and if we have truth it's the same size and we have n numbers is here as well.

And the indices kind of match up, right?

The prediction for our zeroth training example is here. And the true value for our zero training example is here, right? And we're going to use the mean

right? And we're going to use the mean squared error function. So if we just write that here, mean squared error is essentially a sum over all of our data

points. So we need to consider all n

points. So we need to consider all n data points. Take the prediction for the

data points. Take the prediction for the eth one, subtract out the truth for the eth one. So just get the difference

eth one. So just get the difference between them, but don't just take the like normal difference. Instead, we want to get the squared difference and then divide this whole thing by n. So it's

kind of like you're averaging the square of the error for every example. And in

the complete explanation of linear regression video, we kind of go over why we use this error function specifically for linear regression, right? because

maybe you want to take the absolute value. Turns out that doesn't work as

value. Turns out that doesn't work as well. Maybe why why are we squaring it?

well. Maybe why why are we squaring it?

Why not raise it to the third power, the fourth power, right? We'll explain why we use the mean squared error for linear regression in that video. So now that we know which error function to use, how are we going to actually implement this?

Right? So it turns out that when we have our data stored in numpy in numpy arrays which is what we have for this problem.

We actually get to take advantage of a lot of functions in the numpy library.

And these functions are a lot faster than a lot of normal Python functions.

And that is because they call highly optimized CC code which is generally much faster than Python. We also get to take advantage of parallel computation whenever possible, even without a GPU.

This can be done on a CPU as well in some cases. And overall, you want to use

some cases. And overall, you want to use the numpy function for whatever you're trying to do instead of manually implementing it yourself. So an example of that is if we just wanted to get the

difference between every value in both of these arrays instead of iterating over each index and subtracting what you can do is you can just do vector one like the first variable minus the second

variable. And what this is going to do

variable. And what this is going to do is it's going to return back to us the actual vector which has all the differences we want. And this is actually going to be much faster if you were to say actually check the runtime

of this using the system clock. you. We

also really want to take advantage of np.m mean. So this will get us the mean

np.m mean. So this will get us the mean of a vector by sum summing of elements and dividing by n as well as np.square.

np.square will be useful for taking in a vector and returning essentially a vector of the same size back but with with every element squared. So the main takeaway here is that mean squed error

is once you understand what it's doing, it's a pretty simple equation, right?

But moving forward, you always want to kind of use these data science or machine learning libraries, whether that's NumPy or PyTorch. You want to use their versions of these very simple

operations because it will make a huge difference in the runtime of your code.

Now, let's jump into the code. As we

talked about earlier, get model prediction is really easy. All we're

doing is calling the matrix multiplication function from the numpy library. So all we really need to do is

library. So all we really need to do is say something like our result equals np.mmat

np.mmat mole. There's other functions you could

mole. There's other functions you could use that do the same thing but I think matt mole is the most accurately named.

So you would just pass in x and then weights. And we just want to actually

weights. And we just want to actually round our answer to five decimal places.

Specifically every element in this output vector should be rounded. So we

can do np.round

of resz and then five. And this is actually going to be better than say iterating over the entire array and rounding each index. Then forget error.

We know that we want to get the difference, right? And we don't want to

difference, right? And we don't want to actually iterate over. So we might might say something like the difference just to make this super explicit is model prediction

minus the ground truth, right? And then

we know we can use np.square instead of iterating over every index. So squared

should just be np.square

of diff. And now the size of the array hasn't changed. But now we have the

hasn't changed. But now we have the squared value at every index. And now

all we want is the mean. And thus this is going to actually be faster than doing a for loop and actually summing up every element in squared and then dividing by n. So avoiding that for loop

is something that num using numpy operations does confer us a runtime advantage with. So you can just say np.

advantage with. So you can just say np.

mean of squared and this is essentially just the average right and then this is a single number right so we can just use the normal Python round function so

let's just return round of average to five and we can see that it works actually two test cases for each of these is enough to verify that you've done the

right thing since this is a bit mathematically involved this it's really unlikely you would get the answer right with an incorrect effect implementation.

That was the forward pass of linear regression. Next, we are going to train

regression. Next, we are going to train the model. And then once we're done with

the model. And then once we're done with that, we'll finally be ready to explain and code up neural networks. And you'll

actually see that these are just linear regression models stacked up together.

So that's why it was so important to understand linear regression. And

definitely check out the next problem on training our model. In this problem, we're going to solve linear regression training. So our job is actually to

training. So our job is actually to implement the train model function and we're actually given two functions as it says in the problem description. We're

actually given the get model prediction function for this model which is supposed to be called at every iteration of the training loop. We are also given get derivative and this is the reason

we're given get derivative is to actually perform an algorithm called gradient descent. If you're not familiar

gradient descent. If you're not familiar with gradient descent, I would recommend checking out the easy problem in this problem list. And there is actually a

problem list. And there is actually a corresponding video solution that requires zero background knowledge about machine learning at all. But the reason we need the get derivative function is to actually update the weights of our

model using the learning rate at every iteration of training. And let's go ahead and check out the inputs that we're given. So the first thing we're

we're given. So the first thing we're given is X. X is our data set for training the model. If we check out the shape, it will have length n where n is

the number of data points in our data set and it will always each data point has length three. We can see that over here. So checking out the example we can

here. So checking out the example we can see that we have two data points here, right? And each of those two data points

right? And each of those two data points actually has three attributes. So we can see that we have length three for this sublist and length three for this sublist. So that makes sense given the

sublist. So that makes sense given the problem description and the constraints.

Then we can check out y. So y is supposed to be the correct answers for our data set. And we should probably have a label for every single one of our data points in the data set. So that

makes sense that y's length is also n.

Six is the answer for the first data point. And three is the answer for the

point. And three is the answer for the second data point. We are then given the number of iterations to train for.

Essentially this is the number of iterations to run gradient descent for.

We can already kind of tell that whatever the solution is for this problem is going to be involve some sort of significant for loop. That's the main part of the solution. And this is the number of iterations to train for.

Lastly, we have the initial weights for the model, right? Before any kind of training is done, models, you can still get the predictions from a model based on its current weights. And these are

typically chosen completely at random in machine learning. And we can have the

machine learning. And we can have the initial w1, w2 and w3 given here as 0.2, 0.1 and 0.6 respectively. And since it's

given as input to this problem kind of going to develop this intuition that the initial weights do affect your final weights. And talking about the final

weights. And talking about the final weights, that's actually what we have to return. We have to return the final

return. We have to return the final weights after training in the form of a numpy array with dimension three. And

kind of one last thing to kind of make some sense of this test case is if we check out the problem or the sample data points in X, we can actually kind of

maybe guess what the weights should be.

So we have 1 2 3 corresponding to six and then we have 111 corresponding to 3.

So we can maybe guess that W1 is equal to 1, W2 is equal to 1 and W3 is equal to 1. This is just adding up the

to 1. This is just adding up the essentially the three input attributes.

1 + 2 + 3 gets us six and 1 + 1 + 1 would get us three. So that makes sense that after 10 iterations, our initial weights, they ended up much closer to

the final answer. Right? 0.5 is closer to one than 0.2. 0.59 is closer to one than 0.1. And 1.27 is closer to one than

than 0.1. And 1.27 is closer to one than 0.6. And this is also continuing to kind

0.6. And this is also continuing to kind of reinforce that intuition that gradient descent is just an approximation algorithm. We saw that for

approximation algorithm. We saw that for the first two weights, we didn't overshoot, right? We increased, but we

overshoot, right? We increased, but we were still less than one for 0.5 and 0.59. But for W3, we actually completely

0.59. But for W3, we actually completely overshot and we went all the way up to 1.27, even though we know W3 should probably be one, right? So, we actually

have to perform this for more and more iterations. This the number of

iterations. This the number of iterations that you actually have to perform gradient descent for totally depends on the use case. In some case a thousand might be enough. In some case you might need to perform this algorithm

for 100,000 iterations. But we argue developing this intuition that gradient descent is just an approximation algorithm and the number of iterations at which you perform it will affect the

performance of your model. So now let's jump into the explanation.

Now let's talk about the update rule. So

gradient descent updates the weights of our model at every iteration. And the

update rule for gradient descent, just as a reminder, is that the new weight or the new W should just be equal to the old W minus the derivative specifically

for this weight. So the derivative for W, whatever derivative we're talking about here. Let's say this was for W1.

about here. Let's say this was for W1.

So the new W1 is the old W1 minus the derivative for W1 time alpha, which is our learning rate. and will always be given. And if just for an explanation of

given. And if just for an explanation of why we're using this equation, I would highly recommend checking out the gradient descent solution video. But if

this is our update rule, then it's pretty clear that if we're given alpha, right, and we're given initial W's, right, that's actually an input to this problem. Our main task at every

problem. Our main task at every iteration of this loop, the the heavy lifting is just to get the the derivatives for each W. And because we have three weights for this problem, we

have W1, W2, and W3. We're actually

going to need to call the get derivative function. Get derivative. We're going to

function. Get derivative. We're going to need to call this function three times at every iteration of this of this algorithm so that we can get the individual derivatives for each weight.

And if we check out the the function for get derivative, we check out what things it takes in. Two two of the things it takes in are the current model prediction. So it needs the current

prediction. So it needs the current model prediction. This needs to be sent

model prediction. This needs to be sent in to get derivative. And we also tell it our desired W. So that might be W1,

W2 or W3. And then we are also given a function called get model prediction. Right? So if we need

model prediction. Right? So if we need the model prediction to calculate or call get derivative then it makes sense that every iteration the first step should be to call get model prediction.

Maybe we can store that somewhere. Then

we're going to pass that in into the get derivative function. We're going to pass

derivative function. We're going to pass that in into this function. We're going

to call it three different times depending on our desired W. And then we can lastly use the update rule and update our weights. And we will repeat this for the number of iterations

specified by the problem. and then our training is complete. Okay, so now let's jump into the code. We know that there this whole entire algorithm of training

is one giant loop. So we'll say like for i in range of num iterations and we know that to even update our weights and get the derivatives, we need to get our model prediction at every iteration

based on our current weights which I'll keep updating in the variable initial weights. So we can do self.get get model

weights. So we can do self.get get model prediction and go ahead and pass in x as well as whatever our current weights are. Then we're going to need to grab

are. Then we're going to need to grab our three different derivatives. So

going to say that d1 the derivative for w1 is self.get

derivative and let's pass in the model prediction. Let's pass in ground truth

prediction. Let's pass in ground truth which is just the answers or labels for our data set which is given in y. Let's

pass in the length of x as that is the n number of data points. Let's pass in x.

And then for our desired weight, we want the derivative for w1. But we can see that get derivative is using zero indexing, right? So we'll actually need

indexing, right? So we'll actually need to pass in zero over here. And that is how we calculate d1. Then we can do the same for d2 and d3. Okay. So now we can

actually update our weights based on the learning rate. So we can say initial

learning rate. So we can say initial weights at zero should just be well we're actually going to subtract out D1 times our learning rate. So that's

self.arning

rate. And if you're wondering why learning rate is not given as a par input to this function but instead as kind of a class level hyperparameter that is just a convention when training

machine learning models especially as we move into PyTorch. So we can say initial weights at one then we're going to

subtract out D2 time self.arning rate

and finally the same for W3. So we can say at the second index D3 * self.arning

rate and that's it. We just have to then return the rounded version of our answer. So np.round

answer. So np.round

of initial weights at five and we're done. And we can see that the code works

done. And we can see that the code works and one test case is enough to verify that your code just didn't get the answer right by a fluke given that there is some mathematical complexity going on

here. So now that we've got the basics

here. So now that we've got the basics down, the next problems that we'll do are going to focus more on PyTorch and starting to train and explain neural networks. Hey, so in this video we're

networks. Hey, so in this video we're going to finally explain neural networks. This is definitely a buzzword

networks. This is definitely a buzzword that gets tossed around a lot in AI and ML, but they are really important.

They're probably the most powerful form of machine learning to date. They are

what chat GPT uses. They are used in self-driving, deep fakes, image generators like Dolli. Neural networks

are really important concept to understand. And I think a lot of

understand. And I think a lot of resources online often really over complicate neural networks, right?

Because you've probably seen a drawing like this before. might see like three nodes in this first column and then like a bunch more over here and then maybe

like a bunch more over here. And the

idea is that they're all connected to each other like this. That actually took a while to draw out, but the idea is that all the nodes in every column are connected to all the nodes in the

previous column or layer as they call it. But this is honestly an overly

it. But this is honestly an overly complicated diagram to understand. So

let's start from a simpler example that we can still understand neural networks from. So in the first node or the first

from. So in the first node or the first layer, I'll draw three nodes and we'll just go ahead and do two over here. And

all the concepts are still going to apply even though this is a simpler neural network. So we'll just finish

neural network. So we'll just finish drawing this out. And essentially what neural networks are is they're actually just basically multiple instances of

linear regression. Right? We've talked

linear regression. Right? We've talked

about linear regression before and the main important concept behind linear regression is we have some sort of output number, some sort of function. In

one of our previous examples, we talked about predicting the price of an Uber based on the the time the mile distance in miles and the time of day, right? So

maybe the price is a function of X Y Z is equal to W1 * X + W2 * Y + W3

* Z. And then we also might learn an

* Z. And then we also might learn an additional bias or constant term which we call B. And when we talk about training this linear regression model,

it's just about learning the W1, W2, W3, and B that make this function work pretty well for new data points that it's never seen before. So this is actually all a neural network does too,

except we have way more than just W1, W2, and W3. So the first layer of a neural network, the leftmost layer is called the input layer. And that each node in that layer will actually have a

number which corresponds to the input attribute. So let's say the distance

attribute. So let's say the distance goes here, the time length goes here and time of day number goes here. So each of these neurons will have a number associated in it. There will be a number

in this neuron. There will be a number in this neuron and there will be a number in this neuron. And what each node in the following layer does, right?

So this node here and this node here is since each of those nodes is fully connected, right? Each of those nodes is

connected, right? Each of those nodes is connected to each of the input attributes. Each node is actually just

attributes. Each node is actually just doing its own linear regression. Right?

So in this node right here, it will have its own W1, W2, W3, and B that has to be learned through training, which is just gradient descent, taking derivatives,

and then using the derivatives and the learning rate to update the weights.

This node over here independently completely independently of this neuron will do its own linear regression. So

this node will also have its own you know W1 W2 W3 and bias and then well the point of the neural network was to use it to make a prediction right that's all

these AI models are about. But then

you'll notice that we would have like an output number in this neuron and an output number in this neuron. Right? So

how would I actually use this neural network to get a meaningful answer?

Right? If I wanted to use this neural network as a way to predict the price of Ubers. Well, we might do something like

Ubers. Well, we might do something like average these two numbers. Average the

number in this neuron, average the number in this neuron, right? So that

might look something like this. So

that's all a neural network is, right?

We might take those two numbers in this output layer right here. We have a number in this node and a number in this node. Right? But what we'll actually do

node. Right? But what we'll actually do is we will just send them into some average function and then the number that gets kind of outputed by this function is our final prediction by the neural network. And then based on that

neural network. And then based on that number and whatever the correct answer was for these you know for this data point which consisted of three features or three attributes then that would be

used to calculate our loss or our error.

And then we can kind of do something called back propagation. You may have heard this this term before. It's okay

if you haven't. But back propagate back propagation is just that process of calculating the derivatives and essentially then we're going to use those derivatives to update our weights

through gradient descent. So this is actually our first neural network. So

then how does that kind of tie into this ugly looking diagram? Well, the same thing goes for the first input layer of the neural network. For whatever kind of data set this neural network is being used on, there must be three input

attributes because we have three nodes in that first layer. Then each of these four neurons, this one, this one, this one, and this one will learn a W1, a W2,

a W3, and a B. And the B or the bias, the constant term is optional, but it will learn what those parameters are based on gradient descent. And then

we'll basically notice that if each of these nodes, this one, this one, this one, and this one are doing their own linear regression, right? Then each of those are outputting a number. But each

of these nodes in our final layer, right, the output layer, well, this node, this node, this node, and this node, and this node, right? They're all

connected to each of those nodes in the second layer. So what that means is that

second layer. So what that means is that each of these nodes, right? All the

nodes in the final layer are doing their own linear regression based on how many features or attributes are in the previous layer. And we have four there,

previous layer. And we have four there, right? We have four nodes in that second

right? We have four nodes in that second layer. So that means that this node over

layer. So that means that this node over here is going to have to learn a W1, a W2, a W3, a W4, and an optional constant term or bias. And this one will have its

own four weights and a bias. This one

will have its own four weights in a bias. This one will have its own four

bias. This one will have its own four weights and a bias. And this one will have its own four weights in a bias. And

maybe for whatever data set this neural network is for, we're actually going to predict one 2 3 4 five output numbers, right? I don't know what data set this

right? I don't know what data set this neural network is for, but we can certainly imagine a case where some model has to predict five different things, right? So, but if we only wanted

things, right? So, but if we only wanted to predict one thing, we would then maybe send this into some other function like the average function to get some sort of estimate of whatever number we're looking for. And that's all a

neural network is. It's just linear regression stacked up vertically in each layer. Okay. So, the only part really

layer. Okay. So, the only part really left to explain is how this would work in code, right? How would we implement this? So, for our input layer, our three

this? So, for our input layer, our three attributes that we could represent that as a vector, right? Those attributes X, Y, and Z for a single data point could be represented in a vector like this.

And we know that each one of those nodes in that second layer has three weights, right? Because each one of those nodes

right? Because each one of those nodes is fully connected to each of our input attributes. So if we have three weights

attributes. So if we have three weights for each of those three neurons, we need to have a 3x3 matrix here to kind of encapsulate all those weights in a

compact way. So let's draw that 3x3

compact way. So let's draw that 3x3 matrix. And the last thing we need to

matrix. And the last thing we need to remember, it's just a fact from linear algebra. If you need a quick refresher,

algebra. If you need a quick refresher, the way we do matrix multiplication is we do the row of the first thing times the column of the second thing, right?

And we know that this is ultimately going to give us something that's actually just three numbers, right? Cuz

if this here on the left, right, we have one row and three columns, so it's a 1x3. multiplying that by something

1x3. multiplying that by something that's a 3x3, these actually cancel out and we're just left with a one by three, right? And that kind of makes sense that

right? And that kind of makes sense that we will have a 1x3 as the output cuz we have three numbers which is essentially like the output number for each of those neurons. People sometimes call that the

neurons. People sometimes call that the activation of the neuron. Though that's

not a super important term. So in the first column of this 3x3 matrix, what we would have is we would have W1 for the first neuron. We'll say this is the

first neuron. We'll say this is the first neuron. So that's W1 for the first

first neuron. So that's W1 for the first neuron. So let's we'll call that W1

neuron. So let's we'll call that W1 comma 1. Then here we would have W2

comma 1. Then here we would have W2 still for the first neuron and W3 still for the first neuron. Here we would have W1 for the second neuron. We'll say

that's the middle neuron in the second layer. Then this is W2 for the second

layer. Then this is W2 for the second neuron. And then finally we would have

neuron. And then finally we would have W3 for the second neuron. So in our code, we're going to have to maintain the state of this matrix. We're going to need to remember what all these weights

are, right? So we're going to need to

are, right? So we're going to need to maintain that matrix because when we do get model prediction and we're doing all these matrix multiplications, we're going to need the state of this matrix.

And since we're going to calculate a bunch of derivatives with respect to each of these weights to update it, we're also going to need the matrix, right? And so there's one class in a

right? And so there's one class in a library called PyTorch, which we're definitely going to have a whole separate video on, but that class is called nnn.linear, and it will actually

called nnn.linear, and it will actually keep track of the matrix under the hood for us, as well as all the derivatives.

The only thing we have to pass into the constructor of this class when we're making an instance of it is two things.

Essentially, we just need to specify the dimensions of this matrix. And this

class takes in something called in features. We can just call this inf for

features. We can just call this inf for in features and it takes in something called out features. So the input features for a given linear layer like this is just the number of features in

the previous layer. Right? So we can see in the previous layer we had three nodes here, three features. That's what we would specify there. The number of output features is just the number of individual instances of linear

regression. It's how many nodes or

regression. It's how many nodes or neurons you have in the layer that this class or object that we're making is for. And here we can see we also have

for. And here we can see we also have three neurons there. So out features would be three. So to do this in pietorch all we would do is make an instance of this class n.linear and we

would just have to pass in three for in features three for out features and that would be our simple two layer neural network that we have right here. One

final concept we have to talk about before our crash course in neural networks is finished and that is something that is completely orthogonal or different than linear regression and

it's it's literally the opposite. It's

something called a nonlinearity a nonlinearity and the most popular one is something called the sigmoid function which has this symbol and if you were to graph that function it essentially looks

something like this. So we have our x and y axis here and for every x input right like negative infinity is here positive infinity is here. The outputs

of this function are always between zero and one. So let's see on the y- axis we

and one. So let's see on the y- axis we have one over here and it looks something like this. It's kind of like that J-shaped curve. It's kind of asmtoically approaching one here and

asmtoically approaching zero over here.

This is kind of a terrible drawing over here but pretend that it's going down.

So the outputs of this function are always between zero and one and it has this kind of nonlinear nature to it. So

it's actually could be like mathematically proven that when we have those neural networks that we talked about previously, right, where all we have is these linear matrix

multiplications, these linear connections over here, there's only so complex of a relationship that they can learn. there comes a point where their

learn. there comes a point where their neural network isn't powerful enough anymore to capture and learn like really complex relationships.

For example, like for a neural network to understand negation in speech if you were inputting sentences into a neural network. So we have to kind of add these

network. So we have to kind of add these nonlinearities to the neural network to make them more complex and apply these neural networks to more and more problems and something called the

sigmoid layer. There is a class called

sigmoid layer. There is a class called nnn.sigmoid sigmoid which you can make

nnn.sigmoid sigmoid which you can make an instance of in PyTorch. That's kind

of how we would do that in code. And the

other benefit of adding a sigmoid layer to your neural network to and just to be clear as to how we would do that in a diagram is literally right here you have your neural network, right? There's two

layers, the input layer and we have one hidden layer is what they call it with these three neurons right here, right?

This neuron, this neuron and this neuron. So what we would actually do

neuron. So what we would actually do then is just draw say the sigmoid symbol over here. put that in its own little

over here. put that in its own little circle of its own. What we would do is we would actually connect each of the previous layers nodes to this sigmoid because each of those three neurons in

that second layer has a number an output. People call it an activation

output. People call it an activation associated with it, right? And for each of those, we're going to pass that in to the sigmoid function to get something that's between zero and one. The higher

the number is, the closer it'll be to one. So, scrolling down back here, the

one. So, scrolling down back here, the main thing we really need to talk about now is how you can apply the sigmoid function to a neural network. Not only

does it allow you to learn a more complex relationship, but allows you to apply ML and neural networks to more complex problems, right? Previously,

we've just been talking about something called regression problems where the output of the neural network is some number, right? Like the price of an Uber

number, right? Like the price of an Uber or how tall someone's going to be. And

this is like a number that exists on a spectrum, right? There's no fixed set of

spectrum, right? There's no fixed set of values. But what if we are doing

values. But what if we are doing something called classification, right?

Where let's say we need to build a neural network to predict whether or not someone will develop diabetes, right?

You either develop diabetes or you don't. There's two classes. And maybe

don't. There's two classes. And maybe

our neural network actually needs to output a probability that the patient inputed to the neural network will develop diabetes. That means our neural

develop diabetes. That means our neural network needs to if it's going to output a probability, it needs to output something between 0 and one. Well, the

sigmoid function is perfect for that, right? Because if we stick a sigmoid

right? Because if we stick a sigmoid function at the end of our neural network right here and let's say we were to average before we pass into the sigmoid function, we average this

number, this number, and this number.

Then we pass that number into the sigmoid function. Well, our neural

sigmoid function. Well, our neural network is now outputting one single number, right? And we can interpret that

number, right? And we can interpret that number after we've kind of done enough iterations of gradient descent and training to make these weights actually make sense for them to not just be random numbers. Well, we can output we

random numbers. Well, we can output we can interpret that output number as a probability that the input person is going to develop diabetes. So now we're actually able to do classification

problems, right? We can classify an

problems, right? We can classify an input into like say two classes or three classes as we talk about like more complex functions other than just the sigmoid function. The sigmoid function

sigmoid function. The sigmoid function would just be used for binary classification because the output is just between zero and one. But we might be doing turnary classification if we wanted to build a neural network that

can classify an image as either like a dog, a bird or a human. Right? But this

is just one example of how we can extend neural networks by adding this nonlinearity over here to use it for something called binary classification for say diabetes prediction. And say the

three features of the patient like three markers of their health like three facts about their blood work might be one of them over here, one of them over here and one of them over here. And of course

just one last clarification, we can obviously have as many features as we want in the input layer over here. We

can have way more than three nodes over there if we need to. We just have three for this case. So that's an introduction to neural networks. Neural networks,

before you start coding them up, it's important to be familiar with the basics. So let's go over a few multiple

basics. So let's go over a few multiple choice quiz questions. Then at the end, there are even more practice problems that you can try. And if you need a quick refresher on the basics of neural networks before we go through the quiz,

check out the second link in the description. Let's get started. Question

description. Let's get started. Question

one. You may have heard that GPT3 has over a 100red billion parameters across its many subn networks. How many

parameters does this simple model have?

Remember, a parameter is either a weight or a bias. If you want to try calculating the answer on your own, I recommend pausing here. Okay, here's the

explanation. So, in this equation, there

explanation. So, in this equation, there are three weights and one bias or four parameters. And let's go layer by layer.

parameters. And let's go layer by layer.

The first layer is the input layer which doesn't contain any parameters. Each

node simply contains one of the input attributes X1, X2, and X3. The next

layer is where the parameters begin.

Since each hidden node is connected to each of the three input nodes, that means that each hidden node uses this equation to calculate a number Y. That's

four parameters per equation and we have four nodes. So that gives us 16

four nodes. So that gives us 16 parameters in this layer. Finally, the

output layer. Each output node is connected to each of the four hidden nodes. So each output node uses this

nodes. So each output node uses this equation to calculate a number O.

There's five parameters in this equation and we have two nodes. So that gives us 10 more parameters. This neural network has 26 parameters in total. Question

two. Let's say we want to create a neural network that could predict the probability that the next word in a live stream should be censored. Also, let's

say that the network should factor in the previous three words. Here's option

one. Here's option two. Here's option

three. And here's option four. The

answer is option four. The internals of the network actually don't matter for this question. We want a network with

this question. We want a network with three input nodes, one for each word and a single output node since the network outputs one final number in the form of

a probability.

Question three. Let's say we want to store the weights of a layer inside a matrix. Let's consider this equation.

matrix. Let's consider this equation.

The input data point can be represented by this vector and the weight matrix can be represented by this matrix. The

product of the weights and the input vector plus the bias which could simply be stored in a separate variable gives us the output we want. So how many

matrices and what shapes would be needed to store all the weights for this model?

And let's not worry about biases. Option

one, a 4x3 matrix and a 2x4 matrix.

Option two, a 4x2 matrix and a 4x4 matrix. Option three, a 3x two matrix

matrix. Option three, a 3x two matrix and a 2x4 matrix. Option four, a 3x3 matrix and a 4x4 matrix. If you want to

try calculating the answer on your own, I recommend pausing here. Okay, we'll go layer by layer. Again, the input layer doesn't contain any weights, so we can

move on to the hidden layer. The hidden

layer has four nodes, and each node stores a W1, W2, and W3. That's three

weights. So our first matrix should be a 4x3 matrix. Assuming the input vector is

4x3 matrix. Assuming the input vector is a 3x1 vector, this allows the matrix multiplication to work out and we would end up with a 4x1 vector which is

exactly what we want. Each entry in that vector corresponds to a yvalue in a hidden node. In terms of the actual

hidden node. In terms of the actual multiplication, we would multiply this row by this column to get this value and then this row by this column to get this

value and so on. Finally, the output layer. We have two output nodes and each

layer. We have two output nodes and each stores four weights W1 through W4. So,

we would need a 2x4 matrix. That means

that option one is the correct answer.

If you have any questions about this concept or are interested in a short video covering the math you need for ML, be sure to leave a comment since it's definitely important to understand.

Question four, between layers, we often plug values from the neural network into nonlinearities like the sigmoid. Its

outputs are always between 0 and one, and the greater the input, the closer the output is to one. There are multiple benefits to using the sigmoid and other nonlinearities like the tanh function.

But let's say we had a neural network for predicting whether someone develops diabetes and we added the sigmoid after the final layer. What would be the purpose of this? Option one, perform

regression instead of classification.

Option two, ensure the model outputs a probability. Option three, simplify

probability. Option three, simplify derivatives for gradient descent. And

option four, improve runtime since sigmoid calculations are actually easier on GPUs. The answer is option two. Since

on GPUs. The answer is option two. Since

the output is between 0 and 1, we can call it a probability. Option one is incorrect. We want to classify every

incorrect. We want to classify every input person as diabetic or non-diabetic, not perform regression. An

example of regression would be a model that predicts how tall someone would be after they're finished growing. Option

three is incorrect as well. This doesn't

simplify derivatives for gradient descent. It actually makes them more

descent. It actually makes them more complex. And option four is actually

complex. And option four is actually just irrelevant here. And that concludes our neural networks quiz. If you've made it to the end of the video, then I think you're the right fit for our ML

community. But first, I would recommend

community. But first, I would recommend checking out some more practice problems, which you can grab from the link in the description. I've actually

created a full course on LLM with over 25 concise modules. It will always be free and you can secure it using the link in the description. I hope you found this video useful and I'll see you

soon. Okay, this video is going to be an

soon. Okay, this video is going to be an introduction to PyTorch. It will assume a little bit of background knowledge on machine learning things like what is a neural network, but you can go ahead and

check out some other videos on this channel for that. But if you don't have any PyTorch experience, we're going to start completely from scratch in this video. I won't lie and say that this

video. I won't lie and say that this video covers everything you need to know about about PyTorch, but it will give you a pretty solid starting point. So,

let's just get into it. First thing I want to say is that these two import statements right here are incredibly powerful. As a data scientist or machine

powerful. As a data scientist or machine learning engineer or even just for side projects, PyTorch is almost the only library you'll need. Sometimes you might use pandas for loading in certain data

sets, but you can pretty much get away with exclusively using PyTorch. So

learning this library is extremely high ROI. And the fundamental concept behind

ROI. And the fundamental concept behind this library is the idea of a tensor.

You might have heard of something called TensorFlow. PyTorch is kind of like the

TensorFlow. PyTorch is kind of like the industry standard now, especially in research, but it was actually used for chat GPT. But so TensorFlow isn't used

chat GPT. But so TensorFlow isn't used that much anymore, but the fundamental data type in PyTorch is still something called a tensor. And a tensor is kind of

just like a matrix or an array. We can

have onedimensional tensors, twodimensional tensors, threedimensional tensors. We can have like a 3x 10x 20

tensors. We can have like a 3x 10x 20 tensor which you can think of as three different 10x 20 matrices or arrays or tensors. And these store any kind of

tensors. And these store any kind of data we wish. Usually integers or floatingoint numbers, but they can store any kind of data we wish. But tensors

are actually more than just a matrix or an array. They carry they carry

an array. They carry they carry derivatives under the hood. So you never have to thankfully worry about doing derivatives by hand with PyTorch. That's

the amazing thing. You don't have to be bogged down by doing derivatives by hand or linear algebra or matrix multiplications by hand because this library, this goated library and these

two import statements take care of the ugly math for you. And as a result with this main data type in PyTorch, this main data structure, it carries various

derivative attributes associated with it that you never really need to access or look at or it's extremely rare that you would need to. But we should know that this tensor data type carries other

information and other properties that are used to calculate derivatives actually inside a directed asyclic graph inside PyTorch under the hood. So we

should know that tensors are more than just matrices or arrays even though we're going to abstract them away to essentially be as such. So why don't we create our first tensor? We can just say

that a equals torch. This is a a function within pytorch. I can take in a variable number of arguments. So

depending on what we want the dimensions of our tensor to be. So if we wanted a 5x5 tensor, you would pass in two arguments. And if we run this, we can

arguments. And if we run this, we can actually go ahead and see that tensor.

Just go ahead and print it. We can go ahead and see that tensor. And here's

our 5x5 tensor. And we can actually, although it's not necessary to access other properties like the gradients or the derivatives for this tensor since we haven't done anything yet, there wouldn't actually be any information on

that. But we should know that that

that. But we should know that that information is there stored by PyTorch under the hood. So now that we've created our first tensor, why don't we jump into some of the most important

functions in PyTorch that involve tensors. Okay, so one of the first most

tensors. Okay, so one of the first most important functions in PyTorch that we'll deal with is something called the sum function and the mean function. So

we might say something like sum equals torch sum and it will have two arguments the tensor and the axis or dimension along which we want to sum right. So we

might say something like a and then axis because we have a two-dimensional tensor here. The axis will either be it will

here. The axis will either be it will either be axis equals zero or we might say something like axis equals 1. And

this essentially is going to specify whether or not we want to sum along the rows or the columns. However, one

important and maybe unintuitive thing is that if you wanted to get the sum of every row, we have three rows in this tensor right here. We might it's not this. If you want to do torch sum of a,

this. If you want to do torch sum of a, axis equals zero. You might think that axis equals 0 corresponds to rows and axis equals 1 corresponds to columns. So

to get the sum of every row, you would say axis equals zero. It's actually the opposite in PyTorch for some reason. Not

sure why this was done by the creators of PyTorch, but it is an important thing to know as you'll be specifying axes all the time when you're working with PyTorch. So to get the sum of every row,

PyTorch. So to get the sum of every row, what we actually want to do is go across the column. So that's why we say across

the column. So that's why we say across the axis equals 1. And if we go ahead and print this, if we go ahead and print sum, we should actually get the sum of every row. And we can see the sum of

every row. And we can see the sum of every row is five. So this is definitely an important concept in PyTorch, the concept of axes. Another important

concept in PyTorch is the idea of squeeze and unsqueeze. This actually

happens. This is these are two functions that are used all the time in PyTorch and they happen when we have kind of unnecessary dimensions. So we can just

unnecessary dimensions. So we can just kind of erase this for now. We can say our tensor that we care about right now is a 5x1 tensor, right? So five rows and

one column. And if you were to like do

one column. And if you were to like do print a.shape, you would get a tupole

print a.shape, you would get a tupole which has five in the zeroth index and one in the first index. But that one is kind of unnecessary, right? The whole

point is that like you know if we print a it's just you know of size five or of length five. This whole five saying five

length five. This whole five saying five by one and saying oh it's just of size five is kind of carrying the same information but when we're using other functions later on in PyTorch some of

these functions are really particular about if we're talking about a 5x1 or just something that is of size five.

Sometimes you might see this notation in PyTorch the comma which indicates that it's not 5x one it's just of size five.

So if you want to ever get rid of the one, there is something called squeeze in PyTorch, it kind of squeezes out those any unnecessary dimensions. So if

we were to say print a.shape, right? And

then we were to say okay squeezed equals torch.squeeze

torch.squeeze on a and then we print squeezed.shape.

We will actually see a difference, right? The one has now disappeared. And

right? The one has now disappeared. And

although this might seem like an extremely small change, it will actually make a difference for various functions that we use later on. Just to make this

kind of really clear, let's print a when it's 5x1 and let's print squeezed.

We can actually see that there is a difference right here. We can see in the first tensor that there are five rows in one column. This is just simply you can

one column. This is just simply you can think of it as a vector of size five. we

kind of erased a little bit of information about its dimensionality. So

squeezing is important. So if squeezing is a thing, unsqueezing is should also be a thing, right? Because sometimes we actually are passing into a function later on that expects two tensors,

right? And we want things to be

right? And we want things to be consistent with those two input tensors, right? We can't have one tensor passed

right? We can't have one tensor passed into this hypothetical function that has this extra one and one that doesn't. So

unsqueezing might be something that it actually is something that is done all the time and it's just like a good function to be familiar with. So it

might say something like unsqueezed equals torch. Squeeze and you have to

equals torch. Squeeze and you have to pass in the first thing which is what you actually want to unsqueeze. So that

would be squeezed. But then you actually have to specify dim and dim is just like axis. In this case it is essentially

axis. In this case it is essentially where do you want to insert that extra dimension. So I can say well it's

dimension. So I can say well it's currently of just size five right to make it five by one again right I would have to say dim equals one because at the zero index of the shape right if we

were thinking of a tupil of size you know if we printed the shape it would actually be you know why don't we do that that's probably a good practice so let's print squeezedshape

and this is how you get the shape of any tensor in PyTorch we don't need these let's just print squeezeshape it's just going to be essentially just a five, right? But we

want to make it five by one, right? So

we want it to be like a comma after this five and then a one, right? So that's

kind of the one index in this in this uh iterable or this data structure that is returned by shape, right? If we just get the zeroth index in shape,

we see that it's five. If we tried one, it would be out of bounds, right? So to

unsqueeze we simply want to add this one to make it 5x 1 at the one index of the shape and now if we print unsqueezed.shape

unsqueezed.shape we can see that we have it as a 5x1 now so it's very important to be familiar with squeezing and unsqueezing. How do

we actually define neural network models in PyTorch because that's what this library is all about. PyTorch is all about defining models that are specifically neural networks. So we have

this rough idea of a model, right? It

should be some sort of class probably let's just call it my model for now and maybe there is a constructor where we initialize the various objects that are going to compose this model. This might

be the layers of a neural network as we previously talked about and there's one core method that is really important for models and it's always called the forward method which is really similar

to something called get model prediction which if you're not familiar with my previous videos that's okay but get model prediction just kind of by the name is essentially a method that would

use the model it would send in so this would actually send in some example data point, right? This is essentially a sub

point, right? This is essentially a sub routine where you could send in an example data point and the model would use whatever weights and biases that it currently has after, you know, some

number of iterations of training and it should return the model prediction. So

for a model object, the main ideas are a constructor and a forward method. And to

make kind of this whole process of defining these neural network models much easier, PyTorch actually has a base class which we can view over here. And

this is an incredibly important concept within PyTorch. It's the idea of a

within PyTorch. It's the idea of a module. A module is basically the same

module. A module is basically the same thing as a model. And all neural network models that you define in PyTorch are going to inherit from or subclass a

class called nnn.module.

So if you want to define your own module class model, you just specify that it subasses or inherits from nnn.module.

And then when you have the constructor, we're kind of defining the layers of the neural network. This model right here is

neural network. This model right here is actually using convolutional layers.

It's a convolutional neural network. We

haven't talked about those yet. Don't

worry about it. But it is essentially a layer within a neural network. And then

as we talked about earlier, the main kind of method that's important for neural network models, every subclass of nn.module needs to override the forward

nn.module needs to override the forward method from the base class. It's going

to take in an example data point or set of data points, a batch of data points X and doing using the model, it is going to using the layers of the model and

maybe some other functions as we can see in this case which we'll talk about later. we are actually returning the

later. we are actually returning the ultimate model prediction. That is what the forward method does. So this is nn.module and why don't we kind of learn

nn.module and why don't we kind of learn by example and create our own model.

Okay. So now we need to talk about an extremely important existing module in the pietorch library that we'll use as a layer in our neural networks all the

time and that is called nnn.linear.

If you're familiar with neural networks, you know that each layer of a, you know, traditional vanilla neural network is actually just a bunch of nodes that are each doing linear regression based on

the previous layers input attributes, right? And you know that the only things

right? And you know that the only things that you really need to specify are the dimensions of your matrix, right? Which

is just going to be dependent on, you know, your current or output nodes, output layers number of nodes and the previous layers number of nodes. And

that's those are the only required attributes to pass in to nn.linear. The

rest of these are actually optional and we don't need to worry about them for now. But to essentially create a layer

now. But to essentially create a layer of a neural network, specifically a kind of a traditional fully connected neural network, all we really need to specify

is in features and out features. And in

features is essentially the number of nodes in the layer previous to this layer, right? That's essentially coming

layer, right? That's essentially coming in to this layer. and out features is the number of nodes in this layer. So

why don't we take a look at an example?

If you just Google neural networks, this is actually one of the diagrams that comes up. So I think that a good way to

comes up. So I think that a good way to kind of make this diagram less scary is we're going to implement this exact diagram in code. So just to make that super clear, this is our input layer to

the neural network. Whatever data set we're dealing with here clearly has 1 2 3 four attributes associated with it.

And then our subsequent layers, this layer, this layer, and this layer are actually doing computation, right?

They're figuring out the weights, the W's necessary to actually make the model give the right prediction. So, we're

going to have 1 2 3 instances of N9.linear in our model. And this is just

N9.linear in our model. And this is just going to be the input to the model. So,

let's go ahead and say class uh simple. I don't know if it's simple.

uh simple. I don't know if it's simple.

Let's just call it my model. And we know it has a subclass nn.module.

And the first thing we need is our constructor, right? So we'll just go and

constructor, right? So we'll just go and create our constructor. And we'll say that there's not really any other attributes now that the user can specify when creating this neural network. We'll

just hardcode them to be whatever's in this diagram. First thing we'll need is

this diagram. First thing we'll need is our super call, which is kind of boilerplate code. And now we need to

boilerplate code. And now we need to start defining our neural networks, right? So in features and out features.

right? So in features and out features.

So we can do self dot first layer equals nn.linear and in features is going to be

nn.linear and in features is going to be 1 2 3 4 and we can see that out features or the number of nodes in this layer. Remember

each node is doing linear regression. We

have 1 2 3 4 5 6. So that's all we would do there. Then for the second layer it's

do there. Then for the second layer it's pretty self-explanatory. The out

pretty self-explanatory. The out features from the previous layer should be the in features for this layer. So

that would be six. And if we count how many are here, that's obviously six as well. But whatever model this neural

well. But whatever model this neural network was for. For some reason, it's predicting two numbers, right? So we can go ahead and do something like self.final

self.final layer equals nn.linear

of six and then kind of down project to a dimensionality of two. So the final output of this model again we don't know what use case or data set this model was created for. We just grabbed it from

created for. We just grabbed it from Google images. We're just trying to

Google images. We're just trying to clarify how to code up neural networks.

We don't really care about the use case here but for some reason it predicts two numbers this neural network. So we would specify out features as two over here and that would be it. Now all we have to

do is override the forward method. Every

subclass of nn.module does need to do that. So we would define forward. It is

that. So we would define forward. It is

an instance method. So we have self and we'll just say it takes an x which is some series of data points and we know that each data point needs to be of size four. It needs to have four attributes

four. It needs to have four attributes that's what nn.linear expects over here.

And now what we want to do is actually what we're going to do is since each linear each linear instance is actually a subclass this every n.linear is a

subclass of nn.module it too has a forward method. So we're just going to

forward method. So we're just going to call and that's already overwritten for us. That's already written for us by

us. That's already written for us by PyTorch. So we're going to call the

PyTorch. So we're going to call the forward method from each of these linear instances. So we might say something

instances. So we might say something like first layer output equals self.irst

layer and we're calling the forward method from that you know that instance of a linear class which is a subclass of nn.module. So we will pass in x.

nn.module. So we will pass in x.

However, something to make to make the syntax more concise in PyTorch is that the following also works to call the forward method. You can just pass in X

forward method. You can just pass in X directly like that and PyTorch infers that you were calling the forward method from that layer. And then what we essentially want to return then, right,

is just kind of pass X to this layer.

The output of this layer goes to the second layer. The output of the second

second layer. The output of the second layer goes to this final hidden layer and then is passed and does some final computation some final final matrix multiplications to get our output over

here. So we can just return self.fal

here. So we can just return self.fal

layer p to into that forward method we want to send in self do second layer right and into the second layer we want to send in self.irst

layer and then we're simply passing in x into there. So we're in a chain here

into there. So we're in a chain here kind of calling all the forward methods consecutively in a sequence and this is our first neural network model. And one

thing we should probably clarify is all the W's, right? The weights and biases, we haven't done any training yet at all.

Right? So if you check the documentation over here, it will actually say that the values right for the weights of this model. Same thing for the biases, it's

model. Same thing for the biases, it's initialized from some kind of probability distribution, that's not super important for this video. But the

point is that these initial weights that are hidden, we can't really see that.

It's abstracted away from us. But the

weights in the matrices that this model is currently using to make our prediction, they're initially just randomly chosen. So if I were to make an

randomly chosen. So if I were to make an instance of this model right here, right? Model equals my model. Now I have

right? Model equals my model. Now I have an instance of this model and I could technically use it to grab predictions of data points, right? So let's say our

first data point, example data point, let's just say that it is 1x4. We're

going to make it a 1x4 tensor. And the

reason we're going to make it a 1x4 tensor is we just want to do a batch size of one. One single data point. And

every data point needs to have four attributes. And we'll just r send in

attributes. And we'll just r send in like a completely random numbers for a data point. So torch.rand n is actually

data point. So torch.rand n is actually the function pietorch to do this. And we

can just pass in 1 4 1 4. And if we send this into the model, if I call model.forward forward of this example

model.forward forward of this example data point. But we know that for all

data point. But we know that for all subasses of NN module, PyTorch will infer which method we're using. If we

just do this example data point, if you just do this, what the model is going to do is it's going to pass this data point in from left to right through this network doing all the matrix multiplications using our current

weights of the model. But the weights of the model right now are completely meaningless, right? they were just

meaningless, right? they were just randomly initialized from this probability distribution described over here. So if I was to to do this model

here. So if I was to to do this model send in example data point into the model and if I was to actually print this there's no point in really doing that right now. It would be completely uninterpretable right cuz we haven't

actually trained this model yet. So

while we won't cover that in this video, the next step to do after you make an instance of your model is we need to train the model for some number of

iterations. Then then we can actually

iterations. Then then we can actually use the model and get predictions. So

after the training step is done, kind of leaving this as a black box for now, it's going to involve calling various gradient descent functions from PyTorch and that's absolutely something that we're going to cover in a later video.

But a after we train this model then we can actually do all we want. We can keep using our model and kind of using it to see how good it predictions are. We can

send in various example data points into the model. So this is an intro to

the model. So this is an intro to PyTorch. We covered the various concepts

PyTorch. We covered the various concepts behind a tensor and different basic functions associated with tensors. We

actually wrote our first neural network model. We defined its architecture based

model. We defined its architecture based on kind of a predefined architecture that we found from Google images. And be

sure to check out the practice problems where you will actually write your own model classes to do things like diabetes prediction, predict the next words in a

sequence like jet gpt. Then you're we will also have practice problems on training the model and using the model to get predictions. Hey, so if you want to learn PyTorch, this is the video to

start with. We're going to learn the

start with. We're going to learn the basics through a short coding problem where you can read the description on the left, write your code on the right, and there's a button where you can submit your code, run it against test cases. If you want to try it out, this

cases. If you want to try it out, this is completely free. Link is in the description. I've actually created a

description. I've actually created a whole list of these coding problems that take you from the basics of ML all the way to implementing your own GPT and link in the description for the full list as well. So after I created all

these problems, solutions, test cases, my colleague and pretty famous YouTuber Node hosted the problems on this website which he created. And this is the interface you're seeing. Okay. So let's

just take a little bit of time to read the description. There is a background

the description. There is a background video that you can watch before solving the problem, but I'm going to go over everything right now as well. So you

don't really need to watch that video.

So we're going to use built-in PyTorch functions to manipulate tensors. Tensors

are like the fundamental data type of PyTorch. It's where we store all the

PyTorch. It's where we store all the data and parameters for our ML models.

And you don't need any ML experience or experience with neural nets for this video. Though the applications might be

video. Though the applications might be a bit more clear if you do have that experience. So tensors are basically

experience. So tensors are basically just multi-dimensional arrays or matrices. So it's not just like

matrices. So it's not just like two-dimensional. It can be

two-dimensional. It can be threedimensional four-dimensional and it's how we store the data for our ML programs. So we have our tasks that we can see here. These are the functions we're going to write. We have our inputs

for the functions on the right. And

let's just quickly go over the examples.

So here we have a 3x4 tensor, right? And

our task is to actually reshape it. Says

that we're supposed to reshape it into an M byn. So M is three, N is four.

We're supposed to reshape it into a tensor that has only two columns. And if

you only have two columns, then it kind of changes the number of rows. So we can see here that we actually now have six rows. 1 2 3 4 5 6 and two columns. So

rows. 1 2 3 4 5 6 and two columns. So

all the data has stayed the same, but we're actually reshaping it. We're

actually reviewing it in a way. So the

data is represented or stored in a slightly different manner. And that's

actually something that's very important when writing ML programs since sometimes you need your data to be in a particular shape to pass into some other downstream function. So we're going to actually use

function. So we're going to actually use a PyTorch function to do this. Next is

averaging. So we're given some sort of tensor and the description actually says find the average of every column in a tensor. So we can see that we have this

tensor. So we can see that we have this entry over here and this entry over here and that's the first column and then this number is the average of those numbers. We go to the next column. We

numbers. We go to the next column. We

have a data point here, data point there and we want to find the average of that column as well. So we have three numbers over here since we had three columns in the input. But we don't want to actually

the input. But we don't want to actually do this manually with a for loop. It's

actually going to be way faster in terms of runtime to call a PyTorch function.

And the reason is that PyTorch function will then call really fast and efficient C or C++ code that also takes advantage of parallel processing whenever possible. So that's also a general rule

possible. So that's also a general rule for writing ML programs. Generally want to avoid those like traditional for loops and opt for like these really optimized functions that you can just call whenever possible. And we'll go

over that in a second. Then we actually want to concatenate these two tensors.

So we're just going to concatenate them left to right. So we can see over here they actually want to combine an M byN tensor and another M byM tensor into an M by and then M plus N. So we can

clearly see that we had N over here, we had M over here. So the number of rows is staying the same. We still have M rows but our number of colum our number of columns has increased. So we're kind of concatenating these left to right. We

can clearly see here we have this tensor the 2x3 and then here we have a 2x two and when we stick them together we have this 2x3 still valid over here and then we have the 2x two kind of stuck to the

right. So that's something that we'll

right. So that's something that we'll also want to do when we're writing ML programs. So let's get familiar with how to actually use the concatenate function in PyTorch. And lastly we actually want

in PyTorch. And lastly we actually want to get the loss. something that we're going to have to do when later writing ML programs is we need to get the error of our model at every iteration or every

100 iterations. Right? So given the

100 iterations. Right? So given the model's prediction and the actual true answer that we wanted the model to predict, let's actually write a function that gets the error or gets the loss as

it's called. And we can actually read up

it's called. And we can actually read up over here. It says to use the mean

over here. It says to use the mean squared error loss. So I'll flash an equation for that on the screen soon.

But basically what it means is to go over every data point. So here we have the model's prediction for a data point.

The model predicted zero. The true

answer was one. Then the model predicted one for this data point and the true answer was one. So you basically go through each data point and you actually get the difference between the models prediction and the true answer, right?

The actual target the model was supposed to predict. And for all the differences,

to predict. And for all the differences, you just square them and then you average all them together. So you add them all up and divide by the number of data points. And we actually also don't

data points. And we actually also don't want to do this with a for loop. We're

actually going to use an very optimized function that we actually just call from PyTorch and also explain later why exactly we want to use that function but ultimately we do just want to return the

output. Okay, so now let's jump into the

output. Okay, so now let's jump into the code and if you want to see more of these types of ML coding problems definitely leave a like on the video. So

the first function we have to write is reshape and the comma actually tells us that torch.reshape is going to be really

that torch.reshape is going to be really useful. Let's take a look at the

useful. Let's take a look at the documentation for that. it actually just takes in some sort of tensor that we want to reshape. And if we look at the second thing that we have to pass into the function, we have to pass in some

sort of tupil that represents the new shape. So we can see here we actually

shape. So we can see here we actually started off with something that was just torch range of four. So if you're not familiar with that function, that might seem a little weird, but it's basically

basically just going to give us back a tensor that has the numbers from 0 to three. So it's exclusive of four. And

three. So it's exclusive of four. And

that's just a one-dimensional tensor, right? But let's say you want to reshape

right? But let's say you want to reshape that into a 2x two tensor. Then the

output now has our 0, 1, 2, and 3. But

now it's in 2x two format. So we just have to pass in the tensor that we want to reshape. That's a in this case. And

to reshape. That's a in this case. And

we have to pass in a tupole with the new shape. So in code, that's just going to

shape. So in code, that's just going to look like torch.reshape.

And the first thing we're going to pass in is to reshape. And now we need to write our tupil. So here's what I'm actually going to do. we can just say -12 because we know from the problem

description we wanted the result to have two columns that's the two over here but if we specify the number of columns and the number of data points is the same then that kind of automatically determines the number of rows so if we

think back to that example where we had a 3x4 input that's 12 elements right and if I say that the output has to have two columns that's going to force it to have six rows so pietorch can automatically

determine what number should go there which should be six if we had you know 12 total elements. So we can just say -1 and PyTorch will infer what number should be there for the dimension. So

this is our input to reshape we pass in the tupil as well and that's all we need to do and the actually the comments also tell us to round. So we'll just say torchround and we'll just go ahead and pass in

decimals 4 and then we can return this and that's all for that function. Okay,

next is the average function. And

instead of actually iterating over each column to get the average of each column, we're just going to use torch.m

since it'll take advantage of parallel processing and it's going to be a lot more efficient. So let's just take a

more efficient. So let's just take a quick look at the documentation. All we

really need to do is pass in the tensor that we want to find the average of and this parameter called dim. And this is just essentially an integer and it's just the dimension that we want to find the mean on. is do we want to find the

mean of each row or do we want to find the mean of each column? And there's

actually something that's a bit tricky about this that we're gonna have to be careful about. So we're gonna say torch

careful about. So we're gonna say torch mean of two average. But then for dim, so dim is either going to be zero or one cuz we're dealing with a two-dimensional

m byn tensor. Zero corresponds to essentially the first dimension, zero indexing, and then one would correspond to like the second dimension, the n and m byn. So we want to find the average or

m byn. So we want to find the average or mean of each column. So it might be tempting to actually just say dim equals 1. But that is not what how that's not

1. But that is not what how that's not the convention for PyTorch. We would

actually say dim equals 0 since we want actually the average of each column.

We're going across the rows for each column. So we would actually say dim

column. So we would actually say dim equals 0. And I know that might be a

equals 0. And I know that might be a little confusing at first. So feel free to use the link in the description to play around with that. Run your code.

Use some print statements. This code

sandbox supports print statements and try to see the difference between dim equals 0 and dim equals 1. So next is concatenate and we want to use torch.cat. So take a quick look at the

torch.cat. So take a quick look at the documentation. The two things that we

documentation. The two things that we need to pass in are the two tensors that we want to concatenate and this just needs to be any Python sequence of the tensors. So it could be like a list of

tensors. So it could be like a list of our two tensors we want to concatenate.

It could be a tupole. It could be anything like that. And we also need to pass in the dimension along which we want to concatenate them. Are we trying to concatenate our tensors left to right like we did in the problem description

or are we trying to concatenate them actually stack them put one on top and one on the bottom? Obviously you want to do left to right. So we can go ahead and

say return torch.round of torch.cat and

we'll just use a tupole. So I'll say cat one and cat 2. So I've put both those tensors in a tupole. And we know that we want to concatenate them left to right.

We want the number of columns to increase. So we will say dim equals 1.

increase. So we will say dim equals 1.

And that's actually it for this function. Our last function is get loss.

function. Our last function is get loss.

And this one's pretty straightforward.

All we really need to do is call this function. So torch.n.functional

function. So torch.n.functional

msec loss for mean squared error loss.

And all we really need to do is pass in the prediction and the target. And

that's it. And then we can just make sure to round our answer to four decimal places. Again, the reason we want to use

places. Again, the reason we want to use this function instead of manually going over each data point and actually taking the difference, squaring it, finding the average is because this function will take advantage of parallel processing

and it can operate on multiple columns in our input simultaneously. When we

press submit, we can see that our code works. If you want more practice

works. If you want more practice problems, definitely check out the full list of problems, the playlist linked in the description, and you can jump into whichever problem is right for you.

Definitely leave a comment as well if you found this helpful and I'll see you soon. Okay, let's talk about dropout.

soon. Okay, let's talk about dropout.

This is actually a really important concept in deep learning and in training neural networks. So dropout is to solve

neural networks. So dropout is to solve the problem of overfitting.

So overfitting is actually just when your training performance, right, your training accuracy is greater than your testing accuracy.

So let's say you're training your model.

Error is going down with every iteration. You think your training

iteration. You think your training accuracy is great. You test the model.

You get its predictions on data it's never seen before. And the prediction is horrible. That's what overfitting is.

horrible. That's what overfitting is.

And overfitting is caused by the model essentially memorizing irrelevant details in the training data.

Essentially just noise. And when that doesn't appear in the testing data, its predictions aren't too great. So how do we stop the model or what's the cause of the model memorizing this irrelevant

noise in the training data? It's caused

by the model being too complex. This

could mean that the model has too many layers. This could mean that each layer

layers. This could mean that each layer has way too many nodes. The point is the model is just one big mathematical formula. And right now that formula is

formula. And right now that formula is way too complex. the model is memorizing all these irrelevant intricacies in the training data that's causing its actual performance on the testing data and

that's what we care about to go down. So

dropout was a technique created by deep learning researchers to solve it's one of the techniques to solve this problem of overfitting. So let's explain

of overfitting. So let's explain dropout. Let's say we have a neural

dropout. Let's say we have a neural network right we'll just do one hidden layer. So this is our three nodes say in

layer. So this is our three nodes say in our input layer and we'll just say two nodes in our output layer. So we draw our connections and this is going to be

kind of a fully connected neural network as we've been doing so far.

And what we're going to do at every iteration if we were to do say a dropout layer after this linear layer so something like nn.ropout dropout and the only thing you have to specify for a

dropout layer is a probability a probability of P maybe 0.2 or 0.4 four what you're going to do is if you apply dropout to say this linear layer for

every node independently with probability P it gets turned off so with probability P this node gets turned off with probability P this node could get turned off and when I say turned off I

mean set its output or activation to zero right so this node over here let's say this is X Y Z well this node over

here its output or its activation is based on its weights Right? So W1x plus W2 Y plus W3Z and maybe an optional

bias. Right? So what we mean by say

bias. Right? So what we mean by say turning this node off is setting whatever this is its output its activation to zero. So that would be like severing its three connections

temporarily just for that one training iteration to zero. We're severing we're severing those connections. So what

dropout does just to be super clear is that if you apply a dropout layer to a linear layer like this one at every iteration of training there will be a

probability of P that a given node is turned off and its activation is set to zero and it's essentially that would mean that its connections with the nodes in the previous layer are severed and

that is done independently for each node. So for that node that I just start

node. So for that node that I just start there is a chance that that node is turned off as well. And what dropout does is it essentially reduces the complexity of our model a little bit.

We're essentially with, you know, some randomness going to delete some of the nodes in a given layer. And this is going to make our model less complex and actually it's going to make it a bit

stupider. So think of dropout as taking

stupider. So think of dropout as taking our bit giant neural network and we kind of drop it on the ground, knock some of the screws loose. Most of it's still intact as long as the probability value

is not too high. But we made our network a bit stupider. We decreased its ability to learn like you know really really you know intricate noise in the training data. So now the model is kind of just

data. So now the model is kind of just focusing on the big picture instead of memorizing specific noise in the training data. But by making it a bit

training data. But by making it a bit stupider it's kind of been found time and time again that our testing accuracy which is what we really care about the testing accuracy will go up. So dropout

does per does increase the performance of our neural networks especially as they get you know deeper and deeper you know that means that we have more and more layers and this is definitely something we want to include in our

neural networks. So you can jump into

neural networks. So you can jump into the code now. Okay let's solve digit classifier. We're going to build a

classifier. We're going to build a neural network that can actually recognize handwritten digits. They'll be

passed in as images and they're going to be black and white. And our job is to actually predict what digit is in the picture. The the model needs to kind of

picture. The the model needs to kind of interpret this. And this is actually

interpret this. And this is actually kind of a simple but still powerful application of neural networks. And I

definitely would recommend checking out this 10-minute clip just at the timestamp that that link starts at just for a background on neural networks. So,

we're given kind of a model architecture in the description and this blurb right here. And most of our video will be

here. And most of our video will be explaining how to kind of reason through that and code it up. But let's first take a look at our input. So the two things we have to write are the

architecture or the constructor and the forward method that every neural network class in PyTorch has. So the input here is actually the input to then the forward method. And it's essentially one

forward method. And it's essentially one or more. So it could be a whole batch of

or more. So it could be a whole batch of 28x 28 black and white images. And we're

actually guaranteed that for you know every single batch element it actually will be of size 28 * 28. So that's 784.

So kind of you can think of the image as flattened out into a horizontal vector and it says do not write the training loop or gradient descent to actually train the model and minimize the error.

That's actually going to be in the next video. So let's take a look at the

video. So let's take a look at the example. So we have our input image here

example. So we have our input image here and we've since it's of size 784 actually emitted many of the indices but this is essentially a vector where every

number is between 0 and 255. 255 being

you know completely white and zero being completely dark and our output here is actually a vector of size 10. Okay, the

length of this vector is of size 10 and every entry in this vector is between 0 and one and we can kind of interpret it as a probability. So we if we look at the index corresponding to seven, we can

see that the model is essentially saying there's a 90% chance that this input image is a seven. That's the model's kind of confidence there. And you might think that sevens kind of look like twos

depending on how you know some people draw them. So for some reason the model

draw them. So for some reason the model thinks that there's some slight chance that this input image is a seven. But

this output kind of gives us an idea of what the forward method the the forward method needs to return. It needs to return a list of probabilities. And this

is just a note that our exact model prediction once you run your code isn't going to exactly match this because we're not going to train the network. So

the weights of the model will just be whatever they're randomly initialized to and your prediction won't be too great.

But this is just to understand the format of the output and in the next problem we'll actually train the model and see it achieve like 98% accuracy.

Let's jump into the architecture explanation.

Okay. So here I've drawn the architecture that's described in the problem description. So we're going to

problem description. So we're going to do two things. We're going to explain exactly why we're supposed to use this architecture and then we're going to kind of give code snippets so you can implement it without jumping all the way

to the solution at the end. So the first layer on the left, this is the input layer and I've went ahead and drawn 784 neurons. And the reason I've done that

neurons. And the reason I've done that is because for each image in our input, we should be treating each image independently. Let's just focus on

independently. Let's just focus on passing one image at a time from left to right through this neural network, right? For each image, there are

right? For each image, there are actually 784 corresponding numbers or features for each image. And that is specifically the the grayscale

activation at each specific location in the 28x 28 image. So we have 784 numbers in the input layer. You can think of a number being stored in each one of these

nodes or neurons. To the right of it, we actually have our first linear layer of this neural network, which has 584 neurons. And these are not input neurons

neurons. And these are not input neurons anymore. Rather, these are kind of

anymore. Rather, these are kind of internal to the network. They're not the input. They're not the output. They are

input. They're not the output. They are

internal to the network. So, we would actually call those hidden neurons. And

each of these neurons, actually haven't drawn them here, but each of this neuron, each of these neurons, this one, this one, and all 584 of them, all 584

of them are actually fully connected to the input layer. So this neuron is connected to every all 784 neurons. So

there would be many many connections here as well as for every single other neuron in this hidden layer. And there's

actually just a small typo there. Let's

just go ahead and say 512.

512 neurons. And because each of those

neurons. And because each of those neurons is fully connected to every node in the input layer, there are actually 784

weights or W's. We've heard of W1, W2, W3, all the way out to W784 just stored inside this neuron. And then

there will also be 784 different weights stored in this neuron. So we'll go ahead and say 784 weights per neuron. Yet we

have 512 neurons. And that's why all the weights for this linear layer are actually going to be stored in a matrix just to make all the computations a lot

more efficient. and all of those weights

more efficient. and all of those weights and even some biases, optional biases.

784 * 512, that's a lot of weights. And

this layer is actually does a lot of the heavy lifting in this model. This model

is able to kind of learn so much just from this one layer. The model is figuring out the essentially how important every single pixel is. For

each of the 784 pixels, we have 512 weights in this layer over here, right?

Because one way you can think of this is that each of the 512 neurons in this layer is has 784 weights associated with it. But each of those weights is for a

it. But each of those weights is for a single pixel, right? Because each neuron in this layer has one connection to each of the 784 neurons in the preceding

layer. So the this layer is helping the

layer. So the this layer is helping the model actually learn how important each pixel in the image is depending on say what number is actually in that pixel in

terms of the grayscale value. So each

pixel has 512 weights associated with it one from each neuron. After this layer we actually have a nonlinear activation which is a relu activation. And if we

take a look at the RA function which is listed right here, we can see that this is essentially kind of like an onoff function. Here we have off

function. Here we have off before X equals Z and here the function is essentially on. So this function kind

of helps the model learn when a feature is important enough to pay attention to.

Right? Let's say the model needs to detect that some feature in the input.

It's past some kind of threshold and then we have to start paying attention to it, right? It's not just zero or completely irrelevant anymore. That's

actually what the Rail activation can help the model learn. It helps the model learn a far more complex relationship than just those from linear regression

like W1 * X, right? W2 * Y, right? And

if this was our if this was just our output with no kind of nonlinearities, there's a limit to how complex of a relationship the model can learn. And

just to kind of give an example of, you know, a cutoff feature in this model, let's say the model is trying to differentiate between sevens and twos.

And I've drawn this two kind of intentionally because even though this isn't a great two, we'd still want the model to be able to pick up that this is a two for it to be a good digit classifier, right? And let's say that

classifier, right? And let's say that the model learns that essentially the cut off between a seven and a two is how much pixels that are non black of course

we have in this region of the image over there. Well if we had this as a seven

there. Well if we had this as a seven and then this looks like a seven right but let's say we just had a little more over here. Well the model should the

over here. Well the model should the model interpret this as a two? This is

kind of a silly example, but it just kind of gives you the idea that the model does need to have some way of having thresholding and and cut offs and the relu activation is one way that the

model kind of achieves that. One

additional thing I wanted to touch on before we get to dropout is why 512 neurons here. We have an explanation for

neurons here. We have an explanation for why 784 and we'll have an explanation for why 10 over here, but why 512? The

truth is it's actually somewhat arbitrary. I said that this model was

arbitrary. I said that this model was able to achieve like 98% accuracy once we train it. But it would probably achieve similar accuracy if we used I

don't know 550 if we used 500 even if we went a bit lower and said like 490. It's

actually a little bit arbitrary and a range of numbers would probably work for this layer. But the general idea that we

this layer. But the general idea that we have is the larger this number is the more complex of a relationship the model is able to learn. So we would not want a number that is too small. that would

probably not yield great results. But

additionally, depending on how large our data set is and how how much there is for the model to actually learn, if we use a number that is too high, then the model has too many parameters. Then the

model can start overfitting. And I have a separate video on overfitting. But

overfitting is essentially when the model starts memorizing irrelevant noise in its training data. And this actually makes its performance worse if we have way too many neurons, way too many

parameters. So there is a sweet spot and

parameters. So there is a sweet spot and 512 works well. Definitely check out the the following video to for some proof of that that it this model does achieve great accuracies. So dropout, why are we

great accuracies. So dropout, why are we applying a dropout layer over here? I

have a separate video explaining dropout. But essentially what dropout

dropout. But essentially what dropout does is it kills some of the nodes in the prior layer. So at every iteration of training we have some probability each node has a chance of just getting

wiped out which actually lowers it slightly lowers the complexity of the model during training. And this actually helps to prevent overfitting because if

the model doesn't have as many neurons if it doesn't have as many weights even temporarily then the model is essentially able to avoid memorizing irrelevant noise in the training data.

The last and most relevant layer that actually has a lot of conceptual significance is this final output layer which has 10 neurons. And although I haven't drawn the previous layers

neurons over here, the total number of neurons has not changed yet. We still

have this 512 post relu and 512 post dropout. Those are just changing the

dropout. Those are just changing the numbers in each neuron but are not actually changing the total number of neurons. But in this final layer where

neurons. But in this final layer where we have 10 neurons, each of these neurons is fully connected to each of the 512 neurons in the preceding layer.

And the reason that we're actually choosing 10 neurons in this final layer is because we want our model to actually output 10 numbers, right? We want the model to have a number for digit zero

through digit 9. So then we can interpret that as a probability. And of

course, each of these 10 neurons, this one, this one, and all of them have 512 weights associated with them. Due to the fact that there are 512 neurons in the

previous layer, there are 512 numerical features that the this final layer over here, this final layer can actually pay attention to. And lastly, we are

attention to. And lastly, we are actually going to apply the sigmoid function. the sigmoid function. By now,

function. the sigmoid function. By now,

we've seen that it actually causes our model's output to be between zero and one. And the fact that the number is

one. And the fact that the number is between 0 and one means that we can more easily interpret it as a probability. So

the last thing before we jump into the code I want to mention is nn.linear.

N.linear is going to be how we set up this layer and this layer in code. And

nn.linear linear is going to take in in features and out features. Those are the two things that we have to pass into its constructor. So the for the first linear

constructor. So the for the first linear layer we have 784 input features and 512 output features. For the second linear

output features. For the second linear layer which is over here we would have 512 input features and 10 output features. And of course between that we

features. And of course between that we will use nn.relu

to which we don't have to pass in anything. That function is kind of

anything. That function is kind of already predefined as we saw in the previous image that we saw as well as nn.ropout

nn.ropout nn.ropout and the only thing you have to

nn.ropout and the only thing you have to pass n.ropout is the probability. So if

pass n.ropout is the probability. So if

we use a probability of 0.2 as we mentioned in the problem description that would be n. Lastly we will have nn sigmoid.

So all these kind of nn instances will be in the constructor of our handwritten digit recognizer. And then in the

digit recognizer. And then in the forward method, we will just string together the forward calls for all of these layers from starting with the first linear layer all the way to the

final sigmoid forward call. And that

will be our model. So here in the constructor is where we can define the architecture. We can have our first

architecture. We can have our first linear layer. So self first linear and

linear layer. So self first linear and that's going to be an instance of nn.linear. And we talked about how we'll

nn.linear. And we talked about how we'll have 784 input features and 512 output features. After this layer, we are

features. After this layer, we are really interested in applying some nonlinearities. So we go ahead and do

nonlinearities. So we go ahead and do our nn.relu.

our nn.relu.

And there's nothing we have to pass in there. That function is already defined

there. That function is already defined based on the graph we saw earlier. Then

we would like to actually apply our dropout layer. So we can do n.ropout

dropout layer. So we can do n.ropout

with a probability of 0.2. Obviously

don't want to use too high of a value there. and just totally destroy our

there. and just totally destroy our neural network. Lastly, we will have the

neural network. Lastly, we will have the final linear layer. So, we can solve we can actually call this a projection, right? Because we want to project the

right? Because we want to project the dimension down essentially to output 10 neurons, 10 different probabilities. So,

this will be another instance of nn.linear with 512 input features and 10

nn.linear with 512 input features and 10 output features. And lastly, we have our

output features. And lastly, we have our sigmoid instance to make all the outputs between zero and one.

Then we are asked to actually return the model's prediction to four decimal places. So this is going to require

places. So this is going to require calling the forward methods of all of those other NN instances. Our solution

itself is an NN instance. So let's write forward for this module. And one quick thing is that instead of calling forward, so instead of doing something

like self.irst refers to linear and then

like self.irst refers to linear and then calling its forward method with images.

Of course, we can actually just use the following syntax and we will get the same desired behavior. So,

self.firstlinear of images. Well, we

want to pass that into the RLU, right?

So, self.relu of whatever that returns, right? And then we can go ahead and do

right? And then we can go ahead and do the same for dropout. So, self dropout of whatever relu.

And then we can do our final projection.

self dot projection of whatever dropout returns and then lastly we can do self sigmoid of whatever projection returns and then of course we are interested in

rounding this to four decimal places. So

we can store this in a variable called out and simply say that we want to return torch.round round of out with

return torch.round round of out with four decimal places and we can see that it works. Now that we've written our

it works. Now that we've written our first neural network, we can jump into training loops in PyTorch. Once we have a PyTorch model defined, how do we train

it and how do we actually use it to get predictions on data points we've never seen before? That's what we're going to

seen before? That's what we're going to cover in this video. So let's say you have written a class, right? In this

case, it is a model that recognizes images of handwritten digits. And you've

written your constructor. This right

here is where you define your model architecture. And then you have your

architecture. And then you have your forward method. This is where we get the

forward method. This is where we get the model prediction. But in this video,

model prediction. But in this video, we're going to go over how to actually train a model in PyTorch. How do you call the right functions to do gradient descent? And you won't have to take any

descent? And you won't have to take any derivatives by hand. PieTorch will do all of that for you. But how do you actually over some number of iterations update this model and make it better and

better based on the training data? We're

going to explain the training loop in this video. And this is super

this video. And this is super fundamental to all future neural networks that you train. So the below code is going to be the same regardless of the neural network. And if

essentially it might seem like boilerplate code, but there's actually a lot of important concepts embedded within this. So the first step is to

within this. So the first step is to make an instance of your model, the class that we defined earlier. Next,

you'll need to define your loss function. For our previous linear

function. For our previous linear regression problems, we used the mean squared error, and it was kind of an ugly looking formula with a summation and a square, but we did kind of make

sense of it. And ultimately, we realized that if we minimize that error, then we increase the chance that the model will do well on data points it's never seen before. Because this problem is a bit

before. Because this problem is a bit different. It's not a regression

different. It's not a regression problem. It's a classification problem.

problem. It's a classification problem.

Right? Our model is taking in an image and the model needs to predict which digit it is and there's only 10 possible digits. So the model will actually

digits. So the model will actually output probabilities that the given input image belongs to a certain class.

So given say a picture of a number seven, this model will hopefully output something like 95% for the class number seven. So the model is actually

seven. So the model is actually outputting probabilities here. And

that's why we're not going to use the mean squared error whenever we are doing these probability based models. The

model is outputting probabilities. In

that case, we will use something called cross entropy loss. And it sounds like an ugly term and if you look it up, there's kind of a confusing math formula which isn't super important to understand and we'll definitely go over

that in a different video. But the

fundamental idea here is because our model is no longer outputting continuous numbers like if we were say predicting the price of an Uber or predicting how tall someone's going to be. Instead, our

model is trying to do a classification here. It's trying to group the input.

here. It's trying to group the input.

It's trying to put it into one of some fixed number of buckets. We're

outputting probabilities here. And

that's why we have to use a different error function or a different loss function. And this cross entropy loss,

function. And this cross entropy loss, if we kind of skip down here, the cross entropy loss takes two things just like the other error functions. The model

prediction and what the actual answers are, the ground truth labels from our data set. So let's say model prediction

data set. So let's say model prediction is some sort of probability. And let's

say that probability that we gave for some input image was 0.8, right? There

was some 80% chance that the model thinks that that image of some number seven is actually the number seven, right? Then in labels, right? Actually

right? Then in labels, right? Actually

labels is always essentially either ones or zeros, right? Label in that case would just be one. The true answer is that there's a 100% probability that that given image, let's say it was a

number seven, belongs to the class of images of number seven. So just to kind of treat this loss function as a black box, it takes in probabilities of the model's predictions and the true answers

and outputs our error. The next really important line of code that will appear in every single training loop is something called the optimizer. And this

is essentially an object in PyTorch that does gradient descent for us. When we

create the object, we need to pass in the parameters of the model.

Model.parameters. And model.parameters

is going to tell this optimizer all the weights, all the W's inside this neural network that we need to optimize and update over some number of iterations of training. And you might be wondering

training. And you might be wondering what Atom is. Atom is kind of like gradient descent on steroids. It's an

optimized. It's still doing gradient descent. You know, it's taking those

descent. You know, it's taking those derivatives. It's using the learning

derivatives. It's using the learning rate. It's actually using a default

rate. It's actually using a default learning rate. We didn't pass that in.

learning rate. We didn't pass that in.

It'll assume a default learning rate of 10 -3 or 0.001.

But Adam is gradient descent with kind of some optimization tricks to dynamically change the learning rate across the course of the algorithm. Once

we get closer to the answer, right? Once

our we are getting closer to the minimum of a function, we don't want to accidentally overshoot, right? We'd want

to decrease the learning rate at that point. So that's an example of some of

point. So that's an example of some of the tricks and optimizations that Adam uses. But opt to torch.optimatom

uses. But opt to torch.optimatom

is still gradient descent. Next, we

always have to define the number of epochs. An epoch is actually defined as

epochs. An epoch is actually defined as the model being trained on the entire training data set once. So, five epochs would be five passes over the training

data set. You can kind of guess that as

data set. You can kind of guess that as you do more epochs, the model quote unquote would memorize the training data better and better. Too many epochs of course might not be good as we want our

model to not, you know, pay attention to small unnecessary details in the training data. And that's essentially

training data. And that's essentially called overfitting. And we want the

called overfitting. And we want the model to perform well on data it's never seen before rather than solely memorizing the training data. Then we

will essentially iterate over the number of epochs. And we have something here

of epochs. And we have something here called a train data loader. That's just

a couple lines of code to define. And

I've already done that for us earlier in the code. We can kind of delve into the

the code. We can kind of delve into the couple lines of code needed to create the train data loader in a different video. But essentially this is an

video. But essentially this is an iterator that is giving us tupils and it's giving us a batch of images right so images is a batch of images where

each image is 28x 28 or you can think of it as 784 and we have the corresponding labels or what digit that is in the image for each of those images. So

images and labels are the same size and we're kind of pairing them up in a tupole. The first thing we have to do

tupole. The first thing we have to do with the batch of images is instead of having it be 28 x 28, if we kind of refer back to the model architecture here, we notice that the first linear

layer is expecting essentially a vector of size 784. So we're essentially just flattening out. We're viewing or

flattening out. We're viewing or reshaping. You could use torch.resshape

reshaping. You could use torch.resshape

here as well. You're essentially

reshaping that image into a flat vector of size 784 and but it's still encoding the same, you know, information for every pixel in the image. The next part is the training body. These several

lines of code right here are incredibly important to understand are actually the main focus of this video. The first

thing we'll do is we will call the forward method of our model. This syntax

over here calls the forward method from the model class we define and essentially gets our model prediction.

The next step we have to do is a bit of a frustrating line that is even required in PyTorch but will essentially become second nature to you. Optimizer.zero

zero grad cancels out all the derivatives that were calculated from the previous iteration of gradient descent. We know that at every iteration

descent. We know that at every iteration we want to recalculate the derivatives so we can update our weights, right? But

pietorch by default will store the previous derivatives and if it will still use them and add them up with the derivatives we calculate for this iteration of gradient descent unless we say zero grad which cancels out the

gradients from the previous iteration.

Next, this is a line of code that will always appear in a training loop. It is

calculating the loss or the error based on our current model prediction. That is

one parameter passed into the loss function as well as the ground truth labels. The next step is probably or the

labels. The next step is probably or the next line of code is probably the most important line of a training loop in PyTorch. Lossbackward.

PyTorch. Lossbackward.

This will calculate every single derivative necessary to perform gradient descent. This model depending on how big

descent. This model depending on how big the neural network is may have tons of W's, tons of weights and tons of derivatives that need to be calculated.

Specifically, the derivative of our error with respect to those weights. So

then we can update those weights based on the learning rate. But so

loss.backward is essentially probably the most computationally intensive step of this entire program. It is

calculating all necessary derivatives and storing those in such a way that we can actually use them. For the next line of code, optimizer.step. This is the line of code that updates all of our

weights. So you can think of

weights. So you can think of optimizer.step as doing new w equals the

optimizer.step as doing new w equals the old w minus the derivative times the learning rate. This is exactly what

learning rate. This is exactly what optimizer.step is doing. It is taking a

optimizer.step is doing. It is taking a step in the direction hopefully towards the minimum of our loss function. The

whole point of gradient descent is minimization. And it is using all the

minimization. And it is using all the derivatives that we calculated in the previous step or the previous line of code.

And then we can actually see how the model is actually doing once training is done. So I've actually gone ahead and

done. So I've actually gone ahead and ran this cell. Right? We have a trained model now. It has the right weights in

model now. It has the right weights in this model object that was updated over the course of this algorithm. And if

you're wondering just a reminder as to how we actually updated the weight is because for our optimizer which actually does the step we told the optimizer at you know construction time which model

the parameters of the model to update.

So after this line of code is done we have the right model parameters. We can

put the model in something called evaluation mode which tells PyTorch that because we're just trying to get a bunch of model predictions don't worry about calculating the derivatives for training. And then we can iterate over

training. And then we can iterate over our test data loader, reshape our images as we did before, pass them in to the model, right? And then think about what

model, right? And then think about what the model prediction should be. It's

going to be something like batch size by 10. We are feeding in a batch of images

10. We are feeding in a batch of images to the model in this line of code right here. And for every for every image, we

here. And for every for every image, we are predicting a bunch of probabilities, specifically 10 different probabilities for every image. the probability that the image belongs to say the class of

handwritten zeros or handwritten ones all the way through handwritten nines.

So you have 10 numbers for every single input image. But the way we can actually

input image. But the way we can actually say what the model prediction is based on those probabilities is just by taking the max right depending on which probability that was highest. We can say okay that was probably that was we'll

take as what the model actually predicted. So if this is batch size by

predicted. So if this is batch size by 10, we want to get the max index from every single row in this batch size by 10 tensor. So we will take max across

10 tensor. So we will take max across dim equals 1. And then let's iterate over the images that we actually fed in for testing. Print them. So we have to

for testing. Print them. So we have to reshape them back into 28x 28. And let's

actually print the model prediction and see how the model did. So after we run this line of code, let's see how the model was doing. This was a imp 7 passed in. And it looks like the model after we

in. And it looks like the model after we printed on index.index.i item actually predicted that this was a seven. So the

model recognized this image. We can see the model did a great job here again predicted two as the highest probability. A one here, a zero here, a

probability. A one here, a zero here, a four here, and the model did actually well for these five images. If we print some more, we can probably see that the model may have not done so great. So why

don't we take a look at some where the model maybe didn't do so great. So here

we have an image where the model actually predicted that this clear two was a three. That was the highest probability it assigned among all the possible dit digits for this input. So

this neural network is not perfect. It

doesn't have a 100% accuracy. But it's

still pretty simple and achieves overall pretty good results without a convolutional neural network. Don't

worry if you don't know what those are yet. But with essentially with this

yet. But with essentially with this simple neural network architecture, you can achieve around a 98 or 99% accuracy on this handwritten digit data set. So

that just shows that even with these simple neural networks, we're able to learn pretty powerful relationships.

We're teaching computers to see. So the

main takeaway from this video is that we're neur our neural networks are going to get more and more complicated. But

that training loop, that training block of code where we get the model prediction, we zero out the previous gradients. We calculate the derivatives

gradients. We calculate the derivatives and we do optimizer.step. And of course, first we define our loss function, we define the optimizer that is going to hold standard for almost every neural

network we train. So be sure to check out the quiz to make sure you understand that. In this problem, we're going to

that. In this problem, we're going to solve the PyTorch training quiz. And I

would definitely recommend really understanding these concepts as it's the same code that you would use to train any neural network as well as kind of like the handwritten digit model and NLP

models and even chat GPT in future problems. So the first question, what would happen if you called zerorad after calling backward in a training loop? So

zero grad clears out all calculated derivatives that may have been calculated in previous iterations and we know backward calculates derivatives. So

option A says training would be sped up due to parallelization.

That's kind of just completely irrelevant to this. We're not really doing anything with the GPU directly in these lines of code. No weights would change after calling step. This one is

actually the correct answer. So we know that the weights are updated based on the values of the derivatives, right?

That's the formula for gradient descent.

But if we cleared all the derivatives, if we set the derivatives equal to zero, then when we subtract out the derivative times the learning rate, none of the weights would change. Uh C says training would take longer, but it would still

minimize the error. Actually, no, we wouldn't make any progress in minimizing the error. And option D says we would

the error. And option D says we would have a runtime error. You wouldn't get a runtime error. You would just be

runtime error. You would just be confused as to why your model's not working. So the next question asks us

working. So the next question asks us what happens when backward is called and we just talked about this the derivatives for each and every weight.

So the derivative of the loss or the error function with respect to each weight for gradient descent is calculated and that allows us to minimize the loss function. That's the

function we're minimizing. The weights

would not actually be updated. That

would be optimizer.step.

Uh the loss is not calculated. That

would be we would actually have to call the loss function to do this. and D we were actually not adjusting the learning rate in this step. This is something that is actually handled auto handled

automatically by the optimizer.

The next question asks us what happens when step is called. We know this is the line this is the line of code that actually updates all our weights by subtracting out the learning rate times the derivative. So we can actually see

the derivative. So we can actually see that over here that the weights are updated based on the update rule. The

data set is not randomized that is just not really handled by the optimizer. uh

the learning rate may or may not decrease or increase. So the optimizer will actually adjust learning rate accordingly. It's pretty customary to

accordingly. It's pretty customary to start with a small learning rate and slowly increase it and based on how the model is adjusting to the current value of the learning rate and then towards the end of training when we're getting

close to minimizing the loss function, we'll actually decrease the learning rate. So just by knowing that we're

rate. So just by knowing that we're calling step, we can't really know just from this line of code whether or not learning rate is changing or not. This

is kind of handled on the inside by the optimizer algorithms we use. But we do know that this does mean that we are actually updating our weights based on the current learning rate. So quick

reminder on the cross entropy loss function. This is actually an error

function. This is actually an error function used for models that are outputting probabilities. So let's say

outputting probabilities. So let's say your goal is to categorize an input image as one of 10 digits 0 through 9.

Your model is actually outputting a probability that the input image belongs to one of our categories. And then we also know for every image the true class

label or the the true digit that it is.

So this cross entropy loss function takes in the model's probabilities as well as for each training example that we're say processing in a given batch.

It will also take in the correct answer for that image, the correct probability, the correct class that the model should have assigned this image to. So part A or choice A says something about a

regression model. We automatically know

regression model. We automatically know that's out. We' probably use mean

that's out. We' probably use mean squared error for regression. For a

language model that predicts the next word in a sentence among a fixed vocabulary. Yeah, that is definitely

vocabulary. Yeah, that is definitely example where we are outputting a list of probabilities. The probability of the

of probabilities. The probability of the next possible word in the sentence. This

is actually how transformers like chat GPT work. And C says for a

GPT work. And C says for a classification model that predicts whether an email is spam or legitimate.

So this is also a case where we're doing binary classification specifically and we're outputting a probability that a given email is spam. So that's going to go ahead and be B and C. So in this line

of code, we're instantiating the optimizer object and we're trying to figure out what algorithm is actually running inside this optimizer. That's

going to be gradient descent. This is

the object that actually updates our weights based on the calculated derivatives. I hope this was helpful and

derivatives. I hope this was helpful and definitely leave a comment if there's anything you would like me to explain in more detail. I highly recommend

more detail. I highly recommend understanding the concepts behind PyTorch training loops. And after you understand this, you're definitely ready to jump into the next problem, which is

our introduction to NLP or natural language processing. NLP or natural

language processing. NLP or natural language processing is the field of ML focused on teaching AI to read and write like humans. With the development of the

like humans. With the development of the transformer, AI generated text is almost indistinguishable from humans. But

before diving into transformers, there are some NLP fundamentals that are essential to understand. This video will go over three quiz questions that cover these fundamentals. We'll cover

these fundamentals. We'll cover tokenization, the process of breaking up a string into a series of characters, words, or subwords, also known as tokens. We'll also cover word

tokens. We'll also cover word embeddings, which are vector representations for each token. After

you go through this quiz, you'll be well prepared to start my machine learning road map, which is completely free and available at the top link in the description. Let's get started with

description. Let's get started with question one. Here is an example of a

question one. Here is an example of a model vocabulary. The vocabulary is just

model vocabulary. The vocabulary is just a set of all the unique words the model encounters within the training body of text. Side note, the training body of

text. Side note, the training body of text for chat GPT is essentially the entire internet. While the training body

entire internet. While the training body of text for a sentiment analysis model, which is an emotion predictor model, might be a series of movie reviews and a corresponding label for each sentence.

We can see that for every token in the vocabulary, there is a corresponding integer. When passing text into the

integer. When passing text into the model, each token is encoded with its assigned integer. The model is actually

assigned integer. The model is actually generating numbers and we decode the integers back into the corresponding tokens. But how are the integer

tokens. But how are the integer assignments for each token learned? A.

No learning necessary. They're

arbitrarily assigned at the start of training and kept constant throughout.

B. They're randomly initialized but changed during training as the model learns which integers best encode each token. C. They're initialized based on

token. C. They're initialized based on the initial embedding representations and kept constant forever.

D. No learning necessary. Each time the model is called during training or testing, we randomly reassign the integers. The answer is A. No learning

integers. The answer is A. No learning

necessary. They're arbitrarily assigned and kept constant. We need the integer assignments to be consistent so that we can decode the model's output back into coherent strings. But the way we assign

coherent strings. But the way we assign them can be completely arbitrary as long as we keep the mapping constant. But

that means that the integer encodings carry no information about the actual meaning of each word. And it's important for the model to understand the meaning of each word when predicting the emotion

in a sentence or generating a response to our prompt. That's where embedding vectors come into the picture, which we'll talk about in question three.

Okay, question two. Here's an example of character level tokenization. And here's

word level tokenization. Which of the following is an effect of using character level over word level?

The answer is B. The model vocabulary will be smaller but diversity will be greater. If we use word level

greater. If we use word level tokenization, the vocabulary might be the set of all words in a language. But

if we use character level tokenization, the vocabulary would just be the set of all characters in the alphabet plus the special characters. But that's still far

special characters. But that's still far less than the number of words there could be. So the vocabulary would be

could be. So the vocabulary would be smaller. And for a model where we

smaller. And for a model where we generate one character at a time, there are actually many combinations or paths the model could take every time the next character is chosen. In contrast, with a

wordle model, the entire next word is chosen each time the model is called. So

the responses tend to be less diverse.

Okay, question three. How are these vectors learned? Here's a quick summary

vectors learned? Here's a quick summary of embeddings. There are vector

of embeddings. There are vector representations for words that also encode meaning. Similar words will have

encode meaning. Similar words will have representations that are closer to each other when visualized and unrelated words will be farther apart. Also, the

dimension of each vector here is only two, but in practice it could be in the thousands. When passing a sequence of

thousands. When passing a sequence of text into a model, the first step in the calculations for the output is to fetch the embedding vectors for each word. You

can think of this like a lookup table from the token integers to vectors. Once

the embedding vectors are retrieved, a series of multiplications, additions, and nonlinear functions are used to form the final output. Back to the question, how are these vectors learned? The

answer is that they're randomly initialized and over many iterations of training, the gradient descent rule is used to update the embedding vectors.

This means that initially words that are entirely unrelated may be very close to each other and words that are related may be far apart. But over training, these vectors are updated. Gradient

descent is the training algorithm and it makes use of some very basic calculus.

My threeinut video breaking down the equation is linked in the description for those looking to understand it deeply. That wraps up our brief review

deeply. That wraps up our brief review of tokenization and embeddings. If

you're interested in how to implement basic tokenization in Python, I'll also link that video in the description.

Okay, if you've made it to this part of the video, then I think you'll enjoy our ML community. I offer weekly lectures,

ML community. I offer weekly lectures, one-on-one mentorship, and a few special bonuses. If you're interested, just head

bonuses. If you're interested, just head to the link in the description to learn more. I hope you found this review

more. I hope you found this review useful, and I'll see you soon. Let's

solve intro to NLP or natural language processing. This is actually our first

processing. This is actually our first problem on NLP. And this is going to be really exciting because now we can finally get into the application of neural networks. Our examples and our

neural networks. Our examples and our problems have been maybe a bit abstract so far, but we're finally ready to jump into NLP and build models that can do interesting things like generate text,

detect emotion. We're going to build a

detect emotion. We're going to build a sentiment analysis model in a later problem and we're finally going to explore NLP in more detail. So in this

problem, we're actually going to load in starting from a raw body of text, so just strings, and we're going to set up a training data set. You may have heard that chat GPT uses like almost the

entire internet for training. But that's

just a giant string, right? You could

actually represent that in one massive string. How do we actually convert these

string. How do we actually convert these into numbers? How do we convert them

into numbers? How do we convert them into integers? These characters that the

into integers? These characters that the model can actually understand because these models that obviously work with matrices and all these matrix

multiplications, the model needs to take in numbers, not strings. So in this problem, we're going to do exactly that.

We're going to do something called tokenization. Tokenization is just a

tokenization. Tokenization is just a fancy term for encoding whatever your input to the model is. In this case, it would be strings. How will you encode the input? And specifically, how will

the input? And specifically, how will you break it up? Would you break up your sentence into words into individual characters? And how will you encode each

characters? And how will you encode each of those tokens once you've broken them up? How will you encode them into

up? How will you encode them into integers? That is the process of

integers? That is the process of tokenization. And we're actually given

tokenization. And we're actually given two lists. One list of positive strings

two lists. One list of positive strings and one list of negative strings. So we

can imagine that this sort of data processing that we're going to be doing in this problem is going to be used for a sentiment analysis model. And a

sentiment analysis model is just an AI model that can detect emotion within text. And that's actually going to be

text. And that's actually going to be the next problem we solve. I highly

recommend you solve that problem after this one. And one thing I wanted to

this one. And one thing I wanted to clarify is that this set of problems, the goal is not going to be parsing and processing data. Instead, we're going to

processing data. Instead, we're going to focus on how these models actually work.

But we do still need to do one or two problems on how we actually set up the data sets so we can feed them into the models for training and ultimately get those outputs that we want. And the

problem is actually telling us that the lexographically smallest first word should be represented as one, the second should be two and so on. So that's how

we're actually going to encode each word as an integer. That's the rule that we're going to use. And in the final tensor that we return, we should list the positive encodings before the negative encodings. And that's just

negative encodings. And that's just because the positive input comes first.

So we'll process them in that order.

Let's just make sure we understand the example. The first sentence is Dogecoin

example. The first sentence is Dogecoin to the moon. So we can imagine we might be feeding these examples into some sort of sentiment analysis model that is

going to detect emotion in tweets.

That's just one application of this kind of model. And then maybe this fuels some

of model. And then maybe this fuels some sort of stock trading algorithm after our model has detected the emotion in say Elon Musk's tweets. And based on that we might buy or sell a certain

stock. And we can see that Dogecoin is

stock. And we can see that Dogecoin is encoded as 1, two is encoded as seven, thaw is encoded as six, moon is encoded

as four, and there is actually a padding token. The zero will be used for padding

token. The zero will be used for padding and that will actually be added to the end to ensure that our first row in this tensor can be the same length as the row

for the second tensor as we know there's actually one more word there.

And this padding that we're doing, we don't actually have to implement that ourselves. There's actually a function

ourselves. There's actually a function in PyTorch that we can call. All it

really needs to take in is our list of variable length tensors. So it should take in a normal Python list of PyTorch tensors that each have variable length.

And this function will automatically pad them all to match whatever the longest tensor is in that list. So we will then end up with a rectangular non-jagged tensor. And I think the default, yes,

tensor. And I think the default, yes, the default padding value will be one.

And we should set batch first to be true when calling this function. And that's

just so we actually end up with a tensor that is 2 n by t where t is the length of the longest sentence. n is the number of positive examples we have and the number of negative examples we have. So

we have 2 n in total. If we don't set batch first to be true, the return tensor from that method will actually be t by 2 n, which is a bit weird. So we

just need to set batch first to be true.

So our goal here should actually be to construct a mapping. Our goal should be to build a dictionary that maps every single unique string, every single

unique word in our input data set to some sort of integer. So we want to build some sort of dictionary that maps strings to integers. And how would we actually build that data set build this

dictionary given that we know we have to sort the words in lexographical order?

Well, if I actually had a list, if I had a sorted list of every unique word of every unique word in our input data set, then this would be really easy, right?

Because I could just iterate through this to construct the dictionary. And we

know the zeroth index, we're actually we would actually add one since we know we want to reserve zero for padding. And

we're actually going to start our smallest word, our lexographically smallest word at one, the second at two, and so on. So we could actually iterate through this sorted list. If I had a

sorted list of all the unique words in my input and I've sorted them based on lexographic order, I could just iterate through this to construct this dictionary. The actual values in this

dictionary. The actual values in this list would be the keys in the dictionary. And then the index plus one

dictionary. And then the index plus one since it would be obviously be using zero indexing would be the values. And

then once I have this dictionary constructed, once I have this mapping to generate the final return data set that we want, I would just iterate through the entire list or the two lists that

are given. And for every single word in

are given. And for every single word in each sentence, I would simply call the dictionary. I would simply query this

dictionary. I would simply query this dictionary to get the encoded integer for each string and append this to the appropriate list for that for that string for the corresponding sentence.

And then we call our pad. We call the pad function at the end to make sure we end up with a 2n byt tensor. So how do we actually build this dictionary? Well

earlier I said that if we had that list of all the unique words in our data set.

So we've kind of removed all the repeated words cuz remember a given word could in some arbitrary data set some arbitrary list of strings that's given to us it could appear more than once. So

if I had a list of all the unique words and I sorted that list, then we would be good. Then we could construct the

good. Then we could construct the dictionary as I talked about earlier. So

it's just going to come down to collecting all the unique words, eliminating repetitions in the input data set. And we know a data structure

data set. And we know a data structure that could help us with that is a hash set or just a set in Python. So if I actually go through the input list of strings and add every single word to a

set then I will actually have all the desired words that I need in this set and then we can actually cast it. We can

convert that set into a list and then we can actually sort it and then we have the sorted list we want. We can build the dictionary and then we can call or query the dictionary for every word in

the input and return our final desired list.

So let's add all the words all the words in our input of strings to a set. So

we'll just call this our vocabulary.

And we're going to actually have to split up each sentence which is a string into words. And then we can actually use

into words. And then we can actually use dotsplit for that to actually get words separated by spaces. So we can just go

ahead and say for each sentence in positive and then for each word in sentence.split

sentence.split that will actually return a list of all the words in that string back to us. So

we'll automatically break up that string based on spaces. We can just say vocabulary add word. And we want to actually go

add word. And we want to actually go ahead and do the exact same for the list of all of the negative emotioned sentences. So for each word in

sentences. So for each word in sentenceplit we would just say vocabulary do add of word

then we can actually convert this set to a list and call sorted so that we get the sorted list that we talked about earlier. Now we want to build our word

earlier. Now we want to build our word to int dictionary which can easily be built by just iterating through the sorted list. So we can just say for i in

sorted list. So we can just say for i in range of len of sorted list and now we just need to map every word to an

integer. So word to int. Well the key

integer. So word to int. Well the key should just be sorted list at i.

Whatever word is at that index and the corresponding value should be i + 1 since we want the smallest word to have value one, the second smallest word to

have value two and so on. Now let's

encode every sentence as a tensor of integers. So this list is actually just

integers. So this list is actually just going to be all the tensors. This list

is going to have size 2 * n where n is the length of positive as well as the length of negative. And every element in this list will be a tensor. And then

we'll call our padding function to ensure all the tensors have the same length. And the way we're actually going

length. And the way we're actually going to code this up is we're first going to convert every sentence to a list of integers. And then we'll just call

integers. And then we'll just call torch.tensor tensor to go ahead and

torch.tensor tensor to go ahead and convert each list into a tensor. Since

tensors are not really something you can dynamically add values to as you can with lists in Python. So we'll go and say for sentence in positive we can go ahead and say for each word in

sentence.split

sentence.split we just want to create a new list right so this is going to be our current list and we're going to append to this list.

So cur list.append

and then we can go ahead and say word to int.

So we're actually getting the integer conversion for that word and putting it into a list. And then once we're done with that, we can actually go ahead and append this to tensors. So

tensors.append.

And then we want to go ahead and convert the list to a tensor. So torch.tensor

of cur list and that's going to be the tensor that we want for that sentence.

And then we just want to do the exact same for negative.

Now we just need to pad our tensors. And

that's exactly what we'll return. We'll

return nnn.utils.

Rnn.pad

sequence. And all we have to do is pass in our list of tensors where those tensors don't necessarily have the same length. So that's in tensors. And let's

length. So that's in tensors. And let's

say batch first equals true. And this

function will automatically know to use zero as the padding token. And we're

done. And we can see that the code works. If this was helpful, feel free to

works. If this was helpful, feel free to let me know. And next, I would highly recommend jumping into the sentiment analysis problem. With all the problems

analysis problem. With all the problems you've done so far in this series, you're finally ready to code up an AI model that can detect emotion. Okay,

sentiment analysis. This is actually a large area of kind of study in the field of NLP or natural language processing.

Given some text that might be as long as a sentence, a paragraph, you know, even maybe even an entire page, but given some kind of text, we would like to feed this into some kind of model. So this is

usually going to be a neural network.

They tend to perform best on sensible analysis. And this model should ideally

analysis. And this model should ideally output whether this input text was positive or negative. So if this text was something like that movie was okay, then that would probably lean more

towards negative. Or if the text was

towards negative. Or if the text was something like that movie changed my life then it would bo be something like positive right and sentiment analysis actually has a lot of applications. Uh

one one example is we often want to apply sentiment analysis models to scrape tweets right because we know tweets are really influential in affecting you know the stock market and if we can kind of set up some kind of

model that scrapes tweets and feeds it into our sentiment analysis model then maybe we can get a gauge on where certain stocks are going to go. And one

of the most important concepts in this kind of neural network in terms of how it actually works at the low level is something called embeddings. And

embeddings are used all the time in NLP.

They're actually the first step in chat GPT. So let's go in and understand

GPT. So let's go in and understand embeddings.

So first I'll go ahead and give like a highle explanation of what embeddings are and then we'll explain how it works at the lowest level. So what an embedding is is it's essentially learning a trained through training. We

want to learn a vector representation of every word or token in our total set of words that the model could recognize. So

basically we feed into the model let's say our sentence is I love I loved that movie right and as you may have seen in the NLP intro problem before we even

feed a sentence into a model we have to associate each word with some kind of integer and this is just arbitrary but it does need to be consistent so the model encode strings as numbers so let's

just say I ends up becoming zero in our vocabulary which might be like say 500 words right loved becomes like two somewhere in our vocabulary that becomes

like one and then movie is like four this and let's just say there were like hundreds and hundreds of words but these were the mappings for those words. Let's

say we feed this into the model, right?

So we feed in the vector of 0 2 1 and 4.

Well, the first step in the model should actually be the model understanding the meaning of each of these tokens independently. The model needs to

independently. The model needs to generate some sort of actual meaningful representation of each word, right? For

each token, for this token, for this token, for this token, and for this token, we need to have some sort of vector that encapsulates the information in that vector. Cuz just saying 0 2 1

and 4 was our encoding, that's completely arbitrary, right? There's no

actual meaning encoding in that. There's

not that's not helping the model learn any kind of relationship. So for every single token we want to associate some kind of vector with it right and this is going to be learned through training.

These are going to be weights that are learned through training. So let's say after the training is done we actually end up with and let's just say our embedding dimension. So you'll always

embedding dimension. So you'll always choose your embedding dimension and the higher this number is this this number by the way is the size of the vector that we will essentially learn for each

and every token. The higher this number is, the more complex of a relationship our model can pick up on. And let's just say we chose an embedding dimension of two. Just just for a simple example,

two. Just just for a simple example, we'll use much higher numbers in actual models, but let's just say we had two for the word I, for the token I, let's just say we essentially learned that the

vector that represents I should be something like this, right? And let's

just say that for loved or let's just say for that so for token that ends up having integer value one we ended up learning something like this uh -1 and

2.6. So these are essentially going to

2.6. So these are essentially going to be weights. These are weights that will

be weights. These are weights that will be learned and updated through training.

And they're going to start off with completely random numbers, but over the course of training, they will actually become updated as we, you know, minimize the loss over some number of training iterations. These embeddings are

iterations. These embeddings are actually going to make sense. And after

training, if you ever take a look at your trained embeddings and you plot them, let's just say our embedding dimension was two. If you plot your embeddings, what you'll find is that

words that are, you know, similar in the language will actually end up being very close to each other when you plot them.

when you plot that two-dimensional vector that's supposed to represent the embedding. Let's just say this is like

embedding. Let's just say this is like W1 and this is like W2. So if you were to plot say the embedding for the word man, let's just say after training is done and you were then to plot the

embedding for the word woman, you might actually find that they're very close to each other. And this shows that the

each other. And this shows that the model has actually learned some sort of relationship or some sort of meaning for every single token.

So how does the embedding layer actually fit into our neural network? Well, our

input is going to be something of size B by T. B is our batch size or just how

by T. B is our batch size or just how many examples that were independently processing in parallel. So T would essentially be the length of the longest sentence if we kind of pad it into one

rectangular tensor. So if our first

rectangular tensor. So if our first sentence was something like I loved that movie, I love that movie. And our second sentence might be something like I hated

that movie. And of course we would not

that movie. And of course we would not be passing in strings. We would be passing in the integer representations of these strings. But this might be our input. So here B is 2. You can see that

input. So here B is 2. You can see that B is 2. And you can see that T is 4. So

this is our B byt tensor. Then we will have something called the embedding layer. So this is going to be NN. And

layer. So this is going to be NN. And

the architecture or constructor for the neural network. And we're going to

neural network. And we're going to explain how this actually works at the lowest level next. But for now, let's recognize that it has to output a B byt by embedding dimension tensor because

for every single token right at every single time step in this in this sequence, we are going to generate a vector of size embedding dim and that will be a learn vector. It will be

trained through gradient descent and it should encapsulate the meaning of the word and like we said earlier if we plot it, it would even make sense. It would

make sense. So, and just to clarify what that would be in the code, if you check out the the documentation, when you instantiate your NN.bing layer, there are two things that you need to pass in.

The first thing you need to pass in is your vocabulary size. So, how many different tokens or words in total does this layer need to even learn representations for? And the second

representations for? And the second thing you need to pass in is well, the size of the representation, right? How

how complex should that representation be that we learn for each token. So that

would essentially be the embedding dim and that would be the second input there. And if you actually look at the

there. And if you actually look at the documentation for n.bedding, you'll see it referred to as a lookup table. They

call it a lookup table. So what this layer is essentially doing is for every single token in our input, it will essentially go and look up in some table. You can think of the table as

table. You can think of the table as having vocab size rows and embedding dimension columns. So this layer will

dimension columns. So this layer will essentially go into the lookup table and it will find for every token what the corresponding row is and it will pluck

out its feature vector its embedding dimension to pass downstream into the later part of the neural network. So now

let's explain what we actually mean by lookup table.

So we know at a high level what nn.bedding takes in and what it outputs.

nn.bedding takes in and what it outputs.

But let's explain how it actually works at the lowest level. So let's pretend that this is our vocabulary or this is actually just some of the words in our vocabulary. Let's pretend we had a vocab

vocabulary. Let's pretend we had a vocab size of six. So our integer can range from 0 through 5. And let's just say that these are the kind of encodings or

mappings in our dictionary for this vocabulary. So I maps to two, loved maps

vocabulary. So I maps to two, loved maps to zero and so on. One way that we could actually represent this input and feed it into a neural network is using one hot encoding. So this is the one hot

hot encoding. So this is the one hot input and this should encode the same information. The size of this tensor is

information. The size of this tensor is T by vocab size. T by vocab size. This

isn't how we'll actually feed it in, but it is one way we could feed it in. And

we can see that this encodes the same information. Here we have the

information. Here we have the representation for I. So we can see that at the second index there is a one and there is a zero everywhere else indicating that this is representing I.

Here we have a one at the zeroth index or the zeroth position in this row and a zero everywhere else. So we can see that that is loved. So if we could follow that pattern for the other rows, we'll see the same thing. So this is a one hot

encoding of the input. And let's say that the lookup table is over here. So

this is a lookup table and its size will actually be vocab size by n embed. So

this this table over here, this tensor, it would be vocab size by nmbbed. And

let's pretend that the zeroth row, so the zeroth row in this tensor contains the feature vector for whatever is token zero in our vocabulary. So we can see that that is loved. And the first row

would contain the feature representation, the learned and trained feature representation for the token with index one in our vocabulary all the way down. And if you actually do the

way down. And if you actually do the matrix multiplication and you realize what this is doing, we're doing row time column, right? She would take this row

column, right? She would take this row times this column and then of course the first row again times this column all the way through with the rest of the columns. Right? If you see what that's

columns. Right? If you see what that's doing, you'll see that the only entry here is one. Right? So whenever we are taking this row with every column,

right? We're actually only going to be

right? We're actually only going to be plucking out the entry in the third row all the way across from each column, right? because the first two kind of get

right? because the first two kind of get ignored and the remaining three would get ignored. So this one is kind of

get ignored. So this one is kind of being if we have you know the zeroth row, the second row and the third row over here, this is what we would actually be plucking out. So we can see

that the result of this matrix multiplication is actually going to be well if this is t by vocab size and this is vocab size by n embed then the result

is t by n embed and t by n embed or t by embedding dim. T by embedding dim means

embedding dim. T by embedding dim means that for every token at every time step we have essentially plucked out or generated the feature vector. So this is

just one way of plucking out the appropriate rows from this table which is what the embedding layer needs to output. But then if we take a look at

output. But then if we take a look at this neural network that I've drawn over here in this input layer we have vocab size number of neurons right and in

every neuron we would either have a one or a zero depending on whether or not we have that token in our input. That's the

same as the one hot encodings. And here

we have embedding dim neurons which is essentially out features for a neural network right for a linear layer. So

this kind of tells us that this nn.bing

class right that we're actually going to just kind of treat with a little bit of abstraction and we're going to use in our neural networks to generate the embeddings. It's the first layer in our

embeddings. It's the first layer in our neural networks is actually just a wrapper on nn.linear. NN.bing is built on top it's built on top of nn.linear.

linear specifically a linear layer where in features is vocab size and out features is the embedding dim.

So that should make sense. Leave a

comment if that did not make sense. I

could definitely explain that in a different way. But nmbbedding is

different way. But nmbbedding is actually just a wrapper for nn.linear.

All we have to actually feed in is kind of the tokens, right? So let's say we had I loved that movie and I mapped to

zero loved mapped to say two that mapped to say one and movie mapped to say four.

All we have to pass in is this. And

obviously we could have more than one example. So there could be multiple rows

example. So there could be multiple rows here but it is B by T. And then we go ahead and pass this in to NN. Right? So

we have a black box which is our embedding layer. But this is this box,

embedding layer. But this is this box, this kind of layer that we're using on the inside. What it does is it kind of

the inside. What it does is it kind of converts this into a oneh hot encoding and then sets up a linear layer and does the matrix multiplication that we talked about earlier. So I just kind of wanted

about earlier. So I just kind of wanted to explain that and the embedding layer.

The embedding layer is a wrapper that is built on top of nn.linear. It has, you know, those neurons that are fully connected. There are weights and biases

connected. There are weights and biases that are learned, but we can kind of treat it as just a lookup table that fetches the feature vector for every token.

Okay, let's go over the architecture one last time. Our embedding layer outputs a

last time. Our embedding layer outputs a tensor of size B by T by C, where C is the embedding dimension. Then we will do an averaging. This won't really be a

an averaging. This won't really be a layer in the defined in the constructor, but it will be a function torch. called

in the forward method and it's going to output a B by C tensor kind of getting rid of the time dimension averaging along the time dimension. You can think of this as having a vector initially for

every single time step in a sequence but then we average them all together so that for every single example along the batch dimension we have a vector summarizing or encapsulating that

sentence the meaning in that entire sentence one vector of size embedding dimension. Then we can apply a linear

dimension. Then we can apply a linear layer and this layer is simply going to have a single neuron out features will be one. So that we can then get a tensor

be one. So that we can then get a tensor of size B by one. For every single element we want to have a single number and this number can kind of be

interpreted as the gauge of how positive or negative that sentence is. And lastly

we can apply a sigmoid layer. And this

layer will allow us to get a number between 0 and one for every single input example in our say training batch. And

this allows us to say that one is say completely positive and zero is say completely negative. And that's the

completely negative. And that's the architecture for sentiment analysis. I'd

recommend trying to code it up now.

Okay. In this problem, we're going to solve sentiment analysis. Our task is actually to code up a neural network that can recognize positive or negative

emotion in an input sentence. And we are going to code up a neural network. So

I'll assume that you're familiar with the idea of a neural network. We want to be able to feed in a sentence like the movie was okay. The movie was okay. So

the input to this model is going to be some kind of sentence. It could be just one sentence. It could be multiple

one sentence. It could be multiple sentences. But we know that it will be a

sentences. But we know that it will be a string. the movie was okay. And it says

string. the movie was okay. And it says that we actually want the model's prediction to be a number between zero and one. So for something like the movie

and one. So for something like the movie was okay, maybe it would output something like 0.5 or maybe 0.4 cuz it's slightly more on the negative side. But

we want to build a model that can actually detect and assess the emotion within an input sentence. And the

problem says that this is actually an application of word embeddings. So by

this point we're familiar with the idea of neural networks. We have nodes and these nodes are connected in a way that we have a bunch of numbers being multiplied with each other. And each of

these can be thought of as a layer. This

one being the input layer and the second one being say a hidden layer. And it

turns out in chat GPT's neural network which we're going to work up to coding up the first layer is actually called an embedding layer. And embedding are

embedding layer. And embedding are actually the core concept within this problem. One of the main benefits of

problem. One of the main benefits of this problem is that it teaches us how to actually use word embeddings within a neural network. So, let's get into it. I

neural network. So, let's get into it. I

would highly recommend checking out the detailed background video on word embeddings, but if you need a refresher, we're also going to explain it in this video. So, the problem tells us the

video. So, the problem tells us the model architecture to use. That's over

here. And that's what actually is going to be defined in the constructor. And

then we go ahead and actually have to code up the forward method which will return the model's prediction for some sort of input sentence. And we'll come back to the model architecture later.

But this is essentially explaining the parts of the layers of the neural network that we're going to use. And we

are told that we have to code up both the constructor and the forward method within this class. And we do not want to actually train the model. We do not want to code up the gradient descent loop. So

what does the model actually take in as input? We're going to take in vocabulary

input? We're going to take in vocabulary size. Vocabulary size is going to be an

size. Vocabulary size is going to be an integer that represents the number of different words that the model should be able to recognize. And of course, we're going to explain embeddings in a bit more detail in a bit. But if you're

familiar with word embeddings, we know that embeddings are actually just a lookup table. For every possible word or

lookup table. For every possible word or token in our vocabulary, we want to be able to fetch or query the embedding the embedding vector that is actually a

learned and trained embedding vector. It

consists of weights that is learned through gradient descent. But regardless

of how we look at it, there are some fixed set of words the model should be able to recognize and that is essentially the number of rows in this lookup table. And of course, we are also

lookup table. And of course, we are also given a list of strings X each with negative emotion. So let's go ahead and

negative emotion. So let's go ahead and take a look at those actual examples.

The vocab size and X the list of strings. So if we go ahead and take a

strings. So if we go ahead and take a look at these inputs, we can see that vocabulary size here is just a number 170,000. And this might seem a bit

170,000. And this might seem a bit weird, but remember that vocabulary size is not the number of unique words in your actual input sentences. And we'll

describe what these numbers are in a second. Vocabulary size is just the

second. Vocabulary size is just the number of words that your model should be able to recognize. So this is actually the number of words in the English language. But we might actually

English language. But we might actually have a very small subset of words in the list of strings that we want to get the model's prediction for the emotion for.

So it is highly recommended that you have solved the problem NLP intro before solving this problem. But if you haven't, it's okay. Essentially, what

that problem teaches us is that let's after we decide on our tokenizer. And

our tokenizer in NLP is just how are we going to split up the words in a sentence. Let's say for simplicity,

sentence. Let's say for simplicity, we're going to split up the sentence based on spaces. So, we break it up into individual words and feed in a list of words into the model. But after we've

done our tokenization, we actually have to convert each word, each string to some sort of number. Right? These neural

networks, these models, they only understand numbers. To actually do all

understand numbers. To actually do all the matrix multiplications that make up a a neural network as well as calculate all the derivatives needed to optimize the model, the model has to be dealing

with numbers as input specifically just vectors and matrices of numbers. So for

every single token, for every single word, for the then for movie, then for was, then for okay, we actually need to assign a consistent mapping between numbers and and these strings. And I've

g gone ahead and went and already done that as I created the test cases for this problem. So let's go ahead and

this problem. So let's go ahead and assume that this first sentence that is passed in is the movie was okay. Right?

Slightly slightly negative slightly neutral sentence. And that's actually

neutral sentence. And that's actually what's represented in here as the first list within X. The movie was okay. And

because we don't want this to be like a jagged tensor, we want this to actually be rectangular. And we want the number

be rectangular. And we want the number of columns in the first row to match the number of columns in the second row.

Again, we'll get to that second sentence in a bit. But you can actually see that after the word okay, which is represented over here, we have just kind of padded the rest of this row with zeros. And the model will learn to just

zeros. And the model will learn to just kind of ignore the zeros, as we'll explain later. And then for the second

explain later. And then for the second sentence, we go ahead and have I don't think anyone should ever waste their money on this movie. This one is again

split up word by word. which with each string encoded as a number. And that is actually the the second row within X represented over here. And if you check the mappings between the the strings and

the numbers, you'll find that it is consistent. And we can see that for the

consistent. And we can see that for the first string within X. The first

example, the model is supposed to output something like 0.5 cuz that's kind of a neutral sentence. We could say it's a

neutral sentence. We could say it's a bit negative. So maybe 0.4. 0.5 does

bit negative. So maybe 0.4. 0.5 does

encapsulate how neutral that sentence is. But this second sentence, right,

is. But this second sentence, right, this one over here, the second row in X, this is a strongly negative sentence.

And we said that zero was the most negative a sentence would be. Maybe that

would be like, I hated that movie. It

was utter garbage. Maybe that would be like very close to zero. But we go ahead and say that the model should output 0.1, something really close to zero.

Definitely far away from one for that second example. So this is essentially

second example. So this is essentially the output. This is the output for the

the output. This is the output for the forward method. And vocabulary size is

forward method. And vocabulary size is something that the constructor receives is that's just like kind of an attribute of the neural network. And also just want to clarify that the output of say

0.5 and 0.1 that's just to help you actually understand the shape of the inputs and outputs. If you actually print your model prediction after you're done solving this problem and you train the and you and you submit the code,

it's it won't exactly match these numbers because we won't be training the model. So the model's prediction as it

model. So the model's prediction as it is before you true any training before you when you have this random initialization of the weights and numbers inside the neural network you won't get these like really nice

predictions that actually detect the emotion to do that you actually have to do a training loop there is a a separate problem for that in this playlist you actually have to write the training loop but don't worry there's actually a

Google collab a Google collab which is just a a notebook of Python code that you can click run on each cell and and see the output there is a collab notebook linked in the description for

this problem and that will actually use the exact code that you write for this problem. There's no more code you have

problem. There's no more code you have to write after you get your your accepted solution on this on this platform. And once you solve the

platform. And once you solve the problem, you'll be able to actually see the model being trained on Collab. I

have comments and and text blocks explaining what each cell is doing. And

you'll be able to see the model actually achieve this this performance. Actually,

we have some even more interesting examples to see that the model can learn to generate emotion. and we're going to use your code, your solution code exactly in this in this collab notebook.

So definitely don't check out the collab notebook until you solve the problem.

Now let's explain the model architecture. So if the input into our

architecture. So if the input into our model is B by T, where T is the number of words in each sentence, then the first step should actually be for every

single token, for every single word at each time step T, we want to actually get the embedding vector for that word or for that token. And just kind of a crash course on what embedding vectors

are. It's essentially the model learning

are. It's essentially the model learning to represent the meaning of a word or a character in this case words in numbers.

And once we're actually done training the model, these embeddings will actually make sense. So let's say our embedding dimension was two. For every

single word, we were going to learn a vector of size two. There would be two entries in this vector. And these these numerical vectors are supposed to encapsulate the meaning of the word.

Then what we would find after we're done training and we have learned the right representations for each word is similar words when you when you graph them.

Let's assume that it's actually dimension we can graph. Obviously in the problem we're going to use dimension 16 which is not graphable. But let's assume it was just two-dimensional a dimension we can graph. Then what you'll actually

find is similar words will actually be be plotted next to each other once that training process is done. So you might have like man over here and you might have like women over here because

obviously these words are related in the English language and they have some similarities in their meaning but there is obviously a slight difference that we can tell between them. They're not these vectors are not entirely on top of each

other. So when we say we want to use an

other. So when we say we want to use an embedding layer of size 16, it actually means that in the constructor for this model, we need to define that that lookup table that's eventually going to

be trained. And the number of rows in

be trained. And the number of rows in that table should just be the vocabulary size, right? For every single token in a

size, right? For every single token in a word in our vocabulary, we need to learn that embedding table. And it should be vocabulary size by whatever our embedding dimension is. And that's just

kind of a parameter that's up to us as the designers of the neural network to choose. As we choose higher and higher

choose. As we choose higher and higher numbers for this, in this case, you can see that two is the embedding dimension.

But as we choose higher and higher numbers, 16, 32, 64, 128, and so on, the model learns a more and more complex representation and can understand

language at a deeper level. So in this case, as we instantiate our embedding layer, we're actually just going to use nn.bing embedding in our constructor and

nn.bing embedding in our constructor and we'll just pass in the vocabulary size as well as the embedding dimension which is 16 and in the forward method for our

model we'll just call the forward method for this instance of nn embedding and that will actually get us the embeddings for every token the output of that would

be b by t by let's call that b byt by e where e is the embedding dimension because for every single token we would have the embedding dimension so that would actually be the first line of code

in the constructor just declaring this nn.bedding instance where v is the

nn.bedding instance where v is the vocabulary size as well as the first line in the forward method where we want to call the forward method of the embedding layer and get our b by t by e

tensor. However, for these series of

tensor. However, for these series of vectors that we have for the entire sentence for each sentence which is of length t, we have a bunch of vectors and each of those vectors are of size e. We

need some way to kind of combine or aggregate this information into a single vector to kind of have the model take into account all the information for a sentence across every single word. We

want to kind of combine that. And the

way we're going to do that is we're just going to average all the embeddings for across all the time steps for each example across the batch dimension. So

to make that super clear, we want to take our B by T by E tensor and end up with a B by E tensor. And you can think of that as for every single batch

element, we actually have a T by E tensor. T rows and E columns. And what

tensor. T rows and E columns. And what

you actually want to do is you actually want to compute the average of every single row here cuz every single row is actually a time step where the number of columns in every row is E. But we want

to end up with one single vector of size E. So just to make that super clear, we

E. So just to make that super clear, we actually have T rows and we have E columns. So you want to take this row,

columns. So you want to take this row, this row, this row, you want to take all the rows and average them together until you end up with essentially one horizontal vector of size E. And then

for every single batch element for each independent sentence or string that's passed into this model that the model needs to get the emotion of, then we will actually have one vector of size E

encapsulating that sentence and its meaning. And that's going to then be

meaning. And that's going to then be used for the final layer of the model, which we'll get to in a bit. It will be a linear layer that will be used as input to the final linear layer which would then get us a single number for

every single element across the batch dimension. And in code how we would do

dimension. And in code how we would do that in the forward method is we would take embedded which is B by T by E pass it into torch.m and say dim equals 1 cuz that's the dimension we want to collapse

out. We want to collapse out this

out. We want to collapse out this dimension and end up with something that is B by E. And this averaging is actually called the bag of words model in NLP. And the reason it's called that

in NLP. And the reason it's called that is you can think of taking all these vectors for every single word in a sentence and you're just jumbling them all up together. You're just mixing them all up. You're just averaging them in a

all up. You're just averaging them in a single bag, so to speak. You're not

really worrying about the fact that one word comes after another in the sentence. You're not worrying about the

sentence. You're not worrying about the sequential order of the words in the sentence. You're just taking all of them

sentence. You're just taking all of them and you're going to average their embedding vectors into a single bag. And

that's why we call it the bag of words.

So the next part in the neural network is to take these 16 neurons right because we know that for every once we have our B by6 our B bye tensor we know that for every batch element remember we

look at those independently in parallel for every batch element there are 16 numbers right we have a feature vector of size 16 and we want to actually then collapse that into a single number we

want one number that encapsulates the output for the model's prediction for that example so this is one neuron this is just one number. And we know that this neuron, if you're familiar with

linear regression, should have W1 all the way through W16. And of course, that optional constant term or or bias B. And

this linear layer is then going to learn the values of W1 through W16 as well as that optional constant term B, the bias, such that we can, you know, minimize the

the error and the model actually has the right valid prediction for this one number. However, this number is not

number. However, this number is not guaranteed to be between 0ero and one as we actually desired, right? That number

is just some number. It could have some number of decimal places. Could be

negative, could be positive. Doesn't

have any restrictions. Could be greater than zero, greater than one, could be like negative 5,000, could be anything, right? And we want to squash the model's

right? And we want to squash the model's predictions to be in the range of 0 and one. And that's where the sigmoid

one. And that's where the sigmoid function which we actually used in previous problems like the handwritten digit classification problem. That's

where the sigmoid layer actually becomes very useful. So we will toss an nn. We

very useful. So we will toss an nn. We

will toss an nn.igmoigmoid call. We'll

actually define this in the constructor and then you can go ahead and call the forward method of nn.igmoid. In the

forward method for this problem we will toss a sigmoid layer at the end such that we end up with as we wanted that b bye tensor but then all our numbers. All

right. Sorry, we'll end up with error B by one tensor, but all those numbers will be between zero and one. Okay, now

let's jump into the code. We know the first thing that we need to declare in the constructor is actually the embedding layer. That's the first layer

embedding layer. That's the first layer of the neural network. So, we can go ahead and say self.bing

layer equals nn.bing.

And we first thing that an end embedding requires us to pass in is the number of rows of this trainable table which is actually going to be vocabulary size.

And the size of the vector will go ahead and be 16. And that's the size of each trainable feature vector for each word.

And the next layer that we need, well, first we're going to do the averaging, but that would occur in the forward method. So the next layer we actually

method. So the next layer we actually need here, the next trainable layer is going to be the linear layer. So we'll

say self.linearalsn.linear

linear and we know the input number of neurons is 16 and the output number of neurons is just one and of course we need our sigmoid layer so we can say nn sigmoid and that's it for the

constructor next we can move on to writing the forward method also known as get model prediction we know we need the embedded version of the input so that

would be self embedding layer of x that's just calling the forward method from this embedding layer module it itself is a subclass of nnn.module

meaning that it is a neural network model or at least we can think of it as a trainable layer which would make up a neural network model and then we can go ahead and average right so because we

want to get the B by embed dim tensor just as the comment says over here so that would be we could say averaged equals torch mean of embedded and again

we want to it's b by t by embed dim we want to squeeze out the t so we can say dim equals not zero, not not two, but one. And we also want to pass this into

one. And we also want to pass this into the linear layer. So we would go ahead and say something like projected. And

the reason this makes sense is you're kind of like projecting this vector of size 16 down into a single number. So

you can go ahead and say self.linear

of average. And lastly, of course, we know we need to get these between zero and one to interpret them as emotion, right? zero being completely negative

right? zero being completely negative and one being completely positive. So

that would go ahead and be like predictions equals self.igmoid

of projected and this is what we actually want to return. We just need to round our answer to four decimal places.

So let's return torch.round of

predictions and say decimals equals 4.

And we're done. And we can see that the code works. Embeddings are a super

code works. Embeddings are a super important concept in natural language processing. and they're actually the

processing. and they're actually the first layer in the neural network for chat GBT. I definitely recommend

chat GBT. I definitely recommend understanding the code for this problem as well as the concepts behind it super well. And if you need to, you can refer

well. And if you need to, you can refer back to the background video that's linked in the description. That's just

kind of a whiteboarding video where I explain embeddings in more detail. And

now that we've solved this problem, the next problems to jump into kind of part two in this series of problems are all going to be building us up towards coding chat GPT, building in the data

set for to train a chat GPT replica as well as coding up the neural network layers themselves. So hopefully I'll see

layers themselves. So hopefully I'll see you soon. In this video, we're going to

you soon. In this video, we're going to solve this machine learning programming question. It's part of a collaborative

question. It's part of a collaborative project between my colleague Nab and I.

We'll go over the description and a basic test case on the left and solve the problem on the right. The problem is titled GPT data set. For large language models like Chad GPT, a very special

kind of data set is used for training.

This programming problem asks us to build and return that data set. Let's go

over the concepts behind this problem before jumping into the code. You've

probably heard that LLM are trained on the entire internet. And this is true.

We can simply feed in massive chunks of text into the model and there's no need to label each sentence or paragraph.

This is different than if we train say a sentiment analysis model that classifies each input sentence as positive or negative. The data set to train a

negative. The data set to train a sentiment analysis model would consist of sentences where every data point is labeled as positive or negative. So

that's the good news to train an LLM. We

don't need to label any data. we can

effectively just feed the raw text into the model and then during training if we passed in this giant block of text the model learns to predict the next word in the sequence for every possible context.

So the model learns that after the context cricket the word is can come next. The model learns that after the

next. The model learns that after the context cricket is the word a can come next. The model learns that after the

next. The model learns that after the context cricket is a the phrase bat and ball can come next and so on. The way

LLM learn to read and write in a language is by memorizing all the likely sequences of words that can come up.

They're effectively just large autocomplete models that can predict what word comes next in a sequence extremely well. When we actually prompt

extremely well. When we actually prompt the LLM, it generates the response one word at a time based on what's most likely to come next. Formally, we would

say that the neural network is learning a probability distribution over the entire language. But don't worry, this

entire language. But don't worry, this video doesn't require a background in neural networks or probability. Okay, so

why does this programming question even exist? Why can't we just feed in the

exist? Why can't we just feed in the entire data set as one long sequence into the model for training? It's

because of a limitation in the transformer architecture known as context length. To summarize, there's a

context length. To summarize, there's a maximum number of words that an LLM can remember or process at once. The LLM

forgets or more accurately cannot factor words that are outside of this window into its prediction. The context length is a hyperparameter for training an LLM,

meaning that we decide its value before training the model. GPT4 has a context length of 32,000 words and the code that we'll write at the end of this video will depend on the context length passed

in. So at every iteration of training,

in. So at every iteration of training, we want to select a random sequence from the entire very large training data set and have the model memorize the

different autocomplete sequences within it. One more clarification, to make

it. One more clarification, to make training more efficient, we don't just pass in one random sequence at every iteration. we actually pass in multiple

iteration. we actually pass in multiple of them. The batch size hyperparameter

of them. The batch size hyperparameter tells us how many sequences the model will learn from at each iteration. Those

batch size different sequences have nothing to do with each other. They're

completely independent and the model learns from them in parallel. Okay, now

we're ready to jump into the code. Small

side note, if you're enjoying the video, it'd be great if you hit like so that YouTube can recommend more of these videos to you. Back to the code. This

function would be called at every iteration of training to generate the batch. Let's take a quick look at the

batch. Let's take a quick look at the example test case. The raw data set context length and batch size are provided to us. We have to return x and

y where both x and y have length equal to the batch size of two. The first

entry in x corresponds to the first entry in y and the second entry in x corresponds to the second entry in y.

The first random sequence chosen is darkness my old. We know that my follows darkness, old follows my, and friend

follows old. The same logic applies to

follows old. The same logic applies to the second random sequence chosen. Okay,

the first step is to split the input string into a list of words since we're operating on a word level. Next, we want to randomly generate batchized different

starting indices. That way, we can

starting indices. That way, we can simply grab the context length words that follow from each starting index.

We'll use PyTorch's rand int function.

If you're looking to learn PyTorch, I have a couple intro to PyTorch videos already uploaded, and I'll be releasing a more visual animated tutorial in a few days. But the good news is that making

days. But the good news is that making this function call won't require significant PyTorch knowledge. We just

need to specify the highest random number that could be chosen as a starting index. That should be the

starting index. That should be the number of words minus the context length. And this range is exclusive of

length. And this range is exclusive of that number so that we don't go out of bounds from the data set. So let's

specify low and high and also specify how many random numbers we want to select. Lastly, let's convert the

select. Lastly, let's convert the returned tensor to a normal Python list.

Next, we simply need to grab the sequences X and Y. We'll set up the lists. Iterate over all the starting

lists. Iterate over all the starting indices. Index out a sequence to store

indices. Index out a sequence to store in X. Index out a sequence to store in

in X. Index out a sequence to store in Y, which is effectively the same sequence, just shifted one unit over since Y contains the words that need to

be predicted via autocomplete. Append

those to X and Y. And we're finished.

Our code passes the test cases as well.

And that's the explanation and solution to GPT data set. If you made it to the end of the video, you might enjoy the ML community. Just drop your email at the

community. Just drop your email at the link in the description to learn more about it and receive more free ML resources. I hope you found this

resources. I hope you found this programming question useful and I'll see you soon. Okay, let's solve a GPT data

you soon. Okay, let's solve a GPT data set. Before we start building the actual

set. Before we start building the actual GPT class and talking about how we can actually generate text from transformers, we should do one problem where we prep the data set that's

actually used to train GPTs. So you may have heard that chat GPT was trained on the entire internet where the entire internet can be thought of as some body of text. So in this problem, we're

of text. So in this problem, we're actually going to break down what that means in the code. So we can actually take some sort of giant body of text and from this body of text we create

examples for the model to keep learning to predict the next token based on different contexts. So these these chat

different contexts. So these these chat bots these large language models the way they work is they keep predicting the next token which might be like the next character or the next word or the next

subword. They keep doing this over and

subword. They keep doing this over and over again until they have some sort of big paragraph that they've given back to you. But the way they even learn to do

you. But the way they even learn to do this, the way they what they're really good at is predicting the next token in a sequence given like some words or some incomplete sequence, it can complete it.

It's like autocomplete on steroids. But

the way they actually do that is what we're going to learn in this problem or rather the data set that actually helps them do that. And we're going to write a function called batch loader which is

going to set up a batch of training examples. And that tensor needs to be of

examples. And that tensor needs to be of size batch size by context length. And

we also need to have the appropriate labels for this data set. So that then during training we can calculate the loss or the error between the models

prediction and the correct labels or the true answers. And we will explain what

true answers. And we will explain what context length and batch size are soon.

And there's just an implementation tip on what function we're supposed to use.

So we're given some sort of string which is just the raw data set. That's before

we're going to do any processing on this actually creating the examples. We're

also given something called context length. And this is actually how many

length. And this is actually how many tokens back can the model factor in into its response back to you. So where how far back is it actually taking into

account how many tokens back can it quote unquote read? And this is also going to be the length of each training example. It's going to be of length

example. It's going to be of length capital T which represents the context size. Each training example that we

size. Each training example that we create which will be like say a substring from this giant body of text.

This giant string each of those is just going to be of length capital T. And

then how many sequences or independent examples do we want to generate? That's

just the batch size. And we just need to return X and Y. So let's look at an example now. So the way we actually get

example now. So the way we actually get our data points X is we pick batch size or I'll just refer to that as capital B.

We'll pick this many random different starting indices for our substrings of length capital T. And all those substrings are just like valid starting

indices. All those starting indices just

indices. All those starting indices just need to be valid starting indices inside our raw data set. And when I say valid starting indices, they need to be far left enough in the data set such that

you can actually extend capital T tokens to the right. So we're going to develop those starting indices first. And that's

going to then explain our example over here. So we have hello darkness my old

here. So we have hello darkness my old friend as a string capital T= 3 batch size equals 2. So let's say the first

random index we choose is one. So that

corresponds to darkness being the starting token. And with a context

starting token. And with a context length of three, we would say darkness my old. So that ends up being our first

my old. So that ends up being our first example in X. And then the way Y works is Y contains the tokens that the model

is supposed to actually predict, the right answers that the model is supposed to predict. So when we have a sequence,

to predict. So when we have a sequence, it turns out there's a bunch of training examples within them. If we have darkness, my old well the model can

learn what to come predict what comes next given a context of darkness. And we

know my comes next. The model can also learn that if you have darkness my then old comes next. And the model also needs to learn given this entire sequence that

darkness my old that friend should come next. And that's exactly what you'll

next. And that's exactly what you'll find in the corresponding index for Y.

The first label or the first right token that the model needs to learn to predict is my. That's just given darkness. Given

is my. That's just given darkness. Given

darkness my we need to predict old. And

given darkness my old we need to predict friend.

The second random index that's chosen is actually index zero. So that would be hello as our starting token. And if you follow that same reasoning, then this

example should also make sense. So our

goal here is actually just to prep this data set and specifically return these two tensors X and Y. And both of them are essentially containing strings. And

it's highly recommended that you solve the NLP intro problem before this one because that one actually explains the before feeding this into the model for training. You would actually encode each

training. You would actually encode each word as an integer. You wouldn't

directly pass in strings into the model.

But in this problem, it's okay to just return strings.

So in the code, we're going to start off by actually splitting up our raw data set into words since we know we're going to be taking essentially substrings or

sublists to actually return or get the words in our our output tensors. So the

main thing we need to explain is how to actually use torch.rand int. So the

lowest index is going to be zero. For

size, we can actually pass in a tupole where this tupil is just of shape batch size by one since we essentially just want a vector or a tensor of length B since we want B different random

numbers. And we need to figure out what

numbers. And we need to figure out what high should be. So this upper bound is actually exclusive. It will not include

actually exclusive. It will not include anything that it will not include the number for high. That won't be a valid random number. So that's kind of the

random number. So that's kind of the case for a lot of Python functions. And

what we actually want to use is just the number of words minus capital T. So

imagine capital T is one. Then this is number of words minus one. Essentially

the final index. But if this is exclusive of that then we would not include the final index as a valid starting index because we want to still

have capital T. We want to still have capital T remaining tokens left after the starting index so that we can actually have the complete Y, right? Y

contains our labeled answers. So if the starting index was like one of the the actually the last index in our list of words, then that wouldn't actually make sense cuz there's no word there's no

word after that for the model to actually learn to predict.

So once we have our list of words, it's just going to come down to actually taking the right sublists. So whatever

our starting index is, and of course this is different for X versus Y. Y

would have a starting index that is one greater since we start predicting the next token. We just need to go all the

next token. We just need to go all the way to start index plus T. So let's jump into the code. So let's start off by generating our random indices like the starter code tells us to. So we can say

something like indices. We're going to use torch.randt.

use torch.randt.

We can say that indices is torch.randt.

And we just want to set low equal to zero. We want to set high equal to

zero. We want to set high equal to however many words we have. Right? So

maybe we're going to do some sort of split and call that the words minus the context length. Right? And then we know

context length. Right? And then we know the size can simply be a tupole which is just you know of size batch size. So

that's indices, but we need to know how many words we have. We need to actually get that list of words. So we can say words is raw data set data set.split. By

default, we're going to split on the space delimiter, which is going to break that string up into a list of words. And

now we can actually get X and Y. So we

can say X is going to be some sort of list. Then we can say Y is going to be

list. Then we can say Y is going to be some sort of list. And then we can go ahead and say for each index in indices we can generate each of our batch size

different examples. Right? The length of

different examples. Right? The length of x will be batch size and the length of y will be batch size. So x.pend

we can actually just take a sublist of words from index index plus context length. And that's it for that entry in

length. And that's it for that entry in x. And then similarly in y we're want to

x. And then similarly in y we're want to trying to start predicting the next token. Right? So we'd say index + one

token. Right? So we'd say index + one all the way to index + one plus again context length. And that's it. We simply

context length. And that's it. We simply

return a tupil of x and y. And we're

done. And we can see that it works. So

this problem teaches us that training a transformer training a language model is just about generating a data set where the model can keep learning to predict the next token in the sequence. So now

that we know what the inputs and outputs are, now let's actually start jumping into the neural network architecture for transformers. The next problem is on

transformers. The next problem is on self attention. A very complicated

self attention. A very complicated problem at first, although we'll definitely break it down and make it fairly intuitive, but it's definitely the crux of how transformers work.

Chatbots in recent years have been all right, but now we have chat GBT. So,

what makes these newer models so effective at reading and writing like humans? It's a concept called self

humans? It's a concept called self attention. A Google search of this topic

attention. A Google search of this topic gives some confusing diagrams and equations. So, let's break that down.

equations. So, let's break that down.

I'll take the first few minutes of the video to give a highle overview. And

then after that, I'll make a second pass through the explanation, but the second time I'll add more of the math and dive even deeper. So, let's consider these

even deeper. So, let's consider these LLMs or GPTs as a black box. We know

they take in some sequence of words like an instruction or a question and they output the response. But the way the response is generated is actually word by word. So the next word is generated,

by word. So the next word is generated, concatenated. The model is then called

concatenated. The model is then called again and so on. And inside this black box, the model is doing a ton of math to ultimately make its prediction for the next word in the sequence. To get some

inspiration for teaching computers to read, let's think about how humans read.

Do we read each word completely independently, not considering its relation to the words that came before it? No. We read word by word and each

it? No. We read word by word and each word has some relationship with the word that came before it. When we get to a noun, we realize it's the subject being described by the adjective that came

before it. Or when we get to a question

before it. Or when we get to a question mark, we realize that a question was being asked based on a previous word like how, why, or what. We

subconsciously consider the relationships between the different words in a sentence in order to totally understand it rather than looking at words independently. But how does a

words independently. But how does a model do this? For an input sequence of length t, so t different words, the model calculates a t byt matrix of

attention scores where each entry signifies how strongly associated two words are with each other. For example,

the word how can be used in the sentence how are you or the sentence this is how you write. The word how has a slightly

you write. The word how has a slightly different meaning depending on the context. And we see a relatively high

context. And we see a relatively high score for the connection between how and the question mark indicating that how is used a particular way in this sentence.

Once the model has this matrix it knows how important each word is to the previous words and it can do some calculations to predict the next word.

But here's the crazy part. I know it can be really annoying when people treat training as a black box, but this video would simply be too long if we also went into how training works. So for a

5-minute intro to training, check out the video in the top right. And for now, we can just treat training or learning as some iterative process where the model gets better at its task of

predicting the next word. But the crazy part is that during training, the model develops some complex math formula to calculate the entries of this matrix for

the future sentences that will be passed in. The formula is learned from the data

in. The formula is learned from the data it's trained on, which is typically a massive body of text like Wikipedia fed into the model during training. So

that's the highle overview of attention.

The model learns to calculate a number signifying the affinity of every pair of words which helps to accurately predict what comes next. Now let's dive deeper and make a second pass through this

explanation. This time we won't leave

explanation. This time we won't leave out the math involved in calculating this really important attention matrix.

The main idea is that we want the words in a sentence to talk to each other and figure out which ones they should associate with. For example, we want the

associate with. For example, we want the adjectives and the nouns they describe to seek each other out and associate, ultimately resulting in a high attention

score. The way we do this is by having

score. The way we do this is by having every word emit two vectors, a key and a query. I've only shown it for two words

query. I've only shown it for two words here, but every word emits its own key and query. A word's query vector

and query. A word's query vector represents what it's searching for or querying for. And a word's key vector

querying for. And a word's key vector represents what it has to offer or the information it actually stores or encodes. If a noun is searching for the

encodes. If a noun is searching for the adjective that describes it, then its query might align with the adjectives's key. Then we just take the dotproduct

key. Then we just take the dotproduct between every word's query with every other word's key. And those values populate the attention matrix. Remember

that the dotproduct between two vectors is a measure of similarity. The higher

the output, the closer two vectors are to each other. But how do you actually calculate the query and key vectors for every word? Linear regression. A simple

every word? Linear regression. A simple

single layer neural network is used.

Before a sequence of words is passed into the attention layer, we have every word represented as an embedding vector.

So we have one linear network that takes in the embeddings for each word and outputs the query vectors. and another

linear network that also takes in the embeddings but outputs the key vectors.

Remember that each of the nodes in a neural network is just performing a linear regression based on this equation. Each node operates

equation. Each node operates independently of the others and learns its weights through training. For a

10-minute refresher on neural networks, just check out this video. So, now we've calculated the attention matrix. This

txt matrix is actually the crux of how LLMs read and write. There is one part still left to discuss though. In

addition to key and query vectors, another linear network calculates a value vector that is emitted by every word. We then multiply the attention

word. We then multiply the attention matrix by the matrix of value vectors for every word. And this is the actual output of the attention layer. If the

key is what information a word has or encodes, let's think of the value as what's actually relevant for the word to share, what it actually exposes. So

let's see what happens in this matrix multiplication. If this row is

multiplication. If this row is multiplied by this column to yield this value and if this row is multiplied by this column to yield this value and so on. We can see that we're taking a

on. We can see that we're taking a weighted average of the words from the past to end up with a new and transformed vector for this word. And we

end up with a new and transformed vector for every word. So the ultimate output of the attention layer then is actually the model's refined interpretation of the meaning of every word. Far more

nuanced than the crude embeddings that were passed in. The model has factored in the context of neighboring words to generate a new representation of each word ultimately culminating in the model

accurately predicting the next word. If

you found this video helpful and are interested in more videos breaking down the transformer, leave a comment and I'll see you soon. Okay, let's explain attention and specifically self

attention. So here we have a really

attention. So here we have a really complicated looking diagram which you may have seen before and it's called the transformer architecture. It's

transformer architecture. It's essentially a neural network architecture that was used for chat GBT.

Specifically, we don't worry about the left part which called the encoder and we only use the right part of this architecture which is called the decoder. And here we can see the

decoder. And here we can see the different layers in this neural network.

We can see that embeddings are used in this neural network. That's a really big part of chat GBT. Then we also have linear layers. Obviously those

linear layers. Obviously those traditional feed forward neural networks. There's also another one over

networks. There's also another one over here. Just the distinction between this

here. Just the distinction between this linear layer and the feed forward is that the feed forward includes like linearities like RLU and and Sigmoid, right? But actually, one of the most

right? But actually, one of the most important parts of this architecture, the part that makes chat GBT and modern-day chatbots so effective, way more effective than previous chatbots in

previous years, is the concept of attention. And attention is one of the

attention. And attention is one of the most important topics in NLP right now.

So, let's go ahead and explain exactly how it works. So, what is the point of attention? Let's say we're working with

attention? Let's say we're working with some chatbot and we pass in something like write me a poem. This is a kind of a complicated instruction for a computer

to parse. But we know that chat gpt can

to parse. But we know that chat gpt can actually respond really well to this.

Well, the input will be something like B by T, right? And let's just say we're passing in only one sentence or one example right now. So B is one and T will be the number of tokens. If we're

breaking things up on a word level, then T equals 4 here. The first step would actually be the embedding layer, right?

This was actually the first step in say our sentiment analysis model. And as you may have seen in the transformer diagram previously, the first step is the embedding layer. So for every token,

embedding layer. So for every token, right, for every word, every time step, however you want to think about it, we are going to generate the embedding vector and that will be of size C or embedding dim, those might be kind of

the symbols used for that. So this is what I've kind of represented over here.

If B is one, right? Then we don't have to worry about B by T by C. Let's just

say it's T by C, right? Then in the first row we have the embedding for the token right and that's some vector. In

the second row we have the embedding for the token me and that's some vector right. But how does the model actually

right. But how does the model actually combine or aggregate all this information to get a sense of what the instruction or the task is? Right? How

does the model understand the relationships between tokens? Because

there's actually many pairs of relationships that are important within this task or this command that the chatbot is given. For example, what is the model supposed to write? It's

supposed to write a poem, not a movie or a book. It's supposed to write a poem,

a book. It's supposed to write a poem, right? There's many pairs of

right? There's many pairs of relationships between tokens that the model needs to take into account. And

the model somehow needs to aggregate or combine all these embeddings together cuz right now they're kind of these are all separate independent rows, right?

Each row in this in this tensor is a time step, right? But we need to somehow aggregate and combine them together.

Well, the simplest way to do that is is just by averaging, right? Let's say we have a sequence of tokens, right? So, we

have write as the at the zero time step and then me and then a and then poem and write is the first token. So, let's say it doesn't get averaged with anything.

But then me, well, me is the second token. So to maybe to get a more complex

token. So to maybe to get a more complex representation that actually factors in the sequential nature of this sentence or at least the spatial relationships that are going on, let's say we average

me and write together. So it's kind of like a running average. And then we have a. So then a would get averaged with

a. So then a would get averaged with write me and a. And then finally poem.

Well, poem would actually be averaged with itself and everything that came before it. So it gets it's essentially

before it. So it gets it's essentially going to be represented the representation for poem could then be the representation for the whole sentence averaged together. And this is this can work and we can get decent

results in neural networks with this but not the greatest results. And this might remind you of what we did in the sentiment analysis problem where we just averaged our embeddings together. We

just did a simple average. It's a simple aggregation. But we actually want to

aggregation. But we actually want to develop more and more complex aggregations, right? So before we can do

aggregations, right? So before we can do that, let's first start to think of this aggregation, this simple aggregation we did as a matrix multiplication. So here

we have our T by C, right? You can think of each row in this tensor as being the embedding for a token. And I'm just following like row column notation here.

So this is row one, column one, right?

Row one, column 2. That's kind of the subscript notation there. And this

simple running weighted average that we just talked about, you can think of it as a matrix multiplication. So imagine

that we left multiply. So on the left we have this tensor that is T by T. So T is four in this case, right? If it's write me a poem, that's four tokens. So we

have a T by T tensor that is going to multiply against the T by C tensor. And

this is going to help us end up with a T by C tensor again. And in every row, what we're going to have, at least at the entries that are non zero, is just one divided by the number of tokens. And

then everything that comes after it is zero for the first row. Everything that

comes after the first two tokens is zero for the second row. And so on. So, and

once we actually do this matrix multiplication, remember we're doing this row times this column to get this number, right? And then similarly, we

number, right? And then similarly, we would do the first row times this column to get this number, right? And similarly

we would actually be doing this row times this column to get this number.

Right? Once we actually do that let's see what we get. So initially just for the first row right this would just be dealing with the first time step first time token it looks like it doesn't get

averaged or added anything else into it.

It just kind of gets scaled with a factor of 1/4 for both the entries in the two embedding dimensions we have.

Then for the second time step, we can see that we get the we essentially are taking 1/4 times what came before it and also 1/4 with what we currently have.

And we can see the same thing being done here. 1/4 of E12, which is essentially

here. 1/4 of E12, which is essentially what came before it, especially. And

then we add in 1/4 * E22, which is exactly what was there. And this is essentially that running weighted average that we were talking about earlier. Because what is an average,

earlier. Because what is an average, right? You're adding everything up and

right? You're adding everything up and you're dividing by the number of examples. But if your number of examples

examples. But if your number of examples is essentially constant, you can kind of pull that out. And the 1/4, the four that you would be dividing by, if you were say adding everything up and then

dividing by four, that's kind of what a running average would be doing, right?

We can essentially just distribute that 1/4 to each to each number that's being averaged, right? So just to make this

averaged, right? So just to make this super clear, here we have 1/4 * E11.

Here we have 1/4 * E11 plus 1/4 * E21.

So that's kind of averaging right and me. And then here we have 1/4* e11 plus

me. And then here we have 1/4* e11 plus 1/4* e21 plus 1/4 * e31. So this is kind of the average of right me a right as we

get more words and more further along into the sentence we get more and more context that's gets taken into account.

And this is being done for this column over here. And kind of the similar thing

over here. And kind of the similar thing is being done in this column over here because remember in this case C equals 2. C equals 2 the embedding dimension is

2. C equals 2 the embedding dimension is two but we are averaging along the time dimension. So this is essentially

dimension. So this is essentially nothing more than what we already talked about over here. Right? We talked about how we want some way for the model to know what's important to pay attention to and what the relationships are

between each sentence. So for every token we average itself with what came before it. Right gets averaged with

before it. Right gets averaged with everything with itself and what came before it which is nothing. Me gets

averaged itself with everything that came before it which is just right and so on. Right? And that can actually be

so on. Right? And that can actually be represented as a matrix multiplication where we just take the t by c embeddings and we multiply on the left with this t byt tensor.

But the fact that this tensor which contains our weights for this weighted average or aggregation the fact that it's t by t actually means something. So

the fact that it's t by t means that for row i column j inside this tensor we can interpret that as the strength or

affinity or score as you may hear between token i and token j. So we can think of the first row of this T byt matrix as corresponding to the row for

right and then me and then A and then poem. And then similarly for the columns

poem. And then similarly for the columns this is the column for right me a poem.

So if I were to look at this number right here well that's right that's right me a and then write me. So this

1/4 over here that I'm circling should actually be a number that symbolizes the strength or of the connection or the relationship between the words me and a

in the input sentence. But that doesn't seem like a particularly important connection, right? What's probably more

connection, right? What's probably more important is the connection between the tokens write and poem, right? Because

what are we writing a poem? The model

needs to understand that there is a strong relationship between these two tokens. So ideally we would actually

tokens. So ideally we would actually want this number over here to be very high cuz this row is for poem and this column is for right. So we have this T

by T tensor. It's we kind of interpret row I column J as being the strength or connection between the token for row I and the token for column J. But we want to actually have a weighted average.

Right? This is just a simple average.

Everything is just 1/4. But we want to have a weighted average. We want to have a weighted average so the model can actually pay attention to some tokens more than others. Cuz in a given

sentence, write me a poem. The word a isn't really that important to pay attention to. Maybe it's somewhat

attention to. Maybe it's somewhat related to the word poem because the model needs to output only one poem instead of multiple poems in as opposed to if we said write me poems. So the

word a does have some significance in that command that we're giving into the chatbot, but it's not the most important to pay attention to. We don't want to weigh everything equally with this whole 1/4 or 1 over t situation. Instead, what

we want is we want a weighted average, right? But how does the model for any

right? But how does the model for any arbitrary command that it's given any arbitrary sequence of tokens, how does it actually learn what the weights inside this T byt tensor should be?

That's what the self attention layer accomplishes. And there is a bit of a

accomplishes. And there is a bit of a complex mathematical formula, but we'll break down exactly how that works. So

just to make it super clear on what the whole point of this self attention layer in a neural network is supposed to accomplish is it's supposed to actually come up with weights. Specifically, it's

supposed to come up with a T byt tensor of attention weights or attention scores where we can figure out how important each token is to every other token. So I

went ahead and crossed out the future tokens, right? So for high, high is

tokens, right? So for high, high is right here in the first row. High

shouldn't be able to look at any of the future tokens. We only want to look at

future tokens. We only want to look at the current token and what came before it to figure out what comes next. Right?

And I didn't exactly make this explicit earlier, but how these language models like chatbot work, chatbots work is continuously prediction. What's going to

continuously prediction. What's going to come next in the sequence? We'll talk

more about that later, but let's focus on trying to understand this T by Tensor for now. One number that we can see that

for now. One number that we can see that is particularly high is for the row corresponding to the question mark and the column corresponding to high. We can

see that we have a number 73 there which is relatively high. So why would that number be relatively high in the t byt tensor for this particular input. Well

the fact that how is associated with the question mark means that we're we're actually asking a question right. The

word how can be used in various contexts in English. We can say that's how that

in English. We can say that's how that works. that's just kind of declaring

works. that's just kind of declaring something. It's stating some kind of

something. It's stating some kind of information. But if how is associated

information. But if how is associated with the question mark, then the model can learn that the word how is being used to form a question here. So the

goal is to actually have a layer in our neural network that given our embeddings for every single token. We can then generate a T byT tensor that tells us for every pair of tokens how important

they are to each other. And we actually want that to be like trainable and learnable through gradient descent based on our training data. And then we can use that T byt tensor multiply it against the embeddings like we did

previously. Right? What we did is we

previously. Right? What we did is we took our T byt tensor and we matrix multiplied it against our T by C embeddings. That's what we want to do.

embeddings. That's what we want to do.

And then our output can then be kind of sent on and forwarded through later parts of the neural network. So now

let's try to understand how this T byt tensor is actually going to be generated. What's the formula for this

generated. What's the formula for this layer? So there's kind of an ugly

layer? So there's kind of an ugly looking formula for self attention. If

you just Google the formula for self attention, here is the formula to get the T by T scores. And then the layer output will be the scores times something called V, which is similar to

the thing that we multiplied scores by previously. But the main things that we

previously. But the main things that we need to talk about here are what in the world is Q, what in the world is K, and why are we raising it to the power of T and then multiplying them together, and

what is V? DK is also something else that we're going to talk about, but it's very similar to the embedding dimension.

And the softmax is also something that we'll talk about, but it's pretty straightforward. It's kind of like the

straightforward. It's kind of like the sigmoid function in the sense that we can kind of squeeze our values to be between 0 and one. Cuz if we go ahead and take a look at the previous picture that we were looking at, these values

are not really constrained any kind of range, right? We want to normalize them

range, right? We want to normalize them and set them to be in some kind of fixed range where the highest value is one and the lowest value is zero but the kind of the scale and relation the ratios are

still kept. So that's what softmax does

still kept. So that's what softmax does but the main things we really need to explain are what is V, what is K or sorry what is Q and what is K cuz those are the main things that make up self

attention. So let's let's break that

attention. So let's let's break that down and make it really intuitive. So

let's explain the Q in that formula. The

Q in that formula actually stands for query. So in our attention model, we are

query. So in our attention model, we are going to have every single token in the sentence actually be able to quote unquote talk to the other tokens and communicate with each other until they

pair up and we figure out the right T by T scores. And the way that we're going

T scores. And the way that we're going to do that is by having every token I've only drawn it for write and poem right now, but every token is going to emit.

It's going to emit and send out a vector called a query. And just like the word query means, this vector is going to contain information that represents what

that token is searching for. What is its query? So what maybe if we were using a

query? So what maybe if we were using a character level language model and our token was like Q, then Q might be emitting some sort of vector of

information that represents what it is searching for. And we know that in the

searching for. And we know that in the English language, U almost always follows Q. So it might be that

follows Q. So it might be that information in that vector that the token Q is emitting out. If we just had Q as a token using a character level

model instead of a word level model, the vector that Q is emitting might contain information or numbers in here that are similar to what U has. So Q would then

match up with U because U would be a vector that emits its own information.

And what we're going to do is we're going to have every token emit this query vector. And that query vector is

query vector. And that query vector is going to be of size attention dim, which will be a parameter specified to you.

And that can kind of be then represented as a linear layer. That's how we're going to generate the queries for every single token. It's simply going to have

single token. It's simply going to have embedding dim as the input number of features cuz right now before we even generate the queries, we know that for every token, we have a vector of size embedding dim. That's how we kind of

embedding dim. That's how we kind of have that represented. And then we're going to actually change the dimension to attention dim so that every token in this sequence is emitting a query of

size attention dim. And that represents what the token is looking for. So if

every token emits a query vector representing what it's looking for, every token also should emit a key vector representing the information that it has. And then we can kind of match up

it has. And then we can kind of match up the queries with keys. the tokens can kind of pair up with the tokens that are actually relevant for themselves and then the T by T scores are actually

learned. So every token is also going to

learned. So every token is also going to emit a key vector of size attention dim.

Again, I've only drawn this for write and poem right now, but this will be emitted for every single token just as the queries were. And this will also be generated using a linear layer which of

course has, you know, those trainable W's and biases. And then what we do is we take that query matrix which is t by a because for every token at every time

step we have a vector of size a attention dim specifically being the query and then we're going to multiply that by k transpose. So we have that one that matrix over here that one instead

of being t by a because we've transposed it it would be a by t and that's going to be the key vector for every token. So

here each row is the query and here each column because we transposed the matrix meaning we flipped its rows and columns and essentially put the matrix on its

side so to speak. We have the columns here being the keys. So write key me key a key poem key and then when we think about what this matrix multiplication is

doing is we have the query for right kind of dotproducted with the query or the key for right and then we have the query for right dotproducted with the

key for me and then we have the query for right dotproducted with the key for a and we have the query for right dotproducted with the key for poem and the same thing is kind of done for every

other query right all the queries and and keys kind of get dotproducted. And

if you recall the dotproduct dotproduct operation from linear algebra, the dotproduct operation is supposed to be a measure of how similar two vectors are to each other. I'll have a separate

explanation for that right after this clip which you can jump to. But if we take the query for every vector, remember every vector is emitting a query and every vector is emitting a

key. Well, if we dotproduct them, that's

key. Well, if we dotproduct them, that's kind of like saying, okay, if this is a gauge for how similar they are to each other, then the tokens whose queries and keys match up, maybe the query for write

matches up with the key for poem. Or

using the character example we talked about earlier, maybe the query for Q matches up with the key for u. If those

match up, then we can say those tokens are important to each other and we would have a higher number coming out of the dot product. And if we see what shape

dot product. And if we see what shape this this tensor is after you do this multiplication, we can actually see that the t by a multiplied by the a byt would

give us that t byt tensor which we were looking for. So q * a t transpose

looking for. So q * a t transpose specifically q which is the output of that trained linear layer. K which is the output of the trained linear key layer gives us this t byt attention

scores which you were looking for all along. So let's give a quick explanation

along. So let's give a quick explanation on why the dotproduct represents how similar two vectors are to each other.

Remember when we're doing that matrix multiplication, that's what we're doing.

We're doing a dot productduct. This row

time this column. So that ends up being this 1 * this 0 and then we go in and do plus this 0 * this one. Well, for these two vectors in this example, their dot productduct ends up being zero. And if

you were to graph them or plot them, you see that they're completely perpendicular to each other. They're not

pointing in the same direction at all.

we would say they're completely orthogonal to each other. This is kind of different than these two vectors over here. Let's imagine that this is the

here. Let's imagine that this is the query for a token and this is the key for a token. And we're trying to figure out how similar these are to each other.

These at least have they're partially in the same direction. They're not, you know, completely orthogonal, completely pointing away from each other like this one. So here if we had say 3 comma 2

one. So here if we had say 3 comma 2 dotproducted with one of them being 2a 3 we're getting some number like 12 and that's obviously much farther away from zero kind of indicating that these two

vectors do share come some kind of similarity. So that's why we are

similarity. So that's why we are actually doing the query times kranspose the query time krpose and that's why that gives us the t byt scores.

Okay, so we just have one or two things left to explain in our T byt tensor that we have generated. The next step, if you remember that attention formula that I showed at the start, is to actually

apply softmax to it. And you can kind of think of softmax as a multi-dimensional sigmoid. So it'll squash everything to

sigmoid. So it'll squash everything to be between 0 and one. So we can see that being done over here. Between 0 and one now, between 0 and one now, between 0

and one now, and between 0 and one now.

But the way the formula works, the way it kind of raises everything to the power of e, and that's to make it positive, by the way, and not negative anymore. And then we divide by the sum

anymore. And then we divide by the sum of the entire vector. Not only does it squash everything to be between 0 and 1, but it'll make everything sum to one. So

all of these numbers will now sum to one. And on our t byt tensor of

one. And on our t byt tensor of attention scores, what we do when we apply softmax is every row is now going to be between zero and one. And every

row sums to one. So we can kind of think of it as given the say a given row which corresponds for a given token and then all the columns going left to right. If

those sum to one, we can kind of think of those as scores or maybe even probabilities that that token might be relevant to the token corresponding to

the row. So again, not super important

the row. So again, not super important to worry about the mathematical details of softmax, but it is nice that it would normalize our values and squash everything to be between 0 and 1 instead

of having like totally random numbers like we did in the previous tensor we showed where we had a number like 73 and like 96 and then you know just a totally

arbitrary range of numbers. So once we have those normalized T byt scores to actually get the output of this attention layer, why do we multiply it by V, something called V, instead of

just multiplying it by the input like we originally did when we were doing the example with all of the 1/4s in the matrix. So in addition to how every

matrix. So in addition to how every token will emit a query and a key, every token is also going to emit a value and this is going to be another vector of

size dim at or attention dim that every vector is going to emit and it will be learned and trained with a linear layer.

And the reason we do this is to just kind of add another level of complexity to the model cuz if the query is what the token is searching for. If the key

is what the token actually has and well let's say the value is actually what is the token actually willing to share because there are various pieces of

information associated with every token right that's in the key. But the value not the Q not the K but now the V. The

value is okay what information is actually relevant what information do I emit in my vector do I actually want to share not we don't necessarily want to

share the entire unmasked input right like doing t byt scores t byt scores times the input instead we'll multiply it by v where v will be learned so then

the model can actually learn okay for the token for every token what information is actually relevant to share with the other tokens so that would be the value what the token is looking for is the query

what information the token has which should be kind of matching up with the queries for other tokens is the key and what information it's actually willing to share is the value. So the last thing

we need to explain is why are we why did we divide by the square root of DK before we applied softmax where DK is the attention dim. That's what DK actually is. This is actually something

actually is. This is actually something that you'll find in neural networks and deep learning all the time that researchers kind of experiment around with different scale factors. We're just

dividing by the square root of DK, which is a single number here. No matrix

multiplication or matrix divisions or anything like that. But often neural networks can suffer from something called an exploding gradient or a vanishing gradient where the values of the derivatives during training get

either way too big or way too small and things kind of become unfeasible. So

it's often better to kind of scale down or scale up our values by some sort of scale factor, which is exactly what we're doing here. And the researchers who created self attention, they were

researchers at Google in 2017. They

actually found that this achieved far better results and it's kind of become standard when coding up self attention.

And the at operator that I've been using here, by the way, is matrix multiplication in PyTorch. So now you're ready to code up self attention.

Let's solve self attention. We're

finally ready to code up this super important component of transformers. And

I'm going to highly recommend that you check out this video for an explanation of the concepts. Though I will definitely explain the concepts, give an overview of them in the solution video

as well. And similarly to how we first

as well. And similarly to how we first learned linear regression, we implemented it from scratch and then we unlocked the nn.linear class which we simply used as a building block in

neural networks. It was specifically

neural networks. It was specifically used extensively in the handwritten digit recognition. It was used in

digit recognition. It was used in sentiment analysis. It's used in many

sentiment analysis. It's used in many neural networks. So now that we're on

neural networks. So now that we're on our journey to actually coding up this transform architecture that makes up GPT, there's a whole bunch of new layers that we're going to need to talk about.

NN.linear is a component of GPTs, but there's some other layers in this really complicated looking neural network architecture that we're going to break down and make simple. And mainly

actually one of the most important components of the GPT and transformers in general is self attention. We can see it appears here and it appears here and it's one of the most powerful components

of a transformer. It's actually the thing that makes transformers very unique compared to other neural network architectures.

So by this point I will assume that you have solved the GBT data set problem and are generally familiar for the in with the input that we pass into these GPTs

during training. It's actually just a

during training. It's actually just a sequence of tokens or words and there's actually many training examples embedded within even one sentence. This model

during training is just learning to predict the next token over and over again given a bunch of contexts. So in

this sentence if we were to say pass this in during training as one of our sentences in a batch the model can actually learn that given a context of write me comes next. Given a context of

write me a comes next. Given a context of write me a poem comes next. So as

information is actually flowing through this neural network all culminating in the model's prediction for the next token which we'll talk about this later but it ends up being a bunch of probabilities a bunch of probabilities

for all the possible next tokens and then we may do something like take the highest probability or we might do something more complicated just depending on the scenario. There's many

neural network layers that factor in to the model's prediction of what is the next token in the sequence. And one of the most important layers within this neural network that helps the model

achieve that is the attention layer. And

I'm going to talk about a little bit before we get into the attention layer actually what happens before the attention layer. So we can see that

attention layer. So we can see that there's actually some sort of embedding layer over here. And this is exactly the same embedding layer that we used in the sentiment analysis model. So given a a

vector or a sequence of tokens of length t here we can say with write me a poem capital t equals 4 for every token we're going to get an embedding or feature vector which should encapsulate the

meaning of that word of or of that token and this is actually learned through training through gradient descent and let's say our embedding dimension that we choose is capital E now our input of

size capital T has now become T by E if for every token we have a vector of size E and this T by E tensor is what's going to be fed in to the attention layer over

here. And what the attention layer

here. And what the attention layer outputs is actually something of size T by A where A is not the embedding dim but actually the attention dim. So we

can see now for every token at every single time step. Here's one time step in the sequence. Here's the next time step in the sequence. We can see that for every single token we have a different vector now. And this vector

may may or may not be of the same size depending on whether or not a equals e.

But that's actually not the important thing. The important thing is that this

thing. The important thing is that this transformed vector that we have for every token now contains a slightly different piece of information. It

actually is now a transformed version of the vector that encapsulates what the model needs to attend attend to or focus on. What's actually important for the

on. What's actually important for the model to pay attention to. That's where

the term self-attention comes from. And

this transformed vector, the way that the self- attention layer works inside here, the way we're generating this vector of what the model should pay attention to for every token is actually

by aggregating aggregating information from the other tokens. So let's say the model has now a new representation for the word me. Well, the model has factored in everything that came before

me, factored in all those other tokens and aggregated that information together to have a new transformed representation of this token which represents what the model actually needs to pay attention

to. So now let's dive into how that

to. So now let's dive into how that actually works and make this a bit less abstract.

Before we dive into exactly how attention works, let's just make sure we're clear on everything in the problem description. The forward method will

description. The forward method will return a batch size by context length.

Context length is the capital T that I was just talking about by attention dim tensor. And we can see that the input

tensor. And we can see that the input dimensionality is actually the embedding dim. So the input would be B by T by

dim. So the input would be B by T by capital E. And we're also given the

capital E. And we're also given the attention dim, which is like the capital A that I was just talking about. People

also call this the head size. We don't

really need to worry about the word head right now. That's actually going to be

right now. That's actually going to be explained more in the next problem where we talk about multi-headed attention.

And then third input that we receive is the actual B by T by embedding dim tensor which is the input to the forward method for this self attention layer.

Again we are actually coding coding up the self attention class which will simply be used as a layer later on in the GPT class just like NN.ar. So in the constructor for the self attention class

we'll define the relevant you know the instance variables and the layers that make up the self attention class and then the forward method for self attention will be just like getting the model's prediction for that layer

passing in the B by T by E input into that layer and it returns the B by T by attention dim tensor and just to make sure that's super clear here for this

input tensor we can see that we have two different 2x two tensors so this is like 2x by two by two. So this is B by T by

E. So you can actually see that capital

E. So you can actually see that capital T or the context length is equal to two in this case and the batch size is equal to two and capital E equals 2. So then

this one should be 2x 2 by 3. Since 3 is the attention dim, the batch size and capital T or number of tokens remain the same. And we can see that that is the

same. And we can see that that is the shape of this tensor over here. And the

numbers might feel a bit meaningless right now. This example is more just to

right now. This example is more just to help you understand the shape of the inputs and outputs. So the numbers might feel a bit meaningless right now, but they will make sense soon once we explain how self attention works at the

lowest level. Let's build some more

lowest level. Let's build some more intuition for how this works with a simple example. Let's say we had a

simple example. Let's say we had a sentence like dog is cute that we were feeding into GPT. And again, we don't really need to worry about the inner workings of GPT for now and that complex transformer diagram. Let's just build

transformer diagram. Let's just build some intuition for how this concept of self attention works. And let's just say B equals 1. So that we are only dealing with one example at a time. We're not

processing like other sentences in parallel in this batch. So T equals 3 if we're doing a word level breakdown of the sentence. And let's say E equals 2.

the sentence. And let's say E equals 2.

So every token would be represented with a vector of size 2. And this is our input X that is kind of that T by E tensor that is fed into the attentions forward method. And we can see that the

forward method. And we can see that the first row is for dog, the second row is for the word is, and the third row is for the word cute. So let's say we had

these embeddings learned by the model.

And let's just take them to be like what they are at face value. Let's not worry about how the where these numbers came from. Just that this is the model's

from. Just that this is the model's learned representation of these words.

And this will make sense soon. So if we wanted to plot them, we can see that dog is over here is is over here. And then

cute is over here.

So then the way self attention works the way it actually then represents or generates this this new tensor of size t by a where now we have this transform

vector which has the important information for the model to look at for every single token. The way that this is generated is the model considers a t by t or it actually generates a t by t

tensor. So this you can think of this as

tensor. So this you can think of this as a tensor of scores or weights. Right? If

we have t tokens and we look at a t byt t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t tensor, then that is like t squared entries, right? So we're considering

entries, right? So we're considering every single pair of tokens, every single possible pair of words and we're matching them up. And in this tensor, if you were to index it at say row i,

column j, where row i is corresponding to the i token and column j is corresponding to the jave token. The way

we interpret this, and by the way, we're going to make these numbers be between 0ero and one, is we're actually going to interpret that as a score of how important those words are to each other

in the sentence. So in the sentence like dog is cute, you might imagine that the row for dog and the column for cute, we might have like a really high number cuz cute is important for describing the word dog in that sentence. So this would

be like a number or a score between 0 and one, one being more important, zero being not important to each other at all. And this is going to be a ts

all. And this is going to be a ts squared sized tensor which contains the scores for every single pair of tokens and how important they are to each other. How strong and of association do

other. How strong and of association do they have with each other and how important are those pair of tokens for actually understanding and processing the meaning of the input. So for now

let's not worry about the exact formula that the model uses to generate this tensor. We will explain that in a bit.

tensor. We will explain that in a bit.

But let's assume that the model did generate this T byt tensor for the input dog is cute. And let's just understand what's in this tensor. So we can see

that for dog dog. So we can see the row one column one dog and dog has some like relation to each other, right? It's the

same word. So we'll see that for is and is cute and cute. I just put the same number 0.7 because obviously a word is important to itself, right? that these

diagonal entries are kind of meaningless but the model will still learn some sort of number for those entries. The more

important entries to focus on are the ones not on this diagonal. So we can see that for road dog the future tokens that come after dog like is and cute are actually completely zeroed out and this

is intentional and you may have heard of something called masking in self attention. It's okay if you haven't

attention. It's okay if you haven't heard that term before, but essentially what this means is if the whole goal of these models is to predict the next token in the sequence, then we shouldn't be letting the model during training

look at those future tokens. If the goal is to predict that is cute comes after dog, then why would we let the model look at these future tokens? We're

completely going to zero out those attention scores. Instead, for every

attention scores. Instead, for every single token at every single time step, we only let the model see the current time step and everything that came before it. So if we look at the second

before it. So if we look at the second row which is for is we can see that the word that came before it dog 0.5 right it has some importance to the word is dog and is obviously it's kind of like a

helping verb in this sentence but not maybe not like crazy important then obviously is is important to itself and then we don't let the model look at the future token of cute which comes after is this is a really important entry in

the matrix though comes before cute right so for the row cute we're considering all the other tokens that came before it up to and including cute. And we see a really high

including cute. And we see a really high number here. This makes sense, right?

number here. This makes sense, right?

Cute is actually the adjective describing dog in the sentence. For the

model to truly understand the language and meaning of the sentence, the model needs to realize that these this pair of tokens dog and cute are very strongly associated with each other. Hence the

value of 0.9. And then of course maybe 0.5 for is is just a helping verb that kind of links dog and cute. So we just have a 0.5 there. So we'll talk about

how the model generates this T byt tensor. That's one of the biggest parts

tensor. That's one of the biggest parts of coding up the self attention class in a bit. But we just want to understand

a bit. But we just want to understand the whole value of having this TT tensor first. So now let's explain how the

first. So now let's explain how the model uses this TT tensor to actually get the output. Here's on the right is going to be the output of the self attention forward method. So here on the

left matrix we actually have that T byt tensor over here. And here what we have is the actual input. This is that t by e tensor which is the embedded vectors for

each token. Each row represents a time

each token. Each row represents a time step in this tensor. And then what if we matrix multiply them together? What

would that actually give us? And again

this is as deep as kind of like getting into the actual like matrix multiplications we're going to go is.

This is just kind of the one time we're going to really look at it is it really does help with the understanding. Won't

do like any crazy unnecessary math. So

we have this row here 0.7 0 0 and we multiply that with this column over here which ends up getting us this entry over here. So we have the dotproduct of those

here. So we have the dotproduct of those two vectors and that gets us 0.7 over here. And again let's look at this

here. And again let's look at this matrix over here. It's technically we said the output was t by a. Let's just

say a equals e in this scenario. So the

output is still t by e or t by a. And so

each row here is also a time step and this is dog is cute. So the vector here, the vector here should be like the transformed version of the word or the entry for dog. And this this transformed

vector should actually contain the information that we need to pay attention to. So let's try and

attention to. So let's try and understand what this multiplication was actually doing. We have the scores here

actually doing. We have the scores here for how dog is relating to all the other tokens in the sequence. And this vector over here is essentially the first

column of every single row in our feature vectors or embedding vectors.

Right? So this this entry over here is like kind of half the representation of the word dog. Of course, we're not including this column over here. This is

part of the representation of the word is. This is part of the representation

is. This is part of the representation of the word cute. And the score for dog dog is being multiplied with this thing for dog. The score for dog is is being

for dog. The score for dog is is being multiplied with, you know, the number for is. The score for dog cute is being

for is. The score for dog cute is being multiplied with the the number for cute in this column over here. And remember,

our goal is that we want this new vector for every token/ row to actually aggregate or factor in the information from the other tokens, right? Because

for the model to again truly understand text, the model needs to understand the relationships between the different tokens in a sentence, right? We know

that we can't just be looking at these words independently. Rather, the model

words independently. Rather, the model needs to actually understand the different relationships between the tokens in a sentence. However, when we say aggregate or factor in information

from the other tokens, we really only mean the information from the tokens that came before and like up to including the current token, right? We

don't want to let the model look at the future tokens. So if we actually see

future tokens. So if we actually see what's been going on here for this first row for the new transformed version of dog. If we actually do the matrix

dog. If we actually do the matrix multiplications. So then we have the

multiplications. So then we have the same row again over here times times this column. We actually just end up

this column. We actually just end up with zero over here. So it doesn't seem like a drastic change for dog. We just

have 0.70.

But what this actually is representing is that well dog is the first token in the sentence, right? So that's why we have these zeros over here. Dog wasn't

even allowed to look at the future tokens, which why it doesn't seem like we really factored in much from the future tokens.

Let's not worry too much about this second row over here since it's probably not going to give us a ton of important information. As we saw earlier when we

information. As we saw earlier when we were looking at the scores, that is was not like a super critical word in the sentence. However, the relationship

sentence. However, the relationship between dog and cute and the effect of having this 0.9 in this entry of the matrix over there is about to become apparent. So let's say we are trying to

apparent. So let's say we are trying to generate this third row over here. So we

know that is done by taking this row over here and multiplying it with this column over here. So we get 0.9 minus 0.5

minus 0.7.

So, so if we take a look at the effect of that, this 0.9 is a much larger than 0.5 and 0.7 and that 0.9 gets multiplied with this entry for dog whereas these

smaller weights over here are getting multiplied with the entries for these other tokens over here. So in this in this sum that actually becomes this entry over here, the effect of dog and

cute this this score or association of 0.9 is weighted far more heavily and is actually contributing far more to this overall sum rather than these these other terms which are actually bringing

this number down. And if you actually finish out the matrix multiplications, you'll end up when you actually plot those new vectors for each token, this is what we end up with. And it might seem like a pretty small change. We'll

notice that dog has shrked a bit. We'll

notice that is has actually shifted a bit and cute has shifted a bit. And this

might seem small to us these numbers, but this is actually helping the model learn the relationship between all the tokens inside the model and ultimately make its prediction for the next token

in the sequence with the final layer of the model. So how is that tensor of T

the model. So how is that tensor of T byt scores actually generated? If you

look up the formula for self- attention, you're going to see something like this, which seems a little ugly, but it actually has an intuitive understanding or an intuitive explanation. So, we said

we wanted to consider every single pair of tokens and how important they are to each other. Well, the way the model

each other. Well, the way the model achieves that is by actually breaking down these, you know, invisible barriers between the tokens and now instead of processing the tokens independently, it

starts letting the tokens talk to each other. Attention is actually a

other. Attention is actually a communication mechanism between words.

And that sounds a little crazy at first, but what I actually mean by that is that for every single token in the sentence, we're going to emit two vectors. The

model is going to generate and learn two vectors for every single token. And one

of those vectors is called Q for query, and one of those vectors is called K for key. And the query actually represents

key. And the query actually represents what the token is searching for or querying for. I am a token. So the token

querying for. I am a token. So the token is talking to other tokens. It's

generating a query vector and the token is saying, "Okay, I am a token. Here is

what I am looking for. If you if you meet me, hopefully if you meet what I'm looking for, my criteria, hopefully we can meet up and then get a high number or a high score together in this tensor." So what I actually mean by that

tensor." So what I actually mean by that is a noun might be searching for an adjective to describe itself. So the

query for dog, this vector might have similar information that is actually embedded in the vector for cute because we know that is the adjective that describes dog. So we're actually going

describes dog. So we're actually going to look at all these queries and keys and see which ones match up. So the key if the query is what information the token is searching for and that is

emitted by every single token. Then the

key vector the key vector emitted by every single token represents what information that it actually has. what

information does it that token actually contain? So the query for dog might

contain? So the query for dog might match up with say the key for cute since we know the this is a noun, we know this is an adjective, maybe they're looking

for each other. So every single token is going to generate a key and a query.

So if for every single token we have a vector and that vector will be of size a size attention dim then the query tensor or the query matrix is t by a. Same

thing for the key matrix or the keys.

That would be T by A. So then if we actually multiplied Q * A * K transpose, transpose just means to swap the rows and columns in a matrix. Then that's a T

by A. This is Q multiplied by an A by T.

by A. This is Q multiplied by an A by T.

This is um this is K. So then the A's the A's are actually going to cancel out. This makes the matrix

out. This makes the matrix multiplication dimensions work out and we end up with something that is actually t by t which is actually exactly what we wanted. And then if you actually kind of take a look at an

example and see like the row for a query being multiplied with a column in K or K transpose, you'll see that that's literally aligning the queries and keys

because we want to see which queries and keys actually match up, right? We want

the queries for maybe say a dog to match up or actually align up with the vector that represents the key for cute. And

then that dotproduct of those two vectors is some sort of number that's stored at entry I J in this T by T tensor. So maybe I is the row for dog

tensor. So maybe I is the row for dog and J is the column for cute. Or

actually as we talked about earlier that entry would be masked out to be zero because we wouldn't let dog look at future tokens. So it might be more

future tokens. So it might be more accurate to say the row for cute. Maybe

I corresponds to Q and J corresponds to to dog, right? It would actually be the row for Q and the column for dog. And

then that would that number that dotproduct of say that query and that key is a number in this t byt tensor which represents the score or

association between dog and cute.

So then what is this little scale factor that's square rooted in the denominator over here? This is just one of the

over here? This is just one of the contributions from the 2017 paper attention is all you need and they actually introduce this idea of scaled

attention. So DK or the dimensionality

attention. So DK or the dimensionality of K is actually just this attention dim the capital A or head size number that we talked about earlier and the

researchers found that when you actually divide this Q * K tangros we know what that represents that's our T byt tensor but then if you scale it by dividing by the square root of DK you actually end

up with a much smoother training process and you won't run into any extreme values or extremely small or large gradients which makes the training process a a lot easier. So this part of

the formula isn't as important for our conceptual understanding, but this part Q * Kranspose is definitely really important to understand. One additional

thing we're going to do to actually ensure that all our entries in that T by Tensor are between 0 and 1 is we're going to apply the softmax function. And

the softmax function is kind of like a multi-dimensional sigmoid. We talked

multi-dimensional sigmoid. We talked about the sigmoid function earlier in these in these problems, these these ML problems as a function that makes anything that's inputed into it be between zero and one. And the higher the

input is, the higher the number is to one. The lower the input to sigmoid is,

one. The lower the input to sigmoid is, the closer it is to zero. So softmax is pretty similar except it will also make all the entries in your vector that you apply softmax to. Not only will it make

every entry between zero and one, but it will make them sum to one. So we're

actually going to apply the softmax function to each row of this txt tensor.

We're going to apply the softmax function to each row. So then every single entry in this tensor will be between 0 and one. And each row would also sum to one. So if I looked at the

row for dog, right? The row for dog and the columns are like dog is cute. Then

this actually is going to force all of these numbers to actually sum to one.

And we like that because then we kind of think of this as a probability. We could

we know all the probabilities have to sum to one. So we can actually think of it as we can think of I J as like the probability that the I token and the J token are important to each other.

Clarify exactly what we mean by softmax is let's say we had some vector here 1 2 3. To apply softmax to this vector means

3. To apply softmax to this vector means to actually raise every term in the vector to the power e to the power of whatever that is. So e to the^ of 1, e

to the power of two, e to the power of 3. And then all of these entries in the

3. And then all of these entries in the resulting vector are divided by the same thing. That's what makes everything be

thing. That's what makes everything be between zero and one and actually sum all the way up to one when you actually add up all of these numbers over here.

So represented in a formula the softmax at entry I is just e to the x i where x i is say the entry at the i position in

our vector which we can call x and then we divide by the sum of all these e to the x i's. That's kind of an ugly looking formula but here's what it would

look like as an example. And in the code since you want to make in our let's see that's a B by T by T tensor that we're

actually applying softmax to right the whole time we've been talking about a T byt tensor but that's just for one example right if we're processing examples independently in a batch then for every example there's a T byt tensor

so just to be really clear when we actually want to apply softmax to this we want to apply the softmax to every row of the T byt tensor but for every

batch. So the way you would do this in

batch. So the way you would do this in code is nnn dot functional nn.functional

this is the function that we use to call softmax dot dots softmax and then what we're actually going to do is pass in the tensor. So let's just call this

the tensor. So let's just call this scores. The these are our scores. So

scores. The these are our scores. So

you'll go ahead and pass in scores and you'll go ahead and pass in the dim. And

the DM since we have a three-dimensional tensor is either 0, 1, or two. But since

we actually want to make each row of this of this txt tensor for every batch sum to one, you're simply going to pass in dim equals 2. One final thing before

we're ready to talk about the coding. So

the forward method is actually going to return the softmax of our scores. That's

this first factor that I've just underlined here times V which is our value. Earlier when we were doing that

value. Earlier when we were doing that first dummy example where we multiplied our 2x two T by T scores times our input

the B or T by E input. I essentially

said that we were doing our scores times what essentially what we can call X the input to the forward method. But that

might that's not actually what's done in practice. And this that might have been

practice. And this that might have been why the numbers looked a little weird.

So what's actually done in practice is for every single token in a sentence. So

let's say we have a sentence like dog is cute. For every single token, for every

cute. For every single token, for every single ver word, not only do we emit a query, not only do we emit a key, but for every single token, I'm only drawing

it for dog, but it's also the case for every other token. We're also going to emit something called a value. a value

vector. So if the query is what the token is searching for, the key is what information the token actually has. The

value will also be another learned vector for every token that will be generated. And this is what information

generated. And this is what information does the token want to share because there might be a whole bunch of information right in the key. There's a

whole bunch of information that a token has, but maybe not all of it is actually relevant to share to share with the other tokens. So when we actually do our

other tokens. So when we actually do our T by T scores multiplied by you know our input and actually that gets us our

aggregated output we actually are going to multiply by V where V is T by A as well just like Q and K. However every

token is going to emate a vector of size A called the value which represents what information is actually relevant to share. Now we're ready to talk about how

share. Now we're ready to talk about how to actually code this up and the implementation details. So nn.linear is

implementation details. So nn.linear is

going to be our friend again. In order

to generate our keys, queries, and values, we're going to have in our constructor for the self attention class three instances of nn.linear. One of

them for generating the keys, one of them for generating the queries, and one of them for generating the values. And

you can imagine that in features would be the embedding dim or e, and out features would be the attention dim or a. Next, when we actually matrix

a. Next, when we actually matrix multiply Q and K in the self attention formula, this at operator in PyTorch actually does matrix multiplication or

you can use torch.mmatmo.

You can actually check the torch documentation for that. You're actually

going to want to do Q matrix multiplied.

So, at operator time torch.transpose and

then pass in K. And then you're going to transpose the first and second dimension. So that will actually give us

dimension. So that will actually give us the B by T by A tensor which is Q matrix multiplied by the transpose version of K which is B by A by T. So the reason

we're transposing the first and second dimension is because originally it's B by T by A for the K the keys but we want to switch the A and the T. Remember this

is dimension zero. This is dimension one. This is dimension two. So if we

one. This is dimension two. So if we swap those that gets us the matrix multiplication that we actually want.

it'll do the batches. It'll preserve

this batch dimension. It's kind of like a parallel processing independent leftmost dimension. And then for every

leftmost dimension. And then for every single pair of batches, we're doing the t by a * the a by t. And we know a t by

a * an a by t gives us the t byt that we want. Another thing is how do we

want. Another thing is how do we actually do the masking that we talked about earlier? The best way to actually

about earlier? The best way to actually implement this is to use torch ones and pass in capital T capital T which will give you a tensor of all ones which is

of shape T by T. And then you can pass that into another function called torch.trill

torch.trill which is actually short for lower triangular or something like that. It's

some linear algebra term. And then this is going to give us this tensor over here which actually has you know the future tokens masked out. And then we can maybe store that in a variable

called premask because what we're actually then going to want to do is let's say we have our scores, right?

Scores is actually, you know, after we've applied q * krpose and we've done our division that divided by square root

of dk. Let's say that is the state of

of dk. Let's say that is the state of scores at this point. But it's before we've applied softmax. We're going to actually want to do the masked fill the

masked fill on scores before we apply softmax. We want to get rid of those

softmax. We want to get rid of those those future tokens for every single time step before we apply softmax.

Otherwise then in the softmax if let's say you were to zero out the scores for those for those future indices after applying softmax. Well then for the ones

applying softmax. Well then for the ones that are not zeroed out in at every single row, right? because of how the softmax formula works where it's doing like e to like you know whatever's in

the first entry e to whatever's in the second entry even if you later mask those out the denominator was still factoring those in right so you want to erase those future tokens contribution

before you apply softmax and the way you'll do that is with scores you will simply call the masked fill method from pytorch and then in the first entry you want to pass in a mask which is just you

know a tensor of trus and falses So you would say premask which is just this tensor over here equals equals zero which would get us you know the trues and falses the TRS where we actually

want to do the masking and then at those entries you want to pass in negative infinity and the reason you want to pass in negative infinity is because then as a result of softmax then after you apply

softmax you would get zero. And let's

explain why that's the case. So just a quick math review on why this is the case. Again, this is probably as deep as

case. Again, this is probably as deep as the math is going to go. When will you pass in that negative infinity or whatever like the lowest value in Python is into the E function in the softmax

formula? Well, e to the negative

formula? Well, e to the negative infinity is like 1 / e to the infinity.

That's just an algebra rule. And then e to the infinity, if we think about what the the e function looks like, it's like an exponentially growing function like

that. So, as e is approaching infinity,

that. So, as e is approaching infinity, this function is approaching infinity, right? So I can just replace 1 / e to

right? So I can just replace 1 / e to the infinity with infinity. And this is just zero which is exactly what we wanted. We didn't want those future

wanted. We didn't want those future tokens to have any contribution. We

wanted to zero them out. And the way to zero them out is to actually set them to negative infinity before you apply softmax. So why is 1 over infinity zero?

softmax. So why is 1 over infinity zero?

It's just a quick review from maybe calculus one. So 1 / two is a half. 1

calculus one. So 1 / two is a half. 1

over 5 is2. 1 over 10 is 0.1. 1 over 100 is 0.1. As this denominator is getting

is 0.1. As this denominator is getting bigger and bigger, the numbers are getting smaller and smaller. So 1 over infinity is essentially zero. Okay, now

we're ready to code it up. So we know we're going to need three linear layers.

H1 for generating keys, queries, and values respectively. So we can say

values respectively. So we can say self.get

self.get keys is nn.linear linear and you know the in features for the you know the vectors or tensors passed in to n.linear

there are going to be of size embedding dim since for every vector originally we have a for every token originally we have a vector of size embedding dim and we want to get for every token then a

vector of size attention dim and the similar we might have get queries is nn.linear linear of the embedding dim

nn.linear linear of the embedding dim and then the same thing attention dim and then again the same thing for the

values right nn.linear linear embedding dim attention dim and those are actually the only instance variables we need here

just a small adjustment remember we don't use biases in the linear layers for getting the keys queries and values in linear regression when we're doing w1

* x1 w2 * x2 w3 * x3 at the end there is an optional constant term b called a constant term since it's not actually being multiplied by any of our input features, right?

We're just adding it as a constant. And

it can optionally also be learned through training through gradient descent. But if we pass in bias equals

descent. But if we pass in bias equals false, this will not be included. We'll

simply have our W's X's. And in the case of self attention, it's actually been found that we can get slightly better results without the bias. So there's no need for a bias here.

So now let's actually use those layers to get our queries and keys and values.

So we can say K is self.get

keys of the embedded version of our input. So if embedded is B by T by E.

input. So if embedded is B by T by E.

This is B by T by A. Similarly Q should be self.get get queries of again

be self.get get queries of again embedded and has the same dimensions and then the values self.get values of embedded and again same dimensions. So

now we actually want to do the matrix multiplication right we have to do Q matrix multiplied with K. That's how we actually align the queries and values

and see which ones match up. So I might say scores is Q matrix multiplied with torch.trans transpose of K. And again, I

torch.trans transpose of K. And again, I want to swap the T and the A. So, I

would say 1 2. It'll also work if you pass in 2 1. So, 1 2 and then we also are going to need the value of the attention dim, right? As we know, we

actually are going to want to take the square root and divide by it, right? So,

we can actually get the attention dim from a variety of places. The easiest

place to get it is just K, right? Since

we know it's B by T by A. So we can just say B by T by A is K.shape. So we're

unpacking that whole dotshape tupil. And

then here is when we can do our scaling, right? So you can actually say that

right? So you can actually say that scores is equal to scores divided by the attention dim which is just a square rooted. So now we have scaled the

rooted. So now we have scaled the scores. Now we want our lower triangular

scores. Now we want our lower triangular tensor. So we'll do something like

tensor. So we'll do something like torch.trill trill just like we talked

torch.trill trill just like we talked about earlier on torch ones of t and then maybe this is our premask that I talked about earlier that's before we

actually get the tensor of trus and falses that would be like the true definition of a mask and then we might say that our true mask is simply premask but we want to get look at the entries

where this thing is equal to zero and then we can say something like scores equals scores domasked fill go ahead and pass in the mask and then Python's

version of negative infinity or whatever like the smallest system value is. So

that will just be float of negative infinity. And now we can actually apply

infinity. And now we can actually apply softmax. So we can say something like

softmax. So we can say something like scores equals nn.functional

softmax pass in the scores. And again

here is where we want to say dim equals 2 because scores is b by t by t but we actually want to do this for every t by t. And then given a t by t if you wanted

t. And then given a t by t if you wanted to say set every row to sum to one right you're trying to apply softmax to every row that's going across the columns. So

it's actually that final dimension over here dim equals 2. And now that we have the scores we can actually use them to actually get our transformed output. So

we would go and say something like transformed our transformed output is simply scores matrix multiplied with our values and then we can simply round this. So return

torch.round

of transformed and say decimals equals 4. And we're done. And we can see that

4. And we're done. And we can see that it works. I hope this was helpful. And

it works. I hope this was helpful. And

definitely leave a comment if there's anything that you would like me to explain in more detail. There's

definitely a ton of concepts embedded within coding up this class and we're a few problems away from having a working GPT from scratch. Okay, in this video

we're going to explain multi-headed attention. Multi-headed

attention. Multi-headed attention. And as you can see from the

attention. And as you can see from the transformer architecture, which you may have seen before, it's one of the most important parts of the model. It's

actually occurs over here. We actually

won't include this part of attention right here since that has to do with the encoder which I've actually cropped out of the diagram. Large language models like GPTs don't use encoders. But

attention specifically this masked multi-head attention and we're going to really focus on explaining the multi head aspect of attention in this video is one of the aspects of the GPT that

makes it so powerful. It was discovered in 2017 by researchers at Google and we're going to break down exactly how it works. But before we break down

works. But before we break down multi-headed attention, let's do a quick breakdown of how single-headed or just normal attention works. So when chat GPT or any language model receives an

instruction like write me a poem, we know this is something that chat GPT does pretty well on. Well, the model actually needs to know what parts of this sentence, this input sentence to

focus on specifically to pay attention to. Not all of these tokens are

to. Not all of these tokens are important, right? Obviously, the more

important, right? Obviously, the more important ones in this sentence are write and poem. Imagine if you just said write poem to chat GBT. It would still understand what you're doing. That kind

of or understand what you're requesting.

That shows us that these are the two most important tokens and the model needs to learn to pay attention to these when it's given like a real sentence like write me a poem. And this is

actually what attention solves. So this

is what the attention layer solves. It's

actually a layer in a neural network that will take in. Here's what it will take in. it will take in the embedded

take in. it will take in the embedded version of a sentence. So the first step in any kind of language model is actually tokenize the sentence. So we'll

break up the sentence into tokens. And

for now let's just assume we're using a word level tokenizer. So we have some sort of matrix that is actually going to represent this sentence or a tensor in PyTorch. And we're going to say this is

PyTorch. And we're going to say this is T by E. And the reason it's number this has T rows is because we have T tokens in every sentence. If we break this down

on the word level, we'll have four tokens here. So we would actually have

tokens here. So we would actually have four rows here. And then E is actually something called the embedding dimension. The embedding dimension.

dimension. The embedding dimension.

Let's just assume an embedding dimension of two. Then we know our matrix or

of two. Then we know our matrix or tensor looks something like this. And

the embedding dimension for every single token. So for write, for me, for a for

token. So for write, for me, for a for poem, this is going to essentially be a vector that encapsulates the meaning of the word. So you can think of it as a

the word. So you can think of it as a feature vector that summarizes the information in that word in a way the model can actually understand and that will actually be learned through the

process of training a neural network.

But what the attention layer outputs right this is just what the attention layer takes in the attention layer is actually going to output a tensor that is instead of being T by embedding

dimension the output tensor is actually going to be T by attention dimension. So

t we can call this a for the attention dimension also sometimes called the head size and this is also then technically going to be a vector for every token but it's going to be a transformed vector

that kind of encapsulates the relevant parts of the token that the model actually needs to pay attention to. So

if I had to summarize attention in just a couple sentences it's actually a communication mechanism. Attention says

communication mechanism. Attention says we don't want to look at each token in a sentence completely independently. We

want to kind of break these walls. So,

we want to let the tokens talk to each other. We want to let them talk to each

other. We want to let them talk to each other. And we want to let them actually

other. And we want to let them actually have some time specifically through the attention layer in a neural network. And

we want to let them match up with each other until the tokens that are actually relevant to each other have paired up, right? Because every token has some kind

right? Because every token has some kind of relationship within it, right? The

relationship between write and poem is that poem is what needs to be written.

the relationship between me and let's say write well if you remember for grammar again not super important but me is kind of like the indirect object here

the poem is being written for me so clearly for a model to really understand a language it needs to understand the relationships between the tokens and ascendants so to summarize attention it

is essentially a communication mechanism that lets tokens talk to each other okay so now let's explain what we mean by multi-headed let's Say attention dim

equals 8 and embedding dim equals 4. So

then the input to our attention layer, which I'm just calling a in this black box, is a 4x4 tensor. And I'm just assuming t equals 4 using the same sentence, write me a poem, and then the

output would have to be 4x 8. But what

if we wanted to do something called two heads of attention? So multi-headed

attention. Well, we would have two act two separate instances of a where a is doing self attention. And these heads or these two separate instances are going to operate in parallel completely

separately. They'll have their own

separately. They'll have their own parameters and weights that are trained.

Let's keep output dim equals eight. So

if we want to do that, then we're actually going to have to change the attention dim for each instance of a each head as they call it to four. So

that way we can still have this 4 + 4 equal to 8. But if we have two 4x4 outputs, how do we get the output dim equal to 8? Well, we can use concatenation, right? We know there's a

concatenation, right? We know there's a function in PyTorch. Don't worry if you're not familiar with this, but we can actually concatenate two tensors and tell PyTorch along which dimension to

concatenate. And we want to concatenate

concatenate. And we want to concatenate not along the zeroth dimension rather, but rather the one dimension. And that

will actually give us a 4x8 tensor. And

let's say we had an attention dim of 16 and we wanted to have four heads of A.

Then each head size would actually be four since 4 * 4 is 16. So we call this number this number the head size and we can call this the overall tension

dimension. So what multi-headed

dimension. So what multi-headed attention is is just doing this normal attention but having a bunch of instances of it operating in parallel and each of them are going to have their own trainable weights biases parameters

that are learned through gradient descent and then we're going to take all their outputs and concatenate it and that's what the output of multi-headed attentions forward method would be. So

why does multi-headed attention actually yield better results in neural networks than just having one single head of attention, but it has the same attention dim as the output dimension when we

concatenate all our single heads together? Why does that actually give

together? Why does that actually give better results to have these two separate instances of self attention or sometimes four or six or eight? Well,

the number of parameters is staying the same, right? the number of weights or

same, right? the number of weights or the number of trainable or learnable parameters in the model is staying exactly the same, right? We know that the queries, keys, and values, if you've

seen the video on the lower level details of attention, we know that those are linear layers that are going to occur in each single head of attention to for the model to actually figure out

the relationships and scores between pairs of tokens. But if we're shrinking the attention dim for each head, then the number of parameters is actually staying the same. So why are we getting

better results when we do attention, two different heads of attention, one here and one here, and then we go ahead and concatenate the results. Why is that actually giving us better results? Well,

if we think about it, each head of attention gets to operate on the input separately. Each the input, let's say

separately. Each the input, let's say the embedded input, right? That is our T by embedding dim. So T by embedding dim.

This is passed in separately in parallel to each head of attention. So each head of attention gets to operate and do a bunch of math on that input tensor. And

since each input or since each head of attention has its own parameters, each head can actually learn something different. Actually what we'll find is

different. Actually what we'll find is after language models are trained each head of attention is actually specialized at learning some component of the language. So, and actually when

they did an analysis of BERT, and BERT is actually a very popular transformer that Google uses for all sorts of things such as Google Translate, when they did

an analysis on the heads of attention in BERT, they actually found that one head of attention was specialized at looking at direct objects of verbs, direct

objects of verbs, and another head of attention was specialized at looking at indirect objects of verbs. So

intuitively we see these large language models. These models with billions of

models. These models with billions of parameters as essentially having different heads and different components of its neural network that specialize at learning different parts of a language

and kind of a grammar. So that's kind of the main benefit of multi-headed attention. We have different heads that

attention. We have different heads that can each specialize in learning something different even if the number of parameters stays the same whether we do one head or many heads. So that's

multi-headed attention. It's actually

not a terribly complicated concept, but it is probably most people would agree the crux of a GPT. It is the most powerful layer of a GPT, and it is what

allows these modern-day language models to actually learn English language and actually any language so well. So, I'd

highly recommend trying to code it up now. And after you do that, you'll

now. And after you do that, you'll actually be ready to code up the transformer block, this giant rectangular block, which you can see in the picture. and then you will almost

the picture. and then you will almost have a working GPT. In this problem, we're going to solve multi-headed attention. And again, the first thing I

attention. And again, the first thing I want to say is that I highly recommend you solve single-headed attention before we solve this problem. As if you understand single-headed attention, this

problem is not that hard. It's actually

kind of easy. Single-headed attention,

on the other hand, which this problem builds on, is a lot harder just cuz there's so many new concepts introduced in that problem. So, we're going to assume that you've kind of solved that

problem and then this problem isn't going to be that bad, but we'll still generally explain the idea of attention, but we won't go into as much detail as we did in the single-headed attention video. So, the problem statement says

video. So, the problem statement says that this layer is actually what makes LLM so good at talking like real people.

So single-handed attention is pretty good. But when we do this whole

good. But when we do this whole multi-headed thing, which we're going to explain in this problem, that's has what made like the latest transformer neural networks way better at chat bots that you might have seen in previous years.

And I've also made an explanation video of the concepts if you would like to watch that as well. So we have to code up the multi-headed self attention class and then the single-headed attention

class. Essentially, the solution code

class. Essentially, the solution code for that is going to be given in the starter code for this problem. And we're

going to actually make instances of this class to solve the multi-headed attention problem. And we're told that

attention problem. And we're told that the forward method needs to return a B battens dim tensor. So actually returning a

dim tensor. So actually returning a tensor of the same shape as the single-headed attention problem. And

we're going to be given embedding dim.

That's our first input. And this is actually something that we were also previously given in the previous problem. We're also given attention dim.

problem. We're also given attention dim.

We're given embedded which is B by T by E and this was also given in the previous problem. However, this this is

previous problem. However, this this is our new this is our new input that we really need to pay attention to. Num

heads. This is the number of times that we're going to make an instance of the given self attention class. And we have this constraint that attention dim mod

num heads is zero. So attention dim is a multiple of num heads.

And there's an example here. The numbers

are not going to make a ton of sense, but it does help us understand the shapes. Our input is supposed to be B by

shapes. Our input is supposed to be B by T by E. That's the shape of embedded.

And we can actually infer that the batch size is two. Here the T or the context length is also two. You can infer that from this example over here. And then E,

we can find that E is two. And that

makes sense as well given that we have a 2x two by two. We have two different 2x twos in our input embedded. And if

attention dim is three and the b and t are supposed to remain the same then we have two different 2x3's. So it's a 2x 2x3 and that is exactly what we have

over here. So the shapes do make sense.

over here. So the shapes do make sense.

So what actually is multi-headed attention? So we take an instance of

attention? So we take an instance of single-headed attention ha and previously we were passing in our input the b by t by e input and whatever

single-headed attention returned was the output. However now depending on what

output. However now depending on what that numbum heads parameter is whatever that is set to we'll make that many instances of self attention and have them operate in parallel. So let's just

say numbum heads was three then we have three over here. And what we'll do is what if our input was that B by T by E input embedded. I'm just going to

input embedded. I'm just going to simplify that to X for now. Let's pass

X. Let's actually pass X in to all of our heads of self- tension. If I had a fourth one over here, it would also be passed in there. And then what we'll do

is we'll just concatenate all of these outputs together. The outputs of each

outputs together. The outputs of each head. So each head of self attention,

head. So each head of self attention, each of these black each of these boxes over here was able to operate independently and process X and obtain a

different output, right? Cuz each of these layers of the neural network have different if they're separate instances, they have different weights, right?

They're learning different keys, queries, values, and those linear layers that make up the single-headed self attention. And ultimately, they're going

attention. And ultimately, they're going to generate something different since, again, they have different numbers, different weights. they're trained

different weights. they're trained separately even if they're given the same input. So we're going to then

same input. So we're going to then concatenate all their outputs together.

We can call this the concatenated version. The concatenated version and

version. The concatenated version and that is what we would return from the multi-headed attention forward method.

And that's it. So then when we have the transformer neural network architecture and remember we're getting pretty close to coding up this full GPT class. We're

actually then going to use the multi-headed attention class as a layer in this neural network. So then that's going to become one of our building blocks for this giant block over here

which is called the transformer block.

So why does this actually work better than just single-headed attention? Cuz

remember that attention dim which we can call a mod num heads mod numb num heads is zero. So for each head, we're

is zero. So for each head, we're actually making the size of each head or the attention dim parameter. When we

instantiate that single-headed attention object for each of those heads, we're making each of them have a head size of a essentially floor divide num heads,

whatever this integer is. So our total number of learnable parameters in the neural network isn't changing. Right? If

I were to do just one head with an attention dim of a equals 64 and the other option is to maybe do eight heads,

do eight heads and for each of them set a equal to 8. Then the total number of learnable parameters in the neural network hasn't really changed. Even

though we're having multiple heads that are each operating in parallel, right?

Each of them gets to process the input, process that input called embedded and learn parameters. Albeit each head is

learn parameters. Albeit each head is learning less parameters than if we just had one giant head. So why does this actually yield better results? Because

the general trend is more parameters, bigger model, the model can learn a more complex relationship and we get better results.

Well, even though each head has less parameters than if we had one massive head, it turns out that each head gets to learn something different because our

input X is independently passed in to each head. So each head independently

each head. So each head independently gets to learn something different. And

what we actually find when we see how the heads and those weights get activated based on different inputs is that researchers have found that you'll actually have one head that might specialize in focusing on or attending

to the next token. One head that really specializes on focusing or attend attending to. Right? This is all about

attending to. Right? This is all about which parts of the input do we pay attention to the direct objects of verbs. One head might specialize in

verbs. One head might specialize in indirect objects of verbs. So we know that for a model to truly understand language, it needs to understand grammar, right? It needs to understand

grammar, right? It needs to understand the structure of the language and how all the words are put together or how all the tokens come together. So we

actually find that each head specializes in maybe a different grammar rule and this seems to be why this is boosting the performance of the neural network rather than just having one head that's

responsible for learning all the rules of the language. Okay, now let's jump into the code. So, the first thing we need to think about is that we know we're going to have a bunch of instances

of this given single-headed attention class. So, we're probably going to need

class. So, we're probably going to need to store those in some kind of collection in our constructor over here, right? But it turns out that any

right? But it turns out that any instance variables of a class that is a subclass of nn.module and this multi-headed self attention class is a subclass of nn.m module. Any instance

variables like need to also be neural network parameters. So if I did

network parameters. So if I did something like this self heads and then made it a normal Python list, this is going to throw an error when we're using PyTorch because of the

restriction that I just mentioned. So

there's actually a class called nnn.module list. So I can say self.heads

nnn.module list. So I can say self.heads

is nn.module

list and it works just like a normal Python list. You can append to it but it

Python list. You can append to it but it actually is restricted to only store other modules or neural network layers.

So then we can say something like for i in range of num heads we can actually then append those instances. So

self.heads.append

and we can instantiate here. So self we have to do this for these inner classes.

We have to refer to it as self. So self

single head attention and we have to pass in the embedding dim. But then the attention dim for each head is just the size of that head. And we know that

should just be attention dim floor divide num heads. So now we have all our heads defined. Now we just need to

heads defined. Now we just need to actually write the forward method.

Actually get all those outputs and concatenate them together. So we can go ahead and make a list for all our outputs. We can say outputs is just some

outputs. We can say outputs is just some list. And then we actually want to call

list. And then we actually want to call the forward method of every head of single-headed attention. So we can

single-headed attention. So we can actually just say something like for head in self.heads

and we can append to this list. So

outputs.append

and it would just be head do.forward or

simply just use the default call. So

head of embedded and now it's just about concatenating those together, right? So

we can say concatenated is torch. And

torch.cat expects two things. The first

thing it expects is a collection of the things to concatenate. So our collection here will be a list the list output and it expects a dimension something like 0

1 2 etc. So if we think about what the size of each thing in outputs is, right?

Each element in this list is B by T by let's call it head size, right? Cuz that

would be the output dimension from each head of attention over here. And we want to concatenate them along this last dimension so that when I concatenate

them all together, it becomes B by T by this overall attention dim that was given as a parameter in the constructor for multi-headed attention. So I would

simply say dim equals 2 or you can say dim equals1 for the final dimension.

Either would work. And then we simply round our output. So return torch.round

of catted say decimals equals 4. And

we're done. And we can see that it works. So if you found this helpful,

works. So if you found this helpful, definitely leave a comment or leave a comment if there's anything else that you would like me to go into more in-depth on. Now that we have this

in-depth on. Now that we have this multi-headed self attention class done, we can just use this as a layer when we code up the transformer block in the next problem. So definitely check that

next problem. So definitely check that one out next as it's our next step in getting closer to having a working GPT all the way from scratch.

Okay. In this video, we're going to completely break down the transformer block. So, we're going to explain how

block. So, we're going to explain how every single one of these components works. We'll touch a little bit on the

works. We'll touch a little bit on the embeddings and the final two layers, linear and softmax, but we're primarily going to focus on explaining the transformer block. So, that's the

transformer block. So, that's the attention, the ad and norm, as well as the feed forward. We're going to break these down one by one, starting with the ad. Okay, so the ad actually refers to a

ad. Okay, so the ad actually refers to a concept called skip connections or residual connections. And here's how it

residual connections. And here's how it looks visually. We have some arbitrary

looks visually. We have some arbitrary layer in a neural network. This might be a linear layer. This might be an attention layer. And it takes in some

attention layer. And it takes in some input X which we can see over here. But

instead of just taking the output of the layer as the output, we'll also let some portion of X actually completely bypass completely bypass the layer and it will get added to the output of the layer

over here. So in code if we were writing

over here. So in code if we were writing the forward method for some kind of layer and we wanted to incorporate skip connections or residual connections we would actually return layer of X. So

call the forward method for the layer passing in X but also add X. And just to get some intuition as to what this kind of means is the model will actually

learn the right weights and biases for this layer so that the right proportion of X is either sent through the layer or the right proportion of X is allowed to

actually bypass the layer. Right? Maybe

we don't actually want to transform X based on this layer. Maybe we want to retain X's original identity and pass most of it or at least incorporate it into our output for this layer. So the

model can kind of essentially through training through the minimizing of the loss through the training data figure out how much of x should be passed through and how much of x should be sent through this layer. That is some of the

intuition for what a skip connection is doing. However, why do we actually need

doing. However, why do we actually need to do this addition? Because technically

shouldn't the model just be able to learn the right weights and biases in this layer to not change x at all?

Something like this. Why do we actually have to add X to actually get this to achieve our desired results? Well,

there's actually another big benefit of skip connections, and it's actually the main reason that they were originally added into these deep neural networks, and that has to do with solving the

exploding the exploding or vanishing gradient problem. Exploding or vanishing

gradient problem. Exploding or vanishing gradient problem or exploding or vanishing derivative problem. We know

that calculating derivatives and gradients during the process of training is very important for gradient descent and minimizing our loss by updating our parameters. However, something that it

parameters. However, something that it can occur with these super deep neural networks that have tons of layers left to right, these neural networks have tons of layers is that gradients can

either become super big or super small.

And if we recall our update rule for gradient descent, we know that for any weight or parameter in a neural network, the new weight is actually just equal to

the old weight minus alpha the learning rate times the value of the derivative or gradient. And if this derivative

or gradient. And if this derivative value, if it's too small, if it's vanishing and going basically to zero, then this term is basically going to zero and the new weight is almost the same as the old weight. So we

essentially were not able to update the parameter at all. Similarly, if this value of the derivative is way too large, then the new weight is going to be drastically different than the old weight and that won't be great for the

neural network's performance. So, we

would like to mitigate this problem in some way or another. And turns out that this simple addition does end up reducing that problem because if we

recall from calculus if you have some function f ofx and it's just the sum of two other functions g ofx plus h ofx when you're actually taking the

derivative you can actually just say that frime of x is g prime of x g prime of x plus h ofx. So when the model is actually calculating all the derivatives

when we call loss.backward in in PyTorch, the model will actually have an additive term when calculating the necessary gradients for this layer. And

that actually will significantly help with reducing the problem of the exploding and vanishing gradient. the

fact that we're doing an addition instead of a multiplication as that doesn't drastically compound and you know either drastically increase or drastically decrease the value of the

gradients as the network gets deeper and deeper. So the short answer is that

deeper. So the short answer is that incorporating some kind of addition seems to actually keep the gradient values from either getting too big or too small and researchers have found this time and time again with neural

networks and that is why skip and residual connections are included in the transformer. So the norm in the

transformer. So the norm in the transformer block actually refers to something called layer normalization.

And this is kind of a module of its own in PyTorch that you can instantiate by saying nn.layerm and passing in in our

saying nn.layerm and passing in in our case the embedding dimension. And when

you tell a layerm that you're passing in the embedding dimension that is the dimension along which we will normalize.

So what does it actually mean to normalize in statistics? It essentially

means to send reenter our data so that it all revolves around the mean. So it

will kind of look like a bell curve. And

the way that we can take a set of data points and make it have this kind of shape where the mean is the most probable data point. The way we normalize data is for every single data point X, you actually just subtract out

whatever the mean is and you divide by the standard deviation. And that kind of recent your data around the mean and makes it look like that. However, this

would kind of destroy the whole point of a neural network if there was nothing learnable or trainable about this. It's

just a direct formula for this layer that restricts the output to be having this shape. So to still have the neural

this shape. So to still have the neural network have some sort of learning capacity, we go ahead and multiply for every data point this minus mean divided by standard deviation, we multiply by

some other number gamma. The symbol is not super important there. And add some other number beta. And these two will actually be adjustable and learnable across the iterations of gradient

descent that we perform. But the

question should be why does this actually improve the performance of neural networks? It has been found to

neural networks? It has been found to actually increase the training speed of neural networks and make deep learning deep learning far more effective. And

researchers are actually still not entirely sure why normalization tends to improve the training of neural networks.

But their their current hypothesis is that we know that the neural network starts off with totally random parameters, right? Totally random

parameters, right? Totally random weights and to some extent our data could be random as well. But if during training there's there's drastic shift in the nature of the data, meaning it's

the two main attributes that characterize data, right? the mean and the standard deviation. If this kind of changes drastically in in random ways, then the neural network training tends

to be really slow and just not as effective in terms of convergence. So by

kind of centering it around the mean around the standard deviation, sorry, this is the standard deviation and still by adding in some learnable parameters.

So this isn't too restrictive.

Researchers have found that the performance of the neural network does increase and that's why layer normalization is used in a transformer.

One additional thing that I just want to clarify is what data we're actually normalizing. So to feed into this layer,

normalizing. So to feed into this layer, we would have something that's t by e, right? T by e where t is our time step

right? T by e where t is our time step or length of some kind of sequence like our sentence, say the number of words in a sentence. And e is something called

a sentence. And e is something called the embedding dimension. So you can think of that as the size of the feature vector for every token. So let's say for every word we had a vector that

encapsulates the the meaning of that word and that was our embedding vector of size E or of size embedding dim. When

we say we are going to do layer normalization it means we are going to normalize the actual feature not necessarily the time steps or the batch dimension. if we had say multiple

dimension. if we had say multiple examples being fed in and we had batch by t by e but we're actually going to normalize along the embedding dimension.

So for every single time step we have a vector and we're going to go ahead and normalize along that dimension.

Okay, masked multi-headed attention.

Let's explain how this works. So during

the process of training, let's say we have an example that the model is being fed in. So we have this sentence, write

fed in. So we have this sentence, write me a poem. So we'll quickly explain what the masked part means. Because this

model is always predicting the next token in the sequence, the model actually does not get to look at future tokens. So when the model is fed in a

tokens. So when the model is fed in a sentence like this, there's actually a few training examples within here that the model can see. The first one is that given write me comes next. The next one

is that given write me a should come next. And finally the model sees that

next. And finally the model sees that given write me a poem should come next.

But the masked part actually means that when the model is making its next token prediction, you know, given all the tokens before it, the model doesn't actually get to see what comes next. So

if the model was tasked with predicting poem and the model had the context of write to me a in our code in our tensors or our matrices, we would actually mask out the word poem so that the model

cannot see the answer before it's actually made its prediction. So that's

what the masked part means. Now, let's

explain the multi-headed attention part, which is definitely far more significant. So, when the model receives

significant. So, when the model receives a complicated instruction like this, the model needs to know what part of the input to pay most attention to, cuz not all words in this input are equally as

important. And moreover, there are

important. And moreover, there are actually relationships with between every pair of tokens. So, we know that this pair of tokens, right, and poem, there's a strong relationship between them because what are we writing? We're

writing a poem. The model needs to pay attention to that. We're not writing a book. We're not writing a script. We're

book. We're not writing a script. We're

not writing a movie. The model needs to know what to write. So this is clearly an important pair of tokens. So what

attention does? Attention is the model actually letting these tokens talk to each other. And every single pair of

each other. And every single pair of tokens will be considered until the model has actually figured out which to pair of tokens are important and which ones aren't important. And then the

model is actually able to generate some sort of new feature vector, some sort of new feature vector instead of the embeddings that were initially passed in. The model is now able to know the

in. The model is now able to know the important parts of the input to focus on. So attention, if we had to summarize

on. So attention, if we had to summarize it, attention is a communication mechanism. Attential attention at a high

mechanism. Attential attention at a high level is the model letting the tokens talk to each other until the right relationships between them are learned.

And the multi-headed component actually refers to the fact that this attention process I just described will actually have the model perform at the attention layer on the same input. So we have the same input going into this attention,

the same input going into this attention head as well as maybe further attention heads. We might have three, four, five

heads. We might have three, four, five or six and so on. And what the model will do is it will actually just go ahead and concatenate the output of all the attention heads and pass that in to

the next layer of the neural network. So

attention is a highly powerful mechanism for the model to gain a deeper understanding of the language. And

because it's so powerful, we actually want to have many different heads that each learn their own weights and each operate on the input. This head of attention operates on the entire input

and it actually gets to learn its own weights and biases. The entire input is also passed in to this head of attention and this set of attention learns its own set of parameters. So by exploiting

multi-headed attention, the model is actually able to learn and understand language at a far deeper level.

Okay. Feed forward. Feed forward might actually surprise you that it's extremely simple. Feed forward is just a

extremely simple. Feed forward is just a traditional or vanilla neural network.

It's a vanilla neural network. That

means there won't be any attention in this neural network. There won't be anything complicated except linear layers. So linear layers with you know

layers. So linear layers with you know an arbitrary number of nodes or neurons as well as we'll have some nonlinear activations like the relu activation and we may have dropout as well but it is

still a traditional fully connected neural network with when we do toss in some nonlinearities and some dropout and it turns out that first having the

attention and then having this linear neural network that essentially just does a bunch of computation a bunch of weights biases and a bunch of matrix multiplication that essentially is

remembered just doing linear regression.

It turns out that first having computation first or sorry first having the communication in the attention layers and then after these tokens have paired up and essentially the model has

figured out which tokens are relevant to focus on. then doing a bunch of

focus on. then doing a bunch of computation essentially a very complex mathematical formula figuring out those weights and biases after this is actually highly effective for the

model's performance and the model's ability to ultimately predict the next token. So the last part of the

token. So the last part of the transformer architecture and specifically the decoder is actually the linear and softmax components. And if

you're coding up the transformer block, you won't have to worry about this yet, but you absolutely will once you actually code the GPT class. So the

whole point of that final linear layer in softmax is to actually get an interpretable prediction. So for that

interpretable prediction. So for that linear layer, it will actually look something like this. It would be nn.linear of the attention dim. So

nn.linear of the attention dim. So

attention dim as that's kind of the new dimension after we've done all of this attention after the embeddings. So it

would be the attention dim, but that's actually not the most important thing.

The output features or the number of neurons in this layer should actually be vocab size. And the reason for that is

vocab size. And the reason for that is for every single time step that this model receives in every single token in the input sequence, the model wants to predict what token comes next. And

that's going to be a bunch of probabilities, right? So we want to get

probabilities, right? So we want to get a vector of size vocab size where we can interpret each number in this vector, each entry in this vector as the

probability that the token or character or word corresponding to that index is going to come next in the sequence. And

then to squash those numbers between 0 and one and actually make them all add up to one so we can interpret them as a series of probabilities. That's what the softmax function is for. And that will

be the decoder transformer. So, if

you're following along with the code, you now have the information you need to code up the transformer block class, which will actually you'll be given the multi-headed attention and feed forward

classes. And then you can go ahead and

classes. And then you can go ahead and code the GPT class where you'll be given the block class. You train the model and then we'll finally code up generate where we will finally see our trained

language model generate text that looks pretty good and pretty similar to English. So, recommend jumping into the

English. So, recommend jumping into the code next. It's finally time to code up

code next. It's finally time to code up the transformer class. So this is actually the last class and maybe the most important class that we're going to write in defining the GPT model. So the

next problem is actually going to be writing the GPT class. And we're going to use this transformer block class. And

this transformer block class that we're about to write is going to make use of the multi-headed attention class that you've previously written.

So the transformer block is actually this gray rectangular block that is repeated nx times in this diagram. And

this itself is a giant neural network layer that we're going to use in the GPT class. And there is also like a really

class. And there is also like a really in-depth explanation on the transformer block over here that explains like layer by layer breaking down and explaining what every piece of this block is doing.

I recommend checking that out. But in

this video, we're also going to go over how the transformer block works at a general level and we're going to code it up as well. And the forward method for this class. So this is transformer block

this class. So this is transformer block class is going to subclass an endnodule and it's going to have a forward method because we're going to pass some sort of input in to the transformer block and

get some sort of output. And it needs to actually return a tensor that is B by T.

T is our context length by model dim.

Let's just call that capital D. And one

thing I want to clarify is that in the previous problems when we were introducing attention, I kind of made a distinction between embedding dim and attention dim, but they're actually the same thing. There's just one giant

same thing. There's just one giant parameter called model dim, and they're the same number. And then model dim mod numb n numb n numb n numb n numb n numb n numb n numb n numb n numb num heads is zero. And then the head size for

zero. And then the head size for multi-headed attention is model dim by numbum heads. So let's make some sense

numbum heads. So let's make some sense of this example. We don't really need to worry about the numbers. We just need to understand the shapes. So the input we call that embedded since it's actually

the output from the embedding layers of this neural network right before the transformer block starts in the diagram that we just showed. So the input is B by T by D where D is the model

dimensionality and the output is also B by T by D. the model dimensionality and we're told that the model dimensionality is four as we can tell over here that you know we have these four columns over

here in the input and the output and then we can actually infer from this that t equals 2 the context length is two and we can also infer that the batch

size the outermost dimension is two. So

now let's jump into how the transformer block works. So one small change from

block works. So one small change from this diagram is that this middle multi-headed attention part which is actually factoring information from the left part of the transformer diagram.

We're not going to incorporate that at all. Transformers like Chad GPT only use

all. Transformers like Chad GPT only use this right part of the transformer which is actually called a decoder. Although

that term is not super important for now. They only use the decoder and they

now. They only use the decoder and they don't actually include the left part of the transformer at all. So we're just going to completely ignore that. The

next thing that we need to actually explain is in the original transformer diagram, they went ahead and described this as add and norm over here and the same thing over here. But it would

actually be more accurate to say norm and add. Researchers have found that

and add. Researchers have found that norming first and then doing our skip connection or add. And we'll definitely explain what these actually mean in a

sec, but norm is actually what's done.

So let's explain the norm layer. This

refers to something called layer norm which you can actually use as a layer make an instance variable inside the constructor for the transformer block class and that's going to be nn layer

norm and you simply pass in model dimension and this is going to tell this layer that whatever tensor is passed in it's b byt by d go ahead and do the

normalization along this dimension and what we mean by normalize is when you have some sort of data to normalize it just means to really center it around

some sort of fixed mean and standard deviation. So this is kind of an example

deviation. So this is kind of an example of normalizing data. We can see that the data is kind of all revolved around or centered around the mean, the symbol mu.

The mean is the most common data point and the standard deviation just kind of reflects the width and how the data is stretched out from left to right. And

researchers have found that this really does boost the performance of transformers.

And researchers are actually not entirely sure why layer normalization seems to improve the training of neural networks. But it seems to be something

networks. But it seems to be something along the fact that we start off with some random distribution of the weights in the neural network and we don't want them to change too drastically that

actually makes our training process really unstable. So next let's next

really unstable. So next let's next let's explain the ad also known as skip connections. So and then one additional

connections. So and then one additional weird thing that is mentioned in the problem is that the norm is actually going to come before the attention instead of attention before the norm. So

now we can actually explain this add in addin norm also known as skip connections. We have our input to the

connections. We have our input to the transformer block X which is actually going to pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass through layer norm and then the output of layerorm goes into multi-headed self attention but then we

can actually add the unchanged X all the way back to the output of multi-headed self attention and then this is the output this is the output of at least the first part of the transformer block

over here this is the first part of the transformer block and how we would actually do this in the forward method is we would actually just say something

like X plus multi-headed self attention of the norm. Remember, we're going to norm force first and pass in X into the norm. And again, this is kind of covered

norm. And again, this is kind of covered in more detail in the transformer block background video, but the short explanation for why this skip connection

over here actually letting some portion of the input bypass bypass these layers entirely and actually adding X adding X

to the output over here. is that having some sort of additive term instead of having only multiplication in this neural network. Having some sort of

neural network. Having some sort of additive term seems to actually smooth out our gradients and actually slightly mitigate the exploding and vanishing

gradient problem during training. So

then the output of this first part of the transformer block over here is then passed into the second part of the transformer block over here. And again,

we're actually going to apply the norm before the feed forward, even though in the original diagram, it actually says feed forward and then norm. So, we're

going to do norm and then feed forward and then add this component. Add this

component which came out of the first part of the transformer block. We're

going to add this component back in over here. And that's our second skip

here. And that's our second skip connection. So, what's the whole point

connection. So, what's the whole point of this feed forward component of the neural network? Well, this is also

neural network? Well, this is also called a multi-layer perceptron.

Multi-layer because we're going to have multiple linear layers and then nonlinearities. That's kind of maybe the

nonlinearities. That's kind of maybe the perceptron part of this term if you want to think of it that way. It's also

called a vanilla or just a standard neural network. Well, if the attention

neural network. Well, if the attention part if the attention component of the neural network over here is actually our communication mechanism, this is where we let the tokens talk to each other.

Again, I highly recommend understanding multi-headed attention through those coding problems before we move into kind of this transformer block problem. But

if attention is our communication mechanism, then the feed forward part of the neural network, this is actually our computation mechanism. This is where we

computation mechanism. This is where we have a bunch of neurons in each linear layer of this feed forward neural network that are learning a bunch of

weights biases. W1 * x1 plus w2 * x2

weights biases. W1 * x1 plus w2 * x2 doing a ton of linear regression. So

after we've let the tokens talk to each other, after we've let the tokens do their communication, we want to let the tokens have some time as they pass through these linear layers and of course nonlinearities like RLU as well.

We want to let the tokens actually do computation. And this is all going to

computation. And this is all going to factor in to the model's prediction in the final layer or two where the model predicts the next token. And just

looking ahead a bit, this is not going to be part of the code for this problem.

It's going to be part of the next problem where we actually code the full GPT class, but just looking ahead a little bit after the model has done all this communication, all this computation

in this transformer block. And then in the GPT class, by the way, the transformer block is going to be repeated some number of times. So we

would take the output of the transformer block, pass it back in and do this some number of times, maybe like six times, maybe 12 times. But the point is after the model has done all this learning

within the transformer block. Let's

actually take a look at what's to come ahead. There's going to be a linear

ahead. There's going to be a linear layer where in features is model dim.

This should make sense given the dimensions used in the transformer block. And then we're actually going to

block. And then we're actually going to project to a dimension of vocab size. So

this linear layer has vocab size different neurons. We're going to have a

different neurons. We're going to have a essentially vocab size different numbers for every single token at every single time step. And the reason we have that

time step. And the reason we have that many different numbers being predicted is because each number is going to correspond to a probability or the chance that that corresponding token

comes next. And we know we have exactly

comes next. And we know we have exactly this many different tokens in our vocabulary. So we're going to have the

vocabulary. So we're going to have the possibility that any one of them could come next in the sequence as this model is learning to predict the next token.

So we're actually going to need to have that many different numbers to get the full prediction. So then just to clarify

full prediction. So then just to clarify you know just looking ahead a little bit as to how this model is going to be trained once we've fully written the GPT class is we know we had a B byt input a

batch of sequences each of length capital T and then you know the model if you've solved the GPT data set problem the GPT data set problem I highly recommend solving that problem before

doing any of these transformer problems well we know that we have B by T labels because we know that for every context text or all the sequences of tokens the model is trying to predict the next

token. So the correct answer or the

token. So the correct answer or the labels is just actually the tokens kind of offset by one and that is kind of become clear in the GPT data set problem

and then our output our output is B by T by V for every single batch and then for every single token every single time step in this sequence then we have a vector of size V the vocabulary size

which we're thinking of as a list of probabilities the softmax layer over here is going to make all the numbers between 0 and one. So they're actually can be interpreted as probabilities. We

have a bunch of probabilities for which token's going to come next in the sequence. That's the model's prediction.

sequence. That's the model's prediction.

And then the loss or error is going to be calculated using two things. The

model's prediction of the probability for what token came next. Again, that is in this vector of size V. And then we actually know what token comes next.

That's in our labels. The labels were extracted from the raw data set. And

then that is used to calculate the loss.

We dried the loss down during training using an optimizer and following the gradient descent algorithm. And in the end, we actually have a model that can predict the next token. Really well.

Before we jump into the code, I just wanted to clarify one thing. In the

constructor for the transformer block class, you will definitely need to make two different instances of nn.layer

norm. One for this layer norm and one for this layer norm. So that way we can learn different parameters across training. So now let's code this up. We

training. So now let's code this up. We

know the first thing we need is our instance of multi-headed self attention.

So we can say self do multi-headed self attention and we can use the inner class below. So self do multi-headed

below. So self do multi-headed selfattention.

And the only things that we really need to pass in there to this constructor.

Well, why don't we check the constructor for the constructor for multi-headed self attention? We just need to pass in

self attention? We just need to pass in model dim and numb heads as that is visible over here. So scrolling up, we can actually then take a look at that

and we can pass in model dim and numb heads. So model dim and numb heads. The

heads. So model dim and numb heads. The

next thing we're going to want to do is make our first layer norm. So self dot first layer norm is nn layer norm. And

this takes in model dim, the channel or dimension along which to normalize. Then

we need our second layer norm. So second

layer norm and that's nn.layer norm. And

again we want to pass in model dim. And

we're also going to need a feed forward layer, right? The vanilla neural

layer, right? The vanilla neural network. So we can say self.feed forward

network. So we can say self.feed forward

equals self. Vanilla neural network.

This is actually given below if you want to check it out. It's just a couple linear layers with a relu function thrown in between. And then we're going to go ahead and again pass in model dim.

And that's it for the constructor.

Now we can go ahead and write the forward method. So to know that the for

forward method. So to know that the for the first skip connection or the first part of the transformer block, we just want to take embedded which is like X that we talked about earlier. And then

we're going to do self.attension or self do multi-headed self attention. And then

we're actually going to invoke our first layer norm here. So self do first layer norm. And then to there we are actually

norm. And then to there we are actually passing in embedded. So this does reflect the diagram that we talked about earlier. And then we can save this in a

earlier. And then we can save this in a variable called like first part. This is

the first part of the transformer block.

The part that doesn't have the feed forward layer. And then what we're going

forward layer. And then what we're going to want ahead and do is actually take first part and then go ahead and add the result of the second part. So then we

can go ahead and say self feed forward.

And here's where we'll invoke our second layer norm. So you can say self.c

layer norm. So you can say self.c

layer norm. And we're going to go ahead and pass in the first part. So this also does reflect the diagram that we talked about earlier. And this is actually

about earlier. And this is actually exactly what we want to return. We just

need to round it. So we can say that this is our result like resz and then we just round it. So we can return torch.round

torch.round of resz and say decimals equals 4. And

we're done. And we can see that the code works. We've finally written the

works. We've finally written the transformer block. And in the next

transformer block. And in the next problem, we're going to code up the GPT class. So that's definitely going to be

class. So that's definitely going to be an interesting problem. So definitely

check that one out. Next, let's give a highle explanation of transformer architecture. And specifically, we're

architecture. And specifically, we're going to focus on the decoder for this video, not the encoder since that's what GPT uses. So why don't we just start

GPT uses. So why don't we just start with the embedding layers. So we're

going to actually have two embedding layers. One of them is called the token

layers. One of them is called the token embedding layer. Token embedding layer

embedding layer. Token embedding layer and the other is called the positional embedding layer. So we'll discuss this

embedding layer. So we'll discuss this one over here. And the goal of these embedding layers is actually to learn or train feature vectors. Specifically for

the token embedding layer, the goal is to learn or train a feature vector for every single token in our vocabulary. So

let's say we are doing a word level language model. So we can think of this

language model. So we can think of this embedding layer is actually a lookup table and that lookup table will be of size vocab size vocab size where vocab

size is actually the number of different words that this model can recognize. So

maybe the total number of unique words in our body of text that we're training the model on. Vocap size by embedding dim. We're embedding dim is actually our

dim. We're embedding dim is actually our choice. And the higher that we make

choice. And the higher that we make embedding dim, the more complex of a relationship the model can learn. Next

is the positional embedding. So this is also another lookup table except this one is actually going to be context length context length by embedding dim.

You may have heard of context length discussed in the context of large language models. You might know that

language models. You might know that chat GPT has a context length of 128,000 specifically GPT4. And context length is

specifically GPT4. And context length is essentially how many tokens back. So

when you're talking to this these language models, how many tokens back in the sequence can the model read? Cuz at

some point it has to essentially cut off and say okay the model's not going to factor in anything that was this far back. But how far back can the model

back. But how far back can the model actually read? That is the context

actually read? That is the context length. And similarly during training

length. And similarly during training when we were we when we are feeding in batches and batches of training examples we will actually be feeding in B byt

tensors where B is the batch size and T is the length of each say sentence in this training batch and T will actually be the context length during training.

So the positional embedding is essentially another lookup table that has context length rows in it. So this

would be from zero to context length minus one in terms of the rows or indices of that table. And the number of columns or the size of each row is also the embedding dimension where we are

essentially going to learn a vector of size embedding dim for every single possible position.

So what are we actually passing in to the positional embedding layer? Cuz we

know for the token embedding layer, we're literally just going to pass in the input tokens themselves. And the

embedding or the embedding layer here is actually going to then look up the feature vector for each token because we have vocab size number of rows where each row corresponds to a token in our

vocabulary. However, if the rows here

vocabulary. However, if the rows here are going from zero to context length minus one, then that's actually corresponding to positions or indices in

a sequence. If we sent in write me a

a sequence. If we sent in write me a poem, we would say that write is at position zero, me is at position one, a is at position two, and so on. So to

pass in, what are we going to pass into the forward method for the positional embedding layer? We're actually going to

embedding layer? We're actually going to pass in the following torch.

T where torch.arrange A range actually generates a tensor of size t that is sorted in order where its values go from zero to capital t minus one. And that's

how we would essentially want to index our embedding lookup table. And once we have the token embeddings generated, once we have the positional embeddings generated, we simply go ahead and add

them in. We just add those two tensors

them in. We just add those two tensors together before we'd feed them in to the transformer block. Next is the

transformer block. Next is the transformer block. Once we have actually

transformer block. Once we have actually have added our two embeddings together, the token and position embeddings, we're going to go ahead and pass that in to the transformer block. This gray

rectangular box over here. And this part actually does the heavy lifting within a transformer. This is what allows the

transformer. This is what allows the model to predict the next token in the sequence so well. And the most important parts of the transformer block, we won't go super in-depth about this entire box over here. I actually got a separate

over here. I actually got a separate video on that, but we will kind of give a highle explanation for what it's doing and how to code it up. So, the most important part is probably the masked multi-headed attention. And that might

multi-headed attention. And that might sound like a mouthful, but this is essentially the communication part of a transformer. Communication. We know that

transformer. Communication. We know that we pass in sequences of words into transformers. So, let's say we passed in

transformers. So, let's say we passed in write me a poem. At some point the model needs to actually realize which parts of this se sequence of words to pay

attention to and specifically which pairs of tokens are more important. So

in the attention layer the model will actually consider every single pair of tokens. This pair, this pair, this pair

tokens. This pair, this pair, this pair and actually every single pair of tokens in any sequence and the model will figure out which pairs of tokens are actually relevant to each other. So we

know that write and poem must be relevant to each other because how else would the model actually know what to write? The model shouldn't write a book.

write? The model shouldn't write a book.

The model shouldn't write a movie.

Rather, the model needs to know that a poem needs to be written. So the model will actually let these tokens talk to each other, so to speak. The model will consider every single pair of tokens

until the importance of every token pair relationship is determined. And then the model can focus on certain parts of the input and appropriately predict the next token in the sequence. The other most

important part of the transformer block is the feed forward component. And the

feed forward component is actually just a bunch of linear layers kind of stacked together as well as nonlinearities like the rail function. And if attention is

communication, we can actually say that the feed forward part or vanilla neural network part is computation. The model

is going to learn a ton of weights and biases similar to a linear regression in order to encapsulate the entire relationship between all the words in the input so that the model can

appropriately predict the next token. So

that's the transformer block. It does

the heavy lifting within the transformer. However, if you were coding

transformer. However, if you were coding up the GPT class, let's say we are coding up a class called GPT, you'll actually be given the transformer block class and you can treat the transformer

block class as a black box. However,

when you actually are coding up the full transformer in the GBT class, you're going to actually notice this NX in the diagram, and that actually indicates that we're going to have many blocks in

sequence. So what actually occurs in a

sequence. So what actually occurs in a transformer is the output of the transformer block. So the output over

transformer block. So the output over here is actually passed back in to the transformer block n number of times where n is essentially the number of transformer blocks and this is

predetermined beforehand. Obviously as

predetermined beforehand. Obviously as we increase the number of blocks the model gets more and more complex and the model can learn a more and more complex relationship with more parameters. But

to actually implement this into code we're actually going to use something called nnn.sequential. sequential. So in

called nnn.sequential. sequential. So in

your GPT class in the constructor, you will define something called nnn.sequential.

nnn.sequential.

And you can actually treat this just like a python list. You can call.append

on an nn.sequential with the only restriction that the only thing this list can contain is other neural network layers. And then once you actually call

layers. And then once you actually call the forward method of the nn.sequential

sequential object. Once you call the forward method of this object, it will actually call the forward method of every single block that was passed in to

this list in order from left to right.

So that entire process of passing in the output of one transformer block all the way back in would be handled by that line of code. And you will find that incredibly useful if you're ever coding

up the GBT class. One additional

component that's actually not included in this original famous transformer architecture is that in GPTs something that has been found to work really well is after the last transformer block. So

right here an additional layer norm is used. So that's an additional NN.layer

used. So that's an additional NN.layer

norm. And we know roughly at a high level the benefits of a layer norm. By

actually making sure that our data is centered around the mean and has an appropriate standard deviation. we find

that we don't have such crazy extreme range of values in our neural network and the training process is actually much smoother. So we're actually going

much smoother. So we're actually going to have one additional layer norm before the linear layer and the softmax layer.

And the linear layer is something that we can actually think of as a vocabulary projection layer because the dimensions for that linear layer it will actually have vocab size. It will have vocab size

neurons in that layer. So out features for that linear layer would be vocab size and that's because the model needs to figure out what the next token is and it's going to output a bunch of

probabilities specifically vocab size number of probabilities as we will have a probability for every single token in our vocabulary as every single token in the vocabulary could technically come

next. And of course last we will have

next. And of course last we will have the softmax layer. We will have a softmax layer over here which will squash all our values to be between zero and one so that we have a true probability distribution.

Next, I would highly recommend jumping into the code and coding up the GPT class. Your job will be to write the

class. Your job will be to write the constructor or in it as well as the forward method. And once this is done,

forward method. And once this is done, you could actually train this GPT and then go ahead and use it for text generation. So, I highly recommend

generation. So, I highly recommend jumping into the code. We're finally

ready to code up the GPT class. And once

we've written this class in the next problem, we'll actually get to see the model generate text. And this neural network that we're about to code up actually follows the architecture that

almost all LLMs use, including chat GBT as well as many others. And we are going to be given a class called transformer block that's going to be in the starter

code so that we don't have to actually code up this entire gray rectangular box that is repeated nx times in the diagram. And that class is actually then

diagram. And that class is actually then going to make use of other classes like multi-headed attention and the feed forward neural network class. So you

might have previously coded up the transformer block in a different problem. Now we're actually going to use

problem. Now we're actually going to use the transformer block to code up the entire architecture from start to finish. And this diagram is actually

finish. And this diagram is actually from Google's 2017 paper attention is all you need. And this is one of the most influential papers in machine learning to date. And usually papers I

would say are not super important to actually read, but if you're interested definitely check that one out. And the

forward method should return a batch size by context length by vocab size tensor. We can see that over here and

tensor. We can see that over here and we'll explain why that's the appropriate shape of the return tensor in a second.

And the output layer will have vocab size neurons. So that's specifically

size neurons. So that's specifically talking about this linear layer over here, the one in purple. And each neuron in that layer corresponds to the probability or likelihood of a

particular token coming next. So just

like in the handwritten digit problem, if you solve that one, we can actually have 10 output neurons corresponding to digits 0 through 9. And the number in

that neuron, the number in that output neuron represents the probability that the passin image is that digit. So each

neuron here is like the probability that the token corresponding to that neuron.

So let's say we had V for vocab size total number of words or characters or tokens in our vocabulary. Then we can think of our neurons as being indexed

from zero to v minus one and then we have the probability that that particular token comes next. So let's

take a look at the inputs. So the

model's constructor takes in vocab size.

This is just the number of different tokens the model recognizes. And as we get to our embedding table, once we get to that, we'll see that's actually the number of rows in our embedding table.

Context length, this is just the number of tokens back the model can actually read. So when you're talking to a large

read. So when you're talking to a large language model, at some point in the conversation that you're having, it's not factoring in stuff that's like really far back into the conversation.

So what's that cut off point? How many

words or tokens are we talking about?

That's the context length. the model

dim. So this is the same thing as the embedding dim or the attention dim. In

previous problems, if you saw a separate number for embedding dim and attention dim, that was just to kind of help with understanding that those are different layers in the network. But in reality,

when we have multi-headed attention, we actually just say that the embedding dim is equal to the overall attention dim, right? With the overall attention dim is

right? With the overall attention dim is just like your number of heads times each individual head size. So, I

definitely recommend reviewing multi-headed attention if you're a little rusty on that. I can link something in the description like a problem or a video for that. But we

actually just refer to the embedding dim or the attention dim as the overall model dimensionality. And the larger

model dimensionality. And the larger this number is, the more complex the model is. And just for reference, GPT2,

model is. And just for reference, GPT2, so that was like before GPT3 had a model dim of 512. And then obviously this got even bigger with GPT3 and then GPT4.

Number of blocks. So how many instances of the transformer block class do we want to make? We know in the diagram that it is repeated nx times. So what is n? That's num blocks. And then num

n? That's num blocks. And then num heads. We need to know how many heads of

heads. We need to know how many heads of self attention we're going to do if we're doing multi-headed self attention.

And then context. This is what the forward method takes in. Right? In order

to actually train the model, we need some sequence of text. We can actually see an example over here. Or if we were say actually using the model for generation before we generate text the

model needs to have something to start with. So that's what we store in context

with. So that's what we store in context and that's what the forward method of the GPT class takes in.

Okay let's explain the input example. So

here we're actually just defining those constants that the constructor takes in.

Those aren't super important to focus on. However, I do really want to explain

on. However, I do really want to explain this input and its shape and why this should make sense. So, we know the input to the forward method is actually

supposed to be B by T because we train these neural networks in batches. So,

we'll have many sequences passed in in parallel and that is how many we're passing in in parallel is the batch size capital B and that is just one here.

there's only one sentence or one sequence but based on like the shape of this 2D list or 2D tensor we can tell that there is still an additional batch

dimension going on there and then t so during training t the is actually just the context length so we can see that there were five tokens passed in with

great power comes great and then the model is supposed to make predictions for the next token given all the training examples within this this sequence of tokens so given the context

of with the model will predict what comes next. Given the context of with

comes next. Given the context of with great the model will predict what comes next and so on all the way until the model has the entire input with great power comes great and then the model

predict what what predicts what comes next. And we see that the context length

next. And we see that the context length is five, right? And this this kind of makes sense that why would we actually have necessarily a capital T basically how long the sequence is greater than

five because the model can't factor in context length is the greatest amount of tokens the model can even factor in to its next token prediction right so with great power comes great that's already

five tokens in a row the model wouldn't even really be able to factor anything else more than five tokens in it would have to start if we added like another token here the model would then have to not factor in the first token to

actually predict you know what comes next. So that's just an example of

next. So that's just an example of context length during training. So let's

take a look at this dictionary now. So

we're going to assume that the model is consistently using this internal mapping. So we know models don't

mapping. So we know models don't actually take in strings, right? Each of

these tokens needs to be converted to an integer. That's how we actually feed in

integer. That's how we actually feed in strings into models for natural language processing. And let's just and this

processing. And let's just and this model's mapping any model's mapping is pretty arbitrary. As long as it's

pretty arbitrary. As long as it's consistent, that's all that matters. So

we can say width maps to zero, great maps to one, power maps to two, and so on. So in the final layer, the final

on. So in the final layer, the final layer of the neural network where we have vocab size or capital V neurons, we'll say that the zeroth neuron is essentially corresponding to width. The

first neuron is corresponding to great and so on all the way until the V minus one neuron.

Okay, so the returned output first let's explain the shape of it B by T by V. So

we know that batch size was one as we saw in the input. So it makes sense that this whole 2D tensor or this whole 2D array over here is actually wrapped in another one. We can see that there's

another one. We can see that there's another pair of brackets over here. So

the batch size equals 1 is still maintained. However, for that example,

maintained. However, for that example, for the first element in the batch dimension or for every element in the batch dimension, we get a capital T by V tensor. And that's because for every

tensor. And that's because for every single time step, for every single context or training example, the model is generating a vector of size capital V. And the reason we have a vector of

V. And the reason we have a vector of size capital V is because that is essentially a list of probabilities, a probability for every possible token that could come next. So we'll actually

see that the first row the first row in this in this output tensor is actually the model's probability output. It's

vector of size V. We can see that there's V columns here, right? V equals

5. There's five unique words with great power comes responsibility. Those are

our five tokens. We can see that this dictionary has five keys in it. So we

can see that our vocab size is clearly five. Those are only five tokens that

five. Those are only five tokens that we're going to say this dummy example of this dummy model is dealing with. So

back to the first row. The first row in this output tensor is the model's prediction which is a vector of size V for the first context. And the first context is just looking at width. And

then the second row, the second row which is actually over here it's zero and then zero. And then we can see a 0.1 over here. The second row is the model's

over here. The second row is the model's prediction. another vector of size V for

prediction. another vector of size V for the second training example in context.

So the second training example is with great and then the model is trying to predict what comes next. We know power comes next but the model doesn't actually get to see that. And again, the reason the model doesn't get to see

that, doesn't get to see the future tokens, the actual answer for each training example is because in the self attention implementation, and again, I highly recommend reviewing self attention, multi-headed attention if

you're a little rusty on that. But the

reason the model doesn't get to see those feature tokens is because in the code, we actually apply a mask. If you

look at the multi-headed or actually single-headed attention class, and then the multi-headed attention class makes use of the single-headed attention class. But in the single head attention

class. But in the single head attention class, we apply a mask and we mask out those future tokens so that the model can't look at the answer for what it's actually trying to predict. So let's

actually think about why this output would make sense. And let's say the model has actually been trained. So by

this point, it's pretty good at actually predicting the next token. It's not just like an initial neural network in its before training is done where it's weights are and weights and parameters

are just completely random. Let's assume

the model's been trained. So for the first example, we know that the only token the model is looking at is with.

So the model will hopefully in a train state be pretty good at predicting that the word great would come next. And we

know great corresponds to the one index.

And we can actually see that over here.

The model is saying that there's an 80% chance that great comes next if you just are looking at the word with. And there

seems to be a 10% chance that the word power comes next. Let's see if that makes sense. With power. So that that

makes sense. With power. So that that could be like the start of a logical sentence. So that seems reasonable for

sentence. So that seems reasonable for the model to still say that there's some chance for that to happen. And then the final column corresponds to responsibility. The fourth column is

responsibility. The fourth column is using zero indexing. And there's a 10% chance that responsibility comes next.

So that would be like with responsibility. And that that seems

responsibility. And that that seems reasonable too. But thankfully the model

reasonable too. But thankfully the model is still saying that we have an 80% chance. We have an 80% chance that great

chance. We have an 80% chance that great essentially token number one the one column we can actually see that there's an 80% chance that great comes next

given the context of width. Let's take a look at the Vsiz vector for the second row. So the second training example is

row. So the second training example is the context with great and hopefully the model has a pretty good does a pretty good job of predicting that power comes next. So we say there's a 90% chance for

next. So we say there's a 90% chance for the third column and a 10% chance for the last column. So the third column here is corresponding to well 012 using zero indexing right and that corresponds

to power. So you can see that the model

to power. So you can see that the model is saying that with a context of with great there's a 90% chance that power comes next the final column. So that

corresponds to token ID number four is responsibility. So does that make sense

responsibility. So does that make sense with great responsibility? Yeah, that

that could be reasonable. That seems

like a logical sentence in English. So

their model is still saying there's a 10% chance we can say there's a 10% chance that responsibility comes next but the model is still fairly confident fairly confident you know after training

has been done let's say this sentence was actually in its training body of text the model is fairly confident that power comes next with the context of with great so that's that's good that

the model is predicting that I definitely recommend making sense of this row and this row but why don't we skip to the last row and see if the model's prediction makes sense there as

well. So we have this vector of size V

well. So we have this vector of size V capital V equals 5 and this is supposed to be the model's prediction given the entire context. So with great power

entire context. So with great power comes great and the model is predicting what comes next. So we can see here that there's a 90% chance corresponding to

the fifth column. So that's with zero indexing corresponding to token ID equals 4 and that's responsibility. The

model is saying there's a 90% chance that responsibility comes next. So we

can see that this B by T byV shape does make sense. Okay. So now let's run

make sense. Okay. So now let's run through the transform architecture so we can see how we would actually code this up. So the first layer is this pink

up. So the first layer is this pink looking rectangle that is the embedding layer and actually crossed out the word output from this from this layer cuz this original transformer diagram

created by Google was actually used for translation. They were working on models

translation. They were working on models that can translate between different languages. And when we use the

languages. And when we use the transformer to for generating text, the architecture is slightly different.

You've noticed that maybe if you've seen the original version of this diagram, we cut off what was on the left. That's cuz

large language models like chat GPT don't use that at all. And I explain that in more detail in my encoder decoder course. So this first layer of

decoder course. So this first layer of the neural network, this pink embedding layer over here is actually how the model learns the meanings of the different tokens or words. We know that

we have some sort of mapping where we map every single token to an integer and the model then receives this this list for a given sentence where each token is represented as an integer. However,

those are completely arbitrary and we need the model to learn some sort of deeper representation that encapsulates the meaning of each word. Ideally, the

model would learn a vector that represents the meaning of each word. And

we call those embedding vectors and they're actually trainable and learnable through gradient descent. So, we can imagine it being a lookup table, a lookup table where we're given a given

token. We would actually look up the

token. We would actually look up the corresponding row for that token and the all the columns that are associated with that given row would actually just be the embedding vector or feature vector

that is learned for that word. So the

number of rows in this table, the number of rows in this table should just be the vocabulary size capital V. And this is because for that many tokens, that's how

many tokens we want to learn the feature vector for. And the number of columns,

vector for. And the number of columns, the number of columns in this lookup table should just be the model dimensionality since we know that is our dimensionality of our embeddings. And

that's how many features we would essentially have to learn the meaning of each word. So then in the constructor

each word. So then in the constructor for the GPT class, the way we would actually instantiate this embedding layer is by using nn.bedding.

NN.bing.

And we only have to pass in two things to NN. The first thing is the number of

to NN. The first thing is the number of rows of the table and the second thing is the number of columns of the table.

Next, let's talk about the other embedding layer which generates our positional encodings. So when we have a

positional encodings. So when we have a sentence in English, it might be one that we pass into chat GBT. Let's say

it's a command like write me a poem.

This might sound kind of obvious, but it is something that we should clarify.

When it comes to teaching computers to understand speech, the order of the tokens is actually really important. The

positions of each token in this sentence. This token is at the zeroth

sentence. This token is at the zeroth index. This token is at the one index

index. This token is at the one index and so on. If we have something like poem me write a, if we jumble the order, the meaning is completely lost. So we

need the model to actually learn embeddings for each token's position as well. Which means we're going to have

well. Which means we're going to have another lookup table where the number of rows is just capital T or the context length. So the model can learn a feature

length. So the model can learn a feature vector again of size model dim for each token position all the way from zero to capital T minus one. So then when we

actually call call this layer in the forward method, what are we passing in to the forward method of our positional embedding layer? because this is going

embedding layer? because this is going to be a separate NN.bedding instance

than than this one over here. What we

would actually pass in is simply a vector of size capital T that simply has all the numbers from zero to capital T minus one since that would essentially

pluck out the rows of the table for each position in our sequential input string.

And the way you can actually generate that tensor is using torch.range.

It's going to arrange numbers from zero to the whatever the input is to this function minus one. So you would just put in torch.arrange oft and this is the tensor that you would want to pass in in

the forward method of the GPT class when you're calling the embedding positional embedding layer. And of course maybe I

embedding layer. And of course maybe I didn't clarify this before but just to make it super clear you would actually just be passing in the raw input our B by T input. That's what you would pass

in to the actual token embedding layer.

Then in the forward method, what we're going to do is we're just going to add the positional and token embeddings together. And then we're going to pass

together. And then we're going to pass those in to a series of transformer blocks. So let's just go ahead and treat

blocks. So let's just go ahead and treat the transformer block as a black box for now. Is we actually have an entirely

now. Is we actually have an entirely separate coding problem explaining the inner workings of this box. Masked

multi-headed attention, the addin norm, the feed forward, all the components within the transformer block. And this

class is given to us. So we just need to know how to instantiate it in the constructor for GPT class and then how to actually call it in our forward method that we're going to write. So

since we're going to be instantiating a bunch of these a bunch of transformer blocks based on the num blocks parameter that was given to us, we're actually

going to make use of something called nnn.sequential.

nnn.sequential.

nn.sequential. So you can instantiate this in the constructor for the GPT class and then you can actually treat this like a normal Python list you can call theappend method with the only

restriction that anything you pass in needs to be in your own network class.

It needs to be something that also subasses and then do module. So each

time that you append to this list, maybe you would loop over and the range of num blocks, you're going to append an instance of the transformer block. And

every time you instantiate the transformer block, the only things that you'll see you have to pass in are the model dimensionality as well as the number of heads. This is

an input that's given to us. And the

number of heads is simply the number of heads of self attention. So you would define blocks your nn.sequential in the constructor for the GPT class. And then

just to be super clear, the way you would actually do this in the forward method for the GPT class is you would pass in your total embeddings. This is

just your positional embeddings plus your token embeddings. You're going to go ahead and pass that in to the essentially the default calling method, which is essentially the forward method

for the blocks. And the way n.sequential

works is for this list of modules that you've passed in, it's just going to call them in order. So it will pass in total embeddings into whatever's at the zeroth index in this list. And then it will go ahead and pass the output of

that into the next layer, the next transformer block. So the input comes

transformer block. So the input comes all the way back in as for as many times as we have elements in this nn.se

sequential. So you'll simply get the output like this. You will pass in total embeddings into blocks and then this this is the output of the transformer blocks. Okay. So what's left? The bulk

blocks. Okay. So what's left? The bulk

of the code is done. Even though the transformer architecture that was originally developed, this diagram does not include an additional layer norm over here after the transformer block

and before this this linear layer. It is

customary to include one additional layer norm. So an additional instance of

layer norm. So an additional instance of NN.layer norm defined in our constructor

NN.layer norm defined in our constructor that we then call in the forward method.

Researchers have found that these large language models get way better result if you do an additional normalization, an additional layer norm before the final linear layer. And again, for a

linear layer. And again, for a refreshment on what layer norm does, I highly recommend checking out the transformer block video where I explain every single inner working inside the

transformer block. So the final linear

transformer block. So the final linear layer is over here and this is what is going to get us the the output to be in the shape we want so that we can actually extract predictions for the next token. We know that the input to it

next token. We know that the input to it is B by T byD where D is the model dimensionality and we know that we want to get something that's B by T by V where V is the vocabulary size. So the

way we'll do that is by appending a linear layer having a linear layer defined in the constructor of the GPT class and in features for this linear layer will be a model dim as that is

capital D and then we will pass in V or vocab size for out features and this will actually get us the prediction and the shape that we want. And this does make sense given that example we talked

about at the start. And lastly for the softmax layer, again the whole point of the softmax layer is so that we actually get numbers between zero and one because we want those tokens, those final

predictions in the B by T byV tensor. We

want those numbers to be between zero and one. So we can think of them as

and one. So we can think of them as probabilities. So you can actually

probabilities. So you can actually instantiate a softmax layer in the constructor of the GPT class. That would

be like nn.s softmax and you would pass in the dimension along which you want to normalize or make everything add up to one. Well, that would just be the final

one. Well, that would just be the final dimension, right? The vector of size

dimension, right? The vector of size capital V is our list of probabilities.

So, you could just pass in dim= 2 012 or you could say dim equals ne1 for the final dimension. And then you would

final dimension. And then you would simply call the forward layer of your softmax instance in the forward layer of the GPT class. Or since the softmax layer is sometimes thought of as simply

a function rather than a layer, you can actually just call nn.functional n kind of like nn.function.s softmax in the forward method for your GPT class. And

you don't have to worry about instantiating anything in the constructor. And then this function

constructor. And then this function would take in two things. It would take in your tensor as the first input, what you actually want to perform softmax on, as well as the dimension over here,

which we would just pass in -1 for. And

that's it. We're finally done explaining this entire transformer architecture that we've been building up to across all these videos and trying to understand how this neural network works. Before we jump into the code, I

works. Before we jump into the code, I just wanted to say thank you for making it this far into the videos. And then

two other things. One is that we're actually not done. We still, even after this coding problem, have one more. And

that's going to be where we finally generate text from the model. We see

this neural network architecture in action and working like a GPT. So, we're

finally going to be able to generate text from the model. The next problem, you're actually going to write the generate function, which is a bit different than the forward function for this class. and we'll actually get to

this class. and we'll actually get to return a string and that's how like the test cases will work. Like we'll see if the string that your GPT returns back to you based on some sort of prompt actually matches what the string is

supposed to be. So that's going to be a pretty cool problem to test to see if you can write the generation code, the logic that actually makes these transformers generate code. And the next thing I wanted to say is that I highly

recommend if you need to reviewing any of the previous like videos or problems in the series all the way from I think the first problem we started with was like gradient descent and then we've worked all the way through like linear

regression, sentiment analysis, multi-headed attention and these problems and these videos will always be free and if you need to review any of these concepts at any time I highly

recommend them. Okay, let's finally jump

recommend them. Okay, let's finally jump into the code for the GPT class. And I

recommend reviewing like the other classes that are provided below just to refresh yourself on those concepts. But

for the most part, we'll just treat the transformer block class like a black box. So the first layer in the neural

box. So the first layer in the neural network is the word embedding or token embedding layer. So that'll be something

embedding layer. So that'll be something like self token embedding equals nn.bing

equals nn.bing and the number of rows. We know that's the vocab size and the number of columns or the size of that vector is the model dimensionality. Then we're going to have

dimensionality. Then we're going to have our positional embeddings layer. So

positional embeddings is nn embedding.

And the number of rows here is actually just the context length. We're going to go all the way up from 0 to t minus one.

And then the size of each vector or the number of columns is also the model dimensionality. Then we need our

dimensionality. Then we need our transformer blocks, right? So we can say something like self.blocks equals nn.se

sequential. And then something like for i and range of num blocks. we know we actually want to append to this. So,

self.blocks.append.

And then we're actually going to do something like transformer block. We're

going to use that class that's below.

And then we can go ahead and just pass in the two things that it takes. And it

takes in the model dimensionality. If

you check the constructor that's that's way below, it takes in the model dimensionality and the number of heads.

And then we're actually going to need our final layer norm, which actually comes before the final linear layer. So

we can say something like self.final

final layer norm equals nn.layer norm.

And then if you've solved the transformer block problem before, you know that the layer norm class or the layer constructor just needs to take in the dimension along which to normalize.

So this thing is going to take in something that's actually B by T byD and it's going to pass actually return something that's still B by T byD, but it normalizes along that model dimensionality, that third dimension. So

we just pass in model dim. And then

we're going to want to do our final linear layer. So self, we can often call

linear layer. So self, we can often call this this the vocab the vocab projection because we're projecting down to the vocabulary size dimension. So the vocab projection is nn.linear. The input

features is model dim. The output

features is vocab size. And that's

actually it for our GPT constructor. One

small thing though, since transformer block is an inner class, we do need to refer to it as self in Python. And for

the softmax, we'll just use the function and do that in the forward method below.

Okay. So now let's pass in our input also called context all the way through the transform architecture. The first

thing we need is our embeddings. So we

can say something like token embeddings or we can just say token embeds equals self.ken embeddings self.ken embeddings

self.ken embeddings self.ken embeddings

and we pass in our context. So this is going to be something that is B by T by D. Since for every token we get that

D. Since for every token we get that vector of size D. Then we need our positional embeddings. So we know we're

positional embeddings. So we know we're going to call the positional embeddings function or the positional embeddings layer. And what do we want to pass in?

layer. And what do we want to pass in?

We want to pass in torch.range

of T. But how do we actually get T?

Well, we can actually get that from any of these tensors. So we can say something like B by T by D equals token embeds.shape. And that will unpack

embeds.shape. And that will unpack this.shape tupil. And then we can simply

this.shape tupil. And then we can simply pass in T over here. So then we know that our total embeddings or our sum of embeddings is token embeds plus

positional embeds. And this is what we

positional embeds. And this is what we then want to pass in through the blocks.

So we can actually say self.blocks

pass in total embeddings. And that will then be passed through all n of our transformer blocks. Then we know this

transformer blocks. Then we know this goes through the final layer norm. So

self.fal

layer norm. We pass in this into there.

And then the output of the layer norm goes into the vocab projection, the linear layer. So self.vocab projection

linear layer. So self.vocab projection

takes in that output. And then this is almost the final output. So we can say this is like the unnormalized, right?

It's not between zero and one yet. We

have to buy softmax for that. So

nn.functional.softmax

and we pass in the unnormalized. Pass in

dim=1 as we talked about earlier. And

this will actually be our normalized probabilities. They're between 0 and 1.

probabilities. They're between 0 and 1.

So I'll just call it probs and it's been normalized now. And then this is what we

normalized now. And then this is what we want to return. We just need to make sure to round our answer to four decimal places just that the test cases are consistent. So we can say return

consistent. So we can say return torch.round

torch.round of probs and then decimals equals 4. And

we're done.

And we can see that it works. We've

finally written a working GPT class.

There are so many concepts embedded within this. So definitely leave a

within this. So definitely leave a comment if there's anything you would like me to go more in depth on and I'll just make a video on that. If you're

curious as to how I wrote the test case for this problem into how it's actually checking the correctness of your code, I simply created a random context and then randomly initialized the weights of your model which is actually automatically

done by PyTorch. And then I'm checking to see if the probabilities the output tensor returned by your forward method is actually matching the correct solution codes output tensor to four

decimal places. And by the use of all

decimal places. And by the use of all these seeds, I'm able to ensure that everything is reproducible and there's actually no randomness and inconsistencies in the inputs and

outputs. Next, I would highly recommend

outputs. Next, I would highly recommend jumping into the make GPT talk back problem. In that problem, you'll

problem. In that problem, you'll implement the logic to make GPT generate text and it's going to be a really nice satisfying ending to this sequence of problems.

So, how do we actually generate text from language models? Let's say we're going to treat these language models, these transformers like GPTs as a black box. How do they actually generate text

box. How do they actually generate text though? So after we watch this video,

though? So after we watch this video, there will be a coding exercise to generate new Drake lyrics using a trained model. And I highly recommend

trained model. And I highly recommend doing that exercise and then playing around with the model. So how do we actually generate the lyrics? So first

let's treat the model as a black box. So

this is our model. It's some sort of trained GPT and we'll take in we'll explain what it takes in and what it outputs. So it actually takes in a

outputs. So it actually takes in a tensor of size B by T. B is something called a batch size and it represents how many independent examples or requests in parallel we're processing

for generating text. Let's just say batch size equals 1. So we'll see B equals 1. The T is actually really

equals 1. The T is actually really important though. The T is actually

important though. The T is actually equal to the length of the input sequence. So if we're using word level

sequence. So if we're using word level tokenization and we said something like write me a poem that would actually be a capital T of four and what the model

output is actually something of size B by T by V where V is the vocabulary size. So it's the number of unique words

size. So it's the number of unique words or tokens that this model recognizes.

And this should make sense because for every single time step we have this vector of size V. So for every single time step, for every single space or every single position in some sort of

input sequence, maybe that input sequence is four words like write me a poem, we're getting a vector of size V.

And that vector of size V is actually a list of probabilities. And if it's of size V, then we can actually think of index zero in that vector as corresponding to the probability, the

probability that the token associated with index zero comes next. And the same for index one and index 2 and so on all the way to index vocab size minus one.

So in short we have vocab size or capital v entries for every time step where every entry in that in that vector corresponds to the probability that the

corresponding token comes next. So after

we've passed in our input something like write me a poem. So that's b= 1 and t= 4 we get something that's b by t by b. So,

let's actually go ahead and explain how the next token is chosen in more detail later in the video. But given that list of probabilities, there's going to be some sort of sampling algorithm. You can

think of this as reaching into a bag of marbles, and the marbles that occur more often are more likely to actually get picked out. It's essentially going to be

picked out. It's essentially going to be the same concept here. We have a bunch of probabilities of which token could come next. And using those

come next. And using those probabilities, we're actually going to pick one token to come next in the sequence. So let's say the input was

sequence. So let's say the input was write me a poem. Let's say we had a bunch of probabilities that came out of the model and the token that we actually

ended up choosing to come next is stanza. Well, what we go ahead and do to

stanza. Well, what we go ahead and do to actually generate text continuously from the model is we go ahead and append or concatenate this word to the previous input. So now the input says write me a

input. So now the input says write me a poem stanza. And then this is actually

poem stanza. And then this is actually going to go ahead and be passed into the model. So we have write me a poem stanza

model. So we have write me a poem stanza and this is going to be passed into the model. We're going to call the model

model. We're going to call the model again and we get another output. And

this time after we you know choose from the probabilities let's say we get the token one. So then we would actually

token one. So then we would actually append that over here and then we would get stanza one and then we would actually go ahead and pass write me a poem stanza 1 into the model again. And

the model will actually output another probability distribution. And this

probability distribution. And this process of continuously calling the model over and over again in some sort of cyclic manner is actually how we keep getting the next token in the sequence

and how we can generate long long text.

And just to make sure this is super clear, after we pass in write me a poem stanza one into the model, the next token that comes out should hopefully be either the colon operator so that we can

actually start the poem or actually the first token in the poem. And then we would keep calling this model repeatedly to generate the entire poem. So how is this actually going to work in our

coding exercise? Instead of passing in

coding exercise? Instead of passing in an instruction, all we're going to pass in is a start token. It's essentially a dummy token that tells the model to start generating text. And this is actually how text is generated from

language models when they are in their pre-trained state before we actually pre actually fine-tune them on a Q&A data set. So we can talk back and forth with

set. So we can talk back and forth with the model. However, they can still

the model. However, they can still generate text very well. So our start token or initial context is just going to be zero in a tensor and that will be given. But after calling the model once,

given. But after calling the model once, it's going to output something that is one by one by vocab size, assuming a batch size of one and t equals one.

However, after calling the model twice, right? Because after calling the model

right? Because after calling the model once, we're actually going to append or concatenate the model's output to that initial starting context, that starting token. And after calling the model

token. And after calling the model twice, the output is going to be B by 2 by V. And of course, after calling the

by V. And of course, after calling the model three times, the output is going to be B by 3 by V. So we actually only care about the final time step, right?

The next token that's going to be predicted in the sequence based on everything that's come so far, all the previous tokens. So whenever we actually

previous tokens. So whenever we actually extract the output from the model, the output of the models forward, we always are going to need to focus on the last time step and this is how we can index

the output to do so. So once we get our probabilities such as by applying softmax to the model's output and having numbers between 0 and one. Once we have a list of probabilities, we need to

actually choose a token. We need to do something called choose the next token because we don't actually know which token is going to come next. All we have is a bunch of probabilities. One option

is just to pick the token with the highest probability. But this actually

highest probability. But this actually leads to really boring results and the models don't actually give humanlike context or humanlike text generation. So

we actually are going to have to call a function in PyTorch called torch.multial

which is going to simulate the process of sampling. So the process of sampling

of sampling. So the process of sampling can actually be likened to choosing marbles from a bag. The marbles that occur more often, the colors that occur more often are going to be the ones that

you get most of the time. But every once in a while you will get a marble that occurs less often because there's still a chance that that marble could get chosen. The same kind of holds for the

chosen. The same kind of holds for the output of the model which is of size V or of size vocap size. The tokens that are have higher probabilities are the ones that are more likely to get chosen.

But we still want the less likely tokens to get chosen sometimes instead of only choosing the max instead of just calling torch.mmax or something like that. So

torch.mmax or something like that. So

once we have our probabilities, we're going to call torch.multial and

torch.innomial is going to sample or pick a token for us. Then once we have the returned tensor from torch.multial,

which will essentially be a tensor with just one element in it, the next character or the next token in our sequence, we're actually going to append that to our context. We're going to append that to our growing context.

We're going to concatenate it with our growing context. And at the next

growing context. And at the next iteration or the next iteration of the loop, we're going to call the model again. Call the forward method for the

again. Call the forward method for the model again. However, there is one

model again. However, there is one limitation that we still need to take into account and it is actually something called the context length. The

context length of these models, these language models. You might have heard

language models. You might have heard that GPT4 has a context length of 128K.

And the context length actually represents how many tokens back into the past. How many tokens back into the past

past. How many tokens back into the past of this sequence can the model read?

Because this actually has to be cut off somewhere as we don't have infinite compute. So obviously higher numbers for

compute. So obviously higher numbers for the context length are going to yield better results but the context length does have to be established somewhere.

So in our iteration or in our loop for kind of generating new text where we keep continuously calling the model prediction or calling forward for the model. We know that this context is

model. We know that this context is going to be growing. We saw earlier that we had write me a poem and then write me a poem stanza and then write me a poem stanza one. But as this context gets too

stanza one. But as this context gets too long, if this context ever exceeds exceeds context length just in the implementation, it is an important detail that we are going to have to

truncate the the tokens at the start so that we're only looking at the previous context length number of tokens. So

next, I would highly recommend jumping into the code. you will actually get to run your code against test cases and see if the string that you return is correct, the string that represents

GPT's prediction and then you will have a working model that can generate new Drake lyrics. And of course, you could

Drake lyrics. And of course, you could just switch the data set to a different text file if you wanted to generate, you know, some other sort of text. So, I'd

highly recommend jumping into the code and then you can check out the collab linked on the problem which will actually allow you to play around and generate new lyrics. And let's test our code and let's see if it works. So we

can go ahead and run this cell. And

these actually look like real lyrics. In

fact, we can run it again. We can see if we get real lyrics as well. And yeah, we get some different lyrics. And of

course, we can increase the number of characters here and get even more lyrics.

Okay, this is our final problem in this list of problems. And we're finally going to generate text. We're finally

going to generate text using all the code we've written so far. This problem

is going to combine concepts from this entire series and we're finally going to use the GPT class that you wrote. We're

going to use that exact class, one of the inputs, and as is actually an instance of that class, and we're going to use it to generate text. I took the model that you coded up in the previous

problems, the GPT class. I made an instance of that class and I trained it on a raw data set of all of Drake's songs. So what's being passed into you

songs. So what's being passed into you is actually a model that is specializing in generating Drake songs since that is the body of text that it learned to

model really well during training. And

we're finally going to get to write the code that actually generates text from models, these large language models. So

it turns out that implementing inference, this idea of using a trained model and actually using its weights to generate new things is called inference.

And it's a bit more complicated than just calling this forward method from the GPT class over and over again. And

this is a valuable skill to really understand cuz then you could actually download open-source models. you could

download open- source weights from online and implement inference for them on your own and actually generate text and that that can be a really fun side project. So, and the reason it's more

project. So, and the reason it's more complicated than just fall calling forward in a loop is because forward is not outputting like a clearcut output of

what the next token in the sequence should be. Right? If we pass in my name

should be. Right? If we pass in my name is into this model, let's treat the model as a black box. This is our model.

And if we actually pass in this sequence and let's say we know the answer is Bob.

And let's say the model has been trained really well and the model knows the next word is Bob. But it's not going to give us a clearcut answer like this. Right?

That's not what this model outputs. This

model outputs a vector, a vector of size, vocabulary size or capital V where that's the number of tokens that the model was trained on. We have a bunch of

numbers each between zero and one. And

if we go to the index where Bob is, that's actually going to be like a really high number, maybe like 0.95. So

the model is outputting a bunch of probabilities of each possible next token, whether that's a character. In

this model, it's actually going to be a character, but it can obviously be a word. it's just some token. So given a

word. it's just some token. So given a bunch of probabilities, how do you actually choose the next token? Because

it turns out just taking the highest probability token doesn't necessarily yield the greatest results. And we'll

explain why in a bit. The model is only also allowed to read context length, number of tokens back into the past.

Right? As we're generating text, these generations can keep growing and growing and growing. And we're going to need to

and growing. And we're going to need to actually truncate off tokens that are farther back than this threshold at every iteration. And the way that your

every iteration. And the way that your code will actually be validated is we'll just check to see if the string that GPT is returning back to you, whatever text

actually matches what it should be.

So let's try and understand this example. We're given an instance of the

example. We're given an instance of the GPT class and that's going to be the model variable. Then we're told how many

model variable. Then we're told how many new characters to generate and we're given the starting context. So if the model is a black box, this is our model.

We go ahead and pass in some sequence of tokens of length capital T. We're going

to start off here by just passing in one token and specifically is zero. And so

since we just have one token, we can say that capital T equals 1. And if you look at the vocabulary dictionary, remember we actually encode our tokens as

numbers, right? When we're doing NLP, we

numbers, right? When we're doing NLP, we can see that zero is corresponding to the new line token. This is

corresponding to the new line token. And

that's going to kind of be our start token. That's going to kind of be like a

token. That's going to kind of be like a start token to tell the model, hey, start generating text. And in this case, since we chain train the model on lyrics, you can maybe think of the new line token as say, okay, start

generating a new stanza or a new verse or something like that, right? And we

actually are going to then get a probability distribution. We're going to

probability distribution. We're going to pick a token. So, we're going to choose choose the next token. Choose the next token. And then once we decide that,

token. And then once we decide that, we're actually going to concatenate it to our original input. So now this will be of length t + 1. That'll be the size of the input. And then we'll call the

model again inside some sort of loop.

We'll get the next token. And then this process just continues. So we need to be told how many times to actually generate new characters. And that's this input

new characters. And that's this input over here. And of course as this input

over here. And of course as this input grows right as we are generating new tokens appending them back on to the context variable as we are appending to

context and the context is growing at some point you know those tokens from the past maybe those starting tokens need to be cut off if the input is getting too long too long for the model

to actually process. So that's our context length and then again this is just a dictionary for encoding integers to characters. This is going to be a

to characters. This is going to be a character level model instead of a word level model. But all the concepts are

level model. But all the concepts are exactly the same. And although your output when you actually run the code when you when you solve this problem, it's not going to match exactly this

just due to computational limits within the browser. We actually I went ahead

the browser. We actually I went ahead and ran this model. So I used the solution code for this problem and used your GPT class that you wrote in the previous problems. And this is the

output I got for only 60 characters. You

can count this. This is 60 characters.

And we can see that it's something that actually resembles like Drake lyrics. So

that's pretty cool. So now let's jump into how it works.

So this is going to be our general workflow for this generation loop. From

the model output at every iteration, we're obviously going to call the forward method of the given model at every iteration using our growing context, which again we have to remember

to truncate if it gets too long. We are

going to then get probabilities.

However, the output of this model as the starter code says has actually been before softmax is applied. So they're

not actually between zero and one yet.

And we need to actually get those probabilities. So we're going to

probabilities. So we're going to actually use nn.functional.softmax.

And of course we're going to use the last dimension. So we say something like

last dimension. So we say something like dim equals1. So we can make each row

dim equals1. So we can make each row actually sum to one. However, the

model's output before we even apply softmax, the model's output is of shape B by T by V. Right? For every single token, at every single time step, we

have this vector of size V, all the probabilities. But we've actually been

probabilities. But we've actually been calling this model over and over again inside the loop. And this T has been growing. It starts off with T= 1 and

growing. It starts off with T= 1 and then T= 2 and then T= 3 as we generate the tokens. But each time we only really

the tokens. But each time we only really care about the next token, right? The

next token that's going to come in our sequence. So we only actually care about

sequence. So we only actually care about the last time step here. We don't care about the model's predictions for the previous time steps because we already have the answers or we've already chosen

a token. We've already chosen a token

a token. We've already chosen a token from the probabilities and done our whole our whole sampling thing which we're about to explain. We've already

chosen those tokens. So we just want to actually index the model output at the final time step. So we might do something like let's say we had the model prediction stored in some kind of

variable the model output. You would

actually want to leave batch size the same but take the last time step and of course you want to grab all the vocabulary entries all the probabilities. So you would leave the

probabilities. So you would leave the vocabulary dimension the same. then you

can apply softmax and from those probabilities we're going to do sampling. So let's talk about sampling.

sampling. So let's talk about sampling.

So given a bunch of probabilities one choice is just to keep choosing the highest probability token. Look at the max look at the index of the max probability and simply choose that

token. However, we're instead going to

token. However, we're instead going to do something called sampling. Sampling

can kind of be likened to drawing marbles from a bag. Let's say there's a bunch of marbles of different colors and some colors appear more often than others. The ones that appear most often

others. The ones that appear most often if you keep repeatedly drawing into the bag, those are going to pop out more often. But occasionally you can still

often. But occasionally you can still get the lower probability, the lower count marbles. That's exactly what we're

count marbles. That's exactly what we're going to do here. And there's a function called torch.multial that is going to

called torch.multial that is going to simulate this sampling process. All we

have to do is pass in the probabilities.

Right? So that's after we applied softmax. the number of new essentially

softmax. the number of new essentially drawings that we want to do from this bag this bag of probabilities that's just one and then for reproducibility so that you know your test case is not

random every time your test case is output you need to pass in the generator why do we sample why are we doing sampling instead of just taking the max

probability token I go into this in more detail in the background video for this problem but the short answer is if we let the model occasionally choose like the second most likely or the third most

likely, we can get a lot more diverse and interesting outputs cuz it's not like the model is forced to just take one path. Instead, there's different

one path. Instead, there's different paths that the generation can go down, right? Because we know that at a given

right? Because we know that at a given iteration, if we choose a given character like A, right, instead of a character like B, then that is going to affect the future tokens that are generated because they're all

conditioned. They're all kind of

conditioned. They're all kind of dependent on the previous tokens that came before it, right? So we can get some more interesting and some more fun outputs if we occasionally choose maybe

the slightly less likely token at a given iteration because then this can allow the model to go down a different path and generate something different.

Okay, we're finally ready to jump into the code. Since we're going to be

the code. Since we're going to be returning a string that's actually going to be growing at every iteration depending on which characters we generate. First I'm going to in declare

generate. First I'm going to in declare a variable called result which is just going to hold the list of generated characters and then we can just convert this into a string at the end. And again

to maintain the reproducibility here we actually have to have this generator which is going to keep track of all the random numbers and make sure that everything consist everything's consistent. So you actually have to set

consistent. So you actually have to set the generator to the initial state at every iteration to make sure everything's consistent. If that doesn't

everything's consistent. If that doesn't make a ton of sense don't worry it's not too important. The only thing we really

too important. The only thing we really need to do is just have our code here.

And we've said that it's going to be some arbitrary number of lines. And then

you're going to call torch.multial

on this line over here, right? Because

we know that there's going to be a call to that sampling function at some point.

And like we said earlier, we're going to pass in the generator. And then right after that line, we need to set the state again and then finish our code however many lines that is for the loop, for the body of the loop. And then it

starts over again. So the first thing we're going to go ahead and do is check to see if we even need to do any truncation. So has the context grown

truncation. So has the context grown long enough to where it's longer than context length? And the way that I'm

context length? And the way that I'm going to do this is since context is b by t, right? So we know context is b by t and I'm actually going to go ahead and

check the length by using the second dimension. Right? So an easy way to

dimension. Right? So an easy way to check the second dimension is just take the length of the transpose version because len always returns the first dimension right if it's b by t len of

context is b right but if you actually transpose it you flip it into a t by b you could say len of context t so t for

transpose actually is that t not to be confusing between those t's so then if we say okay if the length of context t so transpose is actually greater than

the context length then it's too long and we need to truncate. So we could say context equals context preserve all the the batch dimensions and we can go and

say from context length. So go context length tokens backwards and then so that's kind of what negative context length means right going context length tokens backwards from the end and then colon all the way to the end. So then

we'll preserve everything from that point onwards.

Then based on this context we need to get the model prediction. So we can say prediction equals model.forward forward

or again just use the default calling.

So we can then we can pass in context and this is the truncated context that we passed in. Then we want to focus on the last time step. So you might say last time step because again we only

care about getting the next token. So

prediction and then we might say colon 1 colon as we talked about previously. So

instead of this being b by t by v right this one is b by t by v. This one is just B by V which is actually what we wanted. Then we can get our

wanted. Then we can get our probabilities by calling softmax. So NNF

functional softmax pass in that last time step and we want to normalize along the vocabulary dimension. Make every row in this tensor actually sum to one and

all the entries need to be between 0 and one. So pass in dim equals -1. And now

one. So pass in dim equals -1. And now

we can actually call torch. multinnomial

right. So we can actually go over here says now that we have our probabilities we can call torch.multial. So let's go ahead and do that now. We can say next

char or next character is torch do multi-nomial and we know we need to pass in the probabilities. We know we need to pass in the number of new essentially how many times do we want to draw out of

this bag. So that's called num samples.

this bag. So that's called num samples.

So you can say num samples equals 1 or you can just directly pass in one. And

then we do have to pass in the generator just to make sure the sampling is consistent. And then, you know, we want

consistent. And then, you know, we want to set the state just so everything's consistent, but we don't have to touch that at all. And here's where we're going to finish out the loop. We know we need to grow the context and actually append the new character we just got.

So, we can use torch.cat for that concatenation. So, torch.cat, we pass in

concatenation. So, torch.cat, we pass in our tupil that torch.

And we just want to concatenate context with the new character. So then we can actually say dim=1 over here. So that

takes it from something that's b by t that's the initial shape of context to something that's b by t + 1. That's what

passing in dim=1 does over there. We're

focusing on this dimension over here.

And then we can actually append to res, right? So we can actually say int to

right? So we can actually say int to char. That's that dictionary that we had

char. That's that dictionary that we had earlier, right? We want to index this

earlier, right? We want to index this with whatever token we chose. So we can say next char and next char is supposed to be like an integer right because the model is thinking in terms of these

token ids not the literal strings themselves the model is thinking in terms of the integers that represent each string. However this is actually

each string. However this is actually going to be a tensor still since it's actually something that's returned by torch.mult. So we can simply do item and

torch.mult. So we can simply do item and do item extracts the number out of a tensor of size one. So if you had a tensor which is like wrapping around

some sort of single constant and then if you call it item it gives you like the actual number the actual scaler. So that

would give us five over there. And then

this is just what we want to append to resz. So we can say brez.append

resz. So we can say brez.append

of whatever that dictionary gave us whatever character that was. And now

we're ready to actually return the string. However, we need to return a

string. However, we need to return a string not a list. So why don't we just join all the entries in the list with essentially the empty string. So we can

say return empty string dot join of res and this is going to join all the elements in the list with the empty string and we can see that the code

works. Next I highly highly recommend

works. Next I highly highly recommend checking out the linked Google collab notebook in the description. You don't

have to write any more code. In fact,

you're going to see your code, this exact generate function that you wrote here being used in that notebook. And

all you have to do is click run on each cell one by one. And you'll actually every time you run it, you're going to get new Drake lyrics. So that'll be really interesting. Bumping Justin

really interesting. Bumping Justin Bieber, but her favor ain't left. She

know what she need. All her need, all she bless. Giving you my best. Yeah, I

she bless. Giving you my best. Yeah, I

got my heart.

Welcome to GPT Learning Hub, where we simplify complex ML concepts. In this

video, we'll be going over the two types of machine learning models, or more accurately, the two ways to train models, supervised and unsupervised.

Supervised models learn from a label data set. Imagine a model that predicts

data set. Imagine a model that predicts how many votes a candidate will receive based on a few input attributes. The

data set might have X1, the age of a candidate, X2, the number of laws the candidate has passed, and X3, the number of years they've been in politics. Each

row represents a prior candidate. And

for every row, we also have a label, which is the recorded number of votes received. The model learns to predict

received. The model learns to predict how many votes a future candidate will receive based on this data set. Learning

from label data is the essence of supervised learning. What's actually

supervised learning. What's actually inside this black box? The model could be any mathematical formula, but the simplest model would use this equation

to make its prediction for y given x1, x2, and x3 as input. During the learning phase, the values for w1, w2, w3, and b

are updated until we're satisfied with the model's accuracy. W1 factors in how important X1 is for calculating Y. W2

factors in how important X2 is for calculating Y and so on. The B tells us the value of Y if X1, X2, and X3 were

all zero. On to unsupervised models.

all zero. On to unsupervised models.

They learn from unlabelled data. GBT and

other LLM are trained through unsupervised learning. We simply pass in

unsupervised learning. We simply pass in giant chunks of text like all of Wikipedia during training and there's no need for each sentence or paragraph to be labeled with any additional

information. In our voting example, each

information. In our voting example, each row was a training data point. But when

feeding text into an LLM, there are actually tons of training data points inside a single text snippet. Consider a

phrase like who are you? I am a helpful AI. LLM creators add phrases like these

AI. LLM creators add phrases like these into the training data set which is just a giant body of text. There are actually nine different training data points in

this phrase. Let's see why. Given the

this phrase. Let's see why. Given the

sequence who, the model learns that the word are can follow. Given the sequence who are, the model learns that the word you can follow. Given the sequence who are you, the model learns that a

question mark can follow and so on. Once

training is complete, the model has learned to respond to the question, "Who are you?" with the response, "I am a

are you?" with the response, "I am a helpful AI." We don't have to label any

helpful AI." We don't have to label any of the data points. We already have many sequences where the model can learn to predict the next word. That's the main

benefit of unsupervised learning. Most

Gen AI models are trained with this technique since training on large amounts of data is much easier. If

you've made it to the end of the video, congratulations. you're on your way to

congratulations. you're on your way to mastering this topic. If you like my teaching style, you'll love our 6-day ML challenge. Each day, I'll explain a new

challenge. Each day, I'll explain a new concept in your inbox in an easy to digest format. You can sign up for free

digest format. You can sign up for free by dropping your email at the link in the description. See you soon. You may

the description. See you soon. You may

have seen this weird looking diagram before, and it's actually an RNN or recurrent neural network. So, let's

explain exactly how it works. And as

usual, we're going to start pretty high level and then get down to the actual equations that are used for RNNs.

There's also specific types of RNN's like LSTMs and GRUs, but we actually will go over these in a later video since they're just extensions of this vanilla type of RNN that we're about to

go over. In 2017, Google came out with

go over. In 2017, Google came out with the transformer. And since then, RNN's

the transformer. And since then, RNN's haven't been as popular, but they are still used today in Google Translate.

So, Google Translate uses a hybrid model. It uses both a transformer and an

model. It uses both a transformer and an RNN. So, RNN's are definitely still

RNN. So, RNN's are definitely still important to understand. Definitely

leave a comment below if you'd be interested in a video on Google Translate. So, how do these RNNs

Translate. So, how do these RNNs actually work? We're told that they take

actually work? We're told that they take in some input at time t, get an output at time t, and we also get a hidden state at time t, which is apparently fed

back into the model. So, now the model has two inputs. the actual input at time t displayed over here and the actual hidden state at the previous time. So

what does all of this mean and why do we actually have time states? The thing

about RNN's is that they take in sequential data. So we have sequential

sequential data. So we have sequential data and this can come in many different forms. We can actually have time series data, literal time series data of prices from say maybe the stock market. So

prices from the stock market and at each time step so again our x is actually our input our input data and at each time step we may have a different price. So

this may be our starting day t minus one and then we have the price on the next day of whatever stock we're actually modeling and then we have the price on the next day and maybe the ultimate goal

maybe the ultimate downstream goal is to predict the price of a stock on a future day. The model takes in x which is the

day. The model takes in x which is the actual price which we can see over here and we also take in v. V is actually interchangeable with h. It's the

previous hidden state. So we might initialize that hidden state to be some random vector and then at every time step we actually have some sort of calculations done inside this blue box

or black box over there and we actually get a model output which is some sort of vector as well as another hidden state to be passed in to the next time step.

RNNs can also be used to model textbased data. We may know by now that we can

data. We may know by now that we can think of text as a sequence of words, right? So we can think of every word or

right? So we can think of every word or subword or character, every word is a actual time step. And that we can go ahead and convert each word to some sort of number, maybe an embedding vector and

that will actually be the X that we pass into these models. And maybe our ultimate goal is to do autocomplete and get the next word in the sequence. Maybe

it's to actually gauge the sentiment or the or the emotion in our text.

Regardless of whatever our NLP application is, we can definitely pass in text into these RNNs simply by modeling each word as a time step. So

let's continue with the stock price example. Let's say we have a couple days

example. Let's say we have a couple days of stock prices and the goal is to predict the price on day three. So we

actually have our day one price fed into the RNN over here and then there is some computation done inside the RNN and we'll get into the equations for how that computation is done soon but we

actually have then the day two price over here fed into the next unit as it's called of the RNN and in this box over here there is some very similar math being done and don't forget that we

actually have this initial hidden vector since we don't have any day zero data we would just randomly initialize a hidden vector to pass into the RNA and on the left. Then we get a day one output which

left. Then we get a day one output which can actually be thought of as the model's prediction for the day 2 price based on the day one price. However, we

know the actual true day 2 price and our ultimate goal is to predict the day three price using the computation and the equations that are actually going on inside this box over here. We then

generate a day one hidden vector which is then passed into the next unit. The

RNN factors in both the day 2 price and the information, this hidden information as it's called from day one to ultimately get the day 2 output. So we

have the day 2 output over here as well as a day 2 hidden vector. And depending

on the situation, we might say that our model's prediction for the day three price is either the day 2 output or the day 2 hidden vector. And this can keep going on for as many days of data we

have. And maybe ultimately we only care

have. And maybe ultimately we only care about the final day. We want to predict the future price of a stock. But what

equations are actually being used inside these RNN's, right? Inside these RNNs, these these units to actually get these outputs and hidden vectors. Here are the

main equations. All we really do is use

main equations. All we really do is use the current hidden state that's passed in as well as X to ultimately calculate the next hidden state. And then this

next hidden state is used to actually calculate the output for that time step.

And this set of equations, this pair of equations is used in every single time step. However, what we do need to talk

step. However, what we do need to talk about is what is W, what is U, what is B, what is this W, which is actually different from this W as well as this B

or bias is actually different from this one. So those are actually the

one. So those are actually the parameters of the model that need to be learned through training. To understand

what W, U, and B, those matrices and biases are actually doing. All we really need to do is revisit standard or vanilla neural networks. We have some sort of input here which in code would

actually be represented by some sort of vector. So maybe we have an input that

vector. So maybe we have an input that has some number of attributes. Might

call this one X1. We might call this one X2. And we might call this one X3.

X2. And we might call this one X3.

Although we were actually talking about feeding numbers into our model, the stock prices in our previous example, we might imagine embedding each stock price as some sort of vector. As a result of

the calculations that are actually going on inside these complicated edges, we actually then end up with a four-dimensional vector. We can say that

four-dimensional vector. We can say that this is output number one or 01. This is

output number two or O2, output number three, and lastly output number four. So

this layer, this linear layer took a three-dimensional vector and transformed it to a four-dimensional vector. What we

need to remember is that each of the four hidden nodes, each of the four hidden nodes performs linear regression.

So that means each of those four hidden nodes is actually storing a W1, a W2, and a W3. Those are just weights of a linear regression model, one for each of

our three input attributes, as well as a bias. So that gives us four parameters

bias. So that gives us four parameters or four parameters per hidden node. And

given that we also have four nodes that leaves us with actually 16 total parameters which would actually be represented as a 4x4 matrix of parameters. Of course in practice we

parameters. Of course in practice we might do something like a 3x4. So we

actually have the four parameters and then we have the three input attributes and then we actually store our biases.

we store our biases in a separate variable instead of in that original matrix. The idea is clear. Each weight

matrix. The idea is clear. Each weight

matrix is simply the parameters for a linear layer. So let's return to these

linear layer. So let's return to these equations. W over here represents a

equations. W over here represents a linear layer with learn parameters. The

same goes for U and the same goes for B as well as for this B and for this W.

Each W and B and U is actually learning its own parameters. However, we saw that in the original diagram, we had many time steps going left to right. Each of

those RNN boxes does use the same set of W, U, and B as well as this W and this B. The parameters are shared across time

B. The parameters are shared across time steps. So, the main component we still

steps. So, the main component we still need to talk about then is G. What are

these G functions? The G actually refers to some sort of nonlinear function. Here

we have the sigmoid function which actually always outputs numbers between 0ero and one. Here we have the tanh function which always outputs a number between negative 1 and one. And here we

have the function which takes the max of the given input and zero. So if the given input is negative it gets squashed down to zero as we can see over here.

And if the input is positive then the output is just the same as the input.

It's the identity function. And these

functions can really impact the performance of a neural network. They

can actually drastically improve the performance of a neural network in some cases. And on some cases, they can

cases. And on some cases, they can actually worsen the performance of a neural network if the wrong activation is chosen. And when I say wrong

is chosen. And when I say wrong activation, I mean a function that when we end up calculating our derivatives and gradients, we actually end up with really poor numbers that might be really close to zero and make it hard to train

the network. Ultimately, it comes down

the network. Ultimately, it comes down to trial and error. Researchers will try different activations for different use cases and see which one results in the best neural network performance. So I

want to end this video by talking about some of the limitations of RNN's. One

significant limitation is that we cannot take advantage of parallel processing.

We actually have modern-day GPUs that are very very specialized for parallel processing. However, for RNN's, the

processing. However, for RNN's, the input for the next time step actually depends on the output from the previous time step. Because of this, we're

time step. Because of this, we're actually limited by the length of the sequence. The longer the sequence, the

sequence. The longer the sequence, the longer the processing will take. So the

longer the sequence, the longer the processing will take. So we cannot really exploit parallel processing as much as we would like to. Another

significant issue with RNN's is called the vanishing gradient problem. When

we're actually calculating derivatives, which are ultimately used to update the weights through gradient descent, so we use gradient descent to actually train and optimize any kind of neural network.

when we actually calculate those derivatives, a lot of them end up going to zero as the sequences grow longer. So

as the sequences grow longer, RNN's actually have significant limitation in training. The reason for why longer

training. The reason for why longer sequences result in a vanishing gradient requires a little bit of calculus and I can definitely make a future video on this if there's interest. So definitely

leave a comment if that's something you're interested in. I hope that was helpful. And if you're looking to get

helpful. And if you're looking to get practice problems, I actually have coding problems and quizzes on my website in the description. For each of these practice problems, I actually have a background and solution video. And for

the coding problems, you can run your code against the test cases. There's

also a full playlist, a full playlist on my channel that goes in order for each of these problems. And that should pop up on the screen soon hopefully. I'll

see you soon. Okay. LSTM networks. This

stands for long short-term memory network. They're a special kind of RNN

network. They're a special kind of RNN which stands for recurrent neural network. But don't worry if you're not

network. But don't worry if you're not familiar with these since this video doesn't require background on them. This

video is going to start super high level and then we're going to get more detailed and actually go into more depth on this diagram as the video goes on. So

LSTMs are used to make predictions on sequencebased data. This might be a

sequencebased data. This might be a sentence and we want to predict the next word in the sentence kind of like an autocomplete model. or we might pass in

autocomplete model. or we might pass in stock prices. We might pass in a bunch

stock prices. We might pass in a bunch of stock prices from different days and we want to predict a future price. The

transformer neural network, which I have a whole playlist on, and it should pop up somewhere in the top right in a second, is actually used more commonly these days. However, LSTMs, these neural

these days. However, LSTMs, these neural networks over here, they're definitely still worth learning for a few reasons.

One, they're actually still used in products like Google Translate, so they're not entirely obsolete. The head

researcher at OpenAI, Ilia, actually said that he thinks RNNs could make a comeback. And last, understanding the

comeback. And last, understanding the issues with LSTMs, they're not perfect, help us actually understand why the transformer neural network over here was originally developed. By the way, if

originally developed. By the way, if you're already familiar with the general idea of an RNN, you can skip to the timestamp in the description. So, let's

start pretty high level. Let's say we're building a model that is kind of like autocomplete. We want to predict the

autocomplete. We want to predict the next word in some sort of sequence aka a sentence. So let's say we have I grew up

sentence. So let's say we have I grew up in France. I speak fluent. And then the

in France. I speak fluent. And then the model is supposed to predict what word comes next. We know that the word French

comes next. We know that the word French should come next. But we want the model to actually be able to do this. So let's

start over here. We're passing in some sort of input X into the model. and x is simply going to be our sequence our sequence of words that we have so far.

So this entire sequence over there and then we get some sort of output. We get

some sort of output and we actually go ahead and pass something else which we can see over here back into the model.

So back into the model over here and then we have our sequence continuing again. When we actually unfold this what

again. When we actually unfold this what we actually see is we have the initial time step. So we'll think of each token

time step. So we'll think of each token or each word in our sentence as some sort of time step. So maybe this is t equals 0. Maybe the first token or word

equals 0. Maybe the first token or word is t equals z and the next one is t= 1 and so on. So let's say we pass in i over here the word i and then the word

grew over here and so on. And the

ultimate goal is to actually grab the next word in the sentence. So when we pass in the word I over here, the model is actually outputting some sort of

information in the form of a vector that will actually be used in the next time step. So then when we pass in grew over

step. So then when we pass in grew over here, the model is actually factoring in both I and grew to ultimately make its prediction for the next word in the

sentence which we know is up. So the

fundamental idea here is that the model can factor in the previous information to actually make a future prediction.

The model can actually factor in words for all the way from the past rather than only the previous word, only the previous word that came in the sentence.

For now, let's not really worry about all these symbols like V and H. And now

let's go back to the general idea of an LSTM. So let's say that between those

LSTM. So let's say that between those two sentences we have a bunch of irrelevant information that isn't relevant for the model predicting which word should come over here. We would

actually say that the model needs to remember some information that was far into the past and the model needs to actually factor that in to its response

over here. That's where LSTMs come in.

over here. That's where LSTMs come in.

Now that we have a high-level understanding of LSTMs, let's dive into these weird looking symbols and break down how they work. So, the fundamental difference between a normal RNN and an

LSTM is actually just what's going on inside this box over here. the general

idea of passing in information from our previous time steps, getting some sort of output which we can see over here and over here. Passing those in to the later

over here. Passing those in to the later time steps where we also factor in whatever word is at our current time step and simply continuing this simply continuing this for further time steps

until we get the output we're interested in. This fundamental idea remains the

in. This fundamental idea remains the same. One of the core concepts behind

same. One of the core concepts behind LSTMs is this black line over here called the cell state. It actually runs through all the time steps. And the core

idea going on inside this LSTM is simply to update all these operations that we see going on over here as well as to the right. They're actually involved with

right. They're actually involved with updating or modifying that cell state.

So let's look at the first gate that we have in the network right over here. If

we actually scroll down, we can see that what's actually going on is we concatenate XT, the word at our current time step, and this vector HT minus1,

the vector that came from the previous time step. The concatenated vector is

time step. The concatenated vector is then passed into a single linear layer.

So that's just a standard neural network. Remember, each of these nodes,

network. Remember, each of these nodes, if you're not familiar with neural networks, that's all right. As a quick review, we're essentially passing in some sort of vector as the input. So we

can think of a number. We can think of a number being stored in each of these nodes over here. So this is x1, this is x2, and this is x3. We can think of this

as a vector with three entries. And then

what's going on in each of these hidden nodes is actually linear regression. Y1

through Y4 is calculated for each of those nodes where each of the Y's is actually calculated based on this equation. So for each of those nodes,

equation. So for each of those nodes, the X1, X2 and X3 from the input layer are used and each of those nodes actually learns over the process of

training the LSTM a separate W1, W2, W3 and B or bias. Then each of those entries y1 through y4 are actually passed into this sigmoid function over

here which always outputs a number between zero and one. So ultimately

after this layer over here or this gate over here symbolized by the sigmoid or Greek sigma letter over here we have a vector of entries where each entry is

between zero and one. Conceptually,

whatever vector was actually calculated and is outputed over here is actually factoring in the current word over here as well as whatever vector came in from the previous time step. And we're going

to use that to then update the cell state. The way we update the cell state

state. The way we update the cell state symbolized by this X, which is essentially represented by this equation, we take the elementwise multiplication of the previous cell

state vector and our sigmoid output. So

if we're actually having entries between 0 and one in this vector over here, then we can think of that as essentially taking a fraction for each entry in the previous cell state vector. We are

taking a fraction of it since we're multiplying it by something between 0 and one and reflecting that in the updated cell vector. Just to make that super clear, we can see that if we

multiply elementwise two vectors where one of the vectors has entries between 0 and one, this is equivalent to keeping or preserving some fraction of the

information. That means there's actually

information. That means there's actually an intuitive explanation for what that sigmoid gate is doing. It's actually

referred to as the forget gate. We know

that some of the information in some sort of body of text is irrelevant for actually predicting the next word in the sentence. Going back to our example, we

sentence. Going back to our example, we where we wanted to predict the word French. We know that we had a bunch of

French. We know that we had a bunch of irrelevant information in between in that body of text and the main thing that we wanted to remember from the past was the word France that the speaker

grew up in France. But some of the information in our current time step in our current time step and maybe even in the previous time step needs to be

forgotten by the neural network. So

that's why we refer to this as the forget gate. If we're actually

forget gate. If we're actually multiplying the previous information in the cell state by some sort of number that is between 0ero and one then we are essentially forgetting some of that

information and also preserving some of that information. closer the sigmoid

that information. closer the sigmoid output is to zero, the more information we're actually forgetting. The closer

that output is to one, the more information we're actually remembering.

Since if we multiply something by one, it doesn't change at all. We're

preserving that previous information.

And through the process of training the LSTM, the right parameters are actually learned for each of these nodes. So this

top node as well as the three below it to actually learn to forget and preserve the right information from XT and HT minus one. The other gates which are

minus one. The other gates which are actually to the right of the first gate work in a very similar way. We also need to figure out what information to add to the cell state after we figured out what

information to forget and what information to preserve. And our final gate, which can actually be seen over here, simply helps us determine what information we should output. Interest

of keeping this video concise, let's quickly explain how this gate works.

There's two steps to actually updating and adding information into our cell state, adding in the new information.

The first step is remembering or actually figuring out what information we want to add. That's what this tanh gate does. And the next step is figuring

gate does. And the next step is figuring out how much of that new information we ultimately want to add. The sigmoid gate over here works the exact same as the

previous sigmoid gate. If we just scroll down, we see that we just concatenate XT and HT minus one. We have a linear layer. Although we should note that this

layer. Although we should note that this linear layer that is used in this gate, the new sigmoid gate we're talking about has a separate set of weights that are

learned for each of our nodes Y1 through Y4. And we follow that up with a sigmoid

Y4. And we follow that up with a sigmoid function, which we can see over here.

The tanage gate over here works extremely similarly. We're going to have

extremely similarly. We're going to have a linear layer and we're going to have nodes in the hidden layer that are actually learning weights based on the

XT and the HT minus one that are passed in. However, we follow up the output of

in. However, we follow up the output of the linear layer with a tanh activation which actually has outputs between -1 and 1 instead of 0 and one as in the

sigmoid function. This should make sense

sigmoid function. This should make sense since the tanh gate is actually supposed to calculate what information we want to ultimately add in to our cell state.

What new information is relevant for the model to remember while the sigmoid gate this sigmoid gate figures out okay how much of that information is ultimately

relevant to add in to our cell state. So

before we actually add the information into the cell state, we simply need to multiply that information together since we know that multiplying vectors together where one vector has entries

between 0 and one actually helps us decide how much of our second vector should actually be preserved. All that's

left to figure out is how we calculate ht the output of this time step which is also used in the next time steps for getg gate over here. figuring out what

actually needs to be outputed for this time step. We know that we should

time step. We know that we should probably draw that information from our updated cell state after the forget gate has done some calculations and our addition gate has done some

calculations. We know that the updated

calculations. We know that the updated value of the cell state at this point should actually be factoring in our previous information and our current

word. Our current word at this time

word. Our current word at this time step. And not only has it factored that

step. And not only has it factored that information in, but over here it's actually discarded what's irrelevant.

And here it's added in what is relevant.

That's exactly what the LSTM does. It

actually passes in the cell state into a tanh gate over here, a tanh gate. And

then we actually just elementwise elementwise multiply that with the output of yet another sigmoid gate. This

neural network layer is factoring in the current value of the cell state which we might think of is symbolized as CT. And

this neural network layer over here, the sigmoid layer is factoring in XT and HT minus one. As usual, as usual, the

minus one. As usual, as usual, the element-wise multiplication helps us figure out how much of that information that came from the cell state is actually relevant to preserve and

ultimately pass on as our output over here and over here. Fact that the sigmoid output over here is between 0 and 1 is what helps us achieve that. So

that's the overview of LSTMs, but there's still some clarifications we need to make. LSTMs aren't perfect and they have issues. Primary issue is we cannot take advantage of parallel

processing as much as we would like to with modern GPUs. Need the outputs from previous time steps during training to actually calculate our future time

steps. That means we're actually limited

steps. That means we're actually limited by the sequence length. And the longer the sequence length, the longer training will take. Transformers, on the other

will take. Transformers, on the other hand, actually process all tokens. So

tokens are essentially words at different time steps in parallel. Next,

it may be unclear based on the the concept overview we just gave as to how to actually implement an LSTM and train it. We know there's going to be a lot of

it. We know there's going to be a lot of matrix multiplications, a lot of sigmoid functions, a lot of tanh functions. If

you're interested in a video on implementing an LTM, definitely leave a comment. Lastly, learning requires

comment. Lastly, learning requires practice. That's why I've created a

practice. That's why I've created a bunch of coding problems and quizzes which should be popping up on the screen soon. They're all free on my website.

soon. They're all free on my website.

Every practice problem has a video associated with it in the playlist that's about to pop up. So definitely

check it out and hopefully I'll see you soon. Large language models are capable

soon. Large language models are capable of many different tasks. You can

roleplay with them, ask them to generate code, and even write poems. But in some cases, we want to customize them or have them specialize in a particular task, like generating code in C or speaking a

really niche language that wasn't in the original training data. This is where fine-tuning is useful. This means

further training of the model, actually updating the weights and biases using gradient descent, except this time we're using a new data set, typically much smaller than the original data set, and

we start off with the pre-trained version of the model. Although some

models like GPT are closed source, many models like the llama family from Meta are open source. You can download the model weights, chat with the model locally, and even fine-tune it on a personal data set. But these models have

billions of parameters. So, how can you fine-tune them without the expensive GPUs used to pre-train them? The answer

is Laura. Now that the NPCs have clicked off the video, let's get into it. If

you're still here, you're the kind of person I would want in my ML community, which I'll talk about at the end. Back

to Laura. There are some other tricks like quantization, but lowering adaptation is almost always used for fine-tuning. There's two main questions

fine-tuning. There's two main questions to ask. First, of the billions of

to ask. First, of the billions of parameters in the model and all the layers, or all of them necessary to update? The answer is a resounding no.

update? The answer is a resounding no.

We can achieve almost the same performance by freezing most of the weight matrices and only updating the ones used in the attention layer. one of

the core components of an LLM. The

second question to ask is for the matrices we update, do we actually need to directly update each entry through gradient descent? For an N byN matrix,

gradient descent? For an N byN matrix, that's n^ squ total parameters and n can be in the thousands for LLMs. So, we can make a second simplification that achieves almost the same performance.

Freeze the original weight matrix W K knot. Let's calculate delta W and add it

knot. Let's calculate delta W and add it to W knot. Once fine-tuning is complete, W knot is the product of two smaller matrices B and A. B is N by R and A is R

by N. The entries in B and A are the

by N. The entries in B and A are the ones updated through gradient descent and this is only two NR parameters. R is

typically much much smaller than N. If N

is in the thousands, R might only be say 16. Dropping the constant factor, this

16. Dropping the constant factor, this is on the order of N entries to update which is far less than N squ. By the

way, the original Laura paper recommends initializing B to zero and A to a normal distribution for optimal results. Side

note, if you're interested in an end-to-end walkthrough with every line of code explained, I created a full course on fine-tuning exclusively for the students in our ML community. In

addition to requiring far less compute, there's another benefit to Laura. Let's

look at a tree where we have different versions of an LLM, each serving a use in production. The most intensive step

in production. The most intensive step here is loading the root model into memory. But that only needs to be done

memory. But that only needs to be done once. To switch between customized

once. To switch between customized states of the model, we simply add and subtract delta W, which is just the product BA for each node in this tree.

Adding and subtracting these smaller matrices, is far less intensive than reloading the full model, which is billions of parameters large. Laura is

incredibly easy to use on free GPUs like the one provided in Google Collab and will likely be used for years to come.

If you found this video helpful and are interested in more fine-tuning videos, leave a comment and I'll see you soon.

You may have asked ChatgPT a question before and it says, "Since it's the last knowledge update, there's no way to answer your question. How do we keep LLMs up to date with new information or

let them access information that wasn't in the training data? One option is to continue training the model or fine-tune it on new data, but this is expensive

and timeconuming. The alternative is rag

and timeconuming. The alternative is rag or retrieval augmented generation. This

approach originated from a 2020 paper written by researchers at Facebook's AI research division and has quickly become an extremely popular tool. On to how rag

actually works. First, let's start with

actually works. First, let's start with some bank of information that we want an LLM to have access to so that the LLM's responses to our questions can actually

factor that bank of information in. This

might be a new version of a textbook that just wasn't in the training data.

Or it might consist of internal company documents that obviously weren't in the training data, which is usually just the internet, but are still necessary for a company's internal LLM to have access

to. That is the general idea. Here is

to. That is the general idea. Here is

the rag workflow that helps us achieve this. And if any parts seem a bit

this. And if any parts seem a bit confusing at first, don't worry. We'll

go over all of them. We have a user that sends a query or question to an LLM. But

let's not send the query directly to the LLM. We'll send the query to an

LLM. We'll send the query to an embedding model which will output a vector representation of the query. This

hits a retriever model which will search a vector database for knowledge that might help answer the question. This

database outputs the top K documents that were most similar to the question being asked where K is a parameter we get to choose. The higher K is, the more information we're giving to the LLM to

answer the original question. But keep

in mind that K can't be too large since we might end up including irrelevant information, introduce additional latency into our workflow or simply exceed the LLM's context length, which

is the maximum number of tokens or words it can take in at once. When actually

prompting the LLM, we can say something like, "Here are some documents that might help answer the following question." And then we concatenate the

question." And then we concatenate the question in the top K documents. The

output from the LLM is then given back to the user. That's a highle overview, but let's go over the embedding generator and the retriever. Regarding

the LLM, it's just a standard transformer like chat GPT. Let's treat

that as a black box for this video. We

pass in a prompt and we get a response back. Okay, let's go back to the

back. Okay, let's go back to the knowledge bank from earlier. We break it into chunks. This might be the chapters

into chunks. This might be the chapters of a textbook, the subsections within a long company document, or even something arbitrary like each paragraph or a total

number of words for each chunk. How we

choose to chunk the knowledge bank will definitely affect the results and depends on our use case. Then we'll

generate a highdimensional vector representation for each chunk, which is essentially an embedding. We're going

from raw strings to highdimensional vectors that capture the meaning of the text. We'll then store those embeddings

text. We'll then store those embeddings in a vector database. But what neural network is actually used to generate the embedding of each chunk? This is

typically just another transformer which is trained to generate embeddings for the text passed in. If you're interested in how transformers work, check the second link in the description. This

neural network is typically just the left side of the transformer known as the encoder. We're encoding the input

the encoder. We're encoding the input into a meaningful highdimensional vector. To summarize, before anything in

vector. To summarize, before anything in this workflow happens, we break up the knowledge bank into manageable chunks which are sent through the embedding model and then stored in the vector

database. Then during the actual rag

database. Then during the actual rag workflow, each question we want to ask the LLM is also sent through the embedding model and then the retriever searches the vector database for the

similar documents that might help answer the question. But let's talk a bit more

the question. But let's talk a bit more about the retriever. A class of algorithms called maximum inner product search or MIP is used here. For our

case, inner product essentially means dotproduct, which we know is a measure of similarity between two vectors. So a

maximum inner product search means we're looking for the top k chunks in the vector database that have the highest dotproduct or similarity with the embedded question. And that's a highle

embedded question. And that's a highle overview of rack. If you're an aspiring data scientist or machine learning engineer, this is a must know topic. In

this video, we're going over the vision transformer, and it won't require understanding all the details of this diagram. The transformer is a special

diagram. The transformer is a special kind of neural network that powers large language models like GPT. It was

developed in 2017 and was presented with the paper, attention is all you need from Google. And in 2021, Google

from Google. And in 2021, Google developed the vision transformer for processing images instead of text.

What's crazy is that vision transformers can outperform CNN's or convolutional neural networks. is the go-to model for

neural networks. is the go-to model for image classification. Attention is the

image classification. Attention is the main mechanism in a transformer that makes the model so effective at processing sequences. The model learns

processing sequences. The model learns which parts of the sequence are important to well pay attention to. In

processing text, this means that the model might associate adjectives and nouns. And with the vision transformer,

nouns. And with the vision transformer, the model pays attention to the most important parts of the image. Before we

go over how this actually works, this is your reminder to sign up for GPT Insiders so that you don't miss our next edition. GPT Insiders is my free

edition. GPT Insiders is my free emailing list where each day I share a different insight or resource for your ML journey. No spam, just valuable tips

ML journey. No spam, just valuable tips and resources. You can sign up at the

and resources. You can sign up at the link in the description. Let's get into it. We're going to treat the transformer

it. We're going to treat the transformer as a black box. Let's focus on the inputs and the outputs. The input is always a sequence. In the case of GPT, a sequence of words or tokens where each

is represented as a number. How can we break an image down into a sequence? One

option is to pass in every single pixel as an integer, but for an image that's 100x 100, which is still very low resolution, the sequence would have 10,000 elements. This is infeasible

10,000 elements. This is infeasible since the computational cost of attention is n^ squ, where n is the length of the input sequence. The

solution is to break the image into patches. So our input sequence's length

patches. So our input sequence's length is now equal to the number of patches.

This would just be the size of the image divided by the size of a single P by P patch. At each entry in the input

patch. At each entry in the input sequence, we have a vector which is just the patch flattened into a 1D list. The

patch size is critical. If we make the patches too small, the sequence will be too long and compute demands will be too high. If we make the patches too large,

high. If we make the patches too large, the sequence will be shorter and compute demands will be less, but we risk oversimplifying the input and obscuring important information from the model.

The optimal patch size can be found experimentally. In summary, the

experimentally. In summary, the transformer always takes in a sequence of length n with a vector for each entry in the sequence. But what about the output? After the model performs its

output? After the model performs its calculations, the transformer outputs another sequence of n different vectors.

But what if our goal is to use the model for image classification? We might want to classify the input image as a dog, cat, or bird. There are ways that we can force the model to output vectors of

size three where each entry corresponds to a probability. But we have n different vectors. Which vector do we

different vectors. Which vector do we consider to be the model's prediction?

The solution is to prepend the input sequence with another dummy vector. Then

we can look at the model's output vector for that corresponding index. This is

the model prediction which will be improved over many iterations of training and that's the gist of vision transformers. There is one catch though.

transformers. There is one catch though.

Vision transformers must be trained on extremely large amounts of data to outperform CNN's. On the other hand,

outperform CNN's. On the other hand, they require substantially less compute to train. The bottom line is that while

to train. The bottom line is that while vision transformers are extremely powerful and promising, CNN's aren't going away anytime soon. If you're

interested in a deeper discussion of the vision transformer, I have a huge announcement for you. Beginner's

blueprint is finally available. This is

the exact study plan I wish I had when I was first getting started with machine learning. Everyone told me to read

learning. Everyone told me to read papers, but I had no idea which papers to read. Once I figured that out, I had

to read. Once I figured that out, I had no idea how to understand them, dissect them, much less implement them, and code up the main concepts. And lastly,

workday was a nightmare. I had no idea how to present these projects on my resume and actually land more interviews. The beginner's blueprint

interviews. The beginner's blueprint will solve all of these problems for you so that you can make progress faster than I did. Take it from someone like from the IIT Madras class of 2025. I

personally provided him with a road map helping him get started with the implementation ASAP or shear. He's an

NLP expert and his personal favorite resources are our ML programming questions which are accessible for free.

Lastly, someone like Chang. He's an exy Yahoo a IML engineer and he knows his stuff. It's been a blast making videos

stuff. It's been a blast making videos on this channel for the last year and I'm excited to help even more of you with premium personalized instruction.

Our launch sale is now active and you can secure the entire blueprint for 50% off. Head to the link in the description

off. Head to the link in the description to learn more and I'll see you on the other side. CNN's or convolutional

other side. CNN's or convolutional neural networks are one of the most powerful models for giving computers the gift of vision. They're used in self-driving, image captioning networks, and even GANs or generative adversarial

networks. The central idea is to have

networks. The central idea is to have the model learn which features and patterns in an image are the most important. Here is my promise to you. I

important. Here is my promise to you. I

will provide the most clear explanation of CNN's you've ever seen in return for 5 minutes of your time. Let's get

started. First, let's understand the input. We pass images into CNN's, which

input. We pass images into CNN's, which are typically represented as arrays of pixel values. For a 4x4 image, we would

pixel values. For a 4x4 image, we would have a 4x4 array of numbers. The main

calculation inside a CNN is called a convolution, and each filter performs a convolution. Each filter is just another

convolution. Each filter is just another array of numbers that slides over the input image left to right and top to bottom, performing multiplications and additions. Now is a good time to clarify

additions. Now is a good time to clarify what convolution really is. Given a 4x4 image, let's slide this 2x2 filter over the image and see what the output is. We

overlay the filter on every 2x2 subrid within the image. Multiply the

corresponding numbers and add them up.

This is the calculation for the top left subrid.

Slide one unit right. This process

repeats for the entire image. Also,

CNN's have many filters. So, if we had 10 filters, we would have 10 outputs.

Since each output is 3x3, at this stage we have a 3x3x 10 array of numbers. But

let's say the ultimate task of this model is to classify the input image as a dog, cat, or bird. We would want to output a vector of size three where each entry corresponds to the probability

that the input image is a dog, cat, or bird. So given this 3x3x10 array, the

bird. So given this 3x3x10 array, the model would use a series of reshape operations, nonlinear functions, matrix multiplications, and even more convolutions to arrive at the final

output vector of size three. But there's

one crucial detail we still haven't discussed. Training. When we initialize

discussed. Training. When we initialize a CNN, the entries in all the filters are random numbers. So the initial model predictions will be inaccurate. Over the

course of training, the entries of the filters are updated until we're satisfied with the model's predictions.

Through training, each filter learns to detect a unique pattern in the image.

For example, one filter might detect a dog's tail. Meaning, when we slide that

dog's tail. Meaning, when we slide that filter over the image, the output would be non zero. But if we slide that filter over an image with no dog tail, the output would be mostly zero. But how are

the filter entries actually updated during training? That brings us to

during training? That brings us to gradient descent. the first ML algorithm

gradient descent. the first ML algorithm I think everyone should learn. I have a threeinut video breaking down gradient descent which should pop up in the top right. If you enjoyed the visuals in

right. If you enjoyed the visuals in this video, you'll love the visuals in the gradient descent video. I can't

recommend it enough. See you soon. CNN

or convolutional neural networks are a kind of model that specializes in detecting patterns in images. They're

used in self-driving, image captioning networks, and even GANs or generative adversarial networks, which can be used to create deep fakes. The core idea is to have many weight matrices called

filters that slide over an input image, multiplying and adding up numbers. The

model learns the right numbers for each filter so that special characteristics in the image can be detected. Let's

answer a few quiz questions about CNN's.

Question one, what algorithm is used to update the filter weights during training? A, linear regression, B,

training? A, linear regression, B, gradient descent, C, dynamic programming, D, self attention. The

answer is gradient descent, a minimization algorithm. This equation is

minimization algorithm. This equation is used to update each entry in the filter matrix at each iteration of training. If

you're not familiar with gradient descent, I have a 3minute video breaking this equation down. It's the second link in the description. If we have a 4x4

image and a 2x2 filter, what shape would the output called the convolved image

have? A 4x4, B 3x3, C 4x3, D3 3x4.

have? A 4x4, B 3x3, C 4x3, D3 3x4.

Understanding these dimensions is important since libraries like PyTorch and TensorFlow require you to specify them when instantiating CNN's. If you

want to try the calculation yourself, pause the video here. Okay, let's assume a stride of one, which means that the filter slides one unit at a time. This

filter can slide from left to right a total of three times and downwards a total of three times. And all the convolutions look like this.

And we end up with a 3x3 output.

Question four of this quiz will cover how the calculation is actually done.

Question three, which of the following about CNN's is true? A. Gradient descent

works better on normal neural networks than CNN's. B. The many filters of a

than CNN's. B. The many filters of a convolutional layer are applied in sequence. C. Each filter is

sequence. C. Each filter is independently applied in parallel. And

D. CNN's typically don't contain other layers like linear or sigmoid layers.

The answer is C. Each filter is independent and applied in parallel.

Since each filter operates on the input image independently, each can learn a unique pattern in the image, like a vertical edge or horizontal edge. If

you're wondering about the other choices, gradient descent works just fine on CNN's and CNN's do use other layers in addition to convolutions to form the final prediction. Question

four, given this 4x4 image and the numbers at each pixel and this 2x2 filter, what would the top left entry in the output be? If you want to try the

calculation yourself, pause the video here. Okay. The top left entry is found

here. Okay. The top left entry is found by overlaying the filter on the top left corner of the image. We multiply the corresponding numbers and add them up.

That's all a convolution is. That's 4 * 1 + 4 * 2 + 0 * 0 + 4 * 2, which comes out to 20. Keep in mind that the output

matrix from this convolution is passed on to other layers of the model which ultimately give us a single vector of the model's predictions. Then at each iteration of training, the numbers in

the filter are adjusted to make the final model prediction more accurate.

That wraps up our quiz on CNN's. If you

watch through the full video, then I think you'll benefit from my ML community. I offer one-on-one a IML

community. I offer one-on-one a IML mentorship and help you build projects for your resume. You can read more about it at the link in the description. I

hope you found this video quiz useful and I'll see you soon. This video is going to be different. We're going to learn about something other than neural networks for once. By the way, if you need a refresher on neural networks,

just check the second link in the description. So, that means we're not

description. So, that means we're not going to talk about deep learning today, which gets all the attention these days.

Deep learning is just the field of AI that's focused specifically on neural networks, a type of model that's far more powerful than simple statistical models like linear regression. By the

way, it's a small technicality, but technically deep learning is a sub field of ML and ML is a sub field of AI. So

today we're going to talk about the K means algorithm, which is a classical ML method that does not fall under deep learning. Sometimes neural networks are

learning. Sometimes neural networks are overkill. For simple ML problems, neural

overkill. For simple ML problems, neural networks are not necessary and that means we can build a high accuracy model without neural networks. We prefer this when possible since training neural

networks is expensive and timeconuming.

Let's say we have a thousand data points of information about dogs and each data point contains five numbers describing a particular dog, their weight, height,

tail length, etc. But for each dog, we have no idea which breed they belong to.

In other words, our data is unlabeled.

Can we create an algorithm that groups the similar dogs together? Beforehand,

we must decide how many groups or breeds there might be. We'll call this variable K. And let's represent each dog from the

K. And let's represent each dog from the data set as a vector with five entries.

We want the similar dogs to end up in the same group or breed. Let's start off by randomly assigning each dog to one of our kiff groups or clusters. And over

some number of iterations, let's update which group each dog belongs to. And at

the end, the similar dogs will be assigned to the same group or breed.

Here's what the final output of the algorithm after it finishes running might look like for the case where K equals 3 and we only have two attributes

for each dog. Attribute one on the x-axis and attribute 2 on the y-axis.

But we've been treating this algorithm like a black box. At each iteration, how do we actually update which group or breed each dog belongs to? Step one,

calculate the average vector of each group. Step two, store the average as

group. Step two, store the average as the centrid. Step three, assign each dog

the centrid. Step three, assign each dog to the group whose centroid is nearest.

Step four, repeat. Let's talk about step one. By the way, if you're still

one. By the way, if you're still watching the video, then I know you're not just mindlessly scrolling through YouTube and that you're actually interested in ML. You're the kind of person I want to join my ML community,

which you can read about at the second link in the description. Back to step one. Before the algorithm, we randomly

one. Before the algorithm, we randomly assigned each dog to a group where the groups are indexed from one to K. To

calculate the mean of each cluster, we simply average all the vectors assigned to a given group together. Specifically,

this average is done element by element for all the vectors and the resulting vector is called the centrid of that group. In the image from earlier, the

group. In the image from earlier, the black dots are the centrids for each cluster after the algorithm finishes.

But while the algorithm is running, each dog isn't necessarily assigned to the nearest center. The photo below should

nearest center. The photo below should clarify what that means. We start off with totally random clusters and centroidids. After each iteration, the

centroidids. After each iteration, the centrid approximations get better. At

each iteration, we assign each data point to the group whose centrid is nearest, which reshifts the centrids, and this process repeats. Let's focus on

the final blue cluster. Initially that

cluster centrid was here. After

reassigning data points based on the closest centrid that cluster centrid shifted over here. We performed one more iteration and nothing changed. So we

knew the algorithm concluded. Let's talk

briefly about step three. We want to assign each dog to the nearest center or to the nearest cluster. That means we need to calculate the distance between every data point and each of the K

centrids so that we can find which cluster is closest to each data point.

We can simply use the distance formula which looks like this for two-dimensional data and looks like this for five-dimensional data. After enough

iterations, which could be hundreds or thousands, the C means algorithm is complete. There are other clustering

complete. There are other clustering algorithms, but C means is the simplest one. Let's say you're working with a new

one. Let's say you're working with a new data set and there are no labels. You

might want to gain an initial understanding about your data or group the data points into buckets. Cines

might be the best place to start. There

are also many efficient implementations of CINS and leave a comment if you'd be interested in a video breaking down the code. I hope you found the break from

code. I hope you found the break from deep learning interesting and I'll see you soon. Four must know ML concepts.

you soon. Four must know ML concepts.

Whether you're a student, a data scientist, or engineer, this video will serve as either a refresher of the basic fundamentals or simply a way to bridge the key concepts together. Instead of

talking about each concept independently, I'll bridge them together like a story. Let's get started. The

first is training. It can be really annoying when people talk about models learning or training without being clear as to what that means. So let's start off with a highle explanation and later

in the video we'll talk about the actual equation used for training. Over some

number of iterations we adjust the model and improve its performance. Let's say

our model learns to predict how tall someone will be from a data set where we have various information about people as well as their final height once they're finished growing. How do we actually

finished growing. How do we actually increase the model's accuracy? Well, we

need to remind ourselves of what a model is. It's just a mathematical formula

is. It's just a mathematical formula that outputs a prediction. There's two

components to the formula. First, the

inputs that might affect someone's final height. This might be their current

height. This might be their current weight, their current height, and the average of their parents heights.

Second, the parameters. They're also

called weights. Let's look at the simplest kind of a MEL model represented by this equation. The input numbers are

X1, X2, and X3. While W1, W2, W3, and B are the parameters of the model. The

parameters are just random numbers at first. So the output number H won't be a

first. So the output number H won't be a very accurate prediction of how tall someone will be. But over training, we update the values of the parameters, improving the model's accuracy. And

that's all training is. We're just

adjusting the parameters of the formula.

LLMs, for example, are also just large math formulas. They are formulas that

math formulas. They are formulas that predict the exact sequence of words that best answers your prompt. The formula is a lot more complex since we're dealing with words. And before we pass words

with words. And before we pass words into a math formula, we have to represent each word as a number. And

then when numbers are outputed by the LLM, we have to convert them back into words so that we can actually read the response. But at the end of the day, a

response. But at the end of the day, a large language model is also still a math formula. In concept number four,

math formula. In concept number four, the last concept of this video, we'll see the actual equation used to update the parameters at each iteration of training. But that concludes our

training. But that concludes our overview of training for now. Concept

number two is linear regression. This

one will take less time to explain. The

super simple model we talked about earlier actually has a name, linear regression. Here's the equation again.

regression. Here's the equation again.

This model factors in three pieces of information about the input person, X1, X2, and X3. But if we wanted to factor in more characteristics about someone,

we could add more terms. And there is one issue with linear regression though.

We can't square or raise to any power any of the inputs. And we also can't use any nonlinear functions like the sigmoid function. For many cases, linear

function. For many cases, linear regression is effective. But sometimes

the relationship between input and output is more complex. And in these cases, linear regression yields poor accuracies. But what if we just ran the

accuracies. But what if we just ran the training algorithm for more iterations?

In those cases, even if we update the parameters repeatedly, the model's accuracy will remain low. Fortunately,

neural networks solve this issue, which brings us to concept number three.

Neural networks are another kind of model and they're much more powerful than linear regression. Here is a standard neural network. They use linear regression as well as nonlinear

functions like the sigmoid. The three

input nodes store x1, x2 and x3 respectively. Each of the four nodes in

respectively. Each of the four nodes in the next layer actually uses this equation. Each node in that layer stores

equation. Each node in that layer stores its own set of parameters which are W1 through W3 plus the constant term B.

Those parameters are updated during training. Just to be clear, that's

training. Just to be clear, that's actually four parameters per node in this middle layer for a total of 16 parameters in that column of four nodes.

But here's where neural networks differ from simply having each node do a linear regression in parallel. At this point, we have four YV values. y1 through y4.

And before passing those values on to the next and final layer, let's pass each y value through the sigmoid function. Here it is again. Its outputs

function. Here it is again. Its outputs

are always between zero and one. The

larger the input, the closer the output is to one. So what do the two nodes in the output layer, which will symbolize with the letter O, actually do? They use

this equation. But the x's or the inputs for this layer are actually the outputs of the sigmoid function which transforms the previously calculated y-v values.

Just to be clear, here's the relationship between the y's and these x's. Each of those two output nodes

x's. Each of those two output nodes learns and stores its own set of parameters w1 through w4 plus a constant

term b as well. That gives us two output values from the model. Let's simply take the average of them and report that number as the final height prediction.

And here's the bottom line. Adding

nonlinear functions in between the linear regression layers is powerful.

The model is now capable of learning a far more complex relationship. That

brings us to concept number four. The

actual equation for learning. The

equation or algorithm is called gradient descent. Here are the formulas used at

descent. Here are the formulas used at every iteration of training. Don't

worry, no crazy calculus needed here.

Also, the same formula is used to update W's and constant terms B. So, we can just write one equation. Alpha is called the learning rate. It's typically a

value like 01, but it can vary. And the

derivative of the error is called the gradient. We subtract out the product of

gradient. We subtract out the product of alpha and the gradient at each iteration. And with enough iterations,

iteration. And with enough iterations, the parameters will eventually stop changing. At this stage, the model's

changing. At this stage, the model's error function is minimized. That's

actually the main idea. This equation or algorithm gradient descent can be used to minimize any function. In this case, we use it to minimize the error function. Specifically, the error

function. Specifically, the error between the model's predictions and the true answers. The higher alpha is, the

true answers. The higher alpha is, the higher the product is. So, we are subtracting out a greater number at each iteration. Higher alpha will cause

iteration. Higher alpha will cause parameters to change quickly from iteration to iteration and vice versa.

That's why we call it the learning rate.

We want to find a sweet spot for alpha and it depends on the model we're training. But why exactly does the

training. But why exactly does the derivative of the error function come into this equation? I actually have a separate video to visualize this concept. It's only 3 minutes long and I

concept. It's only 3 minutes long and I promise it's worth it. It should pop up in the top right and I really recommend checking it out right now. really right

now. It'll also pop up at the end in case you don't click it right now. Okay,

that wraps up our four must know concepts. If you're still watching this

concepts. If you're still watching this video, I think you're the kind of person that would love our ML community. Check

out the link in the description to learn more about it and to grab the free LLM's course I created. It's over 10 hours long with 25 different modules and practice problems. Let me know if you

have any questions and I'll see you

Loading...

Loading video analysis...