Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

By Umar Jamil

Summary

Topics Covered

Two Critical Problems With RNNs
Why Transformers Were Created to Solve RNN Limitations
Words Are Just Numbers: Word Embeddings Explained
One Time Step: The Power of Transformers
Greedy vs Beam Search Inference Strategies Explained

Full Transcript

hello guys welcome to my video about the Transformer and this is actually the person 2.0 of my series on the Transformer I had a previous video in

which I talked about the Transformer but the audio quality was not good and as suggested by my viewers as the video was really uh had a huge success the viewers suggested me to to improve their audio

quality so this this is why I'm doing this video uh you don't have to watch the previous series because I would be doing basically the same things but with some improvements so I'm actually compensating from some mistakes I made

or from some improvements that I could add after watching this video I suggest watch my watching my other video about or how to code a Transformer model from

scratch so how to code the model itself how to train it online data and how to inference it stick it with me because it's gonna be a little long journey but for sure what

now before we talk about the Transformer I want to first talk about recurrent neural networks so the networks that were used before they introduced the

transformer for most of the sequence to sequence jobs tasks so let's review them recurring neural networks existed a long

time before the Transformer and they allowed to map one sequence of input to another sequence of output in this case our input is X and we want an input

sequence Y what we did before is that we split the sequence into single items so we gave the recurrent neural network the first item as input so X1 along with an

initial State usually made up of only zeros and the recurrent normal Network produced an output let's call it y1

and this happened at the first time step then we took the hidden State this is called the hidden state of the network of the previous time step along with the

next input token so X2 and the network had to produce the SEC the second output token Y2 and then we did it the same procedure at the third time step in

which we took the hidden state of the previous time step along with the input State the input token at the time steps 3 and the network has to produce the

next output token which is Y3 if you have enter n tokens you need n time steps to map a end sequence input into

an end sequence output this worked fine for a lot of tasks but had some problems let's review them the problems with recurring neural

networks first of all are that they are slow for long sequences because think of the process we did before we have kind of like a for Loop in which we do the

same operation for every token in the input so if you have the longer the sequence the longer this computation and this made the the network not easy to

train for long sequences the second problem was the vanishing or the exploding gradients now you may have heard these terms or expression on the Internet or from other videos but I will

try to give you a brief Insight on what does what do they mean on a practical level so as you know Frameworks like Pi torch they convert

our networks into a computation graph so basically suppose we have a computation graph I this is not an error network I will making I will be making a computational graph that is very simple

has nothing to do with the neural networks but will show you the problems that we have so imagine we have two inputs X and another input let's call it y

our computational graph first let's say multiplies these two numbers so we have a first a function let's call it f of x and y

that is X multiplied by y let me multiplied and the result let's call it Z

is map is given to another function let's call this function G of Z is equal to let's say Z squared

what our phytorch for example does it's that pytorch want to calculate the usually we have a loss function by torch calculates the derivative of the loss

function with respects to its each weight in this case we just calculate the derivative of the G function so the output function with respect to all of

its inputs so derivative of G with respect to X let's say is equal to the derivative of G with respect to f

and multiplied by the derivative of f with respect to X these two should kind of cancel out this is called the chain Rule now as you can

see the longer the chain of computation so if we have many nodes one after another the longer this multiplication chain so

here we have two because the distance from this node and this is two but imagine you have 100 or 1000

now imagine this number is 0.5 and this number is 0.5 also the resulting numbers when multiplied together is a number that is smaller than the two initial

numbers it's gone up 0.25 because it's one to one half multiplied by one half is one fourth so if we have two numbers that are smaller than one and we multiply them

together they will produce an even smaller number and if we have two numbers that are bigger than one and we multiply them together they will produce a number that is bigger than both of

them so if we have a very long chain of computation it eventually will either become a very big number or a very small number and this is not desirable first of all

because our CPU of our GPU can only represent numbers up to a certain Precision let's say 32-bit or 64-bit and if the number becomes too small the

contribution of this number to the output will become very small so when the pi torch or our automatic let's say our framework will calculate how to

adjust the weights the weight will move very very very slowly because the contribution of this product is will be a very small number

and this means that we have the gradient is Vanishing or in the other case it can explode become very big numbers and this is a problem the next problem

is difficulty in accessing information from long time ago what does it mean it means that as you remember from the previous slide we saw that the first input token is given to

the recurrent neural network to with along with the first state now we need to think that the recurrent neural network is a long graph of computation it will produce a new hidden

State then we will use the the new hidden State along with the next token to produce the next output if we have a very long sequence

um of input sequence the last token will have a hidden state whose contribution from the first token has nearly gone because of this long chain of

multiplication so actually the last token will not depend much on the first token and this is also not good because for example we know as humans that in a

text in a quite long text the context that we saw let's say 200 words before still relevant to the context of the current words and this is something that

the RNN could not map and this is why we have the Transformer so the Transformer solves these problems with the recurrent neural networks and

we will see how the structure of the Transformer we can divide into two macro blocks the first macro block is called encoder and it's

this part here the second macro block is called a decoder and it's the second part here the third part here you see on the top

it's just a linear layer and we will see why it's there and what it is function so and the two layers so the encoder and the decoder are connected by this

connection you can see here in which some output of the encoder is sent as input to the decoder and we will also see how let's start first of all

with some notations that I will be using during my explanation and you should be familiar with this notation also to review some maths so the first thing we

should be familiar with is matrix multiplication so imagine we have a input Matrix which is a sequence of let's say words

so sequence by D model and we will see why it's called sequence by the model so imagine we have a matrix that is a 6 by

512 in which each row is a word and this word is not made of characters but by 512 numbers so each word is represented by

512 numbers okay like this imagine you have 512 of them along this row 512 along this other row etc etc one two three four five so we need another one

here okay the first word we will call it a the second B the C D E and F if we multiply this matrix by another

Matrix let's say the transpose of this Matrix so it's a matrix where the rows becomes columns so three four

five and six this word will be here B C D E and F and

then we have um 512 numbers along each column because before we had them on the rows now they

will become on the column so here we have the 512 number etc etc this is a matrix that is

512 by 6 so let me add some brackets here if we multiply them we will get a new Matrix that is we cancel the inner

dimensions and we get the outer Dimension so it will become six by six so it will be 6 rows by 6 rows so let's draw it

how do we calculate the values of this output Matrix this is six by six this is the dot product of the first row

with the First Column so this is a multiplied by a this the second value is the first row with the second column the third value is the first row with the

third column until the last column so a multiplied by F Etc what is the dot product is

basically you take the first number of the first row so here we have 512 numbers here we have 512 numbers so you take the first number of the first row

and the first number of the First Column you multiply them together second value of the first row second value of the First Column you multiply

them together and then you add all these numbers together so it will be let's say uh this number multiplied by this plus

this number multiplied by this plus this number multiplied by this plus this number multiplied by this plus you sum all this number together and this is the

a DOT product a so we should be familiar with this notation because I will be using it a lot in the next slides let's start our journey with of the

Transformer uh by looking at the encoder so the encoder starts with the input embeddings so what is an input embedding

first of all let's start with our sentence we have a sentence of in this case six words what we do is we tokenize it we transform the sentence into tokens

what does it mean to tokenize we split them into single words it is not necessary to always split the sentence using single words we can even

split the sentence in part in smaller parts that are even smaller than a single word so we could even split this a sentence into let's say 20 tokens by

using the each by splitting each word into multiple words this is usually done in most modern Transformer models but we will not be

doing it otherwise it's really difficult to visualize so let's suppose we have this input sentence and we split into tokens and each token is a single word

the next step we do is we map these words into numbers and these numbers represent the position of these words in our vocabulary so

imagine we have a vocabulary of all the possible words that appear in our training set each word will occupy a position in this vocabulary so for example the word will occupy the

position 105 the word the cat will occupy the position 6500 Etc and as you can see this cat here has the same number as this cat here because

they occupy the same position in the vocabulary we take these numbers which are called input IDs and we map them into a vector

of size 512.

this Vector is a vector made of 512 numbers and we always map the same word to always the same embedding

however this number is not fixed it's a parameter for our model so our model will learn to change these numbers in such a way that it represents the

meaning of the word so the input ID is never change because our vocabulary is fixed but the embedding will change along with the training process of the model so the embeddings numbers will

change according to the needs of the loss function so the input embedding are basically mapping our single word into an embedding of size 512 and we call

this quantity 512 D model because it's the same name that it's also used in the paper attention is all you need let's look at the next layer of the

encoder which is the positional encoding so what is positional encoding what we want is that each word should carry some information about its

position in the sentence because now we built a matrix of words that are embeddings but they don't convey any information about how where that

particular word is inside the sentence and this is the job of the positional encoding so what we do we want the model to treat words that

appear close to each other as close and words that are distant as distant so we want the model to see this information about the special information that we see with our eyes so for example when we

see this sentence what is positional encoding we know that the word what is more far from the word um is compared to encoding because we we

have this partial information given by our eyes but the model cannot see this so we need to give some information to the model about how the words are specially distributed inside of the

sentence and we want the positional encoding to represent a pattern that the model can learn and we will see how

imagine we have our original sentence your cat is a lovely cat what we do is we first convert into embeddings using the previous layer so the input

embeddings and these are embeddings of size 512 then we create some special vectors called the positional encoding vectors that we add to these embeddings

so this Vector we see here in red is a vector of size 512 which is not learned it's computed once and not learned along with the training process

it's fixed and this word this Vector represents the position of the word inside of the sentence and this should give us a output that is

a vector of size again 512 because we are summing this number with this number this number with this number so the First Dimension with the First Dimension

the second dimension with that so we will get a new Vector of the same size of the input vectors or how are these position in both embedding calculated let's see

imagine we have a smaller sentence let's say your cat is and you may have seen the following expressions from the paper

what we do is we create a vector of five of size D model so 512 and for each position in this Vector we calculate the

value using these two expressions using these arguments so the first argument indicates the position of the word inside of the sentence so the word

your occupies the position zero and we use them for the even Dimension so the zero the two the four the 510 Etc we use

the first expression so the sine and for the other positions of this Vector we use the second expression and we do this for all the words inside

of the sentence so this particular embedding is calculated p e of 1 0 because it's the first word embedding

zero so this one represents the argument pause and this 0 represents the argument

2 I and p e of 1 1 means that the first word uh Dimension one so we will use the cosine giving the position one and the two I

will be equal to 2i plus 1 will be equal to 1.

and we do this for this third word Etc if we have another sentence we will not have different positional encodings

we will have the same vectors even for different sentences because the positional encoding are computed once and reused for every sentence that our model will see

during inference or training so we only compute the positional encoding once when we create the model we save them and then we reuse them we don't need to

compute it every time we feed the feed a sentence to the model so why the authors chose the cosine and the sine functions to represent positional encodings because let's watch

the plot of these two functions uh the you can see the plot is by position so the position of the word inside of the sentence and this depth is the dimension along the vector so the two I that you

see saw before in the previous expressions and if we plot them we can see as humans a pattern here and we hope that the model can also see this path okay the

next layer of the encoder is the multi-head attention we will not go inside of the multi-head attention first we will first visualize

the single head attention so the self-attention with a single head and let's do it so what is self-attention self attention

is a mechanism that existed before they introduced the Transformer the Alters of the Transformer just changed it into a multi-head attention so how did the

self-attention work the self-attention allows the model to relate words to each other okay so we had the input embeddings that

capture the meaning of the word then we have the positional encoding that give the information about the position of the word inside of the sentence now we

want this self-attention to relate words to each other now imagine we have uh in an input sequence of six word with the D model of

size 512.

which can be represented as a matrix that we will call Q K and V so our q k and V is a same Matrix are the same

Matrix representing the input so the input of six words with the dimension of 512 so each word is represented by a

vector of size 512. we basically apply this formula we saw here from the paper to calculate the attention the self attention in this case why self-attention because it's the each

word in the sentence related to other words in the same sentence so it's self-attention so we start with our Q Matrix which is

uh the input sentence so let's visualize it for example so we have six rows and on this uh on the columns we have 512 column now they are really difficult to

draw but let's say we have 512 columns and here we have six okay now what we do according to this formula we multiply it

by the same sentence but transposed so the transpose of the K which is again the same input sequence we divide it by the square root of 512

and then we apply this soft Max the output of this as we saw before in in the initial Matrix and notations we

saw that when we multiply 6 by 512 with another Matrix that is 512 by 6 we obtain a new Matrix that is six by six

and each value in this Matrix represents the dot product of the first row with the First Column this represents the dot product of the first row with the second

column Etc the values here are actually randomly generated so don't concentrate on the values what you should notice is that the soft Max makes all these values in

such a way that they sum up to one so this Row for example here some sums up to one this other row also sums up to

one etc etc and this value we see here it's the dot product of the first word with the embedding of the word itself

this value here is the dot product of the embedding of the word your with the embedding of the word cat and this value

here is the dot product of the word the embedding of the word your with the embedding of the word is the next thing we and this value

represents somehow a score that how intense is the relationship between one word and another let's go uh ahead with the formula so for now we just

multiplied Q by K divided by the square root of Decay applied to the soft Max but we didn't multiply by V so let's go forward we multiply this

matrix by V and we obtain a new Matrix which is 6 by 512 so if we multiply a matrix that is 6 by 6 with another that

is 6 by 512 we get a new Matrix that is 6 by 512 and one thing you should notice is that with the dimension of this Matrix is exactly the dimension of the

initial Matrix from which we started this what does it mean that we obtain a new Matrix that is six rows so let's say six rows

with 512 columns in which each these are our words so we have six words and each word has an

embedding of Dimension 512 so now this embedding here represents not only the meaning of the word which was given by the input embedding not only the

position of the word which was added by the positional encoding but now somehow this special embedding so these values represent a special embedding that also

captures the relationship of this particular word with all the other words and this particular embedding of this word here also captures not only its

meaning not only its position inside of the sentence but also the relationship of this word with all the other words I want to remind you that this is not the multi-head attention we are just

watching the self-attention so one head we will we will see later how this becomes the multi-head attention self-attention has some properties that are very desirable

first of all it's permutation invariant what does it mean to be permutation invariant it means that if we have a matrix let's say first we had a matrix of six words in

this case the let's say just four words so a b c and d and suppose by applying the formula before this produces this particular Matrix in which the there is new special

embedding for the word a a new special embedding for the word b a new special bedding for the word c and d so let's call it a prime B Prime C Prime D Prime if we

change the position of these two rows the values will not change the position of the output will change accordingly so the values of B Prime will not change it

will just change in the the position and also the C will also change position but the values in each Vector will not change and this is a desirable properties self-attention as of now

requires no parameters I mean I didn't introduce any parameter that is learned by the model I just took the initial sentence of in this case six words

we multiplied it by itself we divide it by a fixed quantity which is the square root of 512 and then we apply the soft Max which is not introducing any

parameters so for now the self-attention rate didn't require any parameter except for the embedding of the words this will change later when we introduce

the multi-head attention also we expect because the each value in the self-attention in the soft Max Matrix is a DOT product of the word

embedding with itself and the other words we expect the values along the diagonal to be the maximum because it's the dot product dot product of each word with itself

and there is another property of this Matrix that is before we apply the soft softmax

if we replace the value in this Matrix suppose we don't want the word your and Cat to interact with each other or we don't want the word let's say is and the

lovely to interact with each other what we can do is before we apply the softmax we can replace this value with minus infinity and also this value with minus

infinity and when we apply the soft Max the soft Max will replace minus infinity with 0.

because as you remember the soft Max is e to the power of x if x is going to minus infinity e will be e to the power of minus infinity will become very very

close to zero so basically zero this is a desirable property that we will use in the decoder of the Transformer now let's have a look at

what is a multi-head attention so what we just saw was the self attention and we want to convert it into a multi-headed tension you may have seen these expressions from the paper but

don't worry I will explain them one by one so let's go imagine we have our encoder so we are on the encoder side of of the Transformer

and we have our input sentence which is let's say 6 by 512 so Six Word by 512 is the size of the embedding of each word

in this case I call it sequence by D model so sequence is the sequence length as you can see on the legend in the bottom left of the slide and the D model

is the size of the embedding Vector which is 512. what we do just like the picture shows and we take this input and we make four

copies of it one will be sent uh wait one will be sent along this connection we can see here and three will be sent to the multi-header attention with three

respective names so it's the same input that becomes three matrices that are equal to input one is called the query one is called key and one is called value so basically we are taking this

input and making three copies of it one we call Q K and B they have of course the same dimension what does the multihead attention do first of all it multiplies these three

matrices by three parameter matrices called WQ w k and WV these matrices have Dimension D model by

D model so if we multiply a matrix that is sequence by the model with another one that is D model by D model we get a new Matrix as output that is sequenced

by D model so basically the same Dimension as the starting Matrix and we will call them Q Prime K Prime and V Prime

our next step is to split these matrices into smaller matrices let's see how we can split this Matrix Q Prime by the sequence Dimension or by the D model

dimension in the multi-hat attention we always split by the D model Dimension so every head will see the full sentence but a

smaller part of the embedding of each word so if we have an embedding of let's say 512 it will become smaller embeddings of

512 divided by four and we call this quantity d k so d k is D model divided by H where H is the number of heads in

our case we have H equal to 4.

we can calculate the attention between these smaller matrices so q1 K1 and V1 using the expression taken from the paper

and this will result into a small Matrix called Head 1 head 2 head 3 and head four the dimension of head 1 up to head

four is sequence by d v what is DV is basically it's equal to DK it's just called a DV because the last multiplication is done by V and in the

paper they call it DV so I am also sticking to the same names our next step is to multi combine these

matrices these small heads by concatenating them along the DV Dimension just like the paper says so we

can cut all this head together and we get a new Matrix that is sequence by H multiplied by DV

where H multiplied by DV as we know DV is equal to d k so H multiplied by DV is equal to D model so we get back the

initial shape so it's sequence by D model here the next step is to multiply the result of this concatenation by w o

and W O is a matrix that is H multiplied by DV so D model multiple with the other dimension being T model and the result of this is a new Matrix that is the

result of the multi-head attention which is sequenced by D model so the multi had attention instead of calculating the attention between these

matrices here so Q Prime K Prime and V Prime splits them along the D model Dimension into smaller matrices and calculates the attention between these

smaller matrices so each head is watching the full sentence but as different aspect of the embedding of each word why we want this because we

want the each head to watch different aspects of the same word for example in the Chinese language but also in other languages one word may be a noun in some

cases maybe a verb in some other cases maybe a adverb in some other cases depending on the context so what we want is that one head maybe

learns to relate that word as a noun another head maybe learns to relate that word as a verb and another head learn to release that verb as an objective or adverb

so this is why we want a multi-head attention now you may also have seen online that the the attention can be visualized and

I will show you how when we calculate the attention between the Q and the K matrices so when we do this operation so the soft Max of Q multiplied by the K

divided by the square root of d k we get a new Matrix just like we saw before which is sequenced by sequence and this represents a score that

represents the intensity of the relationship between the two words we can visualize this and this will produce a visualization uh

similar to this one which I took from the paper in which we see how the all the heads work so for example if we concentrate on this work making this word here we can see that making is

related to the word difficult so this word here by different heads so the blue head the red head and the green head

but the wire let's say the Violet head is not relating this two word together so making and difficult is not related by the violet or the pink head

The Violet head or the pink head they are relating the word making to other words for example to this word 2009

why this is the case because maybe this pink head could see the part of the embedding that these other heads could not see that made this interaction

possible between these two words you may be also wondering why these three mattresses are called query keys and values okay the terms come from the database

terminology or from the python-like dictionaries but I would also like to give my interpretation of my own making a very simple example I think it's quite easy to um

understand so imagine we have a python-like dictionary or a database in which we have keys and values

the keys are the category of movies and the values are the movies belonging to that category in my case I just put one value

so we have Romantics category which includes Titanic we have action movies that include the Dark Knight Etc imagine we also have a user that makes a query

and the query is love because we are in the Transformer world all these words actually are represented by embeddings of size 512.

so what our Transformer will do he will convert this word love into an embedding of 512 all these queries and values are already embeddings of 512 and it will

calculate the dot product between the query and all the keys just like the formula so as you remember the formula is a soft Max of query multiplied by the transpose of the keys

divided by the square root of the model so we are doing the dot product of all the queries with all the keys in this case the word love with all the

keys one by one and this will result in a score that will amplify some values or not amplify

other values um in this case our embedding may be in such a way that the word love and romantic are inter are related to each

other the word love and comedy are also related to each other but not so intensively like the word love and romantic so it's more how to say let's

less strong relationship but maybe the word horror and love are not related at all so maybe their soft Max score is very close to zero our next

um layer in the encoder is the ADD and norm and to introduce the other Norm we need the layer normalization so let's see

what is the layer normalization layer normalization is a layer that okay let's make a practical example imagine

we have a batch of n items in this case n is equal to three item one item two item three each of

these items will have some features it could be an embedding so for example it could be a feature of a vector of size 512 but it could be a very big Matrix of

thousands of features doesn't matter what we do is we calculate the mean and the variance of each of these items independently from each other and we replace each value with another

value that is given by this expression so basically we are normalizing so that the new values are all in the range 0 to 1.

actually we also multiply this new value with a parameter called gamma and then we add another parameter called beta and this gamma and beta are learnable

parameters and the model should learn to multiply and add these parameters so as to amplify the value that it wants to be

Amplified and not amplify that value that it doesn't want to be Amplified uh so we don't just normalize we actually introduce some parameters

and I found a really nice visualization from papers with code.com in which we see the difference between batch norm and layer Norm so as we can

see in the layer normalization we are calculating if n is the batch Dimension we are calculating all the values belonging to one item in the batch

while in the batch Norm we are calculating the same feature for all the batch so for all the items in the batch so we are mixing let's say values from

different items of the batch while in the layer normalization we are treating each item in the batch independently which will have its own mean and its own variance let's look at the decoder now

um in the encoder we saw the input embeddings in this call in this case they are called output embeddings but the underlying working is the same here

also we have the positional encoding and they are also the same as the Imp as the encoder the next layer is the musket multi-head

attention and we will see it now we also have the multi-head attention here with the here we should see that the

there is the encoder here that produces the output and is sent to the decoder in the forms of keys

and values while the query so this connection here is the query coming from the decoder so in this multi-head attention it's not

a self-attention anymore it's a cross attention because we are taking two sentences one is sent from the encoder side so let's write encoder in which we

provide the output of the encoder and we use it as a query as keys and values while the output of the masked multi-head attention is used as the

query in this multi-head attention and the musket multi-head attention is the self-attention of the input sentence of the decoder so we take the input

sentence of the decoder we transform into embeddings we add the depositional encoding we give it to this multi-head attention in which the query key and

values are the same input sequence we do the ADD and Norm then we send this as the queries of the multi-head attention while the keys and the values are coming

from the encoder then we do the add the norm I will not be showing the feed forward which is just a fully connected layer we then send the output of the feed

forward to the ADD and norm and finally to the linear layer which we will see later so let's have a look at the Muscat multi-head attention and how it differs

from a normal multi-head attention what we want our goal is that we want to make the model causal it means that the output at a certain position can only

depend on the words on the previous position so the model must not be able to see future words how can we achieve that

as you saw the the output of the soft Max in the attention calculation formula is this Matrix sequence by sequence if we want to hide the interaction of some

words with other words we delete this value and we replace it with minus infinity before we apply the soft Max so that the soft Max will replace this

value with 0. and we do this for all the interaction that we don't want so we don't want your to watch future words so we don't want your to watch cat is a

lovely cat and we don't want the word cat to watch future words but only all the words that come before it or the word itself so we don't want this this

this this also the same for the other words Etc so we can see that we are replacing all

the word all this values here that are above this diagonal here so this is the principal diagonal of the Matrix and we want all the values that are above this

diagonal to be replaced with minus infinity so that so that the soft Max will replace them with zero let's see in which stage of the multi-head attention

this mechanism is introduced so when we calculate the attention between these smaller matrices so q1 K1 and V1

before we apply this soft Max we replace this values so this one this one this one this one this one Etc with minus infinity then we apply this soft Max and

then the soft Max will take care of transforming these values into zeros so basically we don't want these words to interact with each other and if we don't want this interaction

the model will learn to not make them interact because the model will not get any information from this interaction so it's like this word cannot interact now let's look at how the inference and

training works for a Transformer model as I saw said previously we are dealing with it we will be dealing with the translation tasks so because it's easy

to visualize and it's easy to understand all the steps let's start with the training of the model we will go from an English sentence I love you very much

into an Italian sentence it's a very simple sentence it's easy to describe let's go

we start with a description of the of the Transformer model and we start with our English sentence which is sent to the encoder so our English sentence

here on which we prepared and append to special tokens one is called start of sentence and one is called end of

sentence these two tokens are taken from the vocabulary so they are special tokens in our vocabulary that tells the model what is the start position of a

sentence and what is the end of a sentence we will see later why we need them for now just think that we take our sentence we prepend a special token and

we append a special token then what we do as you can see from the picture we take our inputs we transform into input embeddings we add the positional encoding and then we send it

to the encoder so this is our encoder input sequence by the model we send it to the encoder it will produce an output which is encode a

sequence by D model and it's called the encoder output so as I saw we saw previously the output of the encoder is another Matrix that has the same

Dimension as the input Matrix in which the embedding we can see it as a sequence of embeddings in which this embedding is special because it captures

not only the meaning of the word which was given by the input embedding we saw here so by this not only the position which was given by the positional

encoding but also the interaction of every word with every other word in the same sentence because this is the encoder so we are talking about

self-attention so it's the interaction of each word in the sentence with all the other words in the same sentence we want to convert this sentence into

Italian so we prepare the input of the decoder which is a start of sentence as you can see from the picture of the the Transformer the outputs here you can

see shifted right what does it mean to shift right basically it means we prepared a special token called SOS start of sentence

you should also notice that these two sequences actually they in when we code the Transformer so if you watch my other video on how to code a Transformer you

will see that we make this sequence of fixed length so that if we have a sentence that is te amo multo or a very long sequence actually when we feed them to the

Transformer they all becomes become of the same length how to do this we add padding words to reach the length the desired length so if our model can

support let's say a sequence length of 1000 in this case we have a fourth tokens we will add 996 tokens of padding to make this

sentence long enough to reach the sequence length of course I'm not doing it here because it's not easy to visualize otherwise okay we prepared this input for the

decoder we add transform into embeddings we add the positional encoding then we send it first to the multi-head attentions to the musket multi-haditation so along with the

causal mask and then we take the output of the encoder and we send it to the decoder as

keys and values while the queries are coming from the musket so the queries are coming from this layer and the keys and the values are the output of the encoder

this the output of all this block here so all this big block here will be a matrix that is sequence by the

model just like for the encoder however we can see that this is still an embedding because it's a D model it's a vector of size 512 how can we relate

this um embedding back into our dictionary how can we understand what is this word in our vocabulary that's why we need a

linear layer that will map sequence by D model into another sequence by vocabulary size so it will tell for every embedding that it sees what is the

position of that word in our vocabulary so that we can understand what is the actual token that is output by the model after that we apply the softmax and

then we have our label what we expect the model to Output given this English sentence

we expect the model to Output this te amo multo end of sentence and this is called the label or the target what we do when we have the output of the model

and the corresponding label we calculate the loss in this case is the cross entropy loss and then we back propagate the loss to all the weights

now let's understand why we have these special tokens called SOS and EOS basically you can see that here the sequence length is 4 actually is 1000 because we have the padding but let's

say we don't have any padding so it's four tokens start of sentence the ammo multo and what we want is the T ammo multo end of sentence so our model when

it will see the start of sentence token it will output the first token as output

T when it will see T it will output ammo when it will see armor it will output molto and when it will see a multo it

will output end of sentence which will indicate that okay the translation is done and we will see this mechanism in the inference ah this all happens in one time step

just like I promised at the beginning of the video I said that with recurrental or neural networks we have end time

steps to map n input sequence into an output sequence but this problem would be solved with the Transformer yes it has been solved because you can see here

we didn't do any for Loop we just did all in one pass we give an input sequence to the encoder an input sequence to the decoder we produced some

outputs we calculated that cross entropy loss with the label and that's it it all happens in one time step and this is the power of the Transformer because it made

it very easy and very fast to train very long sequences and with the very very nice performance that you can see in charge GPD you can see GPT in bird Etc

let's have a look at how inference works again we have our English sentence I love you very much we want to map it into an Italian sentence

we have our usual Transformer we prepare the input for the encoder which is start of sentence I love you very much end of sentence we convert into input embeddings then we

add the positional encoding we prepare the input for the encoder and we send it to the encoder the encoder will produce an output which is sequenced by the model and we saw it before that it's a

sequence of special embeddings that capture the meaning the position but also the interaction of all the words with other words

what we do is for the decoder we give him just the start of sentence and of course we keep the we add enough embedding padding tokens to reach our

sequence length we just give the model the start of sentence token and we again we for this single token we convert into embeddings we add the positional

encoding and we send it to the decoder as decoder input the decoder will take this um his input as a query and the key and

the values coming from the encoder and it will produce an output which is sequenced by D model again we want the linear layer to project it back to our

vocabulary and this projection is called logits what we do is we apply the soft Max which will select given the logists will

give the position of the output word will have the maximum score with the soft Max this is how we know what words to select from the vocabulary

and this hopefully should produce the first output token which is T if the model has been trained correctly this however happens at time step one so

when we train the model Transformer model it happens in one pass so we have one input sequence one output sequence we give it to the model we do it one time step and the model will learn it

when we inference however we need to do it token by token and we will also see why this is the case at time Step 2 we don't need to

recompute the encoder output again because the over English sentence didn't change so we hope the the encoder should

produce the same output for it and then what we do is we take the output of the previous sentence so um

as T we append it to the input of the decoder and then we feed it to the decoder again with the output of the encoder from the previous step

which will produce an output sequence from the decoder side which we again project back into our vocabulary and

we get the next token which is ammo so as I saw before as I as I said before we are not recalculating the output of the encoder for every time step because

our English sentence didn't change at all what is changing is the input of the decoder because at every time step we are appending the output of the previous

step to the input of the decoder we do the same for the time step 3 and we do the same for the time step 4

and hopefully we will stop when we see the end of sentence token because that is that's how the model tells us to stop inferencing

and this is how the inference works why we needed four time steps when we inference a model um like the in this case the translation

model there are many strategies for inferencing what we used is called greedy strategy so for every step we get

the word with the maximum soft max value and however this strategy Works uh usually not bad but there are better strategies

and one of them is called beam search in beam search instead of always greedily so this is that's why it's called greedy instead of greedily taking

the maximum soft value we take the top B values and then for each of these choices we inference what are the next

possible tokens for each of the top B values at every step and we keep only the one with the B most probable

sequences and we delete the others this is called beam search and it generally it performs better so thank you guys for watching uh I know

it was a long video but it was really worth it to go through each aspect of the Transformer I hope you enjoyed this journey with me so please subscribe to

the channel and don't forget to watch my other video on how to code a Transformer model from scratch in which I describe not only again the structure of the

Transformer model while coding it but I also show you how to train it on a data set of your choice how to inference it

and I also provided the code on GitHub and the Ecolab notebook to train the model directly on collab

please subscribe to the to the channel and let me know what you didn't understand so that I can give more explanation and please tell me what are the problems in this kind of videos or

in this particular video that I can improve for the next videos thank you very much and have a great rest of the day

Loading...

Loading video analysis...