How Attention Mechanism Works in Transformer Architecture

By Under The Hood

Summary

## Key takeaways - **Embeddings alone miss context**: Embeddings capture semantic meaning but fail to distinguish words with multiple meanings, assigning a single vector regardless of context. For example, 'Apple' could refer to the company or the fruit, but a single embedding value is used. [01:15] - **Self-attention contextualizes word embeddings**: Self-attention transforms static semantic embeddings into contextual representations by allowing each token to interact with others and assign weights. This process captures relationships between words within their specific context. [01:40], [10:45] - **Query, Key, Value drive attention scores**: Each token generates Query, Key, and Value vectors. A token's Query vector interacts with other tokens' Key vectors to compute attention scores, indicating relevance. These scores, after scaling and softmax, determine how much of each token's Value vector contributes to the new contextual representation. [03:04], [04:43] - **Causal attention prevents future-peeking**: In generative models, causal self-attention uses a mask to prevent tokens from attending to future tokens in the sequence. This ensures that predictions are based only on preceding context, crucial for tasks like next-word prediction. [10:53], [12:37] - **Multi-head attention captures diverse relationships**: Instead of a single attention mechanism, multi-head attention uses multiple parallel attention heads. Each head, with its own learned projections for Query, Key, and Value, can focus on different aspects of the input relationships, such as syntax or long-range dependencies. [14:13], [14:53] - **GPT-2 architecture layers stabilize training**: The GPT-2 architecture incorporates layer normalization and residual connections throughout its decoder layers. These components stabilize training, improve gradient flow, and help retain important features across layers, enabling deeper networks. [18:29], [19:14]

Topics Covered

Embeddings fall short for context; attention is key.
Self-attention: Query, Key, and Value unlock contextual meaning.
Softmax needs normalization to balance attention weights.
Multi-head attention captures diverse relationships in parallel.
GPT-2's decoder stack uses attention, normalization, and feed-forward.

Full Transcript

the Transformer architecture has been a

huge breakthrough in AI due to this

large language models have become

incredibly powerful they can generate

human-like text translate languages

summarize articles answer complex

questions and even write code their

ability to understand and generate text

with high accuracy has revolutionized

the way we interact with AI first large

language models tokenize the text

breaking it down into smaller units

called tokens then these tokens are

converted into embeddings if you're

unfamiliar with embeddings I have

explained embeddings in detail and I

suggest you to watch the embedding

before watching this video embeddings

transform text into high-dimensional

dense vectors that capture the meaning

of each token for example the word cat

has an embedding Vector close to similar

words like dog and rat these embeddings

come from an embedding Matrix where the

number of rows equals the total

vocabulary size and the number of

columns represents the embedding

Dimensions as we discussed in the

previous video each token has its own

embedding lookup position so words like

apple orange and banana are close in

embedding space because they are all

fruits the embedding has learned this

relationship however there's a problem

embeddings alone do not distinguish

between words with multiple meanings for

example in our text Apple refers to the

tech company not the fruit but the

embedding table assigns a single

embedding value for Apple regardless of

context ideally we want the embedding of

Apple to adapt based on its meaning and

context in this case it should be closer

to other tech companies rather than

fruits this is where self attention

comes in the self- attention mechanism

transforms semantic representations into

contextual representations making sure

words are understood correctly in their

specific context self attention is

extremely powerful it was introduced in

the paper attention is all you need

where the Transformer architecture uses

attention mechanisms in both its encoder

and decoder in the architecture it first

tokenize the text then get embeddings

that capture semantic meaning and

finally apply self attention to

incorporate contextual

meaning so now you know what the

attention mechanism does let's

understand how it works in the paper

self attention is represented by this

equation let's write it down on our

screen to understand this we first need

to understand what query key and value

are as shown in the equation

to get context information a single

token interacts with other tokens and

assigns weights to each of them then

based on these weights it takes a

proportionate amount from each token to

form a new representation that contains

contextual information so these

different values for a single token

query key and value each have their own

role using the same method all tokens

interact with each other assigning

importance to each token this assigned

importance is called the attention

weight which determines how much a token

attends to another token now let's

understand query key and value in detail

we start with token embeddings from

these embeddings we derive three new

vectors query key and value these are

high-dimensional vectors like embeddings

but have different values and are unique

to each other if we have three tokens in

a text each token has its own query key

and value Vector when a token needs to

interact with other tokens to determine

the weight assignment it uses its query

Vector for example if token a a wants to

determine how much weight to assign to

token B it uses its query Vector as if

asking hey do you have anything useful

for me in response token B uses its key

Vector to answer here's what I can offer

we then take the dotproduct of these two

vectors query and key to obtain a scalar

value which gives a score this score

tells us how well token B matches token

a in order to help create a new

representation for token a each token

follows this process asking itself and

all other tokens for scores to determine

how much they match a simple and

efficient way to do all of this at once

is by performing matrix multiplication

between the query Matrix which contains

all tokens query vectors and the key

Matrix which contains all tokens key

vectors we first transpose the key

Matrix and then perform matrix

multiplication the result is a set of

scores that each token receives from all

other tokens the scores obtained after

multiplying the query and key matrices

are called attention scores for two

tokens we get four different values if

we have five tokens in a text we take a

query Matrix containing all the query

vectors and a key Matrix containing all

the key vectors after transposing the

key Matrix to match the shape we compute

25 attention scores since each token

interacts with all other tokens to

determine what information is useful for

forming its new

representation now you understand query

and key the query is used to ask for

information and the key is used to

provide that information

resulting in the attention score the

value Matrix then helps us compute the

final

representation for example to create a

new representation for the word Apple it

first interacts with other tokens then

using the computed attention scores it

determines how much importance to assign

to each token based on these values it

extracts the corresponding portion from

each token's value Vector to form a new

contextual embedding now that you

understand the purpose of query key and

value let's see how we can contain these

vectors now that we have an embedding

for each token from the embedding layer

we transform this embedding using weight

matrices these matrices are specifically

for query key and value after applying

these Transformations we obtain the

query key and value vectors for each

token these weight matrices are

trainable parameters meaning they are

learned during training their purpose is

to transform the original embeddings

into meaningful query key and value

representations when I say these weight

are learned and transform the embedding

into its own query key and value I mean

that a neural network with linear

activations is used to train and

optimize them over time once we obtain

the query key and value vectors we

compute the attention scores by

multiplying the query and key matrices a

high score means the token assigns more

importance to that specific token after

obtaining these scores we use them along

with the value Matrix the value Matrix

is obtained by multiplying the embedding

with the weight Matrix of value this is

then used along with the attention

scores to get the final contextual

embeddings so a quick recap here to get

a new contextual representation for the

token Apple it first queries all other

tokens then gets the scores and finally

uses the appropriate proportion from

each token to get the final

representation the value each token

assigns to other tokens is called the

attention weight these weights follow a

specific pattern and we use this pattern

to compute the new embedding up until

now this is the overall process but

there is one more step left so far we

have used the query Vector to ask other

tokens and the key Vector to answer we

then multiplied them to get the

attention score however this is not the

final value that we multiply with the

value Matrix if we look at the attention

equation after multiplying query and key

we divide by a certain value and then

apply the softmax function only after

this step we multiply with the value

Matrix to get the final representation

first let's understand what softmax does

softmax is a mathematical function that

takes a list of numbers and converts

them into a list of values that can be

interpreted as

probabilities however softmax has a

problem when given a list of numbers

with a high range or large variance it

assigns a very high weight to the

largest number while the others get

values close to zero on the other hand

if the input numbers are very small or

similar all values become almost equal

to hand handle this issue we normalize

the attention scores by dividing them by

the square root of the dimension of the

key Vector before applying softmax this

ensures that the final attention weights

are well balanced the reason for using

the square root of the dimension of key

is to keep the attention scores at a

reasonable scale so that softmax does

not become too sensitive to large values

the exact mathematical reasoning behind

this is beyond our discussion here but

in simple terms it helps balance the

scaling effect as the dimension of key

increases

in the original paper they mentioned

that the dotproduct of query and key is

divided by the square root of the key

Dimension to counteract the effect of

increasing attention scores when the key

Dimension grows larger so if this is our

raw attention score we first divide it

by the square root of the key Dimension

to normalize it after applying softmax

across each row we get values that sum

to one and act as the attention weights

representing how much importance each

token assigns to the others for example

here token a assigns 25% importance to

itself 1% to token B 68% to token can 5%

to token D this tells us how much token

a attends to each token similarly token

C here assigned 26% to itself 1% to

token D and so on in this way we can

interpret the attention weights now

let's summarize the steps first we

multiply query and key to get the raw

attention scores then we scale the

scores by dividing by the square root of

the key Dimension finally we apply

softmax to get the attention weights now

that we have attention weights we use

them along with the value Vector to

compute the final embedding each token

now uses its attention weight to

determine how much it should take from

other tokens for example token a assigns

60% weight to itself and 40% weight to

token B token B assigns 70% weight to

token a and 30% weight to itself we take

that proportion of the value vectors and

sum them up to get the new

representation for each token this

entire process is actually just matrix

multiplication taking the weighted sum

of each token's value is done using

matrix multiplication so after

multiplying the attention weights with

the value Matrix we get the weighted sum

which results in the new contextual

embeddings for all tokens now you know

what query key and value are how we

generate them and their roles in the

attention mechanism this entire process

from transforming the embeddings into

query key and value to scaling and

normalizing to Computing attention

weights and finally taking the weighted

sum is the self- attention mechanism

self attention is what transforms a

static embedding into a contextual

embedding allowing us to get a

meaningful representation of text that

captures relationships between words in

the decoder architecture of a large

language model there is a variation of

self-attention called causal

self-attention in the data set

preparation video I mentioned that we

prepare training data by selecting a

context window here size of 10 and

setting the training label as the next

token in the sequence the reason we do

this is to train the model to predict

what comes next given the provided

context this is a classification task

but unlike regular classification tasks

in large language models we do not just

predict the next token after the full

context we train the model to predict

the next token at every position in the

sequence this means that for each token

the model learns to use the previous

words to predict what comes next as a

result the label size is the same as the

training token size but the labels are

shifted one step ahead of the input

tokens now let's take a simple training

example with a context size of five

first we tokenize the text and obtain

its embedding vectors after that we

compute the query and key vectors and

use them to calculate attention scores

however there's a problem here since

each token in a language model is

trained to predict the next token the

first token should only be using itself

to predict the second token but when

calculating the attention score we are

allowing the first token to attend to

all other tokens meaning it already has

access to the Token it is supposed to

predict similarly the second token is

attending to its Target token and so is

the third token meaning all tokens are

looking ahead into the future only the

last token should be allowed to attend

to all previous Tok tokens because it

uses the full context to predict the

next token which is not in the context

window to prevent tokens from attending

future tokens we use a mask of the same

size as the sequence this mask contains

negative Infinity values in the

positions where tokens should not attend

we then add this masking Matrix to the

original attention score effectively

replacing the values that represent

attention to Future tokens with negative

Infinity the reason we use negative

Infinity is that when we apply softmax

these values will effectively be ignored

and become zero ensuring that the tokens

do not assign any weight to Future

tokens here the first token will give

100% weight to itself because all future

tokens are masked the second token will

attend only to the first and itself the

third token will attend only to the

first two and itself and so on applying

softmax to the masked attention scores

row by row gives us the attention

weights and this method is called causal

self attention which prevents tokens

from attending to Future tokens after

this the process Remains the Same we use

the attention weights to compute a

weighted sum of the value vectors giving

us the final contextual

embeddings for a sequence of n tokens

the attention Matrix is n * n so

doubling the context size results in

four times more attention scores as the

context size increases the attention

weight Matrix grows quadratically in

size meaning meaning the number of

attention scores that need to be

computed and stored increases rapidly

this leads to higher memory usage and

computational cost making it challenging

to scale Transformers to very long

sequences

efficiently now let's understand

multi-head attention so far we have seen

a single self attention mechanism that

takes embeddings and produces contextual

embeddings but in Transformer

architecture instead of using a single

attention head multiple self attention

mechanisms are used in parallel the idea

is that instead of performing a single

attention operation to get the

contextual embedding it is beneficial to

linearly project the query key and value

using different learned linear

projections which are lower dimensional

and then concatenate their outputs to

get the final embedding here is the

equation for multihead attention let's

write it down the reason for this is to

capture different aspects of

relationships between tokens a single

attention head May focus on word meaning

while another May focus on syntax and

another on long range dependencies this

allows the model to understand different

perspectives of the input

simultaneously let's assume the

embedding size of our model is

512 we use four attention heads and each

head has an inner dimension of

128 this means that for each attention

head the query key and value vectors

will have a dimension of 128 instead of

512 additionally we have another new

weight Matrix W out which is the output

projection the weight matrices for query

key and value are the same as in self

attention but now each attention head

has its own learnable weights and

projects the input into a lower

Dimension each individual attention head

works exactly like the self attention

mechanism we discussed earlier it

computes attention scores applies softex

takes a weighted sum and produces an

output embedding since we have four

attention heads this process is repeated

four times each with its own learned

projection

after that we concatenate the outputs

from all heads resulting in a 512

dimensional Vector which is the same as

our original embedding size finally we

apply a linear projection using W out to

get the final contextual

representation the reason for doing this

is that it allows the model to combine

information from multiple attention

heads capturing different types of

dependencies between words this is how

multihead attention works we take

multiple cell self attention mechanisms

each with its own learnable weights for

query key and value each head produces

an output embedding in a lower Dimension

these outputs are concatenated and then

linearly projected to obtain the final

contextual embedding this is multi-head

attention and it is represented by this

diagram in the Transformer

architecture now that you understand

multi-head attention let's see where it

is used in the Transformer architecture

the Transformer architecture applies

multi-head attention in both the encoder

and decoder in the encoder it uses

multi-head attention here the attention

is not masked and allows to attend to

all other tokens in the decoder it first

applies masked multi-head attention

right after token embedding then another

multi-head attention layer follows which

we call cross attention it is called

cross attention because it allows the

decoder to focus on different parts of

the encoder's output rather than just

attending to itself this helps in

sequence to sequence tasks like

translation where the decoder needs to

extract relevant information from the

encoder's representation here it gets

key and value from encoder and it uses

query its own

decoder however since we are focusing on

large language models which only use the

decoder part of the Transformer cross

attention is not needed here the

architecture also includes layer

normalization and residual Connections

in a specific order this specific

architecture is gpt2 model gpt2 small

model has embedding size of 768 Contex

size of 1024 12 decoder layers and 12

attention heads let's see how the input

tokens flow in the architecture first

1024 tokens are sent into an embedding

layer which acts as a lookup table for

all tokens in the gpt2

tokenizer each token receives its

embedding vector and positional

information is added the result first

goes through a layer normalization step

in this normalization layer the model

normalizes the activations of each input

token across all features this ensures

that the values remain in a stable range

preventing the issue of exploding or

Vanishing gradients it does this by

Computing the mean and variance for each

input token and then adjusting the

values to have zero mean and unit

variance this helps the model train more

effectively and speeds up convergence

importantly the shape of the input

remains unchanged because normalization

only affects the values not the

structure after this the tokens pass

through multi-head attention a residual

connection is added here ensuring the

shape Remains the Same due to element

wise addition layer normalization and

residual connections are also found in

other deep learning architectures their

purpose is to stabilize training allow

better gradient flow and ensure that

important features are retained across

layers residual connections specifically

help in preventing the problem of deep

Network struggling to pass information

affect effectively by allowing the

gradient to flow through multiple layers

unimpeded here another layer

normalization follows again keeping the

shape unchanged before passing through a

feed forward neural network now comes

the feed forward neural network layer

this feed forward Network applies a

transformation that expands the input

representation into a higher dimensional

space allowing the model to capture more

complex relationships then it reduces

the dimensionality back to its original

shape so that it matches the input this

means the overall structure Remains the

Same while enriching the token

representation another residual

connection is applied here if You

observe the output of the decoder has

the same shape as its input this output

is then sent to the next decoder layer

and is repeated multiple times in gpt2

small we have 12 identical decoder

layers each learning new patterns and

capturing different levels of context

after all decoder layers we apply a

final layer normalization followed by a

linear layer as I explained in my

previous embedding video this linear

layer functions similarly to the initial

embedding layer but instead it outputs

logits for the next token prediction in

gpt2 the weights of the embedding layer

and final linear layer are shared

meaning both the start and end of the

decoder influence the embedding process

the logits are then passed through a

softmax function to obtain a probability

distribution over all possible tokens

using labeled data we calculate the loss

at each token position to predict the

next token we then compute gradients and

apply back propagation through all the

layers updating the model's weights each

decoder layer is repeated multiple times

with each layer learning different

aspects of language the multi-head

attention mechanism captures different

levels of context in different decoder

layers but we don't fully understand why

large language models scale so well and

what these attention mechanism learn

now that you've seen all the trainable

parameters in the model you understand

why these models have millions and

billions of parameters this is just GPT

too small in modern large language

models we see thousands of context size

hundreds of attention heads dozens or

even hundreds of decoder layers

embedding dimensions in the tens of

thousands this is why we don't just call

them language models we call them large

language models that was a short summary

of gpt2 architecture with the atten

mechanism hope you enjoyed the video

thank you for watching and see you in

the next one

Loading...

Loading video analysis...