How Attention Mechanism Works in Transformer Architecture
By Under The Hood
Summary
## Key takeaways - **Embeddings alone miss context**: Embeddings capture semantic meaning but fail to distinguish words with multiple meanings, assigning a single vector regardless of context. For example, 'Apple' could refer to the company or the fruit, but a single embedding value is used. [01:15] - **Self-attention contextualizes word embeddings**: Self-attention transforms static semantic embeddings into contextual representations by allowing each token to interact with others and assign weights. This process captures relationships between words within their specific context. [01:40], [10:45] - **Query, Key, Value drive attention scores**: Each token generates Query, Key, and Value vectors. A token's Query vector interacts with other tokens' Key vectors to compute attention scores, indicating relevance. These scores, after scaling and softmax, determine how much of each token's Value vector contributes to the new contextual representation. [03:04], [04:43] - **Causal attention prevents future-peeking**: In generative models, causal self-attention uses a mask to prevent tokens from attending to future tokens in the sequence. This ensures that predictions are based only on preceding context, crucial for tasks like next-word prediction. [10:53], [12:37] - **Multi-head attention captures diverse relationships**: Instead of a single attention mechanism, multi-head attention uses multiple parallel attention heads. Each head, with its own learned projections for Query, Key, and Value, can focus on different aspects of the input relationships, such as syntax or long-range dependencies. [14:13], [14:53] - **GPT-2 architecture layers stabilize training**: The GPT-2 architecture incorporates layer normalization and residual connections throughout its decoder layers. These components stabilize training, improve gradient flow, and help retain important features across layers, enabling deeper networks. [18:29], [19:14]
Topics Covered
- Embeddings fall short for context; attention is key.
- Self-attention: Query, Key, and Value unlock contextual meaning.
- Softmax needs normalization to balance attention weights.
- Multi-head attention captures diverse relationships in parallel.
- GPT-2's decoder stack uses attention, normalization, and feed-forward.
Full Transcript
the Transformer architecture has been a
huge breakthrough in AI due to this
large language models have become
incredibly powerful they can generate
human-like text translate languages
summarize articles answer complex
questions and even write code their
ability to understand and generate text
with high accuracy has revolutionized
the way we interact with AI first large
language models tokenize the text
breaking it down into smaller units
called tokens then these tokens are
converted into embeddings if you're
unfamiliar with embeddings I have
explained embeddings in detail and I
suggest you to watch the embedding
before watching this video embeddings
transform text into high-dimensional
dense vectors that capture the meaning
of each token for example the word cat
has an embedding Vector close to similar
words like dog and rat these embeddings
come from an embedding Matrix where the
number of rows equals the total
vocabulary size and the number of
columns represents the embedding
Dimensions as we discussed in the
previous video each token has its own
embedding lookup position so words like
apple orange and banana are close in
embedding space because they are all
fruits the embedding has learned this
relationship however there's a problem
embeddings alone do not distinguish
between words with multiple meanings for
example in our text Apple refers to the
tech company not the fruit but the
embedding table assigns a single
embedding value for Apple regardless of
context ideally we want the embedding of
Apple to adapt based on its meaning and
context in this case it should be closer
to other tech companies rather than
fruits this is where self attention
comes in the self- attention mechanism
transforms semantic representations into
contextual representations making sure
words are understood correctly in their
specific context self attention is
extremely powerful it was introduced in
the paper attention is all you need
where the Transformer architecture uses
attention mechanisms in both its encoder
and decoder in the architecture it first
tokenize the text then get embeddings
that capture semantic meaning and
finally apply self attention to
incorporate contextual
meaning so now you know what the
attention mechanism does let's
understand how it works in the paper
self attention is represented by this
equation let's write it down on our
screen to understand this we first need
to understand what query key and value
are as shown in the equation
to get context information a single
token interacts with other tokens and
assigns weights to each of them then
based on these weights it takes a
proportionate amount from each token to
form a new representation that contains
contextual information so these
different values for a single token
query key and value each have their own
role using the same method all tokens
interact with each other assigning
importance to each token this assigned
importance is called the attention
weight which determines how much a token
attends to another token now let's
understand query key and value in detail
we start with token embeddings from
these embeddings we derive three new
vectors query key and value these are
high-dimensional vectors like embeddings
but have different values and are unique
to each other if we have three tokens in
a text each token has its own query key
and value Vector when a token needs to
interact with other tokens to determine
the weight assignment it uses its query
Vector for example if token a a wants to
determine how much weight to assign to
token B it uses its query Vector as if
asking hey do you have anything useful
for me in response token B uses its key
Vector to answer here's what I can offer
we then take the dotproduct of these two
vectors query and key to obtain a scalar
value which gives a score this score
tells us how well token B matches token
a in order to help create a new
representation for token a each token
follows this process asking itself and
all other tokens for scores to determine
how much they match a simple and
efficient way to do all of this at once
is by performing matrix multiplication
between the query Matrix which contains
all tokens query vectors and the key
Matrix which contains all tokens key
vectors we first transpose the key
Matrix and then perform matrix
multiplication the result is a set of
scores that each token receives from all
other tokens the scores obtained after
multiplying the query and key matrices
are called attention scores for two
tokens we get four different values if
we have five tokens in a text we take a
query Matrix containing all the query
vectors and a key Matrix containing all
the key vectors after transposing the
key Matrix to match the shape we compute
25 attention scores since each token
interacts with all other tokens to
determine what information is useful for
forming its new
representation now you understand query
and key the query is used to ask for
information and the key is used to
provide that information
resulting in the attention score the
value Matrix then helps us compute the
final
representation for example to create a
new representation for the word Apple it
first interacts with other tokens then
using the computed attention scores it
determines how much importance to assign
to each token based on these values it
extracts the corresponding portion from
each token's value Vector to form a new
contextual embedding now that you
understand the purpose of query key and
value let's see how we can contain these
vectors now that we have an embedding
for each token from the embedding layer
we transform this embedding using weight
matrices these matrices are specifically
for query key and value after applying
these Transformations we obtain the
query key and value vectors for each
token these weight matrices are
trainable parameters meaning they are
learned during training their purpose is
to transform the original embeddings
into meaningful query key and value
representations when I say these weight
are learned and transform the embedding
into its own query key and value I mean
that a neural network with linear
activations is used to train and
optimize them over time once we obtain
the query key and value vectors we
compute the attention scores by
multiplying the query and key matrices a
high score means the token assigns more
importance to that specific token after
obtaining these scores we use them along
with the value Matrix the value Matrix
is obtained by multiplying the embedding
with the weight Matrix of value this is
then used along with the attention
scores to get the final contextual
embeddings so a quick recap here to get
a new contextual representation for the
token Apple it first queries all other
tokens then gets the scores and finally
uses the appropriate proportion from
each token to get the final
representation the value each token
assigns to other tokens is called the
attention weight these weights follow a
specific pattern and we use this pattern
to compute the new embedding up until
now this is the overall process but
there is one more step left so far we
have used the query Vector to ask other
tokens and the key Vector to answer we
then multiplied them to get the
attention score however this is not the
final value that we multiply with the
value Matrix if we look at the attention
equation after multiplying query and key
we divide by a certain value and then
apply the softmax function only after
this step we multiply with the value
Matrix to get the final representation
first let's understand what softmax does
softmax is a mathematical function that
takes a list of numbers and converts
them into a list of values that can be
interpreted as
probabilities however softmax has a
problem when given a list of numbers
with a high range or large variance it
assigns a very high weight to the
largest number while the others get
values close to zero on the other hand
if the input numbers are very small or
similar all values become almost equal
to hand handle this issue we normalize
the attention scores by dividing them by
the square root of the dimension of the
key Vector before applying softmax this
ensures that the final attention weights
are well balanced the reason for using
the square root of the dimension of key
is to keep the attention scores at a
reasonable scale so that softmax does
not become too sensitive to large values
the exact mathematical reasoning behind
this is beyond our discussion here but
in simple terms it helps balance the
scaling effect as the dimension of key
increases
in the original paper they mentioned
that the dotproduct of query and key is
divided by the square root of the key
Dimension to counteract the effect of
increasing attention scores when the key
Dimension grows larger so if this is our
raw attention score we first divide it
by the square root of the key Dimension
to normalize it after applying softmax
across each row we get values that sum
to one and act as the attention weights
representing how much importance each
token assigns to the others for example
here token a assigns 25% importance to
itself 1% to token B 68% to token can 5%
to token D this tells us how much token
a attends to each token similarly token
C here assigned 26% to itself 1% to
token D and so on in this way we can
interpret the attention weights now
let's summarize the steps first we
multiply query and key to get the raw
attention scores then we scale the
scores by dividing by the square root of
the key Dimension finally we apply
softmax to get the attention weights now
that we have attention weights we use
them along with the value Vector to
compute the final embedding each token
now uses its attention weight to
determine how much it should take from
other tokens for example token a assigns
60% weight to itself and 40% weight to
token B token B assigns 70% weight to
token a and 30% weight to itself we take
that proportion of the value vectors and
sum them up to get the new
representation for each token this
entire process is actually just matrix
multiplication taking the weighted sum
of each token's value is done using
matrix multiplication so after
multiplying the attention weights with
the value Matrix we get the weighted sum
which results in the new contextual
embeddings for all tokens now you know
what query key and value are how we
generate them and their roles in the
attention mechanism this entire process
from transforming the embeddings into
query key and value to scaling and
normalizing to Computing attention
weights and finally taking the weighted
sum is the self- attention mechanism
self attention is what transforms a
static embedding into a contextual
embedding allowing us to get a
meaningful representation of text that
captures relationships between words in
the decoder architecture of a large
language model there is a variation of
self-attention called causal
self-attention in the data set
preparation video I mentioned that we
prepare training data by selecting a
context window here size of 10 and
setting the training label as the next
token in the sequence the reason we do
this is to train the model to predict
what comes next given the provided
context this is a classification task
but unlike regular classification tasks
in large language models we do not just
predict the next token after the full
context we train the model to predict
the next token at every position in the
sequence this means that for each token
the model learns to use the previous
words to predict what comes next as a
result the label size is the same as the
training token size but the labels are
shifted one step ahead of the input
tokens now let's take a simple training
example with a context size of five
first we tokenize the text and obtain
its embedding vectors after that we
compute the query and key vectors and
use them to calculate attention scores
however there's a problem here since
each token in a language model is
trained to predict the next token the
first token should only be using itself
to predict the second token but when
calculating the attention score we are
allowing the first token to attend to
all other tokens meaning it already has
access to the Token it is supposed to
predict similarly the second token is
attending to its Target token and so is
the third token meaning all tokens are
looking ahead into the future only the
last token should be allowed to attend
to all previous Tok tokens because it
uses the full context to predict the
next token which is not in the context
window to prevent tokens from attending
future tokens we use a mask of the same
size as the sequence this mask contains
negative Infinity values in the
positions where tokens should not attend
we then add this masking Matrix to the
original attention score effectively
replacing the values that represent
attention to Future tokens with negative
Infinity the reason we use negative
Infinity is that when we apply softmax
these values will effectively be ignored
and become zero ensuring that the tokens
do not assign any weight to Future
tokens here the first token will give
100% weight to itself because all future
tokens are masked the second token will
attend only to the first and itself the
third token will attend only to the
first two and itself and so on applying
softmax to the masked attention scores
row by row gives us the attention
weights and this method is called causal
self attention which prevents tokens
from attending to Future tokens after
this the process Remains the Same we use
the attention weights to compute a
weighted sum of the value vectors giving
us the final contextual
embeddings for a sequence of n tokens
the attention Matrix is n * n so
doubling the context size results in
four times more attention scores as the
context size increases the attention
weight Matrix grows quadratically in
size meaning meaning the number of
attention scores that need to be
computed and stored increases rapidly
this leads to higher memory usage and
computational cost making it challenging
to scale Transformers to very long
sequences
efficiently now let's understand
multi-head attention so far we have seen
a single self attention mechanism that
takes embeddings and produces contextual
embeddings but in Transformer
architecture instead of using a single
attention head multiple self attention
mechanisms are used in parallel the idea
is that instead of performing a single
attention operation to get the
contextual embedding it is beneficial to
linearly project the query key and value
using different learned linear
projections which are lower dimensional
and then concatenate their outputs to
get the final embedding here is the
equation for multihead attention let's
write it down the reason for this is to
capture different aspects of
relationships between tokens a single
attention head May focus on word meaning
while another May focus on syntax and
another on long range dependencies this
allows the model to understand different
perspectives of the input
simultaneously let's assume the
embedding size of our model is
512 we use four attention heads and each
head has an inner dimension of
128 this means that for each attention
head the query key and value vectors
will have a dimension of 128 instead of
512 additionally we have another new
weight Matrix W out which is the output
projection the weight matrices for query
key and value are the same as in self
attention but now each attention head
has its own learnable weights and
projects the input into a lower
Dimension each individual attention head
works exactly like the self attention
mechanism we discussed earlier it
computes attention scores applies softex
takes a weighted sum and produces an
output embedding since we have four
attention heads this process is repeated
four times each with its own learned
projection
after that we concatenate the outputs
from all heads resulting in a 512
dimensional Vector which is the same as
our original embedding size finally we
apply a linear projection using W out to
get the final contextual
representation the reason for doing this
is that it allows the model to combine
information from multiple attention
heads capturing different types of
dependencies between words this is how
multihead attention works we take
multiple cell self attention mechanisms
each with its own learnable weights for
query key and value each head produces
an output embedding in a lower Dimension
these outputs are concatenated and then
linearly projected to obtain the final
contextual embedding this is multi-head
attention and it is represented by this
diagram in the Transformer
architecture now that you understand
multi-head attention let's see where it
is used in the Transformer architecture
the Transformer architecture applies
multi-head attention in both the encoder
and decoder in the encoder it uses
multi-head attention here the attention
is not masked and allows to attend to
all other tokens in the decoder it first
applies masked multi-head attention
right after token embedding then another
multi-head attention layer follows which
we call cross attention it is called
cross attention because it allows the
decoder to focus on different parts of
the encoder's output rather than just
attending to itself this helps in
sequence to sequence tasks like
translation where the decoder needs to
extract relevant information from the
encoder's representation here it gets
key and value from encoder and it uses
query its own
decoder however since we are focusing on
large language models which only use the
decoder part of the Transformer cross
attention is not needed here the
architecture also includes layer
normalization and residual Connections
in a specific order this specific
architecture is gpt2 model gpt2 small
model has embedding size of 768 Contex
size of 1024 12 decoder layers and 12
attention heads let's see how the input
tokens flow in the architecture first
1024 tokens are sent into an embedding
layer which acts as a lookup table for
all tokens in the gpt2
tokenizer each token receives its
embedding vector and positional
information is added the result first
goes through a layer normalization step
in this normalization layer the model
normalizes the activations of each input
token across all features this ensures
that the values remain in a stable range
preventing the issue of exploding or
Vanishing gradients it does this by
Computing the mean and variance for each
input token and then adjusting the
values to have zero mean and unit
variance this helps the model train more
effectively and speeds up convergence
importantly the shape of the input
remains unchanged because normalization
only affects the values not the
structure after this the tokens pass
through multi-head attention a residual
connection is added here ensuring the
shape Remains the Same due to element
wise addition layer normalization and
residual connections are also found in
other deep learning architectures their
purpose is to stabilize training allow
better gradient flow and ensure that
important features are retained across
layers residual connections specifically
help in preventing the problem of deep
Network struggling to pass information
affect effectively by allowing the
gradient to flow through multiple layers
unimpeded here another layer
normalization follows again keeping the
shape unchanged before passing through a
feed forward neural network now comes
the feed forward neural network layer
this feed forward Network applies a
transformation that expands the input
representation into a higher dimensional
space allowing the model to capture more
complex relationships then it reduces
the dimensionality back to its original
shape so that it matches the input this
means the overall structure Remains the
Same while enriching the token
representation another residual
connection is applied here if You
observe the output of the decoder has
the same shape as its input this output
is then sent to the next decoder layer
and is repeated multiple times in gpt2
small we have 12 identical decoder
layers each learning new patterns and
capturing different levels of context
after all decoder layers we apply a
final layer normalization followed by a
linear layer as I explained in my
previous embedding video this linear
layer functions similarly to the initial
embedding layer but instead it outputs
logits for the next token prediction in
gpt2 the weights of the embedding layer
and final linear layer are shared
meaning both the start and end of the
decoder influence the embedding process
the logits are then passed through a
softmax function to obtain a probability
distribution over all possible tokens
using labeled data we calculate the loss
at each token position to predict the
next token we then compute gradients and
apply back propagation through all the
layers updating the model's weights each
decoder layer is repeated multiple times
with each layer learning different
aspects of language the multi-head
attention mechanism captures different
levels of context in different decoder
layers but we don't fully understand why
large language models scale so well and
what these attention mechanism learn
now that you've seen all the trainable
parameters in the model you understand
why these models have millions and
billions of parameters this is just GPT
too small in modern large language
models we see thousands of context size
hundreds of attention heads dozens or
even hundreds of decoder layers
embedding dimensions in the tens of
thousands this is why we don't just call
them language models we call them large
language models that was a short summary
of gpt2 architecture with the atten
mechanism hope you enjoyed the video
thank you for watching and see you in
the next one
Loading video analysis...