Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lec. 2: Pytorch, Resource Accounting
By Stanford Online
Summary
## Key takeaways - **Training cost scales with parameters and data**: The total computational cost for training a model, including forward and backward passes, is approximately six times the product of the number of data points and the number of parameters. This 'napkin math' helps estimate training resource needs. [50:00], [02:14:00] - **Memory usage depends on data type and model components**: The memory required for a model is determined by the number of parameters, activations, gradients, and optimizer states. Using lower precision data types like BF16 or FP8 can reduce memory footprint but may introduce instability. [02:30:00], [06:03:00] - **Matrix multiplication dominates compute**: Matrix multiplications are the most computationally intensive operations in deep learning. The flop count for a matrix multiplication is roughly two times the product of its three dimensions, making it the primary factor in performance. [23:49:00], [37:37:00] - **Hardware performance varies by data type**: The theoretical peak performance (flops per second) of hardware like the H100 is highly dependent on the data type used. FP32 operations are significantly slower than BF16 or FP8 operations, highlighting the importance of mixed-precision training. [35:35:00], [44:10:00] - **Parameter initialization prevents training instability**: Initializing model parameters with a standard Gaussian distribution can lead to exploding values, especially in deep networks. Rescaling by the square root of the input dimension, a technique related to Xavier initialization, helps maintain stable training. [01:00:43], [01:01:33] - **MFU measures hardware utilization**: Model Flops Utilization (MFU) compares the actual computed flops by a model to the theoretical peak performance of the hardware. An MFU above 0.5 is considered good, indicating efficient use of computational resources, especially when matrix multiplications dominate. [44:20:00], [45:02:00]
Topics Covered
- The Myth of Full Precision: FP32 vs. BF16 in Deep Learning
- Flops vs. Flops/sec: Differentiating Computation and Hardware Speed
- The Unsung Heroes: Why Matrix Multiplications Dominate Compute
- The 6x Rule: Understanding Model Training Costs
- Memory Accounting: Tensors, Gradients, and Optimizer States
Full Transcript
Okay, so last lecture I gave an overview
of language models and what it means to
build them from scratch and why we want
to do that. Also talked about
tokenization which is going to be the
first half of the first assignment. Um
today's lecture will be going through
actually building a model. Um we'll
discuss the primitives in PyTorch that
are needed. Um, we're going to start
with tensors, build models, optimizers,
and training loop. And we're going to
place close attention to efficiency, in
particular, how we're using resources,
both um memory and
compute. Okay. So, to motivate things a
bit, here's some here's some
questions. Okay. Um, these questions are
going to be answerable by napkin math.
So, get your napkins out. Um, so how
long would it take to train a 70 billion
parameter dense transformer model on 15
trillion tokens on 1,24
H100s? Okay, so you know, I'm just going
to sketch out the sort of the give you a
flavor of the type of things that we
want to do. Okay, so here's how you go
about reasoning it. Um, you count the no
total number of flops um needed
to um, you know, train. So that's six
times the number of parameters times the
number of tokens. Okay. And where does
that come from? That will be uh what
we'll talk about in this lecture. Um you
can look at the promised number of flops
per second that H100 gives
you. Um the MFU, which is something
we'll see later. Let's just set it to
0.5. Um and you can look at the number
of flops uh per day that your hardware
is going to give you at this particular
MFU. So 1,024 of them
um for um you know one day and then you
just divide the total number of flops
you need to train the model by the
number of flops that you're supposed to
get. Okay. And that gives you about
144. Okay. So this is very simple uh
calculations at the end of the day. Um
we're going to go through a bit more
where uh these numbers uh come from and
in particular where the six times number
of um parameters times number of tokens
comes from. Um okay so here's a
question. What is the largest model you
can train on H eight H100s using atom W
if you're not being uh too
clever. Okay so um H100 has 80 gigabytes
of HBM uh memory. Um the number of bytes
per parameter that you need for the
parameters the gradients optimizer state
is uh 16 and we'll talk more about where
that comes from. Um and the number of
parameters uh is basically total amount
of memory divide by number of bytes you
need per parameter and that gives you uh
about 40 uh billion parameters. Okay.
And this is very rough because it it
doesn't take you into activations which
depends on batch size and sequence
length which I'm not really going to
talk about but will be important for
assignment
one. Okay. So this is a rough back
envelope calculation. And this is
something that you're probably not used
to doing. You just implement a model,
you train it and what happens happens.
But remember that efficiency is the name
of the game. And to be efficient, you
have to know exactly how many flops
you're actually expending because when
these numbers get large, these directly
translate into dollars and uh you want
that to be as small as possible. Okay,
so we'll talk more about the the details
of how these numbers uh arise.
Um you know, uh we will not actually go
over the transformer. So Tatu is going
to talk at a um over the conceptual
overview of that next time. Um, and
there's uh many ways you can learn about
transformer if you haven't already uh
looked at it. There's assignment one. If
you do assignment one, you'll definitely
know what a transformer is. Um, and the
handout actually does a pretty good job
of walking through all the different
pieces. There's a mathematical
description. If you like pictures,
there's pictures. There's a lot of stuff
you can uh look on online. So, but
instead I'm going to work with simpler
models and really talk about the
primitives and the resource accounting
piece. Okay. So remember last time I
said what kinds of knowledge can you
learn? So mechanics um in this lecture
it's going to be just pietorch and
understanding how pietorch works at um a
fairly primitive uh level. So that's
will be pretty straightforward. Mindset
is about resource accounting and it's
not hard. It's just you just have to do
it. And intuitions um unfortunately this
is just going to be broad strokes uh you
know for now. Actually, there's not
really much intuition that I'm going to
talk about in terms of how um anything
we're doing translates to good models.
Um this is more about the M mechanics
and
mindset. Okay. So, let's start with
memory accounting. Um and then I'll talk
about compute accounting. Um and then
we'll build up bottom up. Okay. So, the
best place to start is a tensor. So
tensors are the building block of for
storing everything in deep learning.
Parameters gradients optimizer state
uh data, activations. There's sort of
these atoms. Um you can read lots of
documentation about them. Um you're
probably very familiar with how to
create tensors. There's creating tensors
different ways. Um you can also create a
tensor and not initialize it and use
some special uh initialization um for
the parameters um to um if you want. Um
okay so those are um you know
tensors. Um so let's talk about memory
and how much uh memory tensors take up.
So every tensor that we'll probably be
interested in is stored as a
floatingoint number. And so there's many
ways to represent floating point. So the
most default uh way um is float 32. And
float toy2 has 32 uh bits. They're
allocated one for sign, eight for
exponent, and 23 for the you know the
fraction. So exponent gives you dynamic
range and fraction gives you you know uh
different basically specifies different
values. Um so flow 32 is also known as
FP32 or single uh you know precision um
is sort of the gold uh standard um in
you know computing. Some people also
refer to flow 3 as full precision.
That's a little bit confusing because
full is really depending on who you're
talking to. Um, if you're talking to a
scientific computing person, they'll
kind of laugh at you when you say float
32 is uh is really full because they'll
use float 64 or even more. Um, but if
you're talking to a machine learning
person, float 32 is the max you'll ever
probably need to go because deep
learning is is kind of sloppy like
that. Okay. So, let's look at the
memory. So, the memory is it's very
simple. It's determined by the number of
values you have in your tensor and the
data type of each value. Okay, so if you
create a torch sensor of 4x8 matrix, um
the default it will give you a type of
uh flow 32 um the size is 4 by8 and the
number of elements is 32. Each element
size is four bytes. Um 32 bits is four
bytes. Um and the memory usage is simply
the number of elements times the number
of um uh size of each element and
that'll give you 128 bytes. Okay, so
this is should be pretty um easy and
just to give some intuition if you get
the one matrix in the F4 layer of uh
GPT3 is this number by this number and
that gives you 2.3 gigabytes. Okay, so
that's one matrix is you know can these
matrices can be pretty big.
Okay, so uh float 32 is the default. Um
but of course these matrices get big. So
um naturally you want to make them
smaller so you use less memory. And also
it turns out if you make them smaller
you also uh make it go faster too. Okay.
Um so another type of uh representation
is called float 16 and as the name
suggests it's uh 16 bits um where both
the exponent and the fraction are um
shrunk down from um 8 to 5 and 23 to 10.
Okay, so this is known as half precision
and it cuts down half the memory. Um,
and that's all great except for the
dynamic range uh for these float 16
isn't great. So, for example, if you try
to make a number um like uh 10 e 1 e
minus 8 in flow 16
um it uh basically rounds down to zero
and you get underflow. Okay. So uh the
flow 16 is not great for representing um
very small numbers or or very big
numbers as a matter of fact. Um so if
you use float 16 u for training for
small models it's probably going to be
okay but for large models when you're um
having lots of matrices and you can get
uh instability or underflow or overflow
and bad things happen. Um okay so um one
um thing that has happened which is nice
is there's been another representation
of B float 16 which stands for brain
float um this was developed in 2018 to
address the issue that for deep learning
um we actually care about dynamic range
more than uh we care about this
fraction. So basically BF16 allocates
more to the exponent and less to the
fraction. Okay. So it uses the same
memory as um floating point 16 but it
has a dynamic range of float 32. Okay.
So that sounds really good. Um and it
actually uh the catch is that this
resolution which is determined by the
fraction is worse. Uh but this doesn't
matter as much uh for deep learning. So
now if you try to create a a tensor with
one E minus 8 in BF16 then you get
something that's not
zero. Okay. So um you can dive into the
details. I'm not going to go into this
but you can stare at the actual full
specs of all the different floating
point you know
operations. Okay. Okay, so BF16 is
basically what uh you will typically use
to do computations because um it's it's
sort of you know good enough um for fear
computations. It turns out that um for
storing optimizer states and parameters
you still need float 32 um for otherwise
your training will go um haywire.
Um so if you're bold so now we have um
something called FP8 or uh 8bit and as
the name suggests um this was developed
in 2022 um um by you know Nvidia so now
they have um essentially if you look at
FP and BF16 it's like this and FP wow
you really don't have that many bits to
store stuff right so it's very crude Um
there's there's two sort of variants
depending on if you want to have more
resolution or more um dynamic range. Um
and um I'm not going to say too much
about this but FE8 is you know supported
by H100. It's not really available on
previous generation.
Um but but at a high level, you know,
training with float 32, which is I
think, you know, is is what you would do
if you're not trying to optimize, you
know, too much and it's it's sort of
safe. It requires more memory. Um you
can go down to FP8 or
BF16. Um and but you can get some
instability. Um, basically I don't think
you would probably want to use uh a
float 16 at this point for for deep
learning. And um you can become more
sophisticated by looking at particular
places in your pipeline either forward
pass or backward pass or optimizers or
gradient accumulation and really figure
out what the minimum precision you need
at this particular places. And that's
called gets into kind of mixed precision
training. So for example, some people
like to use flow 32 for the you know the
attention to make sure that doesn't kind
of uh you know get messed up of what
four simple F4 passes with Matt Mall's
BF-16 is is is
fine. Okay, pause a bit for for
questions. So we talked about tensors
and we looked at uh depending on how
what representation how much storage
they take.
Yeah. Can you just clarify about the mix
position like when you would use 32 in
the be? Yeah. So the question is when
would you use um uh flow 32 or BF16?
Um I don't have time to get into the you
know the exact uh details and it sort of
varies depending on the model size and
everything but um generally for the the
parameters and optimizer states you use
float 32. You can think about BF16 as
something that's more transitory like
you you basically take your parameters
you cast it to BF16 and you kind of run
ahead with that model but then the thing
that you're going to accumulate over
time you want to have higher precision.
Yeah. Okay. So now let's talk about uh
compute. So that was memory.
Um so
uh compute obviously depends on what the
hardware you know is by default tensors
are stored in CPU. So for example if you
just in PyTorch um say X equals torch.0
is 3232 then it'll put it on your CPU.
it'll be in um the CPU memory. Um now,
of course, that's no good because if
you're not using your GPU, then you're
going to be orders of magnitude too
slow. So, you need to explicitly say in
PyTorch that you need to move it to the
GPU and this is uh you it's actually
just to make it very clear in pictures,
there's a CPU, it has RAM, and that um
has to be moved over um to the the GPU.
there's a data transfer um which is is c
um which takes some work um takes some
time okay so whenever you have a tensor
in pietorch you should always keep in
your mind where is this residing because
just looking at the variable or just
looking at the code you can't always
tell and if you want to be careful about
computation and data movement you have
to really know where it is um you can
probably do things like assert
um where it is in various places of code
just to um document or be
sure. Okay. So
um so let's look at uh what hardware we
have. So we have um in this case we have
one GPU. Um this was run on the H100
clusters that you guys have access to.
And um this um GPU is a you know H100 80
GB of high bandwidth memory
um and there's a you know it gives you
the cache size and so on. Okay. So
um so if you have remember the x is on
uh CPU um you can move it uh just by
specifying you know two which is a kind
of a general pietorch function. Um you
can also create a tensor directly on a
GPU so you don't have to move it at all.
Um and if everything goes well, I'm
looking at the memory allocated before
and after. Uh the difference should be
exactly uh two 30 by 32x 32 um matrices
of four byte floats. Okay. So um it's
a192. Okay. Okay, so this is just a
sanity check that um the code is is uh
is doing what is
advertised. Okay, so now you have your
answers on the GPU. Um what do you uh
do? So there's many operations um that
you'll be needing for assignment one and
in general to do any deep learning
application. Um and most tensors you
just create by performing operations on
other tensors and each operations has
some memory and compute uh footprint. So
let's make sure we understand uh that.
Um so first of all um what is actually a
tensor in PyTorch right tensors are a
mathematical object in PyTorch they're
actually pointers into some allocated
memory. Okay. So if you have let's say
um a matrix 4x4 matrix um what it
actually looks like is a long
array and what the tensor has is
metadata that specifies how to get to
address into that array. And the
metadata is going to be um you know two
numbers a stride for each or actually
number of one number per dimension of
the of the tensor. in this case because
there's um you know two dimensions it's
uh you know stride zero and stride one
um stride zero uh specifies if you were
in dimension zero to get to the next row
to increment that index how many do you
have to skip and so going down the rows
you skip four so stride uh zero is four
and um to go to the next column um you
skip one. So stride one is one. Okay. So
with that to find an element let's say
one two one comma two um it's simply
just um multiply the indexes by the
stride and you get to your index which
is uh six here. So that would be here or
or here. Okay. So that's basically
what's going underneath the hood um for
for for tensors.
Okay, so this is um relevant because um
you can have multiple tensors that use
the same storage and this is useful
because you don't want to copy the
tensor all over the all over the place.
So imagine you have a 2x3 matrix here.
Um many operations um don't actually
create a new tensor. They just create a
different view and doesn't make a copy.
So you have to keep make sure that um um
your mutations uh if you start mutating
one tensor it's going to cause the other
one to
mutate. Okay. So um for example if you
just get um row
zero okay so remember y is uh this
tensor and sorry x is 1 2 3 4 5 6 and y
is x0 which is just the first
row. Okay. And um you can sort of double
check. There's this you know function I
wrote that says if you look at the
underlying storage whether these two
tensors have the same storage or not.
Okay. So this um definitely doesn't copy
the tensor. It just creates a view. Um
you can get column one. Um this also
doesn't uh copy the tensor. Um oops
don't need to do that. Um you can call a
view function which can take any tensor
and uh ch look at it in terms of the
different dimensions. Um 2x3 uh um
actually this should be maybe the other
way around. Uh as a 3x2 tensor
um so that's also doesn't you know
change uh do any copying. Um you can
transpose um that also doesn't copy. Um
and then you know like I said if you
start mutating X um then Y actually gets
mutated as well because X and Y are just
pointers into the same underlying
storage. Okay. So things are um one
thing that you have to be careful of is
that some views are contiguous which
means that if you run through the tensor
it's like just slide going through the
the uh array in your storage. um but
some are not. So in particular if you
transpose it now you're you know what
does it mean when you're you're
transposing it you're you're sort of
going down now so you're kind of if you
imagine going through the tensor you're
kind of skipping around and if you have
a non-ontiguous tensor um then if you
try to further view it in a different
way then this is not going to
work. Okay. So in some cases if you have
a non-ontiguous tensor you can make it
contiguous first and then uh you can
apply whatever oper viewing operation
you want to it and then in this case x
and y um do not have the same storage
because contiguous in this case uh makes
a
copy. Okay. So um this is just ways of
slicing and dicing a tensor. Um uh views
are free so feel free to use them define
different variables um to make your it
sort of easier to read your code um
because they're not allocating any
memory. Um but you know remember that
contiguous or reshape which is basically
contiguous view um uh can uh create a
copy and so just be careful what you're
doing. Okay, questions before moving
on. All right. So, um hopefully some a
lot of this will be review for those of
you have uh kind of done a lot of
PyTorch before, but it's uh helpful to
just do it systematically, make sure
we're on the same page. So, um here are
some operations that uh do create new
tensors and in particular elementwise
operations all create new tensors
obviously because you need a somewhere
else to store the new value. Um there's
a you know triangular U is also element
operation that uh comes in handy when
you want to create a causal attention
mask um which you'll need for your um
your assignment. Um but nothing is
interesting that interesting uh here um
um okay so let's talk about map malls.
So the bread and butter of deep learning
is matrix multiplications. And I'm sure
all of you have done a matrix
multiplication, but just in case this is
what it looks like. Um you take a 16x32
times a 32x2 matrix, you get a 16x2
matrix.
Um
and but in
general when we do you know our machine
learning application um all operations
are you want to do in a batch and in the
case of language models this usually
means um for every example in a batch
and for every sequence in a batch you
want to do something. Okay, so generally
what you're going to have in instead of
just a matrix is you're going to have a
tensor where the dimensions are
typically batch sequence and then
whatever thing you're you're trying to
do. In this case, it's a matrix for
every um token in your in your data set.
And
so you know PyTorch is nice enough to
make this uh work well for you. So when
you take this for
um the dimension tensor and this matrix
um what actually ends up happening is
that for every batch every example in
every token you're multiplying these two
um matrices. Okay. And then the result
is that you get your your resulting
matrix for each of the first two
elements. So this is just like a there's
nothing uh fancy going on but this is
just a pattern that um I think uh you is
helpful to you know think
about okay so I'm going to take a little
bit of a a digression and talk about um
inops um and so the motivation
fors is the following so normally you in
pietorch you define some um you know
tensors And then you see stuff like this
where you take x and multiply by y
transpose minus2 minus one and you're
you kind of look at this and you say
okay what is minus2 um well I think
that's uh this sequence and then minus1
is this hidden because you're indexing
backwards. Um, and it's really easy to
mess this up because if you look at your
code and you see minus one, minus two,
um, you're kind of if you're you're
good, you write a bunch of comments, but
then the comments are can get out of
date with the code and then, um, you you
have a bad time debugging. So the the
solution is to use inops here. So this
is inspired by Einstein's summation not
notation. Um, and the idea is that we're
just going to name all the dimensions.
um instead of you know relying on
indices essentially. Okay. So there's a
library called Jack typing which um is
is is helpful for um as a as a way to
specify the dimensions in the types. So
normally in PyTorch you would just
define write your code and then you
would comment oh here's what the
dimensions would be. So if you use jacks
typing then um you have this notation
where as a string you just write down
what the dimensions are. So this is a
slightly kind of more natural way of
documenting. Um now notice that there's
no enforcement here right because
PyTorch types are sort of a a little bit
of a lie in in in PyTorch. Uh so it can
be enforced. You you can use a checker,
right? Yeah, you can write a check but
um not by default.
Yeah.
Um okay. So let's look at uh the you
know Ein sum. So Ein sum is basically
matrix multiplication on steroids with
good bookkeeping. Um so here's our
example here. We have X which is let's
just think about this as uh you have a
batch dimension, you have a sequence
dimension and you have uh four um
hiddens and Y is the same size. um you
originally had to do this thing. Um and
now what you do instead is you basically
write down the the dimensions uh names
of the dimensions of the two uh tensors.
So batch sequence one hidden batch
sequence two hidden and you just write
what you dimension should appear in the
output. Okay. So I write batch here
because I just want to uh basically uh
you know carry that over and then I
write seek one and seek two and notice
that I don't write hidden and any
dimension that is not named in output is
just summed over and any dimension that
is um named is sort of just iterated
over. Okay. So
um once you get used to this, this is
actually very very uh you know helpful.
It may maybe looks if you're seeing this
for the first time it might seem a bit
you know strange and long but u trust me
once you get used to it'll be better
than um doing minus2 minus one. Um if
you're a little bit uh you know slicker
you can um use dot dot dot to represent
broadcasting over any number of
dimensions. So in this case um uh
instead of writing batch I can just
write dot dot dot and this would handle
the case where instead of maybe batch I
have batch one batch two or or some
other uh arbitrary long sequence.
Yeah question does for compile this like
is it guaranteed to compile to
like I guess so the question is is it
guaranteed to compile to something
efficient? Um this uh I think the short
answer is yes. Um I don't know if you
have any you know nuances. We will
figure out the best way to reduce the
best order of uh dimensions to reduce
and then use that. If you're using it
within portra compile only do that one
time and then you know reuse the same
implementation over and over again
better than anything designed by hand.
Yeah. Okay.
So um so let's look at reduce. So reduce
operates on one tensor and it basically
aggregates some dimension or dimensions
of the tensor. So you have this tensor
uh before you would write mean to sum
over the the the final dimension and now
you basically say um actually okay so
this replace this with sum. So reduce
and um again you say hidden and hidden
is disappear disappeared. So which means
that you are um aggregating over that
dimension. Okay. So you can check that
this indeed kind of works over
here. Okay. So, so maybe one final
example of uh this is um sometimes
um in a you know tensor one dimension
actually represents multiple dimensions
and you want to unpack that and operate
over one of them and pack it back. So um
in this case uh let's say you have batch
uh sequence and then this
eightdimensional vector is actually a
flattened representation of number of
heads times some you know hidden
dimension. Okay so and then you have a
way a vector that is um needs to operate
on that hidden dimension. So you can do
this very elegantly uh using um inops uh
by um re calling rearrange and this
basically you can think about it we saw
view uh before it's kind of like um kind
of a you know fancier version which
basically looks at the same data but uh
um you know differently. Um so here uh
it basically says this dimension is
actually heads in hidden one. I'm going
to explode that into two dimensions. Um,
and uh, you have to specify the the
number of heads here because there's
multiple ways to split a number into,
you know, two. Um, let's see. This might
be a little bit long. Um, okay, maybe
it's not worth looking at right now.
Um and given that x you can uh
perform your transformation using line
sum. So this is something hidden one
which corresponds to x and then hidden
one hidden two which corresponds to w
and that gives you um something hidden
too.
Okay. Um and then you can rearrange
back. So this is the just the inverse of
breaking up. So you have the your two
dimensions and you group it into one. So
that's just a flattening operation
that's uh um you know with uh everything
all the other dimensions kind of left
alone. Okay. So there is a tutorial um
for for this that I would recommend you
go through and it gives you a bit more.
So um on you don't have to use this um
because you're building it from scratch.
So you can kind of do anything you want
but um in assignment one we do give you
guidance and it's something probably to
invest
in.
Okay.
Um so now let's talk about
uh computation uh no cost of t tensor
operations. So we introduce a bunch of
operations um and you know how much do
they cost? So a floatingoint operation
is uh any operation floating point like
addition or multiplication. These are
them and um these are kind of the you
know main ones that are going to um I
think matter in terms of flop count. Um,
one thing that is uh is sort of a pet
peeve of mine is that when you say
flops, it's actually unclear what you
mean. So you could mean flops with a
lowercase s, which stands for number of
floating operations. This is measures
amount of computation that you've done
or you could mean flops um also written
with a uppercase s which means floating
points per second which is used to
measure the speed of hardware. So um
we're not going to in this class use
uppercase s because I find that very
confusing and just write slash s to
denote that this is floating point per
second.
Okay. Okay. So just to give you uh some
intuition about flops. Uh GBD3 took
about uh 323 flops. GPD4 was 2 E25
flops. um speculation and there was a US
executive order that any foundation
model with over 126 flops had to be
reported to government um which now u
has been revoked um but the EU has still
they're going u still has a something
that's hasn't the EU AI act which is
1e25 which hasn't been revoked
um so uh you know some intuitions uh
A100 has a peak performance of 312
teraflop per second.
Um and um H100 has a peak performance of
um 1979 teraflop per second with uh
sparsity and approximately 50% without
um and um if you look at uh you know the
Nvidia has these specification sheets um
so you can see that um the the flops
actually depends on what you're trying
to do. So if you're using BF or F FP32,
it's actually really really bad. Like
the if you run FP32 on H100, you're not
getting um it's it's orders of magnitude
worse than if you're doing um FP16 or um
and if you're willing to go down to FP8,
then it can be even faster. And you
know, for the for when I first read it,
I didn't realize, but there's an
asterisk here and this means with
sparsity. So uh usually you're in a lot
of the m matrices we have in this class
are dense. So you don't actually get
this. You get something like uh you know
it's exactly half that number. Exactly
half.
Okay. Okay.
So now you can do a bucket of envelope
of calculations. eight H100s for two
weeks is um just you know eight times
the number of flops per second times the
number of seconds in a in a in a week.
Um actually this is this might be one
week. Okay, so that's one week and
that's uh 4.7 * e to the 21
um which is you know some number and you
can kind of contextualize the the flop
counts with other um model counts.
Yeah. Sparity mean? So that means if so
what does sparsity mean? That means if
your matrices are uh sparse it's a
specific like structured sparity. It's
like two out of four elements in each
like group of four elements is zero.
That's the only case if you get that
that speed. No one uses it.
Yeah, it's a marketing department uses
it.
Okay. So, um let's go through a simple
example. So, remember we're not going to
touch the transformer, but I think even
a linear model gives us a lot of the
building blocks and intuitions. So
suppose we have end points each point is
d-dimensional and the linear model is
just going to match map each dimensional
vector to a k dimensional uh vector.
Okay. So let's set some number of points
um is B, dimension is D, K is the number
of outputs. Um and let's create our data
matrix X um our weight matrix uh uh W
and the linear model is just some map.
So nothing you know too too interesting
going on. And
um you know the question is how many
flops was that?
And uh the way you would you look at
this is you say well um when you do the
matrix multiplication you have basically
for every j k triple I have to multiply
two numbers
together and I also have to add that
number to the the the total. Okay. So
the answer is 2 times
um the basically the product of all the
dimensions involved. So the the left
dimension, the middle dimension and the
right
dimension. Okay, so this is something
that you should just kind of remember if
you're doing a matrix multiplication.
The number of flops is two times um the
product of the three dimensions.
Okay, so the flops of other operations
are usually kind of linear in the size
of um the the matrix or tensor. Um and
in general, no other operation you
encounter in deep learning is expensive
as matrix multiplication for large
enough uh matrices. So this is why I
think a lot of the napkin math is very
simple because we're only looking at the
matrix multiplications that are going
are performed by the model. Um now of
course there are regimes where if your
matrices are um small enough then the
cost of other things starts to dominate
but generally that's not a good regime
you want to be in because the hardware
is designed for big much multiplication.
So um sort of by it's a little bit
circular but by kind of we end up in
this regime where we only consider uh
models where the mammals are the
dominant um uh you know
cost. Okay any questions about this this
number two times the product of the
three dimensions. This is just a useful
thing. Would the algorithm of matrix
motivation always be the same? because
the chip might have optimized
that they always the same.
Yeah. So the question is like uh is the
does this essentially does this depend
on the the matrix multiplication you
know algorithm. Um in general I guess
we'll look at this the next week when we
or the week after when we look at
kernels. I mean actually there's a lot
of optimization that goes underneath
under the hood when it comes to matrix
uh multiplications. Um and there's a lot
of specialization depending on the
shape. Um so this is I would say this is
just a kind of a crude you know estimate
um that is ba basically like the right
order of
magnitude. Okay. So um yeah uh additions
and multiplications are considered
equivalent. Yeah, additions and
multiplications are considered are
equivalent.
So um one way I find helpful to
interpret this so at the end of the day
this is just a matrix multiplication but
I'm going to try to give a little bit of
meaning um to this um which is why I've
set up this as kind of a little toy
machine learning problem. So B is really
stands for the number of data points and
DK is the number of parameters. So for
this particular model the number of
flops that's required for forward pass
is two times the number of tokens or
number of data points times the number
of
parameters. Okay. So this turns out to
actually generalize to transformers.
There's an asterisk there because uh
um there there's you know the sequence
length and other stuff but this is
roughly right for uh if your sequent
length isn't isn't too
large.
Um
so okay so
now this is just a number of
floatingoint operations right um so how
does this actually translate to a walk
like time which is presumably the thing
you actually care about how long do you
have to wait for your run um so let's
time this so I have this function that
is just going to
um
um do it five times and I'm going to
perform the matrix multiply operation.
Um we'll talk a little bit later about
this um um two weeks from now why the
other code is here. Um but um for now we
get an actual time. So that matrix uh
took you know 0.16 seconds
um and the actual flops per second which
is how many flops did it do per second
is uh 5.4
E13. Okay. So now you can compare this
with you know the marketing materials
and um for the A100 and H100
um and you know as we look at the spec
sheet the the flops depends on the data
type and we see that the the promise
flops per second which um you know for
H100 for I guess this is for float 32 is
you know 67 you know teraflops as we
looked
and so that is the number of promise
flops per second we we had um and now if
you look at the uh there's a helpful
notion called model flops utilization or
MFU which is the actual number of flops
divided by the promise flops okay so you
take the actual number of flops remember
which
what you actually uh you know witnessed
the number of floatingoint operations
that that are useful for your model um
uh divided by the actual time it took
divided by this promise flops per second
which is you know from the glossy
brochure um you can get a MFU of uh 0.8
eight.
Okay. So, usually you see people talking
about their MFUs and um something
greater than 0.5 is you usually
considered to be good and if you're like
you know 5% MFU that's uh considered to
be really bad. usually can't get, you
know, close to that close to, you know,
uh, you know, 90 or 100 um because this
is sort of ignoring all sort of
communication and overhead. It's just
like the literal uh computation of the
flops. Okay. And usually MFU is much
higher if the magic matrix
multiplications dominate. Um Um, okay.
So that's MFU. Any any questions about
uh this? Yeah. Uh you're using the
promise flop per sec uh not considering
the sparse.
So this promised flop per sec is uh not
considering this as much. Yeah.
Um one a note is like uh this is
actually a you know there's also
something called hardware you know flops
uization and the the inov the motivation
here is that we are all we're trying to
look at
the it's it's called model because we're
looking at the number of effective
useful operations that the model is you
know performing okay And uh so it's a
way of kind of standardizing. It's not
the actual number of flops that are are
done because you could have optimization
in your code that cache a few things or
redo um you know um recomputation of
some things and in some sense you're
still computing the same model. So what
matters is that you're this is sort of
trying to look at the model complexity
and you shouldn't be penalized just
because you were clever in your MFU if
you were clever and you didn't actually
do the flops um but you you said you
did. Okay. So you can also do the same
with
BF16.
Um Oops.
And here we see that for uh BF the um
the time is actually much uh better
right so 03 um instead
of.16 so the actual flops per second is
higher um the even accounting for
sparsity the the promise flops is uh
still quite high so the MFU actually um
lower for BF uh
16. Um this is you know maybe
surprisingly low but
um but sometimes the promise flops is a
bit of a you know
optimistic. So always benchmark your
code uh and don't just uh kind of assume
that you're going to get certain um
levels of
performance. Okay. So uh just to
summarize um matrix multiplications u
dominate the the compute and the general
rule of the thumb is that it's two times
the product of the dimensions
flops. Um
the flops per second floating points per
second depends on you know the hardware
and also the data type. So the fancier
the hardware you have, the higher it is,
the the smaller the data type, the
usually the faster it is. Um and uh MFU
is a useful notion to look at how well
you've you're essentially squeezing you
know your your hardware.
Yeah, I I've heard that often uh in
order to get like the maximum
utilization, you want to use these like
tensor cores on the machine. And so like
does PyTorch by default use these tensor
cores and like are these uh accounting
for that? Yeah. So the question is uh
what about those tensor cores? So if you
go to this um spec sheet um you'll see
that
um you know these are all on the tensor
core. So the tensor core is basically
you know you know specialized hardware
to do um map models. Um so if you are
um you know if you're so by default it
should use it and if you especially if
you're using PyTorch you know compile it
will generate the the code that will um
use the the hardware
properly. Okay. So um let's talk a
little about you know gradients. So, and
the the reason is that we've only looked
at matrix multiplication or in other
words basically feed forward um forward
passes and the number of flops. Um but
there's also a computation that comes
from computing gradients and we want to
track down how much uh that is. Um
okay so just to consider a simple
example simple linear model um where you
take the uh the prediction of a linear
model and you um you look at the MSE
with respect to five. though not a very
interesting loss but I think it'll it's
illustrative for uh looking at the the
gradients. Okay. So remember in the
forward pass you have your x you have
your your w which um you want to compute
the gradient with respect to you make a
prediction um by taking a linear product
and then you you have your loss. Okay.
And in the backward pass uh you just
call loss backwards. And um in this case
the the gradient which is this variable
attached to the tensor is um turns out
to be what you what you
want. Okay. So everyone has done uh you
know gradients in PyTorch before.
Um so let's look at how many flops uh
are required for computing gradients.
Um okay.
So, let's uh look at a slightly more
complicated model. So, now it's uh a
two-layer linear model where you have X
which is B by D times um W1 which is D
by D. So that's the first layer and then
you take the your hidden activations um
H1 and you pass it through another
linear layer W2 and to get a K
dimensional vector and you do some
compute some
loss. Okay, so this is a two-layer
linear network. And just as a kind of
review, if you look at the number of
forward flops, um what you had to do was
um you have to multiply look at W1. You
have to multiply X by W1
um and add it to your
H1 and you have to take H1 and W2 and
you have to add it to your your H2.
Okay, so the no total number of flops
again is 2 * the the product of all the
dimensions in your map mall plus two
times the product dimensions your m all
for the second matrix. Okay, in other
words, two times the total number of
parameters in this
case. Okay, so so what about the
backward pass? So this part will be um a
little bit more
involved. Um so we can recall the model
uh x to h1 to h2 and the loss. So in the
backward path you have to compute a
bunch of gradients and the gradients
that are relevant is you have to compute
uh the gradient with respect to um h1
um you know h2 w1 and
w2 of of the loss. So D loss D each of
these variables. Okay. So how long does
it take to compute that? Let's just look
at W2 for now. Okay. So the things that
touch W2 um you can compute by looking
at the the chain rule. Um so
W2 grad. So the gradient with of d loss
dw2 is um you you sum um h1 times uh the
gradient of the loss with respect to h2.
Okay. So that's just a chain rule for um
for w2. And this is um so all the
gradients are the you know the same size
as the underlying you know of vectors.
Um
so this turns out to be um essentially
looks like a matrix uh you know
multiplication and so the same you know
calculus holds which is that it's 2
times the number of uh the product of
all the dimensions B * D *
K okay but this is only the gradient
with respect to uh
W2 We also need to compute the gradient
with respect to h1 because we have to
keep on back propagating um to w1 and
and so on. Okay. So that is uh going to
be um you know the product of w2
um you know times uh you know
h2 sorry I think this should be that
grad of h2 h2.grad red.
Um so that turns out to also be
essentially you know um looks like the
matrix multiplication and it's the same
number of flops for for computing the
gradient of uh each
one. Okay. So when you add the two uh so
that's just for w2. Um you do the same
thing for W1 and that's uh which has D *
D parameters and when you add it all uh
up it's uh so for this um for W2 the
amount of computation was 4 * B * D * K
and for W1 it's also 4 * B * D * uh D
because W1 is D by
Okay. So, um I know there's a lot of
symbols here. I'm going to try also to
uh give you a visual account for this.
So, this is from a a blog post that I
think um may work uh better. We'll see.
Um okay, I have to wait for the
animation to loop back. So, so basically
this is uh one layer of the neuron now.
It has um you know the hiddens and then
the weights to the next layer. And so I
have to Okay, the problem with this
animation is I have to
wait. Okay, ready set. Okay, so first I
have to multiply W and A and I have to
add it to this. That's a forward pass.
And now I'm going to multiply this these
two and then add it to to that. And I'm
going to multiply and then add it to
that. Okay.
Any
questions? Which a way to slow this
down? Um, but you know, the details
maybe I'll I'll let you kind of ruminate
on, but um the the high level is that
there's two times the number of
parameters for the forward pass and four
times the number of parameters for the
backward pass. And um we can just kind
of work it out uh via the chain row
here. Yeah. For the homeworks, are we
also using the you said some PyTorch
implementation is allowed, some isn't.
Are we allowed to use docu or we are
doing the like entirely by hand doing
the gradient? Uh so the question is in
the homework are you going to compute
gradients by hand? And the answer is no.
Uh you're going to just use pietorch
gradient. This is uh just to break it
down so we can do the the um counting
flops. Okay. Any any questions about
this before I move on.
Okay, just to summarize the forward pass
is for this particular model is 2 times
the number of data points times the
number of parameters and backward is
four times the number of data points
times the number of parameters which
means that total it's six times number
of data points times parameters okay and
that's explains why uh there was that
six in the beginning when I asked the
motivating uh
question So now this is for a simple you
know linear model. Um it turns out that
mo many models this is uh basically the
bulk of a computation.
um when essentially every computation
you do has uh you know touches
essentially a new parameters roughly
right and you know obviously this
doesn't hold you can find models where
this doesn't hold because you can have
like one parameter through parameter
sharing um and have you know a billion
flops but that's generally what not what
models look like.
Okay, so let me um move
on. Um so far I've I've basically
finished talking about the the resource
accounting. So we looked at tensors. We
looked at some computation on tensors.
We looked at how uh much tensors take to
store and also how many flops takes
tensors uh take when you do various
operations on them. Um now let's start
building up uh different you know
models. Um I think this part isn't
necessarily going to be um you know that
uh you know conceptually you know
interesting or challenging but it's more
for maybe just you know
completeness. Okay. So
um so parameters in PyTorch are stored
as these nn parameter
um objects. Um let's talk a little bit
about parameter initialization.
Um so if you have let's say a
uh um you know parameter that has uh
okay so you you generate um a okay do
your sorry your w parameter is an input
dimension by hidden dimension matrix um
you're still in the linear model case so
let's just generate an input and let's
feed it through the output okay so rand
and unit no gausian
is uh you know seems innocuous. Um what
happens when you do this is that if you
look at the output um you get some
pretty large numbers right and this is
because when you you know have the the
number uh grows as essentially the
square root of the the hidden dimension.
Um and so when you have large models, uh
this is going to you know blow up and um
and training can be very unstable.
Um so so typically what you want to do
is initialize in a way that's um you
know invariant to hidden or at least you
know when you you're guaranteed that
it's not going to you know blow up. And
one simple way to do this is just
rescale by the one over square root of
number of you know of inputs. Um so
basically let's redo this w equals you
know parameter where I simply divide by
the square root of the input dimension.
Um and then now when you uh feed it
through the output now you get things
that are stable around you know this
will actually concentrate to um you know
something like normal 01.
Okay, so this is basically you know this
has been explored pretty extensively in
deep learning literature is known uh up
to a constant as Xavier initialization
and typically I guess it's uh fairly
common if you want to be extra safe uh
you don't trust the normal because it
doesn't have it has unbounded tails and
you just say I'm going to truncate to
minus 33 so I don't get any large values
and I don't want any to to mess with
that.
Okay.
Um Okay. So
um so let's build a you know just a
simple model.
Um it's going to have um dimensions and
two layers. There's this you know I just
made up this name cruncher. It's a
custom model which is a deep linear
network which has um num layers layers
and each layer is a a linear um model
which has essentially just a matrix
multiplication. Okay.
Um so um this parameters of these this
model is looks like um I have um
um layers for the the first layer um
which is a D byD matrix uh the second
layer which is also a D byD matrix and
then I have a um a head um or a final
layer. Okay. So if I get the number of
of parameters of this model, then it's
going to be um d^2 + d ^2 + d. Okay, so
nothing too surprising there. Um and I'm
going to move it to the GPU because I
run want this to run one fast. And um
I'm going to generate some random data
and feed it through the data. And the
forward pass is just going through um
the layers um and then finally applying
the
head. Um okay. So with that model I
let's try to I'm going to use this model
and do some you know stuff with it. Um
but just one kind of general digression.
Um, randomness is something that uh is
is sort of can be annoying in some cases
if you're trying to reproduce a bug for
example. It shows up in many places
initialization dropout data ordering.
Um, and just a best practice is I we
recommend you always pass a fix a random
seed. Um, so you can reproduce your your
model or at least as well as you can.
Um and in particular having a difference
random seat for every source of
randomness is is you know nice because
then you can for example fix
initialization or fix the data ordering
but vary other other things. Um
determinism is your your friend when
you're debugging. Um and you know in
code unfortunately there's you know many
places where you can use randomness and
just be uh cognizant of you know which
one you're using and just if you want to
be safe just uh set the C2 for all of
them. Um data loading um I guess I'll go
through this quickly. It's it's not um
it'll be useful for your your
assignment. Um so in language modeling
data is typically just a sequence of
integers because this is remember output
by the tokenizer. Um and you serialize
them into um you can serialize them into
numpire arrays. Um and one I guess thing
that's maybe useful is that uh you don't
want to load all your data into memory
at at once because for example the llama
data is the 2.8 8 terabytes. Um, but you
can sort of pretend to load it by using
this uh handy function called memap. So,
which gives you essentially a variable
that is mapped to a file. So when you
try to access um the data it actually on
the on demand loads the file and then
using that you can uh create a data
loader that um you know is a you know
samples data from your um batch. So I'm
going to skip over that just in interest
of time. Um let's talk a little bit
about you know optimizer. So we've
defined our
model.
Um so there's many optimizers. Um just
kind of maybe going through the
intuitions behind some of them. So of
course there's stocastic gradient
descent. You compute the gradient of
your batch. You take a step in that
direction. No questions asked. Um
there's a idea called momentum which
dates back to classic optimization nest.
um where you uh have a running average
of your um gradients and you update
against the you know the running average
instead of your instantaneous you know
gradient. Um and then you have um adrad
which um you you know you scale the
gradients by uh your
um your the average over the the norms
of your or I guess not the norms the the
square of the gradients. You also have
RMS prop which uh is a improved version
of adigra which uses a exponential
averaging rather than just like a flat
average. And then finally atom which
appeared in 2014 which is essentially
combining RMS prop and momentum. So
that's why you're maintaining both uh
your running average of your gradients
but also running average of your
gradient squared. Okay. So since you're
going to implement uh Adam um in
homework one I'm not going to do that
instead I'm going to implement um you
know adrad so so the way you implement
an
optimizer in in you know pietorch is
that you override the the optimizer
class and you have to um uh let's see
maybe
I'll and I'll get to the implementation
once we step through it Um so let's
define some data. Um compute the forward
pass on the loss and then you compute
the gradients and then you when you call
optimizer.step
um this is where the optimizer uh
actually is active. So what this looks
like is um your parameters are grouped u
by for example you have one for the
layer zero layer one and then the final
you know weights um and you can access a
state which is a a dictionary from
parameters to um you know whatever you
want to store as optimizer state. um the
gradient of that parameter you assume is
already uh calculated um by the the
backward pass. Um and now you can do
things like um you know
in in adrad you're storing the sum of
the gradient uh squares. So you can get
that G2 variable um and you can update
that based on the square of the
gradient. So this is element wise
squaring of the um the gradient and you
put it back into the
state. Okay. So then your obviously your
optimizer is responsible for updating
the parameters and this is just the you
know you update the learning rate times
the gradient divided by this uh scaling.
So now this state is kept over across
multiple uh invocations of um you know
the
optimizer. Okay. So um and then at the
you know end of your optimizer step you
can you know free up the memory uh just
to um which is I think going to actually
be more important when you look when we
talk about model parallelism.
Okay, so let's talk about the memory
requirements of uh the optimizer states
and actually basically at this point
everything. So um you need to uh the
number of parameters in this model is d²
times the number of layers plus d for
the final uh head. Okay.
Um the number of
activations. So this is something we
didn't do before but now for this simple
model it's fairly easy to do. It's just
um B times you know D times the the
number of of layers you have for every
layer for every data point for every
dimension you have to hold the
activations.
Um for the gradients um this is the same
as the number of parameters and the
number of optimizer states in for
adigrad it's uh you remember we had to
store
um the the gradient squared. So that's
another copy of the parameters. So
putting all
together we have um the total memory is
assuming you know FP32 which means four
bytes times the number of parameters,
number of activations, number of
gradients and number of optimizer
states. Okay. And that gives us uh you
know some number which
um is 496
here. Okay. So this is um a fairly
simple calculation in the assignment
one. Um you're going to do this for the
transformer which is a little bit more
involved because you have to there's not
just matrix multiplications but there's
um many matrices there's attention and
there's all these other things but the
general form of the calculation is the
same. You have parameters activations
gradients and optimizer states.
Okay. And
the so and the flops required again for
this model is six times uh the number of
tokens or number of data points times
the number of parameters and um you know
that's basically concludes the resource
uh accounting for this particular model.
Um and uh if for reference if you're
curious about working this out for for
transformers you can um consult um some
of these
articles. Okay. So in the remaining time
I think um maybe I'll pause for
questions. Um and we talked about
building up the tensors and then we
built a kind of a very small model and
you know we talked about optimization
and how many how much memory and how
much uh compute was required.
Yeah.
So the question is why do you need to
store the activations? So naively you
need to store the activations because
when you're uh when you're doing the
peer pass the gradients um of let's say
the first layer depend on the
activation. So the gradients of the I
layer depends on the activation there.
Um now if you're smarter you don't have
to store the activations or you don't
have to store all of them. You can
recomputee them and that's something a
technique will uh called activation
checkpointing which we're going to talk
about later.
Okay, so let's just do this quick uh you
know actually there's not much to say
here but you know here's your typical
you know training loop um where you
define the model define the optimizer
and you get the data um feed forward
backward and take a step in a parameter
space
um And
um I
guess it'll be more interesting I I
guess next time I should show like
actual one plug which I isn't available
on this uh on this version but
um
so one note about checkpointing
um so training language models takes a
long time and you certainly will crash
at some point so you don't want to lose
all your progress.
Um so you want to periodically save your
model to disk and just to be very clear
the thing you want to save is both the
model and uh the
optimizer and probably you know which
iteration you're on. Um I should add
that and then you can just load it up.
Um one maybe final note and I'll uh end
is um I alluded to kind of mix precision
your training. Um you know choice of the
data type has the different trade-offs.
If you have higher precision it's uh
more accurate and stable um but it's
more expensive and lower precision vice
versa. Um and as we mentioned before by
default the recommendation is use float
32 um but try to use VF uh 16 or even
FP8 whenever possible. So you can use um
lower precision for the FE forward pass
um but flow 32 for the rest. Um and this
is an idea that goes back to the you
know 2017 there's exploring mixed
precision training. Um, PyTorch has um
some tools that automatically allow you
to do, you know, mixed precision
training. Um, because it can be sort of
annoying to have to specify which parts
of your model it needs to be, you know,
what precision. Um, generally you define
your model as, you know, sort of this
clean modular thing and and the
specifying the precision is sort of like
a, you know, something that needs to cut
across that.
Um and one I guess maybe one kind of
general comment is that people are
pushing the envelope on what precision
is needed. there's some you know papers
that show you can actually use um FP8
you know all the way you know through um
there's um I guess one of the challenges
is of course when you have lower
precision it gets very numerically un
unstable but then you can do various
tricks to you know control the uh the
numeric of your model during training so
that you don't get into these you know
bad regimes so this is where I think um
the systems and the model architecture
design kind of are synergistic because
you want to design models now that we
have a lot of model design is just
governed by hardware. So even the
transformer as we mentioned last time is
governed by the having GPUs and now if
we notice that you know Nvidia chips
have the property that if um lower
precision even like int four for example
is one thing. Now if you can make your
model training actually work on in four
which is I think you know quite quite
hard then you can get massive you know
speed ups and your model will be more
you know efficient. Um
now there's another uh thing which we'll
talk about later which is you know often
you'll train your model using more sane
you know floating point but when it
comes to inference you can go crazy and
you take your preach model and then you
can quantize it and get a lot of the
gains from very very uh aggressive
quantization. So somehow training is a
lot um more difficult to do with low
precision but once you have a trained
model it's much easier to um make it low
precision. Okay. So um I will wrap up
there just to conclude um we have talked
about the different primitives to use to
train a model building up from tensors
all the way to the training loop. We
talked about u memory accounting and
flops accounting. um for these simple
models. Um hopefully once you go through
assignment one, all these concepts will
be really solid because you'll be um uh
applying these ideas for the actual
transformer. Okay, see you next time.
Loading video analysis...