Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lec. 2: Pytorch, Resource Accounting

By Stanford Online

Summary

## Key takeaways - **Training cost scales with parameters and data**: The total computational cost for training a model, including forward and backward passes, is approximately six times the product of the number of data points and the number of parameters. This 'napkin math' helps estimate training resource needs. [50:00], [02:14:00] - **Memory usage depends on data type and model components**: The memory required for a model is determined by the number of parameters, activations, gradients, and optimizer states. Using lower precision data types like BF16 or FP8 can reduce memory footprint but may introduce instability. [02:30:00], [06:03:00] - **Matrix multiplication dominates compute**: Matrix multiplications are the most computationally intensive operations in deep learning. The flop count for a matrix multiplication is roughly two times the product of its three dimensions, making it the primary factor in performance. [23:49:00], [37:37:00] - **Hardware performance varies by data type**: The theoretical peak performance (flops per second) of hardware like the H100 is highly dependent on the data type used. FP32 operations are significantly slower than BF16 or FP8 operations, highlighting the importance of mixed-precision training. [35:35:00], [44:10:00] - **Parameter initialization prevents training instability**: Initializing model parameters with a standard Gaussian distribution can lead to exploding values, especially in deep networks. Rescaling by the square root of the input dimension, a technique related to Xavier initialization, helps maintain stable training. [01:00:43], [01:01:33] - **MFU measures hardware utilization**: Model Flops Utilization (MFU) compares the actual computed flops by a model to the theoretical peak performance of the hardware. An MFU above 0.5 is considered good, indicating efficient use of computational resources, especially when matrix multiplications dominate. [44:20:00], [45:02:00]

Topics Covered

The Myth of Full Precision: FP32 vs. BF16 in Deep Learning
Flops vs. Flops/sec: Differentiating Computation and Hardware Speed
The Unsung Heroes: Why Matrix Multiplications Dominate Compute
The 6x Rule: Understanding Model Training Costs
Memory Accounting: Tensors, Gradients, and Optimizer States

Full Transcript

Okay, so last lecture I gave an overview

of language models and what it means to

build them from scratch and why we want

to do that. Also talked about

tokenization which is going to be the

first half of the first assignment. Um

today's lecture will be going through

actually building a model. Um we'll

discuss the primitives in PyTorch that

are needed. Um, we're going to start

with tensors, build models, optimizers,

and training loop. And we're going to

place close attention to efficiency, in

particular, how we're using resources,

both um memory and

compute. Okay. So, to motivate things a

bit, here's some here's some

questions. Okay. Um, these questions are

going to be answerable by napkin math.

So, get your napkins out. Um, so how

long would it take to train a 70 billion

parameter dense transformer model on 15

trillion tokens on 1,24

H100s? Okay, so you know, I'm just going

to sketch out the sort of the give you a

flavor of the type of things that we

want to do. Okay, so here's how you go

about reasoning it. Um, you count the no

total number of flops um needed

to um, you know, train. So that's six

times the number of parameters times the

number of tokens. Okay. And where does

that come from? That will be uh what

we'll talk about in this lecture. Um you

can look at the promised number of flops

per second that H100 gives

you. Um the MFU, which is something

we'll see later. Let's just set it to

0.5. Um and you can look at the number

of flops uh per day that your hardware

is going to give you at this particular

MFU. So 1,024 of them

um for um you know one day and then you

just divide the total number of flops

you need to train the model by the

number of flops that you're supposed to

get. Okay. And that gives you about

144. Okay. So this is very simple uh

calculations at the end of the day. Um

we're going to go through a bit more

where uh these numbers uh come from and

in particular where the six times number

of um parameters times number of tokens

comes from. Um okay so here's a

question. What is the largest model you

can train on H eight H100s using atom W

if you're not being uh too

clever. Okay so um H100 has 80 gigabytes

of HBM uh memory. Um the number of bytes

per parameter that you need for the

parameters the gradients optimizer state

is uh 16 and we'll talk more about where

that comes from. Um and the number of

parameters uh is basically total amount

of memory divide by number of bytes you

need per parameter and that gives you uh

about 40 uh billion parameters. Okay.

And this is very rough because it it

doesn't take you into activations which

depends on batch size and sequence

length which I'm not really going to

talk about but will be important for

assignment

one. Okay. So this is a rough back

envelope calculation. And this is

something that you're probably not used

to doing. You just implement a model,

you train it and what happens happens.

But remember that efficiency is the name

of the game. And to be efficient, you

have to know exactly how many flops

you're actually expending because when

these numbers get large, these directly

translate into dollars and uh you want

that to be as small as possible. Okay,

so we'll talk more about the the details

of how these numbers uh arise.

Um you know, uh we will not actually go

over the transformer. So Tatu is going

to talk at a um over the conceptual

overview of that next time. Um, and

there's uh many ways you can learn about

transformer if you haven't already uh

looked at it. There's assignment one. If

you do assignment one, you'll definitely

know what a transformer is. Um, and the

handout actually does a pretty good job

of walking through all the different

pieces. There's a mathematical

description. If you like pictures,

there's pictures. There's a lot of stuff

you can uh look on online. So, but

instead I'm going to work with simpler

models and really talk about the

primitives and the resource accounting

piece. Okay. So remember last time I

said what kinds of knowledge can you

learn? So mechanics um in this lecture

it's going to be just pietorch and

understanding how pietorch works at um a

fairly primitive uh level. So that's

will be pretty straightforward. Mindset

is about resource accounting and it's

not hard. It's just you just have to do

it. And intuitions um unfortunately this

is just going to be broad strokes uh you

know for now. Actually, there's not

really much intuition that I'm going to

talk about in terms of how um anything

we're doing translates to good models.

Um this is more about the M mechanics

and

mindset. Okay. So, let's start with

memory accounting. Um and then I'll talk

about compute accounting. Um and then

we'll build up bottom up. Okay. So, the

best place to start is a tensor. So

tensors are the building block of for

storing everything in deep learning.

Parameters gradients optimizer state

uh data, activations. There's sort of

these atoms. Um you can read lots of

documentation about them. Um you're

probably very familiar with how to

create tensors. There's creating tensors

different ways. Um you can also create a

tensor and not initialize it and use

some special uh initialization um for

the parameters um to um if you want. Um

okay so those are um you know

tensors. Um so let's talk about memory

and how much uh memory tensors take up.

So every tensor that we'll probably be

interested in is stored as a

floatingoint number. And so there's many

ways to represent floating point. So the

most default uh way um is float 32. And

float toy2 has 32 uh bits. They're

allocated one for sign, eight for

exponent, and 23 for the you know the

fraction. So exponent gives you dynamic

range and fraction gives you you know uh

different basically specifies different

values. Um so flow 32 is also known as

FP32 or single uh you know precision um

is sort of the gold uh standard um in

you know computing. Some people also

refer to flow 3 as full precision.

That's a little bit confusing because

full is really depending on who you're

talking to. Um, if you're talking to a

scientific computing person, they'll

kind of laugh at you when you say float

32 is uh is really full because they'll

use float 64 or even more. Um, but if

you're talking to a machine learning

person, float 32 is the max you'll ever

probably need to go because deep

learning is is kind of sloppy like

that. Okay. So, let's look at the

memory. So, the memory is it's very

simple. It's determined by the number of

values you have in your tensor and the

data type of each value. Okay, so if you

create a torch sensor of 4x8 matrix, um

the default it will give you a type of

uh flow 32 um the size is 4 by8 and the

number of elements is 32. Each element

size is four bytes. Um 32 bits is four

bytes. Um and the memory usage is simply

the number of elements times the number

of um uh size of each element and

that'll give you 128 bytes. Okay, so

this is should be pretty um easy and

just to give some intuition if you get

the one matrix in the F4 layer of uh

GPT3 is this number by this number and

that gives you 2.3 gigabytes. Okay, so

that's one matrix is you know can these

matrices can be pretty big.

Okay, so uh float 32 is the default. Um

but of course these matrices get big. So

um naturally you want to make them

smaller so you use less memory. And also

it turns out if you make them smaller

you also uh make it go faster too. Okay.

Um so another type of uh representation

is called float 16 and as the name

suggests it's uh 16 bits um where both

the exponent and the fraction are um

shrunk down from um 8 to 5 and 23 to 10.

Okay, so this is known as half precision

and it cuts down half the memory. Um,

and that's all great except for the

dynamic range uh for these float 16

isn't great. So, for example, if you try

to make a number um like uh 10 e 1 e

minus 8 in flow 16

um it uh basically rounds down to zero

and you get underflow. Okay. So uh the

flow 16 is not great for representing um

very small numbers or or very big

numbers as a matter of fact. Um so if

you use float 16 u for training for

small models it's probably going to be

okay but for large models when you're um

having lots of matrices and you can get

uh instability or underflow or overflow

and bad things happen. Um okay so um one

um thing that has happened which is nice

is there's been another representation

of B float 16 which stands for brain

float um this was developed in 2018 to

address the issue that for deep learning

um we actually care about dynamic range

more than uh we care about this

fraction. So basically BF16 allocates

more to the exponent and less to the

fraction. Okay. So it uses the same

memory as um floating point 16 but it

has a dynamic range of float 32. Okay.

So that sounds really good. Um and it

actually uh the catch is that this

resolution which is determined by the

fraction is worse. Uh but this doesn't

matter as much uh for deep learning. So

now if you try to create a a tensor with

one E minus 8 in BF16 then you get

something that's not

zero. Okay. So um you can dive into the

details. I'm not going to go into this

but you can stare at the actual full

specs of all the different floating

point you know

operations. Okay. Okay, so BF16 is

basically what uh you will typically use

to do computations because um it's it's

sort of you know good enough um for fear

computations. It turns out that um for

storing optimizer states and parameters

you still need float 32 um for otherwise

your training will go um haywire.

Um so if you're bold so now we have um

something called FP8 or uh 8bit and as

the name suggests um this was developed

in 2022 um um by you know Nvidia so now

they have um essentially if you look at

FP and BF16 it's like this and FP wow

you really don't have that many bits to

store stuff right so it's very crude Um

there's there's two sort of variants

depending on if you want to have more

resolution or more um dynamic range. Um

and um I'm not going to say too much

about this but FE8 is you know supported

by H100. It's not really available on

previous generation.

Um but but at a high level, you know,

training with float 32, which is I

think, you know, is is what you would do

if you're not trying to optimize, you

know, too much and it's it's sort of

safe. It requires more memory. Um you

can go down to FP8 or

BF16. Um and but you can get some

instability. Um, basically I don't think

you would probably want to use uh a

float 16 at this point for for deep

learning. And um you can become more

sophisticated by looking at particular

places in your pipeline either forward

pass or backward pass or optimizers or

gradient accumulation and really figure

out what the minimum precision you need

at this particular places. And that's

called gets into kind of mixed precision

training. So for example, some people

like to use flow 32 for the you know the

attention to make sure that doesn't kind

of uh you know get messed up of what

four simple F4 passes with Matt Mall's

BF-16 is is is

fine. Okay, pause a bit for for

questions. So we talked about tensors

and we looked at uh depending on how

what representation how much storage

they take.

Yeah. Can you just clarify about the mix

position like when you would use 32 in

the be? Yeah. So the question is when

would you use um uh flow 32 or BF16?

Um I don't have time to get into the you

know the exact uh details and it sort of

varies depending on the model size and

everything but um generally for the the

parameters and optimizer states you use

float 32. You can think about BF16 as

something that's more transitory like

you you basically take your parameters

you cast it to BF16 and you kind of run

ahead with that model but then the thing

that you're going to accumulate over

time you want to have higher precision.

Yeah. Okay. So now let's talk about uh

compute. So that was memory.

Um so

uh compute obviously depends on what the

hardware you know is by default tensors

are stored in CPU. So for example if you

just in PyTorch um say X equals torch.0

is 3232 then it'll put it on your CPU.

it'll be in um the CPU memory. Um now,

of course, that's no good because if

you're not using your GPU, then you're

going to be orders of magnitude too

slow. So, you need to explicitly say in

PyTorch that you need to move it to the

GPU and this is uh you it's actually

just to make it very clear in pictures,

there's a CPU, it has RAM, and that um

has to be moved over um to the the GPU.

there's a data transfer um which is is c

um which takes some work um takes some

time okay so whenever you have a tensor

in pietorch you should always keep in

your mind where is this residing because

just looking at the variable or just

looking at the code you can't always

tell and if you want to be careful about

computation and data movement you have

to really know where it is um you can

probably do things like assert

um where it is in various places of code

just to um document or be

sure. Okay. So

um so let's look at uh what hardware we

have. So we have um in this case we have

one GPU. Um this was run on the H100

clusters that you guys have access to.

And um this um GPU is a you know H100 80

GB of high bandwidth memory

um and there's a you know it gives you

the cache size and so on. Okay. So

um so if you have remember the x is on

uh CPU um you can move it uh just by

specifying you know two which is a kind

of a general pietorch function. Um you

can also create a tensor directly on a

GPU so you don't have to move it at all.

Um and if everything goes well, I'm

looking at the memory allocated before

and after. Uh the difference should be

exactly uh two 30 by 32x 32 um matrices

of four byte floats. Okay. So um it's

a192. Okay. Okay, so this is just a

sanity check that um the code is is uh

is doing what is

advertised. Okay, so now you have your

answers on the GPU. Um what do you uh

do? So there's many operations um that

you'll be needing for assignment one and

in general to do any deep learning

application. Um and most tensors you

just create by performing operations on

other tensors and each operations has

some memory and compute uh footprint. So

let's make sure we understand uh that.

Um so first of all um what is actually a

tensor in PyTorch right tensors are a

mathematical object in PyTorch they're

actually pointers into some allocated

memory. Okay. So if you have let's say

um a matrix 4x4 matrix um what it

actually looks like is a long

array and what the tensor has is

metadata that specifies how to get to

address into that array. And the

metadata is going to be um you know two

numbers a stride for each or actually

number of one number per dimension of

the of the tensor. in this case because

there's um you know two dimensions it's

uh you know stride zero and stride one

um stride zero uh specifies if you were

in dimension zero to get to the next row

to increment that index how many do you

have to skip and so going down the rows

you skip four so stride uh zero is four

and um to go to the next column um you

skip one. So stride one is one. Okay. So

with that to find an element let's say

one two one comma two um it's simply

just um multiply the indexes by the

stride and you get to your index which

is uh six here. So that would be here or

or here. Okay. So that's basically

what's going underneath the hood um for

for for tensors.

Okay, so this is um relevant because um

you can have multiple tensors that use

the same storage and this is useful

because you don't want to copy the

tensor all over the all over the place.

So imagine you have a 2x3 matrix here.

Um many operations um don't actually

create a new tensor. They just create a

different view and doesn't make a copy.

So you have to keep make sure that um um

your mutations uh if you start mutating

one tensor it's going to cause the other

one to

mutate. Okay. So um for example if you

just get um row

zero okay so remember y is uh this

tensor and sorry x is 1 2 3 4 5 6 and y

is x0 which is just the first

row. Okay. And um you can sort of double

check. There's this you know function I

wrote that says if you look at the

underlying storage whether these two

tensors have the same storage or not.

Okay. So this um definitely doesn't copy

the tensor. It just creates a view. Um

you can get column one. Um this also

doesn't uh copy the tensor. Um oops

don't need to do that. Um you can call a

view function which can take any tensor

and uh ch look at it in terms of the

different dimensions. Um 2x3 uh um

actually this should be maybe the other

way around. Uh as a 3x2 tensor

um so that's also doesn't you know

change uh do any copying. Um you can

transpose um that also doesn't copy. Um

and then you know like I said if you

start mutating X um then Y actually gets

mutated as well because X and Y are just

pointers into the same underlying

storage. Okay. So things are um one

thing that you have to be careful of is

that some views are contiguous which

means that if you run through the tensor

it's like just slide going through the

the uh array in your storage. um but

some are not. So in particular if you

transpose it now you're you know what

does it mean when you're you're

transposing it you're you're sort of

going down now so you're kind of if you

imagine going through the tensor you're

kind of skipping around and if you have

a non-ontiguous tensor um then if you

try to further view it in a different

way then this is not going to

work. Okay. So in some cases if you have

a non-ontiguous tensor you can make it

contiguous first and then uh you can

apply whatever oper viewing operation

you want to it and then in this case x

and y um do not have the same storage

because contiguous in this case uh makes

a

copy. Okay. So um this is just ways of

slicing and dicing a tensor. Um uh views

are free so feel free to use them define

different variables um to make your it

sort of easier to read your code um

because they're not allocating any

memory. Um but you know remember that

contiguous or reshape which is basically

contiguous view um uh can uh create a

copy and so just be careful what you're

doing. Okay, questions before moving

on. All right. So, um hopefully some a

lot of this will be review for those of

you have uh kind of done a lot of

PyTorch before, but it's uh helpful to

just do it systematically, make sure

we're on the same page. So, um here are

some operations that uh do create new

tensors and in particular elementwise

operations all create new tensors

obviously because you need a somewhere

else to store the new value. Um there's

a you know triangular U is also element

operation that uh comes in handy when

you want to create a causal attention

mask um which you'll need for your um

your assignment. Um but nothing is

interesting that interesting uh here um

um okay so let's talk about map malls.

So the bread and butter of deep learning

is matrix multiplications. And I'm sure

all of you have done a matrix

multiplication, but just in case this is

what it looks like. Um you take a 16x32

times a 32x2 matrix, you get a 16x2

matrix.

Um

and but in

general when we do you know our machine

learning application um all operations

are you want to do in a batch and in the

case of language models this usually

means um for every example in a batch

and for every sequence in a batch you

want to do something. Okay, so generally

what you're going to have in instead of

just a matrix is you're going to have a

tensor where the dimensions are

typically batch sequence and then

whatever thing you're you're trying to

do. In this case, it's a matrix for

every um token in your in your data set.

And

so you know PyTorch is nice enough to

make this uh work well for you. So when

you take this for

um the dimension tensor and this matrix

um what actually ends up happening is

that for every batch every example in

every token you're multiplying these two

um matrices. Okay. And then the result

is that you get your your resulting

matrix for each of the first two

elements. So this is just like a there's

nothing uh fancy going on but this is

just a pattern that um I think uh you is

helpful to you know think

about okay so I'm going to take a little

bit of a a digression and talk about um

inops um and so the motivation

fors is the following so normally you in

pietorch you define some um you know

tensors And then you see stuff like this

where you take x and multiply by y

transpose minus2 minus one and you're

you kind of look at this and you say

okay what is minus2 um well I think

that's uh this sequence and then minus1

is this hidden because you're indexing

backwards. Um, and it's really easy to

mess this up because if you look at your

code and you see minus one, minus two,

um, you're kind of if you're you're

good, you write a bunch of comments, but

then the comments are can get out of

date with the code and then, um, you you

have a bad time debugging. So the the

solution is to use inops here. So this

is inspired by Einstein's summation not

notation. Um, and the idea is that we're

just going to name all the dimensions.

um instead of you know relying on

indices essentially. Okay. So there's a

library called Jack typing which um is

is is helpful for um as a as a way to

specify the dimensions in the types. So

normally in PyTorch you would just

define write your code and then you

would comment oh here's what the

dimensions would be. So if you use jacks

typing then um you have this notation

where as a string you just write down

what the dimensions are. So this is a

slightly kind of more natural way of

documenting. Um now notice that there's

no enforcement here right because

PyTorch types are sort of a a little bit

of a lie in in in PyTorch. Uh so it can

be enforced. You you can use a checker,

right? Yeah, you can write a check but

um not by default.

Yeah.

Um okay. So let's look at uh the you

know Ein sum. So Ein sum is basically

matrix multiplication on steroids with

good bookkeeping. Um so here's our

example here. We have X which is let's

just think about this as uh you have a

batch dimension, you have a sequence

dimension and you have uh four um

hiddens and Y is the same size. um you

originally had to do this thing. Um and

now what you do instead is you basically

write down the the dimensions uh names

of the dimensions of the two uh tensors.

So batch sequence one hidden batch

sequence two hidden and you just write

what you dimension should appear in the

output. Okay. So I write batch here

because I just want to uh basically uh

you know carry that over and then I

write seek one and seek two and notice

that I don't write hidden and any

dimension that is not named in output is

just summed over and any dimension that

is um named is sort of just iterated

over. Okay. So

um once you get used to this, this is

actually very very uh you know helpful.

It may maybe looks if you're seeing this

for the first time it might seem a bit

you know strange and long but u trust me

once you get used to it'll be better

than um doing minus2 minus one. Um if

you're a little bit uh you know slicker

you can um use dot dot dot to represent

broadcasting over any number of

dimensions. So in this case um uh

instead of writing batch I can just

write dot dot dot and this would handle

the case where instead of maybe batch I

have batch one batch two or or some

other uh arbitrary long sequence.

Yeah question does for compile this like

is it guaranteed to compile to

like I guess so the question is is it

guaranteed to compile to something

efficient? Um this uh I think the short

answer is yes. Um I don't know if you

have any you know nuances. We will

figure out the best way to reduce the

best order of uh dimensions to reduce

and then use that. If you're using it

within portra compile only do that one

time and then you know reuse the same

implementation over and over again

better than anything designed by hand.

Yeah. Okay.

So um so let's look at reduce. So reduce

operates on one tensor and it basically

aggregates some dimension or dimensions

of the tensor. So you have this tensor

uh before you would write mean to sum

over the the the final dimension and now

you basically say um actually okay so

this replace this with sum. So reduce

and um again you say hidden and hidden

is disappear disappeared. So which means

that you are um aggregating over that

dimension. Okay. So you can check that

this indeed kind of works over

here. Okay. So, so maybe one final

example of uh this is um sometimes

um in a you know tensor one dimension

actually represents multiple dimensions

and you want to unpack that and operate

over one of them and pack it back. So um

in this case uh let's say you have batch

uh sequence and then this

eightdimensional vector is actually a

flattened representation of number of

heads times some you know hidden

dimension. Okay so and then you have a

way a vector that is um needs to operate

on that hidden dimension. So you can do

this very elegantly uh using um inops uh

by um re calling rearrange and this

basically you can think about it we saw

view uh before it's kind of like um kind

of a you know fancier version which

basically looks at the same data but uh

um you know differently. Um so here uh

it basically says this dimension is

actually heads in hidden one. I'm going

to explode that into two dimensions. Um,

and uh, you have to specify the the

number of heads here because there's

multiple ways to split a number into,

you know, two. Um, let's see. This might

be a little bit long. Um, okay, maybe

it's not worth looking at right now.

Um and given that x you can uh

perform your transformation using line

sum. So this is something hidden one

which corresponds to x and then hidden

one hidden two which corresponds to w

and that gives you um something hidden

too.

Okay. Um and then you can rearrange

back. So this is the just the inverse of

breaking up. So you have the your two

dimensions and you group it into one. So

that's just a flattening operation

that's uh um you know with uh everything

all the other dimensions kind of left

alone. Okay. So there is a tutorial um

for for this that I would recommend you

go through and it gives you a bit more.

So um on you don't have to use this um

because you're building it from scratch.

So you can kind of do anything you want

but um in assignment one we do give you

guidance and it's something probably to

invest

in.

Okay.

Um so now let's talk about

uh computation uh no cost of t tensor

operations. So we introduce a bunch of

operations um and you know how much do

they cost? So a floatingoint operation

is uh any operation floating point like

addition or multiplication. These are

them and um these are kind of the you

know main ones that are going to um I

think matter in terms of flop count. Um,

one thing that is uh is sort of a pet

peeve of mine is that when you say

flops, it's actually unclear what you

mean. So you could mean flops with a

lowercase s, which stands for number of

floating operations. This is measures

amount of computation that you've done

or you could mean flops um also written

with a uppercase s which means floating

points per second which is used to

measure the speed of hardware. So um

we're not going to in this class use

uppercase s because I find that very

confusing and just write slash s to

denote that this is floating point per

second.

Okay. Okay. So just to give you uh some

intuition about flops. Uh GBD3 took

about uh 323 flops. GPD4 was 2 E25

flops. um speculation and there was a US

executive order that any foundation

model with over 126 flops had to be

reported to government um which now u

has been revoked um but the EU has still

they're going u still has a something

that's hasn't the EU AI act which is

1e25 which hasn't been revoked

um so uh you know some intuitions uh

A100 has a peak performance of 312

teraflop per second.

Um and um H100 has a peak performance of

um 1979 teraflop per second with uh

sparsity and approximately 50% without

um and um if you look at uh you know the

Nvidia has these specification sheets um

so you can see that um the the flops

actually depends on what you're trying

to do. So if you're using BF or F FP32,

it's actually really really bad. Like

the if you run FP32 on H100, you're not

getting um it's it's orders of magnitude

worse than if you're doing um FP16 or um

and if you're willing to go down to FP8,

then it can be even faster. And you

know, for the for when I first read it,

I didn't realize, but there's an

asterisk here and this means with

sparsity. So uh usually you're in a lot

of the m matrices we have in this class

are dense. So you don't actually get

this. You get something like uh you know

it's exactly half that number. Exactly

half.

Okay. Okay.

So now you can do a bucket of envelope

of calculations. eight H100s for two

weeks is um just you know eight times

the number of flops per second times the

number of seconds in a in a in a week.

Um actually this is this might be one

week. Okay, so that's one week and

that's uh 4.7 * e to the 21

um which is you know some number and you

can kind of contextualize the the flop

counts with other um model counts.

Yeah. Sparity mean? So that means if so

what does sparsity mean? That means if

your matrices are uh sparse it's a

specific like structured sparity. It's

like two out of four elements in each

like group of four elements is zero.

That's the only case if you get that

that speed. No one uses it.

Yeah, it's a marketing department uses

it.

Okay. So, um let's go through a simple

example. So, remember we're not going to

touch the transformer, but I think even

a linear model gives us a lot of the

building blocks and intuitions. So

suppose we have end points each point is

d-dimensional and the linear model is

just going to match map each dimensional

vector to a k dimensional uh vector.

Okay. So let's set some number of points

um is B, dimension is D, K is the number

of outputs. Um and let's create our data

matrix X um our weight matrix uh uh W

and the linear model is just some map.

So nothing you know too too interesting

going on. And

um you know the question is how many

flops was that?

And uh the way you would you look at

this is you say well um when you do the

matrix multiplication you have basically

for every j k triple I have to multiply

two numbers

together and I also have to add that

number to the the the total. Okay. So

the answer is 2 times

um the basically the product of all the

dimensions involved. So the the left

dimension, the middle dimension and the

right

dimension. Okay, so this is something

that you should just kind of remember if

you're doing a matrix multiplication.

The number of flops is two times um the

product of the three dimensions.

Okay, so the flops of other operations

are usually kind of linear in the size

of um the the matrix or tensor. Um and

in general, no other operation you

encounter in deep learning is expensive

as matrix multiplication for large

enough uh matrices. So this is why I

think a lot of the napkin math is very

simple because we're only looking at the

matrix multiplications that are going

are performed by the model. Um now of

course there are regimes where if your

matrices are um small enough then the

cost of other things starts to dominate

but generally that's not a good regime

you want to be in because the hardware

is designed for big much multiplication.

So um sort of by it's a little bit

circular but by kind of we end up in

this regime where we only consider uh

models where the mammals are the

dominant um uh you know

cost. Okay any questions about this this

number two times the product of the

three dimensions. This is just a useful

thing. Would the algorithm of matrix

motivation always be the same? because

the chip might have optimized

that they always the same.

Yeah. So the question is like uh is the

does this essentially does this depend

on the the matrix multiplication you

know algorithm. Um in general I guess

we'll look at this the next week when we

or the week after when we look at

kernels. I mean actually there's a lot

of optimization that goes underneath

under the hood when it comes to matrix

uh multiplications. Um and there's a lot

of specialization depending on the

shape. Um so this is I would say this is

just a kind of a crude you know estimate

um that is ba basically like the right

order of

magnitude. Okay. So um yeah uh additions

and multiplications are considered

equivalent. Yeah, additions and

multiplications are considered are

equivalent.

So um one way I find helpful to

interpret this so at the end of the day

this is just a matrix multiplication but

I'm going to try to give a little bit of

meaning um to this um which is why I've

set up this as kind of a little toy

machine learning problem. So B is really

stands for the number of data points and

DK is the number of parameters. So for

this particular model the number of

flops that's required for forward pass

is two times the number of tokens or

number of data points times the number

of

parameters. Okay. So this turns out to

actually generalize to transformers.

There's an asterisk there because uh

um there there's you know the sequence

length and other stuff but this is

roughly right for uh if your sequent

length isn't isn't too

large.

Um

so okay so

now this is just a number of

floatingoint operations right um so how

does this actually translate to a walk

like time which is presumably the thing

you actually care about how long do you

have to wait for your run um so let's

time this so I have this function that

is just going to

um

um do it five times and I'm going to

perform the matrix multiply operation.

Um we'll talk a little bit later about

this um um two weeks from now why the

other code is here. Um but um for now we

get an actual time. So that matrix uh

took you know 0.16 seconds

um and the actual flops per second which

is how many flops did it do per second

is uh 5.4

E13. Okay. So now you can compare this

with you know the marketing materials

and um for the A100 and H100

um and you know as we look at the spec

sheet the the flops depends on the data

type and we see that the the promise

flops per second which um you know for

H100 for I guess this is for float 32 is

you know 67 you know teraflops as we

looked

and so that is the number of promise

flops per second we we had um and now if

you look at the uh there's a helpful

notion called model flops utilization or

MFU which is the actual number of flops

divided by the promise flops okay so you

take the actual number of flops remember

which

what you actually uh you know witnessed

the number of floatingoint operations

that that are useful for your model um

uh divided by the actual time it took

divided by this promise flops per second

which is you know from the glossy

brochure um you can get a MFU of uh 0.8

eight.

Okay. So, usually you see people talking

about their MFUs and um something

greater than 0.5 is you usually

considered to be good and if you're like

you know 5% MFU that's uh considered to

be really bad. usually can't get, you

know, close to that close to, you know,

uh, you know, 90 or 100 um because this

is sort of ignoring all sort of

communication and overhead. It's just

like the literal uh computation of the

flops. Okay. And usually MFU is much

higher if the magic matrix

multiplications dominate. Um Um, okay.

So that's MFU. Any any questions about

uh this? Yeah. Uh you're using the

promise flop per sec uh not considering

the sparse.

So this promised flop per sec is uh not

considering this as much. Yeah.

Um one a note is like uh this is

actually a you know there's also

something called hardware you know flops

uization and the the inov the motivation

here is that we are all we're trying to

look at

the it's it's called model because we're

looking at the number of effective

useful operations that the model is you

know performing okay And uh so it's a

way of kind of standardizing. It's not

the actual number of flops that are are

done because you could have optimization

in your code that cache a few things or

redo um you know um recomputation of

some things and in some sense you're

still computing the same model. So what

matters is that you're this is sort of

trying to look at the model complexity

and you shouldn't be penalized just

because you were clever in your MFU if

you were clever and you didn't actually

do the flops um but you you said you

did. Okay. So you can also do the same

with

BF16.

Um Oops.

And here we see that for uh BF the um

the time is actually much uh better

right so 03 um instead

of.16 so the actual flops per second is

higher um the even accounting for

sparsity the the promise flops is uh

still quite high so the MFU actually um

lower for BF uh

16. Um this is you know maybe

surprisingly low but

um but sometimes the promise flops is a

bit of a you know

optimistic. So always benchmark your

code uh and don't just uh kind of assume

that you're going to get certain um

levels of

performance. Okay. So uh just to

summarize um matrix multiplications u

dominate the the compute and the general

rule of the thumb is that it's two times

the product of the dimensions

flops. Um

the flops per second floating points per

second depends on you know the hardware

and also the data type. So the fancier

the hardware you have, the higher it is,

the the smaller the data type, the

usually the faster it is. Um and uh MFU

is a useful notion to look at how well

you've you're essentially squeezing you

know your your hardware.

Yeah, I I've heard that often uh in

order to get like the maximum

utilization, you want to use these like

tensor cores on the machine. And so like

does PyTorch by default use these tensor

cores and like are these uh accounting

for that? Yeah. So the question is uh

what about those tensor cores? So if you

go to this um spec sheet um you'll see

that

um you know these are all on the tensor

core. So the tensor core is basically

you know you know specialized hardware

to do um map models. Um so if you are

um you know if you're so by default it

should use it and if you especially if

you're using PyTorch you know compile it

will generate the the code that will um

use the the hardware

properly. Okay. So um let's talk a

little about you know gradients. So, and

the the reason is that we've only looked

at matrix multiplication or in other

words basically feed forward um forward

passes and the number of flops. Um but

there's also a computation that comes

from computing gradients and we want to

track down how much uh that is. Um

okay so just to consider a simple

example simple linear model um where you

take the uh the prediction of a linear

model and you um you look at the MSE

with respect to five. though not a very

interesting loss but I think it'll it's

illustrative for uh looking at the the

gradients. Okay. So remember in the

forward pass you have your x you have

your your w which um you want to compute

the gradient with respect to you make a

prediction um by taking a linear product

and then you you have your loss. Okay.

And in the backward pass uh you just

call loss backwards. And um in this case

the the gradient which is this variable

attached to the tensor is um turns out

to be what you what you

want. Okay. So everyone has done uh you

know gradients in PyTorch before.

Um so let's look at how many flops uh

are required for computing gradients.

Um okay.

So, let's uh look at a slightly more

complicated model. So, now it's uh a

two-layer linear model where you have X

which is B by D times um W1 which is D

by D. So that's the first layer and then

you take the your hidden activations um

H1 and you pass it through another

linear layer W2 and to get a K

dimensional vector and you do some

compute some

loss. Okay, so this is a two-layer

linear network. And just as a kind of

review, if you look at the number of

forward flops, um what you had to do was

um you have to multiply look at W1. You

have to multiply X by W1

um and add it to your

H1 and you have to take H1 and W2 and

you have to add it to your your H2.

Okay, so the no total number of flops

again is 2 * the the product of all the

dimensions in your map mall plus two

times the product dimensions your m all

for the second matrix. Okay, in other

words, two times the total number of

parameters in this

case. Okay, so so what about the

backward pass? So this part will be um a

little bit more

involved. Um so we can recall the model

uh x to h1 to h2 and the loss. So in the

backward path you have to compute a

bunch of gradients and the gradients

that are relevant is you have to compute

uh the gradient with respect to um h1

um you know h2 w1 and

w2 of of the loss. So D loss D each of

these variables. Okay. So how long does

it take to compute that? Let's just look

at W2 for now. Okay. So the things that

touch W2 um you can compute by looking

at the the chain rule. Um so

W2 grad. So the gradient with of d loss

dw2 is um you you sum um h1 times uh the

gradient of the loss with respect to h2.

Okay. So that's just a chain rule for um

for w2. And this is um so all the

gradients are the you know the same size

as the underlying you know of vectors.

Um

so this turns out to be um essentially

looks like a matrix uh you know

multiplication and so the same you know

calculus holds which is that it's 2

times the number of uh the product of

all the dimensions B * D *

K okay but this is only the gradient

with respect to uh

W2 We also need to compute the gradient

with respect to h1 because we have to

keep on back propagating um to w1 and

and so on. Okay. So that is uh going to

be um you know the product of w2

um you know times uh you know

h2 sorry I think this should be that

grad of h2 h2.grad red.

Um so that turns out to also be

essentially you know um looks like the

matrix multiplication and it's the same

number of flops for for computing the

gradient of uh each

one. Okay. So when you add the two uh so

that's just for w2. Um you do the same

thing for W1 and that's uh which has D *

D parameters and when you add it all uh

up it's uh so for this um for W2 the

amount of computation was 4 * B * D * K

and for W1 it's also 4 * B * D * uh D

because W1 is D by

Okay. So, um I know there's a lot of

symbols here. I'm going to try also to

uh give you a visual account for this.

So, this is from a a blog post that I

think um may work uh better. We'll see.

Um okay, I have to wait for the

animation to loop back. So, so basically

this is uh one layer of the neuron now.

It has um you know the hiddens and then

the weights to the next layer. And so I

have to Okay, the problem with this

animation is I have to

wait. Okay, ready set. Okay, so first I

have to multiply W and A and I have to

add it to this. That's a forward pass.

And now I'm going to multiply this these

two and then add it to to that. And I'm

going to multiply and then add it to

that. Okay.

Any

questions? Which a way to slow this

down? Um, but you know, the details

maybe I'll I'll let you kind of ruminate

on, but um the the high level is that

there's two times the number of

parameters for the forward pass and four

times the number of parameters for the

backward pass. And um we can just kind

of work it out uh via the chain row

here. Yeah. For the homeworks, are we

also using the you said some PyTorch

implementation is allowed, some isn't.

Are we allowed to use docu or we are

doing the like entirely by hand doing

the gradient? Uh so the question is in

the homework are you going to compute

gradients by hand? And the answer is no.

Uh you're going to just use pietorch

gradient. This is uh just to break it

down so we can do the the um counting

flops. Okay. Any any questions about

this before I move on.

Okay, just to summarize the forward pass

is for this particular model is 2 times

the number of data points times the

number of parameters and backward is

four times the number of data points

times the number of parameters which

means that total it's six times number

of data points times parameters okay and

that's explains why uh there was that

six in the beginning when I asked the

motivating uh

question So now this is for a simple you

know linear model. Um it turns out that

mo many models this is uh basically the

bulk of a computation.

um when essentially every computation

you do has uh you know touches

essentially a new parameters roughly

right and you know obviously this

doesn't hold you can find models where

this doesn't hold because you can have

like one parameter through parameter

sharing um and have you know a billion

flops but that's generally what not what

models look like.

Okay, so let me um move

on. Um so far I've I've basically

finished talking about the the resource

accounting. So we looked at tensors. We

looked at some computation on tensors.

We looked at how uh much tensors take to

store and also how many flops takes

tensors uh take when you do various

operations on them. Um now let's start

building up uh different you know

models. Um I think this part isn't

necessarily going to be um you know that

uh you know conceptually you know

interesting or challenging but it's more

for maybe just you know

completeness. Okay. So

um so parameters in PyTorch are stored

as these nn parameter

um objects. Um let's talk a little bit

about parameter initialization.

Um so if you have let's say a

uh um you know parameter that has uh

okay so you you generate um a okay do

your sorry your w parameter is an input

dimension by hidden dimension matrix um

you're still in the linear model case so

let's just generate an input and let's

feed it through the output okay so rand

and unit no gausian

is uh you know seems innocuous. Um what

happens when you do this is that if you

look at the output um you get some

pretty large numbers right and this is

because when you you know have the the

number uh grows as essentially the

square root of the the hidden dimension.

Um and so when you have large models, uh

this is going to you know blow up and um

and training can be very unstable.

Um so so typically what you want to do

is initialize in a way that's um you

know invariant to hidden or at least you

know when you you're guaranteed that

it's not going to you know blow up. And

one simple way to do this is just

rescale by the one over square root of

number of you know of inputs. Um so

basically let's redo this w equals you

know parameter where I simply divide by

the square root of the input dimension.

Um and then now when you uh feed it

through the output now you get things

that are stable around you know this

will actually concentrate to um you know

something like normal 01.

Okay, so this is basically you know this

has been explored pretty extensively in

deep learning literature is known uh up

to a constant as Xavier initialization

and typically I guess it's uh fairly

common if you want to be extra safe uh

you don't trust the normal because it

doesn't have it has unbounded tails and

you just say I'm going to truncate to

minus 33 so I don't get any large values

and I don't want any to to mess with

that.

Okay.

Um Okay. So

um so let's build a you know just a

simple model.

Um it's going to have um dimensions and

two layers. There's this you know I just

made up this name cruncher. It's a

custom model which is a deep linear

network which has um num layers layers

and each layer is a a linear um model

which has essentially just a matrix

multiplication. Okay.

Um so um this parameters of these this

model is looks like um I have um

um layers for the the first layer um

which is a D byD matrix uh the second

layer which is also a D byD matrix and

then I have a um a head um or a final

layer. Okay. So if I get the number of

of parameters of this model, then it's

going to be um d^2 + d ^2 + d. Okay, so

nothing too surprising there. Um and I'm

going to move it to the GPU because I

run want this to run one fast. And um

I'm going to generate some random data

and feed it through the data. And the

forward pass is just going through um

the layers um and then finally applying

the

head. Um okay. So with that model I

let's try to I'm going to use this model

and do some you know stuff with it. Um

but just one kind of general digression.

Um, randomness is something that uh is

is sort of can be annoying in some cases

if you're trying to reproduce a bug for

example. It shows up in many places

initialization dropout data ordering.

Um, and just a best practice is I we

recommend you always pass a fix a random

seed. Um, so you can reproduce your your

model or at least as well as you can.

Um and in particular having a difference

random seat for every source of

randomness is is you know nice because

then you can for example fix

initialization or fix the data ordering

but vary other other things. Um

determinism is your your friend when

you're debugging. Um and you know in

code unfortunately there's you know many

places where you can use randomness and

just be uh cognizant of you know which

one you're using and just if you want to

be safe just uh set the C2 for all of

them. Um data loading um I guess I'll go

through this quickly. It's it's not um

it'll be useful for your your

assignment. Um so in language modeling

data is typically just a sequence of

integers because this is remember output

by the tokenizer. Um and you serialize

them into um you can serialize them into

numpire arrays. Um and one I guess thing

that's maybe useful is that uh you don't

want to load all your data into memory

at at once because for example the llama

data is the 2.8 8 terabytes. Um, but you

can sort of pretend to load it by using

this uh handy function called memap. So,

which gives you essentially a variable

that is mapped to a file. So when you

try to access um the data it actually on

the on demand loads the file and then

using that you can uh create a data

loader that um you know is a you know

samples data from your um batch. So I'm

going to skip over that just in interest

of time. Um let's talk a little bit

about you know optimizer. So we've

defined our

model.

Um so there's many optimizers. Um just

kind of maybe going through the

intuitions behind some of them. So of

course there's stocastic gradient

descent. You compute the gradient of

your batch. You take a step in that

direction. No questions asked. Um

there's a idea called momentum which

dates back to classic optimization nest.

um where you uh have a running average

of your um gradients and you update

against the you know the running average

instead of your instantaneous you know

gradient. Um and then you have um adrad

which um you you know you scale the

gradients by uh your

um your the average over the the norms

of your or I guess not the norms the the

square of the gradients. You also have

RMS prop which uh is a improved version

of adigra which uses a exponential

averaging rather than just like a flat

average. And then finally atom which

appeared in 2014 which is essentially

combining RMS prop and momentum. So

that's why you're maintaining both uh

your running average of your gradients

but also running average of your

gradient squared. Okay. So since you're

going to implement uh Adam um in

homework one I'm not going to do that

instead I'm going to implement um you

know adrad so so the way you implement

an

optimizer in in you know pietorch is

that you override the the optimizer

class and you have to um uh let's see

maybe

I'll and I'll get to the implementation

once we step through it Um so let's

define some data. Um compute the forward

pass on the loss and then you compute

the gradients and then you when you call

optimizer.step

um this is where the optimizer uh

actually is active. So what this looks

like is um your parameters are grouped u

by for example you have one for the

layer zero layer one and then the final

you know weights um and you can access a

state which is a a dictionary from

parameters to um you know whatever you

want to store as optimizer state. um the

gradient of that parameter you assume is

already uh calculated um by the the

backward pass. Um and now you can do

things like um you know

in in adrad you're storing the sum of

the gradient uh squares. So you can get

that G2 variable um and you can update

that based on the square of the

gradient. So this is element wise

squaring of the um the gradient and you

put it back into the

state. Okay. So then your obviously your

optimizer is responsible for updating

the parameters and this is just the you

know you update the learning rate times

the gradient divided by this uh scaling.

So now this state is kept over across

multiple uh invocations of um you know

the

optimizer. Okay. So um and then at the

you know end of your optimizer step you

can you know free up the memory uh just

to um which is I think going to actually

be more important when you look when we

talk about model parallelism.

Okay, so let's talk about the memory

requirements of uh the optimizer states

and actually basically at this point

everything. So um you need to uh the

number of parameters in this model is d²

times the number of layers plus d for

the final uh head. Okay.

Um the number of

activations. So this is something we

didn't do before but now for this simple

model it's fairly easy to do. It's just

um B times you know D times the the

number of of layers you have for every

layer for every data point for every

dimension you have to hold the

activations.

Um for the gradients um this is the same

as the number of parameters and the

number of optimizer states in for

adigrad it's uh you remember we had to

store

um the the gradient squared. So that's

another copy of the parameters. So

putting all

together we have um the total memory is

assuming you know FP32 which means four

bytes times the number of parameters,

number of activations, number of

gradients and number of optimizer

states. Okay. And that gives us uh you

know some number which

um is 496

here. Okay. So this is um a fairly

simple calculation in the assignment

one. Um you're going to do this for the

transformer which is a little bit more

involved because you have to there's not

just matrix multiplications but there's

um many matrices there's attention and

there's all these other things but the

general form of the calculation is the

same. You have parameters activations

gradients and optimizer states.

Okay. And

the so and the flops required again for

this model is six times uh the number of

tokens or number of data points times

the number of parameters and um you know

that's basically concludes the resource

uh accounting for this particular model.

Um and uh if for reference if you're

curious about working this out for for

transformers you can um consult um some

of these

articles. Okay. So in the remaining time

I think um maybe I'll pause for

questions. Um and we talked about

building up the tensors and then we

built a kind of a very small model and

you know we talked about optimization

and how many how much memory and how

much uh compute was required.

Yeah.

So the question is why do you need to

store the activations? So naively you

need to store the activations because

when you're uh when you're doing the

peer pass the gradients um of let's say

the first layer depend on the

activation. So the gradients of the I

layer depends on the activation there.

Um now if you're smarter you don't have

to store the activations or you don't

have to store all of them. You can

recomputee them and that's something a

technique will uh called activation

checkpointing which we're going to talk

about later.

Okay, so let's just do this quick uh you

know actually there's not much to say

here but you know here's your typical

you know training loop um where you

define the model define the optimizer

and you get the data um feed forward

backward and take a step in a parameter

space

um And

um I

guess it'll be more interesting I I

guess next time I should show like

actual one plug which I isn't available

on this uh on this version but

um

so one note about checkpointing

um so training language models takes a

long time and you certainly will crash

at some point so you don't want to lose

all your progress.

Um so you want to periodically save your

model to disk and just to be very clear

the thing you want to save is both the

model and uh the

optimizer and probably you know which

iteration you're on. Um I should add

that and then you can just load it up.

Um one maybe final note and I'll uh end

is um I alluded to kind of mix precision

your training. Um you know choice of the

data type has the different trade-offs.

If you have higher precision it's uh

more accurate and stable um but it's

more expensive and lower precision vice

versa. Um and as we mentioned before by

default the recommendation is use float

32 um but try to use VF uh 16 or even

FP8 whenever possible. So you can use um

lower precision for the FE forward pass

um but flow 32 for the rest. Um and this

is an idea that goes back to the you

know 2017 there's exploring mixed

precision training. Um, PyTorch has um

some tools that automatically allow you

to do, you know, mixed precision

training. Um, because it can be sort of

annoying to have to specify which parts

of your model it needs to be, you know,

what precision. Um, generally you define

your model as, you know, sort of this

clean modular thing and and the

specifying the precision is sort of like

a, you know, something that needs to cut

across that.

Um and one I guess maybe one kind of

general comment is that people are

pushing the envelope on what precision

is needed. there's some you know papers

that show you can actually use um FP8

you know all the way you know through um

there's um I guess one of the challenges

is of course when you have lower

precision it gets very numerically un

unstable but then you can do various

tricks to you know control the uh the

numeric of your model during training so

that you don't get into these you know

bad regimes so this is where I think um

the systems and the model architecture

design kind of are synergistic because

you want to design models now that we

have a lot of model design is just

governed by hardware. So even the

transformer as we mentioned last time is

governed by the having GPUs and now if

we notice that you know Nvidia chips

have the property that if um lower

precision even like int four for example

is one thing. Now if you can make your

model training actually work on in four

which is I think you know quite quite

hard then you can get massive you know

speed ups and your model will be more

you know efficient. Um

now there's another uh thing which we'll

talk about later which is you know often

you'll train your model using more sane

you know floating point but when it

comes to inference you can go crazy and

you take your preach model and then you

can quantize it and get a lot of the

gains from very very uh aggressive

quantization. So somehow training is a

lot um more difficult to do with low

precision but once you have a trained

model it's much easier to um make it low

precision. Okay. So um I will wrap up

there just to conclude um we have talked

about the different primitives to use to

train a model building up from tensors

all the way to the training loop. We

talked about u memory accounting and

flops accounting. um for these simple

models. Um hopefully once you go through

assignment one, all these concepts will

be really solid because you'll be um uh

applying these ideas for the actual

transformer. Okay, see you next time.

Loading...

Loading video analysis...