YouTube Video
By Unknown
Summary
Topics Covered
- Always Profile Before Optimizing
- CPU and GPU Run Asynchronously
- Kernel Fusion Can Yield 8x Speedups
- Triton Matches Hand-Written CUDA
- torch.compile Beats Hand-Written Code
Full Transcript
Today we're going to be going into details on um make writing high performance code for GPUs. So part of assignment two is going to be you're
going to have to you know do a bunch of profiling. You will have to write um
profiling. You will have to write um your own Triton kernel uh for flash attention too. You will need to sort of
attention too. You will need to sort of make all of this stuff very high performance. And so in this lecture,
performance. And so in this lecture, we're going to kind of drill down a little bit and we're going to try to, you know, write some high performance code for um standard components in a
language model. Um so the the plan for
language model. Um so the the plan for this lecture is we're going to just do a brief amount of review about GPU stuff.
Um just to make sure you have once again the basic components of the GPUs that we need to to understand in order to to follow the rest of the lecture. Um and
then I'm going to show you a bunch of sort of really basic things about benchmarking and profiling which will be helpful for both the assignment and in general if you want to write high performance um pietorrch or deep
learning code. And then um we're going
learning code. And then um we're going to basically write some kernels. Um
we're going to write uh CUDA kernels in sort of C++. We will then do uh the same thing in Triton. And then lastly, we're going to, you know, do the easy but very good thing of using PyTorch's existing
uh JIT compiler um to have it optimized for us. And then we'll compare all of
for us. And then we'll compare all of those and profile and benchmark things.
And throughout we're going to really dig in deep. We're going to go down all the
in deep. We're going to go down all the way to uh the PTX. So, so pretty close to the machine code um to understand what you know the GPU is actually doing under the hood when we write um all this
code. Um and then um hopefully we'll
code. Um and then um hopefully we'll have time and I think we will uh we'll finish by writing sort of a fast Triton implementation of softmax at the very
end. Okay. So um assignment one uh has
end. Okay. So um assignment one uh has come to a close. There's still a leaderboard. You can still submit and
leaderboard. You can still submit and update things there. Um some of you may be using late days. So um please finish up assignment one. Um and then assignment two is now out. And as I said before um there's going to be you know a
bunch of systems uh stuff that you're going to need to do. Um there's fun parts uh that you can do now involving um GPU kernels and then next week we're going to talk about parallelism and that's going to be the other half of the
assignment um writing fast parallel code like data parallelism um and so on. So
we will get to that next week. All
right. So now remember how GPUs work, right? So when we have something like an
right? So when we have something like an A100 or an H100, we're going to have a whole bunch of SM streaming multipprocessors. within each SM is a
multipprocessors. within each SM is a large number of of units that can do uh computation. Um we have in32 ones or
computation. Um we have in32 ones or FP32 ones. Um and then each SM is going
FP32 ones. Um and then each SM is going to launch a large number of threads, right? Um and we have the memory
right? Um and we have the memory hierarchy. Um which is that we have DRAM
hierarchy. Um which is that we have DRAM or global memory which is big and slow.
And then we've got caches that are much faster. Um and in fact, you know, there
faster. Um and in fact, you know, there you see here there's this thing called a register file. This is very very fast
register file. This is very very fast memory that each each thread can access.
And we're going to be making heavy use of these registers as we write high performance code for GPUs um today. Um
so the basic structure for the execution model is going to be we're going to have a collection of thread blocks and a block is going to be scheduled on a single uh SM. Right? So this is kind of the atomic unit that we're going to be
thinking about especially when we write uh code in things like Triton. And then
within each block there's going to be a whole bunch of threads and the threads are actually going to be the ones doing uh the computation. And so if you have a vector and you're going to be operating over elements of that vector, right, you're going to write code where each
thread is going to go in and maybe operate over a few elements of that vector um at once, right? And all the threads together will sort of process the vector um completely. So um why do
we have these things called thread blocks, right? Why not just have threads
blocks, right? Why not just have threads and your your big global uh context?
Well, thread blocks can um communicate with each other. There's shared memory kind of within the SM that's pretty fast, right? So when you need to do
fast, right? So when you need to do something like matrix multiplication, you're going to need to pass information from thread to thread. Um, and within a thread block that's very fast across thread blocks or across these groups,
it's going to be very expensive. So you
any data that you need, you're going to want to keep within the same thread block or within the same sort of pile.
Um, and that's going to keep things very, very fast. Um, and that's going to be as fast as sort of a L1 cache. And
that's a great, you know, place to be.
And so you can use this to to synchronize across threads. Um but you can't you know for example synchronize across blocks you can't really control um what's going to happen right um and
remember um the thing that I mentioned last week um there's this thing called waves right waves aren't sort of an inherent thing that you you normally think about but for performance it is an important component so um when we
actually run these things the threads are grouped into into consecutive blocks of 32 threads um and that's a wave and that gets executed kind of all at once
um in an SM Um and so one thing that we would like to do is to make sure all the waves have an equal amount of of uh computation. We can't always do that. Um
computation. We can't always do that. Um
but you know if we can we would like to do that right? So we want to make the number of thread blocks ideally uh divide the number of SMS and to make sure that each wave has an equal amount um of uh work. So we're going to ideally
have a lot more thread blocks than SMS. And we're going to try to make that happen as we write um high performance code. Okay. And then the last concept
code. Okay. And then the last concept and maybe maybe amongst the most important concepts here um is arithmetic intensity. Um we would like to keep
intensity. Um we would like to keep arithmetic intensity high. Uh we would like to have more flops than we have bytes of memory movement. Um and this is because you know if you remember the
scaling plot from from last lecture um our compute scaling is much much faster than memory scaling. So a lot of the time um computations are going to end up um being memory bound and we're not actually getting all all of the work
done right. So um as a general rule you
done right. So um as a general rule you know matrix multiplication is computebound if we kind of do it cleverly. Everything else is going to be
cleverly. Everything else is going to be memory bound and we're going to try to cleverly reduce the amount of of things that are memory bound or how badly um things are memory bound. Okay. So that's
our our very very brief sort of review of GPUs. Hopefully everyone remembers
of GPUs. Hopefully everyone remembers this. You still have a fresh sort of
this. You still have a fresh sort of memory of of the execution model. Um
feel free to to stop me and ask questions if if any of you you know have sort of lingering doubts or questions about how this is all uh going to work.
Yes. What was the function of warp? What
was the function of sorry warp? A warp.
Um a warp is uh essentially a group of threads that get executed together. And
the reason why warps exist is that they reduce the amount of control machinery that's needed. Um because you're
that's needed. Um because you're executing all these threads at the same time. Um you don't need a control thing
time. Um you don't need a control thing for for each thread. you need them for blocks of 32, right? And so you see, for example, there's a lot more compute units than there are sort of warp
schedulers. Um, and so you're able to do
schedulers. Um, and so you're able to do a lot more parallel work without worrying about control. And this is one of the trade-offs with CPUs, right?
CPUs, a lot more uh sort of uh silicon area dedicated to control and branch prediction and things like this. Whereas
for GPUs, much more uh emphasis on computation with simpler controls.
Okay, so now we're going to get into sort of sort of newer content now. Um,
and I think if there's one highle thing to remember, um, it's if you want to write high performance code, you should remember to benchmark and profile your code. And that seems very obvious, but
code. And that seems very obvious, but you know, I've seen a lot of things where, you know, students or or people go in and they're like, well, I think this is the bottleneck, so I'm going to spend three hours optimizing it. And it
turns out it wasn't the bottleneck at all. I'm sure it was fun, but that, you
all. I'm sure it was fun, but that, you know, there were it was kind of time that was uh misallocated. And so if you actually use a high performance or very detailed profiler, you can kind of see exactly where your you know bottlenecks
are and exactly what the machine is doing. And once you have that, you can
doing. And once you have that, you can go and spend your efforts um in sort of the most important parts um of your code execution. And so that's the high level
execution. And so that's the high level thing I want to get across because some of the details about you know GPU execution and you know how you write a softmax kernel that's going to kind of change um and maybe you even want to
just rely on the torch compile you know autojit thing. Um, but the fact that you
autojit thing. Um, but the fact that you should profile isn't really going to change no matter what the tools are. So,
um, I want you to sort of internalize that idea that you should be always profiling if you want to be writing um, high performance uh, code. And really, you know, there's a
code. And really, you know, there's a limit to the theory. I think systems is part of this course that you can reason about pretty well. Architecture is
somewhat hard to reason about and you can, you know, really think about sort of the roof line model and so on. But,
you know, how fast does your matrix multiply? Well, maybe that depends on
multiply? Well, maybe that depends on the library version or your hardware like which things are bottlenecking for what reason. There's all sorts of, you
what reason. There's all sorts of, you know, microode things that you don't really fully know. And so, you have to in the end have to do endto-end benchmarking whenever you're developing
these things. Okay. So, um I'm going to
these things. Okay. So, um I'm going to have an example computation. This is the simplest thing you know that we can run compared to all the things that you all are doing in your assignment one. Um but
I'm going to run a very simple MLP. It's
going to have 128 dimensions. It's going
to have 16 layers. It's going to have some batch size and it's going to have five steps. I'm going to just do
five steps. I'm going to just do forwards and backwards um for five different steps here. Um and just to to make the code clear, it's it's something like this, right? I'm going to define a MLP model and we'll sort of I'll show
you that in a moment here. Um and then I'll define, you know, a random gausian input and then I'll run it for uh five steps in that last case where I compute some forward and then I compute a
backwards and then I return uh sort of uh the result which is just the mean of the output of my MLP, right? Not there's
not even losses. It's so simple. It's
just you run the MLP forward and I just average pool at the end, right? Um and
then the MLP is just kind of the simplest thing you can also imagine here. It's just a bunch of linear layers
here. It's just a bunch of linear layers stacked on top of each other. Um which
is this bit and then you know I've got a GLU in between, right? So this is just GLU linear linear GLU so on and so forth. Everything is nice and square,
forth. Everything is nice and square, right? So hopefully this is a very
right? So hopefully this is a very simple MLP that you all feel uh pretty comfortable with. Um and then uh let's
comfortable with. Um and then uh let's go back. Yes. Oh, sorry. Uh, I want to
go back. Yes. Oh, sorry. Uh, I want to go back up to here. Okay, good. Um, and
so now I have this, you know, MLP code that I want to run. And now I'm going to do two things. I'm going to benchmark.
So I'm going to do some timings. So I
want to know how long does this function take to run? And then I'll do profiling, which is to go inside the function and ask, you know, where am I spending all of my time? So let's start with
benchmarking, right? So benchmarking um
benchmarking, right? So benchmarking um is just the measurement of wall clock time of performing these operations. Um,
and I'm only looking for the endto-end execution time of, in this case, my MLP function. And you know, there are some
function. And you know, there are some subtleties to this, like you're sitting there and you're like, why am I being told how to invoke, I don't know, uh, the time it function. Um, but you do have to be a little bit careful about
how you measure times. And I think, you know, if you're not paying attention, you will run into these pitfalls, um, when you do assignment, too. Um, and so, what are we doing this for? We're going
to compare implementations later. We're
going to compare our Triton to our handwritten C++ to um PyTorch's implementation and torch compile and we want to know was it worth it to write that CUDA kernel. Um and we'd also like to understand when I make my matrix
multiplies bigger, how much slower does it get? Right? So we'd like to do some
it get? Right? So we'd like to do some empirical benchmarking of those. So um
throughout this lecture I'm going to be using this benchmark function. Um and
that's going to be sort of a wrapper function. I'll step through it. Um
function. I'll step through it. Um
benchmark is going to do the following things, right? It's going to have a
things, right? It's going to have a function that I want to benchmark, which is run. And then I'm going to do some
is run. And then I'm going to do some number of warm-up iterations, and then I'll do some number of trials, right?
Um, and you might wonder, okay, so like what's this um warm-up thing that we're doing here? Well, one thing that's
doing here? Well, one thing that's really important is, you know, when you do when you first run your PyTorch code and let's say it dispatches something to the GPU, um, it might look very fast and
transparent to you, but that very first time something is executed in the background, machine code is being compiled. you know, that code
compiled. you know, that code instruction might be being sent to the GPU. There's all sorts of things that
GPU. There's all sorts of things that happen to sort of initialize your code.
Um, and so you always want to do some warm-up iteration to make sure that you're not measuring sort of the startup speed. Instead, you want to measure kind
speed. Instead, you want to measure kind of the the steady state speed, right? If
you're running thousands and thousands of iterations, you know, what you're interested in is that part, not necessarily, you know, how fast can you, you know, do on the-fly compilation of your of your CUDA code, right? So,
that's why we have warm-up, and you should always have a bit of warm-up. Um,
and then, um, another thing that's really important, and I'll get to this once we get to the profiler, is you want to call this thing called torch CUDA synchronized. Like, what is that? Well,
synchronized. Like, what is that? Well,
the GPU and the CPU are basically two independent compute units in your in your computer, right? Um, and they can basically run kind of independently. And
so, their execution model is going to be this Python code that I have here. This
lives on the CPU, right? And when I run something, it's going to dispatch a bunch of CUDA kernels, right, to the GPU. It says, "Please run these things
GPU. It says, "Please run these things for me, right?" And the GPU will go off and execute those things. And the CPU will actually go on and keep running, right? It doesn't wait for those CUDA
right? It doesn't wait for those CUDA executions to stop. And so that's great for for writing high performance code, but you should hopefully see the the immediate problem if you want to do benchmarking, right? If you're
benchmarking, right? If you're benchmarking and you've got this model where the GPU runs off in the side and your CPU is doing something different, you're actually not measuring the GPU execution time, right? Um, so torch CUDA
synchronize basically says, all right, let's make sure that the GPU and CPU are in the same state and there's sort of no cued uh things running and that we're we're kind of at the same point in terms of the code that's being executed. And
now, so the GPU and CPU are kind of in the same state and I'm going to time it for real, right? and I'm going to time something for for some number of times and I'm going to run the computation which in this case is the is the sleep
command I'm going to do it three times and since I'm trying to sleep for for 50 uh milliseconds um that's the time that I'm going to kind of get at the end right so I I do time three times and of
course here right I'm also calling torch.cuda CUDA.synchronize at the end
torch.cuda CUDA.synchronize at the end of run to make sure that the GPU and CPU states are the same. So, right, so the CPU is running ahead. It's going to wait uh for the GPU execution to actually finish here. Uh and vice versa. Um and
finish here. Uh and vice versa. Um and
so now I sort of finished and then I'm going to average because you know each single measurement might be you know fluctuating because of things like thermal properties of the GPU and so you want to take multiple replicates take
the mean and return that. That's our our benchmarking code, right? Very simple,
but remember kind of the two important pieces here, right? Always do a warm-up.
Make sure to call CUDA synchronize. Um,
if you do those, it's very simple. If
you get forget to do those, you'll get pretty crazy numbers like you'll get that your big matrix multiply finished instantly, which is definitely not true, right? Okay. So, now we can do some
right? Okay. So, now we can do some benchmarking of matrix multiplies. Um,
I'm going to walk through some of these.
Um, they're just putting numbers to things that we already know, but I want to, you know, just walk through it and and make sure we're on the same page here, right? So, um, I ran this on the
here, right? So, um, I ran this on the on the class H100s. I have GPUs. I'm
going to do matrix multiplies over over these sizes. Um, and then I'm going to
these sizes. Um, and then I'm going to go and collect a whole bunch of matrix multiply timings um for each of these dimensions stepping through kind of this
uh benchmark result. And so, we kind of see, you know, as we expect, right, super linear scaling of our runtimes as we increase the matrix size. Of course,
at the smallest sizes like 1024 and 2048, we actually see that the times don't grow at all because there's constant factor overhead in just doing these matrix multiplies like uh these
numbers have to get shipped from the CPU to the GPU. You know, there's uh overhead in like launching the kernel.
Um and so it's not the case that you know it's super linear all the way to zero. Um but once the matrices get big
zero. Um but once the matrices get big enough, we see exactly the kind of scaling that we expect to see um with our matrix multiplies, right? Okay. So,
um, hopefully straightforward. Now,
let's try to benchmark, um, our MLP. So,
what are we going to do? We're going to make our MLP bigger. We're going to have 256 dimensions. We're going to have four
256 dimensions. We're going to have four layers, batch size of 256, take two steps. Um, and so, what's the time that
steps. Um, and so, what's the time that it takes to do that? Well, it's going to take 6.2 seconds um, to do that. And now
I could do some basic things. I can uh scale the number of steps from two to five and I can benchmark all of those and I'll get 2 3 four and then five
steps. And unlike in the in the matrix
steps. And unlike in the in the matrix multiply case, right, if I'm scaling the number of steps, so the number of forward and backward passes on my MLP, right? What do I expect the runtime to
right? What do I expect the runtime to to behave like? Well, I expect sort of linear scaling, right? And that's kind of what we see. um there's about five seconds uh per MLP execution and we see
uh it's about n times five for the runtime of kind of the endtoend um object here right okay let me see if I can reset the uh thing that's being monitored here oh nope I can't okay I'm
going to zoom out a little bit sorry about that okay now we can also scale the number of layers from 2 three four to five um and what does that give us
well it gives us you know increasing uh run times once again linear in the number of layers, right? This time once again one layer takes about 5 seconds um a little bit less than that and so we
get about uh four times actually four times the number of layers um and linear scaling sort of shows up again.
Unsurprising, right? So both steps and layers obviously have linear relationships uh with the runtime and that is exactly kind of what we end up seeing at the end here. Um, I'm going to skip the batch size thing because this
is getting a little bit unwieldy in terms of the amount of things uh that are being tracked here. Okay. All right.
So, um, that's the end of this benchmarking bit. We can kind of make
benchmarking bit. We can kind of make this nice function that that does a little bit of warm-up, does CUDA synchronize, and we can measure the runtime of anything that we want. And
this is good, and you should do this all the time in your code, right? You can
measure how long it takes for your new fancy architecture to run. But then I think if you want to fix some problems, uh, benchmarking is a very coarse grain tool. It tells you that your code is
tool. It tells you that your code is slow, but it doesn't tell you where the time is being spent. And so what we would like to do, um, is instead do, um, profiling. Um, and so this is going to
profiling. Um, and so this is going to be a much more fine grained object that we're going to want to do. Um, and so profiling is really nice because it not only helps you see what where the time
is being spent, which functions, but you know, when you look at what you're calling, usually you you interact with the PyTorch interface, right? Like the
the parts of PyTorch that you call, but beneath PyTorch, there's this whole universe of CUDA stuff that's being called. And when you run a profiler, you
called. And when you run a profiler, you can actually see all the way to the low-level calls um what is actually being called. And so you can get a much
being called. And so you can get a much nicer intuition for how the the program is actually being executed on the hardware. And so we'll step through um
hardware. And so we'll step through um profiling a few simple functions um and then get a little bit of intuition about what is um happening. And so one of the
things that is nice is that um if you want basic profiling um PyTorch has a very nice kind of built-in profiler that you can use. Um, and this will allow you to not leave the Python PyTorch world
and get some fairly reasonable looking um, outputs. And so I've profiled some
um, outputs. And so I've profiled some functions here and you can kind of see the output of this as well. Um, and so you know I've taken the sleep example
from before. Um, and here is you know
from before. Um, and here is you know the sleep function and when we profile the sleep function the profile function looks something like this. you know, I have a warm-up again. I have torch CUDA
synchronize. Um, and then I call the
synchronize. Um, and then I call the profiler and I'm tracking both CPU and the GPU times. Um, and then, you know, I run something and then I synchronize
again and I print out the average table uh across all the time. Okay. So, I go back now. So, now I'm going to profile the
now. So, now I'm going to profile the sleep function. Um, and if we look at,
sleep function. Um, and if we look at, you know, what's happening uh what happens here? Well, 100% of the time is
happens here? Well, 100% of the time is being spent on something called CUDA device synchronize. Uh because there's
device synchronize. Uh because there's no GPU work being done. This is just kind of a noop. You know, it's kind of a silly thing to be profiling. And so now let's look at something kind of non-trivial, right? So let's look at um
non-trivial, right? So let's look at um this basic operation here of adding two uh matrices, right? So I defined a add function that takes in an A and a B and
adds them together. Um and this is a a helper function that instantiates two random gausian matrices and then invokes uh you know whatever is the in the operation argument. So this is adding
operation argument. So this is adding two uh 2048 size matrices together.
Okay. So now I'm going to profile this and I'm going to call the profiler and I'll get back something that looks like this block over here. Right? So this is what I get back. Um and I'm going to
have to zoom back out because this is not going to be all righty. Okay. Um is
this visible from the back? Can someone
give me a thumbs up if it's visible from the back? And uh Okay, good, good, good.
the back? And uh Okay, good, good, good.
Or thumbs down if it's not. All right,
so um when we when we uh call the add function in Python, right, this is kind of all that we interact with this add function a plus b, right? That's all we think about. But actually underneath
think about. But actually underneath here, the underneath the iceberg so to speak, um there's a lot more that happens. So this gets dispatched to the
happens. So this gets dispatched to the GPU and first um there's this thing called A10, which is the uh C sort of interface for PyTorch. And so this wrapper gets called and it says okay I'm
going to add some numbers right this is what's being called that's the outer wrapper and then that dispatches to a particular kernel um called vectorize elementwise kernel for comma native CUDA
funure add dot dot dot dot dot right and this is the thing that's actually doing the adding and then there's this um also other thing called CUDA launch kernel that's taking some time and this is
actually you know the CPU is taking the command and sending it over to the GPU that's the kernel launch and that takes some time and then finally you know the CUDA device synchronizes we're waiting for the the GPU to finish and send
things back to us and that also takes some time right the the mere act of having a synchronization barrier is going to cost us some time and so we basically have you know the time total
in the end here 1.4 4 milliseconds on the CPU and uh 17 microsconds uh on the CUDA. Right? So, so they're really fast
CUDA. Right? So, so they're really fast on the GPU, slower on the CPU. And if
we're looking at the CPU time that's being spent, um which is the self CPU time, we see that kind of the the C++ interface or the C interface is actually the thing that's costing us a whole bunch of CPU time. And there's sort of
overhead to doing anything where we're sending stuff over um to the GPU. So,
that's the ad function. um and we see you know what's happening under the hood. Same story here if I want to do a
hood. Same story here if I want to do a matrix multiply. So I'm doing you know a
matrix multiply. So I'm doing you know a multiplied by b. So this is a matrix multiply of a and b you know I'm doing 2048 matrices once again. And then I do profiling. Um now this time I see you
profiling. Um now this time I see you know a10 map mole. So this is saying like this is the the lower level interface to do matrix multiplies. Um,
and this is going to dispatch the cutless, which is Nvidia's sort of high performance matrix multiply CUDA library. And then it's dispatching to a
library. And then it's dispatching to a very particular cutless kernel, which is going to have some tile size. Um, the
names are truncated here. I'll show you a more detailed version in a minute. Um,
you know, there this is basically pointing towards a very particular set of like tile sizes, um, and the number of of blocks and so on. And so this thing is parameterized. Um, and that's actually doing the matrix multiply. And
once again we see the same two things at the bottom here, you know, the kernel launch um and the synchronization uh of CUDA devices. Um and you can sort of see
CUDA devices. Um and you can sort of see once again um the the CPU time CUDA time split. And we're spending way more time
split. And we're spending way more time in CUDA because you know matrix multiplies do take more time than just adding two vectors.
Okay. Um any questions uh so far? I can
I can pause for a moment here. I think
I've just been uh going sort of very quickly and on my own through the profiler. So if anyone has questions I
profiler. So if anyone has questions I can I can stop for a moment. If not I can keep going. Okay. Oh yes. In this case our
going. Okay. Oh yes. In this case our time is greater than our CPU time but we did have a barrier that like said to for the CPU to wait for it to synchronize
and so by that shouldn't the CPU time always be at least the same time?
Counting the time.
Yeah. I don't I don't think this counts the time.
Cool. Oh yes. Sorry. there's too much there. Uh is there any particular reason
there. Uh is there any particular reason why like when we switch from adding to matt the CPU time went down? Um is there a reason why when we go from adding to
Matt mode the CPU time goes down? That I
am not sure um to be entirely honest.
Yes. Is
there time compared to like running it?
Is there overhead in the profiler um that can distort things compared to running it in the real world? Um yes uh there is overhead in the profiler. Um
like the barriers will do that. I'll
show you a more advanced profiler from NVIDIA and you can add things like annotations that will also slightly distort the timings but but not by much.
Um the really large scale things that you see aren't going to be really distorted by the profiler. Um so if you're looking at like micro timings, yes, probably. But a lot of the things
yes, probably. But a lot of the things that that we care about in the class, no.
Yes. Just to make sure I'm interpreting this correctly. So is that like for the
this correctly. So is that like for the ad case um is the 98% CPU being utilized over the time period that it's like the
millisecond time period. That's right.
Yeah. So this is the percentage of time as you can see that the actual millisecond time that A10 ad was actually executing in some capacity on the CPU.
I don't think the CPU% of what the CPU is doing. Yeah,
that's right. This is the time that the CPU is active, not not percentage utilization if that's Yeah. So, this is not like the total amount of CPU flops or something. This is a total percentage
or something. This is a total percentage of time that the CPU is doing something. Yes. Okay. Cool. All right.
something. Yes. Okay. Cool. All right.
Um, here's another example of a maple.
Um, so this is a different dimensionality, right? So, this is a I'm
dimensionality, right? So, this is a I'm multiplying 128 uh dimensional matrix here. Um, so 128 by 128, much smaller.
here. Um, so 128 by 128, much smaller.
Um and you'll actually see that now um it's actually directly executing sort of this different command. It's executing
um XMMA GMM. GMM is is the uh a matrix multiply uh type and this is float 32 float 32. You can kind of see from the
float 32. You can kind of see from the the naming of this kernel um what's actually happening here which is that this is a tiled matrix multiply um of some kind and it's not sort of going through cut list. It's executing this
particular command directly. And so for a small matrix multiply, you know, you see that it's dispatching to a different kernel. Now, so you can kind of see um
kernel. Now, so you can kind of see um kind of the complexity of matrix multiply um when we're operating at this high level abstraction, we just think of matrix multiply as a single thing, right? We call like a at b and we're
right? We call like a at b and we're done. But underneath the hood, depending
done. But underneath the hood, depending on the dimensionality that you have, depending on the hardware that you have, it will actually dispatch to very different um matrix multiply sort of
primitives under the hood. And that will actually manifest in very very different um sort of performance characteristics.
And so one fun tip is um torch compile which I will talk about later actually has an option to sort of microbenchmark the matrix multiply performance on your hardware and then it will actually then
pick the the highest performing uh matrix multiply subruines for your for your model which you know in the past I found you know gives you like 10% speed ups for free. It's very cool that like
optimizing for these things actually gives you uh free gains out in the real world. Okay. Um so that's another maple
world. Okay. Um so that's another maple example. Um and so the cool thing about
example. Um and so the cool thing about the profiler compared to the just the raw benchmarking is we can now kind of see which CUDA kernels are being called.
Um we can see that you know different sizes of matrices lead to different CUDA kernels. Um and we see you know cutless
kernels. Um and we see you know cutless 80 simp right is a is a diff is this cutless linear algebra library and it
tells us things um like the t tile size.
So, so far um these operations are very boring in a way like matrix multiplies and adds um they're basically one to one. You you have a you know operation
one. You you have a you know operation on the CPU side, it translates to a GPU operation and it just gets shipped over, right? So there's just a single
right? So there's just a single operation in all of these that does anything on the GPU. So I want to look at um some more complicated operations um two more of these um that have sort
of more compound behavior. So what I want to do now is I want to do um I want to look at this operation uh called torch.cis C dist and this is computing
torch.cis C dist and this is computing you know for for two sets of matrices the the pair-wise uklidian distance between two sets of vectors right so this is going to be a big distance
matrix computation between a's and b's um that I want so that's c dist um and so this is obviously a much more complicated operation if you want to compute uklitian distances uh you're
going to need to compute uh dotproducts you're going to need to compute square roots um and we're going to see that once we um compute cedist so now here is
the is the profiled output of cedist. Um
so we see that this torch you know python command does map in the in the c interface to some sort of lower level cedist. So this is a10 cedist which then
cedist. So this is a10 cedist which then maps to a10 uklitian disc. Um and then this will decompose into a whole bunch of things like a10 mm mole at 10 pow um and then sum because these are all
primitives that you're going to need in order to actually to compute uh the uklidian distances um between uh all of your vectors and when you for each one
of these like matrix multiplies and concatenation um and taking the powers um you have a corresponding uh cuda command that is being called here you know we have gmm which become we've
become familiar with So this is a matrix multiply. It's taking 78% of our compute
multiply. It's taking 78% of our compute or our compute time on the GPU. Um we've
got you know copies um and sort of concatenation of arrays. This takes um 6% of the the execution time and then this sort of vectorzed elementwise
kernel which is taking the power um takes 5% of the GPU time and and 3% goes to the sum. So now we get this very nice low-level breakdown of where, you know,
my GPU is spending all of its time. Um,
and from this, you know, I can get some sense of um where maybe I should should spend my time optimizing. You know,
maybe I think I can optimize my matrix multiply. That would be great because
multiply. That would be great because that's 70 plus% of the time spent um in the GPU. The the final example um the final
GPU. The the final example um the final two examples, sorry, that I want to uh talk about is GLU and softmax. So these
will be our running Oh, sorry, there's a question. What's the too wild.
question. What's the too wild.
Um, okay. So, I will maybe answer that question in a in a few minutes because there's a cooler profiler that shows you a much nicer picture and so I can justiculate here, but I think it'll be better to show that with pictures. Um,
okay. So, I'm going to talk about uh now the GLU um and the softmax. Um, so the GLU is going to be um our running example um throughout the class. So,
this is a nonlinearity. If you remember, it's the Gausian error unit. Gausian
error linear unit. Um and that's going to be a product of a uh tanh and a uh exponential if I remember right. Um and
so we're going to have you know um all sorts of operations. So we're going to add a and b and then we're going to call gelu sort of simulating the the linear plus nonlinear uh structure that we
might have um in our MLP. And so we see once again uh basically the same sort of mapping. we see a10 add corresponding to
mapping. we see a10 add corresponding to a plus b and then we have the cuda equivalent and then we have actually a gue function implemented in cuda which is all the way down here and that takes
about 33% of the compute okay fairly reasonable and then we have once again the softmax I won't go through all of these in sort of gory detail um since you know they all start to look the same
after a while but the thing to to really point out that I think is cool is that a lot of these really core primitives like softmax and gellu um there's kernels written for them, right? So, it's not like the GPU is executing the the basic
primitives. There's sort of a fused
primitives. There's sort of a fused operator that computes all of this. So,
there's no back and forth between CPU and GPU for all of these. So, okay. Um,
I mentioned before that I was going to sort of answer this question of what the CPU was doing. Um, and so let's think about something a little more sophisticated, right? I had the MLP
sophisticated, right? I had the MLP example that I started with for benchmarking. Um, and I would, let's
benchmarking. Um, and I would, let's say, like to optimize that MLP, make it run really fast. So how can we do that?
Well, ideally we would sort of profile this um in a nice sort of fine grained way. So if we use the torch profiler,
way. So if we use the torch profiler, this is kind of what we would get. Um if
you remember the MLP, there's you know stack linear layers. There's a forward and a backward. Um and you see roughly, you know, uh there's this backward thing that's happening. There's a matrix
that's happening. There's a matrix multiply. There's linear. Um and then
multiply. There's linear. Um and then there's accumulate grad operation um for the backward. Um and here's the matrix
the backward. Um and here's the matrix multiply kernel. And then there's only
multiply kernel. And then there's only 10 things that can fit here. So I think this this gets cut off at a certain point. But this this is nice. It does
point. But this this is nice. It does
tell you that most of the time is being spent in the map moles. Um but you do kind of wonder like where does all the rest of the time go and why does only
31% of my time stay here and where's the 60% here? It's a A10 mm but there's no
60% here? It's a A10 mm but there's no corresponding kernel. Right? This is a
corresponding kernel. Right? This is a little bit mysterious and for something that's very complex um module this is not a very good visualization and so for that I think we
have to actually uh get out a real sort of grown-up profiler um and you will have to you know or we will ask you to um look at uh this thing which is
Nvidia's endsite systems um and this is the kind of Nvidia's sort of detailed way of looking at GPU behavior um and performance And so we will actually kind
of see exactly what is happening as we run this MLP. So actually in the back can you see I don't know this tiny text over here. Thumbs up. Okay. All right.
over here. Thumbs up. Okay. All right.
If you can see it then I'm not going to zoom in but it does it does seem small even from here. Um all right. So
basically if we look here um we see several different things. We see CUDA HW over here and then we see threads. Um,
and so this top half, this CUDA part, this is what the GPU is kind of doing.
And then in this threads part, um, we see kind of what the CPU is doing. And I
can also pull up the code, I think. Yes.
Um, the code here, um, when I profiled it, I've added a few annotations. Um,
okay, this one I zoom in for sure. Uh,
okay. Let's,
um, excellent. All right. Um so I've annotated the code with this set of things that says let's see uh
NVTX um which basically annotates uh my code with annotate uh with markers. So
when the profiler comes in here it will know that this piece of code belongs to a block called define model. And for
example the this part that says step range push and range pop. this range
here from line 77 to line 55 should be annotated with something that says step underscore step. Okay, so I've added all
underscore step. Okay, so I've added all these annotations in my code before calling um my profiler. And so let's go back here. So now if we go to this line
back here. So now if we go to this line that says nvtx, we can kind of see define model um which is the thing that I wrapped my my model construction call.
And then I see step zero, step one, step two, step three, step four, step five.
So each step is now nicely annotated in this profiler and we can kind of see all of the things that the model is doing um as we as it goes along and I'll start on
this side. One thing we see is that this
this side. One thing we see is that this um piece of code it doesn't do very much work. It takes only 14 seconds. So
work. It takes only 14 seconds. So
actually most of the time for the profiler is spent on overhead. So the
part up until roughly here is you know things like just loading the libraries and that takes a long time. It takes
apparently 7.5 seconds. just initialize
everything and then on at least on the GPU at 7.5 seconds or so into the program it starts actually building the model and you see here on the memory footprint you
know this is the place where now memory is being sort of uh allocated and on the GPU memory the memory usage starts to grow right now the model is now constructed at this point and then step
zero is where sort of the action starts to happen and so you were asking earlier what's happening um between the CPU and and sort of GPU. And so how the
execution model of this works is um here is sort of step zero on the CPU. And I'm
starting right here and here's the forward pass and this is layer zero. So
let's just kind of think through what's happening. Um as I said before when you
happening. Um as I said before when you first encounter or when you first call a piece of code in PyTorch um it doesn't just directly execute. it will actually do things like um you know on the fly
compile things and so um so you know this thing like runtime triggered module loading um is sort of overhead work that's being done in order to just initialize the layer and the computation
and move sort of various bits of code into the GPU. So this takes a long time.
Um and then after this layer zero is done now if I look at sort of any slice here let's sort of zoom in um to selection we'll see that each of these layers is really really really quick and
what happens here is when I highlight this layer one over here on the CPU side notice that that's not where layer 1 is on the GPU side right so as I said
before the CPU and GPU are kind of two different execution devices so I start at layer zero I'm done with layer zero I start layer one. Now, the CPU is actually just sending all of the the
sort of um CUDA commands um the CUDA kernels um it's launching all the CUDA kernels already to the GPU at this point, right? So, when the CPU is
point, right? So, when the CPU is saying, I'm doing layer one, what it's actually doing is it's queuing commands into the GPU. It says, "Now run this thing next. Run this thing next. Run
thing next. Run this thing next. Run
this thing next." Right? Um and so the CPU is running way ahead of the GPU. And
by the time layer 1 starts executing um on the GPU, actually, we're already at layer 9 on the CPU, right? Right? The
CPU is running way ahead and there's basically um a queue that the the CPU maintains where um it's sending a fixed number of uh kernel uh CUDA kernels to
the GPU. And so once you hit that Q
the GPU. And so once you hit that Q depth, it's going to sort of stop running ahead. But until that point,
running ahead. But until that point, it's just going to keep going and going and going as far as it can, right? Um
and in this case, this does become um I'm gonna zoom out again. Uh okay, undo the zoom. There we go. Um, in this case,
zoom. There we go. Um, in this case, this kind of gets uh a little extreme because if I zoom out once more, um, notice how, you know, in these steps, I'm running way ahead. Like the
step zero is here, step two is here.
This was step one, which basically took no time at all. Um, step two is here.
So, it's the CPU is basically running one entire step forward and backward ahead of the GPU. Um, one interesting thing that you might do is if you're
writing, you know, various code for for training a language model. One normal
thing that you might do is let's go back to the code. Um, I might do something like print, you know, my losses in between iterations. Um, this seems like
between iterations. Um, this seems like it should have no effect on what the GPU is doing, right? You're like, well, it's a print statement. How much could it could it do? Um if you think about it for a moment, this will have big impacts
on the execution layout uh on the GPU because in order to print this statement, right, this print statement happens on the CPU and the CPU needs to get the loss. That means it needs to wait for the GPU to compute that loss.
And so let's look at what happens. So
here, you know, as I said, you know, step four on the CPU happens way before the GPU equivalent. Now, let's switch back. Now, this is the version that I
back. Now, this is the version that I profiled where it has the print statement, right? And then now I sort of
statement, right? And then now I sort of zoom into selection here. Now see how step one and step two are basically kind
of synchronized now, right? Because I
have to wait for the loss to get computed. And and you look at this and
computed. And and you look at this and you say, "Oh, but it's still a little offset, right? Like step two, step one
offset, right? Like step two, step one isn't exactly aligned with each other."
So now let's kind of zoom back in and see, okay, what happened to step one on the CPU? Well, um, basically the end
the CPU? Well, um, basically the end point of step one on the CPU is also kind of where the optimizer step starts, right? So by the time that forward is
right? So by the time that forward is done, um, sorry, this CUDA stream synchronizes the thing. So this CUDA stream synchronize command on the CPU, this is basically saying I'm just waiting for the GPU because I can't run
ahead. I'm waiting for this loss to be
ahead. I'm waiting for this loss to be computed and to be spent sent back to me, right? So this is kind of a dummy
me, right? So this is kind of a dummy operation where it's saying CPU waits, waits waits waits waits waits waits. Um, well, the backward step is
waits. Um, well, the backward step is done. So now I can print the loss. I've
done. So now I can print the loss. I've
printed the loss. Okay, now the CPU can start running ahead. And it does run ahead and starts sending step two stuff now. And then well, once this hits here,
now. And then well, once this hits here, it's sort of run out of commands. It's
waiting for the loss again. CUDA
synchronize. Wait, wait, wait, wait, wait. Backward step is done. Now I can
wait. Backward step is done. Now I can print the loss. Now I run ahead again.
Right? So um, in this case, you know, the GPU is still essentially full utilization in both cases. But in
extreme cases where let's say you're printing tons of stuff all the time, actually you're going to introduce a CPU bottleneck, right? Because the GPU has
bottleneck, right? Because the GPU has to the CPU has to keep waiting for the GPU and it can't launch the kernels um sort of ahead of time. So um that's kind of a really cool thing that you can see
uh with the profiler um sort of this CPU versus GPU and they're actually different devices that communicate to each other. It's not at this single
each other. It's not at this single unified object and you wouldn't see that unless you you started to look at some of these like more advanced profilers.
Um any any question about that sort of set of things? Cool. Okay. Um and the other
things? Cool. Okay. Um and the other thing that I want to kind of show you is you know the the profiler thing that I was playing with before. You can also
generate very similar views um in NSIS as well where you sort of select some range of of things that you want to let's let's uh do a warm-up. I said we should so we should exclude the first couple of steps. So we'll start at step
three and we'll we'll measure some steps. Um sort of in this range we could
steps. Um sort of in this range we could take the kernels. This is what's doing the computation. And you can see that
the computation. And you can see that there's actually many different kinds of matrix multiply. This is one matrix
matrix multiply. This is one matrix multiply kernel. This is a different
multiply kernel. This is a different matrix multiply kernel. There's a
different sort of like vectorzed element kernel. Um and all of these are taking
kernel. Um and all of these are taking different amounts of computation. And we
can take this and we can say oh show me um in the events view all the things that are happening. Um, and I can also see sort of the stats view all of the um
the time that it takes. Wait, let's see.
We want um we want the average time. No, we want sorry the CUDA kernel
execution summary. Yeah, we want the
execution summary. Yeah, we want the total duration of the kernels and so we can see which kernels are taking the most time um and aggregate across these
views. So this this is actually a very
views. So this this is actually a very very powerful tool that can give you both like the the aggregate view of what's slow and what's fast as well as individual kernels that are being launched and when they're launched and
where the CPU uh commands for that came from. Um and I guess one one final side
from. Um and I guess one one final side note here is this is one of the reasons why um you know it doesn't matter that we're programming in Python and Python's not a very high performance language, right? Because the CPU is never the
right? Because the CPU is never the bottleneck because the CPU can run ahead and sort of cue commands into the GPU.
Um, and so this sort of detaching or like this disconnecting uh aspect between the GPU and the CPU is one of the key reasons why we can use this nice highle programming language and yet
still get sort of full utilization um out of sort of our GPUs. Cool. Okay. Any questions before I
GPUs. Cool. Okay. Any questions before I sort of switch back to to this because I'm going to leave uh NSIS sort of forever for this lecture at this point. Cool. Yeah, but you'll get to
point. Cool. Yeah, but you'll get to play with it in assignment two, and I think you'll appreciate it because it gives you like a a really interesting view uh into what your hardware is actually doing to make these like
language models uh train. So, okay, that was benchmarking and profiling. Now, you
have all the tools you need to be able to do sort of performance things. Um,
and now we're going to write some kernels in the remaining time. So,
remember kernel fusion, right? So, this
was the the image that I showed you um in lecture, right? there's a little factory. Every time I need to do an
factory. Every time I need to do an operation, I need to ship it from the warehouse to the factory and back. And
so if I, you know, naively do a bunch of operations in sequence without thinking about it, I'm paying for a lot of sort of shipping cost back and forth from from the warehouse, what I should do is have one factory that does all the
operations at once. So I do not pay for this cost multiple times, right? That's
very important. So now we're going to do GLU. And we're going to write a kernel
GLU. And we're going to write a kernel for GLU. And I'm going to write that
for GLU. And I'm going to write that kernel in several different ways. And
we're going to look at the performance impact of doing that. Um, and so we have the PyTorch implementation of GLU. And
that looks just like this. Um, torchn
functional GLU. Um, and I I invoke approximate equals tanh because um I want this to exactly match the naive thing that I'm going to do next. So this
is not going to be, you know, actually multiplying by um the the CDF of the gausian. it's going to be some
gausian. it's going to be some approximation to that that's easier to compute. Okay, so that's the PyTorch
compute. Okay, so that's the PyTorch Gal. And now I'm going to do the dumb
Gal. And now I'm going to do the dumb thing, right? I'm you're going to look
thing, right? I'm you're going to look at this code and say this is going to be low performance. Um I'm going to go in
low performance. Um I'm going to go in and in PyTorch I'm going to write GLU as 0.5 * X * 1 +
tanh<unk> / 2 * X + 0.044715* X cub. Right? Um magic
formula, but this is a good approximation to the GLU. can you can look it up or or convince yourself this is true. Um but if you do this um you
is true. Um but if you do this um you see that there's a lot of operations that happen right there's like a tanh there's a x cubed there's multiplication by a constant in addition um and
multiplication by 0.5 and x um if this involves you know multiple different CUDA kernels this is probably going to be slow right that should be our intuition at this point from fusion um so let's see if that's true okay so
these two are are the same you can see at the top left they compute the exact same numbers um and you know we can systematically check this on random gausian. And now let's sort of benchmark
gausian. And now let's sort of benchmark the two. Okay, so the manual time is 8.1
the two. Okay, so the manual time is 8.1 um seconds for a really really big GU.
Um and PyTorch time is is 1.1, right? Uh
milliseconds, sorry. Um and the fuse version is going to be uh significantly faster. In fact, eight times faster.
faster. In fact, eight times faster.
Wow. You know, big difference from from writing a simple kernel. Um, of course your your map moles are probably still going to be the bottleneck, but it would be really cool if we could go from that 8 milliseconds to that 1 millisecond,
right? That would feel very satisfying.
right? That would feel very satisfying.
So, we're going to try to get close to that 1.1 millisecond um in the next few parts of the lecture. So, now um let's look at the the what's happening under the hood. Um I don't need to look at
the hood. Um I don't need to look at NSIS because all I really want to know is some very high level stuff for the manual GLU. You know, kind of just like
manual GLU. You know, kind of just like I said, it's going to do a whole bunch of operations. It's going to do a bunch
of operations. It's going to do a bunch of multiplications. It's vectorzed, but
of multiplications. It's vectorzed, but it's a bunch of, you know, CUDA kernels being launched here. Um, and notice on the right, this CUDA kernel gets called three times because we have a whole bunch of multiplications floating around
here. Um, we've also got, you know,
here. Um, we've also got, you know, addition. We've got a tanh. Um, and each
addition. We've got a tanh. Um, and each one of these is is probably kind of slow and in the end, you know, we're incurring fairly large overhead doing this. Um, now let's do the same thing,
this. Um, now let's do the same thing, um, sorry, with the pietorchu. And this
is this is really great. There's a
single CUDA kernel launch. It happens
once and it just processes the whole thing. This is what we'd like to see. Um
thing. This is what we'd like to see. Um
and of course this is very very fast because it's just a single uh CUDA kernel, right? So um this is really nice
kernel, right? So um this is really nice and we would like to to you know somehow get to to the CUDA kernel. And so the first thing you might think of um depending on how much you know about
writing GPU efficient code is all right the PyTorch people must have written this in the lowest level language possible. So we're going to do the same
possible. So we're going to do the same thing. We're going to go to not the
thing. We're going to go to not the lowest level possible but we're going to go to the the C++ API and we're going to write the CUDA kernel in C++ right? So
let's open it up and write our own um CUDA kernel. So how is that going to
CUDA kernel. So how is that going to work? Okay, so we have gone in and sort
work? Okay, so we have gone in and sort of created a C++ version of the whole thing. So CUDA, you know, when we say
thing. So CUDA, you know, when we say CUDA is actually the the C++ API for interfacing with and programming GPUs.
And just like sort of the the logical model of a GPU that we describe, you know, we're going to write some sort of function f. Um, and then when we sort of
function f. Um, and then when we sort of invoke this CUDA kernel, it's going to automatically call F on all the elements of a vector or a matrix. Um and then we will get to parallel compute um
everything that we want. Um as
nomenclature we're going to have a grid um which is a collection of thread blocks. So think of this as I have a
blocks. So think of this as I have a task. I'm going to cut it up into
task. I'm going to cut it up into pieces. Um and there's going to be a
pieces. Um and there's going to be a number of blocks. This is the you know in in a 2D grid for example. Um there's
going to be sort of a row uh coordinate and then there's going to be a column coordinate. And this will be very useful
coordinate. And this will be very useful if you're working with matrices. And
then there will be um the size of each of these blocks like you know how how big are these in terms of the number uh of thread blocks. So this is the dimension of the blocks. Um and then there's a collection of threads um
within these blocks and this is the coordinate that for example one thread block lives in and then each thread is within each block. Right? So there's
sort of hierarchical structure here.
There's a grid and then there's a thread inside a grid. Right? And then we're going to basically each function is going to take in uh three things. It's
going to take the block index like which thread block do I belong to um which uh what's kind of the block dimensions um and then what is the index that I am
like my my thread index and with these I can kind of know which coordinate that I am in in the matrix or the vector and then I can sort of decide what logic that I want. Um, one sort of last thing
before we uh go through the actual C++ code is, you know, whenever you're you're trying to debug CUDA, um, you want to launch with CUDA launch blocking equals 1. This will allow you to
equals 1. This will allow you to actually debug um your CUDA kernel. It
will give you sort of error messages back um at a at a cost in terms of the the runtime. Um, if you don't do that,
the runtime. Um, if you don't do that, you are going to have a bad time uh if you're writing CUDA code and and needing to debug. So, okay. Um, here is uh my my
to debug. So, okay. Um, here is uh my my GLU code and let's go through it kind of piece by piece and then I'll talk about what all the pieces are doing. Um, this
will probably take the longest out of the the things that we're going to walk through. Um, other than the machine
through. Um, other than the machine code. Um, and once you understand this,
code. Um, and once you understand this, you should be able to understand all the other pieces. So, we'll go through this
other pieces. So, we'll go through this a little slowly. Um, so there's two parts of this code. So, the first part, this GLU kernel piece up here, this is the actual kernel. This does the
computation, right? This is going to get
computation, right? This is going to get sent to the GPU. It's going to do the computation and then it will return the results. This piece, the GLU function
results. This piece, the GLU function here, this is a wrapper, right? This is
lives on the CPU. It's going to orchestrate the launch of the kernel which is actually going to go out and live in the GPU, right? Um, so maybe we can start with kind of this uh sort of
wrapper piece, this GLU function first, right? So we're always going to check
right? So we're always going to check two things. Um, basically in in the the
two things. Um, basically in in the the Triton or the the CUDA code, we're always going to check. Oh, sorry.
There's a question back there.
Okay. Sorry, that's my bad. Okay, let me zoom in. That is an easy fix. Um, but I
zoom in. That is an easy fix. Um, but I needed to know that that you can't see.
Okay, good. Um, all right. Is this good?
Okay, excellent. Um, okay. So, um, we're going to start with the gallery function. And there's two things that
function. And there's two things that we're we're always going to need to do.
The first one is to um make sure that X lives in like the the GPU device, like the CUDA tensor of some kind, right? If
it's not um well well that's going to be a problem um we're not going to be able to do anything on the GPU. The second
thing which is maybe less obvious is that we want to check to make sure X is contiguous. What that means is it lives
contiguous. What that means is it lives in a contiguous block of memory because when we index into X, we're going to do a whole bunch of indexing arithmetic and we're going to assume that X lives in a block of memory, right? And if it
doesn't, it's just going to be, you know, basically impossible to do this with with any level of generality. Um,
and so when we compute the GLU, right, we take in an input X and we're going to output a Y, right? And so we need to allocate a output. So torch tensor Y
equals torch empty like X. This is just saying well give me sort of a output tensor space or a pointer to a output tensor um that is just like the
dimension of x and notice that I'm not calling zeros. This will save on extra
calling zeros. This will save on extra operations. I don't need to zero out
operations. I don't need to zero out these y's because I'm going to write into them anyway, right? So this is a a minor but you might as well do it optimization. And then um basically in
optimization. And then um basically in all the code that we write, we're going to need to figure out the grid, right?
So what's the total number of elements that I have? What's the size of each block? The number of threads that I have
block? The number of threads that I have in each block. And then how many blocks total do I have? And when I need to figure out the number of blocks, I'm going to, you know, call CD, which is
going to be essentially take the the ratio of num elements to block size and then take the ceiling, right? Because I
need to round up to make sure that very last set of elements that sort of isn't divisible by block size still gets computed, right? So I I take the ceiling
computed, right? So I I take the ceiling uh rather than the floor. And then this is all very simple bookkeeping stuff.
And then I say all right launch the kernel. You know the GU kernel gets
kernel. You know the GU kernel gets launched. Um and this sort of angle
launched. Um and this sort of angle brackets is saying this is kind of the um with uh the given number of blocks and the and the size of each block. And
this is going to be passed into um sort of the the kernel command. And then I'm going to pass in the pointers to x's and y's, right? I'm not actually going to
y's, right? I'm not actually going to pass the the the values of x's and y's um and the total number of elements. And
I need this to compute sort of um essentially the boundary conditions um of my kernel. So now let's go to the actual kernel itself. Right? So I have
global void gel kernel and I get in pointers for in and out and I have number of elements items. Um and this keyword global um the the website sorry the the rendering here has mangled it a
bit a little bit but you should think of this as underscore global and this is a keyword that distinguishes it as a as a CUDA kernel uh function. And so what am
I doing? Well, you know, this thread is
I doing? Well, you know, this thread is actually supposed to operate on a single element I, right? Um, but I don't get I as input. Like the code doesn't actually
as input. Like the code doesn't actually tell me you're in a vector in coordinate I. So I need to compute where I am. And
I. So I need to compute where I am. And
how I'm how am I going to do that? It's
going to be I take my block index, right? I only have one dimension. So
right? I only have one dimension. So
it's block index.x. So just the first coordinate. Um, and then multiply it by
coordinate. Um, and then multiply it by the size of each block. The the block dim.x. X and this tells me, you know,
dim.x. X and this tells me, you know, basically the starting point within within my current block. And then now I add in thread idx. So, you know, I know where the start of my current block is
and I add in the offset to where I am within the block and that gives me my global coordinate I, right? So, some
some bookkeeping computation just to get the coordinates here. And then this is important too. You see this pattern uh
important too. You see this pattern uh basically in all the CUDA code that people write. Um there's no kind of out
people write. Um there's no kind of out of bounds checking naturally. And so
what you do is I have my coordinate and I'm going to check to make sure that you know I am supposed to be processing something that's inbounds. And some of the threads at the very end of your block, they're going to be processing
stuff that's out of bounds in memory.
And you do not want it to touch those.
And so you you basically condition it on i less than num elements. And you do nothing if you're outside of that.
Sorry. Yes.
Sorry.
This is just the extension uh that you sort of write the the CUDA code in. It's
to distinguish it from, you know, just your standard C code. Okay. Um so this is just a file
code. Okay. Um so this is just a file name thing is this CU. There's nothing
particularly special about it. Um okay.
And then so now you know within here we're going to just do our computation, right? It's just going to be I'm going
right? It's just going to be I'm going to write out um I have my input in. I'm
going to index into the E element and I compute my GLU just like I did before and I assign it to out of I and then I'm done. Right? That's all that's all that
done. Right? That's all that's all that I need to do. And since this is all pointer stuff, um I don't really need to worry too much about what is um kind of actually happening here. So um that's
basically it. I can then take my sort of
basically it. I can then take my sort of CUDA gelu uh code that I have um and then I can load this sort of C++ code in line and then I can just have it compile into a module all within Python. It's
all very nice and convenient. You don't
really have to go out onto the command line um and do things. And so now um we have CUDA galu defined. Um, so this is nice and basically it's a compilation of
this. Um, and I can call it from within
this. Um, and I can call it from within Python and we'll use the C bindings to call this guy. Okay, we're done calling
CUDA GLU. Um, I have my, you know, I can
CUDA GLU. Um, I have my, you know, I can check that the manual GLU and the CUDA GLU are the same. And now let's benchmark the two. Um, so I have the time that it takes to run PyTorch. And,
you know, just like last time, it's about 1.1 milliseconds. Um, and manual time, remember, is 8.1 milliseconds. And
so, drum roll, what is our CUDA time?
Well, we've gotten it down to 1.8, right? Not quite as good as PyTorch's
right? Not quite as good as PyTorch's implementation, but, you know, we're we're uh getting pretty close to PyTorch time, right? We've we've gone from 8
time, right? We've we've gone from 8 milliseconds to 1.8 milliseconds, which is which is not bad. Um, because that C code wasn't that hard to write. And so
now we we also do some profiling. Um,
and we can kind of see what is happening here now. Um, and you know it's called
here now. Um, and you know it's called the GLU kernel, right? This is the the code that got shipped off to the GPU.
Um, and then it's calling empty like this is the initialization. Um, and then empty strided, right? Um, and then CUDA launch kernel and CUDA device synchronize. Um, and that's basically
synchronize. Um, and that's basically all that's happening. And notice how you know once again this is a single CUDA kernel eats up 100% of the GPU time.
Kind of like what we what we wanted, right? Okay, so there's some further
right? Okay, so there's some further optimization we can do, but this is really already solved the problem of you know kernel fusion. We fused all the operators together. Okay. Um so pretty
operators together. Okay. Um so pretty good. Um these kinds of elementwise
good. Um these kinds of elementwise operations are easy to write in CUDA.
Like if you have a new kind of I don't know um uh nonlinearity. You could
easily write a CUDA kernel for it yourself if you really wanted to. Um but
more interesting operations are going to require reading multiple values like doing reductions. Those are going to get
doing reductions. Those are going to get a little more complicated. um flash
attention will be a little bit more complicated but not too much so when you have to do it um in the assignment.
Okay. Um any questions on the on the simple C++ uh CUDA kernel? Yes.
Check the beginning. Yeah. Does that
throw an error? Is it like caller kernel? Yeah. So so the question was
kernel? Yeah. So so the question was what happens if it's not contiguous? At
least in the code that we wrote it will just throw an error because it's an assert. um you could potentially write
assert. um you could potentially write code to handle it, but there's almost no reason for memory to be fragmented because it will allocate contiguously.
Um and you won't deallocate like the middle of a memory unless you're doing something like really tricky. Um and so you should you should really unless you're doing something pretty advanced
expect to have continuous memory.
Sometimes you do like a transpose or jump operation that makes memory not.
So like when you're encoding at a higher level should you be careful to conversely make like forced to be continuous before calling operation.
Yeah. So so the question was like if you're uh transposing then you're no longer going to be continuous. You're
going to have like a you know jump between all the elements in the index.
If you're sort of row traversing something that's sort of column stored.
Um yeah. So so I think transpose or like views or like essentially shuffling dimensions is like the one exception to this. But that's handleable in like the
this. But that's handleable in like the outer like sort of the wrapper part, right? You can basically pass it
right? You can basically pass it something that is continuously indexed.
Um and for a lot of the matrices, you won't really care, right? So yes, what would happen if you were to choose a different block size, right? So what
would happen if you chose a different block size? Um the sort of uh GPU
block size? Um the sort of uh GPU related sort of concerns would kick in.
Sort of like do you have enough uh blocks for to saturate your SMS? um and
do you have enough work within each uh block? And those are like kind of the
block? And those are like kind of the two things that could uh matter here.
But I think my guess is that for block sizes that are are relatively large like 1024, it probably won't matter past a certain point because we're not doing anything advanced. It's all entry- wise
anything advanced. It's all entry- wise operations for this like very very simple example. Um yeah. is the reason
simple example. Um yeah. is the reason that our non GPU version was so slow because this ask to like do a small
operation of GPU back.
So, so the question was like why was our um non CUDA kernel sort of like manual thing so slow? Um it's not that it's sending things back from GPU to CPU per se like X is going to live in the GPU.
we allocate it in GPU like we'll do like as the device like CUDA um but it's going to basically not be in the SM uh the whole time right so once we do like
X squar right that's a you know a CUDA kernel and so that multiplication operation will read the the the sort of vector from the global memory into the SMS do the computation it'll write it
back and so this is all in the in the sort of DRAM to SM communication cost rather than the CPU to GPU communication cost um of Of course, if you write like as device CPU, then you'll hit get the
uh you know CPU transfer cost in addition to the to the DRAM transfer cost. Okay, so now um you've seen that
cost. Okay, so now um you've seen that and like okay so that was not too painful but it would be really nice if we had nicer sort of Python abstractions
for writing CUDA kernels and this is what Triton is and Triton is quite nice.
It like has this very nice middle ground where you don't have to manage literally everything about the GPU. So um Triton um is sort of a domain specific language
uh developed by OpenAI in 2021 um and it makes GPU programming um much more um accessible. So like you you write
accessible. So like you you write everything kind of in in Python um and you don't really think about the threads anymore. You think about thread blocks
anymore. You think about thread blocks um and Triton manages a lot of stuff that is annoying but can be automatically optimized. So it can
automatically optimized. So it can manage uh coalesing of memory. Um so
remember that you know from VRAM you get four uh sort of adjacent values at once with something called burst mode. So you
really want to make sure that you know your memory retrievalss are are sort of grouped into adjacent sort of four element or more um sort of calls at once. So it will handle those
once. So it will handle those automatically. It will group those. Um
automatically. It will group those. Um
it will do shared memory management um when you need to sort of manage um which sort of uh memory that you're writing to within the SM with multiple threads um
from within each SM you know you might need to stop or start threads all managed automatically um but scheduling across SMS or what different SM do that's manual so like the kind of the programming model is that you're going
to think kind of at the SMcententric level and the compiler will handle a lot more of the lower level details Um, and Trion is quite nice because it can outperform by quite a bit a lot of
PyTorch implementations. So, it's kind
PyTorch implementations. So, it's kind of like going all the way to writing CUDA, but you're still in the very familiar Python land. And I think a very underappreciated um advantage is sort of
as it's written here. It's all in Python. You can step through it. You can
Python. You can step through it. You can
kind of debug it uh fairly nicely. And
so, let's step through a Triton kernel.
Like once again we're going to write GLU um and we're going to do it um in Triton. So this I I've you know put the
Triton. So this I I've you know put the code to be as similar structure as possible to our other code. Right? So
this is sort of the the CPU side code so to speak. This is the the wrapper code.
to speak. This is the the wrapper code.
It takes in X which is a torch tensor and I've got my two asserts at the top.
Um and I'm going to allocate um an output tensor Y using empty like once again. And it has the same exact sort of
again. And it has the same exact sort of coordinate computation. uh sort of
coordinate computation. uh sort of components and even the the kernel launch looks very similar. I've got this num blocks annotation and then my block size is is you know at the end here not
in part of this brackets but basically I'm passing the the same information to my kernel and now trying kernel um is this code over here um and this is going
to do the same thing as what we were doing before but now it's nicely written in Python um and you know the mental model here is the inputs are going to be
at x pointer um yp pointer is the output uh vector sort sort of the starting coordinate and the block size is how big you know each of my blocks are and num elements is going to be sort of the very
end of my array. So now I need to to get this set of of lines 557 to 561. This is
doing the the computation of my index right I did I equals you know some formula before this is doing the same calculation over here. I'm calculating
where is the start of my current block.
Well that's my block ID times the size of the block. that gets me. Let's say I live in block one. It'll get me this point right here at the middle. Um and
then um afterwards I need to know where do I live within my block? Well, that's
going to be um kind of the offset. But
now notice one difference. Um I don't get in an offset because I'm not programming threads, right? I'm
programming blocks. And so what does that mean? Well, my offsets are actually
that mean? Well, my offsets are actually a vector, not a single value. because
this this is basically going to be I'm going to do vectorized operation where the vectorzed operation is going to be handled by different threads. So here my
offsets are the start of the block plus a vector this range of block size sort of offsets. So I'm my offsets are all of
of offsets. So I'm my offsets are all of these coordinates within block one at once. Of course, if I'm at the very end,
once. Of course, if I'm at the very end, I might go off the edge. And so, I need a mask to handle anything that lives off the the boundary of my vector. Now, I'm
going to load in a sort of single vectorzed operation um everything um at once. So, xpointer plus offsets. These
once. So, xpointer plus offsets. These
are sort of the values that I'm responsible for masked up and it's loaded into X which is um uh my sort of internal values my internal sort of
temporary vector that I need and with this temporary vector I'm going to do exactly the old GLU computation. Um
there's no tanh so I compute that manually but this formula you can convince yourself is the same as what we have here. Um and then y is going to be
have here. Um and then y is going to be the formula computed up here. Now once
I'm done I need to write it back into my output sort of buffer or my output vector and so I compute sort of my targets. So this is y pointer plus
targets. So this is y pointer plus offsets. I take my values um my
offsets. I take my values um my temporary values y and then I store it right. So this is very very very similar
right. So this is very very very similar to what came before but this one is the vectorzed version. I get to operate on
vectorzed version. I get to operate on an entire block at once. And so instead of kind of thinking at the perspective of um of a thread, I'm thinking from the perspective of a block, but not too
different, right? This is all fairly
different, right? This is all fairly similar um stuff. So now I've written my Triton Gellu and all right, I will I will do this fairly quickly. All right,
so one last thing I will only point out a few things here because I don't want to get like so in the weeds that you all like get up and leave. Um, but the one last cool thing that we can do is Triton
of course compiles into low-level sort of almost machine code for the GPU. And
we can look at, you know, this very low-level called PTX code um after the Triton compiler sort of goes over it.
And it's actually kind of cool. You can
kind of see how the GPU like actually works at the threat uh thread level. So
this is the the Triton Gelu kernel. It
was generated by the compiler. And at
first it's going to do some of the really basic stuff. So what's it doing here? It's saying, well, I'm going to
here? It's saying, well, I'm going to need to store some values, right? I'm
going to need to store intermediate computations. B means actually um sort
computations. B means actually um sort of untyped sort of basically like bytes.
So I need bytes um that are sort of 32bit size. I need floats for doing
32bit size. I need floats for doing computations called f. And I need another set of registers uh that are 64 uh bits. And you know that's another set
uh bits. And you know that's another set of registers. Um and so I have all these
of registers. Um and so I have all these sort of registers that I need for temporary computations. And then
temporary computations. And then starting here I'm going to start computing um basically my coordinates.
So sorry this part is is loading um the the various arguments to the function.
So things like um the x pointer and the y pointer get loaded here. I starting
here I start computing um the coordinate offsets of my Triton sort of kernel. And
then once I get down here, this LD global, this is the code that's used to load um the values from Xpointer back
into my temporary registers. So it's
basically saying load R2, R3, R4, R5 using um the uh the the memory position in RD1. And notice how it's loading four
in RD1. And notice how it's loading four things at once because it's cleverly handling coalesing, right? We know we can get four values for free. we should,
you know, operate on all four of these values at once because we get them. And
then you do the same thing um uh again uh for um you do the same thing again here. And then you start to get uh
here. And then you start to get uh basically the floating point operations mole f32 which basically goes through and does the tanh computations. Um I'm
not going to explain all the different pieces, but you know here it's doing it's multiplying by a constant. It does
a x to the cube like multiplying the same numbers multiple times. Um, and
then it's going to compute here, you know, 2 to the x, but we want e to the x. And so it multiplies by log two to
x. And so it multiplies by log two to get the the exponentiated base. You can
really see all of the different like literal step-by-step operations that the GPU does in order to get you uh the final result. And so I'll skip all over
final result. And so I'll skip all over to the end. This is all floatingoint computations that it needs to do. And
then at the very end it stores the values that it has R38 through R41 um into RD4 which is the memory position of our output. Right? So this is kind of
our output. Right? So this is kind of like what's actually happening at the low level. Um and we see that each
low level. Um and we see that each thread is operating on four values at a time and its temporary storage is the registers which is the really really high-speed uh storage that it has very locally. So we can see you know this is
locally. So we can see you know this is going to you know just looking at it be probably pretty fast code right. Okay.
So that was the PTX and we can you know go through and see what it's doing for all sorts of um things. But now let's go back um and actually benchmark things.
So we got manual GU 8.1 seconds, PyTorch time 1.1 seconds, CUDA time 1.84 seconds, Triton time 1.848 seconds. So
we didn't get any faster, but it was much easier to write Triton code, right?
We wrote it in Python. We thought about blocks. We could do vectorzed additions.
blocks. We could do vectorzed additions.
um if you're doing more sophisticated stuff, you know, it basically Triton will handle a lot of the the memory stuff for you. Um and so it's actually pretty good. And then profiling once
pretty good. And then profiling once again, we see single kernel launch that consumes um all of the GPU time, right?
So that's great. Um and that gets, you know, Triton kernels. The last thing, um at least in this sort of uh
Whoops. One second here. Okay. um that I
Whoops. One second here. Okay. um that I want to talk about is torch compile. Um
of course writing CUDA kernels is cool and it makes you feel really good. Um
but maybe we don't need to do that, right? Like the things that we were
right? Like the things that we were doing here were very simple. We were
just taking these like you know x cubed and like exponentiation operations and we were just shoving them all into a single um CUDA kernel. And so maybe we can just do that without you know doing
much. And so, you know, we've had the
much. And so, you know, we've had the several different ways that we've showed you, but the last one I want to talk about is this thing called torch compile, which will take um, you know,
uh, nonoptimized PyTorch code, and it will write um, more optimized code. And
so here it's going to attempt to automatically do optimizations like um, kernel fusion. Um, and this compiled GLU
kernel fusion. Um, and this compiled GLU is going to be, you know, equivalent in the actual uh, outputs that it generates. But now let's let's look at
generates. But now let's let's look at the run times, right? Um so we've got some runtime variation, but basically the same kind of numbers, right? 8.1
seconds manual, 1.1 seconds pietorch, um 1.8 seconds, and then 1.47 seconds um on torch compile, right? So um the punch
line here is modern JIT compilers are pretty good. It can do optimizations
pretty good. It can do optimizations like operation fusion um without you having to do very much at all. And if
you look under the hood, um you can kind of see that there's um basically once again one thing that happens. This is a a sort of fused add multiply tanh Triton
code. So it's generating Triton under
code. So it's generating Triton under the hood um that basically is doing similar kinds of things as our Triton code, but it's actually slightly more optimized um than what we did. And so
it's getting slightly better performance than even our our code. So um torch compile is quite nice. Yes. How do you feel like
compiled? Like you're going to like try
compiled? Like you're going to like try to implement your price version like it can't do flash in right.
Yeah. So so the the question was like um when do you know that I guess maybe the better way to phrase that question is when do you know you can do better than torch compile right is is sort of the the relevant question. Um and I think
for for simple stuff like simple operator fusion or um the other thing that it's very good at is um optimizing matrix multiplies. Um so torch compile
matrix multiplies. Um so torch compile as I said before can do things like if it knows the shape of the matrices can figure out which kernels to dispatch. It
is very good at those things. I doubt
that you can get much better than that.
But there are things like um if you've seen flash attention one two and three um those are pretty non-trivial optimizations like these days torch compile and like um Jax's like XLA
compiler can do those but that's because we know in hindsight that those are the right optimizations to do. Um I think some of those things are a little bit non-trivial to figure out like Flash Attention 3 has additional sort of
hardware level optimizations that leverage you know the H100 hardware that's not obvious to do with a JIT compiler. Um, and so there are some
compiler. Um, and so there are some things that I think are quite hard with with uh torch compile that I think you could do better. But in general, like I think the point here is, you know, you shouldn't go home and say, um, I'm going
to CUDA kernel like I'm going to write CUDA kernels for every single part of my language model, you know, that's probably not a good use of your time.
But if you're writing a new architecture with some complicated piece and you're not getting utilization, but you think you can, that's maybe the time to really bust out the Triton.
Okay, so we're we're basically at time um but we can quickly go through one last example of uh Triton. Maybe this
will be useful for you um in assignment two um of doing softmax. So one
difference is until now we were doing just basic element wise operations and that's really easy because you just operate on each element and there's sort of no sort of complexity to those kinds
of things. So now let's do soft max
of things. So now let's do soft max which is it has a reduction operation where you have to add across all the elements. So how do we do that? Well um
elements. So how do we do that? Well um
what we want to do is we want to normalize across each row of the matrix and you know what we would like to do is we'd like to to make this fast. So a
naive version of this is uh going to be pretty slow. And now we're going to
pretty slow. And now we're going to write the the Triton kernel. So if I wanted to be lazy, the the easiest way to do this is okay, actually you can think for a moment about what the easiest way to do this. Now let's say
you want to write a softmax. So you're
going to normalize each row of a matrix and imagine these matrices are pretty small. So you're just writing a kernel
small. So you're just writing a kernel for small matrices, right? So if you're doing this, what's the right kind of block design? Well, maybe what we should
block design? Well, maybe what we should do is our grid should actually just be rows. So each SM is going to handle a
rows. So each SM is going to handle a single row. That's kind of the optimal
single row. That's kind of the optimal thing to do because if we can fit a whole row into an SM, then we just sum across that row in the SM and then we divide, right? That's that's great. And
divide, right? That's that's great. And
so that's going to be the simple design for for our very, you know, naive softmax kernel here. So all we're going to do is that we're going to make the block size um basically uh sorry, we're
going to make each block a row. And so
the block size should be number of columns plus, you know, a little bit of buffer to sort of be able to fit all the columns. So this is Triton next power of
columns. So this is Triton next power of two of n. And that's a nice way of padding out um your columns. And then
I'm going to make each block a row. So
the number of blocks is exactly the number of rows. And then I have um my Triton softmax kernel which is written in kind of the way that you expect. So
now we have a matrix rather than um a vector. So we have x pointers, we have y
vector. So we have x pointers, we have y pointers, we need the strides of the matrices. Um, and then we can basically
matrices. Um, and then we can basically figure out what row index I'm in. I can
get the column offsets. This is going to be the same kind of code as before. In
fact, getting the row offsets simpler because each row is a block. And then
now I'm going to do basically the same kind of stuff. I'm going to load in each row into my sort of SM's sort of local memory. And then I'm going to do
memory. And then I'm going to do computation exactly in a way that looks like a softmax. I have have my row. I
subtract my max. I take the exponent. I
sum it and then I divide which is going to give me my softmax normalized row and I write it back to global memory. Right?
No complexity at all. Um whenever your your computations fit nicely in SM, writing Triton code looks very similar to writing just normal Python code just with a little bit of load and store and
keeping track of where the blocks are.
Right? So life is pretty simple. Let's
go back. Um, oh wait, where were we? To
the Triton. Here we go. And then we can kind of see how fast all of our um different pieces of code are. So I'll
zoom out again just make sure. Okay, so
manual time takes 3.7 seconds. Um, our
compile time is 1.3 seconds for for torch compile. Um, the PyTorch time is
torch compile. Um, the PyTorch time is is 1.5 seconds. Um, and the Triton time is 1.9 seconds. It's a still a little bit slow. um torch compile can actually
bit slow. um torch compile can actually do better than sort of the native PyTorch implementation um especially when it knows about the shapes and sizes uh of certain operations. Um so finally
we can look in the profiler the the manual softmax is kind of a disaster here. You see all sorts of crazy
here. You see all sorts of crazy operations happening um all over the place. Let me let me clear this uh if we
place. Let me let me clear this uh if we go back up here. Okay. Yep. Um we see all sorts of operations happening. you
know, we have x, we have max, we have sum because we've implemented things naively and we've got memory reads and writes everywhere. Um, the compiled
writes everywhere. Um, the compiled softmax is just going to be sort of one fused softmax operation that goes quite fast. Um, and then we've got pytorch
fast. Um, and then we've got pytorch softmax which is also one CUDA CUDA kernel call and same thing with our Triton softmax. We have our
nice Triton softmax kernel um that is a single fused kernel um for everything.
Okay, I won't go through the PTX code for this. I think, you know, we're we're
for this. I think, you know, we're we're kind of at time and I don't want to drag you through that low level again. Um,
but hopefully this has given you a flavor of uh lower level GPU programming for uh the purpose of making language models go fast. And hopefully you'll
have fun doing assignment two. Thanks.
Loading video analysis...