Lecture 57: CuTe

By GPU MODE

Summary

## Key takeaways - **CuTe: CUDA Tensors Invention**: Chris Czecha invented CuTe, standing for CUDA tensors, to simplify tensor contractions after struggles with BLAS extensions in FFT and low-rank ML research. A single-file CuTe GEM example beat cuBLAS by 25% in the first week of an intern's work. [01:08], [03:16] - **CuTe Powers Cutlass Abstractions**: Everything inside Cutlass is implemented in terms of CuTe, providing abstractions for productivity and performance across all scopes of dense linear algebra on GPUs. Cutlass has 4.2 million downloads per month and 7,500 GitHub stars as Nvidia's most popular open source project. [05:32], [06:17] - **Layouts Generalize Beyond Row/Column Major**: CuTe layouts express data layouts far beyond row major and column major, using hierarchical shapes and strides to map logical coordinates to physical indices via inner product. This enables folding rank-3 tensors into matrices symmetrically in any orientation for batched GEMs. [09:43], [22:47] - **Universal Copy from Rank-Agnostic Loop**: CuTe's copy algorithm uses a single rank-agnostic loop over integer coordinates, handling 1D arrays, multi-D tensors, gathers, scatters, broadcasts, transposes, and zero-stride cases optimally with static unrolling. Static shapes common in CUDA tiles eliminate runtime overhead. [44:36], [46:04] - **Partitioning via Layout Composition**: Tensor partitioning is functional composition of a thread-value layout with the tensor layout followed by slicing, enabling arbitrary partitioning of any-rank tensors. This directly encodes tensor core partitioning patterns from ISA docs for Volta, Ampere, Hopper. [01:01:17], [01:05:46] - **Algebra Enables GEM and Convolutions**: CuTe's layout algebra supports generic GEM on strided matrices, tensor contractions by mode grouping, and convolutions via IM2COL layouts, reusing the same tiling and optimizations without extra code. Common sublayouts enable automatic vectorized copies. [53:28], [54:26]

Topics Covered

Layouts Generalize Beyond Row Column Major
Static Shapes Eliminate Runtime Overhead
Tensor Contractions Are Just GEMs
One Copy Fits All Tensor Operations
Partitioning Is Layout Composition Slicing

Full Transcript

All right. Can folks on YouTube hear us?

Okay. Let me double check. Yeah, looks like we're good. All

check. Yeah, looks like we're good. All

right. Uh, welcome everyone to uh, another episode of GP Mode. This is like a lecture 57. And today I'm really thrilled to have Chris Seka who's made a few cameos in a few of our past

lectures.

Uh Chris is the creator of cute and so he's here to sort of teach us basically everything we need to know about like tensor algebra. Uh I'm also joined by

tensor algebra. Uh I'm also joined by one of my friends and colleagues like Drisk Cassus who does a lot of cute attentiony related stuff in PyTorch and might also have a lot of questions. So

yeah. So so Chris already told us before the cameras got rolling he loves questions. So please if you have any

questions. So please if you have any questions drop them in chat and we'll we'll interrupt them. Um so yeah without further ado Chris please take it from

here. Okay. Uh thanks Mark. So I am

here. Okay. Uh thanks Mark. So I am Chris Czecha. Um I am the original

Chris Czecha. Um I am the original inventor of cute and I wanted to come on here and kind of

tell you the story of cute and uh tell you uh how it works. And so cute stands for CUDA tensors. Um and let's jump

right into it. I'll give you a quick history. Uh I have a big history in uh

history. Uh I have a big history in uh numerical linear algebra. Um I was doing a lot of uh for

algebra. Um I was doing a lot of uh for example FFT research and for the FFT research I needed blas extensions for tensor contractions. Um tensor contractions

contractions. Um tensor contractions were difficult to do and I didn't understand why why did they have to be

difficult to do. Um sorry this is messing with me.

There we go. Uh, bloss extensions were attention contractions were difficult to do and I didn't understand why. U, I

moved on to some low rank machine learning research. Uh, this, for

learning research. Uh, this, for example, is a chronicer product of two canonical polyatic decompositions, which

is my favorite tensor decomposition.

Um but again tensor contractions were hard and I didn't understand why like why can't I just have my tensor contractions? Tensor contractions are

contractions? Tensor contractions are just gems. So I started uh poking around inside Nvidia like what could we do about tensor contractions and um I collaborated with Paul Springer for a

very long time. He was building CU tensor at the time uh and encountering a lot of cutless 2 pain. Um, I understood some of that pain and I didn't I still

didn't understand why this is hard. Like

this seemed like the most obvious and and simple thing to me. So I started building Q and uh I applied Q to Volulta Ampier Hopper. uh a really amazing

Ampier Hopper. uh a really amazing intern, Muhammad Assam, Muhammad Osama, came along. Uh and I knew I was on to

came along. Uh and I knew I was on to something when in the first week of his internship, uh he was able to

gro a just a single file cute example uh for a speed of light

gem, modify it, and beat Kublaws by 25%.

which is which was unheard of because at the time gems in Nvidia were pretty much being built from assembly directly and

having an actual hackable uh single file CC code C++ code gem that you could hack and modify and do something interesting

with was uh was amazing. So we kept going. We implemented what's called

going. We implemented what's called streamk. Uh published a paper tensor

streamk. Uh published a paper tensor cores. Uh this is my favorite tensor

cores. Uh this is my favorite tensor core. Um I applied this to I finally

core. Um I applied this to I finally applied cute to actually doing the tensor contractions that I initially started with and uh was able to get so and show off a bunch of things that got

Cutless's attention. And then VJ uh who

Cutless's attention. And then VJ uh who I nicknamed the golden intern uh took these ideas and

uh ran insanely fast with them and helped to build out this Cutless 3 which covered Hopper, Blackwell, TMA and uh yeah that's that's kind of the story of

of cute development and cute history. What is

what is Cutless? Cullis is a set of abstractions for productivity and performance for all scopes and scales of building dense linear algebra for GPUs.

Uh it's multi level. You can enter at a different um at any one of these layers.

So we have full uh so excuse me so inside of Nvidia stands for speed of light. Speed of light is as fast as a

light. Speed of light is as fast as a certain chip will go. we can calculate how fast we can possibly go using a certain architecture and that is the

speed of light to us. If you hit that speed, good job. If you don't hit that speed where if you're 1% off, let's go find that 1%. We need to go the speed of

light. We have to go as fast as we

light. We have to go as fast as we possibly can. So, Cutless presents a

possibly can. So, Cutless presents a bunch of speed of light kernels. If

you're just interested in a gem, we have a gem and you can stamp that out and run it. If you want to start hacking a

it. If you want to start hacking a little bit, you can get into the collective layer. If you want to start

collective layer. If you want to start hacking even more, you can use cute. Um,

everything inside of Cutlass is implemented in terms of cute. And um,

and and I'm going to show you some of that some of that base level stuff. Uh

inside of here there's there's cute examples which are really actually these these kind of bare naked single file gems

uh that that serve as examples both to internal uh NVIDIA engineers and to external people that are interested in cute and it shows you how to use some

some basic operations and then over a hundred cutless examples now I believe uh that cover everything from tensor contractions convolution fast attention uh multi-headed attention

group gem, FP4, Streamk, epilogs, distributed gems, optimizations on tops of optimizations. It's amazing. This is

of optimizations. It's amazing. This is

the most popular Nvidia open source project, period. Hands down. It's got

project, period. Hands down. It's got

4.2 million downloads per month. Uh

7,500 GitHub stars. That this is a wildly successful project. Um and I am

ecstatic to be working on it.

Uh this next slide I copied from uh I copied from a presentation at GTC and I have never seen it before in my life. So

let's take a look at what this says. Major pain points with

says. Major pain points with C++. Uh C++ templates are inconvenient.

C++. Uh C++ templates are inconvenient.

Oh, this feels like a personal attack.

Um, additional mental load when writing compile time logic error messages are longer than novels. Whoever wrote this

doesn't read. They don't like reading.

doesn't read. They don't like reading.

They don't like full information in their error messages. C++ templates suffer from slow

messages. C++ templates suffer from slow compilation [Music] time. Sounds like a skill issue. Front

time. Sounds like a skill issue. Front

end too generic for our purposes. Wow. I

like Okay.

Anyway, I'm a C. I

can Oh, my code's compiling. It takes

forever. Um, I can probably I can fix this slide real quick, I think. There we

think. There we go.

Okay. So, I am a C++ fanboy. I grew up in C++. C++ has an amazing uh

in C++. C++ has an amazing uh architecture model and computer model. Um but I agree C++ is hard. It's

model. Um but I agree C++ is hard. It's

slow to compile. People have complaints.

That's okay because we just introduced Colus 4.0. All of the thing all of the things

4.0. All of the thing all of the things that I'm about to show you uh we have this cute little Python here uh reading a cute DSL book.

And so all the things I'm going to show you in this talk and all the things that I talk about in this talk uh are being are now ported to uh being done in

Python. And the benefits of that of

Python. And the benefits of that of course are integration with things like PyTorch um and all of your ML kernels. U of

course more benefits are included in blazing fast compile times. So over 100 uh x improvements in compile times. C++

has a hard time with that. Uh and you can just start writing

that. Uh and you can just start writing CUDA kernels uh in native Python itself.

So this is this is a hello world and there are a ton of resources uh if you go to the NVIDIA Cutless uh GitHub repo.

Lots of documentation, lots of new examples. I highly recommend checking it

examples. I highly recommend checking it out.

But I'm here to talk about cute.

Everything that I say here should also appear should also apply to Python. If

it doesn't, let us know about it. We

would like to know. Um, but what is cute? Cute in a

know. Um, but what is cute? Cute in a slide is uh two things. Cute layouts are a

things. Cute layouts are a representation. Uh, it allows us to

representation. Uh, it allows us to express data layouts that go far beyond row major and column major. and they let us write extremely generic algorithms

like copy, gem, reduce. We only need to write these once. They allow us to encapsulate

once. They allow us to encapsulate metadata. I'll explain what that means

metadata. I'll explain what that means later. Cute layouts are also an algebra.

later. Cute layouts are also an algebra.

What does that mean? It means I can combine layouts to create new layouts and I can implement ex way more generic tiling and partitioning than I have ever

seen before in any other system.

So let's see how some of this works.

So the way that I think about cute um okay let's talk about loops first uh it if we consider this arbitrary for loop and I wanted to essentially represent this for loop all of the parameters of

this for loop I could I could write down inside this box right the for loop starts at two it ends at 50 it increments by three it has a base pointer of a and it does this weird

index transformation um inside of the loop. That's complicated to me. But but

loop. That's complicated to me. But but

an exactly equivalent code uh can be written like this. Just by

transforming the the for loop starting uh position to zero uh transforming the for loop increment to one and then

performing uh some index math on the base pointer and transforming the uh any index transformations that are inside the loop. And this I claim are is much

the loop. And this I claim are is much more interpretable because now I can find I can immediately say that there are 17 elements inside of this that I'm

modifying. Those 17 elements are all

modifying. Those 17 elements are all strided away from each other by 21 and those 17 elements start at this particular position in memory. uh just

by glancing at this for loop and these parameters I could not tell you that right we can derive it of course but I could not just immediately tell you that

so this is this is a 17 element vector somewhere in memory uh with that that whose elements are are are strided away from each other

uh we can do the same thing for two dimensional loops right this this is not constrained to one dimension so for a two dimensional loop write the parameters the parameters start

exploding uh very very quickly. But

again we can transform it so that the base index is zero, the increment is one. All the transformations are

one. All the transformations are happening inside the loop. All the

constant offsets are accumulated into the pointer. And so we just have a base

the pointer. And so we just have a base pointer. And here it looks like we have

pointer. And here it looks like we have a two-dimensional object uh which is taking two indices that one of those indices iterate to four and one of those

indices iterate to 20. So maybe we could interpret this as a 4x 20 matrix. And

then it's got some strides which tell us where the elements of those matrix where the elements of that matrix actually live. Um, again, by by just looking at

live. Um, again, by by just looking at this code, I couldn't really tell you those those things. I couldn't tell you what's going on here.

So by simplifying the the shapes that we're looking at, by simplifying the base pointers that we're looking at, we can we can start talking about building this layout object which is going to be

a mathematical function which tells us how to map uh logical matrix coordinates to physical uh value

indices. And this is not new. It's not

indices. And this is not new. It's not

particularly interesting.

Um and so for example, if I have a 2x3 uh matrix, I can represent that in column major. Okay? And I'll list out

column major. Okay? And I'll list out the strides explicitly. So the strides as I go down this first mode of size two

is the stride is one. So A D follows A, E follows B, F follows C. As you can see in the physical representation, row major is the opposite. the stride

one just appears in the other mode. Uh we can immediately you know

mode. Uh we can immediately you know generalize this. There's no reason why I

generalize this. There's no reason why I have to use three here or two here. I

can stride by four. This gives me a padded column major. So so the actual values of the matrix live here in these positions in memory and the other

positions in memory are just dead space.

They're just not used. Okay, that's

fine.

Uh this immediately extends to tensors as well, right? So if I have a three rank a rank three tensor, um I can give each one of those modes independent

strides and that will tell me where these elements live in physical memory.

Again, in general, if I want to know if if I want to map from my logical coordinate to my physical address to my physical offset, then I just take the

inner product of the coordinate with the stride that I have. Does that make sense to people?

have. Does that make sense to people?

So, so, so, so Chris, a lot of the insight here is like all of these sort of like mapping from logic to physical, the constraint is always like this is an fine function, right? And so, like I

guess like are you also suggesting that like this is the only kind of useful uh transform u and I guess you made a face so maybe you disagree with me. I'm

curious. Yeah. So I agree these are all aphine transformations and those happen to be the most useful and the most common. They are not the only

common. They are not the only transformations that you could imagine.

Uh triangular matrices if you wanted to do those.

Um you could not have any memory at all.

You could be looking up val you could be computing values on the fly. You could

uh be be doing lots of things with this with this layout.

Um, these are the most common. These are

the ones that people like to start with.

Row row major and column major. I think

everybody's implemented something like that at one point.

Uh, but to that point, I can I I do think of cute as as a generalization of of these kinds of uh really common

patterns. Uh so for example MDS span I

patterns. Uh so for example MDS span I compare uh cube tensors to MDS span uh very often but I would also claim it's

much much much easier to use um at least for all of the very common applications that that we're going to be talking about which do happen to be fine. You

can do fancier things in cute. I'll touch on it if we if we have

cute. I'll touch on it if we if we have time.

Um but yeah, we we'll focus on these first. Uh so to

first. Uh so to uh if if you're familiar with with STO MDSpan, I don't like its interface. Uh I

don't like um a lot of its abstractions.

Um and I should just be able to create what I call a layout. It has a certain number of rows, a certain number of columns. Uh, give me an iterator to

columns. Uh, give me an iterator to whatever data you want to actually um back this tensor with. Give me a layout

that describes this coordinate to index mapping. And then we're off to the

mapping. And then we're off to the races, right? We can have uh our double

races, right? We can have uh our double for loops. We can index into our matrix

for loops. We can index into our matrix and set it to whatever we want. We can

assert that uh the the rank of this matrix is two, right? I gave it two modes here. Cool.

modes here. Cool.

Uh yeah, a tensor is a view, not a container. It can take any kind of

container. It can take any kind of iterator that you want. Uh the rank is the number of of the number of modes of a tensor. Sometimes called the number of

a tensor. Sometimes called the number of dimensions of a tensor. Like a matrix is a is rank two. A vector is rank one. The size is the extent of a mode.

one. The size is the extent of a mode.

So here we're just asking like what the size is along the zeroth mode. What is

the size along the first mode? uh and

the stride is the elements is the elements between the consecutive elements. The cool thing about Q the

elements. The cool thing about Q the first thing I did in Q was that very very often I know what these values are

either the shapes or the strides.

uh I know what those are at compile time and I don't want this integer to require any storage

in memory or require any compute at runtime.

So you can just replace these dynamic integers with static integers. This int

22 is a static integer. Didn't change any other code.

integer. Didn't change any other code.

still just going to make a shape out of the out of the static integer and the dynamic integer. Uh we can still assert

dynamic integer. Uh we can still assert that this is a rank two matrix. We can

we can now static assert that the size the the size of the zeroth mode of that matrix is statically known to be 22. And we can do the same loop

22. And we can do the same loop structure. We don't have to change

structure. We don't have to change anything else. So this is this is

anything else. So this is this is effectively an optimization. And here I've printed out

optimization. And here I've printed out the matrices as well. So this first one uh you can see it's got a pointer. It's

got the address of that pointer. This is

a 32bit because we're uh we got a pointer to floats. This is 22 by 19 strided by one. These are just the

default strides. It defaults to column

default strides. It defaults to column major. Uh a stride of one. That's a

major. Uh a stride of one. That's a

suspicious looking underscore. And a

stride of 22.

uh if I make that if I make this mode static then I can again print this matrix you can see this underscore odd the underscore is indicating that this

mode that this value is statically known so this mode is statically known to be size 22 this mode is dynamically known

to be size 19 and now both strides are static which is pretty cool I didn't tell it the strides right this is this is the default column major strides. So

default column major, we've got a static one stride and a static 22 stride, which means that when I index into this matrix, uh the the inner

product is not going to be the the inner product with the stride. It can build in these uh

stride. It can build in these uh statically known integers into the inner product.

And actually the the compiler is very very good about making that much faster than dynamic integers. And then of course we can um

integers. And then of course we can um uh specify the strides as well. So we

can make a make an arbitrary shape. We

can make an arbitrary stride. Those

strides can be both dynamic and static as well. Anything you want. And we can

as well. Anything you want. And we can extend this to arbitrary uh aphine layouts. Layout left is this default

layouts. Layout left is this default column major that I was telling you about. Layout right is the opposite.

about. Layout right is the opposite.

It's default kind of row major. It reads

from right to left instead of left to right. Um and then here I'm just

right. Um and then here I'm just explaining these static integers. So that's all well and good

integers. So that's all well and good and this is this is all kind of like onetoone. Uh you you can do the

onetoone. Uh you you can do the equivalents in in um with stood mspan or pretty much any other tensor library. Uh so nothing well

except for this the static int enhancement is pretty cool I think um and other tensor libraries have have their ways of doing this. I think this

is very tur and uh easy to work with. Now we can get into kind of the

with. Now we can get into kind of the interesting stuff though. So in all my work with tensors uh it is very common

to be in a situation where I have consider a 2x2 x2 tensor. Okay, I have a

2x2 x2 tensor. So three uh rank three and I've got its layout here. It's

sized 2x two x two and strided by 412.

So that means that as I go across the row across the row is strided by one. So

A B C D you can see A B and CD are stored next to each other etc. and follow the strides to figure out what the physical uh storage pattern

is. It's very very common to want to

is. It's very very common to want to take a tensor like this and fold it into a 2x4 matrix. And in this case we can

matrix. And in this case we can absolutely do that. This is really cool.

So instead I can just grab that same pointer right and replace the layout replace the rank three layout with this rank two layout and everything

just works. Uh this one mode is across

just works. Uh this one mode is across the row. So AB E F and there it is AB EF

the row. So AB E F and there it is AB EF in the physical memory. um everything's

happy and this and and I was able to do this uh and and published papers on this

uh to to great effect where pretty much where any tensor contraction can be mapped to a batched gem. And we do that

by kind of like grouping the modes together. And so all of we can identify

together. And so all of we can identify modes, group them together, call this the row mode. Now call this uh the column mode. Call this the reduction

column mode. Call this the reduction mode because it's repeated along here.

The batch mode appears in all three of these tensors. And we canonicalize this

these tensors. And we canonicalize this tensor contraction into a batch gem. And that's equivalent to just

gem. And that's equivalent to just folding the tensor into a matrix and then passing it off to Kublaus or your or your favorite blas

library. The problem that always bugged

library. The problem that always bugged me was the following. I can fold this tensor into a

following. I can fold this tensor into a 2x4 matrix, but I can't fold the tensor into a 4x2 matrix.

There is no stride that I can put here.

So A C E G A C. So that's a stride of four. But

A C. So that's a stride of four. But

then E is a stride of two away from A.

And we actually went backwards, right?

We went from A to C, backwards to E, and then all the way to G.

So there's no integer stride here that will describe what is where my elements are stored down this mode. So this asymmetry bugged the crap

mode. So this asymmetry bugged the crap out of me for a long time. Why can't I fold it one way but I can't fold it the other way? And this is preventing me

other way? And this is preventing me from just using all these all these gem and blaws libraries off the shelf to perform whatever tension injection I want to perform.

So I went back to this work and said, "Oh well, but we're just combining modes and identifying them as as other modes.

So let's do that." So if I instead of instead of trying to fold this as a 4x2 matrix, I can I I know that I can

actually embed inside of this mode more modes and actually keep track of like this is this is a matrix inside of this particular mode. That's kind of weird.

particular mode. That's kind of weird.

But according to this inner product rule like this still works as long as I index into this matrix with with the with the right coordinates. Right now the now the

right coordinates. Right now the now the coordinates for a row are going to be two-dimensional which is weird but it works according to

this inner product. It's a it's a nice little generalization and it uh and and we can see it just

works if we wanted to index into a 4x2 matrix this way. But I

don't, right? Because if I had to index into a 4x2 matrix using the using these like coordinates as a row index, like then it's not a matrix. It's not the

same as as as a as as if I were just treating a 4x2 matrix. Okay.

matrix. Okay.

I mean, we could use indices here and just internally like map map this row index into this row

coordinate and then perform the inner product and that would work. And then if we squint our eyes and

work. And then if we squint our eyes and kind of forget about forget about this like hierarchal shape that I'm doing uh if we squint our

eyes and use these integer uh indices for the rows and the coordinates then then it really is a 4x2 matrix. I can

treat it in code exactly identically to a 4x2 matrix. That's kind of cool. That's

matrix. That's kind of cool. That's

really cool.

But this still this still bugs me like like this still feels like an asymmetry, right?

The the 2x4 view was really really simple. The 4x two view requires this

simple. The 4x two view requires this like hierarchal shape and this and this and this row trans row index transformation. Um what's actually

transformation. Um what's actually happening? Why why isn't this perfectly

happening? Why why isn't this perfectly symmetric?

And uh then you can take a step back and we'll say oh this four mode is actually a folded version of the other two modes.

It's just that we recognized that the transformation from these index row coordinates from the row indices to the row

coordinates and then the inner product with this particular column major stride is exactly equivalent. So this 2212 is

exactly equivalent to 41. The indexing

is the same. The result is the same.

So this top one is exactly equivalent to the bottom one. We can index both of these with with uh raw indices. We can

index the top one with a coordinate. But if I'm doing code, I'm

coordinate. But if I'm doing code, I'm just going to use raw indices. So that was really cool. And

indices. So that was really cool. And

I'm like, oh, so this is all symmetric.

I can index, I can fold any tensor I want into any matrix I want just by combining modes like this. And instead

of trying to flatten or we call this coallesing, instead of trying to coales all the modes, I can just keep them next to each other and perform these

computations. So let's start formalizing

computations. So let's start formalizing this a little bit. uh what is a h how do we build up these types? So

I'm going to call a shape. A shape is going to be uh this hierarchical tupal concept and it's going to have ins and we define this as saying a shape is an

int or a tupil of shapes. So it's it's recursively defined

shapes. So it's it's recursively defined like that. And you know for lack of a better

that. And you know for lack of a better term this is this is a pretty cute definition.

uh a stride is similar. Uh for all intents and purposes, you can think of these as integers as well, but actually uh these can be anything that support inner products with integers. So this is

in in mathematical terms, this is called an integer module.

Um we are free to think of this as only being integers for the time being. You

can do some really really clever things by playing with your stride objects.

uh but we do want it to be congruent with the shape. So we want it to have the same like kind of hierarchical tupal structure here. Uh and then a layout is simply a

here. Uh and then a layout is simply a map a pair of these shape and strides and we can map between these logical coordinates within the shape and then

this this index parameter. And then a tensor is just a pair of a pointer or an accessor and this layout function. And

so any any pointer you have lying around, it's an array, a pointer. We can

have tagged pointers, we can have uh implicit iterators, for example, like out of thrust, uh transform iterators, any kind of iterator you got laying around, throw it into a tensor. Now it's

shaped and strided and and you can do kinds of things with it. So this is nice for a representational uh perspective, but I I apologize, Chris. This just a very newbie question like like isn't an

iterator a form of stride or did I like miss miss this like basically the like the offsets at which you stride over is effectively like an iterator. I'm like

confused what accounting and transform iterator actually are.

Yeah. So

okay let me tell you a story about iterators. Um in Cutless 2X

iterators. Um in Cutless 2X uh everything was implemented with an iterator

and the iterator abstraction is a poor abstraction for uh multi-dimensional linear

algebra because you're embedding uh not only because an iterator is is jumping between memory points points

arbitrarily. And so in Cutless 2.x,

arbitrarily. And so in Cutless 2.x,

uh, all of the iterators were bespoke implemented and we had all of these complex layouts and patterns that we wanted to represent and each one was

represented completely independently. The revolution here was

independently. The revolution here was in replacing every single one of those iterators with this cube layout.

So in this perspective, an iterator is just an array. It's a data. It's a

random access thing where I can ask, hey data backing, what do you got at index five? Right? And that's all it is. Uh an

five? Right? And that's all it is. Uh an

iterator should be dumb. You can start getting clever with

dumb. You can start getting clever with it, but think of it like it's just but but not too clever, right? because like the composition of like two aine iterators

would still be like an iterator. So I

guess like the abstraction doesn't let you do wacky stuff where you're going back and forth presumably. It it does let you do Oh. Oh, it does let you.

Okay, no worries. Oh, but we're not going to talk about that. Okay, I see.

Yep.

Um Okay. And

then so this is a pretty compact uh representation. Again, all I need is an

representation. Again, all I need is an iterator. Something that that when I

iterator. Something that that when I send it an index, it can look up in memory or whatever else it wants to do.

It can give me back a value as long as you know that value is consistent. It

should be it should be a consistent lookup.

Um, and then the the really fascinating part is that you can start running with this really far and you can create what I call an algebra of these layouts because these are aphine. Like Mark just

mentioned, you can combine appine functions and you can start asking, hey, what can I do with two layouts? I can

concatenate layouts. Certainly, I can functionally compose two layouts. And it

turns out that the functional composition of two layouts is almost always another layout. I can take the right inverse of

layout. I can take the right inverse of a layout and get another layout. I can

take the left inverse of a layout and get another layout. I can take the complement. Uh complement is a weird

complement. Uh complement is a weird definition. There is a notion of a

definition. There is a notion of a product of two layouts. I can take the product of two layouts which you can

think of like I can replace every every element of one layout with a completely other layout. So a layout of

other layout. So a layout of layouts I can form this what we call divide of a layout. So I can split a

layout into two according to all of the elements that are in another layout. And this starts to get really

layout. And this starts to get really powerful really fast. So just to be clear and and I don't want to jump into the algebra quite yet, but just to be clear, the shape defines these

coordinate mappings, right? We can go from these one-dimensional coordinates to these logical n-dimensional coordinates to these logical hierarchical coordinates. And then

hierarchical coordinates. And then finally we take the inner product with the stride and that gives us our physical storage index. We take that physical storage index. We ask our

pointer hey what you got and the pointer comes back and tells us um our stuff tells us our value.

So a bunch of quick examples. Obviously

we have column major. We have row major.

It's very easy to do a column major padded, right? That's just a different

padded, right? That's just a different stride.

Uh cutless 2 had these column major interle iterators. Uh but what it was

interle iterators. Uh but what it was effectively trying to model was this layout. Uh where it kind of goes halfway

layout. Uh where it kind of goes halfway and then scoots back and then comes up and then goes back up to the top. So,

it's kind of two blocks and then you can do this thing which I don't it doesn't even have a name. Like why are we trying to name all of these things? Just tell

me the just tell me the layout and we'll go from there. But the point being is that all of these layouts are can be interpreted as 4x8

matrices, right? And I can index into

matrices, right? And I can index into them like they are 4x8 matrices and I don't necessarily care where all of the data lives. So uh so

Chris with the question here like could you literally just put your finger somewhere on this like tensor and like draw like randomly and get like a total like and just build like arbitrary

layouts like I'm like trying to think about like what is possible to express versus yes versus not. Yes, it is possible to express some really crazy bonkers layouts. I have lots of examples

bonkers layouts. I have lots of examples if you would like one.

Um, and swizzle layouts uh are even crazier.

Um, yeah. So, here's a crazy. Let's look at one. So, so this is

crazy. Let's look at one. So, so this is uh iterators versus layouts. Okay. So,

recall again in Cutless 2, all of these kinds of loops were implemented with iterators, which like take me from one block to the next block, right? And

this is how it was implemented on the on the right here. Um, if you want to write this code and you want to maintain this code you

know, kudos. Uh, but we were able to

know, kudos. Uh, but we were able to replace every single implementation and and and I did a count at one point and there were something like um um I I

should have put the numbers in here.

something like two to 300 implementations of iterators spanned across 80 files uh composing something like 30,000 lines

of code 40,000 lines of code and you know none of these iterators work with each other they don't you need this iterator with this

computer with that with that math and and and when I partition it from the grid level to the thread level then I need to only use that

iterator it was a disaster. But the

point is you can replace all of that with uh so this is a layout. You can see I've got a relatively complicated layout here. Uh I've got this magical swizzle

here. Uh I've got this magical swizzle function which is doing something something funny that turned out to be useful inside of the architecture. And

then I can do this nice little tile to shape. I have this layout and I can it

shape. I have this layout and I can it appears to be how big is this layout? This is 16 by 16 layout, right? If I take the product of of these

right? If I take the product of of these two shapes along these two modes 16x6 layout. Uh but I didn't want a 16x6

layout. Uh but I didn't want a 16x6 layout. I wanted a 128x 64

layout. I wanted a 128x 64 layout. Oh, but I can do that because I

layout. Oh, but I can do that because I can just take a 16x6 layout and just And now I've got 128 x 64 layout.

And we can print this too by the way. So

just print latte and we get this nice little print out. Here's my 16 by6

out. Here's my 16 by6 layout. I've tried to color it to be

layout. I've tried to color it to be reasonable, but that it's still not very reasonable. Point is, I would like to

reasonable. Point is, I would like to write this once and then forget about it.

I have a much simpler question after this digression, but um the iteration order of a layout um there's always that like natural indexing from zero to the

size of the layout, right? Uh I guess like new question is it is it always left to right increasing along an individual mode?

Uh I mean write your loop however you want. it if if you're just talking about

want. it if if you're just talking about writing your for loop, you're you're welcome to write your for loop however you want. The traver the photo that I

you want. The traver the photo that I think the next slide.

Um would this just be like from left to right exhaust the size of an individual mode and then increment the next one if I so normally what's hap what happens in

applications and and and and we can get there in the next one. We're going to write algorithms with these tensors. Uh

but normally what happens right is I would have two for loops. One that's

iterating across the rows, one that's iterating across the columns. And when I access that J pair, you know, that's the value. That's the that's the offset that

value. That's the that's the offset that I'm going to be looking up. That's all

it is.

Cool. Um so we have these slightly more advanced patterns. Uh if you're familiar

advanced patterns. Uh if you're familiar with Morton codes, uh this happens to be a three-level Morton encoding uh expressed as a cute layout. Uh this is

an 8 by8 layout. Uh you can see that we have um I've alluded to these to these uh uh to these algebraic operations we

have. So I create a small 2x2 layout. I

have. So I create a small 2x2 layout. I

take the product with itself uh to create a bigger one and then I take a product again to create this 8 by8 one. This is what it

by8 one. This is what it creates. Doesn't matter how I produced

creates. Doesn't matter how I produced it. I could write this down manually,

it. I could write this down manually, right? But it's cool to just take

right? But it's cool to just take products of layouts and and make things like this too. I can index into this in multiple ways. I can uh like I said I

multiple ways. I can uh like I said I can treat this like an 8 by8 matrix and index into it with row and column indices but there's no reason why I can't also just use an integer

coordinate that'll get mapped to the exact same thing. I can use these hierarchical coordinates at any level of the hierarchy that I'm interested in.

They all get mapped to the same position. This is these are these these

position. This is these are these these are all ways of asking for position 49.

And of course I can do matrix matrix E things. I can slice along lo logical sub

things. I can slice along lo logical sub boundaries. So if I wanted the second

boundaries. So if I wanted the second column this is how I would write that and it gives me back obviously this is not just a vector or anything. This is

another tensor. It's got its own layout.

We can figure out the the pointer offset and the new layout uh and and go off and do whatever we need with the second column of this of this matrix.

Um there are ma nice nice mathematical properties here. Uh the shapes form a

properties here. Uh the shapes form a postet. So for example if I write a an

postet. So for example if I write a an algorithm and this particular algorithm

requires you must pass me a 8x4 matrix or in this case you must pass me a 6x6 matrix then in principle you can pass me

any one of these as well and that algorithm will be happy because I can index into a 6x6 matrix symmetrically for every single one of these tensors.

which is really really attractive. So I

can start doing really clever crazy things with where all of these values live and how they're being accessed. Uh so so like I said this was

accessed. Uh so so like I said this was a great success for representing everything that Cutlass ever wanted to represent um and

more. So let's write some algorithms.

more. So let's write some algorithms. This is cool.

Um let's write copy. If I want to copy from one tensor to another tensor. Okay, I I' I've left a lot of

tensor. Okay, I I' I've left a lot of space here. Um we probably need to

space here. Um we probably need to probably consider the rank, right? So if

the ranks are uh two dimensional, then we need two for loops. We'll probably

iterate over the rows or columns um at once. Uh but we need to do that

once. Uh but we need to do that together. What about threedimensional

together. What about threedimensional tensors? You know, it gets a little

tensors? You know, it gets a little harder.

Um all of these tensors except integer coordinates. And so that's my

coordinates. And so that's my implementation of copy. And that's universal. That will

copy. And that's universal. That will

work on any tensor in Q.

Is this considered like a natural ordering of a layout?

Um because like I guess is every layout defined this sequential index to the end? That's what I that's kind of what I

end? That's what I that's kind of what I was referring to where uh in your previous slide you have this like uh iterating point of movement and I'm curious what the like the little line

was just for exposition. It just kind of shows you the the physical order within this storage. Um, but yeah, it doesn't

this storage. Um, but yeah, it doesn't have anything to do with necessarily the iteration order or anything.

So, this is interesting, pretty immediately interesting.

Um, we can start pointing out some problems. So, copy is fundamentally, I would call copy fundamentally a rank one algorithm. like I do not care that you

algorithm. like I do not care that you are passing me these super highdimensional tensors or anything.

Um, and what's most interesting to me is again remember this static versus dynamic stuff for static shapes. This is

optimal. And by that I mean that if this if this if the shapes of these two tensors is are both

static then this loop can be completely unrolled. the coordinate

unrolled. the coordinate transformations that I have to apply can be also computed completely statically and the inner product to get

the physical offset can also often be computed statically. If the strides are

computed statically. If the strides are static then that will also be completely statically computed. Um if it's dynamic

statically computed. Um if it's dynamic it'll still be optimal because like I'm I've got the the uh that hierarchical coordinate that I'm performing the inner

product with. uh I know that coordinate

product with. uh I know that coordinate statically and the point being is that static shapes are extremely common

especially in CUDA programming because REM a register memory is fundamentally static. Uh shared memory we almost

static. Uh shared memory we almost always know the layout of shared memory that we want to be computing with. uh

tensor memory in Blackwell is always static and the first thing and global memory that might be a dynamic shape.

But the very first thing we do to every global memory tensor when we uh are writing a a uh a CUDA kernel is we tile it and that tile is

static. So the subtile that we're

static. So the subtile that we're working on within our within the block of our CUDA kernel uh is almost always a

static statically sized tile of global memory. So this starts to tickle you a

memory. So this starts to tickle you a little bit and you can think about what you can do with this implementation. We

can copy we can obviously copy uh one-dimensional arrays to each other. We

can copy multiple dimensional arrays to each other. So this would be a a rank

each other. So this would be a a rank three tensor copy. We can, you know, gather a bunch of elements that are kind of distributed around away from each

other. We can gather those all into a

other. We can gather those all into a vector. We can do the opposite. You

vector. We can do the opposite. You

know, scatter a vector into a bunch of elements that are scattered around everywhere. These are kind of

everywhere. These are kind of interesting. Zero stride is a thing. Uh

interesting. Zero stride is a thing. Uh

I can broadcast. So this is an eight a logically eight element vector.

where every element of that vector is the same value in memory and I can copy that

to a non-brocasted vector where every logical element is a is a distinct element. Similarly, I could copy a

element. Similarly, I could copy a constant vector to another constant vector or I could perform a transpose.

This is an implementation of transpose.

Copy is an implementation of transpose and I can take an 8x3 column major uh uh 8 by3 column major tensor and copy

it to an 8 by3 row major tensor and and this implementation of copy will just do all of these for me.

We got an interesting question in the chat uh like the what is optimal with respect to um in your last slide uh in the context

of I don't want any runtime overhead whatsoever. So I've I've you know

whatsoever. So I've I've you know suggested that we should be doing these these these row index computations and these inner products everywhere

everything like that. I don't want that to cost anything at runtime.

And so long as the shapes are static, that almost certainly will not cost anything at runtime.

And and I guess like Chris another newbie question which is like is this like the heart of the complaint regarding like slow compilation times like in effect like especially when I

looked at your slide 37 like you're generating like straight up like an algorithm that like yeah you know the alternative is someone having like if like writing separate functions for them and is this sort of like the source of

the compilation problem or is it like something else entirely? Yeah, of course it's propagating all the static information around and I make absolutely sure that I never lose track of any static information ever because that's

that's the death of runtime.

and and related like we had Ismail ask a question which was like uh why templates and why not things like const expressions again forgive me I'm not like a C++ expert it's uh the compiler

is very very bad at propagating const expert around uh there are very easy ways to lose track of constexert and then all of a sudden you've got a dynamic value in your code all right uh

last question for me if you go to slide 37 uh slide 37 yeah next Um, so like so things like for example

like when I look at your 1D array like copy versus transpose like those those seem to me like oh you can be clever algorithmically and do something but like things like gather and scatter like these these are things like where like

the math needs to touch like the real world and so I'm curious like there's kind of like what you can express and then there's like what you what AP low-level APIs you need to call in the Nvidia ecosystem and I'm sort of

wondering if you could help me under Okay, it sounds like it's coming later.

Uh I I I can show you an example of that right now. Um so this is this is more of

right now. Um so this is this is more of my cute copy stuff. We can talk about optimizations, but I don't want to get into that right now. But imagine I have a

uh a a global memory tensor. First thing

we do, we tile into it, right? I have

very generic ways of tiling into it. For

example, these this is four different ways of tiling into global memory. I

claim that every single one of these ways is a 4x8 subtensor, right? It's a 4x8 subtensor.

subtensor, right? It's a 4x8 subtensor.

It's just going to have different strides, whatever. Cool. Okay. Now,

strides, whatever. Cool. Okay. Now,

here's the shared memory that I want to copy it into. Here's four different ways that I can uh arrange my shared memory.

So, I can define my shared memory layout, right? And I might want to

layout, right? And I might want to define it one way or the other in terms of access patterns or vectorization or or what have you. There's one way to

copy these. No matter what I have, I can

copy these. No matter what I have, I can copy a 4x8 tensor to a 4x8 tensor. Any

one of these tiling patterns, any one of these shared memory layout patterns, I'm done. So that's the reason why this is

done. So that's the reason why this is so powerful. And of course there are

so powerful. And of course there are optimizations that we can make inside of copy.

Um and and we can do those very uh we can do those very well in fact because we can inspect the layouts and we can make deductions about what what

we can do better by inspecting these layouts and asking questions about like where's where's the stride one? Should I

be iterating in this way or that way?

Maybe I can vectorize elements automatically. Maybe I could prove that

automatically. Maybe I could prove that there are no race conditions. Maybe I

can eliminate um redundant instructions or do a lot of fancy things. Use use

more complex instructions to perform an equivalent copy. Uh same thing for gem. Gem is

copy. Uh same thing for gem. Gem is

fundamentally a rank three algorithm.

This is my implementation of gem.

Um and again we can start using it for a huge breadth of applications. So

all of the NTN NN uh TT bloss gems obviously um I'm not going to list them all out but they're just swapping LDA and one here in the layouts. We can do

an NT gem where I'm where I'm uh transposing the the C uh matrix instead.

Of course, you can do this in BLS as well, but you got to use the transpose trick. And half the people probably

trick. And half the people probably don't know what I'm talking about, but it comes up at least a couple times a month um in

in in our Slack. Uh the Bliss gem, Bliss is another BL library that uses arbitrary strides along both M uh along both modes of the matrix. Um I always

thought that was kind of interesting.

It's cool. It's we get it for free. We

can do trans u we can do tensor contractions of course if we have multiple uh modes we can do a tensor tensor contraction multiple m modes multiple k modes doesn't matter we can

do convolutions convolutions a little crazy uh but IM2 call fundamentally transforms a convolution into a matrix multiplication we can represent the IM2

call transformation with a cute layout and then just call gem and and you're done and of course we have to make, you know, proper optimizations. But the

point being is that convolutions can be tiled the exact same way that gems can be tiled. And we don't, in principle, we

be tiled. And we don't, in principle, we don't have to write lots of extra code for it. We can reuse the same uh

for it. We can reuse the same uh optimizations and kernels that we've written for matrix multiplication. Um, how much time do I

multiplication. Um, how much time do I have, Mark? Am I do I have a hard you

have, Mark? Am I do I have a hard you you can go as long as you like. Really?

Yes. I've been really enjoying this. So

if you want to go three four hours like you know it's it's totally your choice.

Okay. Let me get through this section then and I'll and I'll stop that uh after this section and we can chat about where to go from there.

So, um, wouldn't the TMA load be more useful than the universal? Uh, yes, it would.

If only we had a way of constructing TMA copies with cute. Uh, yeah, we have lots of TMA

cute. Uh, yeah, we have lots of TMA examples with cute. So, cute composition

cute. So, cute composition um is the workhorse of the cute algebra.

uh all of these products, divisions, uh tilings, they're all implemented with composition. I'm going to show you what

composition. I'm going to show you what that means. Let's go back to our layout

means. Let's go back to our layout mapping. How our layout mapping works.

mapping. How our layout mapping works.

Uh we transform our coordinates, right?

And then eventually we take the inner product of that of those coordinates with the strides in order to get our physical uh index. uh if we

squint or or or or get a little drunk or whatever, then we can ignore most of that and say, "Oh, a layout is just a function from integers to integers." And the first thing I can

integers." And the first thing I can think of to do with a function of integers to integers is functional composition.

And it turns out that the functional composition of two layouts uh almost always produces another layout R. And so

I'm going to define functional comp I'm going to define the composition of two layouts A and B to produce another

layout R such that R is uh shape compatible with B. This is my notation for shape compatible. This is that post set of shapes that I showed up here.

Right? I want it to be compatible with what I'm composing with um which is just a a shape constraint and then such that you know we satisfy

the composition property where uh B of C evaluated at A is exactly what the the composed layout

gives you. So layout R is compatible

gives you. So layout R is compatible with layout B which means that all coordinates of layout B can be used as a coordinate of layout R. uh we have left and right identities obviously uh this

gives you associivity this gives you left distributivity uh upon subjectivity of B um bunch of other cute stuff point is

let's suppose we have this composition function okay and let's let's do something interesting with it so imagine I have a vector here okay I'm not going to tell you the I'm not

going to tell you the layout of this vector or anything else I'm not going to tell where the data lives inside of this vector. I'm just going to give you a

vector. I'm just going to give you a vector. This is this is input to your

vector. This is this is input to your program. Okay? And

program. Okay? And

now inside of your program, you want uh to to to start partitioning this vector.

Okay? So, for example, I want uh thread zero in my CUDA program to be responsible for these six elements. Okay.

elements. Okay.

So what I'm going to do is I'm going to write I'm going to create another layout which points to those six elements and and by points I mean like I'm going to list the coordinates not

the addresses anything like that the coordinates. So this is coordinate zero

coordinates. So this is coordinate zero coordinate one coordinate 4 coordinate 5 coordinate 8 coordinate 9. I'm going to put those down in a list and I'm going

to say, hey, that's a layout. And I can represent that layout as 2 3 1 4. So

this is six elements started by two and then strided by four. I'm sorry. I want

four. I'm sorry. I want

uh yeah, I want each thread to have these six values. I was doing it upside down.

And then I want the next thread to have these six values. Okay?

So again, I list out their uh their coordinates, not their addresses. I

haven't told you what the addresses are or anything like that, just their coordinates. And then we have two more

coordinates. And then we have two more threads lying around and they get uh I want them to have the values in these positions as

well. Okay.

well. Okay.

Uh so we've looked across the the row here and we've noticed uh that the values that each thread receives forms a

layout. And now I look down the column

layout. And now I look down the column and I notice that the threads where each thread starts is also a layout. And I can write that layout as

layout. And I can write that layout as 22 212. So I stride by two first and then I

212. So I stride by two first and then I stride by 12.

I put these together and I can create a two-dimensional layout. Now this two-dimensional layout,

layout. Now this two-dimensional layout, what what is this? What is this object?

This is this is a map from my thread index and my val

index to a 1D coordinate of my array. Right? So if I am thread uh

array. Right? So if I am thread uh two and I want the the fourth value of that

thread, then I'm looking at then that thread value pair tells me I'm looking at coordinate 17, which is this one right here. Okay, that's a weird construction.

here. Okay, that's a weird construction.

And so now but if I have composition then I can take a vector like this I can compose functionally compose with this thread value layout that I

have I get another layout because that because the result has to be compatible with the layout that I compose with then it's still

4x6 and if I want to partition it then it's just a slice. So now thread one has all six all

slice. So now thread one has all six all six values these green values. This is

thread one's ownership. This is thread one's subtensor of this uh vector. All

six elements that thread one wanted it now has. And so in code it literally looks

has. And so in code it literally looks like this. I have a a tensor of input. I

like this. I have a a tensor of input. I

create a vector right. I have this thread value layout that describes this partitioning pattern. I literally call

partitioning pattern. I literally call functional composition on the input tensor and my thread value layout and then I slice it along the thread index that I'm

interested in. So this this thread now

interested in. So this this thread now owns this row of uh elements which is these six elements in the original tensor.

I I have a quick question. Um this makes sense. Um, I'd say what my my curiosity

sense. Um, I'd say what my my curiosity question is what if the thing you're partitioning isn't so well aligned. Um,

do you Yeah, I guess like what's the general strategy for handling predication?

Um, predication's really cool. That

would be an entire another talk.

um we have a uh we have a formal and really strong approach to

predication. So yes um if if the you can

predication. So yes um if if the you can compose so just like you know we need to tile tensors um into tiles where the

tiles don't perfectly divide the the matrix or something like that um that is totally fine. You can still perform all

totally fine. You can still perform all of that is implemented with composition and composition will work. It will do exactly kind of what you think it is which it it will round up but then you

need this predication tensor uh to tell you which values are legal to access and which aren't.

Cool. Makes sense. And that's it. And I

don't know I didn't know there are magic there are magic and equally mathematical ways of constructing that predicate tensor uh that will always protect you.

Yeah. Cool.

And so the moral of the story here is that given a layout, given a tensor, uh, partitioning, the term partitioning is

functional composition followed by slicing. And that's that's a pretty cool

slicing. And that's that's a pretty cool and powerful observation because I can partition uh a vector like this in any

way I want by by just defining the right thread value layout or or however other however else I would like to partition this.

Um and then the punch line. The punch

line is we did this for a tensor for a vector, right? So we applied a we constructed

right? So we applied a we constructed this thread value um this thread value layout which is a partitioning pattern.

We applied that to a vector of elements so that uh each thread would get a portion of this vector. But every

vector. But every tensor is also a vector. We can index into every tensor

vector. We can index into every tensor with integer coordinates. And we can think of every

coordinates. And we can think of every tensor like it's a vector. Which means what we've just done

vector. Which means what we've just done is we've divine we've devised a way to arbitrarily

partition an arbitrarily ranked arbitrarily laid out tensor and that is wildly wildly powerful because we can immediately

apply that to MMAs tensor cores inside of CUDA. So for example, this is my

of CUDA. So for example, this is my favorite tensor core inside of CUDA.

This is the uh Volta 8 by 8x4 um tensor core. This is what inspired cute. This is the first tensor core that

cute. This is the first tensor core that I printed out in this way using this latte print. And I was like, "Oh, that's

latte print. And I was like, "Oh, that's so cool." Um this is if if if you read

so cool." Um this is if if if you read the the ISO documentation, this is the partitioning pattern that is specified by the instruction. The instruction

requires you to partition uh your input matrices so that thread zero has a value zero here. Uh thread

one, thread two, thread three, each one has four down. You can

see this one's not even continuous here.

Thread zero has these eight elements here. The point

here. The point is these are the same thread value layouts that uh that we just did our exercise on. and we can encode that as

exercise on. and we can encode that as metadata to this instruction. This is

the only thing this instruction understands and so it needs to be part of the instructions metadata in order for me to use this instruction. And we can do this for

instruction. And we can do this for every single uh tensor core um in the universe. So this is this is a generic

universe. So this is this is a generic tensor core, a generic cinti multiply.

Uh this is a DP4A um which was the predecessor to the tensor core. This is the the first volta

tensor core. This is the the first volta tensor core. This is the ampear uh

tensor core. This is the ampear uh double precision 8 by 8x4 tensor core.

Looks pretty similar. Different uh

different partitioning pattern for A and B. Different partitioning pattern for C.

B. Different partitioning pattern for C.

We're getting bigger now. uh ampear 16 by 8 by8 this is for half precision values again just recording and writing

down um the partitioning patterns from the ISO documentation uh into these into these thread value layouts uh on hopper it gets a little weird because we have

these shared memory descriptors that we need to construct um and the A and B tensors aren't even partitioned but the C tensor is point is

we can construct these shared memory descriptors precisely because we can inspect the layout of the incoming matrices. We can validate that that uh

matrices. We can validate that that uh they have the right layouts and we can construct the shared memory descriptor for you. So all of this is completely

for you. So all of this is completely opaque. So just a quick rundown this so

opaque. So just a quick rundown this so this is what we call an MMA atom. When

we have the raw PTX and the PTX metadata, we combine those into what we call the MMA atom. It can check all of these layouts.

atom. It can check all of these layouts.

It can uh it can uh create what we call fragments. But point it even cooler, we

fragments. But point it even cooler, we can just print them. So we have these MMA atoms. Show me what you're doing.

Print the latte. Uh it shows me this partition better. Cool. This is exactly

partition better. Cool. This is exactly the thread value layout that we were just talking about.

the tiled MMA. We can multiply these essentially. So if I have if I have this

essentially. So if I have if I have this instruction, I've printed out a single instructions worth, right? Uh this instruction operates at a

right? Uh this instruction operates at a one warp level. I don't want a one warp level. I want four warps. I want a 2x

level. I want four warps. I want a 2x two layout of these of these of this instruction. And so I take a product

instruction. And so I take a product uh of that layout with a 2x two column major layout. could equally be a 2x2 row

major layout. could equally be a 2x2 row major layout or a 4x1 layout what have you. So now this is bigger and more

you. So now this is bigger and more complicated but again I can just print it right. Uh we can do fancy things

right. Uh we can do fancy things actually we can actually permute uh the m mode we can permute the the n mode and we can permute the k mode and you still

have an equally valid uh computational strategy. So this is an example of

strategy. So this is an example of permuting the m mode. Uh, and what that does, it's hard to see, but it brings like the threads values next to each other. So, it's it's slightly easier to

other. So, it's it's slightly easier to design a shared memory layout that will that will have good access patterns for for this MMA instruction. But

effectively, what this is doing, this is interleving multiple MMA tensor cores, which is really amazing. Whoops. Oh, no.

amazing. Whoops. Oh, no.

Um, finally we can slice that, right?

Each thread is going to need to access its particular data. So we take this giant tiled MMA that we've constructed, however we've constructed it. We pass it our thread index. And now I've got an object which represents the partitioning

pattern for this thread is going to inherit the values in A, B, and C. I've

highlighted thread 24 here. And you can see it doesn't even make sense, right?

from a from a gem perspective, iterating over rows and columns and stuff like this one. What is this guy doing way

this one. What is this guy doing way over here? That's bonkers. Um I wouldn't

over here? That's bonkers. Um I wouldn't want to keep track of this by hand. Uh how h how do we use this? How

hand. Uh how h how do we use this? How

do we use this MMA? U we construct shared memory tensors. Again, these can have any

tensors. Again, these can have any layout that we want. We construct our

want. We construct our MMA however we want. we get a slice of it and then we can use that MMA to partition our shared memory. So we

partition our shared memory, partition our shared memory, partition our shared memory. That gives us back

memory. That gives us back uh that gives us back our threads subtensor that it's going to be

responsible for in uh in uh in this instruction. And then we also need uh

instruction. And then we also need uh fragments. So these are like registers.

fragments. So these are like registers.

So these tensors are still shared memory. These tensors are register

memory. These tensors are register memory. Uh these registers just have the

memory. Uh these registers just have the same shape as the shared memory which means that we can copy. So

we copy from shared memory to registers.

We copy from shared memory to registers.

We clear the results the results accumulators. We can write a a triple a

accumulators. We can write a a triple a triple for loop really really basic and call our really complex instruction and then copy our registers back to global

memory. This is global memory.

memory. This is global memory.

not even shared memory, right? This

one's global memory. GC GC is global memory. SB is shared

memory. SB is shared memory, but we partition global memory the exact same way that the uh that the instruction expects. And so instruction

instruction expects. And so instruction doesn't care doesn't it doesn't care if it's shared memory or global memory or what have you. So all of these are shaped right?

So I want to stop there and this is this is the basis of of uh cute and this is uh what made programming cutless much

much much easier because clearly we have a extremely robust extremely general representation of these tensor

cores. Um we can access all of these

cores. Um we can access all of these subtensors. we can always see what's

subtensors. we can always see what's going on. If we have a bug, we can just

going on. If we have a bug, we can just print out these tensors or print out these MMAs.

Um, optimizing often, very often just becomes like figuring out the right shared memory layout so that this copy is fast, so that we can vectorize this

copy internally. Um, but otherwise the

copy internally. Um, but otherwise the MMA is just like it's always right.

It's difficult to write incorrect code when you when you adopt the these kinds of patterns. So, that was only 15 minutes

patterns. So, that was only 15 minutes over. Um, I'm pretty happy with that.

over. Um, I'm pretty happy with that.

You you don't have to stop now, by the way. I think I I I I could go longer if

way. I think I I I I could go longer if you're interested, but like like maybe could you give us like some sense of what your because I still see you're like maybe a third of your way through your slides. What were like some other

your slides. What were like some other ideas you were hoping go over? I've got

a lot of slides.

U there are just really cool things that we can do with this.

Uh like I've already shown some pretty amazing results in my opinion.

Um but like for example returning to copy, right? Um we have all of these

copy, right? Um we have all of these possible like inputs to copy. Uh you can ask about vectorization. What if I

wanted to copy if I'm copying this layout to this layout? Uh in

layout? Uh in principle like instead of performing these two values are next to each other in physical

memory. Agreed? And these two values are

memory. Agreed? And these two values are next to each other in physical memory.

So that means that I can perform a vectorzed copy. I can issue one

vectorzed copy. I can issue one instruction of double the width instead of two instructions. Um, and and I can do that

instructions. Um, and and I can do that here for this blue element as well. The

the 24 and 25 I can slip into position 18 and 19. They're next to each other in physical memory. Next to each other in

physical memory. Next to each other in physical memory has nothing to do with next to each other in logical memory. So

0 1 2 3 0 1 2 3 I can perform a four-way vectorization here in copying this tensor to this tensor. And that's because they have

tensor. And that's because they have what we call a common sub layout. This

one has a two element common sub layout.

This one has a four element common sub layout. If only we could figure out what

layout. If only we could figure out what that common sub layout was.

Um, and I don't want to go too deep into it, but we can figure that out. We can

start writing down the math for what exactly is a common layout. Um, we invert that and it turns

layout. Um, we invert that and it turns into the inverse of one layout composed with the other layout. And that tells you exactly the overlap of those two

tensors. and it tells you exactly the

tensors. and it tells you exactly the vectorization width that is safe to perform.

Is this like another overall theme in your work which is like you're basically mathematically proving a priori at compile time which optimizations are safe to perform? Yes. I see. Okay.

Uh no I don't think is a is a another way to rephrase a common sub layout. Is

it like the the minimum slice that is contiguous? Uh yeah.

contiguous? Uh yeah.

So so when I actually present this um and don't you know completely gloss over it, the question is yeah what even is a common sub layout, right? And and so we the simplest way to express it is that A

and B have a common set of coordinates that map to consecutive offsets. Makes

sense. Yeah. And so there must exist a certain set of coordinates uh where they both are producing this the same consecutive offsets. And so

there's there's multiple ways of writing that, right? And then we we can kind of

that, right? And then we we can kind of play around with with the math and the expressions here and we end up at this really simple invert B composed with A.

I have a a left field question which um I think the validity and power of cute um for gem programming is very apparent.

Um if someone at NVIDIA were to say Chris write me a kernel to do X. Um are

you going to reach to cute to express the partitioning of the data to threads and values for any kernel you could think of?

Uh did you say if someone approaches me into someone every single week? Yeah. Um

so yes and I've reached for cute every single time. So for example when when we

single time. So for example when when we were trying to figure out how to program TMA uh TMA is a very very very complex copy instruction inside of Hopper and

Blackwell. Um, moving

Blackwell. Um, moving on and it requires an immense amount from the programmer.

Um, but I was able to use these cute operations [Music] to TMA has what's called the TMA descriptor which describes the entire problem and then you have to generate

coordinates inside of that problem space and TMA takes the coordinate and the descriptor and does all the predication for you and it does the right thing, right?

um that descriptor is huge and it's a pain in the butt to write and it's confusing and people get it wrong and people don't optimize it quite properly.

Uh, but I can inspect the global memory layout and I can inspect the shared memory layout and I can generate that that TMA descriptor and I can generate the right coordinates and I can figure out where you are and

and how many bytes are going to be transferred and blah blah blah blah blah and uh yeah everything is is is done for you and it's very safe and it's

relatively easy to write. That makes sense. Yeah. So, so

write. That makes sense. Yeah. So, so

this kind of programming model has yes, it's it it it's found a ton of applicability inside of gems. Um but

it's become more general than that where we are using it for every copy operation every uh compute operation that we have. Copy

operations are getting extremely complex. They're all collective. They're

complex. They're all collective. They're

using multiple threads. They're using

multiple warps. The source

uh the the source partitioning pattern is not equal is not equivalent to the destination partitioning pattern. So

this is a this is a copy atom which is the equivalent of an MMA atom but for copies, right? And this is for LDSM

copies, right? And this is for LDSM uh which is a load from shared memory into registers.

uh this is the partitioning pattern for the source which is shared memory. This

is the partitioning pattern for the destination which is registers. They are

not the same. That's mindbending and it's

same. That's mindbending and it's impossible for humans. And we had Excel sheets on Excel sheets like tracking shared memory layouts and operations and and what is this register representing

and where does this live and how do I get this all back in the epilogue back to global memory and how do I make sense of this? But then at the end of the day

this? But then at the end of the day like all we want to do is write code like this where I have some kind of tiled

copy. I apply that partitioning pattern

copy. I apply that partitioning pattern to the source. I apply that partitioning pattern to the destination and then I just start calling copies

and and trivial for loops over the integer coordinates of these tensors.

That makes sense. Yeah. So, uh Chris, I want to ask you like maybe one dumb question, one interesting one. So, just

a dumb question considering I wish we had you for a bit longer. Do you mind sharing your slides after the skull?

Like just as a lot of people want to go through them. Yes, but I won't include

through them. Yes, but I won't include all of them. Um Okay, I'll make a whatever whatever you're comfortable with. Um I I guess my second question is

with. Um I I guess my second question is like um when I look at sort of like the annoying parts of writing PyTorch code, like at least the parts where people get tripped up the first time, it's like

very often like when they're doing like reshapes and cats and flattens and it's basically people are like manipulating shapes and it's kind of like hard for them to sort of like mentally map in

their heads, right? But like it like it sounds like while for cute maybe the upfront overhead is like quite higher like ultimately it gives you like a mathematical algebra for you to reason

about like copies reshapes regardless of layouts regardless of the number of devices etc. Um, yes. And there's a lot to say about

Um, yes. And there's a lot to say about that, whether you're talking about uh like logical reshapings versus physical reshapings and do you actually care where everything lives? Sometimes you

don't, sometimes you do. Um, there's

lots of these extra questions, so you do have to be careful. Uh but yes, in terms of just like being able to fold

um just the representation itself and being able to fold tensors into matrices in more robust ways than are usually

possible uh has been has been immensely useful uh for just getting this stuff off the ground.

Yeah.

Um and and so so would you say like the two papers you linked there at the bottom like did those age well? Do you

still view them as sort of like the best intro for people to get introduced to these ideas? Oh no, absolutely not. Uh

these ideas? Oh no, absolutely not. Uh

these are these are academic papers on applications for tensor contractions and such. If if you want an introduction and

such. If if you want an introduction and you want further work for uh for cutless and cute, go to Nvidia Cutless. There is

an immense amount of uh media and documentation on the on the cullis. I

have my own documentation folder. Most

of these I wrote myself. Uh and then the new python DSL. Uh you can see how to get started the quick start here. Um

hello worlds also goes through a lot of the cute operations. um all of these products and divides and how do I do tiling and predication and such and such and such and such. Um that is the best

place to get started. Uh there are a lot of code

started. Uh there are a lot of code examples as well um that you can find in here that are linked in the within this documentation as well. uh that all of

those are highly highly recommended for for continuing and and and getting in there and like hacking on kernels and doing that one weird thing that you were

never able to do before.

Um okay, I think then this might be a natural time to call things. Uh Chris,

thank honestly, thank you so much. This

is one of my favorite talks so far on the server. The recording will be out

the server. The recording will be out folks soon. will hopefully get some

folks soon. will hopefully get some slides and if you like this please comment on the video so Chris comes again. Uh next week we're going to have

again. Uh next week we're going to have a talk on disagregated LM inference by Juno Chen and yeah I'm looking forward to it and see everyone soon. Cool. Thank

you Mark. Thank you.

Loading...

Loading video analysis...