Gluon and Linear Layouts
By GPU MODE
Summary
Topics Covered
- Triton's compiler growth makes optimal complex kernels infeasible
- Gluon's killer feature is explicit warp specialization under programmer control
- Linear layouts: matrices over F2 that turn hardware problems into linear algebra
- Gluon achieves speed-of-light by exposing Blackwell's parallel execution model
Full Transcript
Um All right. Um
All right. Um well, welcome everyone to uh another episode of GPU mode. Uh this lecture for me has been like very long awaited.
Uh for like a few reasons. Um one is like this is basically coming from like a bunch of people who were like very responsible for building up like a Triton into the sort of kernel DSL that took over
the world. Like I think there was like a
the world. Like I think there was like a time during 2023 when it seemed like there was just no other DSL that really mattered.
Since then there's been an explosion of DSLs, like primarily all targeting uh getting really good performance on Blackwell.
And Gluon, my understanding is like that take. Like it's a lower-level DSL than
take. Like it's a lower-level DSL than Triton.
Um so I figured y'all would be very interested in hearing from the team.
Uh we have like Karen who's a professor uh and Mario and Peter. Uh they all like work at OpenAI. Hey Karen, I'm not sure if you officially work at OpenAI or not, but in case you do, that's great. Uh
and then yeah, and folks please take it from here. Like if you have any
from here. Like if you have any questions, uh please just ask them.
Um and then yeah, I think Peter will get started. So please folks take it from
started. So please folks take it from here.
Yeah, perfect. So
I think it's uh good to start with a little bit of uh an introduction of what we're going to be talking about today. So
uh there's obviously three of us here.
So I myself I'm going to start us out with a little bit of an introduction to Gluon, give you a little bit of a motivation uh that Mark started to touch on why we felt Triton wasn't quite fully
serving our needs and why we needed something a little bit extra.
And a little bit about how we designed the language and how you program in it and what it's like to use.
Then Mario is going to follow up with a little bit of a deep dive into linear layouts, which is kind of the mathematical foundation that is the
backbone of the Gluon language. So
uh that's very important to how we load everything.
And then Karen is going to follow up with a little bit of a case study on optimized Blackwell matmul, which as of fairly recently is now matching uh
cuDNN or sorry, cuBLAS and in some cases even exceeding.
So we're getting true speed-of-light performance and showing all the optimizations you need to make there.
And then talk a little bit about some of the additional developer tools that we've built to support you in developing Gluons. So profilers, sanitizers, and
Gluons. So profilers, sanitizers, and even a layout visualizer that helps you with um developing your Gluon code and
understanding these complex layouts.
So a little bit of a recap of what Triton looks like and uh why it was such a great language at the time.
It really just hits a beautiful sweet spot between being very simple and productive, but also very high performance. So
what's shown on this slide here, that is a full matmul kernel. It gets
today maybe about 80% of speed-of-light performance on Blackwell for very minimal effort. And you can spend a little bit more effort in the
programming to get maybe around 95% peak performance on Blackwell.
But this really is a complete kernel.
So the kind of magic to Triton was this tail.dot operator
tail.dot operator that is your uh gateway into the tensor cores.
And we're as the compiler significantly transforming your code and getting a significantly different actual
um kind of hardware level.
But what you actually have to do to achieve that is just say num_stages four and we'll pipeline the loads for you. Or specialize equals true
and we will generate a warp specialized kernel.
But we sort of as we go into more complicated kernels, this leaves a lot for the compiler to do.
And as it's happened, it's kind of grown over the years in complexity.
So we kind of back in the day when Triton was first coming out, we were in the land of Ampere-style tensor cores.
And what the hardware really wanted was reasonably well matched with the Triton programming model. So
programming model. So say you have four warps executing your code, uh all four of those warps are working cooperatively to load
their chunk of the input data.
Then they all have to sync and uh shuffle about the data to get it into the appropriate format for the MMA instruction.
And then they all cooperatively work on the MMA.
And then we're looping around. So the
the control flow of the program is completely in lockstep and every single warp is kind of executing the same control flow path as your Triton program.
But now fast forward to Blackwell, things look very different. So we still have our our first four warps, but now they're literally just processing the Apple lock.
And instead we have these additional two warps that are doing nothing but loads and nothing but MMAs. And this is enabled by the new hardware like TMA, which
instead of having to cooperatively all the warps work on loading the data, instead just one warp can issue an instruction to the TMA and it loads a massive chunk of data at once and that
can fully saturate the bandwidth of the hardware.
And the same for MMA.
One warp can issue a large MMA instruction that uh handles a lot of data and fully saturates the tensor core just from that one warp.
So because what the hardware really wants now is significantly different from what you're programming in Triton, the complexity of the compiler
transformations has kind of grown.
There's a lot more um sort of different choices that the compiler is just having to make. A lot
more room for things to be suboptimal.
Uh it happens to be the case that for a simple matmul kernel, we can do a pretty good job.
But as things get more complicated like flash attention or any um any kernel that you can imagine, we can't have a compiler that deals with
absolutely all of those in the perfectly optimal way.
It's just feasible.
So So uh Peter, maybe then um before you go there uh let me know if it's an unfair question to ask in the beginning. So we can
always punt on this on this question.
But I mean many of your sort of colleagues who uh write compilers for like a business for languages like sort of have different takes on this. Like
for instance, you know, like some of the our friends on TLX might say, "Well, no, you can add like simple extension points and it'll still work." You know, our friends doing
still work." You know, our friends doing CuTile are like, "No, like it's just a skill issue. Your compiler just needs to
skill issue. Your compiler just needs to be smarter and it'll handle these things." And then of course like you
things." And then of course like you have people that are more ideologically aligned to what you guys are proposing, which is like the CuDSL team saying like, "No, you have to give it the the hardware what it wants."
Uh and so I'm sort of like wondering if y'all have like a unique perspective here cuz obviously like by virtue of writing a new language, you're you're sort of saying that you don't think any of the other solutions are meeting your needs.
Uh so I was wondering if you could like help us, you know, us mere mortals who don't write languages for a job like understand like what uh you know, what what what what do you think why you say infeasible basically?
Like if you could help us unpack that, what that means.
So um I I haven't done a lot with these other languages, but I could say that um I think our team's experience with for
example CuTile has been that it shows great results in the benchmarks, but when we actually throw our real production workloads at it, none of our kernels actually even
compiled, I think. Uh I don't think we managed to get any of them running. So
if you narrow the scope of the language, maybe you can get good performance there.
Uh also if you're, you know, fixated on particular benchmarks that you know really well, it's easy to pattern match those exact benchmark kernels and do something very, very
funky to get exactly the performance there, but what we found is that it's quite brittle. So
quite brittle. So for example um when we were developing our warp specialization, uh I think we found that there's these
parameters where you you choose how many registers you want to give the warp. And
you pick a certain number and PTXAS says, "Oh, okay, that's fine, but I'm going to do lots of spilling and it's going to perform terribly."
So then you allocate it more registers and then PTX just gives up and crashes with some error.
So you can't really do it's because the register allocation is happening so deep down the compiler stack that we have
no real insight there.
Uh you can do some heuristics, you can try and do some heroics, but there's just a big gulf So, a sufficiently smart compiler maybe
could do it. But, whether such a compiler exists is questionable. I'm but
it's just a philosophical yes, like okay, I see.
Makes sense, yeah. Thank you.
Cool.
But yeah, I will say that, you know, we can improve Triton and some of the developments that we're making in Gluon, for example, we recently introduced
multi CTA matmul support. We can feed that back into Triton and be able to produce better kernels there. So, it's
it's all connected. It's It's the same compiler stack and we can feed it back into Triton. Yeah.
Yeah, so the goal with Gluon is to be lower level, to expose these hardware details, to let you program the speed-of-light optimal kernels, but to still be
slightly higher level to give you abstractions, in particular, tensor abstractions, to improve productivity.
And also be quite familiar to anyone who's written Triton before, but just adding new things onto it.
So, there are of course trade-offs there where if you're programming directly against the hardware, your kernels aren't portable.
And it's going to be harder to write because you have to understand the hardware quite deeply.
So, the first thing that you'll notice if you've looked at any Gluon examples is that we have these layouts everywhere.
So, this example, if you squint a bit, it's basically exactly like a Triton kernel, except we have this additional layout argument.
And what that does is it controls the actual mapping from this abstract mathematical tensor into the hardware. So, I've kind of copped
the hardware. So, I've kind of copped out here by not giving a real layout, but a common example for load and store would be a blocked layout, which is
um kind of you have a a certain number of neighboring elements in a given thread.
That's your vector size, basically.
And we're um distributing that amongst the threads in a warp and warps in a block to allow you to do coalesce loads and
things like that.
We have other concrete layouts, uh like slice layout, which is basically taking a higher-dimensional layout and removing one of the dimensions. So, for example,
the result of a reduction or if you're going to take a tensor and then broadcast it.
And we also have these sort of uh helper layouts, auto layout and coalesce layout, which are allowing you to sort of defer the choice
of layout because sometimes it's not easy to express the layout all the time.
So, for example, you can have a complex computation graph and then set the auto layout at the end and it will try and back propagate
the layouts to all the leaf tensors.
But, the rule is that we never uh make up a new layout. We always
take the layout from something that you specified in the code. So, you have complete control there.
And coalesce layout in particular is, instead of you manually setting a layout, it will decide a layout that's good for a load. So, it'll look at the
the alignment of the all the pointers to see if you can do a vectorized load and pick the best coalesce layout.
And these layouts generalize to any in the sort of class of linear layouts, which Marat's going to talk about later, but it is a very expressive layout system and most of the basic operations
work with any arbitrary linear linear layouts.
And the real advantage here of being able to express the layout is when you control where all of your data lives in the hardware, that means you're indirectly controlling how the tensor
operations actually get lowered.
So, for example, if you have a reduction, you could choose a layout where the reduction dimension is local to a warp, so we only have to do um warp shuffles and we don't have to touch
shared memory. That's up to you.
shared memory. That's up to you.
Another common sort of pitfall in Triton is if the compiler accidentally leaves a convert layout in the middle of a hot loop, that can kind of tank your performance.
With Gluon, because you control the layouts, you can choose exactly where you want to do layout conversions or whether you want to recompute.
It's all in your control.
And you can also use generic linear layouts to experiment with the different layouts that the compiler wouldn't necessarily have created on its [snorts] own.
So, that's just extra flexibility to the programmer.
But then the the real killer feature for Gluon is warp specialization. So, you can
warp specialization. So, you can explicitly write all of the code that each work or partition is doing.
So, it follows a a fork-join model where you enter your Gluon program as normal, but then when you call the special warp specialize op, it forks the execution
into what we call the default worker or default partition, which is basically the continuation of the warps that were running your normal program.
And then you can also have these additional worker warps, which in the matmul example would be issuing TMA loads on one of the workers and issuing the MMAs on another worker.
And of course, you can have more complicated kernels like our flash attention example where we have lots of different workers and some of them are doing computation.
And that's all now under the programmer's control.
Oh, I should say also that when the warp specialized block ends, all the workers finish, it does return to the parent execution, but most of the time you've actually basically done
there, so nothing particularly interesting happens, but should say that.
But now that we have these parallel execution going on, you now have this added complexity of needing to communicate between them.
So, we're now exposing shared memory directly to the programmer.
And you can just allocate a block of shared memory and this is one of the few places in Gluon where you can specify a non-power-of-two
dimension. So, we allow
dimension. So, we allow um this is particularly for pipelining, where you might have say three stages in this example,
and you might want to have three load operations happening in parallel. So,
you can now have a non-power-of-two first dimension and then power-of-two remaining dimensions, and those remaining dimensions are associated with a linear layout or any
generic shared memory layout. And that
tells the program the mapping from the high-level tensor indices onto the actual hardware um pointer addresses.
And you interact with the shared memory in tensors, storing and loading entire tensors at a time. So, it's
consistent with the rest of the programming model.
And in addition to that, we have hardware intrinsics where you're interacting directly with the hardware's programming model. So, in this example, we have some
model. So, in this example, we have some Blackwell intrinsics.
This somewhat sizable function is equivalent to the descriptor load call that we saw in the very first Triton kernel.
And it is noticeably more complex. You have
uh M barrier, which is like a in shared memory uh yeah barrier.
And we're having to prepare this M barrier.
We're having to issue the async load.
We're having to manually say that we are waiting on this barrier because the the TMA hardware um in actual fact operates asynchronously from the normal SIMD warp
scheduler.
But, the reason why we want to have all of these different hardware intrinsics visible is in practice, you don't have them one after another like this. You
have uh your load partition in the warp specialization issuing the async loads, and then you have another partition that's waiting on
the M barrier and using the shared memory there. So, by
memory there. So, by splitting these up, it allows you to directly program the asynchrony that's inherent to the hardware.
And I'm showing Blackwell here, but we also have intrinsics for Hopper, Ampere, and also AMD generations as well.
But then, because things are getting more complex as we're uh breaking down these operations into their fundamental
chunks, we also allow greater ability to build abstractions within the language.
So, Triton previously had just named tuples, which allowed you to uh gather data together into meaningful abstractions, but now we're going a step
further in having code associated with those. So,
those. So, uh in this example, we have um the barrier counter is useful for pipelining when you have
multiple stages that are happening in parallel, and you need to keep track of which barrier you should be addressing, but you also need to keep track of a phase, which is just a bit that flips
between 0 and 1 every time you loop around.
And we've abstracted the counter increment into a little function here.
And one little quirk is everything in Gluon is actually immutable.
So, you'll notice that we're actually returning a new barrier instead of um modifying the barrier in place.
That's a little bit of uh something weird coming from Python programming.
Uh do you need to allocate SMM for barriers? Yeah, so in the example here, we're calling this allocate and barrier function, but it's actually just allocating shared
memory and is uh just defining the layout for you as a little convenience.
[clears throat] Right. So,
Right. So, now that we're dropping down to expose these hardware primitives, it uh might leave the question in your head, well, what's actually left for the compiler to do in that case?
And the answer is actually quite a lot.
So, what the compiler is giving you here is a lot of rather complicated lowerings from these operations. So,
load seems trivial, but actually we're analyzing all the pointers to check that they have sufficient alignments and contiguity requirements
to vectorize the loads, which you might not be doing otherwise.
We have this somewhat magical operator convert layout, which can take uh your input tensor in any arbitrary linear layout, which maps our sort of
layout framework, and convert it to any other linear layout, minimizing bank conflicts, maximizing vectorization, and even reducing shared memory. If you
have really large tensors, it will split it up into multiple rounds of transfers.
So, you could write that in Gluon with just allocate some shared memory, store to load from it, but this is giving you a whole lot more power.
Also, something like sum in CUDA or even cute DSL might be, you know, 50 lines of code, but in Gluon it's just one. And that's
not only uh giving you everything you expect, like warp shuffles and communicating through shared memory, it also handles things like swizzling the shared memory,
which you might not necessarily do if you're writing it by hand.
And then, additionally, even though we are exposing hardware intrinsics, they're not actually just pure PTX level. They are actually raising the
level. They are actually raising the abstraction level slightly, because for example, the TMA sync load, it has hardware limitations on the
maximum block size that you might not care about. You want to be able to issue
care about. You want to be able to issue a large block size that's good for matmul performance.
We will automatically figure out what the small block size is that's possible within the TMA unit, split up
your um load into however many instructions is actually required, and then you don't have to think about it as the
programmer.
And similar with MMA, there's a fundamental instruction size that you don't necessarily need to care about anymore.
Also, with shared memory allocation, uh you can only allocate a fixed-size buffer with the CUDA
language, but we're taking this dynamic allocation sort of interface, and then statically figuring out the best way to pack it,
so that we're using a small amount of shared memory that's still required for your program.
And then on top of that, when you're accessing shared memory, it's presented to you as a single unified tensor operation,
but of course, in the hardware, it's actually divided amongst threads, so they're individually storing it and then loading if it's in a different layout, potentially from a different thread. So,
you have a a race condition there.
But our compiler is figuring out where we need to synchronize and inserting anything that's required.
And another common kind of gotcha is a bit of a hardware implementation detail, but on
Hopper and Blackwell, the TMA unit and the MMA unit actually operate on a completely different channel to access shared memory. So, if
you're uh accessing through normal loads and stores and using TMA, you have to issue this weird fence instruction, and
we're figuring out when that's required, so the programmer doesn't have to think about it in the simple cases like this.
Although, I will say that there are edge cases, um like if you're storing in one warp specialized partition and reading from another,
then the compiler can't see full visibility of that. So,
of that. So, you will have to manually put a fence in there but we are at least helping you in this common case.
And then, just kind of recap what we've talked about, looking at position of Gluon in this family of um
kernel DSLs. So,
kernel DSLs. So, Triton, good old Triton has very high-level abstract. You don't have to really think about the hardware.
Gluon, we still have uh these tiles abstraction, but you have to specify the layouts, and you're
endowing the tensors with exactly where it's going to live on the hardware.
And this kind of contrasts with something like cute DSL, which I see as a bit lower, where you're really programming at the level of the direct PTX intrinsics, and you
just have additional utilities that help you mentally think about the layouts that are going on.
Um but yes, what that gives you is going from Triton, we're getting from about 80% SOL on most use cases
up to maybe 95% on matmul.
But with Gluon, we are actually able to achieve speed-of-light performance on convolutions matmul attention.
And of course, we have um yeah, support for both AMD and Nvidia.
But the downside is we're no longer portable, because you're programming directly against the hardware version.
Um Surprised that cute isn't considered multi-vendor or portable.
Well, maybe Okay, maybe I'm uh wrong here, but I thought that cute was uh only supported by Nvidia.
Is that right?
And by portable, I just mean that you're programming directly against the hardware so you can't run TCJM5 on Ampere, for example.
Yeah, I mean, Peter, there's a lot of questions coming in. For By the way, that cute DSL question is This is from Chris, who's the creator of cute DSL, so uh maybe maybe he can he can elaborate on what he's trying to say.
Um but yeah, I'll I'll I'll let you tackle any of the questions, and I have like a couple of my own as well, but please, if you could answer the questions from people in chat, I think that would be great.
Okay, cool. Um well,
in terms of layout types, Marius is going to talk a lot about linear layout, so I'll not cover that. We can hold on for that. Um in terms of warp
for that. Um in terms of warp specialization, you have to manually reg budget. Yes, so
I discussed a little bit this before.
It's very hard in the compiler to figure out what the right register budget should be, because it um
affects PTXAS's optimization. So, it's
as as a compiler, we don't know what the register allocator within PTXAS is going to do.
So, um it's it really needs to be something that you discover by just trying multiple different settings.
Um Cute has no fundamental ties to Nvidia hardware.
Yeah, so the the layout system isn't particularly tied to Nvidia hardware, but my understanding is that the DSL itself kernels that you write in the DSL can only be run on Nvidia
hardware but yeah, maybe there are plans to change that.
What is the design reason why everything in Gluon is immutable? Uh,
so there was a very long discussion about this internally.
Uh, in order to create mutability, we we could do it um in the same way that variables can be
reassigned in the Triton level, but because we have uh [snorts] reference semantics um
in Python, it leads to really weird behavior where you expect things to be mutated globally, but they just aren't.
So, I mean, you could maybe hack something together, but within a reasonable implementation, the behavior was sufficiently different from Python
that I thought it would be quite confusing to the programmer.
Uh, is there any visualization for this Gluon as mentioned?
Oh boy.
Well, we're going to talk to that about that at the very end, but yes, there are.
Uh, there's an entire web tool for visualizing layouts.
Um AMD's Fly DSL is based on cute.
Okay, that's cool.
Um Yeah, I haven't played around with that.
So, I think that was all the questions I saw.
So, in other words, things are immutable because of the different memory model.
So, it's because we're compiling to MLIR, which is static single assignment, which is
when we have um we we could try to sort of mimic um mutability with SSA form,
but if you're looking at something like clang, its front end handles mutability by actually lowering everything to stack values, and then having complicated
compiler passes that raise them up back into SSA values, and that would just be far too complicated for us.
And the consequence of hitting the stack in a GPU programming language is completely catastrophic. So,
completely catastrophic. So, we don't want to even risk that as a possibility.
Yeah.
Yeah.
Cool. Were there any other questions?
Mark, you said you had [clears throat] some.
Um yeah, I mean, I I I I guess like when you say multi-vendor, like this specifically means like the sort of like the front like the user-facing front end is multi-vendor, but like
it's not like you expect a lot of code reuse. Like presumably you would expect
reuse. Like presumably you would expect less code reuse in Gluon than what you saw Absolutely.
That's why I said it's not portable because any code that you're writing is probably going to rely on some hardware detail. Um
detail. Um There you can write some kernels that are portable.
Uh, for example, if you're not using the tensor cores, if you're just doing like a reduction kernel or something, but there are certain details like um the layout depends on how many
threads there are per warp, for example.
And some AMD GPUs have 64 threads per warp instead of 32.
So, it nothing is going to be completely portable when you're relying on the hardware like that.
But yeah, multi-vendor in the sense that you can write Gluon that targets AMD or that targets [snorts] Nvidia.
Um I apologize, I might have missed it, but could you redefine warp partitions for me?
So, this is um when you're organizing your warp specialized code, you usually have groups of warps that are cooperating.
So, in the in the simple cases for um the load partition and the MMA partition in matmul, those are just single warps.
So, that'd be one partition.
But the the epilogue partition, you might have four or eight warps, and that's a single partition that you program together
with a tile abstraction over it.
Did that make sense? Yeah, I mean, I I guess maybe that the word threw me off.
It's more like a grouping and assignment of roles within that group, basically.
Yeah, yeah. So, you you have a fixed number of warps, and you're partitioning them into different groups that you care about. Okay, makes sense.
Yeah.
Okay, I think we took a lot of questions, so I think we should let you probably continue your your your talk.
Thank you. Yeah, so
uh that was a good point to pause for questions though, because this is the end of my section, and I'm going to hand over to Mario to talk about linear layouts.
Very cool. Okay.
All right, so in the layouts.
Um so, we've been talking a lot about um layouts, but here we're going to first of all like look quite deeply about like what are really layouts.
Uh we've we discussed that.
Um we're we're basically going to go into what are linear layouts, but first of all, like what are layouts at all? Um
and here there's like a couple examples of like two layouts, which is, as Peter said, quite a number of times, actually, um that they are just a way to discuss like
how you're you have like certain logical data, and you want to distribute it in hardware. And basically, a layout is a
hardware. And basically, a layout is a way to distribute it. So, here we have an example where we have like uh two warps, each of them of like 32 threads, and each of them has four registers, and
we have a 16 by 16 matrix. And we're
thinking like, all right, so have a 16 by 16 matrix that I might want to multiply by something else, something.
Um and um so, in uh can you actually see my um my mouse?
Or not?
I don't. Could you move it? Sorry.
All right.
Um I wonder how it would be very nice, actually, if uh if I could point to things.
It might be easier if you share your screen then instead of sharing the slides. It All right, fair play. Yeah,
slides. It All right, fair play. Yeah,
let me try that.
Uh da da da.
Move from studio, and then share screen.
Share screen.
Uh Um can you see my screen now and my Yes.
And can you see now my Yes, yes. Uh
mouse Oh, amazing. Um I won't be able to read questions, so please stop me if there's questions, and um I can go back.
Um so, oh, I guess I can. Yeah, I can go back and forth a bit. But anyway, um so So, yeah, so here we have like a 16 by 16 matrix, and we have like two warps,
um 32 threads, and four registers per thread. And here we have two different
thread. And here we have two different layouts, where, you know, in the first layout, we are saying like, all right, so element zero zero is on like warp zero, thread zero, register zero. And
then like uh zero one will be on like uh warp zero, thread zero, register one. Um
and so on and so forth. And then this different layout goes into basically like similar uh similar layout, but transposed. Uh we
basically like when we go to uh the element zero one is owned by register two, register one, uh thread zero,
um warp zero. And then zero two goes to uh thread eight.
So, these are two different layouts, and there's, you know, you can uh reorder these things quite a bit.
But um yeah, you can imagine these sort of layouts coming up rather naturally in certain uh in certain CUDA programs.
Um So, we used to write this layout as what's called a block layout. Um Peter
has already mentioned block layouts.
And, you know, we call them block layouts because they are organized by blocks. Like there's no magic there. Um
blocks. Like there's no magic there. Um
we see that we have like a size per thread of four, uh split in a two by two square. And then the the 32 uh threads,
square. And then the the 32 uh threads, we have them split in like four by eight um here and here. Uh and then the warp per thread, we have like two warps. Uh
sorry, warps per CTA, have two warps organized in a two by one uh fashion.
And then this order is basically telling you like, all right, so um we could uh organize things starting on the zero dimension or on the first dimension. So,
you know, if we like R0 could be here, could Sorry, R1 could be to the left. Uh
sorry, uh so, I'm moving dimension one uh with respect to register zero, or it could be like moving dimension zero. And
here we say like, oh, you know, you first uh fill up the dimension one, and then you fill up the dimension zero. So,
we say that it's um it's a layout that is uh row major rather than column major um in this like two-dimensional case. But like in
two-dimensional case. But like in general, this would be a permutation of like zero to n minus one uh dimensions.
And uh yeah, you use that to um to basically say like, yeah, basically your contiguity like your um sorry, your majorness. Um and here like
same for like in this block layout, like all the um thread, sorry, rasters, uh lanes, and words have the same uh order. So, you
know, you will go like R0, R1, R2, R3, and then for threads, we also first go dimension zero. Uh sorry, dimension one,
dimension zero. Uh sorry, dimension one, and then like we move dimension zero. Uh
and for warps, I would just have to well, it doesn't really matter.
So, before we had like plenty of layouts, like block layouts, and they were all like enumerations of um things like this. Like here I have like a few other examples of layouts
that we had. Um
we had like a slice layout for when you do a reduction.
Um like this basically like takes a layout and says like, oh, I have reduced dimension zero. Uh so, it's now a zero
dimension zero. Uh so, it's now a zero So, if layout was a uh two-dimensional layout, uh this slice layout is a uh one-dimensional layout
with the dimension one removed. We also
had layouts for MMA sync. Uh that is like for uh ampere uh operations. Uh we also have like a
uh operations. Uh we also have like a layout for the And well, these are like many many layouts because this for A and B, the two inputs for C, the output, and there's different layouts for all the
different D types. And then there's the raster layout also for the WGMMA in Hopper. Uh like the left-hand side.
Hopper. Uh like the left-hand side.
Um and then we also have uh all the AMD layouts where like, you know, for those you know that know like AMD um does like uh feeds all the tensor cores
through rasters. So, there's a ton of
through rasters. So, there's a ton of layers here. There's so many layers you
layers here. There's so many layers you wouldn't imagine.
Um and then there's shared [clears throat] memory layouts, uh which is we have like these NVMMA shared layout, which is the uh layout that uh
is fed into the like for the shared memory of like WGMMA and DCG 05 tensor cores. And this is another layout that
cores. And this is another layout that we used to use to go from a block layout into one of these MMA sync layouts to um basically like minimize one conflict.
And then there's a tensor memory So, a ton of layouts. So many layouts.
Um and each of them it's not terribly well defined. Like
sure, you know, like I can explain to you like, all right, so block layout is something like this. And you're like, all right, all right, good good good. I
understand. But then there's edge cases, and there's always edge cases. And like,
you know, we do things like broadcasting where you can uh put together like put a smaller layout into bigger tensor or or a bigger layout into smaller tensor. So,
you could for example put this layout which sort of requires a 16 by 16 uh you know, you would imagine that requires a 16 by 16
uh matrix into an 8 by 8 matrix. And we
have rules as to, you know, what that means. Or you can also put it in a 32 by
means. Or you can also put it in a 32 by 32 matrix. And you know, we also have
32 matrix. And you know, we also have rules for you know, what happens when you do that.
Um and you and we had those rules for all these layouts. And basically, we have
these layouts. And basically, we have the same issue that Cutlass had like in Cutlass II, and that cute layouts came to solve.
Um And basically like we uh yeah, we had all these layouts. And the
problem is that we had like that convert layout function.
Uh and you know, we basically uh you basically had to say like, all right, so if you want to load something from a block layout into one of these layouts,
we had like loads of code saying exactly uh how you have to do all the indexing uh to put like, you know, com- to compute the offsets that uh where you have to put like each of the rasters. Uh
and then like same for slice layout of any layout, and then for all these other layouts, and then all combinations of these layouts.
And that code was very buggy.
Like you can imagine that's a ton of indexing, and it's terribly easy to get wrong, and very very very difficult to get right. So,
get right. So, um like a year and a half ago, Triton was very good at generating uh matrix multiplications uh kernels. And then you tried to do
kernels. And then you tried to do something more complex, and it was incredibly easy to break Triton. Uh
like you would write a kernel like a normal-looking kernel, and then you would transpose uh a tensor in your uh in your uh kernel. And at best, you
would crash the compiler. At worst, it would generate code that was just wrong.
And you know, again, at best, you would uh crash at runtime. And at worst, you would just get like a uh silently wrong result.
So, um so then how do we solve this? Well, uh in Cutlass, they solved it with cute layouts.
Uh here we solved it with what's called linear layouts. Linear layouts was a an
linear layouts. Linear layouts was a an idea from uh this engineer or {slash} mathematician, more more mathematician than an engineer. Uh also, he's a fantastic engineer as well. Adam
uh Goucher, uh who works at OpenAI. And he had this idea in 2018 when he was trying to figure out how to solve this same problem but for uh vectorization and
vector operations on CPU. And he came up with this beautiful uh um framework that solves all this. So, we
are going to discuss a little bit uh now this framework, linear layout framework.
And we're going to discuss it in a way that hopefully uh you could feel that, you know, uh you are not Adam Goucher, but you could perhaps have been able We have been able to come up
with it if you have thought enough about it.
So, but for that like I'm afraid that we'll have to use some maths. Not too
much maths. Uh just a little bit of maths.
Um but, you know, a little bit of linear algebra and so on.
Um it's going to be like a bit of a special linear algebra, but linear algebra nonetheless. It's very
similar to linear algebra that you uh study uh that you learn in um in uni.
So, uh any questions so far?
Um let me see if there's questions.
Uh I guess that this one this one wasn't uh answered before. Uh does the choosing one type of uh layout limit what kind of instructions operations you can use?
Yes. So, basically something that's very cool about Gluon is that uh since they uh since everything is like um uh higher level, we have loads of verifiers for different operations,
which means like, you know, this uh this uh operation is expecting these sorts of inputs. And if you don't give give it
inputs. And if you don't give give it like that sort of inputs, it will yell at you at compile time and tell you like, oh, actually, you know, this input and this input should have like layouts compatible in a certain way. Um there's
some operations [clears throat] that accept like all sorts of layouts. So,
for example, reductions are written in a generic way uh so that, you know, they are uh they work for any uh input layout. There's some others that like uh scans for example, that we
need a block layout because writing a scan for something that's not a block layout is just not very efficient. So,
we just fail early and tell you like, yeah, we're not going to generate that code because that code would be quite bad actually.
Um Another note, is it possible to scale up the distributed arrays and express uh sharding constraints to get No. So,
um No. So, all this just uh like
No. So, all this just uh like uh even though you can write like multi-GPU uh code uh with Triton and Gluon. Um in
Gluon, uh we just think of, you know, we just have a block, you know, as as as Peter has mentioned a few times. Like
it's a Triton is a block level, and um Gluon, you can think of it as a also like roughly like block level, you know, CTA or CGA if you are using multiple CTAs.
So, yeah, like you always think about it as like just one launched uh block of your computation.
All right.
So, all right, what are layouts? But what
are really layouts? Let's talk a little bit about like what are really layouts.
So, a reasonable way to represent these sorts of layouts I showed before as a, you know, block layout of this, this, and that is like through through a function. So, and and we can
function. So, and and we can uh think of it as a function from uh rasters for example, so from the hardware into the logical tensor. In
this case, the logical tensor has two dimensions. So, it will map be mapping
dimensions. So, it will map be mapping uh onto two dimensions, dimension zero and dimension one.
And you know, we can just look a little bit at like how to form it. And you
know, like we with having four rasters and R being like from zero to three, and then having 32 threads, so T going from zero to 31, uh and W being zero and one, then we can
form this formula. And we look at it and we are like, oh, so uh if we are like doing dividing R by two and taking the floor, then it means that like if I
advance one by R if I advance R by one, I will not move dimension zero. But if I advance two, then I will add one here.
So, and it's true. Like if I advance two, then, you know, I get one down the dimension zero. And then if I advance
dimension zero. And then if I advance eight here, you know, so if I advance zero to seven, this will still be zero and this is zero. But if I advance eight, then I get
zero. But if I advance eight, then I get a one here and then I advance two. So,
and this is exactly what's happening.
You know, then you know, for a to advance like another two, I will need to add like another eight, 16, and so on.
And then if I advance all, you know, if I advance one warp, then I jump eight in dimension zero.
Uh and then in dimension one, it's pretty much exactly the same thing like, you know, if I if I advance, but now we have modulus. Like if we advance one or
have modulus. Like if we advance one or one here, I move one in the dimension one.
So, R mod two gives us like how much I advance here. And then I can like if I
advance here. And then I can like if I advance like T mod eight, so if I advance like three threads or like three mod eight, I will advance uh
um Oh, sorry. Let's say uh yeah, three. I
Oh, sorry. Let's say uh yeah, three. I
will advance like three times two, six.
Uh you know, one, two, three, four, five, six, and we're here. Um
uh elements, you know, from R zero to R zero in here. So, basically this formula uh describes fully this layout.
Um is that uh a reasonable thing?
Um is um are there any questions? I see no questions.
Yeah, so I was my my my my this is reminding me a lot of uh it's sort of some of the introductory like PyTorch like dev onboarding like talks about like how powerful like striding is.
And so like this is really generalizing it over it's like striding where like the data is like owned by a specific warp.
Uh sorry, like basically sort of the unit is the register, not like an element of a matrix.
And then the owners are like the thread and the warp. And then basically building up a striding formula based on this premise. Is that like correct or
this premise. Is that like correct or not really? Feel free to say whatever.
not really? Feel free to say whatever.
You're I mean, I believe that uh I've always felt I mean like we have Chris uh check out here. So, I'm you know, always feel free to correct me. But uh I I've
always believed that that approach from going from the strides in PyTorch to uh a layout representation is basically like the cute approach to this problem.
And it's, you know, it's completely uh reasonable thing to do and that's exactly what uh cute does and it does it so very well. We're going to see that this approach is slightly different.
We're going to take a turn in a minute. But for now, we're just, you know, talking a little bit about how would, you know, what's a reasonable way to describe these sorts of
uh stuff. Uh you can see it as strides
uh stuff. Uh you can see it as strides and actually that's a very fair thing to do to do and like, you know, Peter has told me many times that like that's actually [laughter] the way that he's always thought about this even, you know, if I didn't talk
about all this math or whatever. But
like, you know, him coming from PyTorch like he thinks of of this like very much as strides and that can take you a very, very long way actually. So,
in some way, yes, but we're going to see that, It's like training wheels for the concept you might not understand yet.
Yes, 100%. All right, cool. So, this
function here and this representation here is the same. All right, so we have some math, cool. So, what is all this weird formula? What can you do with this
weird formula? What can you do with this formula? I don't know, let's see.
formula? I don't know, let's see.
Um so, let's think of it let's think a bit about like what these parts mean.
So, we are well, first of all, something very interesting in Triton and in Gluon and in all of hardware is that like almost everything is power of two.
That's rather convenient because I well, uh you know, uh it's very, very, you know, for anyone that has done any uh hardware specification, it was very it's very obvious that like, you know, everything's zero something. So,
uh everything that's efficient should be a power of two.
Um so, let's try to explode that like to yeah to to really get more juice out of that.
So, here we see that you know, what happened what is like taking T like which is a power of two, like we have just 32 elements, T modulo
eight. Well, that's taking the last
eight. Well, that's taking the last three bits of T.
And then what's Well, here like yeah, there was a misrendering.
Uh this shouldn't be a bracket, should be a floor.
Uh but what happens when you take like T over eight and take its floor? Then
well, you're actually taking like the the top two bits. If you read here, you see like T zero, T eight, T 16, T 28.
Those those are the two top bits of uh of T.
So, you know, you move around here if you move like any of these two top bits.
And then same thing for this part like R divided by two and R mod two is just like the zero bit and one bit. So, we
can rewrite this very complex looking formula by just like looking at the bits of the different like, you know, the R, the T, the W and rearranging them and
forming another number. This should be read as a, you know, as a binary number that has like four digits uh and reading
like left to right. So, like um the like largest uh like uh digit to like the smallest digit like, you know,
W, T, T4, T3, R1, which, you know, we go here and we look at like, oh, this is this, you know, if we uh you know, moving one in the W takes us
eight units here, which is like, you know 1000.
And then like moving uh T4, which is 16, moving 16 uh here like moves us um four elements in
this dimension, which is like 1000 and so on. Sorry, 100. Um and here is the
so on. Sorry, 100. Um and here is the same thing really like. So, we can rewrite this complex expression just as a bit shuffling, which is quite interesting.
Questions so far?
Uh all right, there's many questions.
No, no, cuz it's just like it this is cool. So, please please keep going.
cool. So, please please keep going.
All right.
All right. I guess Chris is saying that like he he doesn't buy the power of two constraints.
Uh but fighting the power of two constraint for so long now. Um
I don't quite understand the context here, I guess.
Um But yeah.
Uh So, we're going to see that like, you know, there's no free lunch here. Like,
you know, the uh cute layouts do support non-powers of two and linear layouts do not support powers of like non-powers of two. So,
you know, you have to give some give something to get something.
Uh you know, on the other hand, we're going to see like the sort of things that you can do with linear layouts that you cannot do with cute layouts. And
later on, I can discuss a little bit like the connection between cute layouts and linear layouts in which ways they are somewhat equivalent in some parts and in which ways they are not. Um so,
yeah, I'll discuss that in a minute. So,
we have this, which is a pretty neat representation of this very complex thing just like shuffling the bits of the racer, thread, and warp and
and just putting them on our matrix.
So, then we can move forward and represent uh this permutation as a permutation uh represented by matrix.
Uh so, any permutation can be represented as a matrix. That's
something that you may may not have learned in uh in university, but quite obvious uh like what you have to do here. So, you
know, you look at the output for example and then like you say like, oh, so the output is R zero. So, here I have to take R zero. And here is take T zero.
So, I select T zero when multiplying, you know, when putting. So, you know, you it's it's a it's a right particular type of matrix where each row and each column has one and exactly one one.
Um But yeah, like you can represent [clears throat] this matrix uh by sorry, this function by writing a whole matrix. And here all we have done
whole matrix. And here all we have done is just put the R, T, and the W concatenated.
Uh you know, I've put some bars here for you to to see that like, you know, this is the first two columns that apply to the first two elements and so on.
Um Basically, what this is your and and here like we have the two dimensions. Um and we see that like
two dimensions. Um and we see that like this uh matrix is exactly the same thing as this function that we had described before.
Um so, um you know, there's here I change a little bit the convention, which is very annoying how like when when we do that,
but like well, there's an issue that we we write vectors uh from like, you know, V zero to VN, but then like when we write them horizontally, we write them
from like VN to V zero like from like most important to least important. So,
there's something that here that has changed a little bit and we see that like here this was up and here it's down. But
down. But other than that, you know, you can see that if you multiply this matrix by this, you will get this, which is exactly this, you know, here we have swapped the dimensions.
Um because the dimension one is like the fastest running and when you're doing like hardware stuff.
Um and then dimension zero is the slowest running, which is annoyingly enough opposite to what how we represented it in math. Um
So, all right. Yeah, like I came up with this, but like how do you do it? Like
what's this sort of uh what's this sort of uh matrix? How do
I come up with it? Like
here basically this matrix, we would write it this way in uh in Gluon.
So, it's called a distributed linear layout and this is a well, you know, all these uh mythical linear layouts. They're just
basically like matrices of ones and zeros. And this matrix here, you can
zeros. And this matrix here, you can represent it this way. And what this means is like, all right, so how do we read like we have this layout and how would you read off these numbers from
here? So here like we have split this in
here? So here like we have split this in two dimensions. Here we said like this
two dimensions. Here we said like this part is dimension this part is dimension one and this part is dimension zero because of the thing that we changed a little bit. And we say
like, all right, so we just basically for this we just read off the columns. So what column is this?
Well, this is coordinate zero and then reading up is coordinate one. So it's
zero one. And then this is reading up is one zero. So it's one zero and these are the racer bases.
And then lane bases is like zero two, you know, again zero two dimensions here and so on and so forth.
But that is a bit like, all right, sure, but like you know, I understand this and what's all this math? Like sure you can put it together if you if you write it all down and so but such a fuss.
But basically what you have to do is like look at racer zero. What happens if you move one you know, you go to racer one. This is
you move to like zero one.
So then well, this is what this is saying. Like when you move one racer
saying. Like when you move one racer then you go to zero one. What happens
when you move two racers? Then it goes the opposite direction. You go to like one zero. So we put that here. Lane
one zero. So we put that here. Lane
bases.
What happens if you move like one [clears throat] one thread? So you go to zero two.
one thread? So you go to zero two.
Two you move two threads, you go to zero four.
If you move four threads, you'll go to zero eight here, I guess.
If you move eight threads, two zero and so on and so forth. So what are you doing here? Well, here you're you're
doing here? Well, here you're you're basically reading off the columns here by putting just like the canonical basis we say in math. You
know, we put here one zero zero zero and you are seeing where it sends. So you
know, you send the the one zero zero zero. So that's racer zero. It's sent to
zero. So that's racer zero. It's sent to this column because just this part acts on this bit.
So you know, that's where it takes it and then what happens if you do do with zero one zero zero and so on and so forth. And that's
basically these are the bases these are the columns of this matrix. And
basically I have shown you like how to read the little picture.
All right, questions.
All right.
I I I I I I think we're good on questions. Maybe we can keep going.
questions. Maybe we can keep going.
All right, yeah, sounds good.
So all right, so So this is another example for people that have done like I'm here I'm here programming and this is the tile for like a BF16
output of MMA sync.
And we can do exactly the same thing that we did before. Like you know, here this is from the PTX documentation and you see that this layout, you know, this description is exactly the same as the one that we had here. Here it tells you
like, oh, you know, if you if you moved this are like A0 A1 A2 A3 A4 A5 A6 A7.
They are racers. So each thread has eight racers.
So then we can figure out like what's the distributed the linear layout associated with this.
Which is just a matrix, but we are going to show it like this because it's easier. Then we go to see like racer
easier. Then we go to see like racer zero to racer one. So it's from you go to zero one. So we write it down here.
Then to racer two is zero eight. We put
sorry, eight zero. We put it here.
Zero eight for racer zero to racer four. So it's zero eight. And then like lane lanes is like
eight. And then like lane lanes is like two zero zero two zero four and then like one zero one two and so forth. And
then like if we have like four four tiles like this, we can go down like you know, stack like this. We say
like, all right. So we can form a tensor like this is 16 by 16. We can
form a tensor like 64 by 16 adding these two bases and so on. So you
know, you can read off here and like you can also get this linear layout.
And then we'll have reductions where you know, like if you we have talked a little bit about reduction, but imagine that you have this layout and you want to reduce it. Here we have this layout and we're reducing it like, oh, so what
is this slice layout that we talked about earlier? So this slice layout is
about earlier? So this slice layout is basically taking this layout and removing the dimension that we have reduced over. So we we basically zero
reduced over. So we we basically zero out all this all this dimension. But
then here in this racer we end up with a with a zero zero basis, which basically means that we have some like R0 and R1.
And we can basically just remove it because like it would be having like an extra basis. Well, I'm I'm going to talk
extra basis. Well, I'm I'm going to talk a little bit about that now actually. So
you know, there's zero basis here and what does zero basis mean?
Here we have like, you know, removed all these things, but we still have like zero bases around the lanes.
What does that mean? So if we have a zero a full zero here, it means that that's you know, whenever you multiply this matrix by by this vector, you know,
if all this column is a zero, it means that the output that will not depend on this coordinate. So what this means is
this coordinate. So what this means is that since the output the you know, the matrix that we are representing will not depend on these coordinates, it means
that like if you move along like T0 T1 T2 and up to T7 because we have reduced over that they will all point to the same element meaning that they will own the same
data.
So basically to do that like you have to you have to lower the yield of some using butterfly shuffles rather than like down shuffles, but you know, it's perfectly fine. It's
perfectly fine. It's equivalently it's very performing as well. And this allows you then to do
well. And this allows you then to do like broadcasting easier more easily and so. This bit is basically called
so. This bit is basically called broadcasting. It says that these
broadcasting. It says that these these dimensions are being broadcasted.
So now for the big guns, definition of linear layouts. So what's a linear
linear layouts. So what's a linear layout? A linear layout is a linear map
layout? A linear layout is a linear map over F2.
And this is one of those definitions as even though it's true is like we mathematicians really like these definitions because they are very tight.
But yeah, like when someone reads this definition is like, all right.
This implies a lot more than I can see at first sight.
But so what's really what what let's unpack it a little bit. It's a little bit like that definition of monads that they tell you is like, oh yeah, it's a monad over category of endofunctors and you are like, all right, yeah, sure.
But so all right, so what's this? So this is just it's a linear map a linear maps we know that they are represented by they can't be represented
when they are like over a finite vector space finite dimensional vector space. They
can be represented by matrices. Um
So it's a matrices over F2, which means that like F2 is just the zero and one. So it's basically matrices of zeros and ones as we said
before.
Where when we apply them to a vector we replace the addition by XOR and the multiplication by AND.
The XOR part like we're going to get to it in a minute, but like the AND we've already been doing it. Like we when we are these are bits. So when we are like multiplying this row by all this column, we are just performing an AND
of this horizontal vector by this vertical vector and that's what we are putting here. So that's fair play.
here. So that's fair play.
But yeah, the XORs like some of you may already see where this is coming from the XOR, but we'll get there in a minute.
So but yeah, like that's all there is. A
linear layout is just exactly that.
It's just a matrix of ones and zeros that when applied to a to a vector it acts through XORs and ANDs.
So what's the deal with XOR?
Oh, swizzling, oh boy. So you know, XOR shows up in swizzling and it's the sort of stuff that you know, swizzling I'm sure that like all of you have heard about and it's like, oh yeah, that sort of thing that like they tell you about
swizzling and you are like, all right, all right.
Very cool that you are able to come up with that formula, but how the hell were you able to do it?
Anyway, like let's talk a little bit about swizzling in the context of linear layout. So it turns out that
layout. So it turns out that swizzling can be represented by a linear layout.
And here we can do exactly the same thing that we did before.
So we can look at like how to form, you know, we can see that this formula is correct, but we can it's like it's a very cheeky formula. Like how do you read off that formula from here? Well,
we know how to do it. We know that like, you know, if we see that that this is the zero dimension, this is the one dimension, we can figure out what should
be the basis here because we are reading off the basis before. So what should be Here by the way, like here we are representing shared memory. So all these
things all these numbers are offsets in shared memory.
So before we had like racers, threads and words. And here we just have an offset which is the offset in shared memory. And here we want to say like, all right, if I go to offset
one, which element do I have? Well, so
if I move one, I have like zero one, which you know, we can put here a zero one in this matrix. And then if we go to zero two, we have zero two. And then so
on and so forth for like, you know, so four we have like zero four.
Um And here we see that like, oh, so how do we come up with this formula? So, we have to look for element eight. We were looking for
element eight. We were looking for element four. Now, we are looking for
element four. Now, we are looking for element eight. Like all these basically
element eight. Like all these basically what we are doing behind the scenes for those of you that have studied vector spaces. We're saying we're finding we're
spaces. We're saying we're finding we're using a basis of F2 to the end and the basis for that when represented as numbers is like just the powers of two.
So, we are like really enough like what we call here the canonical basis the 100 0100 0100 and so and we are like reading off the
columns as we did before. So, you know, eight is in the 11. So, we write here 11. And then 16 is in the 22. So, we
11. And then 16 is in the 22. So, we
write here 22. And then 32 is in the 44 and it's this.
So, it's this which is this.
So, well, so this thing like can be represented like very naturally through linear layers which is very neat as we will show in a minute.
Um all right, no more questions. Cool.
Yeah, Mario. I guess like so so so in this case like basically you showed us what like you showed us the picture and then you showed us the math you're like aha look they match. But like given a
picture, how would you generate the math quickly basically or like yeah, just Yes. Yeah.
Yes. Yeah.
Yeah, so so exactly like we have like in the same way that we have this distributed linear layout here. We have
another class which is called shared linear layouts which you know, lo and behold is the same thing for for shared memory. It's only that it doesn't
shared memory. It's only that it doesn't have like raster lanes and warps. It
just has offset bases. So, if you wanted to create this layout via linear layout, you could basically do this computation and have like the basis 01 02 04 and
then 11 22 44 and it will give you this distributed layout.
Mhm. Um there's a good question in the chat I forgot to mention earlier. When
you chose to use this approach, what was missing in the other layout solution that works easier in linear ones?
So, for now I'm just like describing what's all this mathematical stuff. But
like you know, um the first thing that you could think about is that like oh so now we can represent everything with some very simple I mean simpleish math, you know, like it's
it's still like you know, there's a finite field there and what not. But you
know, it's not terribly difficult. Um
and and you can start thinking about these objects mathematically and start uh converting problems and we're going to see like a few of those uh in the next slides.
Uh we can go from problems that before were just like you know, how do we convert from one layout to another? We
can define that mathematically and we can describe like bank conflicts mathematically and so on and so forth and we can actually develop algorithms mathematically, you know, looking at all these vector spaces and all that stuff.
That stuff is a bit more gnarly and it needs a little bit more math and we have it in the paper that we published. Um
but [clears throat] we have done it. So, you don't have to do it and that's the cool thing about Gluon. You don't have to do now like all
Gluon. You don't have to do now like all the all the tricky stuff that goes into the developing algorithms with this. You
know that like if you feed a linear layout into a reduce, it will reduce it and it will swizzle the shared memory automatically for you. So, it minimizes bank conflicts and it's as efficient as it as it as it can be.
Um all right.
So, yeah, here we are. So, what's all this linear layout stuff and why would you we want it at all? So, well, first of all they give us as I said they unified framework on top of which to build Gluon.
You know, someone said that uh basically like Cutlass is just a very nice interface on top of CUDA.
So, Gluon is like we jokingly say that's a bit nice interface on top of linear layouts. Um so, [clears throat] it
layouts. Um so, [clears throat] it basically has fixed the source of most previous bugs where we had tons and tons of handwritten code. And now basically
you can use math, you can use linear algebra to solve stuff. And like here we're going to I'm going to show you an example. But basically like
example. But basically like well, here I write completeness and like those of you that work on programming languages like this has a very loaded sense and like uh it's
you know, it's a complete layout system in the sense defined in the paper in that like it's the smallest layout system that we could have in uh in Gluon to make Gluon work.
I won't get into that. But basically
what this uh layout system is allows us to translate hardware problems into linear algebra problems. And one such a problem that's very common is like all right, so how I have like two layouts, one shared memory layout and one of the
million raster layouts that we had and I want to figure out like how to store into this shared memory layout and I and I have to compute the offsets for every
ray raster that I have in a thread like given a thread ID and a warp ID and a raster number, where do I put it?
And like which instruction do I use to put it where it is in shared memory and so on. So, linear layouts give us a
so on. So, linear layouts give us a basically like it's a one-stop shop to solve all these problems. Because like the offset is basically just this sort
of stuff which is like the matrix take takes us the matrix A goes from raster layouts and warps into the linear matrix into the logical matrix. And then the
inverse B which is the shared memory layout takes us from the logical matrix into the offsets. So, this is just like you're solving just multiplying by the
inverse of B. And we have like a whole uh you know, we can just run um like Gaussian elimination to compute this product. Um and basically that this
this product. Um and basically that this also shows a little bit why is it so difficult to implement this code that we had before for all the possible layouts?
It's because you're basically solving this you know, this linear system for a parameterized family of matrices which is a pain in the ass to get right
if you've done it in university. It's
just horrible. No one wants to do that.
But here what we did what we declare is like all these things are com are compile time. So, we have the actual
compile time. So, we have the actual matrices at compile time. So, we can just solve this system for two actual matrices. We get another matrix and we
matrices. We get another matrix and we just apply it to this and move on with life.
So, it's really a very brainy solution to a very complex problem.
Um So, also something else going forward something else that linear layouts have given us is like fully generic lowerings which is very very neat. Um basically
when you have like a a swizzled linear layout uh and and a distributed linear layout, uh you can com compose this matrix and uh basically it's a you know, it's a
transition matrix from rasters onto offsets. And then you can start talking
offsets. And then you can start talking like all right, so if I have this matrix, can I lower it using S matrix or S matrix transpose or the BA version of it or S to share? And if I can vectorize
any of these, how much can I vectorize it and so on? And then this is a linear uh is a linear algebra problem. Um and
we've developed all the tools and they're just like linear algebra tools.
We are not doing anything like you are not over fitting on like oh, if I get these layouts, I do this. Like no, you have fully specified your problem. So,
now it's just a linear algebra problem and that's quite easy to solve.
And the last part that you can do with linear layouts and it's basically what separates um basically what you get uh in exchange
of giving up uh the non-powers-of-two support that CUDA has, uh you get basically these sorts of algorithms that we developed in the paper that we believe that are very
cool. We basically effectively solve the
cool. We basically effectively solve the swizzling problem for two arbitrary layouts.
Uh which means that like if you have a if you if you have a layout uh in rasters and you want to transform it into another layout in rasters. You
mean you want to implement like convert layout.
Uh and you want to choose what's the optimal swizzling so that you minimize bank conflicts and you maximize vectorization. We have an algorithm
vectorization. We have an algorithm again using a bit more abstract linear algebra. But like we prove that the
algebra. But like we prove that the algorithm is correct under definitions that we have in the paper that chooses like which uh which is the best shared memory layout such that it
uses like any of the available uh operations to lower the loads and the stores and it minimizes the the bank conflicts and maximizes vectorization.
And we prove that you can do both minimizing and maximizing at the same time.
So, you know, you are not losing anything for vectorizing more and so on.
So, um this basically generates automatically all those funky patterns for transposing a matrix and so on that I'm sure you've seen like in
many CUDA textbooks. But it basically does this generically for any pair of layouts which is super cool. And like
also it's generic on the on the instructions that you can use. So, we've
recently added all the instructions for AMD and this works just out of the box for AMD which some architectures have um 64 banks and some architectures have
like more threads, less threads, you know, 64 threads per warp, 32 threads per per warp. All this generalizes seamlessly also to its own funky like loads and store operations. So, you can
do this generically and it just works which is very neat.
And the other cool stuff like if the two layouts can be converted from one to the other via shuffles, we also derive optimal shuffling sequences in at the SAS level.
Uh to the point that in the original FA3 paper, there was a rather gnarly um uh the sequence that I think that T Dow and his team
uh basically said like, you know, we came up with this and we are very proud that we came up with the sequence like it's super cool.
Our algorithm is able to derive that sequence and like virtually like any other sequence that, you know, that's optimal for to go from one
race earlier to another by a shuffle.
So, this is a sort of stuff that you can get through linear layouts and yeah, I don't have much more on this. I we have paper for those of you that want to get more into the the maths and yeah.
Yes, so Chris like rightfully says that like all this part um all these generic lowering like they also have in in cute and that's completely true. And as I said like it's
completely true. And as I said like it's this part that really separates uh it's just optimal layout conversion that separates cute from linear layouts because like as you you'll see in the
paper that like you could generalize it to cute layouts but you would really need some very complex maths. I'm sure that the guys at Colfax and and Chris like can certainly pull it off but it would not
be like yeah, the sort of discussion that we will have here for those algorithms wouldn't be as simple as what we are having. You need like yeah, some financial theory.
Um everything's implemented in a Triton GitHub repo, yes.
Um Um and basically to connect what like cute layouts without the switching functor are with respect you know, in the context of linear layouts.
Cute layouts when they are I think that they are called compact but like again like there's many experts here on cute layouts so feel free to uh to correct me. But I think that they call it like a compact layouts which are
dense layouts without like holes in between. Uh like compact layouts are
between. Uh like compact layouts are represented as like permutation matrices like the ones that we had here. So, one
one per row or in column. Perhaps with
some like zero uh like added a few like zero columns in between.
Um so that's basically my mental model for like where cute layouts meet mid linear layouts. So, they are like they don't represent um
they don't represent switching like switching is represented on top of uh on top of the the layout by a functor.
Um and the nice thing about linear layouts is that they are represented in the layout so you can like really you know like get all these uh cool algorithms from them.
Um so yeah, that's all for me.
All right. Well, thank you Mario. I
guess like perhaps like on the sort of differences between how to generalize like cute layouts like maybe maybe we can continue this in chat then uh uh Chris if if Mario has time.
Uh otherwise I think I don't claim to be a cute expert so I'm sure that Chris can explain that much better than I do.
So, and if I made any omissions or mistakes for sure.
Um all right. So, maybe then the next section is Karen, right?
Let me pull up her slides.
All right, so um I think you can see my screen like info screen right?
Is it giving okay? Yeah, I mean it's smaller than what I expected. All right.
That's weird.
Uh I'm not sure but let me just get started for now.
All right, so I assume most of you have already learned basic foundations of linear layout as well as grown. The good
thing is that we have implemented a few wrappers in grown so that you don't need to care about all the mathematical details about linear layout that we talked before. We have taken care of
talked before. We have taken care of linear layout and its conversion in the compiler for you. In this section, I'm going to go through simple matmul example and go talk about
how we implement this optimized matmul on V S Blackwell GPU using grown's primitives. First, let's review matmul
primitives. First, let's review matmul in general. To implement matmul on GPUs,
in general. To implement matmul on GPUs, suppose we have A * B = C, then basically we're going to divide
matrix A and matrix B into tiles and each CTA is going to handle tiles A and tiles B. Then the resultant is going to
tiles B. Then the resultant is going to be stored into C tile.
In each tile, the data pipeline is kind of simple in GPU generations prior to Hopper, you just move data from global
memory to shared memory. Then you load data from shared memory to registers.
Then apply MMA or FMA instructions.
After that, each warp is going to store the result in register and you're going to store the register results back to the global memory.
However, um when we have Blackwell, um things become much more complicated.
We have specialized tensor memory and tensor core engines. We're going to use TCJA.05.MMA
TCJA.05.MMA
instructions for matmuls.
When we apply matmuls using this direct instruction, operands can come from tensor memory or shared memory and the results will be stored in accumulator in tensor memory. Now, the data pipeline
tensor memory. Now, the data pipeline comes um becomes such kind of complicated multi-stages. First, we need to
multi-stages. First, we need to get data from global memory to shared memory or store it from registers. Then we can call TCJA.05.MMA
registers. Then we can call TCJA.05.MMA
instruction to start a matmul computation. After
that, after matmul computation's done, we're going to store the data that has been computed from tensor memory to registers using TCJA.05.LD.
Then if we want to enable high bandwidth memory transactions from registers to global memory, usually we're going to move registers first to shared memory and then store it to
global memory using a asynchronous transaction.
Here it come Here it's illustrated where the layout stuff comes from. We have
already provided a wrapper for tensor memory in grown that is called tensor memory layout. You can just call this
memory layout. You can just call this wrapper which is a class in Python to initialize a tensor memory layout. The
first argument specifies the shape of the continuous region on tensor memory which is a little bit different than the actual size of the C tile. It
could be smaller than that. And the
additional two separate arguments specify something like the CGA layout, basically how many CGA CTAs are there in the CGA cluster and how they are
distributed. And if you want to use two
distributed. And if you want to use two CTA mode, similarly we can claim a shared memory layout using gl.nv.mma_shared_layout
gl.nv.mma_shared_layout
so that it can accept arguments that like block shape, data type, and CGA layout.
Any questions?
Uh I don't see any questions so I'll go to the next slide which shows explicitly how to we should use TCJA.05
using grown APIs.
Uh Peter have already mentioned that we can allocate memory barriers using grown APIs.
So, since all these operations are asynchronous, we need to first allocate memory barriers including TMA mem bars and
MMA bars. These bars will
MMA bars. These bars will The counter could be any positive integer. After that, we need to actually
integer. After that, we need to actually physically allocate memory on shared memory and tensor memory. After that,
we're going to use these barriers for synchronization. We want to know after
synchronization. We want to know after how many bytes these TMA transactions have been considered as done. So,
basically we're going to move the tiles on A and the tiles on B into shared memory. The total memory bytes
shared memory. The total memory bytes that we expected to be transferred is just the A SMN number of bytes and B SMN number of bytes. Then we're going to
wait until all these memory transactions to be done. After that, we can directly call a TCJA.05. Since this instruction is also asynchronous, we want to call a mem mem mem barrier.wait as well after
it.
And after the computation's done, we need to load it into register store it onto shared memory and then issue another asynchronous memory transaction to copy
it from shared memory to global memory.
I'm not saying this is optimal way to do it but this is the simplest way to do TCJA.05 in grown to make it work.
TCJA.05 in grown to make it work.
Cool. In the next few slides, I'm going to go through a few optimizations. The
first optimization specifically for Blackwell is two CTA.
Without this two CTA mode, what we are going to do if we have a matrix multiplication like this, we'll be replicating B tile on both
CTA zero and CTA one.
But since we have these two CTA amount, we can split the one B tile evenly on two CTA CTA zero and CTA one. But, both
CTAs, as long as they are in the same cluster, can access its peer shared memory. This can enable one larger MMA
memory. This can enable one larger MMA with reduced shared memory consumption cuz you don't need to replicate the B tile anymore.
How do we enable two CTA mode? It's
rather straightforward. We just need to first set two CTA flags in relevant operations, including memory barrier allocation and tensor memory layout. As
you can see before, there are argument They're knobs to just set these two knobs.
Um and second, you need to set the correct CTA layout for loading operand tiles. Um
TC Gen 05 MMA will figure out the two CTA mode based on the layout. If you got wrong, then your program will not even compile.
The next optimization I'm going to cover is TMA multicast. So, as you know, TMA multicast and two CTA are two orthogonal optimizations. You can use either one of
optimizations. You can use either one of them or both of them. Prior to ampere, when we are going to load data these tiles from global to memory to
shared memory, both CTA one and CTA zero would issue load instructions. But, in Hopper and
load instructions. But, in Hopper and Blackwell, because of TMA multicast, only one CTA is to issue TMA instruction and it's going to replicate tiles into
multiple CTA shared memory. This is
going to reduce your L2 traffic.
Here's a concrete recipe for enable TMA multicast as well as two CTA together in Guan.
What we ultimately need to do is to specify a special CTA layout, which is a special linear layout that specifies how to split the tensor with regard to the M
and N dimensions if we're going to do matmul.
So, um for the left-hand side, we need to specify the layout as 1 0 2 0 2 specify how to split the um across CTAs
on this N dimension. Basically, it means the first basis points to 1 0 and second base points to 2 0. This linear layout
suggests that these four CTA CTAs are going to evenly split the M dimension.
For the right-hand side operand, we're going to use linear layout 01 and 11.
And for the first base, 01, it means CTA zero and CTA one is going to evenly divide the N dimension. So, we know
we're going to use the two CTA mode because CTA one and CTA zero is going to share the B tile.
The second base is 00 cuz we need to repeat the data on CTA two and CTA three the same as CTA zero and CTA one.
So, this will trigger multicast automatically. The resultant CTA layout
automatically. The resultant CTA layout will be 10 and 20.
Next, we want to also set up multicast flag in relevant operations, including this little wrapper TC Gen 05 MMA barrier count, TMA asynchronous load and
TC Gen 05 MMA.
All right, let me check if there are any questions.
I need more than two CTAs cluster. Is
Guan restricted? No, you could definitely have more than two CTAs per cluster.
With warp base granularity, how does compiler understand if you need to do something with a single thread or you specify it?
How do you specify like consumer release data? Already done.
data? Already done.
Oh, so since Guan or Triton is still a more coarse-grained program model, for these instructions that for these operations that actually use one thread,
the compiler just handles it automatically.
The user doesn't need to issue something like elect one.
So, just to add on what Karen said, that I elect one is a very good example of something that's incredibly specific to CUDA. That is like, oh, you know, why
CUDA. That is like, oh, you know, why can't I do like thread ID equals zero and, you know, live my life. But, you
know, for reasons that I don't fully understand, you need to use elect one cuz otherwise like everything else breaks.
Um so, you when you're writing CUDA code um or other things like you sort of have to know this. With Guan, everything's in
know this. With Guan, everything's in the lowering. So, we'll know that, you
the lowering. So, we'll know that, you know, TC Gen 05 MMA is uh is uh issued from just one thread. And
actually, if it's two CTA mode, it will be issued just from CTA zero. And like,
you know, we'll handle all these rather tricky things uh and making like very good code uh you know, as far as we know.
Um uh for each of them without you having to learn, all right, this operation needs to be executed by this thread or that thread or not.
Yeah, and I would actually add onto that even further that it's kind of a leak in the CUDA programming model that you need to use this elect instruction because the real reason why you need to do it is
that the compiler is basically transforming your thread code into SIMD code. So, it needs to figure out that
code. So, it needs to figure out that actually all of the threads in the warp are getting to the same location. And
we're actually not issuing this instruction from a single thread, we're issuing it from the warp collectively.
So, the programming model of Guan, I think actually fits better onto the hardware than CUDA even.
So, yeah. Anyway.
So, I think the key message we want to deliver is that the design ensures that all the programming APIs in Guan are on the same level. So, there's no leak of
parallelism levels you suddenly need to switch to control single thread.
All right, so the next optimization is a little bit unrelated with layout, but it's still very useful in Blackwell. It's called um cluster launch control.
Um basically, we even before Blackwell, sometimes we use static persistent kernels where we launch the number of CTAs with the same
number of SMs so that we can keep the SMs busy. But, sometimes workload can be
SMs busy. But, sometimes workload can be imbalanced and IO SMs cannot be scheduled prior to um Blackwell. Using Blackwell's cluster
um Blackwell. Using Blackwell's cluster launch control, we can dynamically still some um schedule some future work onto the
existing SM to keep it busy.
To enable cluster launch control in Guan, there are two ways. You can use Guan um these CLC APIs in the same
work group as these groups doing computation or memory load. Or, you can initialize an independent work group just dedicated for CLC work scheduling.
Here, mainly we talk about the second way where we want to use the independent work group cuz this is better for latency hiding. In this work group, we
latency hiding. In this work group, we want to wait for other work groups to finish arrival at the barrier. Then, we
want to check if we can cancel the work of next work tile that is about to launch.
After it's launched after it's been canceled, we get the program ID of that work tile. Then, we're going to notify
work tile. Then, we're going to notify the other work group waiting for this ID telling us on this the still the same SM, what's the next
task you need to handle.
Let's put all these optimizations together. Um we're going to get a speed
together. Um we're going to get a speed of light matmul implementation and its execution pattern is going to be like this. We will issue four work groups,
this. We will issue four work groups, including the cluster launch control work group, epilogue work group, MMA work group, and TMA load work group. The
API, uh just to remind you a bit, is gl.work_specialized
gl.work_specialized where the parameters are these four individual functions. Uh the pipeline
individual functions. Uh the pipeline starts with TMA loading. It's going to load the memory needed for the first iteration. The memory will be consumed
iteration. The memory will be consumed by TC Gen 05. After result is generated, the epilogue is going to store it back to global memory.
Notice that um the stages are fully overlapped here because we can afford to overlap all these asynchronous transactions, including TMA, TC Gen 05,
as well as epilogue handling. The same
time, the CLC can store the next work tile on the buffer, which has in total number ACC accumulate stage slots. Then,
after the current work tile is finished, it can directly fetch the next tile without waiting for things like cancel the next work.
So, this pretty much finished the core Guan part. Uh if you are interested to
Guan part. Uh if you are interested to learn more about Guan, feel free to check out our tutorials and examples.
In the next, I'm going to start with a few developer tools.
I will get started with profilers.
The profiler for Guan is still um our um specialized profiling tool for Triton.
It supports both and Triton.
Um, to start profiling using Proton, you're going to use its start API to get a session ID, which can be used to control the starting profiling start and
profiling finalization. There are in
profiling finalization. There are in total three kind of ways you can use Proton. You can use coarse grain
Proton. You can use coarse grain profiling using its profiling profiler API. Fine grain profiling can be
API. Fine grain profiling can be specialized specified using back end equals instrumentation. There's also PC
equals instrumentation. There's also PC sampling API we can where you can change the mode to PC sampling.
Let's get started with the profiler API.
To use profiler API in Proton, we first need to specify what data to collect.
Uh, here we specify the data has a name matmul. Then, what what makes Proton
matmul. Then, what what makes Proton convenient for benchmarking is that you can flexibly choose regions of kernels to profile and skip these kernels and or
operations that are not interesting to you. In this case, since we're going to
you. In this case, since we're going to benchmark a single GoWon kernel, what we can do here is to skip these kernels that's going to clear cache.
Notice that we are going to call a clear cache function with a parameter L2 cache. This is going to launch a torch
cache. This is going to launch a torch kernel, but we can just directly skip it without including into the final profile.
First, we need to deactivate profiling, then we need to activate activate profiling right before the profile the function we want to profile. And after
all the profiling stuff is done, we can save profile to disk by calling the finalize API.
Using the profiling functionality, we have analyzed the performance in terms of TFLOPS on V200. So, this is the
kernel we have just uh, optimized the same as our matmul example that we talked before. After applying all these
talked before. After applying all these optimizations, we can compare GoWon's performance with torch performance
under the same input shapes. So, you can see here for these three configurations, even without exhaustive auto tuning, like we haven't even run over all the
block sizes and number stages, we can you can see that um, GoWon's performance is on par with cuBLAS. Uh, and based on
cuBLAS suffix, we can see um, it's also using the T2C TA mode.
The next profiling functionality is fine grain profiling using instrumentation API. To enable it, we need to choose the
API. To enable it, we need to choose the instrumentation back end.
Um, it's optional for you to configure profiling mode using the mode argument.
Sometimes the profiling buffer, which is by default on shared memory, could be either too large or too small. In that
case, you may want to change the mode a little bit. After that, we should
little bit. After that, we should finalize profiling using the same finalize API.
All these APIs are to be called on the host side on CPU side. On the device side, what we need to not do is to annotate regions using profiling
profiler's a language API or a context manager. For example, if we want to
manager. For example, if we want to annotate the work in this matmul load partition to trace how this work is doing across
um, the entire tracing timeline, then what we need to do is to annotate this region with enter scope load partition and exit scope with the same name.
Similarly, we can also annotate the matmul MMA partition.
After that, we can visualize the profile we obtained using profiler visualizer.
On the high level, if we don't zoom in, we can get the trace of different SMs running in for this kernel. You can see that there are still a little bit
workload imbalance, but mostly it's fine. By zooming to one of the CTA
fine. By zooming to one of the CTA and expand all the timeline, we can see epilogue load partition, MMA, and COC
partition are mostly balanced. So, I have already also choose
balanced. So, I have already also choose specifically the end of the timeline, so you may think it's more imbalanced, but actually it's very much balanced already.
All right, so let me take a look at the questions first.
Out of curiosity, what are the general principles behind the different design choice in GoWon TLX?
Uh, I don't think either of us know.
Um, I don't think either Marouane or Peter know too much about TLX, but based on my shallow understanding, the purpose of TLX is to
extend Triton by little bit updates, but still get the like better performance than Triton with knobs like warp specialization.
But here in GoWon, what we want to do is to build a wrapper on top of TTGI for users to be um, use these hardware primitives and control of memory
allocation more closely and in more detail so that we can tune our algorithms easily.
The next question, are warp level barriers better than CTU CTA level barrier instructions?
I don't think we expose warp level barriers.
Uh, I probably need more context for this question.
So, meanwhile, I'll go through the sanitizer slides.
In addition to performance tools, we have also developed a series of debugging tools for correctness.
To enable debugging, um, what you only need to do is to turn on this knob to specify which tool to use.
Then, the runtime will recompile and instrument specific instructions for checking the correctness.
Here, um, we have concurrency sanitizer.
The short name for it is consan. That is
going to detect races within a GoWon program, which are something like using TMA load before waiting on the end barrier overriding
MMA inputs while MMA is using them, as well as invalid instruction sanitizer.
Uh, it adds alignment checks on TMA load store operations because TMA instructions has specific requirements on the alignment.
We developed these sanitizers pretty much because Nvidia's compute sanitizer uh, either doesn't have corresponding checks or probably is too slow without
knowing compiler semantics.
In addition to these sanitizers, we have also developed floating point sanitizer and global memory sanitizer. To enable
them, you use the same knob that is instrumentation mode equals either FSAN or GSAN.
For floating point sanitizer, uh, what we did internally is to replace floating point arithmetics with integer operations to preserve mathematical identities like associativity. It
enables testing optimizations that change rounding. Optimization here means
change rounding. Optimization here means compiler updates that change rounding or algorithm updates without relying on tolerance like assert close.
Another there's another um, sanitizer called um, global memory sanitizer that's going to introduce runtime check for race conditions between CTAs GoWon
program. It supports NVLink
program. It supports NVLink communications and cross kernel synchronizations.
The it's noteworthy that one must allocate tensor with GSAN.
Create mem pool to enable global memory sanitizer.
Also, the question is bar.warp.sync.
Uh, as we mentioned before, the programming granularity for GoWon is still work group, not individual warp.
Unless this work group only contains one warp.
Okay, so as Peter just mentioned, we indeed have a visualizer.
So, this is a screenshot of this visualizer is developed by my student.
Um, I'm going to briefly introduce this functionality and then I'll go through a demo. So, the purpose is for
demo. So, the purpose is for um, students or developers who want to learn existing layout or compiler developers who want to debug custom
layouts for new hardware or for operations like reshape, split, join Triton operators.
You can also use it to debug linear layout operators like these complex complex operators including compose and invert.
To access it, we actually have launched a website and the code is fully open source. There's also manual instructing
source. There's also manual instructing how to use it.
The main functionality of this visualizer including selection, you can select preset layouts. Tabs allows you to switch between different layouts you
have opened in your workspace. Applying
linear layout operators directly in the visualizer, as well as slicing to check the mapping between register spread and their logical locations.
And next, I'm going to open up the the website and show you a little bit on the visualizer.
I guess I need to switch and present another screen.
Hey Mark, can you enable sharing? Okay,
thanks.
Cool. So, to start with, I'm going to show you the simple layout to demonstrate the visualization functionalities. This is a block block
functionalities. This is a block block layout.
To get into specific mappings, we want to zoom in to the locations, and here you can see based on definition, each thread has four registers, for
starting from T0, and following with T1, that is kind of coalesced on the column.
There are a few buttons here you may want to use.
The first one is select.
After choosing the select mode, and by selecting corresponding locations on on the block layout and hardware layout, you can see the the
selected locations are highlighted, so you can establish mapping relationship.
You can also switch from 2D view to 3D view, but I don't think it's uh used in general.
There are also other helpful um buttons like hiding the tensor dimensions, hiding the tensor names.
On the right side, uh we can see there's layout specification. So, this is the
layout specification. So, this is the exact linear layout specification of this uh simple block layout, and you can actually find its definition in layout
matrix, where you can see for each column there's only one bit. So, this
satisfies the distributed layout property. If you want to initialize this
property. If you want to initialize this layout in Gluon, you can copy the initialization code.
In the next tab, you can control which layer which tensor to visualize. You can
hide the hardware layout, for example.
Sometimes uh it's probably not useful to you. And there's also another
you. And there's also another functionality pretty uh useful, I think, is slicing.
Using slicing, we can choose which dimension to slice, then to view on the logical location on the hardware or logical location where these um points
are. For example, suppose we are choose
are. For example, suppose we are choose we choose to slice over the thread, then we can choose from thread one and to thread two,
thread three, how in the large in the block layout, the location mappings is changed.
You can also select two dimensions.
And view the I think that the uh your screen is not updating, so I don't think that we are seeing what you're seeing.
Oh, now I get it. Yeah, probably I Mark was sharing Yeah, now it's now it's working.
Okay.
All right.
Let me start over. Basically, uh
in this website, uh we can zoom in to view the detailed mappings like these, I guess you can see the text now clearly. You can see each
thread has four registers, and they are kind of coalesced, and the next mapping is thread one.
The select button allows you to highlight the locations in both the hardware and block layout.
And on the right side, we can view the layout specification that includes the actual linear layout matrix as well as the initialization code.
And uh another functionality is slicing.
So, you can choose either slice the logical layout or the hardware layout.
If we choose to slice over the hardware layout, we can choose which dimension to slice, then view what's the difference
from thread equals zero to thread equals one, how it's mapped onto the block layout, which is kind convenient for study. You
can even select multiple dimensions, including thread and warp, and view how these uh locations are being changed.
In the next, uh we can view So, here we have previously loaded some layouts for you, such as MMA, MMB, and MMC layout. But
one thing um that we have demonstrated useful in linear layout is swizzling. So, I'm
going to show you how swizzling looks like. You can see based on the swizzling
like. You can see based on the swizzling bits, we have permuted the locations on these columns across different rows.
Uh So, in the matrix, what you can see is that for each column, you probably sometimes has two bits, which are the bits that actually perform swizzling.
Even better, um for in the layout operation tab, what you can do is to apply arbitrary linear layout operators to test and verify your understanding.
For instance, if you want to test the product operation, we can just copy the layout name and apply uh apply the product symbol, then it's
going to render an interesting layout like this. I I don't think it's useful,
like this. I I don't think it's useful, but it's good to demonstrate that we have this functionality.
All right, so the last functionality I want to demonstrate is that somehow it can have a better way for you to visualize layout that has been embedded in PTX
documentation, cuz each time, if you want to load PTX documentation to find the corresponding layout, it may take you some time to hover over the pay very
long page and locate where it is. But
here, based on our preset layout functionality, we can easily try choose which GPU architecture to visualize, what's the corresponding instruction,
and what would be the matrix size.
For example, uh I guess you are already familiar with MMA. Let's use LD matrix.
Oh, Karen, I'm really sorry. I'm getting
kicked out of my meeting room. Uh so, we might need to start wrapping things up.
I might lose my connection. Just
All right, so anyway, so this is how you visualize the uh LD matrix layout, and you can turn on
propagate output, zoom in to have a better sense.
To wrap up uh this talk, I think basically we have delivered um four features of Gluon.
The key takeaways are First, Gluon exposes hardware features.
Sorry, Karen.
Yeah, but I'm not sure if uh Mark allows me to change. Oh, all
right. Give me one second.
Oh, you you can change, you can change.
It's just uh Yeah, glad you are still here.
Sorry for the rush. It's a new office, so it's a bit chaotic.
I've switched my screen to share the slides. All right, so the key takeaways
slides. All right, so the key takeaways are First, Gluon exposes hardware features.
It has first-class primitives for hardware-specific um capabilities and explicit layout control for fine-grained optimization. Second, it eases
optimization. Second, it eases programming with its design. It still
integrates Triton-like program model and abstraction, but it also automatically allocate memory and um insert synchronization instructions when necessary.
Fine-grained profiling for Gluon is accurate due to the minimum compiler transformation passes.
All right. Well, uh
thank you, Karen, and I apologize again for rushing you guys in the end. Um so,
for for everyone in chat, like we do have a Triton and uh and Gluon channel on the server. If you're curious about trying out Gluon, you have any questions, uh I guess like on your on your guys' end, like would you be okay
just sharing the slides? Like that way I can put them in the YouTube description, people can take a closer look.
Uh thank you so much, guys. I learned a lot. Like I I can't wait to read the
lot. Like I I can't wait to read the linear layout paper. It does seem like you know, Mario, to your point, like because it's like matrices of zeros and ones, I can imagine like operations on
top of this to be like quite easy to visualize and try out and code, so uh I certainly will be giving it a shot. So,
thank you so much, guys, for coming, and uh please come back again soon.
Okay. Thank you, everyone. Thank you.
Bye. Bye.
Loading video analysis...