[ARRAY'25] Kuiper: verified and efficient GPU programming

By ACM SIGPLAN

Summary

Topics Covered

GPUs are an adversarial environment for program verification
Verification eliminates the safety-versus-performance tradeoff
Kuiper extracts verified code to plain CUDA C
GPU resource lacks formal definition and is opaque to model spawning restrictions
Permission-based logic handles data-dependent GPU loops

Full Transcript

Hi, welcome to the last session. Uh we

have two amazing papers. The first will be given by Gro Martinez who works for MSR a search engineer. So GO take it away. All right. Thank you. Um yeah,

away. All right. Thank you. Um yeah,

this works perfect. Uh yeah, so I'm going to tell you a bit about um Kyper which is a language we're developing for um doing safe first of all um GPU programming and also um verifiable in the sense that you can prove that it's

doing what it has to do and also not uh relinguishing uh efficiency. And this is joint work with a bunch of people. A

special shout out to Yona who was our intern last year. Um so the first question of course is why what do we want a language that's you know verified GPU programming. And the first and main

GPU programming. And the first and main reason is that GPUs are really hard which is probably not surprising to this crowd. Um there's uh there's two

crowd. Um there's uh there's two different memories. You need to deal

different memories. You need to deal with this weird parallelism with the blocks and threads. You have uh barriers and a synchrony and memory hierarchies and um errors are kind of may go unnoticed at sometimes and debugging is

also hard. So, it's like not a not a

also hard. So, it's like not a not a great place to be as a as a prover. Um,

and performance is also like paramount.

It's like pretty much if if you're using GPUs, you probably really do care much about performance and we don't want to um renounce um um you know full

performance or have any kind of dynamic checks or sacrifice um you know not not be able to do what we want to do just because we have to avoid bugs and be safe. um and we want very precise choice

safe. um and we want very precise choice of like uh the memory layout and when we copy stuff and etc. So there's kind of this trade-off between um safety and um

you know a correct programm and also being like 100% performance um when you're developing at scale. So um

verification is kind of a way to get rid of that trade-off and do not trade performance for safety. can do, you know, anything you want to do. Use every

any fancy instruction or or a pattern or whatever um that you want to as long as you just prove it that it's safe. So um

Kyper is a language that um starts with this this mindset. Um it allows you to do many many things. Um so it supports all of the usual suspect of GPU features. So there's heterogeneous

features. So there's heterogeneous memory, there's sharing between different threads, parallelism, synchrony and it's always memory safe and database free by construction. Um,

optionally you can even verify that your your programs are doing what you want them to do. So if you write a fancy gem that's splitting the thing into three levels of tiles and sharing between different blocks and whatever. Um, you

can anyway at the end prove that it's actually just computing a gem at the mathematical level and you don't need to trust the implementation. You just need to look at the specification. Um, and

there's few restrictions. You can write the fancy algorithm that you want as long as you just prove it that it's that it's safe. Um, so so far Kyper is

it's safe. Um, so so far Kyper is relatively new. It's maybe like a

relatively new. It's maybe like a yearish old. Um we have like several

yearish old. Um we have like several flavors of gems like fancy matrix multiplications um with no performance hit compared to a handwritten implementation. Um we also have programs

implementation. Um we also have programs using atomics. We have a parallel

using atomics. We have a parallel reduction implemented. We can do a

reduction implemented. We can do a synchron as synchronicity like launching kernels asynchronously. Uh use floating

kernels asynchronously. Uh use floating points and um do some forms of generic programming. So like a very um um free

programming. So like a very um um free style of polymorphism. Um Kyper is built on top of and pulse. Um Sar you might know is this proof oriented verification language. Uh polls you might not know

language. Uh polls you might not know about, but you're kind of in luck because in Friday we have a paper about it on PLI. So if you're curious, uh you can show up and and listen to it. Um I'm

going to show you a bit of pulse today, which is kind of like the the basics to get us going. Um so uh Pulse is a language that's based on um concurrency evation logic. That might sound a bit

evation logic. That might sound a bit scary. Um I'm just um I'm just going to

scary. Um I'm just um I'm just going to give you like a primer in one minute of what this is about. So it's an extension to logic. So pre and post conditions

to logic. So pre and post conditions that makes reasoning about pointers and data more compositional. So um the pre and post conditions in in separation logic represent resources which can be

spent. They're not just about truth. Um

spent. They're not just about truth. Um

and threads own resources. So um this are like the the prototypes are simple functions like allocate a reference and free it. Allocate in this case um I give

free it. Allocate in this case um I give it an initial value and it gives me back a reference and as a post condition it says R points to B and this is um in a way telling me that R does point to E in

the thread in the in the heap but also that I own this fact. I'm I am the single owner of this of this fact that R points to B and then I can free it. So

the uh this other function free here has a precondition that R points to B and then when I call it I give this precondition I give this resource and what I get back in the post condition is empty. I lose it once I free a reference

empty. I lose it once I free a reference it's no longer there and I cannot use it. Um

it. Um uh and then some resources like this one can be fractioned and shared. Um

fractioning means um for instance if I have a reference that points to a value B, I can fraction this resource into two um resources saying that R points to E but I I only know half of this and I only know another half of this and then

I can give this other half to another thread and to read a reference I only need a fraction of a permission. I don't

need to be the single owner to read. Uh

so this is the way that um concurrent read only accesses are permitted in in separation logic. So this fra notation

separation logic. So this fra notation is what represents a fractional resource. Um, and then for instance,

resource. Um, and then for instance, this last thing is copying Y into X. Uh,

I only need some read permission over Y.

And then my precondition is that X points to some value X0. And my postcond condition is that now it points to something different because I'm I'm changing it. And I am the single owner.

changing it. And I am the single owner.

If it doesn't set, hey, frack, I'm the single owner of this resource. Um, yeah.

So for instance, once you have this language set up, this is like a very tiny pulse example. If I allocate a reference one, R initialize to one, I get the fact that R points to one, I could free it and then I lose this fact.

And if I try to do anything with R after it, this fails. Like I don't own this resource anymore and this this is gone.

Um so look into pulse if you are curious or ask me about it um at any point. Um

so how does this help us in GPU programming? So I'm going to show you a

programming? So I'm going to show you a a very simple example of verifying a kernel thread. So um this is about this

kernel thread. So um this is about this is um an example of a kernel that takes two arrays in memory and pointwise multiplies them into a separate array um AR. So this is the what I'm showing on

AR. So this is the what I'm showing on the right here is actual Kyper code. Um

it has three parameters A1 A2 AR that are GPU arrays. Um there's a parameter about the uh that is the thread ID of the thread that is running. This is

similar to thread ID thread index.x in

CUDA but we make it explicit parameter because it looks nicer. Um and the implementation inside the braces here the body um is really just what you would expect. I read array A1 at tid.

would expect. I read array A1 at tid.

read array A2 at TID um multiply the values and save them into AR. Um now the implementation is direct. Um what about the specifications? So we need to add

the specifications? So we need to add some stuff. Um so first of all we need

some stuff. Um so first of all we need to say um this function when it's called owns some fraction of the array A1 and some fraction of the array A2. So it is

allowed to read them. If I did not own this, I cannot read them at all because they could be owned by some other thread who might free them. So they would be unsafe. Um so oh when I say preserves it

unsafe. Um so oh when I say preserves it means something that is required and also given back at the end. So we take some permission on A1 and A2 we give it back at the end. Um and the requires

another part of the precondition is that um we know and we own that the tit cell of the result array points to something and in the post condition we say uh we we change this v 0ero value and now it

points to something else. Um which

allows us to write over this thing and and this one is full permission over the cell. Um so this is like the separation

cell. Um so this is like the separation logic specification that you would expect for for this function. Um in

Kyper we need to add a few more things to make it this actually into a um a GPU kernel. Basically we need to mark this

kernel. Basically we need to mark this function as a GPU kernel. Um this thing is kind of a um a token resource that we use just to mark that this thing um will run on the GPU and it can be spawned

into a thread group and it can use GPU only operations like barriers to synchronize and etc. Um, another thing is that we need to explicit we explicitly encode the grid constraints.

So if I'm saying that this function will multiply two arrays of size len this needs to run in a block of size len um I cannot just run whatever number of threads I want. If I want this post

condition at the end that that they actually multiply to to two arrays um we need the block to be the proper size. So

this is encoded in the specification as well here. Um and then um when you when

well here. Um and then um when you when when we're verifying and trying to get you know decent reasonable properties out of this we need to have some add some proof steps proof steps once in a

while. Um so in this case if we want to

while. Um so in this case if we want to verify that at the end we get the um the pointwise multiplication of these two sequences. There's some convincing that

sequences. There's some convincing that we need to call um to to do to get positive to to accept this. Uh so it's those final two lines. Um but if we only if we only care about safety and we just

want a program that we know will just uh will not crash and will not have data races, this can be removed um just by weakening the specification and saying ar points to some other value, some new

value. Um so how do we execute these

value. Um so how do we execute these things? So I'm showing you this fancy

things? So I'm showing you this fancy language that looks kind of different from from CUDA or something and it has the specifications. Um the way we

the specifications. Um the way we execute these functions is by extracting them into CUDA C. So after we check these functions in in pulse in Kyper um we extract them and we get code that looks exact exactly like this actually.

So it's pretty much what you would expect. Um the the code is somewhat

expect. Um the the code is somewhat readable. Um all of the specifications

readable. Um all of the specifications and proofs are gone. You just get plain CUDA C code out of it. Um without any kinds of dynamic checks you we we don't do any kind of transformation and in

that sense and this can be used directly by CUDA clients. You don't need to buy into to write your entire program in Kyber to be able to use it. Um, does

this kind of make sense so far? Um,

good. Um, so I was just showing you how to verify one single um, one single thread in a way. This is this is the function that executes one single thread. Um, when we want to spawn a

thread. Um, when we want to spawn a kernel, we need to we have, you know, some big resource like I have two of my input matrices AB and I want to write the products into a matrix C. Um, so I

need to split this resources and uh, to be able to spawn a big kernel and allow them to compute. Um so the way this looks like in in separation logic is roughly that um at the beginning as the

main thread of my function let's say of of my program and this is just a CPU thread I fully own A and B and I fully own C and the f the first step I will do is split this resources in some

interesting way. So, for instance, um

interesting way. So, for instance, um let's say I'm going to spawn, let's say, 16 threads, but they don't fit on the slide, so I'm just going to show you four. Um I'm going to spawn a bunch of

four. Um I'm going to spawn a bunch of threads and um I want every thread to be able to read from A and B and to write into a single cell of C. So, the steps you need to do is um take the matrices A

and B and share them across all of the threads. So, in 16 equal pieces and

threads. So, in 16 equal pieces and everyone gets a some fraction of permission. And the output matrix, it's

permission. And the output matrix, it's it's a different kind of sharing.

explode it into cells and give to every thread full ownership of a single cell.

So they can mutate it, but they can do nothing else with the rest. Don't even

read them. Um after we do this, you can you can spawn a big kernel that works um every every thread works on one of these output cells. They modify them somehow

output cells. They modify them somehow and now they make them purple. Um they

cannot really modify the inputs because they all share them. So sharing them um modifying them when someone else can read them is a data race. Um and then at the end we'll also join all these resources and get back you know full

permission over AB and full permission over the now modified C. And I should say all of the computation really happens in in this middle step um here what I'm when I say of all of this um

sharing resources and and slicing the matrix and all this these are all ghost steps that um don't have any computational meaning. They're just to

computational meaning. They're just to convince um they're just proof artifacts in a way. Um

so with that in mind the way we model kernel launches in in Kyper is um if we if consider launching nth threads I'm going to I'm going to um let's say we have a simple example of let's launch

one block of nth threads and that's it.

And every thread I the thread index I has a precondition prei and a postcond condition post I. So if we want to launch all of these threads in parallel what we have to do is provide at the beginning um each of the preconditions

of the threads separately. We need to be able to split the the the pre split our our resources into separate compartments where um each thread has its own precondition separately from the others.

Um then we can run all the threads in parallel and we get back all the all the post conditions separately. Um now this is generic enough but it's it's in a shape that's not really useful. Um so

the pre and post conditions of the full thing have a a simpler way of usually have a simpler um way of expressing them. So we add some ghost steps to

them. So we add some ghost steps to tweak it. Um so a kernel The kernel

tweak it. Um so a kernel The kernel specification looks most like um there's a big precondition and a big post condition and then there is a way to uh split the big precondition into the individual preconditions run the threads

in parallel you get the separated post conditions and then you put them all back together into post. So at the end this looks more like uh okay you give me permission over the full matrix and I'll give you back permission over the full

matrix at the end even though in the middle it's all shared in this um slicing and sharing and this kind of interesting patterns. Um and here this

interesting patterns. Um and here this was an example of just one block of n threads. Usually a grid is at least has

threads. Usually a grid is at least has two levels. So you have to spawn blocks

two levels. So you have to spawn blocks and um each block spawns a number of threads. So the same idea applies just

threads. So the same idea applies just recursively. Um you do this two steps in

recursively. Um you do this two steps in a row. Um and then it allows you to to

a row. Um and then it allows you to to capture the specifications of of more complicated grids.

Um so once we have this kind of proof done, uh we can package it into what we call a kernel description. Um which

means that um so there is a a type in Paul called K description that says this is a package of a kernel that goes from pre to post and it does so by calling some kernel function and it has some

grid size and here are the proofs that um that show that this is all correct and once we defined um an element of this type we can forget about all the

details of the kernel. This is all um irrelevant basically and um since they encode the grid size and all of that uh we can just spawn a kernel a kernel description and the function that spawns

a kernel looks something like this give me a kernel description that goes from pre to post and uh I require pre and I ensure post and that's it and then everything else is happening um under the covers and there's no messing up

when you spawn a kernel there's no messing up the number of blocks and threads and etc which can happen included this is all encoded in the record Um

this makes sense so far. So um another um of Kyper features is that um going beyond like the safety and the the way the separation logic is involved is that

um we make heavy use of polymorphism and specialization. So Kyper inherits

specialization. So Kyper inherits polymorphism and type classes from from pulse and a star which are um which we built up on top of. Um so in this

example here this is a type of a matrix multiplication. um it's polymorphic over

multiplication. um it's polymorphic over the element type of the matrix. So the

type of every cell as long as that type is scalar. This is a type plus

is scalar. This is a type plus constraint in far. Um the type being scalar just means that it has a plus and a and a and a product and a zero and a one. Um and the implementation here is

one. Um and the implementation here is completely parametric over the type we're working on. And when you choose a type, you can specialize this and generate specialized CUDA code that works over the particular type um

without any runtime u like dynamic dispatch or anything like it. Um you can also use this for instance if you want to compute shortest distances in a graph. You can instantiate this with the

graph. You can instantiate this with the uh min and plus um semi- ring as air was telling us this morning. Um and this allows you to just reuse your matrix multiplication and specialize it to

compute um short distances in a graph at at no cost. Um

right going a bit further than this actually um we we can we can write programs that are not only polymorphic over the the type they work on and the operations to work on but also on the matrix layout. So um I keep saying we we

matrix layout. So um I keep saying we we have this matrices that are in memory some some way. Um there's many representations you can use for matrices. The most the two most basic

matrices. The most the two most basic ones are like row major and column major. So ideally I would like to write

major. So ideally I would like to write my algorithms just once and then just instantiate them for for different layouts. And Kyper supports this in a

layouts. And Kyper supports this in a very um very comfortable way. So in the in the type you see here on the right um it's saying um this function takes three

uh matrices GA, GBGC and they have some layout LA, LB and LC and this is defined in some way in in far um and there's a type of constraint saying that the layouts have to be concretizable and

computable. Um, and the layout, the only

computable. Um, and the layout, the only thing that the layout is defining is is where um where does the cell J of the matrix go in memory. And if it's row major, it's I * columns plus J. And if

it's column major, it's the other the dual. Um, and this layout doesn't even

dual. Um, and this layout doesn't even have to be regular. If it's a fixed size, like any crazy works. If you're

working over matrices of 4x4 and you decided that you want to store them by rows, but some row is backwards or whatever, this can all be modeled in Gyper. Um, and every function we wrote

Gyper. Um, and every function we wrote is genetic on top of this. um and

specialized implementation are also possible. So we're also we're currently

possible. So we're also we're currently looking at trying to model tensor cores inside Kyper and this we will definitely need to have specialized um um well they use a given particular particular

representation. So those are not

representation. So those are not polymorphic anyway. Um so all of these

polymorphic anyway. Um so all of these algorithms we write them um in this very polymorphic and um specializable way.

And once you do this you can specialize a given function. So for instance, here's a um this mattal GPU function here is a a very polymorphic matrix multiplication and there's a

specialization function that says okay specialize it now to floats and the two inputs are in column major and the output is in row major and outcomes this code which is kind of reasonable CUDA

for this. Um and if you look closely you

for this. Um and if you look closely you will see that the array accesses to GA and GB happen in in column major and that the the access to the output GC happens in row major. But I didn't have to think about any of this when I wrote

this. We just verify this once and we

this. We just verify this once and we also verify what column major and row major mean only once and then we we get to specialize every combination of these functions.

Um so we're getting special this specialized verified kernels for free with the same guarantees and there there's no runtime. This this CUDA code is exactly what you would expect and any parameter is specializable. So I I've

been telling you about um the type and the operations and the layout. Um

another important parameter here that you can specialize is a tile size. So

you can define um um you know many many versions of matrix modification that copy tiles into shared memory and do this a couple times and the tile size is a very important parameter. Um so what you can what you do in Kyper usually is

you define a function that does this with where the tile size is just an argument. So a dynamic argument um and

argument. So a dynamic argument um and you verify this function and then you can choose to specialize it to 2 4 8 16 32 and then you try to see you tune it to see which one works best. Um and this

it's easier to tune in a way this allows you to tune more easily since you know every every version is correct. They are

all correct by the fact that you verify the generic one. Um yeah just a minor fact we also support some basic forms of um rotation fusion. So if you let's say

you need to compute the product of a and b um a * b transposed well what you can do is compute a * b let's say that they're in row major you do the product into column major representation into

some buffer and then if you just flip your your view this is really the transpose in row major. Um so this can all be proven inside the tool so you can avoid like spirious steps of transposing

and etc. Um today by the way with this specialization trick we're generating around 15,000 lines of CUDA from around 15,000 lines of of star um or pulser. Um

this ratio is usually much lower but the you know the the the exponential verification here really kind of gives us an edge. Um some more fancy features

that we support as well are um barriers.

So in a block, you know, usually needs to synchronize between the threads. And

um so the way a barrier works in CUDA is um all the threads have to reach a barrier uh before they can all go. Um

and this allows you to you know make sure that some things have been copied or or whatever you're doing. And the way we model this in Kyber and in separation logic is that um betterers have a precondition path that uh the threads

have to provide. So from um 0 to n minus one. Let's say there are n threads. Each

one. Let's say there are n threads. Each

thread has has to give P of zero, P of one, P of two. So thread I has to give P of I into the barrier and then when they all get there um the we we model there

is some ghost shuffling of resources that happens um and every thread comes out of the barrier owning uh Q Q of I so thread zero gave P of zero and gets Q of

zero and the same for every every other resource. Um, and the only proof a user

resource. Um, and the only proof a user needs to do to show that a barrier is correct is that this is actually a shuffle. That if you take all of the PS,

shuffle. That if you take all of the PS, you can turn them into all of the cues.

Um, this is useful for instance, if you're using a local cache, like if you're copying global memory into shared memory um, which all of the threads in a block share, you use them to say um, at

the beginning of of the iteration, every thread in my block owns a single cell of the of this array. Um then u the so the barrier the precondition of the barrier is that I own a given cell of the array

and the um the post condition of the barrier is that I own the full matrix with just some fraction. Um so then what happens when you call sync threads is that after this every thread owns a um owns a fraction of the matrix. Um and

then you can also go back um with the same variable to the to the previous state and keep copying memory back and forth and sharing them and um sharing them between all the blocks and the threads all the threads in the block

sorry. Um

sorry. Um um another bit more fancy feature that Kyper supports is um a synchronous kernel launches. So usually um you know

kernel launches. So usually um you know including a kernel does not immediately wait for it. It's just launched and scheduled into the GPU to be to be executed and you can spawn many of them

at once. Um we also model this in Kip

at once. Um we also model this in Kip and allow you to reason about this stuff. So the way this works is um um we

stuff. So the way this works is um um we we use a reasoning um principle for this kind of stuff called uh pledges. Uh you

can see our our PLDI paper on on Friday to for a deeper explanation of it. But a

pledge is kind of like a coupon that tells you what will happen in the future and how you can get it. Um so the key reasoning principle of a pledge it's looks kind of like an implication. Um

it's a pledge says that if I own P and I also have a pledge that whenever P holds you you get Q. So pledge PQ, you should read it as kind of a coupon that tells you uh that you can cash it in. Um so

then when P holds um you get to keep P and also obtain the postcond condition Q. And uh the way we model this is a

Q. And uh the way we model this is a signalless launches is that u if I want to launch a kernel that has some precondition pre I need to give the precondition so the thread can execute and then what I get at the end is uh the

most important bit is this pledge that tells me when the current current epoch is done um which I'll tell you about what that is in a minute then you'll get the post but you don't get it immediately and the way you get that the

epoch is done is by calling sync device and sync device basically advances this epoch counter and tells you like okay the current epoch is now done you've waited for So then you get to call redeem pledge repeatedly over every

every uh block every um kernel you spawn to get all the post conditions. Um yeah

so that's about what I wanted to tell you. Um so um Kyper is this framework

you. Um so um Kyper is this framework for writing generic verified kernels at at zero cost so far like we don't want to compromise on performance. So

everything we're doing we're generating very first order CUDA code and checking that it's actually the performance we expect. Um we code a decent chunk of

expect. Um we code a decent chunk of cooldown. So um different grids, shared

cooldown. So um different grids, shared memory, barriers, atomics, um floating points with many precisions, a synchrony. Um we can also prove some

synchrony. Um we can also prove some stuff about floatingoint computations, but I'm I didn't have time to go into that, but ask me if you're curious. Um

we verified several building blocks already, many flavors of matrix multiplications. Um we have a parallel

multiplications. Um we have a parallel reduction using um using barriers, another uh parallel reduction using atomics across blocks. Um and a singular invocations and etc.

Um by the way this table right here is um uh for many different implementations of matrix multiplication. Um the ones that are not in bold font are just um

handwritten that we um you know just normal CUDA code um getting closer to Kubless like performance. The best one we've done in in Kyper is this 1D block tiling. So and we actually beat the

tiling. So and we actually beat the performance of the original um CUDA code that we were mimicking. For the others, we're pretty much exactly the same because the CUDA code is pretty exactly what you would expect. Um, and we plan

to keep going forward. Um, the the things we but we now need to make um, you know, make it bigger than this. Um,

to get closer to 100% is like um, world level operations and and uh, tensor cores and etc. Um, yeah. So, I'm going to do a shameless self luck. So, after

this uh, session, I'm going to be at the Microsoft booth. I'm going to give a

Microsoft booth. I'm going to give a demo of this stuff. So, if anyone is interested, you should feel free to come and and chat. And yeah, I'm happy to take any questions. So, uh, thank you.

Thank you. Questions.

Cool. Uh, thank thanks for the talk. Uh

so one question I was going to have is have you guys looked at uh so like Cutless and Cute have built a bunch of abstractions that actually have good semantics I think for doing this kind of

thing where for example like cutless exposes an atom which is like a MMA atom and everything is defined in terms of these layouts that are algebraic and so you could actually define them I think

where the precondition postition would sort of have like a an if statement and it effectively like what level of hierarchy you're doing but they they're modeling copying from shared memory to global and back to registers, but I think it actually would be a good

abstraction to think about because they've sort of done some of the modeling and then the layouts that they have are more general and that you can use them to cover not only the modes of that you have here like column row major

etc but like complicated tilings and shuffles and I don't know if you guys have looked at that but I think there's something my verification head and my kernel uh writing head or see like a connection here uh that I think would be

useful right we haven't but that's a great tip thank you yeah the layout general more complicated. I'm happy to talk about it afterwards, but I think it's it's it generalizes everything to like a layout is a pair of a shape and a

stride that generalizes like bjective or there's a bunch of conditions on them, but they have like co- operations, etc. So, so for now, our layouts are um kind of basic in the way that they're just a

bjection from J into a position in array. So, we don't particularly model

array. So, we don't particularly model strides so far. And this means we don't allow you right now to take like a submatrix because then you need a stride. Uh but we should definitely

stride. Uh but we should definitely extend it to do this kind of thing. So,

yeah. Right. The cool part is the copy operation then becomes generic. So I

think you could write a generic pre-post condition and then you could write an an implementation ideally that doesn't have to mention very much because it's sort of discharged by the condition of the the atom in quotes. So anyway something

we talk about but yeah nice thanks.

Time for one more question. Anyone in

the back please thanks for the talk. Uh could you speak a bit more about the GPU resource? Does

it have a definition and how do you use it in the verification? Because a lot of what you present can also be applied to parallel CPU programs, right? Yes. So

this the GPU resource uh you're right, it doesn't have a definition. It's more

exumatic. Um so we use it to control what you can spawn. So in when you're writing CUDA C, you cannot just put like triple brackets on any function. You

know, it has to be a global uh it must have been marked with a global attribute to say that this is a kernel and you can spawn it. Um, so the GPU resource is

spawn it. Um, so the GPU resource is kind of some it's kind of what we use here to to model this restriction, but it really is an opaque resource that which is passed back and forth. So what

should we probably really do is have a mode system in pulse that allows you to say this function is GPU and it's a different kind of function from others.

But this resource thing kind of gets us most of the way there, but it doesn't have a definition. Yes, good, good catch. Thanks. I have one question. Uh I

catch. Thanks. I have one question. Uh I

was not sure whether that while loop if changed slightly will give a problem. I

haven't analyzed it fully. You had a while loop this one. Uh yeah suppose we make uh the k uh thread idx or something like that. Then that sort of means that

like that. Then that sort of means that each thread will do a different number of iterations because the shared comes in from outside. So would that that will be a data dependent loop. So I was I

wasn't sure how. So K K K K K K K K K K K K K K K K K K K K K is initialized as I know I know but I'm looking at a thinking of a different program where that K might be T idx or something thread id. So you're comparing my thread

thread id. So you're comparing my thread ID with uh something that comes from above. So each thread will iterate a

above. So each thread will iterate a different number of times. Am I right?

Ina in that case the amount of uh readwrite accesses each thread performs is a data dependent. Yeah. So I should work out uh I don't know. Yeah. So there

could be in general data dependency is handled properly by permissionbased separation logic. Uh but I haven't

separation logic. Uh but I haven't thought through it. Yeah. Have you run into any cases where that might uh cause an issue or not? So not yet but I think

most of the things so for instance we're not for every read you get uh yeah you don't take any permissions for every yeah you take a fractional permission and things like that right but it's not

spent so we don't do any kind of reasoning of like oh every thread runs in the same constant time or they use the same memory access we don't do any of this right now should be good but we could possibly do it I think let's thank

this speaker once again for a great presentation thank [Applause]

Loading...

Loading video analysis...