George Hotz | Programming | coding in UOps on AMD MI350X tickets.comma-con.com/events/comma/1859964

By george hotz archive

Summary

## Key takeaways - **AMD MI350X UOps Programming**: The discussion delves into programming the AMD MI350X using UOps, highlighting the complexities and specific instructions involved in optimizing performance for this hardware. [01:28:15] - **Async Operations in B200**: Programming the B200 is challenging due to its reliance on asynchronous operations, requiring persistent kernels, specialized tools like tcgen05, and TMA for efficient data handling. [01:35:21], [01:37:31] - **Performance Bottlenecks**: A key performance bottleneck identified is being memory-bound, where the system waits for memory copies rather than effectively pipelining them, impacting overall speed. [03:58:04] - **AI and Software Development**: There's a concern that engineers are becoming overly reliant on AI coding assistants, to the point where they struggle to perform tasks without them, as evidenced by recent outages. [07:10:42] - **Tinygrad Adoption Challenges**: A user shared difficulty in porting a PyTorch model to Tinygrad, experiencing significantly slower performance, suggesting potential challenges in adoption or optimization for new users. [07:27:43] - **GPU Memory as Numa Node**: GPU memory can function as another Numa node, accessible by the system, and conversely, the GPU can access system memory, indicating a more integrated memory architecture. [04:28:43], [04:29:02]

Topics Covered

AMD Employees React to Performance
B200 Server Compared to a Paperweight
Studying Tinygrad commits and kernel-level ML
Fighting Robots: The Ultimate Morning Motivation
Engineers unable to code without AI assistance

Full Transcript

dexterr145: hey

swimming_dragon: oh shit

cacaosteve: hi

ravroid: Good evening sir

tboah: helloooooo

0xTomOstSec: long time no seen

userismale: Yoo

ouroboredos: No sub modep lox

kristofferhelle: welcome back

m3ntors: Hey geooo hello man

endritpb: hi

leikowo: ey it’s the tiny guy

goudacheeseburgers: o7

dahorak: Oh shit

cuddlingusrl: non arc raiders stream

dantedante33: ai hotz or real one?

vox_bony: MAGA baby GoatEmotey GoatEmotey GoatEmotey elbunk4BEAST2 elbunk4BEAST2

tboah: it’s been a minute

Maoriratto: wellcom back

lemmywemmy: sup og

gallium_ow: is mic a bit low for anyone else?

dexterr145: @gallium_ow yeah

einqs: Quiet

swimming_dragon: how is HK

evilfjalar: wazzza

sambusa_123: yeah, very quiet

zee133: thank you, I was able to retire because of my long amd position

jmickelly9: Smart move

ChandyNoEgo: happy halloween, george!!!!

cjcostanzo: not watching the world series?

0xTomOstSec: nice 2 see you Georgeyyy

ravroid: Sports ball?

voynich_1209: Hello, any reason why u moved to HK?

Tony_Roubino: Hong kongers were moving to london few years ago lmaooo

nuttyodin1: nuttyodin1 subscribed at Tier 1. They've subscribed for 31 months!

nuttyodin1: Nice to se you after a long time

NatelBeats: NatelBeats subscribed at Tier 1. They've subscribed for 65 months, currently on a 42 month streak! Internet really used to be so much better

sambusa_123: sambusa_123 subscribed at Tier 1.

sambusa_123: i now have a voice!

nuttyodin1: ExtropicAI says they are the ones to bankrupt Nvidia

leikowo: any async mma on these ?

leikowo: how much does the vendor blas get ?

leikowo: I meant something like tcgen05.mma

Fleetwood2048: f32?

Fleetwood2048: f32 out ah right

Fleetwood2048: everyone

sambusa_123: not me!

leikowo: 1 thread per float acc!

leikowo: lmao

leikowo: cast acc!

leikowo: is ref same type ?

leikowo: what's bufferize ?

Fleetwood2048: looking more like TVM every day SeemsGood

leikowo: what's .after() ?

Fleetwood2048: can't stump mr chen

johnathanschmidt: johnathanschmidt subscribed with Prime. They've subscribed for 8 months!

leikowo: yeah but that's because they got cute

leikowo: yes

leikowo: cute::print_latex

leikowo: cute algebra is pretty cool ngl

leikowo: from ISA would be easier to follow

leikowo: from the pdf you reference

nickhitman21: nickhitman21 subscribed at Tier 1.

nickhitman21: привет, как дела?

leikowo: I asked at the beginning about the ref dtype x)

nickhitman21: good day, sir. can you explain to me as for a 5 year old kid what is happening? NotLikeThis

nickhitman21: claude code vibecoding

leikowo: do the inputs partition and acc partition match the formula under 7.1.4.1 and 7.1.4.2 ?

leikowo: your C tensor is empty, wouldn't we get numbers even without running the custom kernel ?

leikowo: (we also got global variables named M,N,K)

leikowo: B is 1

leikowo: pain(a, b, c) is hilariously accurate

leikowo: I don't know how you got that right tbh

leikowo: there was two values something iirc

RickAndMoreTea: geez rick am i too late to halloween party here

RickAndMoreTea: lets goo

leikowo: insane that there are no mentions of async in CDNA4 ISA

leikowo: really makes you want to divide A layout by (16,32)

gclawes: Man I wish I knew how to program GPUs even a little bit, this looks cool

leikowo: yeah it's pretty sick

leikowo: that's the shortest kernel calling a TC by hand I've seen

stevethesniperr: i have no idea what the hell is happening here im just hoping to learn via osmosis

leikowo: @stevethesniperr You have two matrices A and B, you multiply them: this operation is called a GEMM. GEMMs can be efficiently computed on GPUs, nowadays GPUs include custom cores for accelerating GEMMs. mr hotz is writing a clean gpu function doing this

somewatson: somewatson subscribed at Tier 1.

somewatson: puddot1Hi puddot1Hi

somewatson: do you have physical access to the MI350X or is this a cloud-based instance?

leikowo: @somewatson pretty sure it's one of the servers they got from AMD

leikowo: (physically)

somewatson: oh i see, nice

somewatson: happy halloween btw

somewatson: by the way, I’m subbing because I really want better driver support for AMD GPUs, so hope this sub helps

leikowo: UOp.group victorizes stores ?

leikowo: It would take me a lot more lmao

leikowo: so now we do larger blocks ? like 2x2 TC blocks ?

somewatson: is there a huge difference between the mi300A and mi350x driver wise? I’m hoping any changes you make also help me out downstream since I’m running the mi300a

leikowo: is this calling the builtin ??

leikowo: B200 is a pain because it's all async

leikowo: yeye it's like an intrinsic

leikowo: B200 you have to do a persistent kernel, use tcgen05 (so new tmem which is annoying) and TMA for async loads & maybe async stores if needed too

fart_factory: OMG GEORGE IS BACK

leikowo: ah and I forgot thread block clusters and TMA multicast

somewatson: very nice, glad to hear about the contract. impressed that you were able to get them to focus on it

leikowo: fuck it, use MxK and NxK inputs

jmickelly9: What a trade you did back then with AMD stock. Also great work with the contract with AMD.

leikowo: 2200Mhz

leikowo: pretty cool that if you do the flops/clk on the different TC instrunctions the 32x32x16 is the same as 16x16x32

somewatson: would you say it worth trying to set up on my mi300’s and begin to write up docs or are things still in motion?

leikowo: maybe easier to hit flops with cause you just need to dispatch less TC instructions and can use the other threads for data movement ?

leikowo: K is 32

leikowo: banger reshape

leikowo: er warp

leikowo: per block sorry

leikowo: AddrSpace.LOCAL is smem ?

leikowo: can we do vector load from gem ?

leikowo: dtype.vector from gmem*

leikowo: can we slice ?

leikowo: A[gx, :, gy, :]

leikowo: x)

leikowo: CuTe DSL uses slicing for this

leikowo: did you ever look into CuTe ? like divides, local_tile, partitions

leikowo: I am sold

leikowo: watch the MMA_traits<MMA_OP> file, they describe EVERY TC layouts using cute layouts

leikowo: these really look like tensors tbh x)

leikowo: thr_copy.slice(A[gx, :, gy, :]).to("reg")

leikowo: should be like (num_mma_m, num_mma_n, 4)

leikowo: what's the register max per thread on amd ?

nuttyodin1: great day to hangout with you code

leikowo: all loops and K

leikowo: can you statically check that it's equivalent to matmul by unrolling all UOps and comparing with normal matmul ?

leikowo: this is a lot lot of registers but I'm not familiar with AMD. I know nvidia wouldn't let you

leikowo: registers spilling ?

leikowo: larger BLOCK_K maybe

leikowo: unroll arg for UOp.range ?

leikowo: no more instruction cache x)

leikowo: no vector stores ? :c

leikowo: rdna renderer ?

leikowo: (BLOCK_M//TC_M, BLOCK_N//TC_N, 4) ?

leikowo: 64,64,128

leikowo: simple loads not vectorized no ?

leikowo: reading through the rdna4 doc, apparently you can load blocks of memory from gmem to rmem

leikowo: also DS_READ_B128

dp_smh: dp_smh subscribed at Tier 1.

leikowo: ohhh tensor cores got dedicated registers on amd

ibarres: Kappa

leikowo: would be nice to copy from smem to them

leikowo: what's these 16bit global loads

leikowo: can we cast to half8 before loading ?

leikowo: lots of flops for a single warp

leikowo: this asm looks very good

leikowo: we could try multiple warps per blocks

leikowo: I think the smem -> rmem load layout doesn't matter much

leikowo: are we even getting the BW from global men ?

leikowo: mem*

Cos_ighn: howdy

leikowo: wait wat

leikowo: who's lidx0

leikowo: has to be first dim I think

Cos_ighn: Cos_ighn subscribed with Prime. They've subscribed for 5 months!

leikowo: are you not making 64 warps of 4 threads ??

leikowo: no there shouldn't be any acc x)

leikowo: bank conflicts aren't **this** bad I think

ibarres: thats water bro

Santiago_LHC: Very nice tea setup

leikowo: please show the generated code again

leikowo: we don't see the screen lmao

Santiago_LHC: your cam stayed too big

Santiago_LHC: power tea

leikowo: divide BLOCK_M//4//TC_M

leikowo: nah it was for the acc, you did it already now

leikowo: AMD employees seeing you get 16TF out of a 2PF card

leikowo: (I got 22TF on a B200 like two weeks into my internship)

leikowo: cause K is good

leikowo: doesn't grow accumulator

leikowo: wait As stores

leikowo: shouldn't they also be loaded using multiple warps ?!

leikowo: probably not the issue though :(

leikowo: BLOCKM//4*BLOCK_N*BLOCK_K

leikowo: there is no way this is faster

leikowo: It should spill

leikowo: BLOCK_K should be involved

leikowo: yes, I just think we are memory bound and waiting for memory copies instead of pipelining them

leikowo: gmem

leikowo: can we diff fake and non fake kernel ?

leikowo: sorry, warp group 1 and warp group 4 not fake

leikowo: nonono

leikowo: I meant wg1 and wg4

leikowo: it's 6am sorry!

leikowo: just generated code

leikowo: like some loops should go away

leikowo: probably because we just duplicated the loaded data from gmem

ronmasas: How about brute forcing the best block_* combo?

codingfisch: The lonely knight against entropy still at work 🙌 You are not alone!

leikowo: was very cool to interact, it's 6:13am and I'm struggling to keep my eyes opened. good night! (I'll still listen but likely fall asleep)

leikowo: for the gmem loads: for A split M by WG and threads access 8 successive halfs each for B split N by WG and same 8 halfs for each thread

jmickelly9: What’s the bandwidth speed on the USB4 - Nvidia GPU

dp_smh: NVDA reaching out before Apple responds 😒

leikowo: you want a B200 server, a single B200 is a paperweight xd

somewatson: whatttt why no meetings?

somewatson: LUL

somewatson: are you passing out Halloween candy today? it’s super dark out here so I’m guessing you’re not in California right now

codingfisch: They will. You are so cheap!

sunnergunner: nope :)

pupscub: yay I am free

leikowo: GB200 superchip would be nice, there is no software for it (the C2C)

somewatson: HK is fun. I highly recommend Saizeriya. it’s a Japanese chain restaurant that does Italian food

somewatson: ngl but the best Italian food in the world is in Japan

somewatson: oh the spark sounds pretty fun to play with

iashyam: Good to catch you live here george. I watch your stream during coding at job. Thank you so much for inspiring me.

somewatson: yes I always go every time I visit

codingfisch: Can you reiterate? I just joined but used quite a lot PTX in the last weeks

somewatson: LUL it’s better than sbarro for sure

leikowo: fix: get an nvidia gpu

somewatson: although we all know the best NYC pizza is sbarros

leikowo: did you see how the gb200 works ?

leikowo: gpu memory is another numa node and can be used by system

leikowo: (and gpu can access system memory)

codingfisch: How much more value does your programming provide compared to an average tinycorp employee?

leikowo: ds read ?

dp_smh: why have the load_loop blocks below the compute?

lostendasauce: use ai

tbullshit: when you say we're going to assembly do you mean tinygrad generating the assembly

ardorendk: George, how are you handling quantization and scaling into MXFP6 from FP16? per-tensor, per-channel, or dynamic?

Bfgrocks: is that party ticket on the wall xD

480i: @somewatson i learned that from the office

noseynice: What is the best dish you got in hong kong so far?

GIGA10101: anyone know the expected timeframe for the 1.0 release?

somewatson: or how about using duckduckgo's AI chat?

somewatson: @480i LUL

datdoc: ill ssh to your server and finish the code

therollerblader: what is bro coding

tbullshit: ai girlfriends, just really early stage

therollerblader: wow

therollerblader: not even using ai coding for ai project wow

dp_smh: it seemed the same to me

LilTankster33: I’ve been studying your Tinygrad commits. I’m new to kernel level ML work, but I’m trying to understand how memory tiling affects perf. Respect for showing the process raw like this.

standardpointer: o/

standardpointer: woohoo, 235

ardorendk: nice

maland3r: !f

utkarshgill: what is coalescing

ardorendk: faster by +0

lostendasauce: id try grok too

noseynice: Anything outside of what the median person talks about is very scarce in any LLM datset so less probable results?

GENUlNlTY: LuvPeekR

dawkinsisdoper: how are you doing all this with out a cursor chat? using actual brain?

Parasocializing: AI brainrot

tbullshit: twitter pilled

Aesthetic_Champ: screen frozen?

Parasocializing: @Aesthetic_Champ yeah

tbullshit: mr streamer your screen is fucked

whiterunknight: good morning

quagmire_8: is rangeify in complete working mode yet

UltraScientism: @noseynice What do you mean? They definitely ingest lots of academic sources, including obscure ones.

dawkinsisdoper: @tbullshit just transparent font

pika2u: yo <3

tbullshit: mr streamer it's frozen

UltraScientism: yes, it's frozen

utkarshgill: screen stuck

azertysdfghjk: @georgehotz we can't see what you type, check your obs settings please

tbullshit: im about to post a ascii of the austrian painter

tbullshit: nvm crisis averted

uranus4head: do u think terry davis was good programmer

want7000: want7000 subscribed with Prime.

want7000: Thoughts on humanity's progress with large language models writing software?

lostendasauce: computers have no insight

UltraScientism: @lostendasauce for a year or two

TigreDozer: AI is more coherent in the answers than 50% of human population...

jmickelly9: There was a twitter space where the sort of software that LLMs clearly struggle with understanding are browsers, compilers and device drivers.

lostendasauce: unless there is a breakthrough in neroscience where thoughts can be translated to electrical signals. still would have to figure out where the eureka moment comes from

Daniel_Ehrhardt: will there be a comma 4 soon?

Daniel_Ehrhardt: 🙈👌

utkarshgill: how to understand rangeify ? I was getting used to older abstractions and shapetracker, now VIZ=1 sometimes feels harder to grasp

UltraScientism: How fast are you currently on this chip compared to pytorch + amd's standard driver stack?

AK_ttv_: Sup

ardorendk: reducing complexity usually improves speed as well

rvx_4: hmm

Hoosier_Cruiser: what up G

dp_smh: a wise man told me this week if you want speed you need a profiler

AK_ttv_: Sorry if this is off topic: I want to create an AI thats connected to the entire Android system, the AI should be intelligent to basically do everything you need and make personalized user predictions by usage...

AK_ttv_: Just an idea.

ardorendk: george will enable sub only mode ^

AK_ttv_: Ok.

hlidskjalf77708: FeelsStrongMan

verxile: looking forward to comma con :D

AK_ttv_: My other idea is to basically make a robot that can replace or be like a doordash... lolol

AK_ttv_: Anyways geohot

jetSETcodecat: Looking forward to Comma Con as well and also looking forward to you developing those Unitree or Boston Dynamics robots. So I can fight it, of course :)

want7000: Will read that now, but as I thought; it creates awful non-functioning code at random and all encounters with ai are unwanted and waste time. have yet to find a counter-example.

titondev: Hi SUBtember

480i: i feel like google these days is rough

whiterunknight: did you finish reading gravitys rainbow

480i: @whiterunknight pynchon!

Bfgrocks: Have you seen Grokipedia

AK_ttv_: lolol

espacevecto: chat, is he using a mac or linux with apple theme ?

UltraScientism: well there are non-llm AIs that for example solve (mechanically prove ) new math Olympiad problems at a good level. That's more impressive to me than the improvements in search that llms brought.

lotr_may: No subchat? PogChamp Hiii

AK_ttv_: So geohot

AK_ttv_: Wsg

7vevex: @UltraScientism Steven Wolfram was talking about that on stream yesterday

jetSETcodecat: jetSETcodecat subscribed at Tier 1.

jetSETcodecat: Yay <3

whiterunknight: goodbye plebs

jetSETcodecat: Now I have free room to talk in depth about how I'm gonna curb stomp a clanker , that is, after I make it fold my clothes and be generally less-than-human enough to do my tasks rote and routine tasks - Slower !!!

jetSETcodecat: I'm sorry, I'll take it back. After 10 years time and I see one capable enough to FIGHT me

jetSETcodecat: Apologies if I'm adamant about fighting robots - It's what gets me up in the morning to be frank. Maybe I'll learn tinygrad & manage to team up with Unitree to UFC fighting robots

jetSETcodecat: Nah you're good, on a personal level

jetSETcodecat: Haha, sorry, I'll learn tinygrad then

whiterunknight: worlds first ai twitch viewer

whiterunknight: bro used an em dash

whiterunknight: will it even run on a fold 7

want7000: they don't care about normal people, enthusiastically marketing bullshit to other rich people makes stock go up, i think

jmickelly9: AWS and Anthropic had an outage recently and just saw many ‘engineers’ could not do any work for the whole day without using Claude. That is beyond horrifying to see that ‘engineers’ lost the ability to code without AI these days.

whiterunknight: God damm ssh

jetSETcodecat: Huzzah

jetSETcodecat: Oh babyt

whiterunknight: hell yeah

jetSETcodecat: My lord "It just works" wasn't an exaggeration or even false brother spittin' facts LETS GOOO

dp_smh: yeah own that poser

whiterunknight: thank you for the demo

MinistryOfWarHacker: @georgehotz What should I eat today for breakfast?

BRODZELi: thank you for sharing

warmrow: how much do u use ai code editors daily

AugusDogus: when do you expect to see more tinygrad adoption?

GrandeurMusic: Great Work!

FatherOfFamilyValues: @georgehotz Will you grace us again with more creative writing in your blog?

espacevecto: Congrats

born2zen: hhahaha

Tsoding: I saw that guy. He's the pathetic loser :)

pika2u: Ty George for all you do big heart big heart <3

AugusDogus: what matters?

born2zen: <3

AugusDogus: word

whiterunknight: <3

jetSETcodecat: <3

GrandeurMusic: Write in Assembly, and it becomes beautiful!

jmickelly9: The USB4 to NVIDIA GPU on a Macbook with tinygrad driving it is amazing. Can’t wait for it.

utkarshgill: I implemented sub 500 line beautiful_lander in torch. gets a 200+ score in lunar lander < 250 episodes, reliably. when I directly port it to tinygrad, it gets stuck at 10-12% and its super slow. if I work on it and submit a PR to tinygrad/examples/beautiful_lander.py, would it be useful?

jetSETcodecat: wdym today like today today you're joking thats fucking amazing

MIGGOXYZ: I aspire to be as cracked as you are one day

mokert2: Hi George, I have a conjecture that has been bouncing in my head for a while, do you think fundamentally model size is just bits? Like if you made the best possible 8GB param model in FP8 and say, an FP64 model with the same memory footprint, would they be equivalent? You may need a different net arch, activation etc, but are bits ultimately just bits regardless?

GrandeurMusic: Does it work for linux the same way (tinygrad) to NVDIA GPU

AugusDogus: What's the timeline look like for being faster than PyTorch?

GrandeurMusic: Great Stream! Thanks for the stream!

deepvision: I hate clean and fast code I get payed by the hour Kappa

parapuschel: hi geo hi chat

utkarshgill: I don't have a 4090

pika2u: I liked your recent blog posts!

Mach_34: Tea or coffee>

dp_smh: what did you think of WNY in your short time at RIT

Mach_34: Love chinese tea culture. I've been buying lots of puer

inbredcuckfromusa: is it this vod available on yt after yyour stream?

tolpa_zevak: Did George say anything about Durov’s and the Lieberman brothers’ projects on decentralized model training?

tohrxyz: tohrxyz subscribed at Tier 1. They've subscribed for 6 months, currently on a 6 month streak! Hey and good day!

jetSETcodecat: @inbredcuckfromusa geohotarchive on YT or type george hotz archive anywhere

jetSETcodecat: Thanks for the stream. See you next stream <3

MinistryOfWarHacker: from MONAD

ibarres: when simple in order VLIW isa gen

JVPost: are you saying, the truth will set you free?

kinglouisxvii: Based

tohrxyz: write that into them books

pika2u: Have you ever read Silver Surfer: Requiem? I have recently and really recommend it

MinistryOfWarHacker: truth is love

ivyharvey: Octopus card is the truth

ardorendk: did u just said slavery is good? Kappa

Y3absira: hello george

UltraScientism: what's so good about hong kong?

Y3absira: 2 2's to my word fam

hulkemist: thanks for stream

AugusDogus: cheers

tohrxyz: happy halloweeen guys

PFASpartout: Joy comes from making sense out of this mess

ardorendk: don't kill anyone

whiterunknight: :(

tohrxyz: dont be too scary on kids

GrandeurMusic: you are in Japan?

Loading...

Loading video analysis...