LongCut logo

NVIDIA didn't want me to do this

By Alex Ziskind

Summary

Topics Covered

  • Clustering Accelerates with Faster Interconnects
  • Cable Speed Mismatches Cripple Performance
  • Latency Dominates in Multi-Node Clusters
  • Dense Models Scale Best Across Nodes
  • Terabyte VRAM Runs 800GB Frontier Models

Full Transcript

Well, I've got some good news and some bad news. I got a cluster of four of

bad news. I got a cluster of four of these working together. That's the good news. Why would you want to do that?

news. Why would you want to do that?

Well, each one of these DJX Sparks has 128 gigs of memory that can be used for large language models, training and inference, generating stuff. Stacking

all these together gives me 512 GB of memory that I can use for these tasks.

So, yeah, I can run some pretty large models. But not only that, this is using

models. But not only that, this is using high.

This is really hot. Also,

this is using the Connect 7 interface from Nvidia, which is probably $1,000 to $1,500 on its own, adding quite a bit to the cost of these Sparks. But without

those, we're not going to have the insanely fast connection. And they also allow us to do something else. Tensor

parallelism. What the heck is that?

Well, typically, if you're using slow connection, like Ethernet, for example, with this kind of cable, you can still do clustering, but the more machines you add, the slower things get. with a

faster connection that supports RDMA, direct memory access. It's scaled in such a way that makes you go faster the more machines you add. So, yeah, there's four here and maybe I'll even do eight.

We'll see. Now, the bad news, this is really freaking hot. It's blowing air right on me. The exhaust is this way because if I have it go that way, it actually turns off my cameras.

This is a very quiet machine, but the air is super concentrated and it just blows out hot air in this direction.

That's not the only bad thing. See, I'm

kind of dumb at this.

And uh I bought a whole bunch of cables to be able to connect those QSFP connections, which I had no idea what the heck that meant before I started this little project a few months ago.

The same size connectors can be QSFP 28 and QSFP 56. The 56 ones are double the speed. There can also be double density

speed. There can also be double density connectors, which is what I needed to connect four of these together. and I

happen to have bought the wrong ones.

So, for the last few months, I've been using QSFP 28, but I hope to remedy that today. Now,

Nvidia, they have some pretty good documentation that works pretty well with recipes where you can actually take one of the sparks and use it, or you can get a little cable like this that just

connects two sparks together, one and two. When you do that, you double your

two. When you do that, you double your memory to 256. So you can run pretty large models already like Quen 235 and you can get that tensor parallelism going on with these two as well. But no

matter how much I asked them, they wouldn't show me how to connect more than two. And they wouldn't add that to

than two. And they wouldn't add that to the documentation. It's not until a

the documentation. It's not until a community member came through that we can now get more than two connected together. By the way, I have a DJX Spark

together. By the way, I have a DJX Spark to give away to a lucky winner courtesy of Nvidia. I'm not buying it. If you

of Nvidia. I'm not buying it. If you

want to enter, there's going to be rules down below in the description. Also,

make sure you leave a comment and subscribe to the channel, of course. And

this, we'll get to that. To get this set up, I had to buy this crazy micro tick switch, which I saw on Serve the Home Channel. The most expensive switch I

Channel. The most expensive switch I ever bought, 1,300 bucks. But, but

totally worth it if it lets me mess around with four sparks. And it's got two 400 GB connections on it, which means I can plug in a breakout cable like this into one port. These are very expensive, by the way. all these

networking things start getting pretty pricey if you start going with these kinds of connections. And literally,

that was the first time I ever touched a QSFP cable. This is now going out to

QSFP cable. This is now going out to four sparks so that they all have a mesh network. They can all talk to each other

network. They can all talk to each other just like this. Spark one is talking to all the other ones. Spark 2, Spark 3, and Spark 4. In other words, they can ping each other. They can SSH to each other without using passwords, which is

something you need when you set up a cluster like this. But here's a problem.

What the heck is this? When I run ETH tool from node to node, I get five gigabit per second. Wait, sorry. Uh,

five 50,000 megabit per second, which is 50 Gbit per second. And I'm capped by the cable. This is a QSFP 28 cable on

the cable. This is a QSFP 28 cable on this side. An expensive little

this side. An expensive little discovery. So, of course, I went to

discovery. So, of course, I went to Amazon and I got another breakout cable.

This is really crinkly.

Let's see if just changing the cable is going to double our speed. By the way, to validate this whole thing, I'm not going to use a huge model that comes later. I'm going to use a small model so

later. I'm going to use a small model so I can easily download it to one machine and easily distribute it to the other machines. So, I went with my trusty Quen

machines. So, I went with my trusty Quen 34B. By the way, this is not the 4-bit

34B. By the way, this is not the 4-bit quant. This is the full BF-16 version.

quant. This is the full BF-16 version.

And the token generation is 51 tokens per second. But look at this. The PP

per second. But look at this. The PP

2048, that's the prompt processing is 6367.

We're already seeing a huge boost in PROM processing just from having all these hooked up. And also, it's the fact that I'm using PP 2048, not the default, which is something I discuss in more

detail in one of the member videos why we use that one. Now, this is an Elchipo at $185.

Yeah, I'm not even kidding. That is

cheap compared to some of the other ones that I've seen. But

it's a QSFP 56 on each of the ends instead of 28.

This is a monster.

Now, I thought plugging in an Ethernet cable is a satisfying click. This is an even more satisfying click.

All right, they're talking to each other. I have to look at the switch to

other. I have to look at the switch to make sure that things are working because the sparks don't have any kind of indication whether they're on or that Ethernet is working.

What? I'm still getting 50,000. Are you

kidding me? This was supposed to double the speed. Well, even after a reboot,

the speed. Well, even after a reboot, still showing the same thing. Now, I do have Claude involved in this whole business because I just have to mention

something from one machine and he goes and does everything on all the machines, which is way easier than having to do things manually on each machine. And

Claude says that perhaps even though this says QSFP56 in the Amazon listing, which is most

certainly does right here, 400G PAM 4 to 4X100G, that the breakout ends are actually QSFP28 connectors. Now, I know that

QSFP28 connectors. Now, I know that sometimes LLMs hallucinate, and that could be happening right now, or sometimes Amazon sellers lie, and that

could be happening right now. So, I

don't know which one it is.

Unfortunately, the only thing that's left to do is buy another cable and try it. So, these days, I'm always flipping

it. So, these days, I'm always flipping between models. GPT for research, cloud

between models. GPT for research, cloud for coding, nano banana for image generation, vio cling, and runway for video. Six tabs, six bills, and

video. Six tabs, six bills, and counting. Enter chat LLM teams. One

counting. Enter chat LLM teams. One dashboard houses every top LLM and route Olympics the right one. GPT Mini for ultra fast answers, Cloud Sonnet for

coding, Gemini Pro for massive context.

They recently added Gemini 3 and GPT 5.1 the moment they dropped. Create

professional presentations with graphs, charts, and deep research detailed content. Need human sounding copy?

content. Need human sounding copy?

Humanize rewrites text to defeat AI detectors. Need visuals? Pick frontier

detectors. Need visuals? Pick frontier

or open- source models. Nano Banana,

MidJourney, Flux for images, Magnific upscaling, plus VO, WAN, and Sora for video. All built in. You also get

video. All built in. You also get Abaca's AI deep agent to pretty much do anything. Build full stack apps,

anything. Build full stack apps, websites, reports with just text prompts and deploy them on the spot. They have

Abaca's AI desktop, which is the brand new coding editor and assistant that lets you vibe code and build productionready apps. And the kicker,

productionready apps. And the kicker, it's just $10 a month, less than one premium model. Head over to

premium model. Head over to chatlm.abocus.ai

chatlm.abocus.ai or click the link below to level up with chatlm teams. DJX Sparks are supposed to have 200 GB connection nicks and the cables that Nvidia actually includes

when you buy one of these uh double packs of DJX Sparks is a 400G cable. I

know that works. And when I go to FS, which is where I bought the previous set of cables, I can search for DJX Spark and they actually have cables that are labeled for Spark. Look, it says DJX

Spark right there in the name. And uh

yeah, a price to match. These are 400G doubled to uh two 200g QSFP 56. So I ordered some more cables.

I'm not going to show you the price tag.

I got four of these. Basically the same kind of cable that FS has. Now these are breakout cables from one to two. So I'm

going to need two of them and use two ports on that switch. protected,

unprotected.

Make sure you always wear protections on your Ethernet cables when you're doing dangerous but fun things like this.

Here we go. I'm going to try with the bottom two first. Sparks one and two to see if we even getting anywhere. All

right, let's see.

Uh, I'm still getting 50.

What the heck?

Come on.

>> Many hours later.

>> This is a whole new thing that I wish I would knew, but I don't. And I'm glad that I have Claw to help me out here because this switch is a manage switch and you can go into its interface and

manage it. There's just a lot of

manage it. There's just a lot of settings in here that I don't know anything about. But it has this handy

anything about. But it has this handy little terminal and you can SSH into it, which means I can give Claude access to SSH into the switch and work its magic.

And it found out that the port that I'm plugging it into is hardcoded to 50G.

Don't even know where to look for that.

Ah, look at that. It should be configured to 100G. I will plug the cable into port one on the switch. Make

the adjustments. I love that.

Is it going to work though? That's a

whole different story. Okay, it seems to have reconfigured it. Let's test this out.

100 GBs between one and two. So, let's

plug in the other two sparks. Yes,

100 GB on each one of these ports. Now

we can run our llama beni. So if you missed that, which I kind of did the first time also, there's two physical ports on each Spark. Each port is capable of 200 Gbit per second, and each

port has two virtual interfaces. Each

interface is capable of 100. So yeah,

it's a little confusing. And that's how we can get a 200 GB cable in here. But

we need to set each interface to 100 gigs. So that two interfaces, 200 gigs.

gigs. So that two interfaces, 200 gigs.

And this is the solution on GitHub that makes everything possible. It's by this guy that hangs out on Nvidia forums called Yuger. I don't know if I'm saying

called Yuger. I don't know if I'm saying that right, but he's also helped me out a little bit to get this going. And he

created this repository which he seems to be updating pretty often. Now, he's

one of the first ones to figure out how to actually run VLM on one of these without using a Docker container. So if

you're interested in that, check out his repositories. Now, this one is a

repositories. Now, this one is a Dockerbased VLM run, and it's specifically for clustering a bunch of them. He simplifies a lot of the

them. He simplifies a lot of the workflow in getting this running. All

right, here we go. Llama Beni, which is one of his other projects, too, by the way. This is better than Llama Bench

way. This is better than Llama Bench because it gives more realistic results as it talks to the server over the network. Um,

network. Um, what we're getting about the same results here. What? Let's run that again

results here. What? Let's run that again and take a look at NVtop on all the machines. Yeah, it's actually clustered

machines. Yeah, it's actually clustered and it's running on all four of these.

Well, we're getting about the same speed, folks, for token generation and for prompt processing. I thought this was going to be better, but I guess I didn't need to buy all those cables.

Explain this, Claude. So, for 50G before, we had faster prefill than we do now. Percent of change, we lost 19% of

now. Percent of change, we lost 19% of performance. What the heck? Generation

performance. What the heck? Generation

is faster. 7% faster. But to me, that looks like a margin of error type of thing. And time to first token is

thing. And time to first token is slower. All right. Generation improved

slower. All right. Generation improved

7% which makes sense. Generation is

communicationbound with frequent small all reduce ops that benefits from lower latency of these link speeds. The

prefill regression is unexpected. Yeah,

you're telling me. And just to make sure we're still getting some benefits here, I had to rerun this with just one spark and I got 23 tokens per second for generation there. And then with two

generation there. And then with two sparks, 35. Here's the full picture at

sparks, 35. Here's the full picture at 100g. One node, two node and four node.

100g. One node, two node and four node.

Two node looks pretty nice at almost 8,000 tokens per second prefill. But as

you can see for token generation, we're getting big improvements the more nodes we add. Now to make sure that this is

we add. Now to make sure that this is using rock E, also known as RDMA over converged Ethernet. That's when you have

converged Ethernet. That's when you have an acronym inside another acronym. RDMA

of course being remote direct memory access over converged Ethernet. Rocky.

Basically what that does is bypass the whole network stack and communicate directly between the memory spaces of these machines. And that's what we're

these machines. And that's what we're getting apparently here because Yugurt, the creator of this clustering solution, suggested I use the nickel debug flag to make sure that indeed that's working.

Another acronym. Yeah, I know. Nickel is

Nvidia collective communications library. NCCL is pronounced nickel. It

library. NCCL is pronounced nickel. It

enables efficient scaling across multiple GPUs via PCIe, NVL, Infiniband, and TCP IP. Yeah, I know. Fun stuff,

right? I just want to plug it in and have it work. So, we got both interfaces on each machine carrying 100G of traffic and Nickel is getting the full 200g per node without needing IP addresses on the

second interface. This is as fast as it

second interface. This is as fast as it goes, I guess. And now's a good time to mention the third most important thing when it comes to running LLMs across a cluster. Now, just to recap, if you're

cluster. Now, just to recap, if you're not running on a cluster, there are two things that are important. One is the GPU processing power. That's the

calculations, the number crunching that happens on the GPU to give you fast prefill or PP as we've seen it in the benchmarks. That's what takes your

benchmarks. That's what takes your prompt and processes it. Prompt

processing. The second is decode. That's

the token generation. And that's where memory bandwidth comes in handy. And

finally, the third important thing matters only when you're clustering, and that's latency. See, if you connect

that's latency. See, if you connect these with a standard Ethernet plug, your latency is going to be an order of magnitude more than it would be if you're connecting with Rocky. I don't

want to connect with Rocky. My jaw

wouldn't survive it. All right, Alex.

Enough of those silly jokes. But here is one of the tests you can run. And this

is actually on that repository to check for latency. So on Spark 2, I'm going to

for latency. So on Spark 2, I'm going to kick this off. So it's going to listen for connections. And on Spark one, I'm

for connections. And on Spark one, I'm going to do the Infiniband write test.

At least that's what I think it is. And

I'm going to target Spark 2. Boom. All

right. Not the best numbers, I got to say. Not the best. Let's do that one

say. Not the best. Let's do that one more time. Okay, that's better. 3

more time. Okay, that's better. 3

microsconds. That's the latency going from one Spark through the switch into another Spark. This is basically the

another Spark. This is basically the same thing that allowed Apple to have better clustering capability on their machines, which is what I cover in other videos. I think they have four or five

videos. I think they have four or five microcond latency. But you have to

microcond latency. But you have to wonder though if that three microsconds is a result of introducing another extra step, which is that switch. So, let's

take some of these connections out. I'm

going to plug two sparks directly into each other. This is the most direct way

each other. This is the most direct way you can go and run that same test. So

that's Spark 1 and two. Boom. And yeah,

when they're connected directly, I'm getting 1 microcond less latency. That's

a pretty big difference. And this would have the most effect on smaller models, which brings us to the next topic, bigger models. Now, I did reach out to

bigger models. Now, I did reach out to Yuger and he suggested the dense, larger models would perform the best across a cluster like this. So, I got this Quen 3

VL32B Instruct, which is actually a model that's 66 GB on disk. This is the full BF16 version. Now, yeah, this has been recently replaced by Quen 3.5, but

that's ane. It's got vision. It's got

that's ane. It's got vision. It's got

everything in it. This is a dense model, which we still need to run once in a while. So, sharding across multiple GPUs

while. So, sharding across multiple GPUs is the only way to run them at all. That

would go for something like Llama 405B.

But even if they do fit, like in this case, clustering lets you increase context and concurrency and speed up token generation. And we should see

token generation. And we should see really nice scaling. Scaling slightly

better than the Mac Studios even. So

here I've got VL32 running on one node, the first node. You can see that in the chart right here, NVtop. And we got 3.58 tokens per second. Not great. What

happens when we do two nodes? Here are

the first two nodes, Spark 1 and Spark 2. You can see the GPU usage on both of

2. You can see the GPU usage on both of those. Almost 100% there. Well, 96 95.

those. Almost 100% there. Well, 96 95.

Wow, the scaling is actually pretty good here. 6.14 tokens per second token

here. 6.14 tokens per second token generation. Any guesses what's going to

generation. Any guesses what's going to happen with four nodes? Here we go. Four

nodes. Look at that memory usage though.

63 GB per machine even though we're splitting up across four nodes because we have to consider that the model is 66 GB. Sure, but we also have all that

GB. Sure, but we also have all that cache, the context window, and that's why you see such high memory usage. And

we got 11.36 tokens per second for generation here and also quite a bit more for prompt processing. I'd say the scaling is

processing. I'd say the scaling is actually pretty damn good here. Okay,

what about adding some more?

So, of course, these are the DJX Sparks from Nvidia. These two are Dell GB10.

from Nvidia. These two are Dell GB10.

They're basically the exact same thing, but a slightly different design. Some

people prefer it. This is the MSI Edge Expert. I call it the EX 10. And this

Expert. I call it the EX 10. And this

one on top is the Asus Ascent GX10. That

would be one heck of a cluster.

Fun. Uhoh. I sense a problem. But I've

foreseen this problem coming up. Do you

see the problem? The switch only has two 400 GB nicks on it. And I'm using both of them already for four sparks. So I

would need to buy another one of these switches.

Or while I was working on this, Microick released another switch. It says new item coming soon, but I already ordered it. Not sure what that was all about. By

it. Not sure what that was all about. By

the way, they're very strict about this.

You can't send this to China, Hong Kong, Russia, or Venezuela.

But look at this switch. It's got four 400 GB ports. Let's see if I can get this working. This is also why I ordered

this working. This is also why I ordered more cables.

This is the most expensive video I've ever made. Now, when I hook these up, I

ever made. Now, when I hook these up, I can't promise that my cameras aren't going to die.

Hello. Okay, you're back. Everything is

hooked up now, but I had to crack the window open. All right. Uh, last night I

window open. All right. Uh, last night I stayed up really late uh with this thing. No, not like that. Clusters take

thing. No, not like that. Clusters take

a long time to set up. First, I migrated the switch from the CRS 812 to the CRS 804. As a side note, what's up with the

804. As a side note, what's up with the two different colors? And this one has a honeycomb pattern. This one has

honeycomb pattern. This one has rectangles. Who designs these things?

rectangles. Who designs these things?

And it's not just plugandplay. Getting

these switches set up is involved.

There's a network configuration part where I had to assign new QSFP IPs to all these machines and standardized net plans with large packet support NT9000 across all of these machines.

>> Those are called jumbo frames by the way. Finally, I had all these machines

way. Finally, I had all these machines hooked up getting 100 GB signal speed.

And then I needed to set up SSH mesh.

Boom. Check it out. This is where each one of the machines can talk to all the other machines via SSH without passwords. This is something that's

passwords. This is something that's needed for cluster setup. And the more machines you have, the harder it gets.

So, luckily I had Claude help me out with this one. Check it out. We've have

like 64 connections. Well, if we don't want to include self to self, then it's 56. And they are all coming back

56. And they are all coming back successfully. Yes, all the sparks are

successfully. Yes, all the sparks are talking to each other. SSH, we're ready to do the actual test now. Oh, yeah. I

also have to copy all the models across using rync now that we have 100 gigs.

So, yeah, I use that. bring up the cluster using the Quinn 34B model as a test compacting conversation.

Now, of course, you can do all these things by just executing the script. I'm

just feeling a little bit on the lazy side. I've been up for a while. All

side. I've been up for a while. All

right, the cluster is up and running.

Here we go.

Uh, that was really, really fast. You

can see from the uh NVTOP chart right there, barely any scratch. Let's do that again. That was really fast. Couple

again. That was really fast. Couple

things I noticed. Oh, prompt processing is way faster across this cluster, but token generation is not that much faster. 64 and 61 tokens per second for

faster. 64 and 61 tokens per second for Quen 34B across the cluster of eight nodes is only about 10 tokens per second faster than four nodes. Prompt

processing is much better, but token generation, not really. And this comes back to this being a tiny model. Yeah,

it's a BF16, but the whole thing is like 8 GB. Spreading it across eight nodes is

8 GB. Spreading it across eight nodes is kind of useless.

>> This is so silly.

>> What fun.

>> Yeah, teamwork.

>> Let's try a bigger model. All right,

I've got Quen VL 32 billion instruct across the cluster. Here we go. Llama

Benji. Boom. And watch the nodes right there in NVTOP. We are reaching 109 GB used on each machine out of the 120 or so available. Again, that's with KV

so available. Again, that's with KV cache included. That didn't run very

cache included. That didn't run very long either. And oh, not seeing huge

long either. And oh, not seeing huge benefits here. Doing it again because

benefits here. Doing it again because wow, that's a little disappointing. I

mean, there is a speed up. We're getting

16 1/2 tokens per second for token generation. And remember, on one spark,

generation. And remember, on one spark, we hit three. On two sparks, we had about six. On four sparks, we were close

about six. On four sparks, we were close to 12. So, getting 16 here, not that

to 12. So, getting 16 here, not that huge of an advantage. Want to run that again just so that we get some temperatures here. And we're not seeing

temperatures here. And we're not seeing that much. I I'm getting up to 50° at

that much. I I'm getting up to 50° at the hottest point, but that's pretty much it. I can tell you it feels a lot

much it. I can tell you it feels a lot hotter back here. And also, folks, sorry about all the noise. Those switches are not the quietest thing, even though they're way quieter than data center switches. Now, I know what you're really

switches. Now, I know what you're really thinking here. What the heck is the

thinking here. What the heck is the point of all this if I'm not going to run a big model? After all, this is a terabyte of VRAM. Well, such a model just happened to have landed, and it's

the Quen 3.5 397 billion parameter model, Active 17. And this thing is pretty hot, as in it's doing really well in tests. How do you run it? How big is

in tests. How do you run it? How big is it? Oh, it's only 800 GB on disk. You

it? Oh, it's only 800 GB on disk. You

can't even run it on one Mac Studio with 512 gigs of memory, but you can run it on this cluster. So, this is the full thing, not some 4bit quant. And it took about 7 minutes for this model to shard

across all eight sparks. It probably

took another three or so to build the CUDA graphs. And finally, let's do the

CUDA graphs. And finally, let's do the Benji test and go. There it goes. We can

watch it over here. 112 gigabytes

out of 119 per node. That's pretty nuts.

But it did actually go on all the machines. It ran on all of them. And

machines. It ran on all of them. And

we're getting 24 tokens per second for such a huge model. It's mixture of experts. So, it's not running all the

experts. So, it's not running all the parameters at the same time. But 24 for state-of-the-art brand new model that just landed and it's huge. Can't run on anything that's smaller than this. I'd

say that's a win. You may let me know otherwise in the comments if you disagree or agree. Kim K2 is another one. It's pretty big. It's not as big as

one. It's pretty big. It's not as big as Quen 3.5, but it's still a hefty 600 GB.

This one took about 15 minutes to load.

Let's run Beni. Boom. And there it goes.

What is this one? This one is taking up even more memory on each node. 115 GB.

Ooh, not leaving much. Not leaving much.

But VLM is a beast. It takes all the available memory if it can because it wants to fill up and utilize as much of the system as possible and leave enough space for context. 13.35

tokens per second here. It's not the fastest thing in the world, but it runs it. You won't be able to run that on

it. You won't be able to run that on four sparks. You need eight. Hope you

four sparks. You need eight. Hope you

enjoyed the video. This was the most expensive video I ever made. So, I do appreciate a like and a comment and a subscription. It's free. All right.

subscription. It's free. All right.

We're going to be doing more stuff like this. Thanks for watching. You probably

this. Thanks for watching. You probably

want to see this video next. Also a good one. See you next time.

one. See you next time.

Loading...

Loading video analysis...