NVIDIA didn't want me to do this
By Alex Ziskind
Summary
Topics Covered
- Clustering Accelerates with Faster Interconnects
- Cable Speed Mismatches Cripple Performance
- Latency Dominates in Multi-Node Clusters
- Dense Models Scale Best Across Nodes
- Terabyte VRAM Runs 800GB Frontier Models
Full Transcript
Well, I've got some good news and some bad news. I got a cluster of four of
bad news. I got a cluster of four of these working together. That's the good news. Why would you want to do that?
news. Why would you want to do that?
Well, each one of these DJX Sparks has 128 gigs of memory that can be used for large language models, training and inference, generating stuff. Stacking
all these together gives me 512 GB of memory that I can use for these tasks.
So, yeah, I can run some pretty large models. But not only that, this is using
models. But not only that, this is using high.
This is really hot. Also,
this is using the Connect 7 interface from Nvidia, which is probably $1,000 to $1,500 on its own, adding quite a bit to the cost of these Sparks. But without
those, we're not going to have the insanely fast connection. And they also allow us to do something else. Tensor
parallelism. What the heck is that?
Well, typically, if you're using slow connection, like Ethernet, for example, with this kind of cable, you can still do clustering, but the more machines you add, the slower things get. with a
faster connection that supports RDMA, direct memory access. It's scaled in such a way that makes you go faster the more machines you add. So, yeah, there's four here and maybe I'll even do eight.
We'll see. Now, the bad news, this is really freaking hot. It's blowing air right on me. The exhaust is this way because if I have it go that way, it actually turns off my cameras.
This is a very quiet machine, but the air is super concentrated and it just blows out hot air in this direction.
That's not the only bad thing. See, I'm
kind of dumb at this.
And uh I bought a whole bunch of cables to be able to connect those QSFP connections, which I had no idea what the heck that meant before I started this little project a few months ago.
The same size connectors can be QSFP 28 and QSFP 56. The 56 ones are double the speed. There can also be double density
speed. There can also be double density connectors, which is what I needed to connect four of these together. and I
happen to have bought the wrong ones.
So, for the last few months, I've been using QSFP 28, but I hope to remedy that today. Now,
Nvidia, they have some pretty good documentation that works pretty well with recipes where you can actually take one of the sparks and use it, or you can get a little cable like this that just
connects two sparks together, one and two. When you do that, you double your
two. When you do that, you double your memory to 256. So you can run pretty large models already like Quen 235 and you can get that tensor parallelism going on with these two as well. But no
matter how much I asked them, they wouldn't show me how to connect more than two. And they wouldn't add that to
than two. And they wouldn't add that to the documentation. It's not until a
the documentation. It's not until a community member came through that we can now get more than two connected together. By the way, I have a DJX Spark
together. By the way, I have a DJX Spark to give away to a lucky winner courtesy of Nvidia. I'm not buying it. If you
of Nvidia. I'm not buying it. If you
want to enter, there's going to be rules down below in the description. Also,
make sure you leave a comment and subscribe to the channel, of course. And
this, we'll get to that. To get this set up, I had to buy this crazy micro tick switch, which I saw on Serve the Home Channel. The most expensive switch I
Channel. The most expensive switch I ever bought, 1,300 bucks. But, but
totally worth it if it lets me mess around with four sparks. And it's got two 400 GB connections on it, which means I can plug in a breakout cable like this into one port. These are very expensive, by the way. all these
networking things start getting pretty pricey if you start going with these kinds of connections. And literally,
that was the first time I ever touched a QSFP cable. This is now going out to
QSFP cable. This is now going out to four sparks so that they all have a mesh network. They can all talk to each other
network. They can all talk to each other just like this. Spark one is talking to all the other ones. Spark 2, Spark 3, and Spark 4. In other words, they can ping each other. They can SSH to each other without using passwords, which is
something you need when you set up a cluster like this. But here's a problem.
What the heck is this? When I run ETH tool from node to node, I get five gigabit per second. Wait, sorry. Uh,
five 50,000 megabit per second, which is 50 Gbit per second. And I'm capped by the cable. This is a QSFP 28 cable on
the cable. This is a QSFP 28 cable on this side. An expensive little
this side. An expensive little discovery. So, of course, I went to
discovery. So, of course, I went to Amazon and I got another breakout cable.
This is really crinkly.
Let's see if just changing the cable is going to double our speed. By the way, to validate this whole thing, I'm not going to use a huge model that comes later. I'm going to use a small model so
later. I'm going to use a small model so I can easily download it to one machine and easily distribute it to the other machines. So, I went with my trusty Quen
machines. So, I went with my trusty Quen 34B. By the way, this is not the 4-bit
34B. By the way, this is not the 4-bit quant. This is the full BF-16 version.
quant. This is the full BF-16 version.
And the token generation is 51 tokens per second. But look at this. The PP
per second. But look at this. The PP
2048, that's the prompt processing is 6367.
We're already seeing a huge boost in PROM processing just from having all these hooked up. And also, it's the fact that I'm using PP 2048, not the default, which is something I discuss in more
detail in one of the member videos why we use that one. Now, this is an Elchipo at $185.
Yeah, I'm not even kidding. That is
cheap compared to some of the other ones that I've seen. But
it's a QSFP 56 on each of the ends instead of 28.
This is a monster.
Now, I thought plugging in an Ethernet cable is a satisfying click. This is an even more satisfying click.
All right, they're talking to each other. I have to look at the switch to
other. I have to look at the switch to make sure that things are working because the sparks don't have any kind of indication whether they're on or that Ethernet is working.
What? I'm still getting 50,000. Are you
kidding me? This was supposed to double the speed. Well, even after a reboot,
the speed. Well, even after a reboot, still showing the same thing. Now, I do have Claude involved in this whole business because I just have to mention
something from one machine and he goes and does everything on all the machines, which is way easier than having to do things manually on each machine. And
Claude says that perhaps even though this says QSFP56 in the Amazon listing, which is most
certainly does right here, 400G PAM 4 to 4X100G, that the breakout ends are actually QSFP28 connectors. Now, I know that
QSFP28 connectors. Now, I know that sometimes LLMs hallucinate, and that could be happening right now, or sometimes Amazon sellers lie, and that
could be happening right now. So, I
don't know which one it is.
Unfortunately, the only thing that's left to do is buy another cable and try it. So, these days, I'm always flipping
it. So, these days, I'm always flipping between models. GPT for research, cloud
between models. GPT for research, cloud for coding, nano banana for image generation, vio cling, and runway for video. Six tabs, six bills, and
video. Six tabs, six bills, and counting. Enter chat LLM teams. One
counting. Enter chat LLM teams. One dashboard houses every top LLM and route Olympics the right one. GPT Mini for ultra fast answers, Cloud Sonnet for
coding, Gemini Pro for massive context.
They recently added Gemini 3 and GPT 5.1 the moment they dropped. Create
professional presentations with graphs, charts, and deep research detailed content. Need human sounding copy?
content. Need human sounding copy?
Humanize rewrites text to defeat AI detectors. Need visuals? Pick frontier
detectors. Need visuals? Pick frontier
or open- source models. Nano Banana,
MidJourney, Flux for images, Magnific upscaling, plus VO, WAN, and Sora for video. All built in. You also get
video. All built in. You also get Abaca's AI deep agent to pretty much do anything. Build full stack apps,
anything. Build full stack apps, websites, reports with just text prompts and deploy them on the spot. They have
Abaca's AI desktop, which is the brand new coding editor and assistant that lets you vibe code and build productionready apps. And the kicker,
productionready apps. And the kicker, it's just $10 a month, less than one premium model. Head over to
premium model. Head over to chatlm.abocus.ai
chatlm.abocus.ai or click the link below to level up with chatlm teams. DJX Sparks are supposed to have 200 GB connection nicks and the cables that Nvidia actually includes
when you buy one of these uh double packs of DJX Sparks is a 400G cable. I
know that works. And when I go to FS, which is where I bought the previous set of cables, I can search for DJX Spark and they actually have cables that are labeled for Spark. Look, it says DJX
Spark right there in the name. And uh
yeah, a price to match. These are 400G doubled to uh two 200g QSFP 56. So I ordered some more cables.
I'm not going to show you the price tag.
I got four of these. Basically the same kind of cable that FS has. Now these are breakout cables from one to two. So I'm
going to need two of them and use two ports on that switch. protected,
unprotected.
Make sure you always wear protections on your Ethernet cables when you're doing dangerous but fun things like this.
Here we go. I'm going to try with the bottom two first. Sparks one and two to see if we even getting anywhere. All
right, let's see.
Uh, I'm still getting 50.
What the heck?
Come on.
>> Many hours later.
>> This is a whole new thing that I wish I would knew, but I don't. And I'm glad that I have Claw to help me out here because this switch is a manage switch and you can go into its interface and
manage it. There's just a lot of
manage it. There's just a lot of settings in here that I don't know anything about. But it has this handy
anything about. But it has this handy little terminal and you can SSH into it, which means I can give Claude access to SSH into the switch and work its magic.
And it found out that the port that I'm plugging it into is hardcoded to 50G.
Don't even know where to look for that.
Ah, look at that. It should be configured to 100G. I will plug the cable into port one on the switch. Make
the adjustments. I love that.
Is it going to work though? That's a
whole different story. Okay, it seems to have reconfigured it. Let's test this out.
100 GBs between one and two. So, let's
plug in the other two sparks. Yes,
100 GB on each one of these ports. Now
we can run our llama beni. So if you missed that, which I kind of did the first time also, there's two physical ports on each Spark. Each port is capable of 200 Gbit per second, and each
port has two virtual interfaces. Each
interface is capable of 100. So yeah,
it's a little confusing. And that's how we can get a 200 GB cable in here. But
we need to set each interface to 100 gigs. So that two interfaces, 200 gigs.
gigs. So that two interfaces, 200 gigs.
And this is the solution on GitHub that makes everything possible. It's by this guy that hangs out on Nvidia forums called Yuger. I don't know if I'm saying
called Yuger. I don't know if I'm saying that right, but he's also helped me out a little bit to get this going. And he
created this repository which he seems to be updating pretty often. Now, he's
one of the first ones to figure out how to actually run VLM on one of these without using a Docker container. So if
you're interested in that, check out his repositories. Now, this one is a
repositories. Now, this one is a Dockerbased VLM run, and it's specifically for clustering a bunch of them. He simplifies a lot of the
them. He simplifies a lot of the workflow in getting this running. All
right, here we go. Llama Beni, which is one of his other projects, too, by the way. This is better than Llama Bench
way. This is better than Llama Bench because it gives more realistic results as it talks to the server over the network. Um,
network. Um, what we're getting about the same results here. What? Let's run that again
results here. What? Let's run that again and take a look at NVtop on all the machines. Yeah, it's actually clustered
machines. Yeah, it's actually clustered and it's running on all four of these.
Well, we're getting about the same speed, folks, for token generation and for prompt processing. I thought this was going to be better, but I guess I didn't need to buy all those cables.
Explain this, Claude. So, for 50G before, we had faster prefill than we do now. Percent of change, we lost 19% of
now. Percent of change, we lost 19% of performance. What the heck? Generation
performance. What the heck? Generation
is faster. 7% faster. But to me, that looks like a margin of error type of thing. And time to first token is
thing. And time to first token is slower. All right. Generation improved
slower. All right. Generation improved
7% which makes sense. Generation is
communicationbound with frequent small all reduce ops that benefits from lower latency of these link speeds. The
prefill regression is unexpected. Yeah,
you're telling me. And just to make sure we're still getting some benefits here, I had to rerun this with just one spark and I got 23 tokens per second for generation there. And then with two
generation there. And then with two sparks, 35. Here's the full picture at
sparks, 35. Here's the full picture at 100g. One node, two node and four node.
100g. One node, two node and four node.
Two node looks pretty nice at almost 8,000 tokens per second prefill. But as
you can see for token generation, we're getting big improvements the more nodes we add. Now to make sure that this is
we add. Now to make sure that this is using rock E, also known as RDMA over converged Ethernet. That's when you have
converged Ethernet. That's when you have an acronym inside another acronym. RDMA
of course being remote direct memory access over converged Ethernet. Rocky.
Basically what that does is bypass the whole network stack and communicate directly between the memory spaces of these machines. And that's what we're
these machines. And that's what we're getting apparently here because Yugurt, the creator of this clustering solution, suggested I use the nickel debug flag to make sure that indeed that's working.
Another acronym. Yeah, I know. Nickel is
Nvidia collective communications library. NCCL is pronounced nickel. It
library. NCCL is pronounced nickel. It
enables efficient scaling across multiple GPUs via PCIe, NVL, Infiniband, and TCP IP. Yeah, I know. Fun stuff,
right? I just want to plug it in and have it work. So, we got both interfaces on each machine carrying 100G of traffic and Nickel is getting the full 200g per node without needing IP addresses on the
second interface. This is as fast as it
second interface. This is as fast as it goes, I guess. And now's a good time to mention the third most important thing when it comes to running LLMs across a cluster. Now, just to recap, if you're
cluster. Now, just to recap, if you're not running on a cluster, there are two things that are important. One is the GPU processing power. That's the
calculations, the number crunching that happens on the GPU to give you fast prefill or PP as we've seen it in the benchmarks. That's what takes your
benchmarks. That's what takes your prompt and processes it. Prompt
processing. The second is decode. That's
the token generation. And that's where memory bandwidth comes in handy. And
finally, the third important thing matters only when you're clustering, and that's latency. See, if you connect
that's latency. See, if you connect these with a standard Ethernet plug, your latency is going to be an order of magnitude more than it would be if you're connecting with Rocky. I don't
want to connect with Rocky. My jaw
wouldn't survive it. All right, Alex.
Enough of those silly jokes. But here is one of the tests you can run. And this
is actually on that repository to check for latency. So on Spark 2, I'm going to
for latency. So on Spark 2, I'm going to kick this off. So it's going to listen for connections. And on Spark one, I'm
for connections. And on Spark one, I'm going to do the Infiniband write test.
At least that's what I think it is. And
I'm going to target Spark 2. Boom. All
right. Not the best numbers, I got to say. Not the best. Let's do that one
say. Not the best. Let's do that one more time. Okay, that's better. 3
more time. Okay, that's better. 3
microsconds. That's the latency going from one Spark through the switch into another Spark. This is basically the
another Spark. This is basically the same thing that allowed Apple to have better clustering capability on their machines, which is what I cover in other videos. I think they have four or five
videos. I think they have four or five microcond latency. But you have to
microcond latency. But you have to wonder though if that three microsconds is a result of introducing another extra step, which is that switch. So, let's
take some of these connections out. I'm
going to plug two sparks directly into each other. This is the most direct way
each other. This is the most direct way you can go and run that same test. So
that's Spark 1 and two. Boom. And yeah,
when they're connected directly, I'm getting 1 microcond less latency. That's
a pretty big difference. And this would have the most effect on smaller models, which brings us to the next topic, bigger models. Now, I did reach out to
bigger models. Now, I did reach out to Yuger and he suggested the dense, larger models would perform the best across a cluster like this. So, I got this Quen 3
VL32B Instruct, which is actually a model that's 66 GB on disk. This is the full BF16 version. Now, yeah, this has been recently replaced by Quen 3.5, but
that's ane. It's got vision. It's got
that's ane. It's got vision. It's got
everything in it. This is a dense model, which we still need to run once in a while. So, sharding across multiple GPUs
while. So, sharding across multiple GPUs is the only way to run them at all. That
would go for something like Llama 405B.
But even if they do fit, like in this case, clustering lets you increase context and concurrency and speed up token generation. And we should see
token generation. And we should see really nice scaling. Scaling slightly
better than the Mac Studios even. So
here I've got VL32 running on one node, the first node. You can see that in the chart right here, NVtop. And we got 3.58 tokens per second. Not great. What
happens when we do two nodes? Here are
the first two nodes, Spark 1 and Spark 2. You can see the GPU usage on both of
2. You can see the GPU usage on both of those. Almost 100% there. Well, 96 95.
those. Almost 100% there. Well, 96 95.
Wow, the scaling is actually pretty good here. 6.14 tokens per second token
here. 6.14 tokens per second token generation. Any guesses what's going to
generation. Any guesses what's going to happen with four nodes? Here we go. Four
nodes. Look at that memory usage though.
63 GB per machine even though we're splitting up across four nodes because we have to consider that the model is 66 GB. Sure, but we also have all that
GB. Sure, but we also have all that cache, the context window, and that's why you see such high memory usage. And
we got 11.36 tokens per second for generation here and also quite a bit more for prompt processing. I'd say the scaling is
processing. I'd say the scaling is actually pretty damn good here. Okay,
what about adding some more?
So, of course, these are the DJX Sparks from Nvidia. These two are Dell GB10.
from Nvidia. These two are Dell GB10.
They're basically the exact same thing, but a slightly different design. Some
people prefer it. This is the MSI Edge Expert. I call it the EX 10. And this
Expert. I call it the EX 10. And this
one on top is the Asus Ascent GX10. That
would be one heck of a cluster.
Fun. Uhoh. I sense a problem. But I've
foreseen this problem coming up. Do you
see the problem? The switch only has two 400 GB nicks on it. And I'm using both of them already for four sparks. So I
would need to buy another one of these switches.
Or while I was working on this, Microick released another switch. It says new item coming soon, but I already ordered it. Not sure what that was all about. By
it. Not sure what that was all about. By
the way, they're very strict about this.
You can't send this to China, Hong Kong, Russia, or Venezuela.
But look at this switch. It's got four 400 GB ports. Let's see if I can get this working. This is also why I ordered
this working. This is also why I ordered more cables.
This is the most expensive video I've ever made. Now, when I hook these up, I
ever made. Now, when I hook these up, I can't promise that my cameras aren't going to die.
Hello. Okay, you're back. Everything is
hooked up now, but I had to crack the window open. All right. Uh, last night I
window open. All right. Uh, last night I stayed up really late uh with this thing. No, not like that. Clusters take
thing. No, not like that. Clusters take
a long time to set up. First, I migrated the switch from the CRS 812 to the CRS 804. As a side note, what's up with the
804. As a side note, what's up with the two different colors? And this one has a honeycomb pattern. This one has
honeycomb pattern. This one has rectangles. Who designs these things?
rectangles. Who designs these things?
And it's not just plugandplay. Getting
these switches set up is involved.
There's a network configuration part where I had to assign new QSFP IPs to all these machines and standardized net plans with large packet support NT9000 across all of these machines.
>> Those are called jumbo frames by the way. Finally, I had all these machines
way. Finally, I had all these machines hooked up getting 100 GB signal speed.
And then I needed to set up SSH mesh.
Boom. Check it out. This is where each one of the machines can talk to all the other machines via SSH without passwords. This is something that's
passwords. This is something that's needed for cluster setup. And the more machines you have, the harder it gets.
So, luckily I had Claude help me out with this one. Check it out. We've have
like 64 connections. Well, if we don't want to include self to self, then it's 56. And they are all coming back
56. And they are all coming back successfully. Yes, all the sparks are
successfully. Yes, all the sparks are talking to each other. SSH, we're ready to do the actual test now. Oh, yeah. I
also have to copy all the models across using rync now that we have 100 gigs.
So, yeah, I use that. bring up the cluster using the Quinn 34B model as a test compacting conversation.
Now, of course, you can do all these things by just executing the script. I'm
just feeling a little bit on the lazy side. I've been up for a while. All
side. I've been up for a while. All
right, the cluster is up and running.
Here we go.
Uh, that was really, really fast. You
can see from the uh NVTOP chart right there, barely any scratch. Let's do that again. That was really fast. Couple
again. That was really fast. Couple
things I noticed. Oh, prompt processing is way faster across this cluster, but token generation is not that much faster. 64 and 61 tokens per second for
faster. 64 and 61 tokens per second for Quen 34B across the cluster of eight nodes is only about 10 tokens per second faster than four nodes. Prompt
processing is much better, but token generation, not really. And this comes back to this being a tiny model. Yeah,
it's a BF16, but the whole thing is like 8 GB. Spreading it across eight nodes is
8 GB. Spreading it across eight nodes is kind of useless.
>> This is so silly.
>> What fun.
>> Yeah, teamwork.
>> Let's try a bigger model. All right,
I've got Quen VL 32 billion instruct across the cluster. Here we go. Llama
Benji. Boom. And watch the nodes right there in NVTOP. We are reaching 109 GB used on each machine out of the 120 or so available. Again, that's with KV
so available. Again, that's with KV cache included. That didn't run very
cache included. That didn't run very long either. And oh, not seeing huge
long either. And oh, not seeing huge benefits here. Doing it again because
benefits here. Doing it again because wow, that's a little disappointing. I
mean, there is a speed up. We're getting
16 1/2 tokens per second for token generation. And remember, on one spark,
generation. And remember, on one spark, we hit three. On two sparks, we had about six. On four sparks, we were close
about six. On four sparks, we were close to 12. So, getting 16 here, not that
to 12. So, getting 16 here, not that huge of an advantage. Want to run that again just so that we get some temperatures here. And we're not seeing
temperatures here. And we're not seeing that much. I I'm getting up to 50° at
that much. I I'm getting up to 50° at the hottest point, but that's pretty much it. I can tell you it feels a lot
much it. I can tell you it feels a lot hotter back here. And also, folks, sorry about all the noise. Those switches are not the quietest thing, even though they're way quieter than data center switches. Now, I know what you're really
switches. Now, I know what you're really thinking here. What the heck is the
thinking here. What the heck is the point of all this if I'm not going to run a big model? After all, this is a terabyte of VRAM. Well, such a model just happened to have landed, and it's
the Quen 3.5 397 billion parameter model, Active 17. And this thing is pretty hot, as in it's doing really well in tests. How do you run it? How big is
in tests. How do you run it? How big is it? Oh, it's only 800 GB on disk. You
it? Oh, it's only 800 GB on disk. You
can't even run it on one Mac Studio with 512 gigs of memory, but you can run it on this cluster. So, this is the full thing, not some 4bit quant. And it took about 7 minutes for this model to shard
across all eight sparks. It probably
took another three or so to build the CUDA graphs. And finally, let's do the
CUDA graphs. And finally, let's do the Benji test and go. There it goes. We can
watch it over here. 112 gigabytes
out of 119 per node. That's pretty nuts.
But it did actually go on all the machines. It ran on all of them. And
machines. It ran on all of them. And
we're getting 24 tokens per second for such a huge model. It's mixture of experts. So, it's not running all the
experts. So, it's not running all the parameters at the same time. But 24 for state-of-the-art brand new model that just landed and it's huge. Can't run on anything that's smaller than this. I'd
say that's a win. You may let me know otherwise in the comments if you disagree or agree. Kim K2 is another one. It's pretty big. It's not as big as
one. It's pretty big. It's not as big as Quen 3.5, but it's still a hefty 600 GB.
This one took about 15 minutes to load.
Let's run Beni. Boom. And there it goes.
What is this one? This one is taking up even more memory on each node. 115 GB.
Ooh, not leaving much. Not leaving much.
But VLM is a beast. It takes all the available memory if it can because it wants to fill up and utilize as much of the system as possible and leave enough space for context. 13.35
tokens per second here. It's not the fastest thing in the world, but it runs it. You won't be able to run that on
it. You won't be able to run that on four sparks. You need eight. Hope you
four sparks. You need eight. Hope you
enjoyed the video. This was the most expensive video I ever made. So, I do appreciate a like and a comment and a subscription. It's free. All right.
subscription. It's free. All right.
We're going to be doing more stuff like this. Thanks for watching. You probably
this. Thanks for watching. You probably
want to see this video next. Also a good one. See you next time.
one. See you next time.
Loading video analysis...