Same 128GB but cheaper
By Alex Ziskind
Summary
Topics Covered
- Thor Matches Spark at Fraction of Cost
- Memory Bandwidth Equalizes Performance
- Prefill Exposes Compute Differences
- Thor Excels in Power Efficiency
- Thor Built for Deterministic Latency
Full Transcript
The Jetson Thor is Nvidia's new developer kit, and the DJX Spark is for everybody, but what's not often talked about is the Thor has 128 gigs of RAM,
just like the Spark. That's a lot of RAM, and it's more than $1,000 cheaper than the Spark. So, why don't we just buy the Thor instead of the Spark? Well,
you could. I'm going to run some tests right now to see what it's capable of and talk about a couple of differences between those two machines and also Apple Silicon. The reason I picked these
Apple Silicon. The reason I picked these three is because the M4 Pro, the Thor platform, and the Spark all have 273 GB per second memory bandwidth. And I
mentioned before many times that memory bandwidth is really important for machine learning, but that doesn't tell the whole story. By the way, Nvidia sent me this Thor to do some testing. And
they're also giving away the Jetson Orin Nano Super signed by Jensen One. Rules
for the raffle are in the description.
The other two machines I bought. So
these days I'm always flipping between models. GPD for research, cloud for
models. GPD for research, cloud for coding, nanobanana for image generation, VO cling, and runway for video, six tabs, six bills, and counting. Enter
chat LLM teams. One dashboard houses every top LLM and route pickics to write one. GPT Mini for ultra fast answers,
one. GPT Mini for ultra fast answers, Claude Sonnet for coding, Gemini Pro for massive context. They recently added
massive context. They recently added Gemini 3 and GPT 5.1 the moment they dropped. Create professional
dropped. Create professional presentations with graphs, charts, and deep research detailed content. Need
human sounding copy? Humanize rewrites
text to defeat AI detectors. Need
visuals? Pick frontier or open-source models. Nano banana, midjourney, flux
models. Nano banana, midjourney, flux for images, magnific upscaling, plus VO, WAN, and Sora for video, all builtin.
You also get Abaca's AI deep agent to pretty much do anything. Build full
stack apps, websites, reports with just text prompts and deploy them on the spot. They have Abacus AI Desktop, which
spot. They have Abacus AI Desktop, which is the brand new coding editor and assistant that lets you vibe code and build productionready apps. And the
kicker, it's just $10 a month, less than one premium model. Head over to chatlm.abacus.ai
chatlm.abacus.ai or click the link below to level up with Chat LLM teams. When it comes to these kinds of machines, an important thing to consider is the amount of power they
use. Here I just have the terminal open.
use. Here I just have the terminal open.
Envy Top running in the Thor and Spark and activity monitor running on the Mac.
And here is the power usage. Mac Mini M4 Pro is using about 8 watts. The Thor is using about 31 watts and the Spark is using about 44 watts. That's just to kick things off. I'm going to run a couple LM here and then do some
benchmarking so you get an idea of how these perform. There's a brand new model
these perform. There's a brand new model called Minist. Haven't tested that on
called Minist. Haven't tested that on the channel before. So, let's start with that one. And I'm using Olama, which is
that one. And I'm using Olama, which is a really dead simple way to get started.
I've created tutorials on the channel before on how to start with that and how to set it up. There's really not much to it. Palama Run Ministral 3 and I'm going
it. Palama Run Ministral 3 and I'm going to give it the verbose flag. Now,
different models behave very differently. That's why we have to test
differently. That's why we have to test as many models as possible. I wish there was a way to keep track of all the different models, all the different versions of them on all the different hardware that I'm testing, but it's kind
of hard to do that. I'm working on something. Design a scalable web
something. Design a scalable web application architecture for an e-commerce platform and so on and so on.
This is one of my medium prompts. It's
uh it gives a really nice long output.
really detailed architecture. I have
different length prompts that I use.
They're all in my GitHub. I'll link to it down below. All right, let's go. And
there it goes.
That is really interesting that the Mac and the Spark are printing out pretty much the same exact result. They're a
little bit different. Now, this is going to be a very, very long result. So,
while it's happening, take a look at what's going on with uh the Spark here.
There is the GPU usage. We're going at about 95% usage. And over here on the Thor about 94 to 96% usage as well. But
the characteristic of the of the line is very different on these two. The Spark
is very even. So it's constantly using 95% of that GPU. But the Thor is bouncing around a little bit, which tells me that just based on the graph, I don't know what's going on on the inside
there, but it's telling me that things are loading and unloading, loading and unloading um at a rapid pace. So, it may not be as efficient as the Spark, for example, at this particular task, but
we'll see what happens when the result comes. Oh, Spark is done and so is the
comes. Oh, Spark is done and so is the Mac Mini. You can see that usage is
Mac Mini. You can see that usage is pretty even there as well. I didn't
expect it to finish so soon because I wanted to get the power usage, but I'll get that in a moment. Here are first results. 39 tokens per second for the
results. 39 tokens per second for the generation. This is token generation.
generation. This is token generation.
We'll get to prompt processing in a bit and prefill. We have 30.82 82 tokens per
and prefill. We have 30.82 82 tokens per second on the Thor and 34.97 tokens per second on the Spark. So,
they're all kind of close to each other.
Uh, well, I guess the Mac Mini won this round and the Thor lost this round with the Spark coming in right in the middle there. But the token generation speed is
there. But the token generation speed is not the only story here. Look at the prompt eval rate. That's the prompt processing. This is the part of the
processing. This is the part of the process that's not related to memory bandwidth. This is pure computation. And
bandwidth. This is pure computation. And
in pure computation, the Mac Mini loses 353 tokens per second here. The Thor has 1,036 tokens per second here. Three
times faster. And the Spark 2,46 tokens per second. So we have very different characteristics on all three of these machines. But the two Nvidia machines both have the latest Blackwell
chips, which is where that prefill gets processed. So we have the latest Tensor
processed. So we have the latest Tensor cores in there and the latest CUDA cores. By the way, the Thor has 25% more
cores. By the way, the Thor has 25% more than the previous generation, which was the AGX Orin, but the Spark definitely beats it out, at least in the prompt prefill stage in the calculation,
because it has a lot more tensor cores and CUDA cores. But that's not the only difference. All three of these are
difference. All three of these are ARMbased systems, but just because these two are Blackwell system doesn't mean that they have the same exact chip. The
Thor has a 4 nanometer process and that's the CPU part of the chip. Whereas
the Spark has a three nanometer process chip for the CPU and that's going to directly affect that CPU toGPU communication. So the Spark may be more
communication. So the Spark may be more capable of feeding the GPU part, the Blackwell part than the Thor. But the
Thor is not exactly designed for that particular scenario as we'll see in a bit. Now this model is a brand new model
bit. Now this model is a brand new model and this is a mixture of experts model.
So it's not using all of its parameters that are available at the same time.
It's only using a subset of the parameters. It's called a sparse model.
parameters. It's called a sparse model.
If there's a dense model, it has very different characteristics. First, let's
different characteristics. First, let's run this again because I want to take a look at how much power is being used while all three of these are working. So
there they are. The GPU is plugging away, doing its thing on all three of these. Let's take a look at what's
these. Let's take a look at what's happening with the power. So, we're
about 70 watts on the Mac Mini. We're
about 90 watts here. Almost 90 watts on the Thor. And we're at 141 watts on the
the Thor. And we're at 141 watts on the Spark, which is quite a bit higher.
Almost two times higher power usage on the Spark. And that's all being used by
the Spark. And that's all being used by that compute, that really powerful fast compute that's on there with more cores.
Mech Mini is done. We got 38.95 tokens per second there. 32.92 tokens on the Spark. just a tiny bit slower, but the
Spark. just a tiny bit slower, but the prompt eval rate went up here to 2,817 tokens per second, which is ridiculous.
And we got 28 tokens per second on the Thor. A little bit slower again, but the
Thor. A little bit slower again, but the prompt eval rate 1,090. So, really good there. Let's take a look at a dense
there. Let's take a look at a dense model, which means it's going to be using all the parameters. So, this is going to be quite a bit slower, but dense models are really good for clustering, and I'm going to go into
more of that in later videos. For
example, with Sparks over here and Apple silicon machines. Stay tuned for that.
silicon machines. Stay tuned for that.
I'm going to run Llama 3.3 70 billion parameter model, which is a really chunky model. It's over 40 GB on disk,
chunky model. It's over 40 GB on disk, which means it's going to require all that memory and then some if you want to include some context. So, this little
Mac Mini is almost all but filled up with its 64 GB of memory there. These
machines have plenty more to go with 128 each. All right, here we go. Same
each. All right, here we go. Same
prompt. And we got a bigger model now.
Boom. Boom. Boom.
Look how much slower this thing is.
70 billion parameters. It's a lot of parameters. I mean, you got to give it
parameters. I mean, you got to give it to these little guys. They're they're
small. Look how small they are. It's a
70 billion parameter model. Think about
that. This is going to take a while. All
right, let's take a look at what's happening here. Um, we've got the GPU
happening here. Um, we've got the GPU usage is pretty steady on the Spark, 96%. We got a similar kind of thing
96%. We got a similar kind of thing going on the Thor where the GPU usage is going up and down by a little bit, jumping up and down there. And the Mac Mini has pretty steady GPU usage, but
interestingly, it's higher than it was when we were doing the Minist 3 model, just by a little bit. So, it's using more of the GPU. Memory used on the Mac
Mini, 56.87 87 GB out of the 64, but the memory pressure is nice and clean there.
No orange or red, so we're good. Power
usage is a bit different here. We got
more power usage on the Mac Mini, almost 80. We got more usage on the Thor,
80. We got more usage on the Thor, almost 100 there, and about the same or just a tiny bit more on the Spark with 153 watts. Watts is the unit we're
153 watts. Watts is the unit we're talking about here, folks. I know
somebody was going to leave that comment. 53. What? What?
comment. 53. What? What?
Oh, I'm hearing some sounds.
Who is it? Who is it?
The Mag Mini. Can we hear that?
It's kind of hard to tell in the video, but trust me, that's what's happening here. The Spark is an extremely quiet
here. The Spark is an extremely quiet machine. Is that good or bad? I don't
machine. Is that good or bad? I don't
know. There's other manufacturers like the Dell GB10. That one has a little bit more cooling going on. I have a couple of those that I'm going to be testing soon, so stay tuned. But yeah, I never
heard the Mac Mini make that that much squealing before. It's not squealing,
squealing before. It's not squealing, it's it's more like a blowing sound. I
gotta say Llama 3.370B was a good model, but it's getting a little bit long in the tooth. Uh,
compared to some of the more some of the smaller models even that are more modern. Look at this. This is supposed
modern. Look at this. This is supposed to be an architecture prompt. I wanted
architecture, but it's giving me code.
Not exactly what I wanted. These two are actually giving me architecture. There
is a little bit of code and pseudo code, but this one gave me straight up code.
It doesn't even know like what technology I want to use. Okay, we've
got 5.43 tokens per second here on the Mac Mini.
Oh, interesting. 4.61 tokens per second on the Thor and 4.46 tokens per second on the Spark. A little bit of a diminishing returns there on the Spark,
even though the prompt eval rate still was amazing on the Spark. More than two times faster, almost three times faster than anything else. So 283 tokens per
second on prompt eval which is basically prefill. That's the first stage before
prefill. That's the first stage before we do the decode stage and that's the token generation. If I didn't mention
token generation. If I didn't mention that earlier my apologies got prefilled.
That's the prompt evaluation and that could include the whole context. So if
you are having a chat session for example with one of these things it's going to include your entire chat session. Send that in process it
session. Send that in process it everything including the system prompt too. So, you want that to be fast
too. So, you want that to be fast because sometimes the conversations can get pretty long. So, good on you, Sparky. Uh, we got Thor with 103 tokens
Sparky. Uh, we got Thor with 103 tokens per second. Not too bad. Pretty terrible
per second. Not too bad. Pretty terrible
with the Mac Mini. 34 tokens per second.
But the M5 generation, hold on to your pants. We'll be taking a look at that shortly as well. The prompt
processing on that has improved quite a bit. So, 4.46 tokens per second for the
bit. So, 4.46 tokens per second for the Spark. Let's verify these numbers from a
Spark. Let's verify these numbers from a different perspective. We're going to
different perspective. We're going to use a tool called Llama Bench, which comes with Llama CPP, and I got the exact same model that I'm going to be running, but this is the GGUF version.
And this gives us both the prefill and the token generation. By the way, I compiled these on the machines. For the
two Nvidia machines, I used CUDA compilation. That's why it's detecting
compilation. That's why it's detecting the CUDA devices for here, device zero, Nvidia Thor, and here we got device 0, Nvidia GB10. And GB10 is basically the
Nvidia GB10. And GB10 is basically the platform for Spark, for the Dell, for the ASUS machines. And since the creator of Llama CPP actually uses a Mac, you just run the basic build on the Mac and
it automatically compiles for it. Now,
take a look at this. On the Mac, Llama Bench is actually utilizing more GPU than when we were running Olama just by a tiny bit. You can kind of visually see it. I don't have any concrete numbers
it. I don't have any concrete numbers here, but maybe we'll see some when we actually see the final results here.
Olama might have just a tiny bit of overhead with whatever else is running the services that it's running. And
whoa, look at this. So on the Spark, we got a flat line just like we did before.
But here on the Thor, look at that up and down. It's basically shutting down,
and down. It's basically shutting down, going to zero on the GPU usage, and going back up to 98%. So more GPU usage, but also less. It's really doing that up
and down quite a bit. That was for prefill. That was when we were
prefill. That was when we were processing. Now for token generation,
processing. Now for token generation, it's actually flat. Really interesting
results here for that same exact model.
The llama 3.370 billion parameter Q4KM by the way is the quantization. Same one
I used with O Lama. And Q4 just means it's quantized down from the original, which was probably BF16 if I'm not mistaken, but you'll probably correct me
in the comments. Uh down to uh int four.
So, integer 4, which means it's four times smaller, requiring less memory to run. Still, 42 GB is a pretty
run. Still, 42 GB is a pretty significant chunk, which means you won't be able to run it on uh even like a 5090 with just 32 gigs of VRAM. Okay, the
Spark gave us a result that's pretty much the same as what Olama gave us.
4.56 tokens per second for TG128. TG128
just means token generation. Then we got about 283 tokens per second for PP512 which is prompt processing or prefill on the Thor. It's not done yet. So wait for
the Thor. It's not done yet. So wait for that one. On the Mac Mini, 42 which is
that one. On the Mac Mini, 42 which is higher for prompt processing 512 and 5.48 for token generation. Thor has slightly
higher PP 512 prompt processing 130 tokens per second and TG128 token generation at 4.4. So about the same there, but the Mac Mini is faster at
token generation in this case. This is a dense model. Let's take a look at a
dense model. Let's take a look at a sparse model that's pretty popular. GPT
OSS 20B. Let's just do 12B because why not? It's a mixture of experts model. So
not? It's a mixture of experts model. So
it should be much faster than Llama 70B.
But I am a little bit worried about the amount of memory that's available on the Mac Mini because this is a pretty big model. So let's see. Oh, failed to load
model. So let's see. Oh, failed to load the model on the Mac Mini. Oh, that's
too bad. You can see where having 128 gigs of memory is a big plus here. The
two Nvidia boxes have that and they're still running it. And here a pretty big difference between the Thor and the Spark is finally revealed. In this
particular architecture for this particular model, we have more than two times faster speed for prompt processing on the Spark than the Thor and almost
two times faster token generation speed as well. So the Thor 37.17
as well. So the Thor 37.17 for token generation, 464 for prompt processing, while the Spark has 52.77 token generation, 977 for PROM
processing. Pretty big difference there.
processing. Pretty big difference there.
As you can see, the type of model and the architecture of the model that you're running matters a lot. But let's
say you're running a dense model and you want to save some money. The Thor is not a bad option. 128 gigs. But generally,
the Thor is meant for applications like automotive, robotics. It's supposed to
automotive, robotics. It's supposed to have deterministic latency, which means it doesn't vary its performance as widely as Spark does. I didn't test the variability between different things. I
just showed you examples, but theoretically, that's what it's supposed to be good at. So, safety first, power conservative. We saw that it uses a lot
conservative. We saw that it uses a lot less power than the Spark and the Spark would be better for bursty kind of applications and serving multiple users.
Now, we took a look at Llama CPP. On the
CUDA platform, you have VLM, which is really good at serving multiple users at the same time and doing concurrent tasks. I haven't showed you it in this
tasks. I haven't showed you it in this case, but I did make a video about that.
You can watch that right over here.
Thanks for watching and I'll see you in the next one.
Loading video analysis...