LongCut logo

Same 128GB but cheaper

By Alex Ziskind

Summary

Topics Covered

  • Thor Matches Spark at Fraction of Cost
  • Memory Bandwidth Equalizes Performance
  • Prefill Exposes Compute Differences
  • Thor Excels in Power Efficiency
  • Thor Built for Deterministic Latency

Full Transcript

The Jetson Thor is Nvidia's new developer kit, and the DJX Spark is for everybody, but what's not often talked about is the Thor has 128 gigs of RAM,

just like the Spark. That's a lot of RAM, and it's more than $1,000 cheaper than the Spark. So, why don't we just buy the Thor instead of the Spark? Well,

you could. I'm going to run some tests right now to see what it's capable of and talk about a couple of differences between those two machines and also Apple Silicon. The reason I picked these

Apple Silicon. The reason I picked these three is because the M4 Pro, the Thor platform, and the Spark all have 273 GB per second memory bandwidth. And I

mentioned before many times that memory bandwidth is really important for machine learning, but that doesn't tell the whole story. By the way, Nvidia sent me this Thor to do some testing. And

they're also giving away the Jetson Orin Nano Super signed by Jensen One. Rules

for the raffle are in the description.

The other two machines I bought. So

these days I'm always flipping between models. GPD for research, cloud for

models. GPD for research, cloud for coding, nanobanana for image generation, VO cling, and runway for video, six tabs, six bills, and counting. Enter

chat LLM teams. One dashboard houses every top LLM and route pickics to write one. GPT Mini for ultra fast answers,

one. GPT Mini for ultra fast answers, Claude Sonnet for coding, Gemini Pro for massive context. They recently added

massive context. They recently added Gemini 3 and GPT 5.1 the moment they dropped. Create professional

dropped. Create professional presentations with graphs, charts, and deep research detailed content. Need

human sounding copy? Humanize rewrites

text to defeat AI detectors. Need

visuals? Pick frontier or open-source models. Nano banana, midjourney, flux

models. Nano banana, midjourney, flux for images, magnific upscaling, plus VO, WAN, and Sora for video, all builtin.

You also get Abaca's AI deep agent to pretty much do anything. Build full

stack apps, websites, reports with just text prompts and deploy them on the spot. They have Abacus AI Desktop, which

spot. They have Abacus AI Desktop, which is the brand new coding editor and assistant that lets you vibe code and build productionready apps. And the

kicker, it's just $10 a month, less than one premium model. Head over to chatlm.abacus.ai

chatlm.abacus.ai or click the link below to level up with Chat LLM teams. When it comes to these kinds of machines, an important thing to consider is the amount of power they

use. Here I just have the terminal open.

use. Here I just have the terminal open.

Envy Top running in the Thor and Spark and activity monitor running on the Mac.

And here is the power usage. Mac Mini M4 Pro is using about 8 watts. The Thor is using about 31 watts and the Spark is using about 44 watts. That's just to kick things off. I'm going to run a couple LM here and then do some

benchmarking so you get an idea of how these perform. There's a brand new model

these perform. There's a brand new model called Minist. Haven't tested that on

called Minist. Haven't tested that on the channel before. So, let's start with that one. And I'm using Olama, which is

that one. And I'm using Olama, which is a really dead simple way to get started.

I've created tutorials on the channel before on how to start with that and how to set it up. There's really not much to it. Palama Run Ministral 3 and I'm going

it. Palama Run Ministral 3 and I'm going to give it the verbose flag. Now,

different models behave very differently. That's why we have to test

differently. That's why we have to test as many models as possible. I wish there was a way to keep track of all the different models, all the different versions of them on all the different hardware that I'm testing, but it's kind

of hard to do that. I'm working on something. Design a scalable web

something. Design a scalable web application architecture for an e-commerce platform and so on and so on.

This is one of my medium prompts. It's

uh it gives a really nice long output.

really detailed architecture. I have

different length prompts that I use.

They're all in my GitHub. I'll link to it down below. All right, let's go. And

there it goes.

That is really interesting that the Mac and the Spark are printing out pretty much the same exact result. They're a

little bit different. Now, this is going to be a very, very long result. So,

while it's happening, take a look at what's going on with uh the Spark here.

There is the GPU usage. We're going at about 95% usage. And over here on the Thor about 94 to 96% usage as well. But

the characteristic of the of the line is very different on these two. The Spark

is very even. So it's constantly using 95% of that GPU. But the Thor is bouncing around a little bit, which tells me that just based on the graph, I don't know what's going on on the inside

there, but it's telling me that things are loading and unloading, loading and unloading um at a rapid pace. So, it may not be as efficient as the Spark, for example, at this particular task, but

we'll see what happens when the result comes. Oh, Spark is done and so is the

comes. Oh, Spark is done and so is the Mac Mini. You can see that usage is

Mac Mini. You can see that usage is pretty even there as well. I didn't

expect it to finish so soon because I wanted to get the power usage, but I'll get that in a moment. Here are first results. 39 tokens per second for the

results. 39 tokens per second for the generation. This is token generation.

generation. This is token generation.

We'll get to prompt processing in a bit and prefill. We have 30.82 82 tokens per

and prefill. We have 30.82 82 tokens per second on the Thor and 34.97 tokens per second on the Spark. So,

they're all kind of close to each other.

Uh, well, I guess the Mac Mini won this round and the Thor lost this round with the Spark coming in right in the middle there. But the token generation speed is

there. But the token generation speed is not the only story here. Look at the prompt eval rate. That's the prompt processing. This is the part of the

processing. This is the part of the process that's not related to memory bandwidth. This is pure computation. And

bandwidth. This is pure computation. And

in pure computation, the Mac Mini loses 353 tokens per second here. The Thor has 1,036 tokens per second here. Three

times faster. And the Spark 2,46 tokens per second. So we have very different characteristics on all three of these machines. But the two Nvidia machines both have the latest Blackwell

chips, which is where that prefill gets processed. So we have the latest Tensor

processed. So we have the latest Tensor cores in there and the latest CUDA cores. By the way, the Thor has 25% more

cores. By the way, the Thor has 25% more than the previous generation, which was the AGX Orin, but the Spark definitely beats it out, at least in the prompt prefill stage in the calculation,

because it has a lot more tensor cores and CUDA cores. But that's not the only difference. All three of these are

difference. All three of these are ARMbased systems, but just because these two are Blackwell system doesn't mean that they have the same exact chip. The

Thor has a 4 nanometer process and that's the CPU part of the chip. Whereas

the Spark has a three nanometer process chip for the CPU and that's going to directly affect that CPU toGPU communication. So the Spark may be more

communication. So the Spark may be more capable of feeding the GPU part, the Blackwell part than the Thor. But the

Thor is not exactly designed for that particular scenario as we'll see in a bit. Now this model is a brand new model

bit. Now this model is a brand new model and this is a mixture of experts model.

So it's not using all of its parameters that are available at the same time.

It's only using a subset of the parameters. It's called a sparse model.

parameters. It's called a sparse model.

If there's a dense model, it has very different characteristics. First, let's

different characteristics. First, let's run this again because I want to take a look at how much power is being used while all three of these are working. So

there they are. The GPU is plugging away, doing its thing on all three of these. Let's take a look at what's

these. Let's take a look at what's happening with the power. So, we're

about 70 watts on the Mac Mini. We're

about 90 watts here. Almost 90 watts on the Thor. And we're at 141 watts on the

the Thor. And we're at 141 watts on the Spark, which is quite a bit higher.

Almost two times higher power usage on the Spark. And that's all being used by

the Spark. And that's all being used by that compute, that really powerful fast compute that's on there with more cores.

Mech Mini is done. We got 38.95 tokens per second there. 32.92 tokens on the Spark. just a tiny bit slower, but the

Spark. just a tiny bit slower, but the prompt eval rate went up here to 2,817 tokens per second, which is ridiculous.

And we got 28 tokens per second on the Thor. A little bit slower again, but the

Thor. A little bit slower again, but the prompt eval rate 1,090. So, really good there. Let's take a look at a dense

there. Let's take a look at a dense model, which means it's going to be using all the parameters. So, this is going to be quite a bit slower, but dense models are really good for clustering, and I'm going to go into

more of that in later videos. For

example, with Sparks over here and Apple silicon machines. Stay tuned for that.

silicon machines. Stay tuned for that.

I'm going to run Llama 3.3 70 billion parameter model, which is a really chunky model. It's over 40 GB on disk,

chunky model. It's over 40 GB on disk, which means it's going to require all that memory and then some if you want to include some context. So, this little

Mac Mini is almost all but filled up with its 64 GB of memory there. These

machines have plenty more to go with 128 each. All right, here we go. Same

each. All right, here we go. Same

prompt. And we got a bigger model now.

Boom. Boom. Boom.

Look how much slower this thing is.

70 billion parameters. It's a lot of parameters. I mean, you got to give it

parameters. I mean, you got to give it to these little guys. They're they're

small. Look how small they are. It's a

70 billion parameter model. Think about

that. This is going to take a while. All

right, let's take a look at what's happening here. Um, we've got the GPU

happening here. Um, we've got the GPU usage is pretty steady on the Spark, 96%. We got a similar kind of thing

96%. We got a similar kind of thing going on the Thor where the GPU usage is going up and down by a little bit, jumping up and down there. And the Mac Mini has pretty steady GPU usage, but

interestingly, it's higher than it was when we were doing the Minist 3 model, just by a little bit. So, it's using more of the GPU. Memory used on the Mac

Mini, 56.87 87 GB out of the 64, but the memory pressure is nice and clean there.

No orange or red, so we're good. Power

usage is a bit different here. We got

more power usage on the Mac Mini, almost 80. We got more usage on the Thor,

80. We got more usage on the Thor, almost 100 there, and about the same or just a tiny bit more on the Spark with 153 watts. Watts is the unit we're

153 watts. Watts is the unit we're talking about here, folks. I know

somebody was going to leave that comment. 53. What? What?

comment. 53. What? What?

Oh, I'm hearing some sounds.

Who is it? Who is it?

The Mag Mini. Can we hear that?

It's kind of hard to tell in the video, but trust me, that's what's happening here. The Spark is an extremely quiet

here. The Spark is an extremely quiet machine. Is that good or bad? I don't

machine. Is that good or bad? I don't

know. There's other manufacturers like the Dell GB10. That one has a little bit more cooling going on. I have a couple of those that I'm going to be testing soon, so stay tuned. But yeah, I never

heard the Mac Mini make that that much squealing before. It's not squealing,

squealing before. It's not squealing, it's it's more like a blowing sound. I

gotta say Llama 3.370B was a good model, but it's getting a little bit long in the tooth. Uh,

compared to some of the more some of the smaller models even that are more modern. Look at this. This is supposed

modern. Look at this. This is supposed to be an architecture prompt. I wanted

architecture, but it's giving me code.

Not exactly what I wanted. These two are actually giving me architecture. There

is a little bit of code and pseudo code, but this one gave me straight up code.

It doesn't even know like what technology I want to use. Okay, we've

got 5.43 tokens per second here on the Mac Mini.

Oh, interesting. 4.61 tokens per second on the Thor and 4.46 tokens per second on the Spark. A little bit of a diminishing returns there on the Spark,

even though the prompt eval rate still was amazing on the Spark. More than two times faster, almost three times faster than anything else. So 283 tokens per

second on prompt eval which is basically prefill. That's the first stage before

prefill. That's the first stage before we do the decode stage and that's the token generation. If I didn't mention

token generation. If I didn't mention that earlier my apologies got prefilled.

That's the prompt evaluation and that could include the whole context. So if

you are having a chat session for example with one of these things it's going to include your entire chat session. Send that in process it

session. Send that in process it everything including the system prompt too. So, you want that to be fast

too. So, you want that to be fast because sometimes the conversations can get pretty long. So, good on you, Sparky. Uh, we got Thor with 103 tokens

Sparky. Uh, we got Thor with 103 tokens per second. Not too bad. Pretty terrible

per second. Not too bad. Pretty terrible

with the Mac Mini. 34 tokens per second.

But the M5 generation, hold on to your pants. We'll be taking a look at that shortly as well. The prompt

processing on that has improved quite a bit. So, 4.46 tokens per second for the

bit. So, 4.46 tokens per second for the Spark. Let's verify these numbers from a

Spark. Let's verify these numbers from a different perspective. We're going to

different perspective. We're going to use a tool called Llama Bench, which comes with Llama CPP, and I got the exact same model that I'm going to be running, but this is the GGUF version.

And this gives us both the prefill and the token generation. By the way, I compiled these on the machines. For the

two Nvidia machines, I used CUDA compilation. That's why it's detecting

compilation. That's why it's detecting the CUDA devices for here, device zero, Nvidia Thor, and here we got device 0, Nvidia GB10. And GB10 is basically the

Nvidia GB10. And GB10 is basically the platform for Spark, for the Dell, for the ASUS machines. And since the creator of Llama CPP actually uses a Mac, you just run the basic build on the Mac and

it automatically compiles for it. Now,

take a look at this. On the Mac, Llama Bench is actually utilizing more GPU than when we were running Olama just by a tiny bit. You can kind of visually see it. I don't have any concrete numbers

it. I don't have any concrete numbers here, but maybe we'll see some when we actually see the final results here.

Olama might have just a tiny bit of overhead with whatever else is running the services that it's running. And

whoa, look at this. So on the Spark, we got a flat line just like we did before.

But here on the Thor, look at that up and down. It's basically shutting down,

and down. It's basically shutting down, going to zero on the GPU usage, and going back up to 98%. So more GPU usage, but also less. It's really doing that up

and down quite a bit. That was for prefill. That was when we were

prefill. That was when we were processing. Now for token generation,

processing. Now for token generation, it's actually flat. Really interesting

results here for that same exact model.

The llama 3.370 billion parameter Q4KM by the way is the quantization. Same one

I used with O Lama. And Q4 just means it's quantized down from the original, which was probably BF16 if I'm not mistaken, but you'll probably correct me

in the comments. Uh down to uh int four.

So, integer 4, which means it's four times smaller, requiring less memory to run. Still, 42 GB is a pretty

run. Still, 42 GB is a pretty significant chunk, which means you won't be able to run it on uh even like a 5090 with just 32 gigs of VRAM. Okay, the

Spark gave us a result that's pretty much the same as what Olama gave us.

4.56 tokens per second for TG128. TG128

just means token generation. Then we got about 283 tokens per second for PP512 which is prompt processing or prefill on the Thor. It's not done yet. So wait for

the Thor. It's not done yet. So wait for that one. On the Mac Mini, 42 which is

that one. On the Mac Mini, 42 which is higher for prompt processing 512 and 5.48 for token generation. Thor has slightly

higher PP 512 prompt processing 130 tokens per second and TG128 token generation at 4.4. So about the same there, but the Mac Mini is faster at

token generation in this case. This is a dense model. Let's take a look at a

dense model. Let's take a look at a sparse model that's pretty popular. GPT

OSS 20B. Let's just do 12B because why not? It's a mixture of experts model. So

not? It's a mixture of experts model. So

it should be much faster than Llama 70B.

But I am a little bit worried about the amount of memory that's available on the Mac Mini because this is a pretty big model. So let's see. Oh, failed to load

model. So let's see. Oh, failed to load the model on the Mac Mini. Oh, that's

too bad. You can see where having 128 gigs of memory is a big plus here. The

two Nvidia boxes have that and they're still running it. And here a pretty big difference between the Thor and the Spark is finally revealed. In this

particular architecture for this particular model, we have more than two times faster speed for prompt processing on the Spark than the Thor and almost

two times faster token generation speed as well. So the Thor 37.17

as well. So the Thor 37.17 for token generation, 464 for prompt processing, while the Spark has 52.77 token generation, 977 for PROM

processing. Pretty big difference there.

processing. Pretty big difference there.

As you can see, the type of model and the architecture of the model that you're running matters a lot. But let's

say you're running a dense model and you want to save some money. The Thor is not a bad option. 128 gigs. But generally,

the Thor is meant for applications like automotive, robotics. It's supposed to

automotive, robotics. It's supposed to have deterministic latency, which means it doesn't vary its performance as widely as Spark does. I didn't test the variability between different things. I

just showed you examples, but theoretically, that's what it's supposed to be good at. So, safety first, power conservative. We saw that it uses a lot

conservative. We saw that it uses a lot less power than the Spark and the Spark would be better for bursty kind of applications and serving multiple users.

Now, we took a look at Llama CPP. On the

CUDA platform, you have VLM, which is really good at serving multiple users at the same time and doing concurrent tasks. I haven't showed you it in this

tasks. I haven't showed you it in this case, but I did make a video about that.

You can watch that right over here.

Thanks for watching and I'll see you in the next one.

Loading...

Loading video analysis...