GPU and CPU Performance LLM Benchmark Comparison with Ollama
By TheDataDaddi
Summary
## Key takeaways - **RTX 3090 Doubles P100 Tokens/Second**: Holistically, the RTX 3090 outperforms with roughly double the token generation of the P100, which barely beats the P40, while CPU cases are significantly slower. [34:31], [35:12] - **P40 Beats 3090 on Single-GPU Price/Perf**: In the single GPU case, the P40 provides better price per token per second than the RTX 3090, making it the best value for money. [43:05], [43:39] - **10 Tokens/Second Usability Threshold**: 10 tokens per second is the minimum for usable conversational LLM applications, as shown by visualizers; below 5 is not real-time. [47:07], [48:41] - **RTX 3090 Always Hits 10+ Tokens/Second**: The RTX 3090 performs at or above 10 tokens per second across all tested models that fit fully in VRAM on single or double GPU setups. [50:44], [51:39] - **CPUs Unusable for LLMs**: CPU tests averaged below 3 tokens per second even for 8B models, making them unusable for real-time applications across all hardware environments. [28:45], [29:25] - **70B Costs 6x More for 3x Quality**: Running 70B instruct requires six times the cost of 8B but delivers only three times the Hugging Face score and negligible user-perceived quality gain. [01:04:43], [01:08:43]
Topics Covered
- Full Video
Full Transcript
[Music] Al righty guys what's up D daddy here again this evening and in today's video we're going to be jumping into a topic
uh I promised for a while now um and that is benchmarking uh different GPU uh and CPU performance uh regarding a lot of the uh open- source
llms so um in a video prior to this I basically went over and add-on to the um bench daddy um basically a GPU benchmarking
Suite that I've been working on and I updated it so that it can test its scale um a lot of the uh different llama flavors or basically anything that's in
the AMA library is now testable with a wide range of gpus and um the CPU so um you know if you're interested in the actual code or you want to run this for yourself um the link will be in the
video description not only to the GitHub but also to the um the video where you can find the code and you can find uh all that stuff more in depth but
basically that's the the um will form the basis of uh this discussion today and I I'll link the underlying um spreadsheet sorry let me
just go pull it up really quick um but I'll link the underlying spreadsheet as well so you guys can go check that out if you're interested um um and you
prefer data in the more raw form but um for today uh I'm pretty much just going to go through and uh go through over this looker Studio presentation I put
together I think this is a a great Tool uh I used it in the past for the last kind of comparison um with the U basic test in the the GPU suite and I think
it's a very uh underrated um data visualization tool so highly recommend lier Studio but uh in any case we'll go ahead and um and hop into it today and um I'm going to go
ahead and put this in presentation mode I'm going need to jump around a little bit um some of these links just to explain things but for now go ahead and put it in presentation mode so we can see this a little bit better um and the
first thing I want to do or I want to talk about is I want to explain the hardware environment and then I'll talk about the gpus I used in the study as well um and I think it's it's important to understand the the Contex text so and
then I'll try to explain why I did these things as well but so there's there's three different um Hardware environments
and basically the the first is my um 40 uh sorry super micro 4028 um gr- trt server uh and that is
what we're looking at here in the first category uh the second one is my desktop computer pretty much um and go over the specs here in a second and then the third one is the
have a Mac M1 and I've always heard so much about um you know Macs being really good with uh or really good for llama CPP that I wanted to try it out
especially since AMA is pretty much just a wrapper around llama CPP anyway so I figured that we could do not only a GPU comparison but also a CPU comparison and I'm I'm going to try to put all that in
the same video so do bear with me it is a little bit uh a little bit long um but I'll try to get through the information as quickly and as efficiently as I can
um so we'll start here um the the the specs just to be explicit um for the super micro server we have uh Xenon E5
uh 26 or 2699 version 4 uh that's the the CPU um the longer name is here manufacturer is Intel has a base clock
of 2.2 GHz um each core or each um CPU has 22 physical cores and then there are two CPU sockets so there's uh two different
CPUs in this particular machine uh and that's um 88 threads or or um 88 logical cores in this case uh I also wanted to
include um for uh just completeness the underlying OS so this is the the bare metal this is the OS that's actually loaded onto the machine and then I ran everything in Docker so I that that's
what why you see the the bare metal OS and then you see the the docker or you see OS is a boon 22204 with um Docker just letting you know that that's the OS loaded inside the docker container where
the actual model itself is running so um and I way I just wanted to to make sure that you could kind of easily tell uh you know what the underlying OS is and what the docker OS
is and that way maybe you could make your own assumptions about what the perform what that might do to the performance I know some people will will say that it's better to do this kind of thing in bare metal but using Docker is
an extremely convenient and from my experience like the performance degradation is only marginal so I think this is kind of a a fine thing especially if you're keeping things
consistent um across all of your tests but for you at home maybe you have different applications so do keep this in mind um the next thing is we have ddr4
Ram uh hyx or Enix I'm not sure how you pronounce that to be honest with you um that's manufacturer the um memory frequency and this is what
it's actually utilizing not what the capacity is um is 1866 uh megat transfer a second uh the
Ram size is just under sorry just over uh 1.5 terabytes the motherboard is a super micro X10
drg um- o uh plus CPU so that's the that's basically just that there are different um motherboard configurations depending on
like how many C how many gpus you um want or like what kind of GPU configuration you want for this server and this is the one that just has eight um eight uh
vertically uh installed vertically situated gpus in in the back of the server uh and then the the interconnect technology um is pci3 so now there's probably some of the
things I'm missing but these are the most relevant um you know Hardware environment attributes that I I I
thought to uh add so uh the next is my my desktop machine um and that is a ryen
95950 x so pretty pretty performant there um uh probably more than I need
for my desktop to be honest but uh so this is an AMD uh CPU its clock is uh signat L higher or it base clock is
significantly higher at 3.4 gahz a second there's 16 physical cores um only one CPU in this case uh 32 logical cores
or 32 threads I'm running Windows 11 on the underlying uh machine itself and then I'm running aun 22204 Docker again
so um just another nice difference there this is a ddr4 uh this time it's course air has a much higher uh memory
frequency and um the ram is significantly smaller uh and then the motherboard is this MSI
b550m Pro vdh motherboard and um it's actually part of another build I've done in the past so if you're interested you can check out
that in some of my earlier videos um and then this pcie 4.0 so it has a um faster pcie so in theory it should be able to
transfer data a bit faster in this case uh and then we'll see if that actually comes into play and when we get the results and then finally there's uh the Mac M1 uh I just wanted to do a
comparison for for you know on Apple silicon I know that the M1 is kind of old at this point now that we have the um you know M3 Pro Max what have you coming out so um you this may not be as
relevant but I wanted to still include it um as a comparison because of the unified memory uh so anyway um the it's Apple
obviously the uh base clock is um pretty comparable to what I have with in my desktop there's a you know eight less physical
cores and um the only one CPU and interestingly enough only with apple they have these are actually um PES and
EES from what I remember and um they only have one thread per core actually so that was an interesting different architectural difference I didn't
realize um and Os in this case is uh somoma so version 14.6 point1 uh and then again
using Docker so underlying o doesn't matter as much uh the ram type is LP or low power ddr4 so a little bit different Ram setup um and then this is the hyx is
the ram manufacturer and and then the ram frequency is quite a bit higher at 4466 and then finally um all this is is kind of integrated this amount of Chip from Apple so there really isn't a
traditional motherboard so to speak it's just kind of all built and soldered together and the same thing kind of with the interconnect technology at this point um so that's what we're looking at
in terms of Hardware um and the next thing that's relevant is the GP specs or what gpus we're actually using and uh if you've been watching my
channel for any length of time you kind of probably already know this it's the P40 the p100 and the RTX 390 I'm hoping to add the Titan to this list of gpus PR
pretty soon um and then I'll try to start including uh some other ones as well as I have basically the money to uh to build them and um but uh anyway so
this is this is compute compatibility this is the um the specific um basically version if you
will of each GPU so for example um certain gpus have different iterations or different releases and they have different uh computational abilities
actually so um this one is just letting it's letting you know more specifically what version or what P40 is you're actually talking about in this case so
this is 6.1 six um 8.6 and these are uh the prices so this is not the price I paid at the time that I got it this is the price um whenever I
was creating this last week this is basically the price that you could that you could pay uh on eBay for a um you know either a refurbed or new or or
whatever the the um whatever was available at at that particular time and basically what this means to me is this is is
the uh price you can pay from a reliable seller so somebody that has like a clear business or a clear Trend that they are a reliable seller they
have like um good feedback you know above like 95% satisfactory and in the deals that they have or and the transactions they've already made um and they've had like a couple hundred transactions so that's that's the kind
of thing I'm looking for when I'm saying these are um reputable sellers so places that I would actually buy the gpus from so I'm trying to be as realistic as possible in terms of what these prices actually are and then obviously I choose
the lowest one of all those sellers that I can find um obviously the vram size so you
have 24 for the P40 16 uh gigs of RAM for the P 100 and 24 for the RTX
390 um different uh actual bam types so um gddr5 hbm2 and then gddr 6X so all different
uh architectures there for the different gpus um quite a bit of difference in terms of bandwidth as well so we have um P40 with the lowest bandwidth we have
the p100 with um about twice the bandwidth the P40 and then the 3900 with the absolute highest bandwidth so and we'll we'll see if this um makes a
difference later when we're looking at the results um excuse me I just had a coughing fit out still getting over a little cold um
here we have the the the TDP or basically the um top wattage or the basically the the most wattage usage you
can expect uh from these gpus uh so 250 and 250 for both the P40 and p00 and
then the rgx 390 is 350 um in vlink this you know could be relevant uh and in fact I would argue that that it probably is based on the results you'll see later
um these two do not have uh in NV link uh the RTX 39 you does um and uh just as a side note I found this interesting a lot of people have asked me you know
about in vlink for these two gpus over the um you know course of the time I've been doing YouTube and I don't think that at least I've found there is any like for the for the pcie versions of
these gpus I don't think that there is um a inv vlink option now if you buy the smx2 form factor so the ones that actually sit that are like um square and
have the square heat sinks and look kind of like more traditional chips like those that um with that architecture and form factor those actually do or I guess
by Nature support NV link so if you have if you do see these gpws listed as having in L available it's from my understanding only for those x two form
factor gpus so just wanted to uh clear that up I know that this is something only probably a few people will get but um just thought I'd mention that while I
I had it in in mind um next uh just looking at the theoretical performance so you know what you can what we should expect or how uh
better put how these GPU should Stack Up based on what uh theoretically we should expect
um the P40 in terms of um uh fp16 performance is uh basically
negligible and um so basically the P40 has very good uh single Precision or fp32 performance but it is it performs pretty badly in
terms of um half Precision or fp16 and then um double Precision or fp64 so um in both those cases it supposedly has
port performance um and really only operates well for the um fp32 case and traditionally this is more of a GPU setup for inference although I would um argue that's not necessarily
true um the p00 it is kind of a more uh generic GPU and in fact it's the only
GPU that has really good or uh I should say relatively decent um double Precision here but slightly worse on
single Precision but it handles um the um half Precision or a 16 the single Precision case sorry half Precision
cases uh much better and then you have RTX the RTX 390 and you have basically same same performance uh and Far and Away the best
the best performance um in terms of half Precision single precision and then it it pretty much I would say almost is the same as the P40 in terms of double but this is really not that relevant for um
Ai and machine learning to be honest with you I mean in scientific Computing it's still relevant in some cases but for the most part um D's are going to be uh 326 and and these days even
quantized um you know especially with llms quantization is huge to kind of reduce that um that tensor size is as small as possible
so okay um so now we've got those out of the way that's that's the the hardware environments and the gpus we actually going to be looking at in this study uh
and then the next most pertinent thing is the prompts so um I wanted to kind of come up with a well-rounded um prompt suite and the idea initially was to create five
prompts that that increased in difficulty so prompt one would be the easiest and then two and then three and then four and five would be the most difficult um although I think in practice that doesn't didn't really C
out that way um and I think I probably could done a little bit better job job of the the the response format in terms of like
asking them to to format things differently or or you know I could have maybe done a little better job there but overall I think this provides a pretty a pretty good um
well-rounded um list of prompts that ask for different things and encourage different outputs from these various models so I'll go ahead and just go
through these quickly uh the first one is obviously the the most most simple it just has explain what artificial intelligence is and this is just a generic thing that would require
relatively few tokens from whatever model that it's uh it's it's asked to um the next is propose or sorry describe the process of photosynthesis in plants
and how it supports the ecosystem by producing oxygen and glucose so you know a little bit more complicated of a question and hopefully uh elicits more um a more
complicated response from the model the next is provide a detailed explanation of quantum entanglement covering the epr Paradox Bells theorem and how entanglement is utilized in Quantum
Computing for teleportation of information so you know more complex topic requires in theory more logic and reasoning and the response should should
probably be longer at least that's what the ideas so um you know continuing on with that idea the next is write a short story about a scientists who discovers a new form of energy but most space
ethical dilemmas as governments and corporations fight over control of the technology the story should include elements of suspense character development moral conflict um so not only is there you know a pretty high
level of logic here and understanding that has to go into it it's asking for the for a more of creative side um you know and depending on whether it's an instructor whether it's a different
model it may do better at this particular task so um this is designed to to not only have a longer prompt response but also take things in a more creative
Direction um and then finally the last is list and explain the top 10 most important groundbreaking sorry most important breakthroughs in artificial intelligence in the last decade focusing
on both advancements in hardware and algorithms so this is um you know since it say list of top 10 it's probably going to have a bulleted response or some sort of um you know hierarchical
response so um again testing the GPU in just a little bit different way even if it's probably overall maybe more simple prompt um so that's that's really the
the the key here and you know if you wanted to go back and test these with different prompts like the the um Benchmark and and the suite is made so you can easily integrate new prompts in
fact there just a yaml file there all you have to do is just delete the ones you don't like add ones that you do it's it's pretty straightforward and you could test hundreds if you want it so so um you know that's that's all there and
available on on GitHub for you if you'd like to um to do that or or you don't like the prompts that you see here okay um next it's worth mentioning the the test runs this is kind of the
experimental setup here um the uh all tests were performed by averaging the test metrics across each prompt for each model and GPU
combination so basically um you have the way that works is you have a a model and GPU combination um and you would a Docker instance is instantiated for that and
that um model is loaded onto the docker container or at least depending on what you said it could be a volume or it could be the container itself um you
know whichever of those and um the container is only has access to the gpus that you set in the config so if you want to test one GPU you can set
that that GPU ID in the config fig and then that model or list of models will be tested against that GPU so basically you're testing model and GPU
combination and then um for that each prompt is tested and then each prompt is tested a number of times so you can get an average so for example you can set test runs as three as five as 10 as
however many you want and then it's going to average that across all of the the runs for each prompt so you can get an idea of um you know basically for this prompt for this model and this GPU
combination how did the model perform and that's essentially the the goal here um so the GPU test one I kind of broke
this into a a number of of of subtests or a number of smaller tests so it was a little easier to handle um the GPU test one and this is the hardware environment
one so the the super micro server where all of my gpus are so basically Harbor one is going to be where all the GPU tests are occurring because that's just where I have them right now
um and then the the prompts this is listing the the number of prompts that uh or I guess all the prompts were tested and in this case it's the same for all of them all all five um and then
the models for each test so in this case for test one it's U the 8 billion instruct uh 4 bit quantization 8 billion
instruct 8 bit quantization and then 8 billion instruct um the uh full half Precision here uh and then tested with
one RTX one P40 and one uh p100 so just testing each of the single gpus with all of the models that will actually fit inside of the GPU and that's another thing you may be asking why didn't you
just test all of them well I wanted to test a good smattering of models that would actually fit all the way inside of the the GPU
vran um so because I mean I don't I don't think it's a fair test to try to split between the CPU and GPU which I mean AMA will do but I don't think
that's truly testing the performance of the GPU uh at that point so that's why I kind of set this test up that way um the next test it kind of follows
a similar uh similar narrative here so the the hardware environment one we're still going to use same prompts um same basic models but now this time we're
testing two GPU so we can load um you know more intense models uh larger models on on to um both gpus and kind of
split across both gpus here so we've included um the 70 billion instruct uh two bit quantized K and so K and S and
KS and km these are all um ways of keeping keeping things quantized but
including more information and I should probably done a little more research here to be honest but um there there is a good bit of research here in this
area and maybe I'll make another video following up here but basically um my understanding is these are ways to keep more information um at the two two bit quantize level so kind of intermediary
steps between like um this perhaps and the um like Baseline Q3 so it just kind of gives you different like a little bit
better performance um without having to go all the way up to the next uh full um full bit of quantization so from Q Q2
to Q3 or Q4 or um or something like that so and in the comments actually it was kind of I did did a little preliminary research but at this point I've kind of forgotten it um if you know please drop
it in the comments and let me know what these actually stand for uh or if you can link me to some some papers or or some something with a good explanation I think that'd be awesome
awesome um but honestly for for our purposes I don't think it really matters that much it's we're just trying to get an overall idea of of how these gpus work with with different models so I I tried to include some of these odd ones
I don't understand that well um so that you know the the test can be uh as comprehensive as possible so
um so yeah so so anyway we have a 26 gig one a 30 31 gig one and a 43 gig one um and and so we have a 70 billion the Q2
Q3 and Q4 and I only had 48 to work with and I couldn't go up to the next model size so um that was as as big as we could go with the
um with the two GPU setups so that was test two so test one is the one one GPU and test two is two GPU so kind of makes
a little bit of sense um and then test three we have um again on Harbor environment one same prompts but this time um this is all of the models so
notice there's a comma here so this is two gpus and then uh two p4s and then 2 p100s this is all of them at once and
what I wanted to do here is just see um how uh having different models would work so for example like you know where is a bottleneck you know is there a
bottleneck between the P40 p100 and and rcx 390 do they play nicely together together can you even run this kind of setup so all these other kind this is kind of an experimental um setup to understand you
know how performance looks when you're having a mismatch of gpus and you're excuse me and you're splitting across so many um so just wanted to to kind of
test it and see how how it would go uh and another thing here to note is that I went all the way up so that the um the model doesn't fit on the gpus
so in this last case um the so we're using all fully utilizing all six gpus and we're also pushing some of it on to uh CPU Ram so
this you know this should cover all the cases to see how does performance look when you have that scenario as well um okay and then so those are the first
three tests the first the first three I guess say three GPU tests and then we also had three CPU tests so this is just testing each of the hardware environments
um uh in independently so no gpus anymore just the same five prompts um and then for this for super for um Harbor environment one so this would be
the the super micro server uh we tested the 8 billion instruct 4bit Quant quantitized model um and then the 8bit
quantized model and then the um the half precision uh 8 billion model as well and you may be thinking okay well you didn't really
test that many models for the CPU like why not and you'll see later in the results like the the tokens per second was just so low I didn't think it was really worth my time to to continue
testing larger models because I mean you're already I think at this one we were already looking at like three tokens a second which is so slow it's in my opinion unusable um so I was just try
I was trying to to gauge performance without we just kind of wasting time um so and I think beyond that you kind of would have ASM totically approached like
one or two tokens a second um so anyway uh let's see test two is Hardware environment too so this was the I
believe it was my desktop setup let's go back one more yeah desktop setup so um so yeah this is the desktop setup
again I only tested two models because after this one it was so slow that it was uh you know I don't think diminishing returns at that point um and then test three same thing this is on on
the Mac I was actually quite surprised I thought the performance would be um surprisingly good here it was actually probably the worst overall which is you
know I know it is the M1 and I only had I think 16 gigs of of unified memory which isn't much but I expected performance to be better especially as much as people are
bragging on Apple silicon for um you know machine learning so you know I'm not a hater I I like I like Apple products a lot I think they always do a good job and it may just be the fact
that um you know I I have a old laptop so you know m1's kind of old now and maybe that's you know why the performance was less than uh expected
but in any case that's the experimental setup um and now we'll look at the test metrics and and I I do I do apologize know we're 30 minutes in I'm still
describing the um setup so I'll try to make sure that in the video description there's a link just to go straight to the results if you don't care about any of the stuff um and I'll try to insert
something the beginning of the video to let you guys know um because I know some people just really don't care they just want to see the results and that's that's fine um but uh the test metrics
that we're looking at here um the load duration so how long does it actually take to load the model onto the GPU or into the CPU Ram or or you know whatever
the case is um prompt evaluation so how long does it actually take to process the input prompt and then um response evaluation is how long does the model
actually spend generating the response and then total duration is just 1 plus 2 plus three from above so the summation of all of this and then one thing I forgot to add here um is there's also a
little bit of overhead head so the time may not match exactly um the the summation below the prompt and the response time so just if it
doesn't match then there's just a little bit of overhead so sometimes it may not match quite exactly um but uh tokens per second is generated by calculating or
Sorry by by dividing the total number of tokens so basically the tokens generated uh in this response duration by the duration itself and these are actually
outputting nanc so they're multiplied by 10 the 9 and then um uh and then uh divided or basically
they are divided and then multiplied by um 10 to n to put them in seconds form but in any case you you guys get the idea uh and then I I took this exactly
from ama's documentation so that's how they recommended to do it um and these are actually all of their metrics that the model already outputs I'm just basically made it into a scalable
testing solution and then and providing all this analysis so I'm not you know not Reinventing the wheel or anything here I'm just trying to make this digestible
um so uh next thing is at the end I try to do some a quality analysis now it's not it's naive at best but I wanted to
look at so there is the um model score so this is the hugging face open llm leaderboard um and this is the average score for a particular model and if you want you can go take a look
at this and I'll explain it later when it's more relevant so it's freser on your on your mind when we're looking at it um and then they do the same thing
here with the um the llm uh it should be Arena sorry but there is a um Arena score in the chatbot Arena llm leaderboard and this is basically like a
head-to-head competition between um between llms that you can actually go and do kind of like a pairwise comparison like is model a better than model B and it's like a voting system
and um they actually have a leader board based on that so I feel like that's kind of relevant because that's the user input or user driven what model is best it's not Benchmark driven so I think
both of these metrics are good uh for different reasons um so anyway these are all the things that I use to try to evaluate and and um
explain the results um the next thing is the overall macro level of view so I tried to keep things as large as possible and
then we'll kind of get granular after this and then we'll look at the Quality so so sorry about that guys coughing again um so what I'm going to do here is I'm going to explain this chart uh and there are a couple of subcases that I
think interesting to look at and I'm going to explain this chart so the I guess the first and foremost way that we should assess these gpus or I think that these should be assessed is just looking at Raw performance and performance the
two most important metrics in my mind are the tokens per second and then the total duration so how much how much token generation are you getting scale scaled uh and then how long is it
actually taking for this to happen um so it shouldn't be a surprise to anybody that the RTX 390 and this is holistically by the way and we'll look at some other cases as well but
holistically speaking the 390 s uh outperforms it's has a red double the token uh generation as the
p100 um the p100 just barely beats the P40 uh and then the CPU cases are significantly less and then as you can imagine the um the case with all the
gpus this is all the gpus um this is the the worst case scenario not only because the gpus are mismatch and you have different bandwidths and all this other kind of stuff going on you have to transfer it through more gpus but also
because there's one case in here where um the models can't fit all the way on the gpus they have to use the CPU Ram as
well so that I'm sure uh has dragged this average down significantly as well so just wanted to highlight that and then another thing that said note here
is that these these um RTX 390 this is the single GPU case and the double GPU case so um the the average of all of those over all the proms all the models
all the combinations Etc um for both of these cases so um you know pretty good overview here if you just want a really top level view but I'm going to throw in
some caveats here um the first one is while I was actually doing this I realized that in one of the cases um one of the models failed and has zero values
so for the sake of um removing outliers we're going to take that out of the equation and I think it's the
K3 Kore s this guy right here so if we take that away the picture looks a lot more normal and this is kind of a lot more what I
expected because in the other one you notice that the the time curve went up even though the tokens per second generation was
uh was flat or almost flat with uh the P40 in that case but um anyway so
when you remove that outlier you can see a lot more clearly what the thing should look like and the P40 does do a better job overall from from my empirical uh
testing than the p00 actually which is surprising because the theoretical specs say that uh it should probably do a little bit better especially for the llm cases because it has better half
Precision so you know um just another reason why I don't think you can take theoretical specs or non-tested specs um you know with any
kind of certainty because these results that I'm showing don't match what we should be seeing if the theoretical specs were true so um and of course it could be that I'm doing something wrong
it could be that there's something weird in my setup so there's all these other things but um this is my second set of tests now and it seems to kind of say the same thing so I don't know that
that's the case anyway um so okay now we've taken out the outlier and now let's look at um some other interesting cases so let's look at
only um only one GPU so we'll deselect that um so what we find here is that with in the one GPU case there's actually like the performance gap
between the 3090 and the P40 uh increases quite a bit and also the performance gap between the P40 and the
p00 um increases quite a bit so uh uh you know it it's interesting that in the one GPU case I mean it's not interesting it's actually I kind of makes sense that
when everything's happening on one GPU there's that no data transfer um the raw power of the 390 is actually able to you know even even more
outperform um some of the older GPS and then now if we look at just the two and one we notice that we're pretty
much flat here um almost well there some slight Improvement of the P40 um and the
uh n the P 100 but um but uh you know not not a whole lot
and um there's also a another huge gap um from from these cases as well so so um I guess overall not really
doesn't really matter how much you slice it at least as far as this is concerned whether it's two gpus or um one GPU the RTX 390 significantly outperforms the
P40 um but if you're talking about P40 versus p100 then um you know the the bandwidth differences can can really or the bandwidth can really make a
difference um between those two gpus from the one to two GPU case and another caveat here is the fact that the the 390s have um NV link so if they didn't
then you might see um this be uh lower uh for these two cases but that's not something I've explicitly tested that's something that I'm I'm is on my list of things to do um I just have an explicitly tested the NV link to show
you know exactly what kind of performance gains you get with it versus without it okay um the next thing here and uh
let me just reset everything back to normal we'll keep this because that's out outlying data anyway um so okay if we're taking it so if we're moving to
price per token or price per token per second rather and keeping things holistic um this is probably the most important metric
and I I wanted to go over a second because I I think that a lot of people just care about the performance so I wanted to give people that that want that performance summary um you know what they want first but I do think that
the price per token per second is probably the the most important metric because objectively yeah the the 3090 is um the best in terms of performance and
you can see that and that's pretty obvious um but not in every case is it the best in terms of price per performance so what you're actually
paying for that performance so um o overall it is it is still um better than the p100 um significantly
better than the sorry P40 significantly better than the p00 and then way better than all the gpus put together and of course you know this is expected they
mismatched gpus the prices a ton for the the performance and I also tested one case where like I said the um models are loading on the CP Ram as well which inherently slows things slows things
down so I don't feel like this is a relevant data point either I just wanted to kind of put this in here to show you how drastically different um things can
be when you are using tons of CPUs that are mismatched and then um you know also using gpus and the CPU as well the performance really goes down and the
price goes up um now that said this do allow you to run models that you actually you wouldn't be able to run otherwise um just in terms of
um vram and you know loading larger models so if you if you have gpus available and you want to run the models then you know sometimes this makes sense
even if the performance is is bad uh so just wanted to highlight that as well but for now we'll take that out so I'll take the
six and we'll just look at the p100 P40 and the 390 it looks much nicer graph there
um okay so now let's look at the um the one GPU and two GP cases and
this is this is where I was I was saying that in some cases it's better some cases it's not so let's look at the f one GPU case and if you look at that
then the P40 is actually better than the RTX 390 in terms of token generation or sorry price per token generation per second so if you want to get the best
value for your money basically this is saying that the P40 is better than than the RTX 390 So objectively speaking you know I I'll debate that as we get to the
next chart as to whether it's worth buying or not but the results show that from a pure price to token uh per second
standpoint the P40 is better than the pp P or sorry the RTX 390 and the p100 and single GPU cases now if we go back and
we look at two gpus it's obviously not the case and in fact we actually see um a significant performance improve M across
the RTX 390 and again this is probably due to the um the Envy link that's set up at least
that's my hypothesis anyway um so you know that could be another reason is the fact that we are using a different interconnect interconnect technology here and that's what is driving this
larger um larger performance increase on the 2 GPU case but um summarize here guys the I just wanted to to show you that yeah the
RTX 30 390 is the best um you know that's kind of obvious but this is designed also to put numbers to it so you know if you want to know how much
token generation can you actually get from the 3090 um this is a pretty good place to start and then we'll look at specific cases and particular models in the next part of this video um and then
also which is you know what is the best use of your money is another thing that I always try to keep in mind because that's something that's important to me I always want to know I have limited resources and and limited
Capital so I want to know how best to uh use those as well um all right so now let's like a little more granular approach and um so
I I wanted to do three more analysises analyses there we go so I wanted to look at things model wise so um basically from a tokens per second standpoint uh
each model and then sorry each GPU and then the other cases so CPU and
all and then um the models so um okay next um I wanted to look at uh
prompts so for example what I want to see is okay is promp difficulty does it actually um have an impact on uh token
generation or or token per second and then finally does the uh the hardware environment or where these where these computations are
actually being run does that have an impact on um the uh load duration prompt evaluation duration and then the
response uh evaluation duration so the time it actually takes to do this processing does the hardware environment um impact that a lot
so we'll start here well with the model wise analysis and before we do anything let me take out this ratio here because we don't need that that's something I
just included for uh the next analysis later um just to make things easier to see and um the first thing before we
before we get you know too much more um into this analysis what I want to do is I want to pause and I want to talk about this black length this is very important
um this black line is to me what I would consider when a model is usable and what I mean is from a tokens per second standpoint is the
model um s the the GPU model combination on the hardware that it's running on is it too slow enough or fast enough rather that you can actually use it in
production without wanting to rip your hair out so um that's you know I did a lot of testing and I I was like okay well how can we say um you know a GPU is
worth buying or a GPU and model combination is worth actually buying and using um you know where do you draw a line in sand I guess is what I was
trying to figure out um so the the uh idea I came to and actually I love this I found a um tokens per second visualizer I was actually going to build
one but somebody I I did some looking and somebody actually posted this on Reddit and I found it but basically what this does is you just enter a speed and it shows you what the prom The Prompt generation looks like so for example the
reason why I have the Benchmark set at 10 is because this is what 10 looks like and I consider this to be probably
as slow as I'm comfortable going I mean it's still actually objectively kind of slow but um you know this is this is kind of where I drew my line I okay if I
can get um you know a system to run 10 tokens a second for an llm then I'm I'm I'm I'm comfortable with that I'm happy with that because you can actually for small things and especially you could
actually have a chat almost in real time uh for something like this um and of course I I'll link this website in the video description because I think this is this is great um but
um okay so that's that's 10 let's refresh the page here and just to give you like I think 80 was the maximum that we had today or
maybe 90 I think maybe 90 was maximum so we we'll say 90 but that's what 90 looks like so that's that's extremely fast and that's probably what you would find
that's more close to what you would find on like um chat GPT or some of the some like the the paid Services um this is closer to the the actual generation that
you would find there so okay and then to by comparison if we look at five like five tokens a second this is and and some people you
might argue that this is doable and while I think this is probably fine this is to me you're going below the threshold where this is actually conversational like this is no longer um
what I would consider real time enough to be able to have a conversation this would have to be something that you could use or maybe build into like an API that you're not actively like um communicating with via
a guey or you're building an application this may be something where you could build that in or where this would be relevant at maybe tokens a second still be slow proba have to adjust timeouts
and things for for large responses but it would be doable as long as it doesn't need to happen in real time so anyway that's I just wanted to explain why I was drawing my lines there
uh and why I think that that 10 tokens a second is kind of um a a tangible lower bound shall we say to what is usable for
gpus so um in terms of testing with that with that bound what we what we see is um let me
exclude this model as well for consistency there we go okay um just to keep to keep everything uh consistent uh
because we took the outlier out of the last one I'll take it out here as well um so what we're seeing here is the the RTX 390 in all models that that we
tested it or in all cases that it was tested tested performed um at or above our minimum requirement so this is a check to me if you want to run it on any
of these models so um basically anything that fits on its um fully on its Ram either in the CLE the sorry in it's vram
either in the single or double GPU case I think it's it's totally fine to run and you can use it for like uh conversational applications um where the you probably
get a better bang for your buck with the P40 is if you're going to uh basically use it as a single GPU um and you are going to run
basically up to the full um 8 billion instruct um llama model so that's kind of where I would draw the line there if if you're if that's what your goal is and that's
all you need then I think that then this is probably a better buy uh to be honest with you um and then as we can see here in the CPU
case none of none of the CPU cases I tested um were at least the average on average the CPU didn't do well enough to even make
uh a recommendation for using the CPU so I was kind of disappointed I was hoping to see better performance out of llama to be honest um I was hoping that it would make it um you know realistic to just use a
CPU and I think in some cases you still probably can if you like I said you want to use the API applications um or you're not you needing it for real time um chat interfacing I think it'll probably be
fine um excuse me uh and then
um the the all scenario is is pretty is pretty bad I mean you know uh it kind of starts where this ends almost uh and in most cases it's also going to fall going
to fall behind as well so if you are if you are going to have a lot of U gpus and you want it to be performant then you're probably going to need to go with something with high bandwidth like the
the 3090 or um you know something a little bit more robust than the the the P40 and p100 and and stuff like that because uh you just get um and again
this is like a mismatch of all of the things and um you know there were cases where I I ran it on um you know I ran things that were that were too big like
in this case you get almost negligible like Zero Performance so that drags the overall average down quite a bit but you know for all the other models it's still
you know not good enough to actually um not good enough to actually really run uh like I think that you would want so in any case guys that's um that's the
analysis here uh and then you can you can dig into it more deeply if you'd like like you could select um you know only one
GPU uh or you know two gpus and uh you know I I didn't you know not all the tests overlap because I didn't want to waste my time and uh you
know some of the times I couldn't you know I couldn't go to the next model without fitting it onto the you know using the CPU as well so I tried to just test the models where it made sense but unfortunately that means that there are
gaps where things don't overlap quite as well so um you know there is that let's see see yeah I think that that pretty much
covers this one um so let's move on now to prompt analysis um here what I was expecting to see actually is I was expecting to see as the prompts um
increase in difficulty I was expecting to see kind of like a slight upward Trend here um but really I've you know you don't really see much of anything so there's this is tokens per second by the
way and then this in this blue chart here is the the total duration uh or average total duration for each of the prompts um and so what you see is while
each prompt on average takes more or less time uh to actually get through um which is basically attributed to the fact of like how many tokens are being
generated um so by the by a pure token generation standpoint this is the most complex um and it has a slightly I don't think it I don't think
that's lowest let's see that's 18 18.7 and that's 18.75 okay so yeah this is actually the lowest um you know the lowest case but um just because the
most um tokens are generated here and it takes the longest uh doesn't necessarily mean that
there is a one to one or linear uh decrease in tokens per second versus the like second longest case so what I'm trying to say is that it doesn't look
like the prompt really has much to do with the token generation like it maybe is a kind of a fine um makes a very small impact and maybe that's just because the the prompts I tested really
weren't that representative of all the different cases there could be so you know I think from a prompt analysis standpoint you could probably do a lot more here and go a lot more in depth but
you know just for the sake of the study and keeping things kind of short like I would say that overall prompt is not something that really impacts to this section tokens a second all that much at
least according to these results um the next thing and the last thing I I I did is I looked at um things from a hardware standpoint and I wanted
to understand okay is the does the hardware environment uh actually play a role um in the
um in the like time it takes to actually generate these responses um and if so where is most of the time spent and that was really kind
of what I what what I wanted to to find out most um and it doesn't really look like you know there's not really a whole lot of information to to find out which
environment Hardware environment is is better it doesn't really look like there's much difference in relative terms so this is just percent um you know of the total time
so you know from a relative perspective there's really not sorry from Hardware perspective there's really not much difference in all of these but the one thing is overwhelming is the fact that
the majority of the time is spent in um response valuation or or generating the response the the loading is negligible and then the prompt evaluation in a lot of cases is also
pretty much negligible um so and in fact let's see what's the promp valuation is 8 point 2% and this is
0.23 and this one I can't even get my cursor on honestly okay 0 27 okay this is this is 0
23 oh sorry 0 29 prompt valuation yeah 23 but anyway these are these are almost the exact same uh I just thought maybe one would be a little better than the other but they're so
small it's almost pretty much negligible um so anyway the load is the load is literally zero comparatively speaking and then the prompt valuation is almost negligable compared to the response so that's the overall summary that I that I
would say there is is the important takeaway is that your your time is being spent in response Generation Um and the hardware it's kind of Hardware agnostic doesn't really matter um and the
hardware doesn't really make that much of a difference uh in terms of that all right um last thing here guys
is uh I wanted to do a uh model cost versus quality or I guess quality versus cost I should say
um and like I said I use two different um publicly available metrics here to do this so I was trying to get a hard metric is how good is this model and then what are what are you paying for
that goodness uh basically so um I used the open llm leaderboard which you can find here in all L these in the video description but basically what this does is it um has a bunch of
benchmarks that it runs and then takes the average so you can kind of apples to apples compare um these different models in kind of a Academic Way
almost um and uh so I took basically this average here and that's what um that's
what uh this number is here so um for eight the 8 billion um half precision and then the seven billion instruct half Precision those are the only ones that
are actually listed or I could find the leaderboard so that's the why there're the only two here um which is fine this will give us a kind of a quick easy way to compare
and then the other thing is I I use this chatbot Arena llm leaderboard and um so what this does is this actually allows
you to um like the community to look at two different models give them the same prompt and then choose which one actually um performs better and then I
took this um Arena score so that's what I'm using um as a way to Benchmark okay we have an academic with from the hugging face but how you know in practice how how much better is one
versus the other and how noticeable is that to um the average person uh and then so Arena score is a little bit convoluted but
this kind of tells you how that works and um you know it may not be the best for an Apples to Apples comparison like this it probably is better just to
generate a ranking um but I think I think for the per purposes of this study I think it should be fine um but basically it looks at how this is um
actually done and I think it's uh an ELO score that quantifies l in performance um yeah uses the ELO rating
system and then it's also a BT uh I don't want to spend too much time on this because it's really not that relevant um but in any case if you're if you're curious how this is
calculated that's here and they have a link to the paper um in the uh here so you can go to the paper as well if you want um I just didn't personally have
time to actually go and read it but if you want it's it's 29 pages so if you want you can go through and read it I just didn't have time uh so anyway just
wanted to to to show you you know if you're curious about the score um how it came to be you could check it out uh check it out but for the purpose of our study we're going to directly compare it
even though it's wrong like it may not be scaled correctly for this kind of comparison but we're going to do it anyway um just wanted to acknowledge that uh in case you know anybody was
also thinking that so so but let me let me explain this what we're trying to get out here so um what I did is I wanted to say okay here we have these these way to assess model quality in terms of the
hugging face score and in terms of the the LM Arena uh chatbot score um and then what is the minimum cost that would actually cost us to run this so AKA let's buy a GPU that we can use to put
this on so um in the the 8 billion the in this case this is one singular P40 is what this comes out to be and you could
probably use the p00 if you really wanted to but it I don't know that it would fit all the way onto the GPU because it's
exactly 16 gigs and the uh GPU itself has exactly 16 gigs as well but hey you could try it and it might be a little bit cheaper but I just chose the the one
that I knew would work um and then also I did the same thing here and I chose this is the equivalent
of six I think it's six p4s that will actually work yeah sorry I just did the calculation but it's six p4s um that will actually is it's the bare minimum
that you would need to to to be able to run this so that would cost you about $2,000 to be able to run this model um so anyway this is minimum cost to run
and then the next thing is the scores so that's the hugging face score um for uh the 8 billion and then for 70 billion and then the llm the sorry the
llm arena score for 70 billion and 8 billion um and then these are um cost adjusted basically so how much are you
paying for for one increment increase in score so in this case um you're paying $24 for a one point increase uh in this hugging face score and in this case
you're going to pay 48 uh and then similarly here for a one point increase in the LM Marina score how much are you paying and you're paying 29 cents here
and you're paying a161 here um so anyway this is this is important or this what I'm getting at here is these ratios so just from a minimum cost to run perspective you're going to pay six
times more to run the 70 billion instruct model than you would if you just stuck with the 8 billion so for paying six times more what are you actually getting in terms of quality um
so if we look here and we look at the cost per hugging face score or basically cost per score um I guess we should look look at all this so if you look at the
actual score itself you're getting um basically three times uh the quality uh in terms of academic benchmarking uh
from hugging face with the 70 billion instruct um versus the 8 billion so okay so basically you're paying uh paying six times but only getting three times the
quality so uh maybe that's worth it it depends on your use case um depends on what you're trying to do or what you need what level of quality you actually need so is that six times increase worth
the three times increase in um Quality that you're actually getting if that Benchmark is in fact indicative of quality um and then the next thing
that's interesting is from ratio standpoint like um how much do it actually cost you or or how much um um so this is how much you're actually
paying for one um one uh point one score Point basically and then the same thing here so how much
more expensive is one score point from one model to this model so this is actually twice as expensive so it's twice as expensive um sorry so basically what
this boils down to a better way to say it is the quality here this is 7 billion is twice as expensive overall uh as what this one is and you could have probably done the same thing just divided these
two ratios and it would pretty much give you this as well so um and the boil sign of this you're you're paying six times price for something you're only getting
three times to benefit of uh so my point is um whenever you don't need a level of like you you the cost that you pay is exponential for the quality for the
amount of vram for the computing power always going to pay a lot more than it doesn't scale linearly you're always going to pay a lot more almost
exponentially um for minor uh gains is what I'm what I'm trying to get at here and you can see that again actually probably even exacerbated for this LM
score um so basically there's almost a negligible difference between you know people are actually like the score that people are giving the 8 billion model and then the score that people are
giving the 70 billion model so what that says is that in terms of what people can actually tell um or how people actually use this the 8 billion is almost just as good as
the 70 billion so in this sense you're actually paying um six times so you're paying six times for a
1.06 increase in model quality in terms of what people actually feel um when they're using it so you know and this is
the basically that that ratio so you're basically paying 5.64 more times more uh
for the same quality essentially is um is what this what this actually means so this is just a kind of a metric to kind of um compare apples to apples in terms
of quality and what this is saying is that you're paying twice as much uh for quality in terms of the academic Benchmark and you're paying almost six times as much for quality in terms of
what people actually or the kind of benefit that you actually get tangibly um So relas based on um the chat Arena
Benchmark so anyway my my point guys is I'm not trying to dissuade people from picking one GPU versus the other GPU um what I am saying is I think people a lot
of times get enamored with the new models the better performance this um you know all these other things it's new and shiny and bright and you forget that there's a cost associated with all of
these things and the newest and biggest and fastest is not always necessarily better when you factor in cost of course you know cost aside it's always better you know you always want the best
performance but when you start factoring in cost you have to you know there there will always be trade-offs and that's why part of the reason why I started this channel is because I think it's so important to always keep costs in mind
when you're making these kind of decisions because um you know a resources are limited and B it really depends on your use cases and if you are smart about the way that you choose
things you can maximize your money and get the most benefit for whatever project it is that you're trying to pursue um so anyway guys that's um it's a wrap up for this I know it's already
been really long and I'm sorry I've been trying to keep my videos shorter but sometimes this is just a lot of information and uh today that was one of those days where there's a lot to present and you know I'll try to make it
as digestible as possible in the description so you can jump around and you don't have to listen to me join on on about stuff um that you don't want to hear um but anyway guys I really really
appreciate it I hope you enjoyed this video as much as I did making it and um I will see you in the next one all right guys brief reminder here if you enjoyed the content please consider giving me a
like and a subscribe so I can continue to grow and produce better and better content for you if you really enjoyed the content you might if can consider buying me a coffee and the link for how to do that will be in the video
description below um if nothing else please just give me some feedback and the comments and let me know how I'm doing if anything's unclear or if there anything uh that I can improve on thank
you again guys and have a great rest of your day bye-bye
Loading video analysis...