How to Benchmark LLMs Using LM Evaluation Harness - Multi-GPU, Apple MPS Support
By Uygar Kurt
Summary
Topics Covered
- Eval Harness Emerges as LLM Standard
- MPS Demands Float16 Over BFloat16
- Merge Accelerate and Parallelize for Max GPU Use
Full Transcript
Hey guys, today I'm going to show you how you can use the evaluation harness by Elotor AI. First of all, I left my equipment at home and I'm on a vacation
right now. So the voice quality may not
right now. So the voice quality may not be that good. So sorry for that. But as
always, I will continue to provide you high quality content. So first, why evolation harness? So there has been a
evolation harness? So there has been a lot of debate on how we can evaluate the benchmarks, what to use, what is the universal way etc. Also there's an
interesting blog post here what's going on with open LLM leaderboard. I suggest
you to read this as well. I will put it into the description. But overall, the reason that I chose to show you the
evalation harness by elot AI is that it has pretty much became a standard for evaluating LLMs both in the industry and
in academia and it is highly likely that you will encounter this or you will encounter results that is produced by using this library. In overall, I will
show you how you can use it to evalate your LLMs. First on a MacBook using the MPS or Apple Medal accelerator and then we will switch to
Nvidia devices. I will show you how you
Nvidia devices. I will show you how you can do it on an Nvidia device. Then I
will show you how you can do it and different ways you can evaluate your LLMs on multiple Nvidia devices. I will
put the GitHub link into the description down below. And everything I did, I will
down below. And everything I did, I will put it into GitHub as well. And before
we start, don't forget to like this video, subscribe to my channel what you want to see next. Let me know in the comment section. And let's just go now.
comment section. And let's just go now.
Let's start evaluating LLMs using evaluation harness on Mac using MPS or Apple Medal. We will start by installing
Apple Medal. We will start by installing LM evolutionation harness. We are going to follow the same procedures with Mac, Linux or whatever we are using. We are
going to copy this under the install section. Now I'm in my Visual Studio
section. Now I'm in my Visual Studio Code in my MacBook. So I'm going to just paste it. Since I already installed it,
paste it. Since I already installed it, won't install again, but in your case it will install. Let's press enter and wait
will install. Let's press enter and wait for it. Okay. So the process is complete
for it. Okay. So the process is complete and as you can see we are inside LM evaluation harness repository. If we
execute ls we will see all the files of this repository. So since this library is
repository. So since this library is installed you have two options to do the evaluations. You can execute everything
evaluations. You can execute everything in the terminal since these are just commands. But for the sake of
commands. But for the sake of consistency I will do everything inside a Jupiter notebook. So I'm inside my Jupiter notebook. The first command that
Jupiter notebook. The first command that you will see is that lml task list. What
it will do is that it will list us all the task that evolution harness supports. So if you enter it, it will
supports. So if you enter it, it will list us all these tasks. So there are a lot of them. So it is very long. As you
can see, VS code truncated it. But if
you like, you can just take a look at them. Now let's move on to the actual
them. Now let's move on to the actual evalation part. We are going to start
evalation part. We are going to start with exclamation mark. Since we are executing terminal commands inside the notebook, we will use that. If you are executing directly from the terminal,
you can omit it. The way we call the evalation arm is lm_l. So we will use couple of flags
lm_l. So we will use couple of flags here. First is model. We will use a
here. First is model. We will use a hugging face model. So we will just put that here. Second one is model arcs.
that here. Second one is model arcs.
Here we will feed the path the hugging face path to our model. So in this case it is in
our model. So in this case it is in hugging face TB. I will use small llm 135 million
instruct. You can use whatever you want
instruct. You can use whatever you want but this model since it is small it is fast and fits into the memory. Also you
can just like this if you give the local path to your model to here as an argument it will load your local model.
Next is the data type. So usually we are using bload 16 but since we are on a MacBook if you try to do Bflat 16 with
MPS you will get an error because MPS doesn't support Bflat 16 but it supports float 16. So we will use d type equals
float 16. So we will use d type equals to b d type equals to float 16 and we will give some tasks. In this case we
will give two popular tasks we know grande and lambad openai. So how this works is that you give this tasks flag
and you enter the tasks you want from task list that we did above. But more or less I believe you have an idea on what tasks to give and you will just comma
separate all of the tasks and this evolation harness will evalate all of these tasks. Now the one of the
these tasks. Now the one of the important things for especially Mac is the device flag. So we are using the MPS accelerator. So we are going to pass the
accelerator. So we are going to pass the MPS flag to the device and the batch size of course it depends on your machine. In my case,
machine. In my case, 256 work just fine. And we can specify the output path with output path flag.
In this case, I want the results to be written to out folder. So these are the flags that I used for this. However, if
you want to see all of the arguments, what you need to do is simple. You come
to the repository again. You go down and here they use a guide detailing the full list of supported arguments. You just
click here, go here and as you can see we have all the supported arguments here. tasks, task lists. You can set the
here. tasks, task lists. You can set the arguments for generation such as top P, top K, like we said, batch size, device,
we used NPS, output pad, and there are bunch of other arguments that I suggest you check out. It may be useful to you.
I will also put a link of this page in the description as well. So, let's come here and let's just execute this.
Actually, I will decrease the back size to 32 because I just ran this and it was so overwhelming for my system with all
the recording and all the other tabs.
All of my recordings just crashed and I had to start over. So, I'm decreasing it to 32. Now, you just decide according to
to 32. Now, you just decide according to your system and let's run this again. We
can see it is using the device MPS which is good. It is using the accelerator,
is good. It is using the accelerator, the Apple accelerator. And now let's go and let's see if we can track the GPU usage. Okay. So as you can see if you go
usage. Okay. So as you can see if you go to the activity monitor, go to window and toggle on the GPU history, it will show you this, which is the GPU usage.
The most left was from my initial run where all of my recordings and everything got crashed. But now as you can see it went down since I terminated
everything and it started going up again and now it is being used in its full capacity. So this is very similar to the
capacity. So this is very similar to the Nvidia Smi. You can just open this and
Nvidia Smi. You can just open this and track your GPU usage here. And now let's just wait for this to complete. Now the
execution is complete. Let's also check out the GPU user. So as you can see during that peak point we did the evaluation and now it went down and it
is now right in the middle and the GPU usage has decreased back to normal. Now
what will you get is this output basically it will be the tasks Lambad open AI the metrics we have accuracy and
perplexity we have inog grande we have accuracy metric and the values and the standard errors. Now let's also look to
standard errors. Now let's also look to the output files. In here we set output path to be out. So let's come here. So
as we can see here it created a new folder with the name of the model that we are using with the exact same path I
face TB face TB and we have the results.json file here. So this is a
results.json file here. So this is a JSON file which contains the results for Lambad OpenAI. You have the perplexity.
Lambad OpenAI. You have the perplexity.
You have standard error. You have the accuracy and accuracy standard error. Same for veno grande. You have
error. Same for veno grande. You have
accuracy and accuracy standard error. You have the configurations for
error. You have the configurations for the tasks for the lambda openai. We have
our data types. We have our model here. Same for the veno grande.
here. Same for the veno grande.
So this is a multiplechoice task. There
is that again we have the data type. We
have the versions and we have some other configurations such as number of
parameters. It is 135 million model. So
parameters. It is 135 million model. So
this is correct random seat date some environment info some template information some tokenizer information and the evaluation time. So this is
basically how this library works and this is how you use it on an Apple device. Now let's switch to working with
device. Now let's switch to working with Nvidia GPU and multiple Nvidia GPUs. Now
I'm inside my kegle notebook. The first
thing I'm going to do is open up this panel and as an accelerator I will use two T4
GPUs. So I'll show you how to do it with
GPUs. So I'll show you how to do it with a single GPU. But there will be cases where you have multiple GPUs and you want to take you want to take an
advantage of it or your model is so big that it won't fit the single GPU so that you would like to use multiple GPUs. We
will go over them all one by one. First
thing that we will do is install the evaluation harness. How we are going to
evaluation harness. How we are going to do it is run this code. Get clone. After
this is done, I will copy this. Open up the
this. Open up the terminal. Paste
terminal. Paste it and copy this. Open up the terminal again. Paste
this. Open up the terminal again. Paste
this. Hit
enter. Now it is installed. We can just close the console and restart the notebook for that. Restart and clear outputs and
that. Restart and clear outputs and everything should be set. Now let's
start by doing the exact same evaluation we did with our MacBook just with the difference of using a single Nvidia
GPU. Let's paste it. As you can see
GPU. Let's paste it. As you can see everything is the same. The only
difference is device is 2.0 zero which means that it will do the evalation using the zero GPU. So let's run this.
We can we can also view the GPU usage from here. As you can see this is the
from here. As you can see this is the zero GPU and this is the first GPU and zero GPU is getting utilized whereas the
first GPU is not which is what we expected and we get the results with the same format here. We had some warnings initially, but you can just overlook
those warnings. Let's also come here and
those warnings. Let's also come here and see the files being saved. As you can see under out, we have the files, we have their resulting JSONs here. I won't
go over them again since we did that during the make part. If you are curious what these JSON files look like, just go to the end of the make part of this
video. So this is the regular thing we
video. So this is the regular thing we did with the with the MPS. Now we did with a single Nvidia GPU. But we have
two GPUs. So we can use those two GPUs.
two GPUs. So we can use those two GPUs.
One of the ways we can do that is also actually when you go to this to the GitHub page of this evaluation harness
and when you go down here under the user guide, it has the multi-GPU evaluation part. Okay, we will mostly be following
part. Okay, we will mostly be following that. You can also check this part out
that. You can also check this part out to utilize this two GPUs we have. One
thing that we can do is we can load our model into each GPU and feed the data for them separately. For that we will be
using the accelerate. What we are going to do is simple. We will just add accelerate launch m just before the lm
command. and we won't specify any GPUs.
command. and we won't specify any GPUs.
So with this, the model will be loaded to each GPU separately. So let's run this. Also, let's open the GPU
this. Also, let's open the GPU utilization to track what's going on. As
you can see, we are utilizing both of the GPUs. Also, as a task, I chose Hello
the GPUs. Also, as a task, I chose Hello Swag because it is a larger data set and it takes longer to do the evaluations.
By this, I could actually show you the GPU usage of this approach. And here we have the results. So you don't have to worry about these warnings. Now let's
move on to another scenario where you have a model that is a big model for example like this one 13b lamatu chat
model. So this is a big model that won't
model. So this is a big model that won't fit into a single GPU. We can actually verify this by running this. What we
expect that it will give us a QA out of memory error and we got an error. Let's
see what error is that and QA out of memory error like we expected. Now how
we can overcome this and to make an evolation with this big model that doesn't fit into a single GPU is that we can fit this model into multiple GPUs.
And how we are going to do that is like this. Everything is the same. We execute
this. Everything is the same. We execute
lml and in the model arguments we pass this keyword parallelize equals to true.
So this is the key and we don't set any device. So there are some arguments that
device. So there are some arguments that you can set specific GPUs but in this case since we have only two GPUs anyways
I didn't set any arguments and we can just run this and see that it will actually do the evaluation and it will utilize multiple GPUs and as you can see
we are able to fit this model into this machine by splitting by splitting it into two GPUs GPU zero and GPU 1. One
thing that you may realize is that the utilization in one GPU is 100% and in the other one it is in 0%. Even though
the model is split between two GPUs, my take on this is because even though the model is split into two GPUs, the data
still flows into one GPU. And again as you can see it just moved into the zero GPU and it fluctuates. The solution for this is the merge two approaches. One is
the one that we did in the beginning with accelerate where we parallelize the data and this one where we split a single model into multiple GPUs. So
let's try that first. I will kill this because it will take a long time. If you
like you can just keep it going. I will
just kill this and to merge those two approaches. What we want to do
approaches. What we want to do is we will literally merge the commands.
So we will accelerate launch m like our regular command and additionally we will add this parallelize equals the true
flag as well. So by this what it will happen is that we will split the model into multiple GPUs and we will utilize all GPUs at the same time. For that
let's run this and again view the GPU utilizations. Now it started doing the
utilizations. Now it started doing the evaluation. Let's look at the GPU
evaluation. Let's look at the GPU consumptions. Yes, as you can see the G
consumptions. Yes, as you can see the G the model split into two GPUs and we were able to load this also the GPU
utilization for both of them is at full.
This is the optimal stage like this is the optimal way to utilize your GPUs and again this will take a time a while for the evolution to be complete. So I will
just kill this job. If you if you want to evalate and see the results you can just keep it going. I'll just kill it.
And that's all I wanted to show you. I
will upload all of these notebooks into GitHub and put the link to the GitHub into the description. To summarize, we used the evaluation harness with a
MacBook using MPS accelerator. Then we switched to an
accelerator. Then we switched to an Nvidia GPU. We used a small model with
Nvidia GPU. We used a small model with multiple GPUs. Then we switched to a
multiple GPUs. Then we switched to a large model that actually doesn't fit into a single GPU. and we split that model into multiple GPUs and actually we
saw how we can utilize every GPUs to their full potential. So this is not a full guide on how to use this evaluation harness library like I suggested. Go
check out this additional arguments. Go
check out actually like read this documentation. Check out additional
documentation. Check out additional things that you can give additional settings. If you're in industry, it is
settings. If you're in industry, it is probable that you are using Nemo models.
You can check out how you can evaluate with these Nemo models, etc. So, that was all for me today. I hope my voice was not that bad since I don't have my
equipment with me. But, thank you for listening. Don't forget to like the
listening. Don't forget to like the video, what you want to see next, comment it out, and let me know.
Subscribe to my channel and see you at another video. Bye-bye.
another video. Bye-bye.
Loading video analysis...