How to Benchmark LLMs Using LM Evaluation Harness - Multi-GPU, Apple MPS Support

By Uygar Kurt

Summary

Topics Covered

Eval Harness Emerges as LLM Standard
MPS Demands Float16 Over BFloat16
Merge Accelerate and Parallelize for Max GPU Use

Full Transcript

Hey guys, today I'm going to show you how you can use the evaluation harness by Elotor AI. First of all, I left my equipment at home and I'm on a vacation

right now. So the voice quality may not

right now. So the voice quality may not be that good. So sorry for that. But as

always, I will continue to provide you high quality content. So first, why evolation harness? So there has been a

evolation harness? So there has been a lot of debate on how we can evaluate the benchmarks, what to use, what is the universal way etc. Also there's an

interesting blog post here what's going on with open LLM leaderboard. I suggest

you to read this as well. I will put it into the description. But overall, the reason that I chose to show you the

evalation harness by elot AI is that it has pretty much became a standard for evaluating LLMs both in the industry and

in academia and it is highly likely that you will encounter this or you will encounter results that is produced by using this library. In overall, I will

show you how you can use it to evalate your LLMs. First on a MacBook using the MPS or Apple Medal accelerator and then we will switch to

Nvidia devices. I will show you how you

Nvidia devices. I will show you how you can do it on an Nvidia device. Then I

will show you how you can do it and different ways you can evaluate your LLMs on multiple Nvidia devices. I will

put the GitHub link into the description down below. And everything I did, I will

down below. And everything I did, I will put it into GitHub as well. And before

we start, don't forget to like this video, subscribe to my channel what you want to see next. Let me know in the comment section. And let's just go now.

comment section. And let's just go now.

Let's start evaluating LLMs using evaluation harness on Mac using MPS or Apple Medal. We will start by installing

Apple Medal. We will start by installing LM evolutionation harness. We are going to follow the same procedures with Mac, Linux or whatever we are using. We are

going to copy this under the install section. Now I'm in my Visual Studio

section. Now I'm in my Visual Studio Code in my MacBook. So I'm going to just paste it. Since I already installed it,

paste it. Since I already installed it, won't install again, but in your case it will install. Let's press enter and wait

will install. Let's press enter and wait for it. Okay. So the process is complete

for it. Okay. So the process is complete and as you can see we are inside LM evaluation harness repository. If we

execute ls we will see all the files of this repository. So since this library is

repository. So since this library is installed you have two options to do the evaluations. You can execute everything

evaluations. You can execute everything in the terminal since these are just commands. But for the sake of

commands. But for the sake of consistency I will do everything inside a Jupiter notebook. So I'm inside my Jupiter notebook. The first command that

Jupiter notebook. The first command that you will see is that lml task list. What

it will do is that it will list us all the task that evolution harness supports. So if you enter it, it will

supports. So if you enter it, it will list us all these tasks. So there are a lot of them. So it is very long. As you

can see, VS code truncated it. But if

you like, you can just take a look at them. Now let's move on to the actual

them. Now let's move on to the actual evalation part. We are going to start

evalation part. We are going to start with exclamation mark. Since we are executing terminal commands inside the notebook, we will use that. If you are executing directly from the terminal,

you can omit it. The way we call the evalation arm is lm_l. So we will use couple of flags

lm_l. So we will use couple of flags here. First is model. We will use a

here. First is model. We will use a hugging face model. So we will just put that here. Second one is model arcs.

that here. Second one is model arcs.

Here we will feed the path the hugging face path to our model. So in this case it is in

our model. So in this case it is in hugging face TB. I will use small llm 135 million

instruct. You can use whatever you want

instruct. You can use whatever you want but this model since it is small it is fast and fits into the memory. Also you

can just like this if you give the local path to your model to here as an argument it will load your local model.

Next is the data type. So usually we are using bload 16 but since we are on a MacBook if you try to do Bflat 16 with

MPS you will get an error because MPS doesn't support Bflat 16 but it supports float 16. So we will use d type equals

float 16. So we will use d type equals to b d type equals to float 16 and we will give some tasks. In this case we

will give two popular tasks we know grande and lambad openai. So how this works is that you give this tasks flag

and you enter the tasks you want from task list that we did above. But more or less I believe you have an idea on what tasks to give and you will just comma

separate all of the tasks and this evolation harness will evalate all of these tasks. Now the one of the

these tasks. Now the one of the important things for especially Mac is the device flag. So we are using the MPS accelerator. So we are going to pass the

accelerator. So we are going to pass the MPS flag to the device and the batch size of course it depends on your machine. In my case,

machine. In my case, 256 work just fine. And we can specify the output path with output path flag.

In this case, I want the results to be written to out folder. So these are the flags that I used for this. However, if

you want to see all of the arguments, what you need to do is simple. You come

to the repository again. You go down and here they use a guide detailing the full list of supported arguments. You just

click here, go here and as you can see we have all the supported arguments here. tasks, task lists. You can set the

here. tasks, task lists. You can set the arguments for generation such as top P, top K, like we said, batch size, device,

we used NPS, output pad, and there are bunch of other arguments that I suggest you check out. It may be useful to you.

I will also put a link of this page in the description as well. So, let's come here and let's just execute this.

Actually, I will decrease the back size to 32 because I just ran this and it was so overwhelming for my system with all

the recording and all the other tabs.

All of my recordings just crashed and I had to start over. So, I'm decreasing it to 32. Now, you just decide according to

to 32. Now, you just decide according to your system and let's run this again. We

can see it is using the device MPS which is good. It is using the accelerator,

is good. It is using the accelerator, the Apple accelerator. And now let's go and let's see if we can track the GPU usage. Okay. So as you can see if you go

usage. Okay. So as you can see if you go to the activity monitor, go to window and toggle on the GPU history, it will show you this, which is the GPU usage.

The most left was from my initial run where all of my recordings and everything got crashed. But now as you can see it went down since I terminated

everything and it started going up again and now it is being used in its full capacity. So this is very similar to the

capacity. So this is very similar to the Nvidia Smi. You can just open this and

Nvidia Smi. You can just open this and track your GPU usage here. And now let's just wait for this to complete. Now the

execution is complete. Let's also check out the GPU user. So as you can see during that peak point we did the evaluation and now it went down and it

is now right in the middle and the GPU usage has decreased back to normal. Now

what will you get is this output basically it will be the tasks Lambad open AI the metrics we have accuracy and

perplexity we have inog grande we have accuracy metric and the values and the standard errors. Now let's also look to

standard errors. Now let's also look to the output files. In here we set output path to be out. So let's come here. So

as we can see here it created a new folder with the name of the model that we are using with the exact same path I

face TB face TB and we have the results.json file here. So this is a

results.json file here. So this is a JSON file which contains the results for Lambad OpenAI. You have the perplexity.

Lambad OpenAI. You have the perplexity.

You have standard error. You have the accuracy and accuracy standard error. Same for veno grande. You have

error. Same for veno grande. You have

accuracy and accuracy standard error. You have the configurations for

error. You have the configurations for the tasks for the lambda openai. We have

our data types. We have our model here. Same for the veno grande.

here. Same for the veno grande.

So this is a multiplechoice task. There

is that again we have the data type. We

have the versions and we have some other configurations such as number of

parameters. It is 135 million model. So

parameters. It is 135 million model. So

this is correct random seat date some environment info some template information some tokenizer information and the evaluation time. So this is

basically how this library works and this is how you use it on an Apple device. Now let's switch to working with

device. Now let's switch to working with Nvidia GPU and multiple Nvidia GPUs. Now

I'm inside my kegle notebook. The first

thing I'm going to do is open up this panel and as an accelerator I will use two T4

GPUs. So I'll show you how to do it with

GPUs. So I'll show you how to do it with a single GPU. But there will be cases where you have multiple GPUs and you want to take you want to take an

advantage of it or your model is so big that it won't fit the single GPU so that you would like to use multiple GPUs. We

will go over them all one by one. First

thing that we will do is install the evaluation harness. How we are going to

evaluation harness. How we are going to do it is run this code. Get clone. After

this is done, I will copy this. Open up the

this. Open up the terminal. Paste

terminal. Paste it and copy this. Open up the terminal again. Paste

this. Open up the terminal again. Paste

this. Hit

enter. Now it is installed. We can just close the console and restart the notebook for that. Restart and clear outputs and

that. Restart and clear outputs and everything should be set. Now let's

start by doing the exact same evaluation we did with our MacBook just with the difference of using a single Nvidia

GPU. Let's paste it. As you can see

GPU. Let's paste it. As you can see everything is the same. The only

difference is device is 2.0 zero which means that it will do the evalation using the zero GPU. So let's run this.

We can we can also view the GPU usage from here. As you can see this is the

from here. As you can see this is the zero GPU and this is the first GPU and zero GPU is getting utilized whereas the

first GPU is not which is what we expected and we get the results with the same format here. We had some warnings initially, but you can just overlook

those warnings. Let's also come here and

those warnings. Let's also come here and see the files being saved. As you can see under out, we have the files, we have their resulting JSONs here. I won't

go over them again since we did that during the make part. If you are curious what these JSON files look like, just go to the end of the make part of this

video. So this is the regular thing we

video. So this is the regular thing we did with the with the MPS. Now we did with a single Nvidia GPU. But we have

two GPUs. So we can use those two GPUs.

two GPUs. So we can use those two GPUs.

One of the ways we can do that is also actually when you go to this to the GitHub page of this evaluation harness

and when you go down here under the user guide, it has the multi-GPU evaluation part. Okay, we will mostly be following

part. Okay, we will mostly be following that. You can also check this part out

that. You can also check this part out to utilize this two GPUs we have. One

thing that we can do is we can load our model into each GPU and feed the data for them separately. For that we will be

using the accelerate. What we are going to do is simple. We will just add accelerate launch m just before the lm

command. and we won't specify any GPUs.

command. and we won't specify any GPUs.

So with this, the model will be loaded to each GPU separately. So let's run this. Also, let's open the GPU

this. Also, let's open the GPU utilization to track what's going on. As

you can see, we are utilizing both of the GPUs. Also, as a task, I chose Hello

the GPUs. Also, as a task, I chose Hello Swag because it is a larger data set and it takes longer to do the evaluations.

By this, I could actually show you the GPU usage of this approach. And here we have the results. So you don't have to worry about these warnings. Now let's

move on to another scenario where you have a model that is a big model for example like this one 13b lamatu chat

model. So this is a big model that won't

model. So this is a big model that won't fit into a single GPU. We can actually verify this by running this. What we

expect that it will give us a QA out of memory error and we got an error. Let's

see what error is that and QA out of memory error like we expected. Now how

we can overcome this and to make an evolation with this big model that doesn't fit into a single GPU is that we can fit this model into multiple GPUs.

And how we are going to do that is like this. Everything is the same. We execute

this. Everything is the same. We execute

lml and in the model arguments we pass this keyword parallelize equals to true.

So this is the key and we don't set any device. So there are some arguments that

device. So there are some arguments that you can set specific GPUs but in this case since we have only two GPUs anyways

I didn't set any arguments and we can just run this and see that it will actually do the evaluation and it will utilize multiple GPUs and as you can see

we are able to fit this model into this machine by splitting by splitting it into two GPUs GPU zero and GPU 1. One

thing that you may realize is that the utilization in one GPU is 100% and in the other one it is in 0%. Even though

the model is split between two GPUs, my take on this is because even though the model is split into two GPUs, the data

still flows into one GPU. And again as you can see it just moved into the zero GPU and it fluctuates. The solution for this is the merge two approaches. One is

the one that we did in the beginning with accelerate where we parallelize the data and this one where we split a single model into multiple GPUs. So

let's try that first. I will kill this because it will take a long time. If you

like you can just keep it going. I will

just kill this and to merge those two approaches. What we want to do

approaches. What we want to do is we will literally merge the commands.

So we will accelerate launch m like our regular command and additionally we will add this parallelize equals the true

flag as well. So by this what it will happen is that we will split the model into multiple GPUs and we will utilize all GPUs at the same time. For that

let's run this and again view the GPU utilizations. Now it started doing the

utilizations. Now it started doing the evaluation. Let's look at the GPU

evaluation. Let's look at the GPU consumptions. Yes, as you can see the G

consumptions. Yes, as you can see the G the model split into two GPUs and we were able to load this also the GPU

utilization for both of them is at full.

This is the optimal stage like this is the optimal way to utilize your GPUs and again this will take a time a while for the evolution to be complete. So I will

just kill this job. If you if you want to evalate and see the results you can just keep it going. I'll just kill it.

And that's all I wanted to show you. I

will upload all of these notebooks into GitHub and put the link to the GitHub into the description. To summarize, we used the evaluation harness with a

MacBook using MPS accelerator. Then we switched to an

accelerator. Then we switched to an Nvidia GPU. We used a small model with

Nvidia GPU. We used a small model with multiple GPUs. Then we switched to a

multiple GPUs. Then we switched to a large model that actually doesn't fit into a single GPU. and we split that model into multiple GPUs and actually we

saw how we can utilize every GPUs to their full potential. So this is not a full guide on how to use this evaluation harness library like I suggested. Go

check out this additional arguments. Go

check out actually like read this documentation. Check out additional

documentation. Check out additional things that you can give additional settings. If you're in industry, it is

settings. If you're in industry, it is probable that you are using Nemo models.

You can check out how you can evaluate with these Nemo models, etc. So, that was all for me today. I hope my voice was not that bad since I don't have my

equipment with me. But, thank you for listening. Don't forget to like the

listening. Don't forget to like the video, what you want to see next, comment it out, and let me know.

Subscribe to my channel and see you at another video. Bye-bye.

another video. Bye-bye.

Loading...

Loading video analysis...