Intro to LLM Evaluation w/ OpenAI Evals [Walk-Thru]

By Manny Bernabe

Summary

Topics Covered

Enterprises Battle Post-Deployment AI Hurdles
LangChain's Four-Step LLM Evaluation Framework
Tailor Evaluations to Specific LLM Tasks
GPT-4 Outperforms Turbo by 2% on Sentiment

Full Transcript

2025 is shaping up to be quite the year for llm large language models until now it's been mostly AI first companies and startups leading the charge with llm adoption however we're beginning to see

larger Enterprises and traditional businesses get on board the versatility of these tools is undeniable and they have the potential to revolutionize nearly every single industry that said

one of the biggest hurdles in deploying AI products especially within larger Enterprises isn't the technology itself the tech often demos very nicely showcasing some impressive capabilities

instead the challenges lie in supporting the implementation of this Tech especially post deployment ensuring that the system consistently delivers on value this is where monitoring and

evaluation become critical Enterprises need peace of mind that their AI system will operate reliably even at 3:00 a.m.

in the morning handling thousands and thousands of customer inquiries they also need to know that these tasks are being performed accurately and ethically and that they align with evolving

regulatory requirements in this video I'd like to walk you through a four-step framework that will help us Orient and think about the evaluation of llms and secondly we're going to walk through a

fairly straightforward case study that's just going to make things a little bit more concrete first let's cover some of the Core Concepts in the process of evaluating llms this framework is one

that I borrowed from Lance Martin over at Lang chain they have a great series on evaluations I'll link to it below there there are essentially four main components that we have to consider number one is the data set this refers

to the list of items or interactions that we are evaluating think of this as all of the runs that your llm performs in production or examples of engagements

in llm might have with the user number two is the evaluator this is the mechanism that we're going to use to determine whether a run was successful or not it acts as The Benchmark or the

standard for assessing the llms performance third is the T this is the specific objective that we want to accomplish with the llm think of this as the particular function or the job the

llm is expected to perform during the evaluation and lastly we have the results analysis this last step involves bringing everything together to analyze the outcomes of our evaluation here we

determine whether the llm has performed as expected and evaluate the success of the test one of the cool things with llms is that you can do so much with

them from sentiment analysis to labeling to translation but that also introduces some complexity when it comes to evaluating them as well there's a lot of

different options that you can choose from in each one of these components the key takeaway here is that evaluating llms is not a one-size fits all instead you need to carefully tailor your

approach diving deeper into each one of these components to find the right configuration for your specific llm and goal and workflow let's talk a little bit about tooling there are a couple

couple of good options available out in the marketplace if you look at any good llm Market map and focus in on evaluation and observability you'll find plenty of great choices these are tools

that I want to explore in future videos so stay tuned for that but for this particular example I'm going to keep it simple and leverage opening eyes developer dashboard evaluation tool it's a low code straightforward and simple

way to get started with these evaluations and is going to serve as a jumping off point for this type of process let's talk a little bit more about example that we're going to walk

through in this video I've pulled a movie review data set from hugging face and I've subset it down to 50 examples we have movie reviews alongside with an indication whether or not the review is

positive or negative if it's positive it's labeled as a one if it's negative it's labeled as a zero this serves as our ground truth these are labels that have been made by humans the curators of

the data set have already done that work for us our task here is to evaluate how well an llm can match these labels in other words does the LM agree with the human label for whether the review is

positive or negative I'll include a repo link below so you can access the data set and all the prompts that we're going to be using in this example so with that as a precursor let's head over to the opening eye dashboard and start walking

through the evaluation we're going to head over to platform. openen

ai.com and you can click here on dashboard and head over to evaluations before we start our evaluations just a high level

overview of what we're going to do number one is we're going to upload our data check it out make sure that it's okay number two we're going to set the evaluation to generate responses this is

where we want an llm to actually generate a response for the items that we're evaluating sounds a little funky but I'll explain it to you in just a bit third thing that we want to do is set up

our test criteria so this is how we're going to measure whether or not one of the Run is successful or not and then we're going to review our initial

evaluation run and then as a last step we are going to compare different models so we're going to compare how turbo 3.5

does versus GPT 40 for example so let's jump into it so again we're here in evaluations under dashboard we're going

to select create a new evaluation and we want to upload a file I've already uploaded this data again you can access it from the GitHub repo so I'm just

going to bring that in and take a look here and what you can see is that we have a review of a movie and then we

have the label again if it's one it's positive if it's zero it's negative so you can look through some of these as well and I've only uploaded 50 but this

data set is considerably larger if you would like to do that so now we're going to go to the next step and here is where we're going to have the evaluation

generate responses for each one of these roles so if you head over back to the repo I have the system prompts here so you can grab

this copy that and drop it here in the system prompt and then we're also going to have a message prompt as well and

that's going to be Analyze This View and I'll walk through this in just a bit okay so the system prompt is you are an expert in analyzing the sentiment of

movie reviews if the movie review is positive you output a one if not you output a zero and only output one or zero for the responses nothing else so

that's the system prompt and then the user prompt for each one of the runs is Analyze This review and then we drop in

the text so we're dropping in the review from each one of these rols the cool thing about this is that this is dynamic if you do curly brackets you can

dynamically pull in any one of these variables so in this instance we want the text but we're going to be referencing the label a little bit later so let's just leave that there and then

I get to pick the model that I want to generate with so in this instance I want to see if I can get away with using a cheaper

faster model so let's just go with GPT 3.5 turbo and let's go to next the next thing that we're going to do is we're going to set up a test criteria so this

is how we're going to gauge whether or not the output from the model is correct or not and so in this particular instance we're simply going to do a string check so we're going to see if

the response from the model matches the ground truth so if GPT 3.5 five returns one and our ground tooth is one okay that's a success so we do that with

string with string check but there's a lot of different options that you have here and you can also combine these also in the same run which is really interesting as well but for this particular example we're just going to

look at string check so check if response and this is from the run which is the sample should be the sample output text

response entered by the model is ground truth equals the ground truth and our ground truth remember is in the item label so it's going to check

if these two match and I can add that so I've set that up and as I mentioned you can add other criteria here as well we won't do that for this particular example but that's it's nice that that's

another option here so we're going to go on next and now we can name this and we can say um rename this here so we could

say movie review let's add that there and I'm going to just do a quick review make sure that the data set is there the variables are

good I've got the right model I've got the right test criteria and I'm good to go one thing that I encourage you to do

is to test the evaluation this will run the evaluation on the first 10 rows and it's sort of a a quick way to make sure that you have everything properly

configured here cuz you know if you submit a 1,000 row evaluation and you've got something that's messed up you don't want to find out after those thousand runs so you can check that and we can

see that we have some outputs from the model here which is um you know the output says here zero but I have a one that's a fail but all these other ones

are pass pass pass fail and it and it's for the first 10 it's a good check the other cool reason to do this is because this allows open AI to evaluate how

costly it's going to be for them to run this evaluation if if it looks like it's reasonable they give you an option to share the results from this evaluation with openi and make it free so I'm going

to click on that and pretty much what you're doing is you're just sharing your data the analysis and the runs with them and it's going to help them make their models better this isn't sensitive

information this is just for educational purposes so I'll um do it for free otherwise you would have to calculate

how your inference cost for all of these runs so this is the input tokens the output tokens and then you'd have to go to the pricing page and you'd have to see what the price is for that

particular model and then multiply that by the number of rows that you have in your evaluation data set this probably wouldn't be that costly but but still if you have a big run that you want to

evaluate um if if they give you this option and it's not sensitive data I'd say take it okay so now we're going to run that the cool thing here is that you can just set it and forget it there

they'll send you an email uh when the evaluation is done this isn't a lot of rolles so it should be pretty fast and immediately I can see the output from

this particular run so for GPT 3.5 turbo the pass rate here is 94% so in 47 out of the 50 cases GPT 3.5

outputed the same label as my ground truth and if I want to inspect that a little bit further I can go here and

check out some of the outputs from the assistant and then see whether or not they match my ground truth and if I just want to look at the failed items I can

do that as well and I can see that these are the three that failed and in this instance the ground truth was negative but my assistant said positive and in

these two instances the ground truth was positive but my assistant said zero or negative and these are all failures so

let's say that 94% isn't good enough I need to get this as high as possible how can I check whether or not another model is better what's the type of lift I get

by using a more sophisticated and also more costly model so you can go here and click on ADD run and you can change a couple of things here you can change the

prompt so maybe I can get the score higher if I modify the prompt a little bit more both on the system and the user side of things but for this particular

instance I want to see if a more sophisticated model would make any difference so in this instance I'm going to go with gbt 40 uh one opening eyes

Flagship model and I'm going to leave everything else the same and I'm going to click on run and it's going going to take some time to do that but again because this

is a small data set this should run relatively quickly now as you can see GPT 40 scored 96% here 48 out of 50

passing so one better than GPT 3.5 turbo that gives you information maybe it is worth it to get this type of lift maybe 3.5 turbo is good enough you can do

multiple different types of run to find that out you can test other models as well and of course you can export this data so that you can look at this in

your own system and dig into it a little bit deeper if you want all right so go check out opening eyes evaluation tool I think it's a really easy way to get started with evaluations as I mentioned

there's some other evaluation tools that I want to test out let me know which Tools in use cases would be interesting to you but more importantly just make sure that evaluations are a key part of

your consideration if you're doing anything with llm you have to be proactive about it you have to make sure you have a systematic approach to evaluating your llms on a continual

basis it's going to help you avoid a lot of headaches and also make your product better and better as more models become available so evaluations have to be a key part of your overall workflow I hope

you enjoyed this video if you have any questions any comments drop them below you can also find me on LinkedIn and x and would love to hear from you cheers

Loading...

Loading video analysis...