Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation
By Stanford Online
Summary
Topics Covered
- Human Evaluation Suffers Random Agreement Bias
- Rule-Based Metrics Ignore Stylistic Variations
- LLM-as-Judge Outputs Rationale First
- Three Biases Plague LLM Judges
- Factuality Scores Aggregate Atomic Facts
Full Transcript
Hello everyone and uh welcome to lecture 8 of CME295.
So today's topic will be LLM evaluation and I think this class is probably one of the most important classes of this uh quarter because the idea is if we don't
know how to measure the performance of our LLM, we don't really know what to improve and so this class will focus on
how we can quantify how the LLM performs in a bunch of different cases.
So with that said, we are going to start the class as usual by recapping what we saw last week. So if you remember last
week, we saw how our LLM could interact with systems that are outside of the LM
itself. So we saw one core technique
itself. So we saw one core technique that is called rag that allows our LLM to fetch information from external
knowledge bases. And so here rag stands
knowledge bases. And so here rag stands for retrieval augmented generation and we saw how we could uh improve the
retrieval system. So we saw uh that it
retrieval system. So we saw uh that it was composed of two main steps. So one
was candidate retrieval which is typically something that is done with uh a c by encoder kind of setup. So
sentence spurt was uh like a good example of how people would design such a model. Um and so this first step is
a model. Um and so this first step is typically there to filter down the potential relevant candidates for a given incoming query.
And then we saw that there was a second step which was um reranking and that one was a bit more involved and involved cross encoders which were most
sophisticated.
And we also saw some ways to quantify how well our retrieval system performed.
And then we also saw something that was called tool calling which is the ability for our model to know which tool to call
with which argument.
So if you remember, if we give our LLM the knowledge of the tools that are available to to it, it can figure out
which arguments it needs to input to the function as a function of the input query and then run that function and
then output the result in natural language to the user.
And then we also saw how agentic workflows were composed of. So uh
spoiler alert is something that is a combination of the two [snorts] previous methods. So rag and tool calling and in
methods. So rag and tool calling and in particular given an input we're allowing our model to make multiple calls
to call different tools to fetch relevant data from uh other uh knowledge bases and uh we saw one example that was
kind of successful uh from the current applications which was AI assisted coding which relies on this principle And react is typically the framework
that people would use. So reason plus act which is decomposing this into
observe, plan and act steps.
Cool. So this is what we saw last time and we also started from this slide last time. We, you know, if you remember, our
time. We, you know, if you remember, our LLM has strength but also weaknesses that we're trying to mitigate. So, in
particular, the focus of lectures six and seven were on methods to improve reasoning of the model and ways to for
the model to fetch knowledge from other systems as well as performing actions.
And today we're going to focus on the evaluation part. In particular, given a
evaluation part. In particular, given a response that the model is giving, how can we quantify how well the LLM is giving its response?
Cool. So first of all, I would like to define the term evaluation and the meaning that we will use for this lecture. So when we say I want to
lecture. So when we say I want to evaluate my LLM, it can actually take a lot of different meanings.
So when you say let's evaluate the LLM, it can mean let's evaluate the performance, the output, let's evaluate this based on coherence, factuality.
Let's evaluate it based on latency. So
more system related metrics or pricing or how often it is up and so on.
So just to make sure we're on the same page, this lecture will mostly focus on the output quality parts and in
particular we'll focus on quantifying how good the actual response is.
And here you will note that this is a challenging problem because um as we saw previously our LLM is a texttoext
model that can output basically anything.
So it can be natural language, it can be code, it can be math reasoning and so on and so forth. So it's very hard to come
up with universal metrics to evaluate that. So we will see how people do this
that. So we will see how people do this in practice.
Cool. So given the fact that our LLM generates free form output, one could imagine that the ideal
scenario for us to evaluate the LLM output would be to every time ask a human to rate the response.
So here the ideal scenario would be okay I give a prompt to my LLM. It gives a response. I ask a human to rate it and I
response. I ask a human to rate it and I start again and again and uh what I do is at the end of the day I just collect all these human
responses and I try to quantify the overall performance of my model. Well,
as you can imagine, the main problem is that such a system would be very cost intensive.
But let's look at this into more detail.
So if you remember the LLM outputs are really free form and
there may be cases that even human judgment may be something that is fuzzy because maybe the rating task in itself is subjective.
So let's take the following example.
Let's suppose I ask my LLM what birthday gift should I get? And let's suppose the LLM responds with a teddy bear is almost always a sweet gift. Just pick one that
feels right for you. So let's suppose I want to evaluate this response with respect to the usefulness dimension.
I may have one human raider that says, "Yeah, it's pretty useful because you know like teddy bear is pretty indicative of um I guess the what the
user should get as a gift." But then another raider may say, "No, actually it's not useful because maybe the response didn't specify exactly which
teddy bear. Should I have a bear? Should
teddy bear. Should I have a bear? Should
I have an elephant, a giraffe? Like
which stuffed animal should I get?"
And so there is this um notion of interrator agreements where we're bas basically
um concerned with making sure that everyone is aligned on how to rate those responses because sometimes like in this
illustrative example it may be a little bit subjective.
So responses may vary. So what people want to do is to make sure that the guidelines are clear enough for everyone to rate these responses in a consistent
manner.
So people come up with um agreement types of metrics.
So a very natural metric that you may think of is the quote unquote agreement rates.
So for instance you have this uh two raiders. So what you do is you just um
raiders. So what you do is you just um measure the proportion of the time that the two raiders give the same response.
And let's suppose the response here is binary. So let's say yes good or not
binary. So let's say yes good or not good.
Well do you see a problem with such a metric?
like is this a good metric?
I guess another way to ask this question is if I give you a given number of agreement rates, can you tell me if it's
a good number or if it's a bad number?
Well, let's take the example of let's say two raiders.
um let's say Alice and let's say Bob and let's suppose we have two different types of ratings that these raiders can
give. So either let's say uh yes it's
give. So either let's say uh yes it's good so one the output is good or the output is not good.
So if we assume that the first raider gives let's say random responses
with some probability P of A for being good and one minus P of A for being not good and then let's say Bob who should
have eyes and a smile has uh P of B for being good and one minus P of B for being not Then
let's compute the agreement rate for this case. So the agreement rate
this case. So the agreement rate is basically the probability that raider A and raider B
agree.
And so here A and B agree.
if A and B both vote one or way when A and B both vote zero, right?
But if they give the their response in an independent and random way, well, if you use like these probability
concepts that you know, then we will have probability of A and B responding to one, which is probability of A
responding to one time probability of B responding to one and same for zero. So
we will have something like this.
So P of A P of B plus 1 - P of A 1 - P of B.
So this one is A and B say 1 and here A and B say zero.
So, so let's see what the agreement rate would be in that case.
So if we assume that suppose that let's suppose uh P of A is equal to P of B which is equal to let's say.5
say.5 then the agreement rate would be so agreement rate would be so I'm just replacing the
numbers here so.5
squared +.5^
squar. So it's 0.25 + 25 which is equal to.5. So what that means is
to.5. So what that means is if we're just letting our raers rate these things in a random way with some probability, you know, P of A, P of
B, we would already have an agreement rate of 50%. Just by pure random chance.
And so one thing that I want to say is that this agreement rate by pure chance is a function of the probability that
each of these raiders give these ratings.
And so if this probability is actually higher the agreement rate by pure chance is also higher.
So what that means? So what do I want to say?
I want to say that if we just take the agreement rates then it's very hard to put it into context in terms of what you
would have gotten if things would have happened just by pure chance.
So for this reason, people have come up with a series of metrics that try to make it more
relative to this baseline, which is what would happen if our raiders would choose things kind of randomly.
And so you have these metrics like for instance this one is the coins scapa metric which
computes a quantity that is a function of this agreement rate by chance
and take the observed one such that if our observed agreement rate
is greater than the by chance agreement rates then our coefficient is positive. So
when it's positive you know at least you know it's going in the right direction.
So here if the observed agreement rate is equal to one then alpha is so kappa is equal to one.
But if our observed agreement rate is below the you know by pure random chance erment rate that we saw on the blackboard then our coefficient would be
negative.
So long story short, there is a bunch of metrics that try to quantify interrator agreement rates using these kinds of
formulas to be able to make these quantities relative to what would happen if things were done in a random way.
And so that's why you may see a bunch of metrics out there. So here it's cohen scappa that people use for cases where there are tool raiders but then you have
extensions such as fly scapa and crypendors alpha that you may see out there. So they all rely on this idea
there. So they all rely on this idea that we should have some baseline which is you know our raiders just randomly
picking answers and try to see how much better our actual agreement is compared to this.
So does that make sense?
Yeah. So I guess what I want to say is that the first limitation of asking humans
to rate our LM outputs which was sometimes the task being subjective can be something that we can quantify
with this interrator agreement metrics.
So what people typically do is they keep track of how good that agreement is. And
if let's say we have a quantity that's not satisfactory, people would just hold some quote unquote agreement sessions between the
raiders to just align on how they should rate the answers.
So that it can be seen as just a health metric to track how consistent your rating are and this is typically
something that people use in practice.
So up until now we've seen one limitation of human ratings.
Well second limitation I think I also said this previously. So it's really slow. You know, if you ask someone to
slow. You know, if you ask someone to rate a thousand LM outputs, well, it will take them a while and it's of course expensive.
So, all of that to say that our ideal scenario of asking a human to rate every LLM output is not something that is
practical.
But we can leverage human ratings in some way because we've seen that even if the task is subjective, we
can have a way to align our raiders.
So now let's move on to another way to go about doing this, which is by using some rule-based metrics.
So here I'm just going to revise the setting that I mentioned before and instead of asking our humans to
write every LLM output, this time I'm just going to ask them to write
the references or the ideal outputs for a given set of prompts. just fix that for good
and then use some kind of metric that would compare the LLM outputs with those references.
So here the main difference is let's suppose I have a given set of prompts fixed.
Well, I can make iterations in my model and always compare the output of my LLM with this fixed reference instead of
always asking humans to rate that again and again. So, it's already an
and again. So, it's already an improvement and we will see a little bit what are the kinds of rulebased metrics that you
will see out there.
So ideally these metrics should reflect the performance of the LM output in a in an optimal way. And what I mean by an
optimal way, it is to make it be a little bit flexible given the fact that natural language is not always something
that you can say in one given way.
So for instance when I provide a response to a given prompt there can be very well a case where I can formulate the response slightly differently but it
will still be just as good.
So the idea behind this metrics is to make this comparison a little bit flexible.
So let's start with one common one that people use in the translation case. So
this metric is called meteor and it stands for metric for evaluation of translation with explicit ordering.
So the idea here is to compare reference and predicted and we'll see
how it's being done and also penalize cases when words are not in the
same order which is explaining why the metric is called with explicit ordering.
So the formula is as follows. So it is some fcore times 1 minus some penalty.
So the fcore here is so you may be familiar with f1 score. So it's like the harmonic mean with equal weights. So
this one is with the variable weights.
So it is a function of precision and recall where precision is the proportion of the
unigs that are in your predicted sequence that are matching with the reference and the recall is the proportion of the
unigs in the reference that are matching with what is in the predicted. So it's basically matching
predicted. So it's basically matching the usual precision recall metrics that you know and then we have another quantity
here which is the penalty and I mentioned the penalty here tries to incentivize goods ordering. So if it's ordered the
goods ordering. So if it's ordered the same in the reference and in the prediction then it's good otherwise it's bad.
And so here there's a bunch of quantities. So gamma and beta are
quantities. So gamma and beta are hyperparameters that people arbitrarily choose.
And it's a function of C, the number of contiguous chunks that are matched
over the matched unigs. The number of matched unigs.
So ideally you would want C that would be as low as possible because if you have a low number of contiguous matches
it means that your contiguous sequences are are long which means that the ordering is the same.
So you want C to be low and then matched unigs to be high.
So you want that penalty term to be low for a good I guess for a prediction that has the same ordering as the reference.
So I guess higher meter score means better translation AC according to this way of doing things.
So I guess when you look at this formula so first of all it seems it looks very arbitrary right I have uh alpha as a hyperparameter gamma beta so it's kind
of a a recipe I feel so that's one and the second thing is that it does not
allow for stylistic variations because here we're measuring the number of matched unigs
Although the metric expands the like range of what is called matched unigs by taking into account things like words
that are synonyms of one another and things that are of the same roots but still it is not like extremely uh
satisfactory in that sense.
So mter is meteor is one such metric.
You have another one that's being used or that has been used in translation tasks which is called blur which you may know. So, BL stands for bilingual
know. So, BL stands for bilingual evaluation under study. And you can think of this as kind of a precision
focused kind of metric that uh looks at the number of matched the matching engrams
over the engrams that are in the prediction, which is why it's a precision kind of metric.
And it also has uh a penalty term here.
It's called brevity penalty because given that it's more of a precision kind of metric. If you
translate something that's very short, you may be able to gain the the metric.
So you want to penalize the translation being too short.
So we'll not go into a lot of details, but I just want to just show you the kinds of metrics that are out there. So
meter is one, blue is another one and rouge which you may have heard is also another one typically used for summarization tasks.
Again same idea and um it has a bunch of varants that you may see out there but long story short all these metrics
they all compare the output with a reference.
So as we saw one key limitation is that they do not allow stylistic variation.
So let's take an example. So let's
suppose I say a plush teddy bear can comfort a child during bedtime. Well,
the exact same thing I can say it I can say it in a really different way. So
soft, stuffed uh bears often help kids feel safe as they fall asleep or many youngsters rest more easily at night when they cuddle a gentle toy companion.
So in all these cases, the metrics that we saw would you would really perform very poorly.
So that's one key limitation. So the
second key limitation is correlation is not that great.
I mean, you can imagine that people have come up with all these hyperparameters to kind of make it be correlated to human ratings, but they're not that
correlated.
And the bottom line is it still requires human ratings to just get started. And sometimes
you just can't afford to have human ratings maybe in your project.
So I guess there are still some key limitations which is the reason why. So all of that to say I want to motivate the key method
of this class or of this lecture which is called LLM as a judge.
So, you know, we spent the first seven lectures motivating these large language models that are pre-trained on huge amounts of data that are uh tuned in a
way to match human preference. So, they
do contain human knowledge. They do
contain some indication of what humans may prefer.
So the idea here is to have our model response be actually an input of yet another LLM
and that LLM is something that people typically call LLM as a judge.
So it was a term that was introduced in a paper uh from two years ago.
So here the idea is to use an LLM for rating purposes and things that you would see as input would be the prompt that was used to
produce the response, the response and the criteria along which you want to
grade your response.
And so here LLM as a judge would give you the following outputs. So the first thing is it would give you a score.
So here you can think of it as like a binary scale kind of score. So pass or fail but and this is very new
also a rationale because LLMs they understand text so they can also explain you why they
graded something with a given score and that part is the key difference with previous
methods. we are able to explain why the
methods. we are able to explain why the metric or the model is giving us a given score. And this is quite quite good
score. And this is quite quite good because in the other let's say rulebased worlds where you would have all these like formulas and multiplication all
these things and sometimes you would come up with a number that would not be very self-explanatory and this is luckily something that LM as
a judge addresses.
So to recap, what we want is to use an LLM as a way to grade the response.
So here you would have typically the following kind of prompts. So you
would state okay I want to evaluate my response with respect to a given criteria and then you give the prompt that you used to generate that response along
with the model response and then you would ask the judge to return two things the
rationale and then the score.
So one little trick I want to point out is people typically ask the model to first output the rationale and then the score
and the reason why we typically do that is it's something that empirically improves the quality of the results but then given what we saw I think in
lecture six if you remember the reasoning class we saw that these reasoning models that are being trendy especially in 2025
what they do is they first output a chain of thought before giving the answer.
So you can actually think of this trick as being on the same kind of idea of reasoning models as in it allows the
model to externalize verbalize its quote unquote thought process before giving the score.
So it gives it a chance to really figure out what is good or what is wrong in the model response.
So far so good.
Any questions on I guess the setup?
Yeah, all good. Okay. So now I have a question for you.
If I give the following prompt to my LM as a judge, am I guaranteed to have a rational and a
score that I can parse?
Am I guaranteed?
No. Yeah, exactly. No. The answer is no.
You're not guaranteed to have um a rational and a score that you can parse because this model has some probabilistic nature to it with the sampling process and it's not something
that you can really control. So I guess my follow-up question is do you know a technique that would
I guess guarantee you to have um a structured response. So hint is a it's a
structured response. So hint is a it's a technique that we seen that we saw like towards the beginning of the class.
Okay, I'll give you a little hint. So if
you remember on slide 65 of lecture 3, we saw a technique called constrained guided decoding.
So if you remember the idea here is to constrain the decoding part pro process by allowing our model to only sample
from quote unquote valid tokens.
And we typically do that in cases where we want our output to have a given format. So let's suppose a JSON format
format. So let's suppose a JSON format and we want absolutely that format.
So what people do is they use this technique to guarantee the form of the response.
And in case you're using these provider like the providers that are you know out there like for instance open AAI or Gemini or anthropic
this technique is known under the name structured output.
So in your project if you want to constrain the decoding process in order to output a response of a given
format. So let's suppose my format is a
format. So let's suppose my format is a response and I kind of represent it by a class and there are like two attributes
so rational and score.
Well typically you can reference that with the argument text format equal to that representation.
So, this I believe is something that uh OpenAI does, and I'm not exactly sure if it's exactly the argument name that you would see for the other providers, but
they're all I guess along the same lines.
Does that sound good? So, the key word here is structured output. Whenever you
want a response of a given format, you would just go for that.
Okay, cool. So just to recap, our LLM as a judge has two main benefits. So the
first one is that we do not need a reference text. We do not need human
reference text. We do not need human ratings to just get started because our LLM already has a lot of I guess knowledge that has acquired during
pre-training and human preferences and so on. So you do not need that.
so on. So you do not need that.
And then the second thing is you can interpret the score with the rationale that is being output
and that is also quite remarkable.
So just as an example so here you would say okay evaluate the quality of like this this response. So you would have some rational that would explain what
this response has or doesn't have that makes it good or bad along with the score.
Okay, cool. And I believe uh Okay, so now we're going to see the kinds of LM messages that you can see out there. So
of course there are many variations but there are generally two types of L me that you will see.
So the first one is you have a single output a single response that you want to evaluate
and here you would ask LS judge to say okay is it good or is it not good and the second big kind of LS judge that
you will see out there is pairwise kind of setup. So you have two responses and
of setup. So you have two responses and you say is response A better or is response B better
and here you would obtain a response either okay that one or this one.
So if you remember we've seen uh in previous lectures that there are a lot of situations where we would want to have preference data. So for instance in
the preference tuning class that we had I believe it was lecture five.
So this kind of method can also be a good way to synthetically generate preference ratings where you have two responses
and then you ask your LLM to say okay I prefer that one and you can use that one as a as the label to train your rework model.
Does it sound good?
Any questions on the setup or everything that we've talked about so far?
Okay, cool. Everyone is uh on the same page. So now let's see what can go wrong
page. So now let's see what can go wrong with our LMS as a judge. So uh let's think of the possible kinds of failures
that we can encounter.
So the first one is called position bias and as the name suggests it has to do with the ordering at which we present the responses to our model.
So let's say if we ask our model is response A better or response B.
Well, there is a chance that the model responds with response A just because it it was the first one to be mentioned.
So that kind of bias is called position bias. So it's where the position at
bias. So it's where the position at which you place the response matters in the
judgment of the LMS as a judge model.
And I guess as a way to remedy that, people have different techniques. But
one typical technique would be to ask the model is A or B better and then ask the model is B or A better and then take
the majority voting. So if both of them lead to the same response, then it's good. But if the response changes,
good. But if the response changes, then it may not be good. So you may want to do something else.
Um there are a bunch of other techniques. So I know there's a bunch of
techniques. So I know there's a bunch of papers that try to tweak the position embeddings but those ones are a bit more advanced. So it's not typically the
advanced. So it's not typically the thing that you would do just out of the box. So taking the average or like you
box. So taking the average or like you know taking the majority voting of this you know position swapping is CPT what you would do.
Okay, cool. So this was the first kind of bias. The second bias is called
of bias. The second bias is called verbosity bias.
So let's suppose you have two responses and the first response is short and concise.
The second response is something that goes much more into details is typically something that is more verbose.
Well, there are cases where the model will tend to um to I guess prefer responses that are just more verbose just because they're
more verbose, not necessarily because they're more correct.
And for that, it's maybe a little bit trickier. So
trickier. So people typically try to explicit this dimension in the guidelines
when they input I guess the this this question to the LMS as a judge they say uh well make sure to not pay too much attention to the length of these responses to not I guess prefer
something just because it's more verbose.
So that's one kind of meth method that you will see out there.
Uh the second one is to just um also add some examples in context learning examples uh to the model to just um I guess tell it to
um I guess show by example that verbosity is not something you should uh prefer. And then uh the last one is to
prefer. And then uh the last one is to have some kind of penalty on the output length.
So you can ask your model in a pointwise way is one how how good is one how good is two and then try to penalize that with the length. So that's something
that also people may use.
Okay. So we've seen position bias, we've seen verbosity bias, now we will see the third kind of bias that you may see out there which is called selfenhancement
bias.
And so that one has to do with the fact that if you ask a model to judge an output
that was produced by itself, well, the model will tend to prefer responses that are generated by itself regardless of whether or not the other
one was more uh aligned with what we wanted.
And I guess here the intuition is that if our model generated such an answer then it's maybe the case that
uh our model thought that from a probabilistic standpoint this was a sequence that was very much likely to appear. So it may be I guess one way to
appear. So it may be I guess one way to think about it which is if it has generated such a sequence then it means
that it it is something that it thinks I mean think uh quote unquote that it's a good answer.
So the general guideline here is to typically not use the same model that you use for generation and for judges.
But um I guess nowadays it's kind of hard to have that strict uh constraints um I guess
uh respected because I guess all models they are trained on basically the same data sets.
So you can argue they're all being subject to the same I guess training mixes and so on. But still I guess
people what what people do is they tend to use another model just to have such a risk be minimized. So long story
short try to not use the exact same model that you use for generation and for evaluation.
So this is self- enhancement bias.
Okay. So before we go to the next uh subp part I guess what do you think of these three biases? Do they make sense?
Any questions so far?
Yep.
So, can you elaborate a bit more?
Yeah.
Yeah. Yeah. Yeah. So, the question is, can you have a model that just maybe isn't aligned with uh I guess the ground truth and maybe prioritizes maybe one label over another. So, yeah, this can
definitely be another kind of bias. Uh
so this bias being that our LLM is not exactly aligned with what humans would prefer. So these three biases are by no
prefer. So these three biases are by no means exhaustive. So this can very well
means exhaustive. So this can very well be another bias that you can list as well. Um so yeah, this is definitely
well. Um so yeah, this is definitely another kind of bias.
Yep.
Yep.
So the question is, is it possible that our judge still prefers an LM response even if it's a different one? Well, it
depends how good your judge is. But
typically the best practice is to have a judge that has a much bigger capacity that may capture these kind of
differences and not be fooled by a response that just sounds like something it may generate but something that is maybe more aligned with human preferences. So I guess the short answer
preferences. So I guess the short answer is yes, you can still have such a situation. But in order to mitigate that
situation. But in order to mitigate that risk, you would typically take a model that is not the same but also typically much bigger.
So uh you have a bunch of uh such models out there and uh you know with all the I guess um improvements that have been made with like reasoning models. This is
also something that people try.
Yeah, question is should the judge be bigger? Um, it's not a hard constraint,
bigger? Um, it's not a hard constraint, but it's typically something that people would take like a bigger model that would have strong reasoning capabilities
that could really tease out what's good and what's not good. Yeah.
Okay, cool. So, with that, I'm going to just go over the best practices that we've seen. So we saw that um in order
we've seen. So we saw that um in order for our LMS as a judge to output a score, we need to give the criteria that
we want this to be evaluated against.
But sometimes this criteria may be a little bit subjective. So one thing that really works very well is to have crisp
guidelines. So really explicit what we
guidelines. So really explicit what we want, what we don't want.
The other point is you may see different kinds of scaling out there. So sometimes
people having uh a scale that is maybe more granular and maybe other cases where we're just operating on a binary scale. So typically what people would
scale. So typically what people would tend to prefer is actually the binary one because it makes the job of the LM
as a judge easier. So it's just either good or bad.
And also when it comes to aligning the judge with human ratings, humans, they typically also find it easier to just
judge out of two options as opposed to several. So it just removes the noise of
several. So it just removes the noise of having several uh possible choices. And
it's not necessarily an extra signal that may be really useful. So here the tip is to use a binary scale like a pass
or fail kind of score as opposed to like a gradual one.
The third tip is to make sure to output the rational before outputting the score.
And we've seen this is along the same ideas of outputting a chain of thought before providing the response which is something that is done by our reasoning
models. So it's typically something that
models. So it's typically something that will improve uh the judge performance.
Uh so we've talked about the different kinds of biases. So position, verbosity, self-enhancement, but it's not the only ones of course and uh I guess people
typically also look at how to mitigate those with the remedies that we mentioned.
So, so far we've stated that we do not need human ratings to get started, but a good practice is to still look at
how the LLM ratings compare with the human ratings.
So here one tip is to just calibrate the responses that the judge is giving with respect to the human ratings because at the end of the day it is the quantity
that we want to approximate and so here I guess if there is the budget and it's something that is possible for the project one good uh
practice is to collect the human ratings output the LM as a judge scores and then run some correlation analysis. this to
see if there is something that can be uh improved in terms of the prompt mainly the prompt and then the last thing is the temperature.
So if you remember the temperature um is a parameter that you can tweak to
make your generation more deterministic as opposed to more creative. And so you will see that for evaluation tasks, people use a low temperature because
they want to make their evaluation experiments reproducible.
Like let's imagine you do one evaluation and then you do another one let's say uh two days later. You don't want the scores to be super different. So you
will see a temperature value of something [snorts] like 0.1 or 0 2.
These are like very uh common values that people take.
And so long story short, if we were to recap, we went from the ideal scenario being that
each LM output is rated by humans to actually having some kind of approximation with these LM as a judge
models that can do this evaluation I guess um without any constraints or without any need for human judge. ments.
But as we mentioned in the best practices, making sure that the LMS adjust scores
and the human ratings do not diverge is something that we should keep in mind as we improve our model because it may be that we're improving
on our model so that our LMS score is very high but the LMS score is itself an approximation
of human ratings. So I guess you don't want to overoptimize against the I guess a proxy. And so that's why you want to
a proxy. And so that's why you want to have that proxy be as aligned as possible with your ground truth labels which are human ratings.
Okay, cool. Um so uh we have a few minutes left before giving it to Shervin. So I'm going to quickly go over
Shervin. So I'm going to quickly go over the kinds of dimensions that people uh measure uh LM output against. So we have
broadly so there are many dimensions but just to simplify things there are two main dimensions that we can look at here. So one is how well your task is uh
here. So one is how well your task is uh being done. So task performance with
being done. So task performance with things like was the response useful, was the response factual, was the response
relevant among other things. And also
how aligned was the response format in terms of tone is in terms of whether the style was
something that is aligned what we want in terms of whether there was any unsafe elements of response that was given to the user.
And I just want us to spend maybe five minutes on the facility dimension which is actually something that
requires a little bit more work and I'll give you just a setting.
So let's suppose we have some text output and our goal is to quantify how factual that output is. So I'm going to read the
text out loud. So teddy bears first created in the 1920s were named after President Theodore Roosevelt after he proudly wanted to shoot a captured bear
on a hunting trip. So what we want is quantify how factual that piece of text is.
So I told you pre previously that um we typically prefer binary scales when it comes to rating uh something with respect to a
dimension.
But the thing with factuality is that there's a lot of nuance.
Some text may be very wrong. Some text
may be a little bit wrong. Some text may be not wrong at all. So we want to capture how wrong the text is. given the
the fact that the text can contain a lot of sentences and if there's like one small issue, we don't want to just say the whole thing was not correct.
So, I'm not sure if you saw in this text, but there are actually two errors.
So, it's not 1920s, but 1900s that the teddy bears were first created. And uh
the president didn't want to proudly shoot. He actually I think refused. So
shoot. He actually I think refused. So
if we are in such a case the question that we want to tackle here is how do we want to quantify this nuance.
So this is an open question that people have been writing papers on. So what I'm going to tell you now is something that people typically use
nowadays based on re research that has been done.
So we typically operate in a few steps.
So the first step is for us to go from the original text output to a list of facts
because when you look at a text, it actually contains a lot of facts that need to be checked. And so the idea here
is to aggregate the factuality of this text along the dimension of the facts that are present in this text.
So in this example, we would have one LLM call that transforms our original multi-sentence multiaragraph potentially
text into a list of list of facts. So here in this example we would have four facts.
So that's the first step. The second
step is we would go over each of these facts and check whether it is correct or not.
And so here we would typically proceed in a binary fashion because when if you think about it a fact is either correct or not. I mean
you may have some in between but we don't want to over complicate the ta the task and so here the factchecking process
would typically involve the other technique we've seen last lecture like rag for instance given a piece of text
we want to I guess query a knowledge base with the actual fact and then check whether the fact that is here is actually correct
So this facteing process is typically something that involves things like rag web search is also something else and so on.
So you can think of this facteing step as also involving LLM calls.
And so you can also think of some facts being more important than others.
So as an example, maybe the fact that the president proudly wanted to shoot the the bear is not as important as let's say the name of the person after
which the teddy bears were named. So you
can think of also having weights that quantifies the importance of each fact.
So people would use something like this formula which is an aggregation over all the facts with some weight that quantifies the importance of
each fact.
So these weights alpha I can be all equal to one another if you want to make it simpler. It's not something that is
it simpler. It's not something that is necessarily uh the case everywhere that these must be different but it may be something that you can tweak.
So if we go back to our uh initial question which is how do you quantify the factuality of this text here you
would say okay the second and the third facts are both correct. we know how important they are. So we run this
aggregation formula and we obtain a score of 6.
So that means that there are some errors but we still have some things that were still factually correct.
So this is typically how you would run this criteria with I guess the techniques that we have nowadays.
Okay, cool. I know I'm 2 minutes late and with that I'm going to give it to Shervin.
Thank you Ashin.
So before we move on to looking at specific benchmarks, I wanted to take a detour and look at what is happening on the agent side of things.
So if you recall what we discussed uh last lecture, we talked about this react framework where you could decompose what
was going on within an within an agent into specific steps. So it's usually three steps can be observe, plan, act or
it can have other names. But the fact is that you have several atomic steps that can loop.
So if you take a look at the typical agents inner working uh you can see a pattern like this. Now you might wonder
how do you even evaluate such a thing.
So let's take a look at just one loop um and then let's see together what can the errors be in order for us to have an
idea of what would an evaluation results mean on an agentic workflow.
So I'm going to show a slide that we had presented um at the previous lecture and uh we had
seen that we can decompose a tool call into these three steps. So let's take our favorite example um let's say you want to find a bear near you. So you
would ask that to the model. So the
first stage is to find the right tool call with the right argument and then once you have found this right
tool call you need to execute it and then based on your tool call prediction and on the results that you obtained from your tool you would infer the
results at the last step. So these are three steps and you might have a series of them in the case of an agentic workflow where you call multiple tools
and then build up your reasoning until reaching an answer that you then give to the user.
Okay. So now let's look at what can the failure modes can what can they be at each of these steps. So first
let's take a look at possible tool prediction errors. So the first one I
prediction errors. So the first one I want to mention is the case where the error is that from a user query that
obviously needs a tool that you don't actually use the tool. So here let's suppose that you want to find a bear.
You have the tool to find bears at hand but you don't use it. Uh so typically if you don't use it a possible behavior from the model can be to uh say an
error. Um so an error by an error I mean
error. Um so an error by an error I mean you know sorry I cannot do that and this in assistant terms you can call this a punt. So when you don't answer the
punt. So when you don't answer the question you just fail. Uh it's called a punt. So you might punt here. Uh you
punt. So you might punt here. Uh you
know sorry I don't know where I can find one. and let's see together what could
one. and let's see together what could possibly cause this issue and how we could remedy it.
So I don't know if you recall the concept of tool router or tool selector that we had introduced. So usually when you are dealing with tools you don't
have just one you have multiple ones and the number of tools that might be useful for a large scale in the sense of number
of users LLM might be large. So you
don't want to input all the function APIs at every call. So it might be the case that you have this intermediary step where you filter down the sets of
possible functions that you can put in the preamble and here um like these tool selectors or tool routers
um they have the property of trying to be recall oriented. So you want to trim the list of functions that you want to input in the preamble
u but you want to at least find those that you need. So the main property here is that you want to save on uh context space but you still want to ensure you
know most of your use cases are still working. So this is why here when we say
working. So this is why here when we say tool wer error we actually mean a recall error. So it means it's possible that we
error. So it means it's possible that we just didn't select the right tool among the set of tools and uh let's say this is the cause then
it's pretty clear we just have to adjust the tool router in order for it to be um like predicting the right tool. So this
can be one kind of issue. Another kind
is, hey, actually the tool was included in this list of function APIs, but it's just that the LLM didn't think about using it.
So, um, so maybe this like fine teddy bear was in there, but we just don't use it. The LLM directly outputs a response.
it. The LLM directly outputs a response.
So in that case um if you recall we had mentioned techniques to teach an LLM to use a tool. So you would need to revisit
that part and either so if if you had trained it with SFT so you know include this pattern maybe to train the model to
recognize it or if you had done prompt tuning then you should revisit your prompt in order for this call to make sense to the model that you should use
that tool.
Okay, great. So this is one kind of uh possible error. Another one that you
possible error. Another one that you might see in the wild when you want to debug uh agents is at the time of tool calls. It might be the case that the
calls. It might be the case that the model comes up with a function name that just simply doesn't exist.
So here I mentioned the tool hallucination. This is what I mean by
hallucination. This is what I mean by that. So it calls a function that is
that. So it calls a function that is just not defined. So here our API was called if you remember find teddy bear.
So this was the function that existed and in this example of failure the model tries to call the function find bear which I haven't defined.
So uh when you see such errors you have several potential causes. One of them being that the model simply doesn't
ground well overall and typically it occurs if the model is too simple.
Uh so it's an empirical observation if it's too weak. Maybe it can make up things that it thinks uh can be reasonable but it doesn't actually ground on your instructions.
And here I have no better remedy proposal than to maybe upgrade the model if you see that this is truly the case and then see if this uh this is
reproducible.
Some other potential causes could be coming from actually you. So the model is trained on very high quality data
during its SFT stage. So it has seen what great APIs look like. So these
tools that you define that help the user achieve what they're looking for might not be written in the best way. You
know, if you didn't use uh AI assistant coding, let's say um or yeah, you don't necessarily have to use AI assistant coding to write this, but these are
typically a great way to check whether your implementation makes sense from a model standpoint. And if it doesn't then
model standpoint. And if it doesn't then a typical remedy is to either like renaming the API just the function name
and the arguments just go a long way because this is what the model will see when it comes to your to your tool call. It sees the API
um you know function name. It sees the arguments and the highle descriptions.
So these are your three knobs to tune in order to make it sound more logical and then linked to the actual task at hand.
And then uh so you know at the very beginning I was saying maybe the model is too weak but actually maybe the first thing you should check is whether the
horizontal instructions so horizontal across tools whether these are clear enough. Maybe [snorts]
enough. Maybe [snorts] uh the model hasn't really um understood that it needs to use functions that are given to it. So maybe it's just making
up function names that it believes could have access to. So the first thing to check would probably be to see if this
phenomenon is generalized and see if these horizontal instructions they are concisely saying that you should uh make
sure to use available functions and then uh yeah on that you can iterate on these top level instructions and uh you know maybe iterate with uh with an
LLM itself because typically top level instructions they're veryant important.
So you need perfect formatting and perfect logic. So yeah, typically being
perfect logic. So yeah, typically being able to detail them um with with great detail is is helpful.
Um okay, now let's see a third possible failure cause. So let's say you have
failure cause. So let's say you have your model and your uh user prompts, but you just don't use the right tool. So
here if the user says find a bear near me, one other reasonable approach would be you know what if you just send a
message asking for a bear. You know that would be reasonable, right? But maybe
that's not what you want to implement as a behavior for your user. So in that case it's not clear to the model what
approach you prefer and it is uh on your um you know it is your responsibility to ensure it is indeed clear. So um and
then you have to do that at two different levels. So the first one is
different levels. So the first one is you know potentially also at the tool router level. Maybe the tool router
router level. Maybe the tool router doesn't know that for this kind of query, you should have the tool that you had in mind as part of the results. So
it's possible that you have a recall issue that you need to fix. And then the second one is simply going back to the APIs of both functions. Maybe they
conflict in scope. So you want to go back to each of them and be precise into which situations should be dealt with
with which tool. So being very um very precise in in these APIs um just plays a lot here.
Okay, great. So now we're going to go through a fourth u and last failure mode for this tool prediction task which is what if you
have the right tool but you just don't have the right arguments.
So you have already gone one step you know you have found the right tool but then this last mile of making sure that the tool is run with what you would like
is not fulfilled. So here if I say find a beer near me and it outputs and and it uses the coordinates 0 0 which is u
somewhere in the southern Atlantic so it's not likely that I'm actually there uh you know that can reflect an issue and
you know one possible explanation for this is that maybe simply doesn't know where I am because I haven't specified in my query that I'm here at Stanford
and it just tries to make up my coordinates. So, one uh thing that you
coordinates. So, one uh thing that you should uh double check is making sure that the context uh carries the location information. So, if I haven't provided
information. So, if I haven't provided that as a setting on my LLM map, it's possible it's not there. Um, and uh, and then let's say it's not there, maybe you
would want to introduce a location finder tool that is executed beforehand and if it fails because I haven't given the app the permission to see my
location, then maybe you could have a actionable error shown to the user uh, instead of having some dummy parameters
passed in. So this is one potential
passed in. So this is one potential remedy. And then the second one is uh
remedy. And then the second one is uh you know maybe it sees arguments but uh the model just doesn't know what it should put as input.
So that that could also be um you know another reason and then on that you know this is a common remedy to go back and
then retrain either the model on how it uses these tools or rewrite the API.
So we have seen four ways to I mean four failure modes for the tool prediction step. Now we're going to see two more on
step. Now we're going to see two more on this tool call step.
So the first one is you know very simple one um you know from a mindset perspective. So know maybe your your
perspective. So know maybe your your tool just doesn't output the right response. So, it's a very um kind of
response. So, it's a very um kind of vague category. And as an example, maybe
vague category. And as an example, maybe uh your code logic has a bug somewhere and it it it just returns an error. You
know, those that you see in Python, maybe it hits some value error or like anything else. And I just want to say um
anything else. And I just want to say um that it might not be necessarily the case that hitting an error is bad
because sometimes you know in the case of finding your location if you haven't uh provided your permission to find the location maybe it will hit an error and the model will
anchor on that error to convey the status to the user. But in general, it's not really common practice to return
errors just because the model could interpret it as an internal tool error.
So sometimes when you you hit an error and then you ask the model to synthetize the tool cause it has seen. Sometimes it
just says, "Oh, sorry. I couldn't do it, but it's my fault. It's because I encountered an error." Doesn't really say, you know, actionably what happened.
And instead the fix here is to uh convey these outputs in a meaningful manner. So
uh typically you have a structured output and you return a true output instead of an error.
So uh yeah here um so it's a general case just to say that just check your tool implementation. So you have the
tool implementation. So you have the right arguments you you have the right tool but you just don't have the right value. So just a software engineering
value. So just a software engineering problem, just go and fix the tool.
And the second category of issues that we see at the backend level is when you return no response.
And returning no response is often bad when the tool is one that performs an action.
So let's say last lecture we talked about uh increasing the thermostat for your teddy bear who was called. So if
you increase the thermostat and the tool doesn't say anything. So the model doesn't know if it has done the task successfully or not. So it could well
come up with a false confirmation of hey uh you know all is good. I have
increased the thermostat. No worries.
But it actually hasn't. So this is why a common guidance is to always make sure that tool calls are followed by a
meaningful output. So uh as usual you
meaningful output. So uh as usual you have this like structured message and you should take advantage of it to convey what has happened as part of your
tool um in order to make sure that the the model in turn knows what to convey to the user or knows how to continue
that agent tick loop.
So yep. So always output something and let's say you're finding you want to find a teddy bear and you haven't found any then here you will be surprised by
what I say but it's actually better to output an empty JSON than just outputting none because an empty JSON
could mean I found no bears but an output of none doesn't say anything. So
even in that case an empty output in the sense of you know empty JSON is meaningful and make sure to use that
meaning in the way you encode your tool.
Okay great. So we have seen uh two more u possible errors at the function call level. Now let's suppose that everything
level. Now let's suppose that everything went great at the first step everything went great at the second step. So you
have found the right tool output but now the model has trouble to synthesize the output into a meaningful response.
So yeah let's suppose your tool found a bear named teddy and then the other attributes which I haven't shown here maybe say that they are like one mile away from me. So yeah the teddy bear has
been found and we just have to present it to the user. But uh if you put it to the model, let's suppose the model says,
you know, I didn't find any bear. So
what could be the cause here? So it
could be the case that you have an output that has information that the model doesn't ground on. So it could be the case that just the model lacks the
ability to refer to content that it that was uh you know put previously. So here
I have this the same um vanilla suggestion of um you know upgrading the model.
So usually that doesn't happen really anymore but it used to in early iterations of LLMs. There is uh you know one that is
actually one that happens fairly often.
So sometimes the tool back end returns not only an output but a lot of output and it's too much for the model to properly parse what is important. So you
have maybe the information of Teddy in there but you know it's it's it's drowning under um an ocean of other kinds of information that are not
useful. So the model cannot distinguish
useful. So the model cannot distinguish what is helpful. And then the solution for that is to go back to your tool implementation and ensure that whatever
you output is meaningful to be used to the model in the next stage.
And uh I think this goes um so this overlaps with the third reason I put here. So let's say your output is
here. So let's say your output is trimmed already. Then another possible
trimmed already. Then another possible explanation is maybe it's not being presented in a meaningful way. So this
is why in Python you have these classes where you can uh instantiate attributes so that the output is very meaningful.
So let's say for like saying that you have found a bear, you could return an object called teddy bear with attributes name, distance and so on. This is very
meaningful as opposed to maybe raw information that it doesn't know how to interpret.
Okay, awesome. So we have seen here seven different failure modes over all these categories. So these are not the
these categories. So these are not the only ones you know I I have just mentioned those that I see very often that I thought could be helpful but you
could definitely see other failure modes. Um does this make sense?
modes. Um does this make sense?
Do you have any questions?
Okay, great. So we can move on um to uh summarizing what were the common trends in uh these failure modes. So
often times we have talked about the modeling side where uh you know sometimes in improving the model's ability to
reason and ground could be the solution.
Another um you know another kind of complaint that we have seen is uh the relevance of what we put in the context window. If we improve that relevance
window. If we improve that relevance maybe it gets better. And on the modeling side one more um aspect is maybe the tool wrers modeling or the
tool API modeling itself either by SFT tuning or just prompting or even just the API description itself. So you know
does the function make sense? Do the
arguments make sense? Does the duck string make sense? So this is one kind.
And the other kind is the tool itself.
So maybe it just has a problem. So you
need to fix it.
Um and I just want to say that when you deal with tools and evaluations, you have a lot of possible errors. So one
thing that will help you to navigate through this is to be very methodical into categorizing kinds of errors and
then dealing with them in group. So you
see there's lots of errors and every time that you deal with a given loss, it may be just an adventure to solve. So
really being very organized here is going to help you a lot.
Okay.
So with that in mind um we can delve into the world of benchmarks. So we
talked about evaluations. Now you might wonder how can you evaluate a large language model uh you know let's say you have trained everything you know how can
you compare it with respect to others.
So we're going to see together a series of [snorts] benchmark categories that today's benchmarks usually um you know
where where today's benchmark usually resides. So in one of these um and we're
resides. So in one of these um and we're going to see examples for each of them.
Does that sound good?
Awesome. So we can start with a kind of benchmark that I called a knowledge based benchmark where we want to test if
the model is able to restitute um given facts. So um so yeah these
facts are typically spanning uh lots of domains. So it doesn't have to be super
domains. So it doesn't have to be super precise on a given domain, but it's like spanning like all the kind of domains that your users may care about. And then
one prime example for this is MMLU that we're going to see very soon. But just
before we do so, I want to say that this knowledge benchmark mostly but not only measures how well pre-training was done. How well the
information in your large corpora of data was retained by the model in order to be helpful at inference time.
So, MMLU stands for massive multitask language [snorts] uh language understanding and this benchmark has um
like almost 60 different tasks that are super diverse. So it's not just uh you
super diverse. So it's not just uh you know one specific topic is just a bunch of topics like everyday life topics or very you know for example is there are
like law um you know medicine and everything that you can think about and the benchmark is redacted in a way that can be easily
um you know measuring an LLM's performance and weighing that performance with respect to others. So
it's not something that is free form is something that is um very constrained.
There is a question and then you have four possible answers and you ask the LLM to choose one of them. So it's a bit
like CME 295 exams. you know part of the exam is also multiple choice questions and you know it's a good way to
standardize the you know evaluation and it's the same that is used here. Uh so
here um and this is also a trend that you see across benchmarks. You don't ask the model to just come up with some answer and you have maybe an LLM as a
judge just giving some its opinion about it because doing so introduces um you know another layer of potential
errors. You know the LLM as a judge as a
errors. You know the LLM as a judge as a mentioned isn't necessarily perfect. So
this framing enables us to to have a hard-coded way to extract the the answer output by the LLM. So typically you would ask it to output the the right
letter at the end of each question which you can then extract and then compare with respect to the answer.
So giving some examples that what is about what is in this benchmark. uh as I mentioned you have all sorts of fields uh in there and you will notice that
each problem uh mostly requires some prior knowledge about that topic so it's not like purely logic that will help you solve it and I
think the last example in this slide is a good representation of it where you have something in the domain of medicine and you have a bunch of numbers uh you
know patient has this and What would you say like where would you say is the damage? So it's typically something that you could see in medicine books maybe and the same goes with other
fields such as law where everything has been codified somewhere and you need the knowledge of that somewhere in order to answer the question.
Okay, great. So this is the first kind.
Um and it's not the only kind of benchmark you have. Um so it's not the only benchmark in this category. You
have other benchmarks that are uh that can be in this category but it's just one of them.
A second category that you might see are are those that are in the reasoning space. So typically
space. So typically uh these are kinds of benchmarks that require some amount of thoughts before outputting an answer. So
it assesses the quality of the chain of thoughts or if you are in the reasoning world maybe the quality of your think tokens
but just more broadly your ability to infer um a response based on some reasoning and then for that I'm going to
mention two examples. So, one in the field of math and then one other in the field of so-called common sense
reasoning that is anchored in everyday life. Uh, which is typically
life. Uh, which is typically the field that might be of interest for your LM users and we're going to see
that very soon. So, first let's take a look at the benchmark focus on math. So
how many of you know about AIM?
So, AIM is uh an exam that high school students um sit for when they want to um
participates to like the Olympiads and it's typically um it's it's a very hard test and it's covering math topics and
um and then it's in a format that is LLM friendly because you have uh a given problem statement and at the end you ask
the student to write the response into a three-digit uh a three-digit number. So, it's very well constrained which makes it a right
fit to benchmark LLMs and uh you know just like the one before it's hardcoded
and you know I give here some samples of the AIM exam as seen this year. So as
you see, so I don't know if you can read from afar, but it's not super simple.
You you have some, you know, one sentence. So you think maybe it's easy,
sentence. So you think maybe it's easy, but you actually need to write down the reasoning before finding the answer. And
this is what we want to test the LLM for.
And then the second kind of um reasoning that we mentioned here was so-called common sense reasoning. And the one that
is often used these days is um PAS. So
physical interaction question answering.
So these are tasks that are deeply grounded into the physical real world.
So we have some samples at the next slide that we'll show. But it is uh you know still reasoning based questions but that will rely on your understanding on
how things work around you not nec necessarily math based but just you know everyday life
and uh this time it's not multiple choice questions over four answers like MMLU it's over two only
and you have 20k examples And then here a good example that I really liked from the samples mentioned in the paper uh was how do I find something lo I lost on
the carpet. So there is one solution
the carpet. So there is one solution that says you know vacuum with a solid seal and the other one is vacuum with the hairnet
and of course when you vacuum with a solid seal then the seal is of course solid so no air can go through it. But
if you have a hairet, you will vacuum the whole thing and the thing that you lost will be caught inside of it. So
what I mentioned is common sense, but it might not be obvious and this is what we task the model to resolve.
Okay. So then one other major area for benchmarks is coding where uh we want to probe the model for solving
uh complex questions in coding and um this has two main uses in real life. So
one is aligned with the kind of use case that I mentioned at the end of last lecture the one I liked regarding AI assistant coding. So these models they
assistant coding. So these models they aim at being used in that setting as well. So you should make sure that these
well. So you should make sure that these benchmarks perform like these benchmarks show that these LLMs perform well in order to be useful to your users.
And then a second reason why uh benchmarking on coding makes sense is that you have all these tools that you might want to use in an agentic setting
and then these tools are written maybe in a Python format. So you want to ensure that your model has uh the right
ability to read and write code so that it can execute these tool calls and then interpret what's coming out of them.
So these are two I would say motivations that um that motivate us to like you know find coding useful here you know even to the to the folks that don't do
coding at all.
So um yeah and then one example of such benchmark is SWEBench.
So SWEBench is so I put a question mark here because they didn't define exactly you know what the acronym meant but it's likely that it meant software engineering benchmark. to Swiss, often
engineering benchmark. to Swiss, often times the acronym for software engineering.
And what they did is that they looked at popular Python repositories, and they filtered down those that
contained pull requests that were solving an issue and that introduced tests. So you have some before after
tests. So you have some before after behavior that you can quantitatively assess with the tests that are
introduced. And supposedly if you have a
introduced. And supposedly if you have a pull request that introduces some test and some fix, you can fairly assume that
these tests were not passing without the fix and that they are passing after the fix. So the fact that we have these
fix. So the fact that we have these tests at hand is a good measure for us to assess the quality of a fix.
So if you have heard of a testdriven development, it's all about you know having tests and ensuring they pass and this is what it relies on.
So uh yeah and then here uh you ask these LLMs to solve these GitHub issues and then you assess whether they indeed
uh pass by uh by looking at the test status before and after patching the answer suggested by the
Okay, great. Uh so here is a very nice
Okay, great. Uh so here is a very nice um figure that the paper introducing that benchmark gives. So you know you're given a code base uh and then what you
ask the model for is a patch that is just it and then uh you patch whatever the model has provided to find the tests. So to find the test status
tests. So to find the test status and then one last area that I want to mention in the case of uh like base
benchmarks is safety.
Uh so when you see uh fancy LLMs coming out you usually don't see in the you know advertisement of modeling benchmarks
the safety part because usually safety is is a bit subjective with respect to the LLM provider. So every company has its
LLM provider. So every company has its own policy. So you cannot necessarily
own policy. So you cannot necessarily compare a performance on a given benchmark across models just because all these providers might not claim they
want to perfectly solve that benchmark 100%.
As a result, it's not necessarily a good measure of that field. So if you look at model cards often times you see a safety section being mentioned in reports to
say the work that they have done but they don't necessarily compare models with respect to a given benchmark.
Um and uh yeah, usually the safety benchmarks they are fairly aligned with what we think um you know should or should not happen. But on top of it you
might have additional policies that maybe uh you know just policies. So you
have some human that had to make some decision. It might not be a universal
decision. It might not be a universal decision. It's just a given decision. So
decision. It's just a given decision. So
the benchmark the benchmark's goal is to be aligned with what kind of policy the LLM provider has in mind in order to be
truly meaningful and yeah this is why when you execute a safety benchmark you should check the content of the benchmark in order to put a meaning
behind it. So here let's talk about harm
behind it. So here let's talk about harm bench which I am supposing it means harmful behavior benchmark.
So this benchmark has four categories.
The so-called standard category that is um categorizing uh like quote unquote vanilla harmful
behavior. Then you have copyright
behavior. Then you have copyright category that is uh assessing the model's ability to generate copyrighted content which we do not want. And then
the two last ones contextual and multimodel it's like both of them are contextual based on a given modality. So
contextual is on the the text modality and then multimodel is with like other model modalities than text. So we're
going to see an example at the next slide.
And here you don't have the same ability to assess performance on this benchmark based on some hard-coded match because these harmful statements might be
open-ended.
And you cannot possibly just solve all of these by regex matching. You know,
for example, one example in the standard category of this benchmark tries to uh entice
harmful behavior um into like executing something that is harmful and um the paper mentions like makes a distinguish uh distinguishes something that is very
interesting. So it distinguishes
interesting. So it distinguishes model quality with safety by saying that if the model tries to do the harmful behavior even if it wasn't successful
because it was not of a good quality enough then it's enough to count the attack as successful and then for that they trained some classifier to
recognize these cases and yeah this is the only benchmark uh among those that I presented here that is done based on a classifier that
can itself be prone to error uh compared to others that are very grounded in uh constraints setup values
and as promised here is like a few examples as mentioned in the paper. So
uh here uh you know we test whether you can unlock a door that you shouldn't unlock and then here um the yeah the test is on influencing someone with
respect to some election. So it's not these are not safe behaviors and yeah okay great so I mentioned everything uh
so far that you could solve without tools.
Uh so of course I say you could you know some of them you could use tools of course to solve them but what about uh measuring the behavior of agents.
So for here you have an interesting benchmark called toao bench where toao is a Greek letter that actually um you
know you can write it as tool agent users and this is why we say toao and um it's a benchmark that provides uh across
two different fields so the airline and the retail field a set of tools and it gives a set of policies is you know
things that the agent can and cannot do and then what you do is that you have a set of tasks. So tasks are problem
statements that you give um given user and the goal is for the user to achieve that task through the agent.
And the interesting thing about TaoBench is that Towbench is language model simulated. So the user interaction with
simulated. So the user interaction with the agent as you can imagine cannot be hardcoded because further turns will
depend on previous ones. So let's say you say something as a user and your agent decides to do something. Then you
need the context of what it has done in order to continue the conversation. And
this is why you have the simulation aspect that the paper introduces that is typically done by a separate uh big model that plays the role of the
user.
And uh yeah, here we have an example of the task of changing a flight. So you
have a given tools, the agent tries to help the um the user to achieve that goal and uh at the end of it we assess
whether it's successful in doing so by calculating a reward that's a function of the database change. So let's say the user has changed their tickets. So we
want to see if the database has indeed the state that we're looking for and or a given action. So maybe the action of
cancelling is one that is the goal of this task. Uh so this is like part of
this task. Uh so this is like part of the reward.
And the paper that introduces this benchmark uh talks about a concept that is um funny word play. Um with respect
to the metric that we had that we had talked about at the last lecture, we had introduced pass at K. And then this
paper talks about pass hat K which is the probability that all K
attempts succeed and then why is that uh relevance metric here. So as you have seen the airline and retail um domains
were ones that were chosen chosen here and then an an agent in the loop here could be a way to see if automating the
agent side of things could help. And in
order to truly know whether it can help you want to have reliability and consistency in mind. So if you execute the the task k times, you don't want
like pass at k the probability that at least one of them succeeds. You want the probability that all of them succeeds which is why this metric matters. So if
we had more time I would have derived the formula to find um pass hat K with respect to the um
parameters of the problem but I will refer you to the derivation that Afin has done last time and uh yeah I just want you to be convinced that this is
indeed the formula. So if you're not convinced, please feel free to do it um at home.
And moving on, we talked about all these benchmarks. Now let's see how they are
benchmarks. Now let's see how they are grounded in reality. So by now I think everyone of you has seen the new Gemini
launch a few days ago. So this was the report that was sent to everyone to to just justify that the performance here
was better. And you can see that what we
was better. And you can see that what we introduced here is mentioned in some format. So the reasoning part on
format. So the reasoning part on aim and pa is there. And then you will see that some of these benchmarks are
derived in a flavor that introduces um multilinguality linguity um multi- languages and uh so this is the case for
pika instead of p global pa and then for coding it uses a flavor of swbench and tool use it uses also a flavor of bench
which is towel squared bench And a few last words here. I just want to say that benchmarks are here to
characterize the profile of of your LLM.
So it's not all good or all bad. Maybe
some of your LLMs will have some strength and some weaknesses.
and your personal experience might guide you to use one specific one with respect to others in given situations. So if I had to just give my personal experience,
I know that the Sony models are very helpful for coding and whenever I want outputs that are fast and cheap, you know, Gemini flash is usually good. But
these are not by any means global recommendations. you know, your own use
recommendations. you know, your own use case and your own experience can guide you into having a profile of models that suits your tasks best.
And you can interestingly plot the performance of your models with respect to the other dimension that you care
about which is price for example and see for a given price what is the best model you can use. And then the border that you see on the best models for that
specific metric is called the parareto frontier.
And you might have a parto frontier with respect to several aspects. So cost,
safety, context length.
And then few words regarding data contamination.
One thing about these benchmarks is that they are as good as the assumption of whether you have seen the actual benchmark results or not. So make sure you haven't seen them. And for that
people introduce hash values.
uh you know in the case of tool use they introduce a block list in order to not access websites that might contain the responses
or uh in the case of math we have the luxury of evaluating on new tests that the model has for sure not seen
and a good heart's law is a very good um adage that says a measure when a measure becomes a target it ceases to be a good measure
So all these benchmark results are to be weighed against what you're truly looking for and then these benchmark results don't necessarily tell you
whether a model is good for you or not.
So um you know we had talked about chatbot arena um in what one of the previous lectures it can be one way to balance the real life um performance of
these. But I would say ultimately you
these. But I would say ultimately you should be the one trying out these best models and see for yourself which one corresponds to you best.
And with that I hope you all have a great Thanksgiving and thank you
Loading video analysis...