Complete Beginner's Course on AI Evaluations in 50 Minutes (2025) | Aman Khan

By Peter Yang

Summary

## Key takeaways - **LLM hallucinations necessitate AI evaluations.**: As LLMs are prone to hallucinate, AI evaluations are crucial for ensuring the quality and reliability of AI systems, preventing negative impacts on companies and brands. [01:32], [02:32] - **Four types of AI evaluations exist.**: The four primary types of AI evaluations are code-based, human, LLM-as-a-judge, and user evaluations, each serving a distinct purpose in assessing AI performance. [02:45], [04:41] - **PMs must be involved in human evaluations.**: Product Managers should not completely outsource human evaluations; they need to be hands-on in the spreadsheets to ensure the AI product's success and align it with desired end-user experiences. [05:07], [05:32] - **Use Anthropic's console for prompt generation.**: Anthropic's console, specifically the workbench tool, is a valuable starting point for generating effective prompts by utilizing best practices and input variables. [08:09], [08:35] - **Spreadsheets are key for building golden datasets.**: Spreadsheets are essential for creating golden datasets by defining evaluation criteria, scoring responses, and identifying areas for prompt improvement. [15:13], [17:38] - **LLM judges need alignment with human labels.**: To ensure the accuracy of LLM-as-a-judge evaluations, it's critical to compare its outputs against human-labeled data and calculate a match rate to verify its reliability. [38:18], [39:43]

Topics Covered

Even LLM creators say you must check their models.
There are only four types of AI evaluations.
The AI PM's job is in the spreadsheet.
Your AI judge must align with human judgment.

Full Transcript

The CPOS of these companies are telling

you eval are really important. You

should probably think what are AI eval

exactly. I think there's really just

four types of of evalu.

And we can define what's good or bad for

any of these. So all you're doing here

is basically building out a golden data

set in the first place. So, we're going

to hit generate and let's see what the

agent comes up with here. So, okay. So

it starts giving me a prompt. You want

LLM as a judge to align with your human

label and the the judgment that you're

giving it. Let's get super practical.

Now, there's like a lot of eval videos

out there about frameworks and blah blah

blah. Like, I just want to like actually

walk through a real example. All right

let's look at one. So

okay, welcome everyone. My guest today

is Aman, head of product at Arise. And

today, Aman and I are going to show you

exactly how experienced PMs like

ourselves run AI evaluations using a

real example. And we're going to walk

through all the steps including uh

defining the evaluation rubric, creating

our human label golden data set, running

Elm as a judge, eval, and more. So

welcome Aman.

Yeah, thanks for having me back, man.

It's awesome to be back, and it feels

like a lot has changed since we last

talked about this, so I'm pretty excited

to dive in. All right, so let's do a

quick recap and then we'll dive into our

example, right? So like um what are AI

evals exactly and why are they uh you

know the most important skill for PMs to

build? So I think first off we know this

to be true, right? LLM hallucinate.

We've all experienced this in our

day-to-day life using LLM based products

or trying to build around this for our

company.

But you don't have to take my word for

it. Don't take Peter's word for it. like

the CPOS of these companies are telling

you you should probably have eval

when you're actually building around AI

products. So if the if the people

selling you the LLMs in the first place

are telling you eval are really

important so that when LLMs do

hallucinate they don't you know they're

not going to like negatively impact your

company or your brand. You should

probably think okay well how should I

use eval in the first place to actually

build products that matter. So I think

that that's a really important component

which is you know don't just take our

word for it. The actual CPOS of these

companies product people leaders are

telling you eval are the emerging skill

because LLM hallucinate and eval measure

the quality of your AI system in the

first place.

Yeah. It's like the the people selling

these models are telling you to check

the model's answer. So

it's probably true. Yeah.

Exactly. Yeah. Yeah. I think there's

really just four types of of evalu

deployed and you know even all the way

from development through to production

with AI

and those four types of eval are like

codebased eval. binary sort of pass

fail, right? This is like checking for a

string or presence of a string in a

message. So, for example, if you are

you know, an airline like United and a

company says, "How do I book a flight

with Delta, you know, or a person or a

user asks you that question, you you

probably don't want to respond with an

answer of like, how to go book a flight

with Delta if you're United." Yeah. So

like just checking simple things like

that is like level sort of level one

level zero of initial code-based eval.

They get a lot more complicated. But the

beauty is because of

code generation today it's actually

become a lot easier to create code-based

eval.

Then you have the human eval which is

like the PMs or the subject matter

experts doing a thumbs up thumbs down on

an individual sort of back and forth.

And that's really like was the answer

good? Did it meet the criteria? And

we're going to kind of go into the

deeper here, but that's like having a

human in the loop. And then there's the

use the LLM as a judge, which is

probably the most prolific example of

having scaled up evaluations

which is actually having an LLM system

sort of act as if a human is labeling

the data. And it's like training, it's

like similar to training an agent or

training an LLM to actually respond like

a human by giving a label. And then

there's the user eval which is like the

closest I think of this as like a

business metric. I'm curious what what

you think too here but this is like how

do you collect data from the real world

and this could be an actual user

interacting with your application that a

downstream customer giving you feedback

on how their experience went. So these

are this is it. I mean this is like

generally when you think about eval

types that come up um pretty much 100%

of the time. Yeah, I I totally agree.

And I think um like having done some of

this, I I think the human in the loop is

like really important. Like there's no

magic formula for this stuff. You have

to, you know, care care about how this

stuff is going and like really have um

attention to detail on some of the

stuff. Yeah, exactly. And that's that's

when I think people are saying, "Oh

well, where does the PM fit in here?"

The PM's job is to have judgment on what

that end product experience should be.

And so being in the details on that when

it comes to the human evals is really

what determines whether or not your

product is successful or fails.

Yeah. Like I I never found it useful to

just completely outsource the human

evals to some contractors or labers.

Like the PM has to be in the spreadsheet

doing the stuff themselves too, you

know? So

totally totally I mean I I I mean I can

give you an example like even ever since

we started you know when I was working

on like self-driving cars way back in

the day too we would just look at data

almost every day or once a week and just

go in and label did the car should the

car have done that or not and kind of

the same thing with these agents like

just look at your data and pull your

team together to make a determination of

what's good or bad.

Yep. There there's a lot of uh there's a

lot of spreadsheets involved so as as

we'll see soon from the demo that that

we'll do.

Yeah. Yeah. So, um, so actually, yeah

so let's get super practical now and

uh, you know, there's like a lot of eval

videos out there about frameworks and

blah blah blah. Like I just want to like

actually walk through a real example

right?

So like for example, I love to run, uh

even though I run pretty slow slowly and

uh, let let's assume that we're building

a customer support agent for for on

running shoes. Um, so, uh, on is like, I

think European brand and gain a lot of

popularity recently. And this product

actually exists. If you go to the on

website, there actually is a customer

support agent. Do you want to actually

quick quickly show it?

Yeah, let's do it. Yeah. Let me let me

pull it up.

All right. They got send. That's great.

Yeah.

Yeah.

Yeah.

I think somewhere on the bottom they

have like a customer support thing.

Oh, yeah. Here it is.

Yeah. All right.

So this is not a real human, right? This

is like a custom port. So maybe a ask it

about return policy or something.

Yeah.

How do I return my on cloud running

shoes?

So there's a couple things I guess while

we're waiting for we're kind of looking

at the responses. Like it responds with

emojis. It's asking like follow-up

questions if it needs more information.

uh and then it sort of suggests what

should I what can I go do next? So it

actually these are generated suggestions

as well of actions that you know that

the agent probably has access to to show

to a user saying return an order or just

asking more about the policy. So it's

sort of has some branching logic to to

figure out what's the right next step

and so now it's giving me more

information on the policy.

Okay great. So before we uh start doing

the evals, we have to um write a prompt

right? To actually get our customer

support agent to start working. So so

can you uh show us, you know, how to

write this prompt and also what kind of

information we need to give the prompt

to make it great.

Yeah, good question. So let's go ahead

and pull this up. So I really like to

use uh Anthropic has a really great uh

product for this actually. So if you go

to Anthropic, they have this console

system and the console actually helps

you get started from like 0 to one with

writing a prompt in the first place. So

it's a really good starting point. So

what we're going to do is we're actually

going to build this prompt in uh

Anthropics console which is if you go to

Anthropic, they have this tool called

workbench uh which is really cool. It

helps you like actually get started with

u building an agent in the first place.

And a really core part of building an

agent is the initial prompt that you

want to give the agent with the context

to help it perform the task in the first

place. And rather than like going and

trying to do this all from scratch, we

can actually use a you know this

generate function which allows us to

kind of create a prompt uh using best

practices right off the bat. So what I

can do is say uh design a prompt for a

customer support bot

that handles customer interactions for

on running shoes. And it's pretty

forgiving like it you don't have to be

like super descriptive but it's good to

give it a couple of reference points

here. So you'll take an input variable.

So this is like maybe going to be the

user's question. Mhm.

[Music]

product information

and should respond. Oh, and then uh the

policy information as well.

[Music]

So there's there's kind of three pieces

of information here. There's the user

question.

So, you're, going to, take, input, variables

which are the user question product

information

and maybe the policy which is like the

return guidelines right like so this is

like it knows about on it has a user's

question and it has policy information

of like return guidelines and things

like that and you should respond

with a helpful answer.

How does this look to you Peter?

That looks great. Yeah.

Awesome. So, we're going to hit generate

and let's see what the agent comes up

with here. So, okay. So, it starts

giving me a prompt. It's actually

putting in the variables for me here.

So, that's what prompt variables look

like. There are and it's actually giving

examples, too. And it actually has uh

the the curly braces here represent

variables and it even calls out over

here that there are variables as

placeholder values that make your prompt

flexible, reusable. So we can hit

continue here and let's say we want to

use this to iterate on or actually start

building out this example of this this

customer support agent. So pretty pretty

good first pass and it even puts an

example in there for a back and forth

that the agent should handle. Um cool.

So

yeah. So you have to actually add the

product info and the policy info, right?

Yeah, exactly. So this the product info

and policy info are going to come

directly from the on website. So let's

go ahead and pull that in and maybe we

can run an example here. So I'm going to

hit run and it's going to ask me for me

to actually provide this information

the product info and policy info. So

what I'll do is actually I have the the

uh the on website up on the other, you

know, on my my end. I'm just gonna go

ahead and copy over a ton of the the

context here from the on website itself.

So, let's go ahead and do that

and we'll copy the return policy. Okay

here we go. So, that's the policy info.

So, I just copied that over from the on

website.

There's product info which also comes

from the on catalog. So, let me go ahead

and copy that over to

and and I think we just got like a chatb

claw to look up the own website and

extract the like the shoe prices and

descriptions and names. I think that's

how how we did.

Yeah, exactly. Yeah. So, so that that's

a great point. Yeah. So like you can get

started with this with like you can just

ask you can even ask chat JBT get me

policy info for on right or or give it

copy paste it in and it'll kind of

figure out and maybe give you like a

little bit of context here to get

started with and then the product info

in your case you're you know we kind of

go to the on website and you're like you

know copy paste that into cloud and

you're and you're basically saying give

me my product info to get started with.

So, that's kind of a good starting

point. And then I don't know like what's

a good what's what what's a problem you

hit recently with uh your on running

shoes Peter?

Um let's say uh let's say I bought the

Cloud Monster

shoe and uh it's been like two months

but now I want to return it. Like I

don't like it anymore.

But now I want to return it. Okay. Yeah.

So now we'll hit run

and this is what the response actually

looks like from the LLM on the fly. So

we're using clo cloud sonnet. I almost

said clude um cloud sonnet. Uh we're

using kind of the latest model here. Um

one, of the, latest, models

and you can adjust any parameters you

want. But then you get basically this

initial kind of response. So it says, "I

understand you'd like to return your

cloud monsters that you purchased two

months ago. Unfortunately, I have to let

you know that our return policy is 30

days." So you've actually exceeded the

return window, right? But what do you

think of this response? Like is this how

does this response like feel to you as a

as an initial take?

Um I mean it it looks like it's like uh

you know policy compliant. It is seems

to be pretty helpful. Uh I kind of wish

it was a little bit more concise but but

um yeah over overall it seems good you

know.

Yeah. So it's like initial first pass

not so bad like from a response but as a

PM I think you're going to be like

nitpicking slash you know wanting to

make sure that this thing is actually

good enough for you to like go and

actually you know deploy right like you

you can't just look at one example and

think okay is this thing

is deployable

is this deployable? Yeah. Can you hit

deploy and completely all of your

support bot questions go through this

one? Uh, you know, instead of humans

it's now going through this bot, right?

So, what do you want to do in this case?

So, I think it's great to like copy this

example and why don't we just go ahead

and build a spreadsheet here of what's

good or bad and some of the criteria for

good and bad

um to get started with. So, so as a PM

you're probably familiar with building

things in spreadsheets to get started

with. I think I think spreadsheets are

the ultimate sort of product to use for

evaluating LLMs in the first place.

And we've got a couple so you kind of

mentioned a couple things, Peter, as we

were chatting, which is like the product

knowledge. Does it know about the

product in the first place? Is it

following the rules? This is a very

common one. we see customers, you know

kind of users in the space actually

doing, which is is the agent actually

following the rules that we've given it

and then what's the tone? And we can

define what's good or bad for any of

these, right? So, you can actually go in

and change what the definitions are for

the criteria for what's good, bad

average. And it's just meant to be like

something that you can give to your team

members and say grade the output based

on good, bad, average.

Yeah. And and um you know you talk to

your team to come up with this but you

can also work with AI.

Yeah.

To come up with this, right? Um but yeah

like I I think the rubric is kind of the

next step after coming up the prompt.

That's right. Yeah. So once you have the

prompt in place and you start getting

some data, you can start building a

golden data set. So, it's actually the

combination of taking that initial

prompt that you wanted to build with the

context, taking a look and sort of like

okay, does this feel right? Now, I want

to get more granular across my metrics.

And the metrics are

it should know about the company it's

you're building it for. It should follow

the rules and it should respond in a way

that you want it to just to get started

with. And so we've got a couple here.

Um, which is kind of this is sort of a

let's go ahead and copy this one back

in. So this is actually a pretty similar

pretty similar question. Maybe we can go

ahead and see what the output is here

for this. Um, so I'll go ahead and

adjust the variables.

And so you can just change the use the

user, question., And, I'm, going to, say, I

bought the cloud monster 3 weeks ago.

Generate the output.

Interesting. So, it says it's been 3

weeks. Uh, you're still within our

return window. Okay. So, let's copy this

and we'll go ahead and copy this back to

the spreadsheet and that's going to be

the AI answer essentially.

Yeah.

So, all you're doing here is basically

building out a golden data set in the

first place. And so, now we can grade

this across the same rubrics we had

before, which are product knowledge. I

think it seems to know about the cloud

monster. Uh

I'd say that's like good probably. What

do you think?

Yeah, from product knowledge perspective

is good. But this is mo mostly a policy

question. So

yeah. Yeah, it's mostly a policy

question. Okay. I have some good news

and important information for you since

you purchased your cloud monster

directly from on and it's been 3 weeks.

You're still within our return window

but you lost the box.

Interesting.

So, it says, "I'd recommend contacting

our customer support team."

Yeah. I mean, to me, this seems like the

policy should actually cover like what

happens if you lose the box or if the

shoe is damaged, right? Like, it should

kind of cover that. So, to me, I I think

um I don't I don't think that

information is in the policy, right? So

yeah.

Yeah. Interesting. So would you say that

the eval here is

unknown, bad, good or

I think I think the eval here is maybe

good based on the existing policy.

Yeah.

But the policy itself I think needs to

be improved. So maybe we can put that as

a note like this is how you like improve

a product, right?

Yeah.

We need a better policy. Yeah.

Yeah.

Policy to handle

the box being thrown away. So, I mean

for what it's worth, I you know, this is

like uh this is, by the way, like if if

we were working together on this, I'm

I'm a PM on your team. I would probably

I would probably argue and push back a

little bit and say, well

is it in the policy or not? Have we are

we actually taking a close look at the

policy or is it because it's not in the

policy you know now that the agent is

saying I recommend c contacting our

customer support team like should we

instead just have said you know what

sorry uh that's not covered in our

policy and just been more direct because

what you're actually doing is you're

you're pushing the blame over to another

team to go handle this and it's just

going to create more work but it's a

borderline one right so like that's

that's a good one where it's like

it's probably a good one to debate on

your team and it and it's like a healthy

debate to have which is like how should

this agent have responded in the first

place.

Yeah. May yeah I think that's a good

point because maybe it should just

default say like it's not in a policy

because there could be all kinds of

crazy edge cases right with the shoes

and the boxes. So

yeah and then if it's not in the policy

like yeah how should

how should the agent respond in that

case? Um and maybe that's a rule that

should be in the agent as opposed to

even in the policy which is like

recommend a next action. So that's like

a good product product debate to have.

And

I also noticed that uh it doesn't

actually tell me how to contact a

support agent.

Yeah.

You know

exactly. And I think if you look at the

uh the policy from on I think it does

have a support information that you can

that you should be able to access. So

Oh, really? Okay. Then then the policy

compliance is not good then. Yeah.

Yeah. So it's probably bad. And on top

of that, the the tone is like it's fine.

I think, you know, if we want this to be

like a customer support bot, it's a

little bit formal. Uh if you look at

you know, how we were interacting with

the onbot before, it's a little bit more

like upbeat, cheerful.

Yep.

This this kind of feels formal. I'd

probably say this is like average to

bad, you know. Um

but it's it's a healthy debate because

then if you have multiple people

labeling the data, you get better better

sort of labels overall. So this is like

step you we're already doing eval right

like we're we're looking at the data

we're evaluating it we're debating back

and forth what the label should be and

and it's it's kind of based on that that

you're going to start building your LLM

as a judge

and, and and, like, I, I, want, to, bring, on

the point that like this stuff is not

super sexy you're like in a Google sheet

but this is probably like the one of the

most important things that you should

get right right with your team because

if going to the SI sucks then the rest

of your emails will be terrible you know

totally there's no point in I mean It's

like it's it's the most important step

before you start trying to build

anything complicated on top like LLM as

a judge or codebased eval like just look

at the data and debate are these the

right metrics for us to look at do we

have the eval criteria in place do we

know how to evaluate this or not um and

you just start we just start with five

rows of data right um just to get

started as well

and um I I think you actually filled out

the, rest, of it, Right., All right.

Yeah. So, yeah. So, you know

yeah, we'll go off. We'll kind of like

fill this out really quick. And okay, so

here we have

all the responses

all of the responses basically laid out

from the initial questions. So, this

this is, you know, basically just

generating data to get started with

to to kind of start grading. And I went

ahead and just started grading some of

the the data as well. So, we've got, you

know, in this case, it's good uh good

product knowledge, policy, and tone.

Tone is kind of bad. But what's

interesting is as I was going through

I was actually seeing there's a couple

of examples here that are kind of

interesting, which is, you know, let's

take a look at this one really quick. I

just placed an order 45 minutes ago, but

need to change the delivery address. Is

that possible? And it says

unfortunately, I have some news for you.

Because it's been 45 minutes, you've

passed the 60-minute window. we provide

for order cancellations. So, the LLM

fun fact, LM are really bad at math. Uh

it it seems to not realize that like 45

minutes is not past the 60-minute

window. And this is this is like actual

real real world data. Um

and uh

and so I gave that one a bad, right?

Like that means the LM is like taking

the policy and just making stuff up.

Yeah, that's not good.

Getting an answer. Yeah.

Yeah. Um, so, so that one's like one I

want to go take a look at with my team

and go deeper on.

Yeah. And, and what I like to do is I

like to write like little notes next to

the bad ones

and then and then like, you know, like

in, reality,, we're, going to, be, doing, more

than five, right? Like probably, you

know,

um, and like what I do at the end is

like um, use the LM to summarize all my

notes and be like, "Hey, here's like the

top three things we need to fix with the

pro product like the prompt or the the

product." Yeah, exactly. I think that

that's like a good example of something

where

the notes help you figure out I need to

go, take, a, look, here, a, little bit, more.

Yeah. So, maybe like the the prompt

update here is like, you know, you had

like a very important please check your

math or something.

Yeah. Yeah. Maybe. Yeah. I mean, what's

what's kind of cool um you know, as we

get deeper into this is LLMs are really

good at suggesting prompt improvements

too. So, Peter, the fact that you're

like giving notes and labels

Yeah. on top of your data means you can

take that data and then use it back in

your anthropic console or wherever

you're doing development and actually

prompt it again to improve on the data

that you gave it. And that's like

self-improving agents in a way or

creating a self-improving agent.

Cool. So uh so now we have our human

labeled data set uh and and let's talk a

little bit about this actually because

this is kind of important. So when

you're just starting to build this

product like how many examples should

you have, you know? like do you have any

rough guidelines?

So, usually it depends a lot on the type

of use case that you're building for. Uh

so if it's like something that's very in

a highly regulated environment, you need

a lot of data to feel confident. It's

really it's a statistics question which

is like how much data do you need to

feel confident in your eval in the first

place? So if it's something like you

know if you're just getting started with

the PC I think that there's like a

couple ways to phase out your product

development. So if you're going to do

internal product development first and

just launch this to a small subset of

users or test it yourself. You can get

started with I would recommend somewhere

between you know five at least usually

it's around 10 or more examples for the

type of testing that we're doing which

is we want to see if this thing is even

confident for us to even worth investing

further

in our with our team and with the data

that we have

or do we need to invest in better data

invest in better tooling upstream. So

start with 10.

Y

if you're building for production and

you want to get to more confidence

that's when you might need maybe a 100

examples or more depending on the

industry. So that's like a general rule

of thumb which is internal test start

with like 10 rows. Just take a look at

the data to get started and then if you

want to scale up get closer to about a

100 rows of data.

Yeah, I I think that's a good benchmark

because like in the beginning you're

still making a lot of updates to the

prompt and the product, right? Like you

don't want to have to do 100 manual

emails each time. I'm like that'll take

forever.

Right. Right.

You know, yeah, you're really there's a

few dimensions as PMs. We think about

things in this way, but you're you're

actually trying to optimize for speed of

iteration and confidence in result. And

these are kind of two sort of orthogonal

dimensions. And the more data you have

the more confidence you have, you are.

But the longer it takes for you to

iterate, the fewer data, less data you

have, the faster you can iterate, but

you're not going to feel as confident.

So, you need to kind of define what that

looks like for your team. get started.

Okay. Um so why don't we actually uh

simulate kind of update the prompt real

quick?

Yeah.

If you go back to the anthropic

Yeah. So um so let let's just say let's

just say um like the like the answers

are just too long, man. Like it's too

much text to to read. Like how do we

make it more friendly like the tone?

Yeah. So, Enthroic has a really cool uh

feature here which allows Claude to

optimize the prompt and you can actually

just, say,, you, know what,, make, it, sound

more friendly and less formal.

So, this will impact your your tone

right? So, um so this is actually going

to take a look at your data

based on the example that you have.

One interesting kind of thing here is

it's doing a ton of work like

it's building it's building like a

mermaid graph

to define what friendly should even be.

So it's like doing code because I think

it's a really strong coding agent

and then that's what's used in the

reasoning to then go back and iterate on

the prompt. It's like a multi-step

prompt iteration flow.

Yeah. In my opinion, it's it's kind of

it can be a little bit of like overkill

a little bit, especially since the main

addition it made is just to say

your role is to provide helpful and

accurate

information and then here's the steps to

follow. Format your response with always

do this, always do this. It's like super

again like super formal like rules that

it's put in place. Do you need all of

these rules? Is this just adding more to

your your prompt? It's not clear, you

know? It's not I'm not really sure how

much this will improve my prompt at

scale. We can run one example just to

see and maybe take a look at it.

Mhm.

It still doesn't sound super friendly.

In fact, it it almost like got longer

and uh and it's and that that's sort of

the interesting thing about these LMS

like

you need real world data

and a little bit of like taste and

nuance to iterate on the prompts as well

to know what data to feed in. So um so

this is an example where having human

labels and eval

might actually give you a better uh sort

of optimization once you start looking

at those results.

Yeah. And I guess uh you can like you

know you can also include like a some

few shot examples right like of what a

good responses.

Exactly. Yeah. There's other things you

can do as well. Uh like this in this

tool you get started you know with the

prompt creation. It it writes in the

user prompt. you might want to start

with a system prompt. So, there's a lot

of nuance here. Um, but uh but to get

started with, you know, you actually

need real world data. And I think that

this is one where, you know, if I'm

iterating on this, it's like not clear

to me that this was better. Um

honestly, I think it might might be

might even be worse.

Yeah. But this is like a this is

actually core to like a AIPM's job

right? Like you literally just like

iterate on the prompt and run the evals

and iterate prompt like 100 times and

hope better.

Exactly. And ideally ideally you have

like some way to to do this at scale

with more data and iterate with you know

more confidence as well. So like that's

the but I totally agree like get getting

started this is the reality you are just

to recap you start with initial prompts

you feed in data you put the data in a

spreadsheet

and once you have the data in a

spreadsheet you label it you figure out

okay do I want to which dimension do I

want to improve on you go back

and you iterate on the prompt you get a

new result and then that's where that's

how that's what the journey looks like

Yeah. And maybe like you also add like

another criteria or something.

You add another criteria.

I think another way to to look at this

as well is um you know once you've kind

of started to get a good feel of this

type of data, you might want to look at

a tool to help you kind of run these

eval scale and feel more confident in

the eval results as well.

And uh so that's actually what what

we're building. Um this is kind of what

I work on. Uh so it's a platform. It's

called Arise that helps you actually

iterate on your prompts using data and

it's a good it's another kind of good

starting point. It's another tool that

you can kind of think about in your your

toolkit where you can actually upload an

example CSV. So I actually went ahead

and just downloaded the the data that we

have from the the Google sheet in the

first place here and I can upload it and

use that as a way to to kind of get

started. So let me go ahead and uh

upload that here.

And we'll just call it Peter's

on

or let's call it on golden data set.

Okay.

Our very rigorous five example golden

data set.

But you know what? Fi five is better

than nothing. You know, like that's the

best part. Like it's it's so much better

to start with some data that you can

then kind of iterate on top of. And so

that's that's really what the whole

point is is just start somewhere. Now

we've we've moved our CSV over to uh

Arise where we're going to upload it as

a data set. And then what's really cool

is we can it's the same data that you

saw in the CSV already. But the beauty

is that we can take this data set and

actually iterate on it with our eval

kind of set up in a in a kind of

userfriendly environment. And so this is

the same prompt that I had from my my

spreadsheet. These are the variables

the same variables. And then if I look

at this, this is every single row I

already put into the spreadsheet. And

this is actually loaded in. So I can

replay multiple examples. You know what

we were doing before, Peter? We were

looking at like one example at a time.

Yeah.

And trying to figure out, okay, is this

good or bad? Here, I can put in

hundreds, thousands of examples even

and replay in my prompt what the new

output is.

On top of that, we can even add

evaluators that sort of match what we

were looking at in the first place in

the spreadsheet, which is our our sort

of criteria. I actually just copied this

in as the prompt. I said, take a look at

the output from the playground, which

we're going to run. And the job of this

is building an LLM as a judge type of

system. So, this is actually the eval

prompt. So, this is actually taking the

the sort of the criteria that we had

from the spreadsheet and we're making

that a prompt for an LLM to give us a

label. And I'm saying here's the output

here's the question, here's the product

information and the policies. Give me a

score or label of good, bad, or average

based on that criteria.

Yep.

And we've got three eval here. So, the

same three dimensions we were looking at

before. We have product knowledge

policy compliance, and tone. And it's

really it's just kind of encoding the

same information we had from from

before.

Yeah, there's a table that we had

right? The the criteria table.

Same table. So, let's go ahead and turn

those on here.

X out of this. And now I can replay it.

And I can do things like try out a

different model and see if that gets me

a better result. Let's maybe let me You

want to try like uh

you want to try GPT5? We got GP. Yeah.

GB. Yeah.

All right., So,, let's, see., Um,, so, we've

got, you know, instead of comparing

cloud, now we're comparing it to GPT and

we're going to try and regenerate the

same results and see if this this is

better or worse than what we we had

initially.

Is it the same five questions or is it

generating

same five questions? Same five questions

that we just that we just ran through.

So, this is I can even uh show the

question. GBD5 might be a little bit

slower because I think a lot of people

are probably hitting it today, but let's

see.

Um, we can while it's running, we can

take a look at one of the questions

here. The question is, I'm training for

my first marathon. I need something with

maximum cushioning. What shoe would you

recommend? And then this is the answer.

Says, hi there. Big congrats for maximum

cushioning. I recommend the cloud

monster, a couple of great alternatives

and it also gives me more shoes here

with the product information. It's

pretty good.

Um,

yeah, it's pretty good.

And it it what's what's interesting is

you know, it has it's actually like

pulling that product information and

doing a pretty good job. What else is

cool is I was running those eval

the text in the first place. So I

actually have the compliance, product

knowledge, you know, tone, all of that

information sort of as eval

rows. And so what's interesting is I can

actually go and take a look at that uh

you know in a more granular way and say

like what are my eval

what I just generated and that's

actually just new columns here.

Okay. And and now the AI is doing the

eval GP

AI is doing the eval.

What else is what else is really

powerful here is when you're doing AI

eval using LLM as a judge, I highly

recommend, especially when you're

getting started, have the LLM create an

explanation as well. And why that's

useful is that's a lot like having a

human give you notes on why the LLM is

giving you a label. So you can see here

this is the eval for policy compliance

explanation and it says the response

aderes perfectly to the company's

policy. So these this is this is new

right like this is information that you

can use to make a judgment whether the

LLM as a judge is giving you a good eval

or not. Do you agree with the

explanation it's giving you?

Yeah.

Does that make sense? Can you find uh

has it actually rated anything bad or

No, it gave everything good.

What about some of the other criteria?

Yeah. So, that's that's sort of the

thing where like if we take a closer

look here, let's take let's take a

closer look at some of this.

So, personally, I think that you know

this might be a definition between how

we've created the eval in the first

place or what our human sort of judgment

looks like. But if I look at the

response here, it says, you know, oh

I'm really sorry how your cloud boom

strikes wore out. It's really

frustrating. Uh, here's some

recommendations. It's super long, right?

Like it's like a really verbose sort of

answer that it's giving to the user. So

what we can do is actually put human

labels on some of this

and actually compare the human labeled.

So you know from this I can I can in the

same platform I can create human labels

or just upload them as a CSV the same

way that we did before and then we can

use those human labels to compare to the

LLM as a judge labels in the first

place. So that kind of gives us a good

sense of is this result good or bad

based on the human label and does it

match the LLM as a judge and then that

can tell us how good the LLM as a judge

is because it's just giving everything

good. So do we even trust this result or

not? Well, we kind of need human labels

to verify and align with the LLM as a

judge. So

yeah,

my takeaway is like from this is

you want LLM as a judge to align with

your human label and the the judgment

that you're giving it.

Um, so, so like this is like a toy

example, right? But in a real example

you probably have uh for LMS judge at

least like 100 plus or even more

examples,

right?

So, like as a human like I I don't want

to review all 100 examples one by one.

Should I look for uh examples where

they're bad or like how how do I look at

this stuff?

Yeah, I think it's helpful to start with

at least a few like five or 10 to get

started with like uh like kind of before

you remember how we were talking about

designing a prompt for the agent and

labeling that data in a spreadsheet.

Mhm.

That's the human eval on your system

prompt or your agent prompt. you need to

take a similar approach with providing

labels against the eval labels for your

LLM as a judge. And so when you start

with like five or 10 examples, you can

start figuring out, do I trust the LLM

as a judge or not? And then you can go

run it on a 100 examples and see, okay

what's the score on 100? Let me go look

at the bad ones. Because right now, I

don't even trust the result on these

five, right? Like it's just saying

everything is good. So I need to

actually label these with a human label

and run the LLM as a judge on this again

to to see the match rate. And that's

actually that's actually a a type of

eval which is called match rate which

matches the human judgment with the LLM

as a judge and you can see how much

those line up.

You can we can run those really quick

and see. Yeah.

And you can

but do you have the human labels there

or

Yeah. So I have the I have the original

human labels that I'm going to compare.

Yeah. Um but we can we can go ahead and

uh

actually yeah we can just go ahead and

label this in the platform too. Um if

you want to do that.

Yeah. Let's let's look at let's look at

one example.

All right let's look at one. So yeah so

what we can do is um let me go ahead and

go back here.

You can actually create a labeling

criteria here.

Uh-huh. And so I could say like

uh let's call it like Peter's Peter's

eval.

I'm just going to assign myself

can assign you to Peter. You're in

you're in the account. Um and we'll say

okay look at look at all of the outputs

and give me your judgment. And we're

going to put the same annotations that

we had before. So we'll put

uh we'll put these actually

do categorical.

So, you, have, to, do, a, little bit, of, setup

here. Um but we'll do product knowledge.

We'll just do good

average,

and bad.

Okay.

Yeah. So we can test this one just to

start with and basically say like does

it have good product knowledge or bad

product knowledge. Actually I kind of

like the tone one too. Maybe we can just

do that really quick.

Yeah, let's do it quick.

Yeah. So we'll do tone and we could say

like good

average bad. So I know this is also like

not the most uh oh let me call it tone

for on.

And that's going to create the labeling

queue. And we're going to send these

examples to a labeling queue there.

Okay. So now it's going to compare the

human labels with the LM labels. Ex.

Uh yeah. So this is comparing the uh the

this is actually just doing the human

label on the output. So we we're going

to label the data again here. So we're

going to say like is this tone good? I

think we're saying like the tone is kind

of bad because it's kind of long, right?

Yeah. long response

and we could say, okay, product

knowledge. So, we could take a look at a

few of these um and say, okay, like is

this good or bad? I think in this case

you know, it's it's kind of hard to say

from the from here. You're reaching out

about the cloud monsters. You're within

our three 30-day window. So, I mean

that's good. So, we'll just kind of go

in and do that really quick as well.

Yeah. I mean, they're really they're

pretty they're pretty long.

Okay.

So, we got a few of these now.

Okay. Good.

So, now we can do the that that step

that we were just uh where we have human

labels and we're going to compare the

human label on top of the EVA label. And

we're just going to go ahead and do the

tone match really quick. So, I'm going

to go ahead and say

take a look at this column. We're going

to call it just we just named it from

our eval, right? So it's like tone

annotation tone on label

and we're going to compare that to the

eval for tone. And what it's going to do

is going to give us a match or no match

score on that output in the f in the

first place.

And we'll do the same thing here for

product knowledge.

Okay, we didn't end up doing the policy

but we'll we'll just go ahead and run

those two.

Give that a second. Okay, refresh the

page.

All right, so we have our score here.

So, this is our match score of the

experiment that we just ran, which is

does the tone match? It looks like 100%

of the time the product knowledge

matches, but the tone only matches once

in from our human labels. So this is an

example of where we probably have some

room for improvement for our eval based

on the human labels that we had.

Makes sense.

Yeah.

Yeah. Then uh so so now you basically

have both the prompt for your actual

product and a bunch of prompts for

evals. So so given this thing you should

go off and improve the tone eval prompt

right?

Exactly. Yeah. you should go improve the

tone eval prompt to basically be

stricter on how long the answer to the

response is. Maybe you want to put like

criteria there which is like don't

respond when it's super long. You know

don't don't say it's a good response if

it's really long. And then you can take

this information and even go improve on

your system prompt back in the anthropic

console and say or in with an arise as

well. You can just take that same system

prompt and if we create another

experiment here, we can pull in that

prompt here, the same prompt and we can

say don't respond super long and we can

test it against that eval as well

once we've done that.

Yeah.

All right,, man., This, is, this, is, super

this is great. This is super practical.

So, um let's just kind of recap the

whole process for our view viewers

right? So I think uh just recap we're

trying to build a on customer support

agent for for shoes and the first step

is to write the prompt right and and try

to gather the relevant information the

policy and the shoe information in the

prompt. Um and I think the second step

was to uh uh create the human the create

the eval criteria like like the three

different buck buckets. Yeah. And the

third step was to do manual evals with

like, you know, a bunch of example

questions and and try to score the Yeah.

score the answers in your spreadsheet

and then, you know, you and I can debate

it.

Yeah. Exactly. Start start with a basic

prompt.

Yeah.

Yeah. And then you've got Exactly. I

mean, this is this is basically all of

the steps live in the process.

Yeah. Okay. Yeah.

Yeah. So, let's walk through this. start

based upon categorize

uh and define failures. Uh yeah, so

that's basically the eval criteria and

then uh you do the manual

labels.

Yeah.

And and like the manual label like that

step is like pretty involved, right?

Like you probably take like a good

amount of time on that step just

iterating on the prompt, iterating on

the criteria. You know, there's probably

like a lot of work there.

Totally. I think this is like manually

label and debate and then

I think that there's a ton when it goes

into like actual iterating.

This is this is probably like a step in

between which is like iterate on the

actual data.

Yeah.

Yeah. That that is like a that is like a

you know that like a loop that goes back

and forth maybe like I used 20 20 times.

Yeah. Yeah. Exactly. Exactly. It just

goes back and forth like a bunch

basically, to, start, at least, initially

and then you can start thinking about

LLM as a judge.

Yeah.

Eval

100 rows for like start with 10 rows

align with the human and the goal is

really align with your human labels.

Yeah. But like before you go to prod or

something or even run like a a test you

probably have at least 100, right? 100

examples.

Yeah,

exactly. Yeah. And that's what that's

what this would look like. It's like

align with

human labels and then get, you know

once you feel good about like the 10

examples, then do the same process on

100.

Yeah.

Yeah. Okay.

100. Um then then you know we haven't

already covered it but um like you

should probably launch this thing as

like a AB test first like maybe like a

10% of traffic or something.

Mhm. And then uh see if the actual users

complain or not right.

Yeah.

AB test

maybe

you know like 1% traffic or something

depending on the the type of company you

have right. Uh start internal usually

internal. Yeah. Plus maybe 1%.

You really I think like people dog

fooding their own product helps a ton

here too. So you can kind of get a feel

for like if it's good or bad.

Yep.

And then you actually get real world

data

and that's what gives you uh back here.

That's what gives you like a good sense

of the overall basically like

how good is the system overall.

Yeah. You know, I I I thought about

maybe like making the actual users

be the labelers like like if if they

thumb thumbs down or say something is

bad like or if if they want to get

feedback on a product like I thought

about maybe actually like um letting

them score the policy compliance

saves the team a bunch of work. But I I

haven't really seen that being pulled

off. Have you seen it being pulled off?

You know what's interesting about that

is

Yeah. Having your customers

actually label your data will give you

it will be a useful signal for you for

sure. This is where the the customer

sort of like labels come in. But

and and you should take a look at that

data for sure.

The problem is that let's say you have

an example where you're you know you're

frustrated with your onsport bot Peter

that you that you now have to interact

with and you say you know you give it a

thumbs down when it says you're not

eligible for a refund. What does that

actually mean, right? Does that mean

that your support agent did something

poorly? Is it just that you're mad at

the support agent, so you gave it a

thumbs down? Like, how do you interpret

that signal? So, there's a few questions

that I recommend folks ask their teams

and debate with the teams, which is what

happens when the eval is good, but the

business metric goes down. How do you

break that tie in the first place? like

if the thumbs down comes from a user

but the system looks like it's good in

the first place.

Like that's a really really important

question to ask because in this world

this this is non-deterministic and your

LLM is going to produce different

answers for different inputs and you you

know you don't always know whether you

should trust your user's label either.

So that's a really good example of you

have to go back and look at the data

again there.

Got it. Yeah, that makes sense.

Yeah. All right, man. Well, this is this

has been super super great. I'm I'm glad

we're able to do this live. uh as as

people can see it's kind of messy like

you kind of have to go back and forth a

lot.

Yeah.

And figure stuff out. But uh yeah, this

is what eval looks like in practice.

Yeah. Thanks so much for having me on.

This was a ton of fun. And yeah, it is

super messy uh to get started with. You

know, it's kind of funny like people are

always looking for a silver bullet, but

I do think that PM's sort of

understanding how this process works is

going to be really important for us to

live in a world where AI products

actually work and don't suck in in the

real world. So, thanks for having me on

so we can talk about this and build real

world AI products.

Yeah, man. And and where can people find

more of your stuff?

You can find me on uh on LinkedIn and X

but you can also just go to amman.ai AI

and uh you'll find all the different

links to subscribe and uh yeah would

love to to chat with folks that are

building real world eval or real world

AI systems too.

Awesome man. It's always a pleasure

chatting with you my friend. Yeah

likewise. Thanks again. Good to see you.

All right. I bitter.

Loading...

Loading video analysis...