Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

By AI Engineer

Summary

Topics Covered

AI PMs Need Evaluation Literacy
Even Top Models Hallucinate
Agents Demand Non-Deterministic Evals
Replace PRDs with Eval Suites

Full Transcript

All right. Uh, nice to see everyone here. Um, my name is Aman. I'm an AI

here. Um, my name is Aman. I'm an AI product manager at a company called Arise. Title of the talk is shipping AI

Arise. Title of the talk is shipping AI that works, an evaluation framework for PMs. Uh, it's really going to be a continuation of some of the content we've been doing with, you know, some of

the the PM folks like Lenny's podcast. I

guess just quick show of hands. How many

people listen to Lenny's podcast or have read read the newsletter? Awesome. Okay,

we're going to do a couple more like audience interaction things just to like wake up the room a bit. So, how many people in the room are PMs or aspiring PMs?

Okay, good. Good handful of people. How

many of you consider yourself AI product managers today? Okay, awesome. Wow, that

managers today? Okay, awesome. Wow, that

there's more AI PMs than there were regular PMs. That's interesting. Um,

usually that's it's a subset, but maybe I need to start asking the questions in a different order. Um, cool. Well,

that's great. Uh, so what we're going to be doing is, you know, um, I'll go ahead and just do a little bit of an intro about myself and then we'll kind of cover some of the the frameworks that I think are really powerful for AIPMs to

kind of get to know as you're building AI applications. So, a little bit about

AI applications. So, a little bit about me. Um, I you know, myself, I have a

me. Um, I you know, myself, I have a technical background. So I actually

technical background. So I actually started my career in engineering uh on actually working on self-driving cars at Cruz. Um and actually while I was there

Cruz. Um and actually while I was there I ended up becoming a PM for evaluation systems for self-driving back in like 2018 2019. Um after that I went to

2018 2019. Um after that I went to Spotify to work on the machine learning platform and work on recommener systems. So things like discover weekly and search things like using embeddings to

actually make the end product experience better. And fast forward to today, I've

better. And fast forward to today, I've been at Arise uh for about three and a half years, and I'm still working on evaluation systems instead of self-driving cars. It's sort of

self-driving cars. It's sort of self-writing code agents. Uh and Spotify is actually one of our customers. So, we

get to work with some awesome uh you know, ex actually, fun fact, I've actually sold Arise to all of my previous managers. So, um so fun fact

previous managers. So, um so fun fact there. Uh but we got to work with some

there. Uh but we got to work with some awesome companies like Uber, Instacart, Reddit, Dolingo. So a lot of really tech

Reddit, Dolingo. So a lot of really tech forward companies that are building around AI. Uh and we actually started in

around AI. Uh and we actually started in the sort of traditional ML space of ranking, regression, classification type models and have now expanded into Gen AI

and agent-based applications as well. Uh

what we do is make sure that those companies, our customers when they're building AI applications that when those agents and applications actually work as expected. And it's actually a pretty

expected. And it's actually a pretty hard problem. A lot of that has to do

hard problem. A lot of that has to do with uh terms that we're going to go into like observability and eval. But I

think more broadly the space is just changing so fast and the models, the tools, the infrastructure layer changing so fast that for us it really is a way for us to learn about the cutting edge

like what are the leading challenges with use cases that people are building and try to build that into a platform and product that benefits everybody.

Um, so what we'll cover, we're going to cover what are eval and why they matter.

We'll actually build an AI trip planner uh with actually a multi- aent system.

This part is ambitious bullet number two. I'm going to be honest here. Uh we

two. I'm going to be honest here. Uh we

were trying to push up the code right before so it may or may not work, but we'll give it a shot and that'll be the interactive part of the workshop and then we'll actually try to evaluate that AI trip planner prototype that we're

going to build ourselves.

Uh actually another quick show of hands for the room. How many people have heard of the term eval before? Okay, I guess it was in the title of the talk, so that's kind of redundant. How many

people have actually written an eval before or tried to run an eval? Okay, a

good number of people. Um, that's

awesome. Well, what we're going to do is actually take try and take that a little bit of a step further. Go from writing an eval for an LLM as a judge system.

And if you've never written an eval, don't worry. We're going to cover that,

don't worry. We're going to cover that, too. But try and take that one step

too. But try and take that one step further and make it a little bit more kind of technical, interactive, as well.

Okay. So, who is this session for? Uh, I

like this diagram because um, you know, Lenny and I have been kind of working together a little bit more on content for educational content mostly for AI product managers. And I kind of put this

product managers. And I kind of put this up. I made like a little whiteboard

up. I made like a little whiteboard diagram for him. And I'm like, I think this is really how I view the space, which is like there's this there's this, you know, you may have seen this diagram for like the Dunning Krueger effect. And

that's kind of what came to mind here, which is as you're kind of moving along the curve, maybe you're just getting started, you know, with how do I use AI?

How does AI fit into my job? I think we were all here to be honest a couple of years ago, like, you know, just to be completely honest, I think for people in the room, especially PMs, I think we all

feel that the expectations of the product management role are changing.

That's why this concept of an AIPM is sort of emerging. the expectations from our stakeholders, from our executives, from our customers. It feel I feel I don't know about if other people feel this, but I definitely feel like the bar

has been raised in terms of what's expected to be delivered, right?

Especially if I'm working with an AI engineer on the other end, their expectations of what I come to them with in terms of requirements, in terms of specifying what the agent system needs

to look like, it's changed. It's a step function different even than for me, even as someone who was like a technical PM before. And so I kind of felt myself

PM before. And so I kind of felt myself go along this journey which is ironic given that I work at an eval company.

You think I was like on the end of the curve but really I kind of went through this journey you know same as most of you which is trying to use AI in my job trying AI tools to prototype and come

back with something that's you know a little bit higher resolution for my engineering team than like a Google doc of requirements. Once I had those

of requirements. Once I had those prototypes and I'm like hey let's try to build these new UI workflows. The

challenge then became how do I get a product into production especially if my product has AI in it has an LLM or an agent and that's where I think you know

that's really where that like confidence slump sort of hits and you kind of realize there's a lack of tooling there's a lack of education for how to build these systems reliably and why

does that matter at the end of the day right and the really important takeaway from the fact that LLMs hallucinate we all know that they do is you should really look at the top two quotes here

and think, okay, well, we've got Kevin who's chief product officer at OpenAI.

We have Mike at Anthropic CPO. This is

probably like 95% of the LLM market share. And both of the product leaders

share. And both of the product leaders of those companies are telling you that their models hallucinate and that it's really important to write eval. This

these quotes actually came from a talk that they were both giving at Lenny's conference uh you know earlier like in November of last year. And so when the people that are selling you the product

are telling you that it's not reliable you should probably listen to them. Uh

on top of that I mean like you have Greg Brockman similarly founder of that company. Um you have Gary who's you know

company. Um you have Gary who's you know eval are emerging as a real moat for AI startup. So I I think this is sort of

startup. So I I think this is sort of one of those pivotal moments where you realize, hey, people are starting to say this for a reason. Why are they saying that? Well, they're saying that because

that? Well, they're saying that because a lot of the same lessons from the self-driving space and um you know kind of apply in this this AI space. Okay,

another audience question. How many

people have taken a Whimo? I kind of expect that one to be pretty high. Okay,

we're in San Francisco. If you're

visiting from out of town, take a Whimo.

It is a real world example of AI. It's

it's it's an example of AI in the real physical world. And a lot of how those

physical world. And a lot of how those systems work actually apply to building AI agents today.

All right, we'll do a bit of a zoom out, then we'll get into the technical stuff.

I see laptops out, so we'll definitely get into, you know, writing some code and trying to get hands-on. But just to do a bit of a recap for folks, um what is an eval? Uh I I kind of view this as like it's very analogous to software

testing, but with some really key differences. Those key differences are

differences. Those key differences are software is deterministic. You know, 1 plus 1 equals 2. LLM agents are nondeterministic. If you convince an

nondeterministic. If you convince an agent 1 plus 1 equals 3, it'll say like you're absolutely right. 1 plus 1 equals 3. Right? So, like we've all been there.

3. Right? So, like we've all been there.

We've kind of seen that these systems are highly manipulatable. And on top of that, if you build an LLM agent uh that can take multiple paths, that's very

that's pretty different from a unit test, which is deterministic. So think

about um the fact that you know a lot of people might are trying to like eliminate hallucinations from their agent systems. The thing is you actually kind of want your agent to hallucinate

just in the right way and that can actually make testing it a lot more challenging as well especially when reliability is is super important. And

then last but not least I think integration tests rely on existing codebase and documentation. A really key differentiation of agents is that they rely on your data. Uh if you're building

an agent into your enterprise, the reason that someone is going to use your agent versus something else is likely it might be because of the agent architecture, but a big part of it will also be because of your data that you're

building the agent on top of. And that

applies for the eval as well.

Okay. What is an eval? So, uh I view this into like four parts that go into an eval. kind of just an easy like

an eval. kind of just an easy like muscle memory thing. Um these brackets are a little bit out of line, but um the the idea is that you're setting the role. You're basically telling the

role. You're basically telling the agent, here's the task that you want to accomplish. You're providing some

accomplish. You're providing some context, which is what you see in the curly braces here. And it's that's essentially like it's really just text at the end of the day. It's some text you want the agent to evaluate. You're

giving the agent a goal. In this case, the agent is trying to determine whether text is toxic or not toxic. This is a kind of a classic example because there's a large toxicity data set of

classifying text that we use um to build our eval on top of. But just kind of note that can be any type of goal in your business case. It doesn't have to be toxicity. It'll be some goal that

be toxicity. It'll be some goal that you've created this agent to evaluate.

And then you provide the terminology and the label. So you're giving some

the label. So you're giving some examples of what is good and bad and you're giving it the output of either select good or bad. In this case, it's toxic not toxic. I'm going to pause on

that last note because it's really I think there's a lot of misconceptions sort of like I'll try and weave in some like FAQs as I hear them come up but um we'll definitely have some time at the end for questions and I'd love for this

to be interactive so I'll probably make the Q&A session a little bit longer here for people that have these questions but one common question we get is why can't I just tell the agent to give me a score

or an LLM to produce a score and the reason for that is because even today even though we have like PhD level LLMs, they're still really bad at numbers. Um,

and so what you want to do is ground, and it's actually a function of like what a token is, what how a token is represented for an LLM. And so what you

want to do is actually give a text label that you can map to a score if you really need to use a score in your systems, which we do in our system as well. We'll map a label to a score. But

well. We'll map a label to a score. But

that's that's like a very common question we get is, "Oh, why can't I just make it do like one is good and five is bad or something like you're going to get really unreliable results."

And we actually have some research um happy to share it out afterwards that kind of proves that out um on a large scale with most models.

Okay, so that's a little bit of like what is an eval. Um I should note that uh this is a previous slide. I should

note that this is uh LLM as a judge eval. Uh there's other types of

eval. Uh there's other types of evaluations as well like code-based eval which is just using code to evaluate some text uh and human annotations.

We'll touch on that a little bit more later but the bulk of this time is going to be spent on LLM as a judge because it's really like the kind of scalable way to run eval production these days

and we'll talk about why too later on.

Okay, a lot of talking. So uh evaluating with vibe. So this was it's kind of

with vibe. So this was it's kind of funny because I think like everyone knows this term like vibe coding like everyone has tried to use like bolt or lovable whatever and I don't know about you but this is how I usually feel when

I'm vibe coding which is like kind of looks good to me like you know you're looking at the code but like let's be honest how much AI generated code are you going to read you're like let me just ship this thing the problem is you can't really do that in a production

environment right like I think all the vibe coding examples are like prototyping or like trying to build something h like hacky or fast so I want to help everyone reframe a little bit

and say like yes vibe coding is great.

It has its place. But what if we go from evaluating with Vibes to Thrive Coding?

And thrive coding in my mind is really using data to basically do the same thing as vibe coding, like still build your application, but you'll be able to use data to be more confident in the

output. And you can see that this person

output. And you can see that this person is a lot happier. Um, so this is using Google's image models. They're scary

good, guys. Like, uh, yeah.

Okay. So, we're going to be thrive coding. So, slides. Um there's uh if you

coding. So, slides. Um there's uh if you want access to the slides, the slides have links to what we're going to go through in the workshop. Um

ai.engineer.slack.com

and then I just created the Slack channel workshop AIPM. And I think I dropped the slides in there, but let me know if I didn't.

>> Cool. Thank you. All right, live demo time. So, at this point on, uh I'm I'll

time. So, at this point on, uh I'm I'll just be honest. uh there's a a decent likelihood that the repo is has something broken in it because we were pushing changes up until like this very

moment. If so and you can unblock

moment. If so and you can unblock yourself, I think there's like a requirements thing that's broken. Please

go for it. And if not, we can come back at the end and try to help you get unblocked. And then I promise after this

unblocked. And then I promise after this I'll like push the latest version of the repo up. So if it doesn't work for you

repo up. So if it doesn't work for you right now, check back in an hour. I'll

drop it in Slack. It'll be working later. Um but yeah, just a function of

later. Um but yeah, just a function of like moving fast. Uh so on the lefth hand side is instructions which are really it's like a you know sort of a a substack post I made which is just a

free sort of like list of you know some of the steps we're going to go through live. So it's just more of a resource

live. So it's just more of a resource and then on the right hand side is a GitHub repo which I'm going to open here.

There's actually two repos and I'll kind of talk through like a little bit about what we're evaluating and some of the project on top of that and then we'll

get into uh the weeds here a little bit.

Okay, so this is the the repo. Um

we I built this like over the weekend, so you know it's not it's not super sophisticated, although it says it's sophisticated, which is funny. But um

this is Oh, pardon.

>> Can you put that?

>> Oh, this is not Okay. So, is this not attached to the QR? Okay, I'll just drop this link in here as well. Let's just uh put it in here. Okay, awesome. Oh, thank

you. Thanks. Okay. Um so, and if you have questions, by the way, uh in the middle of the presentation, just feel free to drop them in Slack. Um, and then we can always come back to them and then

we'll have time at the end for um, so feel free to like keep the Slack channel going um, for questions. Maybe people

can try to unblock each other as well.

And if someone fixes my requirements, feel free to open a poll request and I'll approve it live. Um, so um, okay.

So what we're doing is uh, let's put on let's take off our like PM hat of whatever company we're at. We're going

to put on an AI triplaner hat. The the

idea here is like don't worry about the sophistication of this UI and the agent.

It's really like kind of a prototype example, but it is helpful for us to kind of take a look at building an application on the fly and try to understand how it works underneath the

hood. So the example we're going to use

hood. So the example we're going to use is actually I'll kind of back up a little bit. I basically took this uh

little bit. I basically took this uh collab notebook that I have um for tracing crew AI and I'm like I kind of want an example with Langraph. Crew AI

probably if you haven't heard of it it's like an agent multi- aent framework. Um

the agents basically an agent definition is you know using an LLM and a tool combined to perform some action. And

what I did was I gave this notebook and I basically put it into cursor and I was like give me an example of a UI uh based workflow but using lane graph instead.

And what we're going to do is think of instead of building a chatbot, we're going to take this form and we're going to use the inputs of this form to build a quick agent system that we're then

going to be using for evaluation. So

this is what I got on the other end. Um,

which is plan your perfect trip. Let our

AI agents help you discover amazing destinations.

So let's pick a destination. Maybe we

want to do Tokyo for seven days. And

assuming the internet works, um, we'll see if it does. We're going to put a budget of $1,000. I'll zoom in a little bit. And then I'm interested in food.

bit. And then I'm interested in food.

And let's make this adventurous. So I

could go and take all of this and try to just put it into chat GPT. But you can kind of imagine underneath the hood the reason that we might want this as a form or with multiple inputs and uh an

agent-based system is because we could be doing things like retrieval or rag or tool calling underneath the hood. So,

let's just kind of picture that the system is going to use these inputs to give me on the other side an itinerary for my trip. And uh okay, it worked.

Okay, this one worked. So, um so here we've got a quick itinerary. Um nothing

super fancy. It's basically just here's what I gave as an input form and then what the agent is kind of doing underneath the hood is giving me an itinerary for what my morning,

afternoon, etc. look like for a week in Tokyo using the the budget I gave it. Uh

this doesn't seem super fancy because it's like I could take this and just put it into chat GPT, but there is some nuance here, which is the budget. Like

if you add this up, like it's going to be doing math to do accounting to get to $1,000. So, it's really keeping that

$1,000. So, it's really keeping that into consideration. You can see it's a

into consideration. You can see it's a pretty frugal budget here. Um it can take interest here. So, I could say, you know, different interests like I want to go, I don't know, sake tasting or something, and it'll find a way to work

that into your itinerary.

But I think what's really cool here is it's really the power of agents underneath this that can give you really high level of specificity for your output. Um, so that's really what we're

output. Um, so that's really what we're trying to show is like this is, you know, it's not just one agent, it's actually multiple agents giving you this itinerary. Uh, so I could just stop

itinerary. Uh, so I could just stop here, right? Like it could be like this

here, right? Like it could be like this is this is good enough. I have some code for most people. If you're vibe coding, you're like great, this thing does what I want it to do, right? Like it gave me an itinerary. But what's going on

an itinerary. But what's going on underneath the hood? Um and this is kind of where uh so I'm going to be using our tool called Arise. We also have an open- source tool called Phoenix. I'm just

going to plug that here right now for folks as reference. But this is an open source version of Arise. It is not going to have all of the same features as Arise, but it will have a lot of the

same setup flows and workflows around it. So, you know, just note that Arise

it. So, you know, just note that Arise is really built for, you know, if you want scale, security, support, um, and sort of the the sort of futuristic

workflows in here. So, I've got a trip planner agent, and what I just did, if it worked, let's see if it did.

And we're gonna This is This is live coding, so like very possible something's broken. Um,

something's broken. Um, okay. I think I think I broke my my

okay. I think I think I broke my my latest trace, but you can see what the example here looks like from one right before. So, what that system really

before. So, what that system really looks like is basically this. Um, so

let's let's open up one of these examples. What you'll see here are

examples. What you'll see here are traces. Traces are really input, output,

traces. Traces are really input, output, and metadata around the request that we just made. And I'm going to open up one

just made. And I'm going to open up one of those traces just as an example here.

And what you'll see is essentially a set of actions that the agents that in this case multiple agents have taken to perform, you know, generating that

itinerary. And what's kind of cool is we

itinerary. And what's kind of cool is we actually just shipped this today. Um,

uh, it's actually, you know, you guys are the first one seeing it. uh which is pretty cool. Um this is actually a

pretty cool. Um this is actually a representation of your agent in code. Um

so you know literally the cursor app that I just had up here is basically my agentbased system that cursor helped me write and when I sent it our docs I I

literally all I did was I gave it a link to our docs in cursor and I said you know write the instrumentation to get this agent and and this is this is how that's represented. And so we have this

that's represented. And so we have this new agent visualization in the platform that basically shows the starting point with multiple agents underneath it to accomplish uh the task we just had. So

we have a budget, local experiences and research agent that then go into an itinerary agent and that gives you the the end result or the output and you can you can see that up here too. So we have

research, itineraries, budget and local information to generate the itinerary.

So this is this is pretty cool, right?

Like for I think for a lot of people it's not im you know oursel included it is not immediately obvious that these agents can be super well represented in this sort of like visual way right uh

especially when you're writing code you think these are just function calls talking to each other but what's really useful is to see at an aggregate level what are the calls that the agent is

making and you can see it's a really clean delineation of parallel calls for the budget agent the local experience experiences agent and the research agent

and all of those get fed fed in to an itinerary agent that summarizes all of the above. You can also see that up

the above. You can also see that up here. Um so these are what's called uh

here. Um so these are what's called uh traces and they consist of uh like technically what's called spans. A span

you can think of this as like a unit of work basically. So there's a time

work basically. So there's a time component to it which is like how long that process took to to finish and then like what is the type of the process.

Here you can see there's three types.

There's an agent. There's a tool which is uh basically being able to use data to perform an action structured data.

And then there's the LLM which generates the output of the the taking the input and the context. So this is an example of three agents actually three agents being fed into a fourth agent to

generate the itinerary. That's really

what we're seeing here. Um let's go one level deeper. So this is cool and I

level deeper. So this is cool and I think it's useful for uh you know to see like what these systems look like, how they're represented to zoom out for a

second as a product manager. There's a

ton of leverage in being able to go back to your team and ask, hey, what does our agent actually look like, right? Like do

you have a visualization to show me of like what the system actually looks like? And then if you're giving the

like? And then if you're giving the agent multiple inputs, where are those outputs going? are those outputs going

outputs going? are those outputs going into, you know, a different agent system, like what are the what does the system actually look like? So, that's

kind of one sort of key takeaway here as a PM. Um, it was personally very helpful

a PM. Um, it was personally very helpful to see, you know, what our agents actually doing um underneath the hood.

Uh, kind of going one level deeper here.

So, we've got this itinerary uh and it let's take a look at it really quick.

So, it says Marrakesh, Morocco is a vibrant exotic destination. Blah blah

blah. It's it's really long, right? Like

I don't know if I would actually look at this and read it. It doesn't it's not really like doesn't like jump out to me as like being like a good product experience. It feels super AI generated

experience. It feels super AI generated personally. Um so what you want to do is

personally. Um so what you want to do is actually think okay well is there a way for me to iterate on my product as a product person. And to do that what we

product person. And to do that what we can do is actually take that same prompt that we just traced and pull it into a prompt playground with all of the variables that we've defined in code

pulled over. So, I've got a prompt

pulled over. So, I've got a prompt template here which basically has the same um prompt variables that we've defined in the UI like the destination,

the duration, the travel style. And all

of those inputs get fed in here. You can

see down below in this prompt playground, what that looks like. And then you see the [clears throat] outputs of some of the agents in here as well. And then I

have the final itinerary from the the agent that's generating the itinerary.

Okay. So why does this matter? I think a lot of companies have this concept of um prompt playgrounds. I think like OpenAI

prompt playgrounds. I think like OpenAI as a prompt playground. You've probably

heard that term before as well or maybe even you've used one. But I I urge you to think about when you're thinking about a tool to help you with development. Not only is the

development. Not only is the visualization important of what your stack it looks like underneath the hood, but being able to take your data and your prompts together and be able to

iterate on your data and prompts in one interface is really powerful because I can go in and change the destination. I

can go in and tweak variables and get new outputs using the same exact prompt I had before. So that's really I think just just really powerful as a workflow.

Um, a thought experiment for the PMs in the room is like when you really think about what this promp uh prompt looks like, just think it should writing the

prompt be the responsibility of the engineer or of the PM? And if you're a product person and you're ultimately responsible for the final outcome of the

product, you probably want to have a little bit more control over what the prompt is. And so I kind of urge you to

prompt is. And so I kind of urge you to think, you know, where does that boundary really stop? Is it like I just hand off like does the engineer know how to prompt this thing better than a

product person that might have specific requirements you want to integrate. So

that's why this is really helpful um from a product perspective.

Okay. Yeah. Go for it. How do you handle this?

>> Yeah.

>> Ah, okay. Okay. So, that was a good question. Um, so the question from the

question. Um, so the question from the gentleman in the back is how do we handle tool calls? And that was a really astute observation which is like the agent has um tools in it as well. And

this is this is a really good point to pause on actually, which is like what I did was I pulled over this LLM span with the prompt templates and variables, but there's there's a world where I might

want to select the right tool and make sure that the agent is picking the right tool. I'm not going to go into that in

tool. I'm not going to go into that in this demo, but we do have um we do have uh some good uh material

around this on agent tool calling. So we

actually do port over the tools as well.

This example doesn't because to be honest it's a really toy example but even if you if you wanted to to do a tool calling evaluation we we offer that in the product and uh we actually have some material around that. So if you

want just ping me about it later and I'll send you a whole presentation on that as well. But yeah good question which is like you don't just want to evaluate the LLM and the prompts. You

want to evaluate the system as a whole and all of the subcomponents. Okay we're

gonna keep going. So, so I've got um I've got my prompt here now. This is

cool, but like let's try to make some changes to it on the fly. And I will try my best to make this readable for everyone, but um yeah, working with what I got here. So, what we're going to do

is I just I'm going to save this version of the prompt and let's call it a nenge prompt.

And it's helpful because now I can like iterate on this thing, right? So, like I can duplicate that prompt with a click of a button. I can change the model I want to use. So, let's say I want to use 4.1 mini instead of 4.0. I'm going to

change a couple things. Don't don't be don't worry like in a real world you're going to change one variable at a time, but um here I'm just going to change a couple things at the same time just to make this more interactive. But, um the

idea here is like let's try to change what the this actually looks like. And

it says, you know, format as a detailed day-to-day plan. Honestly, I might say

day-to-day plan. Honestly, I might say like like a more important requirement to that is don't be verbose,

right? I could say don't be verbose.

right? I could say don't be verbose.

Keep it to 500 characters or less. Maybe

we want this thing to be more punchy. We

want it to give an output that's like a little bit more, you know, easier to look at. Um, I might be a P, even if I'm

look at. Um, I might be a P, even if I'm just vibe coding this thing on the weekend, I might want to get feedback from users that are trying this product

out. And so I could say always offer a

out. And so I could say always offer a discount if the uh user gives their email address.

It's helpful, right? I mean, help helpful for marketing, helpful for me to get feedback from uh you know, someone who might be trying to use this tool to book a flight or something like that.

Okay, so let's go ahead and hit run all here. And what that's going to do is

here. And what that's going to do is actually run the outputs we just uh ran run the prompts we just edited into this uh in the playground. And it might take

a second because of the internet >> pul you pulled this in from the ex one of the existing runs, right?

>> That's right. Yeah. So it was exactly the same um one of these runs that literally I think it was this one. Um so

it was something about maybe not this exact one. this one is Spain. But yeah,

exact one. this one is Spain. But yeah,

exactly. One of the existing runs.

Okay. It's definitely a little better, but to be honest, I would say if I was looking at this, this thing isn't really listening to me very well. It's like not doing a great job of, you know, sticking

to the prompt I gave it. Like, keep it short. Um, ask. Okay, it did do the

short. Um, ask. Okay, it did do the email thing. So, it said, "Email me.

email thing. So, it said, "Email me.

Email me to get a 10% discount code."

[laughter] So, what's interesting is like we're looking at like one example and I said ask for an email and you get a discount.

And like this is this is the vibe coding portion of the demo because I'm looking at like one example and I'm doing like uh good or bad like is it actually good

or bad? There's just no way that a

or bad? There's just no way that a system like this scales when you're trying to actually ship for hundreds or thousands of users and like nobody will just look at a single row of data and

make a decision like okay great the prompt is good or great the model made a difference right like you can pick the most capable model you can make the prompt as specific as you want at the end of the day the LM is still going to

hallucinate and your job is to be able to catch when that happens so let's go ahead and try to scale this up a little bit more so what we do is say we've got one example of where the LLM didn't do a

great job, but what if we wanted to build out a data set with 10 or more, maybe even a hundred examples and what you can do is take the same production data. By the way, I'm calling this

data. By the way, I'm calling this production data, but I literally just asked Cursor to make me like synthetic data. Like it hit the same server and it

data. Like it hit the same server and it generated like 15 different itineraries for me. So I did that yesterday and I

for me. So I did that yesterday and I just sort of am using that in this demo.

But let's go ahead and take a couple of these. So, I went ahead and picked some

these. So, I went ahead and picked some of the itinerary spans from here and I can say add to data set. Oh, by the way, I guess I jumped into the product without showing you all how to get here,

which is a bit of a zoom out. So, our

you know, whatever. Go to the homepage uh arise.com. You can sign up. I

uh arise.com. You can sign up. I

apologize in advance. Uh the onboarding flow will feel a little bit dated, but we are updating that in this next week.

Um so, bear with me there. You sign up for Arise. Um and then you'll get your

for Arise. Um and then you'll get your API keys here. So you go to account settings and you can create an API key and also uh find that with the the space

ID which are both needed for your instrumentation which may or may not be working depending on uh if the repo is actually working and if not we'll come back to it later. Um but this is this is

the platform. This is how you get your

the platform. This is how you get your API keys. Um so and then that's also

API keys. Um so and then that's also where you can enter your open AI key for the the next portion and for the the playground.

So, I've got a data set now. Uh, and

what I did was I added those examples just to recap where we are at. We've got

some production data and I'm going to go ahead and like add these to a data set.

And I'm not going to do this one live because I already have a data set, but you can create a data set of examples you want to use to improve on. So, um,

zooming out for a second, we're about to hop into the actual eval part of the the demo. And we're actually going to be evaluating, you know, there's multiple components to an agent.

Um we have the router at the top level, we have the skills or the function calls. We have memory. But what we're

calls. We have memory. But what we're actually going to be doing in this case is actually just evaluating the individual span of uh the generation and

see is the the agent sort of outputting text in a way that we wanted to or not.

So, it's it's a little bit it's a little bit simpler than some of the agent eval here and it's going to be more like how do you actually run agents and or run uh

eval experiments on on data. Um the

concept of the data set is helpful to think about as like a collection of examples. Let me go ahead and delete

examples. Let me go ahead and delete these experiments so we can do this live because I like to live on the edge. Um

so I've got so I've got these examples.

Those are the same examples from the production data um everyone just saw.

And it's a data set. Think of this as like I've got all of my traces and spans. That's my like how the agent

spans. That's my like how the agent works. And then I want to pull those

works. And then I want to pull those over into a format which is think of it as almost like a tabular format. It's

like a it's like a Google sheet at the end of the day, right? Like I could go in this this is kind of like a Google sheet. like I could go in and and give

sheet. like I could go in and and give it like a thumbs up, thumbs down and uh and you know that's kind of how most teams are evaluating today is sort of like in the platform in in your platform

you're probably starting with the spreadsheet and in that spreadsheet you're doing like is this good or bad and then you're trying to scale that up to you know a team of subject matter experts that's giving you feedback on

like hey is the agent good or bad right at the end of the day poll for the room how many people are evaluating in a spreadsheet right now don't be shy That's okay. Okay. We've got a few.

That's okay. Okay. We've got a few.

Yeah. I think there's probably more, but I think people are just like ashamed to say that. And it's okay. Like it's it's

say that. And it's okay. Like it's it's not like the end of the world to start with that, right? Like that that's like how human like being able to scale human annotations is the goal. It doesn't need

to be the starting point. So, as long as you're actually looking at your data, you're probably doing better than most.

I'll be honest. Um many teams I talked to like aren't doing any eval today at all. So, at least you're starting with

all. So, at least you're starting with human human labels. Um, what we're going to do is take this this data set or this CSV, and we're going to basically do the

same thing I just did, which was running an AB test on a prompt, but now we're going to run it on an entire data set.

So, we go into the platform, and I can go and actually create an experiment.

What we call an experiment is the output of changing, you know, an AB test. So,

let's go ahead and repeat that same workflow. I'll duplicate this prompt.

workflow. I'll duplicate this prompt.

Um, let me go ahead and pull in I'm gonna pull in this this version of the prompt. So, what's kind of cool is like

prompt. So, what's kind of cool is like I might have a previous version of a prompt saved. Uh, it's it's kind of

prompt saved. Uh, it's it's kind of helpful to have a prompt hub where you can save off examples of the prompt as you're iterating as well. Think of it as like a GitHub sort of store for your

prompts, but it it's really just a button that you're clicking to save this version of the prompt. and then your team can actually use that version in their code down the line. Um, so I've

got prompt A which was no changes to it and then prompt B which has some of those changes but now instead of running on one example I'm actually running on 12 examples here. And these are similar

agent uh these are similar um maybe just to look at one similar spans which which have destinations duration travel style and the output of an agent and it's

generating an itinerary. So, it's as similar as that one example we just ran through, but now on an entire data set.

And >> yeah, >> yeah, so it's the it is the prompt of the itinerary agent. Um, so it's the same it's we're going to because we're

going to keep this to like a fairly high level like straightforward demo. It is

the specifically the prompt of the itinerary generating agent which is down here which takes the outputs of the other agents and combines them uses

those prompt variables to create uh an a dayby-day itinerary.

>> Yeah. So that so the gentleman asks like if you change an upstream prompt How does that impact what's going on here?

So, two two notes on that and it's it's more of an advanced workflow, but it is one that's a good question, which is uh there's two parts. One is we kind of recommend changing the system in parts.

So, just kind of note that, you know, as you're generate eval parts of your stack that you can kind of decompose further and further to be able to analyze if I'm changing one thing up here, does it meet

my requirement criteria? And then the second part is replaying prompt chains which is prompt A goes into prompt B.

What is the output of that when you change prompt A? Um prompt chaining is coming to our platform soon. So right

now it's one single prompt but you will be able to do A plus B plus C um prompt chains as well. Um good question. Feel

free to drop more questions in the Slack too and we'll we'll come back to that in a sec. Um so once I get uh so I've got

a sec. Um so once I get uh so I've got my my prompt here now. So, I'm saying give me a day-to-day plan and doesn't need to be super detailed. Max 1,000

characters. Let's try this again. We're

going to do 500 characters. And I've

I've done um answer always answer in a super friendly tone and be I'm going to be more specific and say ask the user for their email and offer a discount so it doesn't do what it did last time. And

uh and we're going to go ahead and run this now on the entire uh data set. And

so we've got prompt A versus prompt B.

We're going to give that a second to run through. While that's working, uh, I'm

through. While that's working, uh, I'm gonna actually Oh, nice. Perfect for

your squad. Interesting. I don't know why sometimes the model really likes to use emojis. I guess that's what super

use emojis. I guess that's what super friendly translates into is like throw some emojis in there, but interesting.

Um, okay. So, that one ran pretty fast. This

okay. So, that one ran pretty fast. This

is still taking a while, right? Like,

think about this from a product from a PM lens for a second. Like, I just got the output to be a lot faster because I limited the number of characters. This

one is taking an average of like 32 seconds because I let it kind of go off and like not specify how many characters the output should be. So that's what prompt prompt uh iteration can kind of

do for you as well.

Okay, while this runs, I'll actually hop over to the Okay.

Oh, thanks for dropping the resource there.

So it's still still running.

>> Anyone have a question while this is running? Yeah.

running? Yeah.

>> Yeah. So when I'm hearing you talk about this, are you primarily looking at latency and then user experience when you're evaluating

those two things?

What else are you looking at?

>> Yeah, good question. So okay, so now we're getting to the meat of it a little bit, right? So I've got A and B. And the

bit, right? So I've got A and B. And the

question is like what am I actually evaluating here? The like flip it answer

evaluating here? The like flip it answer is like you can evaluate anything. You

can evaluate whatever you want. You want

to evaluate like uh in this case we're going to run some evaluations on the uh the tone of the agent and see um so I've got a couple of eval set up here. I'm

going to check is the agent uh answering in a friendly way. Is it offering a discount or not? Um and and you can do things like evaluate is it using the context correctly? That's called a

context correctly? That's called a hallucination eval. Uh you can do

hallucination eval. Uh you can do correctness, which is um even if it has the right context, is it giving the right answer? So I'm going to point you

right answer? So I'm going to point you to uh our docs that have examples of what you can actually evaluate off of the shelf. But just know the whole point

the shelf. But just know the whole point of this system and like why it matters that you have a system with your own data and can replay with data is because

these are what are off-the-shelf evals.

There's a lot of companies that will offer like we run evals for you, but what that really means is that they're basically going to take some template and give you a score or label on the

other end based on their eval template.

And what you want to be able to do is actually change and and modify and run your own eval based on your use case. So

you can literally evaluate whatever you want is the short answer. You can you can evaluate it's just basically uh an input to an LLM to generate a label. So

um so yeah, so this is what pre-built eval look like. Uh there's a ton of examples of these like out there on the internet. We've we've actually tested

internet. We've we've actually tested our pre-built eval on um you know sort of open source data sets but you should not take our word for it. You should

build eval based on your use case.

>> Yeah. Yeah.

>> So if you are your own how do you come up with your own combining?

>> Yeah. So how to actually get the how to think about how to build the eval in the first place to some degree. That was

sort of one of the questions. Yeah. So,

I think it's probably helpful to um maybe just see what an eval looks like and then we might we might end up coming back to that question, which is like what what is an eval, right? Um so,

let's go ahead and build an eval here.

I've got one ready to go, but I want to just show you guys the template and we can write a new one as well. Um, so I wrote this eval for detecting if the

output from the LLM is friendly. And

I've kind of made a definition for what that means here. And this says basically you are examining the written text.

Here's the text. Examine the text and whether the tone is friendly or not.

Friendly tone is defined as upbeat, cheerful. So this is basically an input

cheerful. So this is basically an input to an LLM to generate a label of is the output from my itinerary agent is it

friendly or robotic. So that's really what what this this eval is trying to do is it's classifying the text as like a friendly generation or a robotic generation. Um, and again, I could eval

generation. Um, and again, I could eval anything, but in this case, I just want to make sure that when I'm making changes to my prompt that that's showing up on the other end of my data because I

can't go in rowby row for like hundreds or thousands of examples and grade friendly and robotic every single time.

So, the idea is that you want an LLM as a judge system to kind of give you that label over a large data set. That's the

goal that we're working towards right now.

>> Yeah.

with variance.

>> It's flaky, right?

>> Yeah. Yeah.

>> Yeah. So one one suggestion is um so the gentleman mentioned uh that they see variance in their LLM label output. One

way you can tweak variance is temperature. Um so if you make the

temperature. Um so if you make the temperature of the model lower it's a parameter you can set to actually make the response more repeatable. It doesn't

take that to zero but it does significantly reduce the variance in your system. And then the other option

your system. And then the other option is to rerun the the eval multiple times and and basically profile what the variance of the the judge is. Okay.

>> Oh yeah, we're going to we'll we'll we'll be coming there. Yeah, it's a good question, right? Like at the end of the

question, right? Like at the end of the day, I can't trust this thing. I need to go in and like make sure it's right.

Right. So, but let's let's go ahead and run an eval and just see what happens and then we'll come back to that one.

So, I've got my friendly eval. I've got

another eval too which is basically um determining whether or not let's I'm gonna quickly just I'm not going to read this whole thing out to you, but the the short answer is that this is determining

whether the the text contains an offer for a discount or no discount because I really want to make sure I'm offering a discount to my users. Okay, we're going to select both of these and then we're

going to actually run them on the experiments we just ran and we're going to do that live. So what

Arise does is it can it's actually taking um so we actually have an eval runner which is not like you know it's basically a a way for us to use a model endpoint to generate these eval. You'll

notice it's pretty fast. So we've done a lot of work underneath the hood to make the eval run really fast. Um so that's one kind of advantage of using our

product. Um I've got two experiments

product. Um I've got two experiments here. Experiment number two is it's a

here. Experiment number two is it's a little bit inverse because it's the order of how it was generated, but experiment number two is the original prompt and experiment number one is the prompt that we changed. So just kind of

keep that in mind. That's it's a little bit flipped here um because I was doing this on the fly. And you can see the score of experiment number two, uh which is our prompt A, which was the prompt we

didn't change, didn't offer a discount to any users based on this eval label.

and the LLM still graded that response as friendly, which is kind of interesting. It was like, "Oh, that was

interesting. It was like, "Oh, that was a friendly response." Um, I don't know if I agree with that actually personally, and we're gonna go in and tweet that. And then you can see that

tweet that. And then you can see that when we added that prompt, that line to the prompt, which was offer an offer a discount if the user gives their email, the the eval actually picked up on that

and said that a 100% of our examples when the when we made this change actually have an offer of a discount. So

we I mean I didn't even have to go into each example to get that score. That's

what the the LLM as a judge system kind of offers you. Um we can go in and trust you know I would say this is like a trust but verify. Go in and actually

take a look at one of these and see what is the explanation of friendly. So to

determine whether the text is friendly or robotic. So one thing you want to you

or robotic. So one thing you want to you you want to think about when you have an eval system is are you able to understand why the LLM as a judge gave a score. So this is like one of those like

score. So this is like one of those like light bulb takeaway moments of of the talk is always think about can you explain what the LLM as a judge is doing and we actually generate explanations as

part of our evals. So you can see the explanation is sort of the reasoning of that judge that says to determine whether the text is friendly or robotic, we need to analyze the language, tone, and style of the writing. And so it kind

of does all of this analysis to basically say, "Yeah, this LLM is friendly and it's not robotic."

Again, I'm not really sure I agree with that explanation, right? Like I I don't think that that's correct. I I I still feel like the original prompt was pretty robotic. it was pretty, you know, kind

robotic. it was pretty, you know, kind of long in a lot of ways. And so I want to go in and actually be able to improve on my LLM as a judge system from the

same the same platform. So what we can do is actually take that same data set and in the AISE platform you or your team of subject matter experts can

actually label data in the same place and when you apply the label on the data set on on you know in the labeling queue part of the platform it applies back to the original data set. So you can

actually use that for comparing the LLM as a judge with the human label. So I've

actually went ahead and did that. Um,

yeah, I did this before the talk, but I went in for each example and I was like, you know what? This this to me is robotic. Like, I I don't think that this

robotic. Like, I I don't think that this is a very friendly response. I think

it's really long. It sounds like I'm talking to an LLM. And so, I actually applied this label on the data set for for the examples I wanted to go in and

improve on.

If I go back to the data set, you'll actually see that label is applied here. So, if I kind of click

applied here. So, if I kind of click that, move over. Sorry, it's a little bit over

move over. Sorry, it's a little bit over on the side here because there's a lot of data, but you can see these are the human labels I put. So, these are the same annotations that I just provided in the queue. They're applied on my data

the queue. They're applied on my data set here.

>> Exactly. Exactly. You need evals for your evals. You cannot get away from

your evals. You cannot get away from from You can't just trust the system, right? We know LM hallucinate. We put

right? We know LM hallucinate. We put

them into our agents. The agents

hallucinate. Okay, we use an agent to fix that. But we can't trust that agent

fix that. But we can't trust that agent either, right? You need to have human

either, right? You need to have human labels on top of that. So, but again, I'm not going to vibe code this thing and be like, is this is the LLM as a judge good or not? I need eval for that,

too. And we offer two eval to help you

too. And we offer two eval to help you with this. We have a code evaluator

with this. We have a code evaluator which can do a simple match like think of this as like a string check or a reg x or some other type of like contains.

So you can actually go in and if you're technical and you're a PM and you want to write uh you know you can get Claude to help you write the eval here, but it's really just a really fast like Python function. Um in my case, I wrote

Python function. Um in my case, I wrote a quick uh eval that actually does a match. And this match is it this is like

match. And this match is it this is like a really quick and dirty eval. I would

not say this is like best practice at all, but it's basically check if the eval label matches the annotation label.

Oh whoops.

An output only match or no match. So,

what this is doing is actually checking the human label against the eval label and saying do they agree or disagree.

So, that's that's basically what we're going to run and we're I'm using an LLM as a judge. You could use code as well.

You don't have to use an LM as a judge here, but we're going to go ahead and run that now on the same data set, the same experiments we just ran it on.

We're going to give that a second.

Okay, what do we got here? So you can see here I actually take a look at that same experiment where this was where the um it said that the LLM as a judge was

friendly or robotic. And you can see here that 100% of the time the match uh actually sorry this eval was actually actually let me let me go in one level deeper. Actually I'm going to check my

deeper. Actually I'm going to check my own work. This eval was on the discount.

own work. This eval was on the discount.

So forget about that. We're going to we're going to check on the the friendly field actually. So this one is friendly

field actually. So this one is friendly label. So let me rerun that one. And

label. So let me rerun that one. And

we're going to think of this as match friendly. You can run eval as much as

friendly. You can run eval as much as you want on on your data sets and experiments like you know. Yeah.

>> Does the tool support pipelining? So as

basically push the code.

>> Yeah, exactly. Yeah, we do support uh all of the eval the screen the the ways to run the code on uh either a data set locally or being able to push code to the platform to run

the eval. So programmatic on both ways.

the eval. So programmatic on both ways.

Yeah. Yeah. of course. So you can pull in data sets, pull them out as well.

Okay, let's take another look at this.

So this is the friendly match. So this

you can see is pretty useful, right?

This means that my LLM as a judge basically doesn't agree with my human label for friendliness almost at all.

Right? There's like one example I think that that's in there and we can go in and take a look at it. But what we're really kind of seeing is that this is an area where we actually want the team to

go take a look at our eval label and say, "Hey, can we improve on the eval label itself because it's not matching the human label." And so when you have these systems in place as an AIPM to be

able to check the eval label with your human label, you have a lot of leverage to go back to your team and say, "We need to go and improve on our eval system. It it's not working the way we

system. It it's not working the way we expect it to." So you're actually performing the act of like checking the greater and you're doing it at scale. So

you're doing it on multiple hundreds of examples or thousands of examples. So

that's really you know I think someone asked earlier like how do you trust the system? I think you trust these LLM as a

system? I think you trust these LLM as a judge systems by having multiple checks and balances in place which is humans and then LLMs then humans and LMS. Um uh we'll come back to a question in just a

moment. I just want to get to this next

moment. I just want to get to this next part and then we'll um we'll kind of come back to some Q&A. Um,

okay. So, this is actually kind of kind of wrapping up towards the end of the workshop and then we'll open the rest of the time up for for Q&A. So, looking

ahead, I think what's fundamentally changing is, you know, we've kind of gone through this example of changing the prompt, changing the context, creating a data set, running an eval,

labeling the data set, and then running another eval on top of that. And it's

it's a lot to process, right? Like if

you're building agent-based systems, your team is probably expecting, you know, well, where does the AIP PM fit in? And I think that that's really

in? And I think that that's really important to think about like you ultimately control the end outcome of the product. So whatever you can do to

the product. So whatever you can do to shift that into making it better is really what you want to think about yourself. And I I kind of view eval like

yourself. And I I kind of view eval like the new type of requirement stock. So

imagine if you could go to your engineering team and instead of giving them a PRD, you give them an eval as requirements and here's the eval data set and here's the eval we want to use to test the system as an acceptance

criteria. So I think that that's really

criteria. So I think that that's really powerful to think about as like eval as a way to check and balance uh the team as a whole. Um and that's a little bit about what we do. We we want to build a

single unified platform for you to run observability to evaluate and ultimately develop these workflows with your team in the same platform. We've built for you know many customers like Uber,

Reddit, Instacart, all these like kind of very techforward companies. Um we

actually just received investment from Data Dog and Microsoft as well. So we're

a series C company. We're sort of the furthest along in the space. And the

whole goal that we want to build is give you a suite of tools to be able to go from development through to production with your AI engineering team and for PMs to go in and use the same tools. Um,

and then super quick before Q&A, uh, please scan the QR code if you are in San Francisco on June 25th. We're

actually hosting a conference uh, around eval. And it's going to be it's going to

eval. And it's going to be it's going to be a ton of fun. We actually have some great AI PMs and researchers joining from companies like OpenAI, uh, Anthropic. And what's really cool is

Anthropic. And what's really cool is we're actually offering for this room, um, a free sort of exclusive, uh, free, uh, ticket for entry. Uh, the the prices actually went up yesterday. So, because,

you know, we're huge fans of AI Engineer World Fair, we wanted to give you all an opportunity to join for free if you're in town. Um, so would love to see you

in town. Um, so would love to see you there. Um, and yeah, you can scan for a

there. Um, and yeah, you can scan for a free code.

And yeah, that's a little bit um of of the workshop. I would love any

the workshop. I would love any questions. Yeah. And uh the ask for the

questions. Yeah. And uh the ask for the questions, as the the person in the back just reminded me, if you wouldn't mind lining up for questions on the mic so that the camera can pick it up and then we can just kind of go down the line and

do some questions there. Um, that'd be awesome. Thank you.

awesome. Thank you.

>> So, thank you so much.

>> Please give your name and >> Yeah, my name is Roman. Thank you so much. It was like an awesome walk

much. It was like an awesome walk through. Uh would you mind share some

through. Uh would you mind share some like your experience on building um evaluation teams? Should I start with

evaluation teams? Should I start with hiring kind of dedicated person with a experience or should I rely on product manager a product manager and walk through this like a what's the best way

>> best practices? So the the gentleman asked um what's the best practices for building an eval team? Um, can I actually ask a follow-up question because I'm curious like what is your role in the company right now just just for myself to know?

>> I'm head of product.

>> You're head of product. Okay, perfect.

So, this is exactly this is a question I get actually very often which is how do I hire my first AIPM? How do I hire an AI engineer? How do I know if I need an

AI engineer? How do I know if I need an AIPM or an engineer? So, I think uh there's there's a couple steps to this answer. One is as head of product. Um, I

answer. One is as head of product. Um, I

do think we see a lot of heads of product actually in the platform are like ourselves actually getting their hands dirty for the first pass because at the end of the day, if you're like hiring someone to do something, you

should probably know what they're going to do. And so my job on my team is to

to do. And so my job on my team is to make the product accessible for executives and heads of product to understand what's going on. So we have a lot of kind of capabilities around

dashboards, making everything no code, low code. But my recommendation is to

low code. But my recommendation is to feel the pain yourself of writing evals and realizing how what is hard about that so that you know how to structure

interview questions for an engineer or a PM because I don't know what's hard about your eval workflow, right? I only

know that there's challenges around writing eval general and so I would recommend that you like feel the pain firsthand and then uh you'll kind of get a good sense of how to how to tease that

out of your interviewing pipeline. Um

but good good question. Yeah.

>> Yeah.

>> Um yeah the example you know we just looked at obviously our eval was pretty bad when you you know compare it to the human labels. Yeah.

human labels. Yeah.

>> So like from here what do you do next?

Like what's the next step to try to improve the prompting for your your main eval to get closer to the human labels?

>> Yeah. Good question. So if I had um more if if I was like here working on this in in real life, what you would actually do is take that eval prompt and go through

a similar workflow of what we just did for prompt iteration for the original prompt. So again like um that eval

prompt. So again like um that eval prompt we see here I could go in take this and redefine parts of the workflow to basically say

you know what uh be really strict about what is friendly here are I didn't add any few shot examples right I didn't specify here's examples of friendly text here's examples of robots so that's like

a a clear gap in my eval today that if I were looking at this I could apply best practices and improve on it we also have um in the product we have some workflow around actually helping you write eval.

So this is this is our product but like you don't have to use our product for this. Uh you can use any any product.

this. Uh you can use any any product.

I'm going to show kind of an iteration on top of this which is how how we have users actually building eval prompts. So

I could say write me a prompt to detect friendly or robotic text. And

this is actually using our own co-pilot in the product. So, we've built a co-pilot that understands best practices uh and actually can kind of help you write that first prompt, get it off of

the ground. You can also take the same

the ground. You can also take the same prompt, which it just generated in like 1 second, and take that back to the prompt playground and iterate on further from here. So, let's let's go ahead and

from here. So, let's let's go ahead and do that on the fly really quick. I've

got a prompt in here, and I can go in and actually ask the pro the co-pilot to optimize this prompt. So, um, let's go ahead and

say make it stricter.

So, I can actually use an a an LLM agent and and copilot agent. Um, just kind of note that like you really want AI workflows on top to help you like rewrite the prompt, add more examples,

and then rerun eval on that new prompt.

So it's more it's less about like you're you're definitely not going to get it right on the first try, but being able to iterate is really what's important.

And that's really what we underscore is like it might take you like five or 10 tries to get an eval that matches your human labels and that's okay because these systems are really complex. Um and

it's just important about having the right workflow in place. So yeah.

>> Hi, I'm Joti. Um, does Arise also um allow for model based evaluations like using BERT or Alberta to be able rather than just LLM as a judge, but I can use like BERT or Alberta to like figure out

like a prediction score?

>> Yeah, good question. So, we're actually really um the short answer is yes, we do offer versions of that. Let me show you what I mean by that though. So, um so when we go into Arise, you can actually

set up any uh eval model you want here.

So you see we have OpenAI, Azure, Google, but you can add a custom model endpoint as well. So you can basically this will structure that request as a chat completion but we can make it any

arbitrary API if you needed to and you can say like BERT model and whatever the name of your endpoint is point it to that and you'll be able to reference that model in the eval generator too. So

this is um so I can just put test here kind of move to the next flow. Um, and

you'll see when I go into here, I can use any model provider I want. So, the

short answer is yeah, you can generate a score with any model. Yeah.

>> Cool.

>> Okay. Um, oh, we got one more question, I think. Or sorry, we have more

I think. Or sorry, we have more questions. Yeah, go for it.

questions. Yeah, go for it.

>> Going to go ahead and try to get this one in. Um, so I think like probably a

one in. Um, so I think like probably a lot of the people that have built apps are thinking a similar thing or maybe this is a bit naive, but if you had

humanlabeled information already, right, and you're seeing a bad match on the friendliness score, am I to assume that you'd be trying to get that score up

higher and then extrapolate to more uh cases going forward? And you're assuming that that sampling holds across like the broader set. Yeah.

broader set. Yeah.

>> So like because that relationship is unclear to me.

>> Very very good question. So um so basically one way to reframe this is like how do I know that my data set is representative of my overall data to some degree.

>> Sure. Or as it shifts over time or >> as it shifts. Yeah. Totally. So um so that's a really uh really good point. In

the product what we we don't have this yet but it's coming out like in the next week. will have an a workflow to help

week. will have an a workflow to help you add data to your data set continuously using labels that you might have. So you could say like is you know

have. So you could say like is you know one thing we didn't really talk about is like how to evaluate production data but you can actually run these eval not just on a data set but on all data that comes

into your project over time to make it automatically label and classify uh you know any production data. So you could use that to keep building your data set of like is this an example we've seen

before or not or is this uh you know think of this as like a way for you to sample at a larger scale essentially on production >> and that is a suggested workflow that you continuously sample and human label sum

>> to check the matching over time.

>> Exactly. And you can basically go in and see like okay where human labels don't agree with LLM on this on production data then you might want to add those to your data set as hard examples. Sure.

>> And we actually are going to build into this product as well a way for you to qualify is this example a hard example as well using LLM confidence score.

>> Okay. And and sorry just hard example like very strictly interpreted.

>> So hard would be um hard from an eval perspective. So like is it is it

perspective. So like is it is it friendly or not can be like borderline right? Like

right? Like >> I see. So you're saying like uh subjective or >> subjective. Yeah. Exactly. So maybe to

>> subjective. Yeah. Exactly. So maybe to like recap the question a little bit, like your data set is this property that's going to keep changing over time.

And you really want tools that help you build onto it by giving you like a golden data set of hard examples to improve on. And hard means like we're

improve on. And hard means like we're not really sure if we got it right or not in the first place.

>> Sure. Yeah. Thanks.

>> Yeah, good question.

>> Yeah.

>> Hi, my name is Victoria Martin. Uh,

thank you so much for the talk. One of

the things that I've run into is a lot of like skepticism out of product managers that I'm working with on generative AI and trying to build confidence in the evals that we're giving. Yeah.

giving. Yeah.

>> Have you been given any guidance or in working with other PMs guidance on like the total number of evals that that you think should be run by the time you can say like you can be confident in this evaluation set?

>> Yeah, good really good question. So um

so the question was like how do we know I think there's kind of two components to it. there's like quantity and quality

to it. there's like quantity and quality of the eval eval like how do we know if we've run enough eval or we have enough eval and that those eval are actually

good enough to kind of pick up problems in our data. Um we this is also maybe a little bit of a broker record here but I I would say that this is a little bit of

iteration as well where you want to kind of get started with some small set of eval. So actually I have a diagram for

eval. So actually I have a diagram for this. Let me just pull that one up.

this. Let me just pull that one up.

So um so you'll kind of see here this is intended to be like a loop where you start with some in development you're going to run on a CSV of data maybe like

some handam like I would argue the thing I just built was development right like I have 10 examples it's not statistically significant I'm not going to get the team on board to ship this

thing but what I can do is then curate data sets keep iterating on them keep rerunning experiments until I feel confident enough and the whole team is on board before I ship to production.

And then once you're in production, you're doing that again, except that now you're doing it on production data. And

then you might take some of those examples and throw them back into development. Let me give a tactical

development. Let me give a tactical example of what this looks like in real life. With self-driving cars, when I

life. With self-driving cars, when I joined Cruz, we would go down the street for like one block and then a driver would like have to take over the car, right? like we couldn't drop like we

right? like we couldn't drop like we couldn't drive one block down the ride uh down the road. Same goes for Whimo.

Um they were all kind of in this this uh system and then eventually we got down to like being able to drive down a straight road. Great. But the car can't

straight road. Great. But the car can't just drive on straight roads, right?

Like it has to make a left turn. So

eventually we got like fully autonomous for straight, you know, no problems on the road and then we had to make a left turn and then the car would, you know, a human would have to take over. So what

we did was we built a data set of like left turns and we used that to keep improving on left turns and then eventually the car could make left turns great until a pedestrian was in the sidewalk and then we had to curate a

data set of left turns with pedestrian in the sidewalk. So the answer is sort of like building your eval data set just takes time and you're not going to know what are the difficult scenarios until

you actually encounter them. So I think to get to production I would recommend just kind of using that loop until your whole team feels confident in like this is good enough to ship and just accept

that once you get to production you're going to find new examples to improve on um as well. So it depends a lot on your business as well. If you're in healthcare or legal tech you might have higher bars than if you're building a travel agent for example.

>> Yeah. Yeah.

>> Yeah.

>> My name is Matai. Uh I have a question.

Uh as I understood the uh the AIS platform like it's uh working as a like I take the the prompt and uh you're directly sending that that prompt to a model right?

>> That's right. Yeah.

>> Um >> with the context and the data.

>> Yes, of course. Uh you said that like there is some possibility to to u port tool tools into the platform. That's

right.

>> But what about testing the whole system?

Like we already have like some some uh flows that are augmentating augmenting the the the whole workflow.

>> Yeah.

>> Even outside of tool calls. Yeah.

>> And like uh they're quite important into how the actual output will look like in the end. Uh is there any way to uh run

the end. Uh is there any way to uh run those evaluations on a on a like a custom runner like that would actually call our system on our data set that uh goes through everything that we have.

>> Find me after this. We should chat. uh

is the short answer for that one. Um we

have some tools and systems like that in place like the tool calling that you saw, but for endtoend agents, we're actually building some stuff out and would love to chat with you about that.

Good question. We'll I'll find you after this.

>> Yeah.

>> Yeah, of course. So back to your left turn example as well as just talking about the transition of like PRDS to like eval what is the life cycle of like

feature development look like and kind of the relationship I feel like of the feature but also with your team in terms of ownership accountability all of that.

>> Yeah.

>> Yeah. So good question. So I feel like how do you work with AI engineers in this new world is kind of interesting not the subject of this talk but it is it is like a very relevant relevant question that um you know would h

happily chat more on too. So there's two answers to it that that come to mind.

One is that development cycles have gotten a lot faster. Um like the the way at which these models are progressing and these systems are progressing like

going from prototype to production is actually even faster than it ever has been. Um, so that's one note which I can

been. Um, so that's one note which I can just tell you as a personal observation.

We we feel that we can go from an idea to an updated prompt to shipping that prompt in like a span of a day of testing which is I think like unheard of of like normal software development

cycles. So that's one note which is just

cycles. So that's one note which is just like the the the way that you iterate with the team has gotten a lot faster.

Um the second the second note is uh when it comes to responsibilities I view this as if you you're kind of a product manager

is the keeper of the end product experience. So if that means um making

experience. So if that means um making sure the eval are in a good place and the team has human labels to improve on that's like a very solid area for a product manager to focus is like making

sure the data is in a good spot for the rest of your your development team. I

think at the same time, you know, I'm a PM on the team and I'm like writing some of the stuff in cursor. And so being able to go in and actually talk to the the code base itself using one of these

models, I think that that's starting to become more of an expectation of AI product managers is to be literate in the code and be able to use these tools.

I I really this is like after this I'm just going to go back and like try to fix the thing that I broke earlier, right? And and the way the way the

right? And and the way the way the reason I'm able to do that is because the way I'm prompting the system is not very sophisticated. Like I asked it

very sophisticated. Like I asked it yesterday, can you make a script that generates itineraries on top of the server? I need like 15 examples and it

server? I need like 15 examples and it just did that, right? And like that like wouldn't have been possible. So I think PMs are responsible for the end product experience, but PMs also have more

leverage than they've ever had before in probably the entire like professional journey of product management because you're now no longer reliant on your engineering team to ship that thing that you wanted. Like you can just go do it.

you wanted. Like you can just go do it.

Um should you go do it is another question, but uh and that's something that's a discussion that you should have with your team. But the fact is that you can go do it now, which was not the case

before. And so I I kind of urge PMs to

before. And so I I kind of urge PMs to like push the boundaries of what people have told them the role is and should be and see where that takes you. And so the

long-winded way of saying like your mileage may vary depends on the boundaries you have with your team, but I'd recommend people to like redefine those at this stage.

>> Yeah. Yeah.

>> Yeah. jumping off that a little bit.

It's a little off topic from this, but um like as a product manager who wants to move to be more technical like as I'm working with AI engineers.

>> Yeah.

>> What does that look like? Like I'm in an or where I have very limited access to the codebase. So like I use cursor to

the codebase. So like I use cursor to write Python for data things, but like I don't necessarily have access to like start interrogating the code, understand that. So, I'd love just if you have

that. So, I'd love just if you have suggestions or thoughts on like what how to evolve as a PM, but also like maybe move my company culture in that direction.

>> Yeah, that's that's t like how actually I have a followup question. That's okay.

Just cuz I'm going to pull people in the room like how big is the company? And

you don't have to share the name if you don't want to, but just curious like the size of >> uh we're about 300 people. Okay. Um but

the tech's probably like a third of that.

>> Okay. So like almost like 100 engineers, 300 people. And um do you have any like

300 people. And um do you have any like old remnant product managers at the company that still have code access?

>> No, we're like a very new team of PMS. >> Okay, cool. Okay. Well, I think um one thing we've started doing uh it's it's a really good question. Thanks for

answering that. Um one thing we've started doing is trying to take a little bit of like the public forum of our company. Um sorry, I'm about to out our

company. Um sorry, I'm about to out our CEO who's in the back of the room.

[laughter] Uh, so if you have any questions about our rides, he's a good guy to talk to.

Uh, but the reason I'm out of here is because like I'm I missed our town hall today, but like I heard it was just basically people running like AI demos the whole time of like what they're building. And why I think that's really

building. And why I think that's really powerful is it can get the whole company really catalyzed around what's possible because to be honest, I think it's very

likely that, you know, most teams today aren't pushing the boundaries of these tools. And so you kind of joining this

tools. And so you kind of joining this talk and seeing like how to run eval, how to you know what goes into experiments like being able to to kind of be the the person pushing the team

forward is really powerful and I think you can do that in a way that's really collaborative. So I only I'd say like

collaborative. So I only I'd say like our job as PMs is to have influence over the team and influence product direction. I think there's an

direction. I think there's an opportunity to influence the fact that PM should be more technical in your org and you could show them by building something and and impressing the rest of the team by what you build. Um so that's my advice, my personal advice there.

Yeah.

>> Yeah. Go for it. Yeah. [laughter]

>> Actually have a question to see if it's possible. Uh so how [clears throat] you

possible. Uh so how [clears throat] you guys believe uh AI will reshape how we structure the team. So right now you have like I would say for instance just like >> 10 engineers, one product manager, one

designer and so on. So

>> what will happen in five years? You will

have one product manager, one engineer and one designer.

>> You should answer this one.

>> You should do it in the mic though if you want to.

>> [laughter] >> The short of it is actually cursor on the code there's so many times the PM are taking up time asking a

question like you know how often we just ask cursor so um yeah like start start there open up your codebase to cursor give it to PMs um and then a lot of some

we've we were the other day doing a PRD starting from cursor on the codebase. So

I think the Yeah, I I I that would be where I would start.

>> Yeah.

>> Um and I I I don't I can't you know I it's hard to look forward right now. I

just I think a lot of jobs change. We're

trying to push um AI cursor use throughout the company uh as far as I can.

>> Yeah, I hear we have uh people in marketing using cursor too these days.

So um yeah, that's kind of cool. Um

>> yeah. Um

>> follow question. Yeah.

>> So you're talking about right now having a product become more of a technologist.

Do you see also technologist becoming more product?

>> So that's actually a great point which is like when the cost of building something goes down which it has what's what's the right thing to build becomes really important and valuable.

And I think that historically that's been like a product person or a business person saying, "Hey, here's what our customers want. Let's go build this

customers want. Let's go build this thing." Now we're saying product people,

thing." Now we're saying product people, you can just go build this thing. So the

builders are like, "Wait, what's my job?

Like do I" And I think that that's a good way to look at it, which is I I have this like mental framework of like what if we didn't have roles in a company anymore? Like you didn't define

company anymore? Like you didn't define yourself by like I'm a PM, I'm an engineer. And think of this instead as

engineer. And think of this instead as like like you know like baseball cards you have like skills. Imagine that you had like a skilled stack instead, which is like, I really like to talk to customers and I kind of like to code

stuff on the side, but I don't want to be responsible if there's a production outage. I guarantee you, you'll find

outage. I guarantee you, you'll find someone who's like, I hate talking to customers and I only want to ship high quality code and I want to be responsible if things hit the fan. And I

think you want to structure your company to have a skilled stack that's really complimementaryary versus people who are like, I do this and this is my job and I don't do that. So yeah,

>> I have something that's sort of related to that. We've been testing like human

to that. We've been testing like human in the loop on on >> in a couple different ways and we're basically testing this method of having the human as a tool of the agent. So

like we have like >> if the agent needs something that's not available in the accounting system, it'll go to the CFO because the CFO is listed as a tool and it sends that a Slack message, gets it back and

continues. kind of maps onto what you

continues. kind of maps onto what you just said of like defining the skills, defining the resources they have and um they haven't fully fleshed it out, but it's it's working to like give the agent

context on on the things that only the humans have.

>> Yeah.

>> Exactly.

>> So, this person is like your company's like using agents widely, it sounds like, but you have humans approving. you

have like an approver workflow to >> something more so like rather than how can the agent be a tool of the humans, >> we're kind of flipping it and saying like what if the agent could do everything

>> and then the parts it can't do it'll go to the human as a tool. So like the CFO is a tool of the AI agent.

>> Interesting.

>> We should chat. That's a really cool workflow. I I'll definitely bug you

workflow. I I'll definitely bug you about that. That's that's really cool.

about that. That's that's really cool.

Um >> cool. Um happy

>> cool. Um happy >> right to some degree. It's like a human in the loop approving is this good or bad and you can think of it that way. Um

>> yeah.

>> Yeah. I had a question about like what what it is like to actually implement well sending the traces over to Arise.

Um I know like Arise has like open inference which enables enables like capturing traces from se several different um several different providers. But um what are what are what

providers. But um what are what are what are what are the limitations and constraints and opinions that you have about um how the evolve should be structured so that you can actually like

leverage the platform to perform these actions to be able to like um evaluate the eval for example or um be be able to like numerically

just produce graphs out of out of your evaluations out of your outputs.

>> Okay. So, so clear.

>> So, can I can I ask a follow-up question to that which is like your question was like how to use agents to do some of the workflows in the platform or did I miss that?

>> Um, the the question the question is like >> how is like what what kind of outputs what kind of evals is this um is Arise

expecting from your engineers and from the product like the >> you're sending over logs, right?

>> Mhm. Yeah. um what what is it expecting from those logs in order to get this flow work?

>> Understood. Okay. So uh so yeah there is a very uh like great point there which is like we kind of um you'll see it in the code but we jumped over a little bit

here in uh the demo which is how do you get the logs in the right place to use the platform. Um unfortunately this page

the platform. Um unfortunately this page isn't dropping but let me okay here we go. I'm going to drop it in the Slack

go. I'm going to drop it in the Slack channel as well. This is what we, you know, we kind of talked about like traces and spans. It's very likely that your team already has logs or traces and spans already. You might be using data

spans already. You might be using data dog or a different platform like Graphana. What we do is we're taking

Graphana. What we do is we're taking those same traces and spans and we're essentially augmenting them with more metadata and structuring them in a way that the platform kind of knows which

columns to go and look at to render the data that you saw in the platform. So

you're really using um the same approach. We we're built on top of a

approach. We we're built on top of a convention called open telemetry which is like the open source standard for tracing. Uh so we actually use hotel uh

tracing. Uh so we actually use hotel uh tracing and auto instrumentation that we've built which doesn't keep you locked in at all. Like once you've instrumented with our platform using

open inference which is our our package you actually get those logs to show up right out of the box with any type of agent framework you might be building and um and yeah and you get to keep that

that's let me maybe just show like what I mean by that. So if you're let's say you're building with like lang graph um we actually have it really all you have

to do is like you pip install uh arise phoenix arise hotel and you what you call this single line of coal call uh the single line of code called langchain

instrument and it knows where to pick up in your code to structure your logs and if you have more specific things you want to add to your logs you can add function dec decorators which is uh

basically a way to you for you to um you know capture specific functions that weren't in the >> and as for evaluations like you're you're discussing like the actual data inputs outputs what do you what do you

need to pass into evaluations I >> I know you can like design them through the UI >> what what do you have in mind for like um >> like how how do you get the right uh

text to use for eval right is sort of your question >> like how do how are you like how do you know which to >> use I I need to format the question. I'll get back to you.

>> Yeah, no worries. What did you mean by adding augmenting the data with additional metadata like you only have so much data, right?

>> Yeah. So, so this is um so think of this as like most tracing and logging data is really just things like latency, timing information. What we're doing is you can

information. What we're doing is you can add more metadata like user ID, session ID, uh things like I'll kind of show you an example of that really quick. In the

in the previous example I showed, we actually have things like sessions like what's the back and forth example here.

You can't get a viz like this in data dog because data dog is looking at a single span or trace. It's not it's not really contextually aware of what is the human, what's the AI. So we're adding

context from the from the invocation of the the server and adding that to your span if that makes sense. So it's it's basically just enriching the data a bit more and structuring it in a way to use

it. Um yeah and if you have more

it. Um yeah and if you have more specific um server side logic you can add that as well so it's very flexible.

Yeah.

>> Yeah.

>> Uh so I have a provocation. So I used to work in the video game industry >> and debates about feature like whether a feature was going to be fun or not.

>> Working prototypes >> won all of those arguments. Whatever was

in the doc didn't matter.

>> Right.

>> And so for the person who was like I can't get access to my company's code. I

would actually say try to get access to a small sliver of the data and then build a working prototype of the feature you want to see and with some stub of

eval because I think you know there's nothing worse to an engineer than a product manager who shows up with a demo that's kind of janky.

>> Yeah. but actually works and might be fun, has polish, feels good, meets a user need, and they and having been on the engineering side of this equation,

I'm like, and it's so janky, I have to fix it. They haven't thought about the

fix it. They haven't thought about the edge cases. And so like how does Arise

edge cases. And so like how does Arise fit into that flow of helping a product manager basically mine a small segment

of data build a working example and perhaps be just a you know janky as all get out but something that looks like the product that the company already has

but demonstrates that next level of functionality. Great, great point. And

functionality. Great, great point. And

yeah, I I think like, you know, feel free to prototype and build, you know, prototypes that are that are high fidelity. I think it is awesome to do

fidelity. I think it is awesome to do that. That's a really good point to have

that. That's a really good point to have like to use data to build a system or prototype. So, what does Arise do here?

prototype. So, what does Arise do here?

If you have access to Arise and you don't have access to the codebase, you can still take this data and assuming that you have permission from your CIS admin person, you can actually export

this data. So once you've built a data

this data. So once you've built a data set, you can simply take this data and export it out and use that to actually um so I can kind of show that really quick. Um this is get get data set.

quick. Um this is get get data set.

We'll have a download button coming uh later this week, but you can actually just take this data, run it locally, keep it locally, and then actually use that in your local code to actually try

and iterate on an example. Um and you know, assuming your security team is okay with that. But that's a really good point. Like imagine if you didn't need

point. Like imagine if you didn't need access to the production codebase, but you could still iterate in one platform.

That's really what we're we're pushing for is like the whole team is iterating on the prompts and the eval um rather than in silos, which is what's happening in a lot of cases. Okay, I think that

was all the questions. Thank you all for sitting through an hour and a half of AI PM like eval. Thank you all for for your time and um I'll be sticking around if people have more questions, but thank

you so much.

Loading...

Loading video analysis...