If You Don’t Understand AI Evals, Don’t Build AI

By Aakash Gupta

Summary

Topics Covered

Vibe Checks Scale Until Production
Evals Are Modern PRDs
Evals Needed When Distant From Users
Eval Has Three Parts: Data, Task, Scores
Failing Evals Drive Future Innovation

Full Transcript

Evals are one of the most important skills for building effective AI products.

>> The failure and success of AI products is driven by how good the evals they write are, how well they use them and of course how much they improve them.

>> One of the top eval companies used by Replet, Versel and Air Table is Brain Trust. I think all the top AI companies

Trust. I think all the top AI companies understand that building a really good feedback loop from what their users are doing in production all the way through

their evals that they can run offline is really really important. Anker Goyle is the founder and CEO of [music] Brain Trust. Just announced its series B round

Trust. Just announced its series B round at an $800 million valuation. This tweet

blew up. This is literally affecting people's jobs who are product managers and heads of product. How should they be dealing with this controversy?

>> I think vibe checks are eval. I actually

think this tweet is I think one of the most important things is to have evals that fail. If you only have evals that

that fail. If you only have evals that succeed, then you don't [music] know what problems there are.

>> Yeah. My brain immediately went to, well, let's improve the system prompt.

So, why should [music] anyone care about EVELs to begin with?

Before we go any further, do me a favor and check that you are subscribed on YouTube and following on Apple and Spotify podcasts. And if you want to get

Spotify podcasts. And if you want to get access to amazing AI tools, check out my bundle where if you become an anal subscriber to my newsletter, you get a

full year free of the paid plans of Mobin Arise Relay Dovetail Linear Magic Patterns, Deep Sky, Reforge Build, Descript, and Speechify. So be sure to

check that out at bundle.ac.com. And now

into today's episode.

Anker, welcome to the podcast.

>> Thank you so much for having me. I'm

really excited to be here.

>> When I think of experts in the eval space, you have to be right at the very top of the list. But some people, they just rely on vibe checks. I've had some product leaders on this podcast who have

created amazing AI features that have helped their company bag the next $5001 billion valuation just on vibes. So why

should anyone care about evals to begin with?

>> You know, I actually think vibe checks are a form of eval. And uh there's this really popular Paul Graham essay that um I think is very true. It's do things that don't scale. And vibe checks are

like the you know do things that don't scale analog for evals. When you do a vibe check, you are using your AI product and then using a scoring function which is your brain to try to

intuit it whether the result is good or bad. And if it's not very good, then you

bad. And if it's not very good, then you might tweak the prompt or you might try a different model or adjust how your agent is architected, whatever it may be, and then try again. What happens is

that once your product gets into production, more people start using it, you have more subject matter experts and engineers at your company that are actually contributing to its quality.

then the vibe check version of an eval stops scaling and you need a little bit more software and process and tooling to help you um execute at a higher scale

and with more predictable performance and that's where you know what we normally think of as eval start to come in but I I actually think it's it's a whole journey and vibe checking is great

it's it's just one type of eval you know I think one of the really new things about AI development is there's this kind magical thing ether that we have to

deal with which is an LLM. And an LLM, you know, not not unlike a person that you might hire or work with is somewhat unpredictable. You don't know if your

unpredictable. You don't know if your app isn't working, whether it's because the LLM inherently doesn't understand your task or maybe you haven't prompted or or built around it. Well, I remember

a couple years ago, right around when Brainfrust started, um, a lot of my really smart friends who were LLM skeptics would say like, "Hey, this LLM doesn't understand C++ or it doesn't

understand my, you know, specific task even though it works for demos." I think nowadays people have mostly moved past that, but it illustrates the idea that it's hard to know where the

responsibility lies. Um, and I actually

responsibility lies. Um, and I actually think that what the most clever, successful AI builders have have proven, you know, Manis is a really good recent example of this is that the alpha in

building a good AI product is kind of understanding that LLMs are imperfect yet very capable and figuring out how to work your way around that and make the

most of what LLMs are able to do today with an eye towards what they can do in the future. And that is really where

the future. And that is really where eval come into play. They are a good way for you to treat, you know, the imperfection of LLMs from being kind of

a mystery or a burden into a really fun and engaging product and engineering challenge that you can actually overcome. Um, I I think a lot of people

overcome. Um, I I think a lot of people are starting to recognize this. You

know, especially as uh software and models and agents are changing constantly. One of the things I come

constantly. One of the things I come back to is that an eval is is a relatively durable thing that you can invest in. So let's say that you're

invest in. So let's say that you're working on a new product area and uh you you know use the latest agent framework and use uh opus 4.5 which is the cool

model right now all of that might change and you know you and I were just joking about this a few minutes ago like all of that might change in in a couple weeks or a couple months but if you um invest

in evals and and by that I mean you do a good job of understanding what your users are actually trying to do with the product and then you encode that as data

and scorers and um and eval flow then even as the models and agents and and everything change you've you've actually set yourself up to continue iterating and build on an investment that you

make. Um and and so I think that the

make. Um and and so I think that the companies that have started doing that effectively they're actually building true differentiation. If you believe

true differentiation. If you believe that the way that you've wired together your agent today is your differentiator, you're you're actually highly likely to fail because that's probably going to change in in a couple months. On the

other hand, if you build really good evals, then you've built something that has a little bit more durability to it.

I've been preaching this message to everybody which is like the harness around your LM everything from the memory to the evals that is actually your more durable moat because the model

underneath that continues to evolve. One

of the interesting things you have here on this slide is you have the quotes from Mike Kger and Kevin Wild and I think it's super notable because they're the product leaders at these companies.

What do you see as the role of the product manager in defining the eval?

running one of the most used eval platforms out there. Are product

managers the main user of it or how do they interact with whoever the main user actually is?

>> Oh yeah, I think this is honestly not something I anticipated when we started and uh just a little bit of backstory. I

know there's some controversy about evals and coding products and stuff and I actually think it relates to this point. So I'd love to to talk about that

point. So I'd love to to talk about that too. But what we've seen is that uh if

too. But what we've seen is that uh if you're building an AI product, you are now able to solve problems that software couldn't really solve before. And the

people who really really drive that level of uh creative thinking and software application outside of the sort of four lines of of what software did before are product managers and EVELs

are core to product managers ability to do that. I actually have sort of shifted

do that. I actually have sort of shifted my thinking. I think of eval as kind of

my thinking. I think of eval as kind of like the natural evolution of a PRD. So

if you look at a PRD in 2015, it's an unstructured document that is a spec that is meant to communicate how you should build something and what maybe

the success criteria are for the product working. Fast forward to now, I think

working. Fast forward to now, I think the modern PRD is an eval. And um it's actually something that an engineering team who maybe doesn't know everything about the product or the problem that

they're solving can use to quantify how well the software that they're building is able to solve the problem. And I

think that actually means product managers are able to be a lot more effective because they go from providing kind of a qualitative spec that no one really follows and it's always kind of

annoying to reconcile the PRD with the actual product into something that's very quantifiable. You can look at an

very quantifiable. You can look at an eval and say does this uh piece of software fit the eval or not? And and

oftent times it will fit the eval and the product will still suck. And that

means that it's actually on the product manager to then go and improve the evals. And I think that's an area of

evals. And I think that's an area of leverage that product managers actually didn't have before.

>> Yeah, I would love to talk about this coding controversy that you referred to.

This tweet blew up. I actually had somebody ping me about this like the day it happened because they were like, "My boss was telling me about this because I've been championing evals in my own company, but Claude Code is not using

eval. Have we been doing it all wrong?"

eval. Have we been doing it all wrong?"

So this is literally affecting people's jobs who are product managers and heads of product. How should they be dealing

of product. How should they be dealing with this controversy?

>> Yeah, for sure. I mean, I think a lot of coding tools also don't have product managers. And the reason is that the

managers. And the reason is that the software engineers who work on the coding tools have relatively good intuition about what other software engineers want to do. And I actually

think the same principle applies here.

As I mentioned earlier, I think vibe checks are evil. So I just think this I love Swix. He's a good friend. But I I

love Swix. He's a good friend. But I I actually think this tweet is factually incorrect. Like the fact that other

incorrect. Like the fact that other people, you know, Boris and other folks at Anthropic are using Claude Code and likely providing feedback about whether the model or Claude code itself is

solving their problem. That is a form of eval. Uh sure, they don't have it

eval. Uh sure, they don't have it necessarily as a quantifiable process.

Maybe they do by now. We we we don't know or I certainly don't know. Or maybe

they don't use a tool or whatever and follow, you know, a what someone might think of as an eval. But I think they are doing eval. If someone is trying out the product and they're providing feedback and then they're incorporating that feedback into iterating on the

product, which I I think they are to me that that's doing EVELs. Now, why are they able to get away without a structured process that is somewhat multiddisciplinary with product managers and engineers? I think it's likely

and engineers? I think it's likely because the engineers are solving problems for other engineers and they're doing it at a company that's training the models that are also able to solve that problem. So, it's totally

that problem. So, it's totally verticalized and you don't really need any third-party intuition to solve the problem. If you go into another domain

problem. If you go into another domain like an AI company that is applying an LLM to solve healthcare problems, I think you're in a totally different world cuz they're probably not making

the LLM themselves. They probably have great software engineers who are passionate about healthcare but are not necessarily healthcare subject matter experts. And then of course there are

experts. And then of course there are product managers who are able to bridge from what engineers are working on to what you know patients or doctors whoever the end user is um is actually experiencing. I my parents are both

experiencing. I my parents are both doctors so I have a little bit of a a soft spot for this use case. When I hang out with them and talk to them I have almost no idea what they're talking about. Right? They're using very

about. Right? They're using very specific jargon. They're talking about

specific jargon. They're talking about you know medical issues that are obviously very important and maybe can be assisted by software but I I just don't have the intuition for that. And

so eval become a mechanism for uh product managers in in this scenario to help glue together the unknowns of how a an end user might actually interact with

a piece of software into something that is tangible that an engineering and product team can use to iterate on and improve the quality of their product.

Today's episode is brought to you by the experimentation platform Chameleon. Nine

out of 10 companies that see themselves as industry leaders and expect to grow this year say experimentation is critical to their business. But most

companies still fail at it. Why? Because

most experiments [music] require too much developer involvement. Chameleon

handles experimentation differently. It

enables product and growth teams to create and test prototypes in minutes.

With prompt-based [music] experimentation, you describe what you want. Chameleon builds a variation of

want. Chameleon builds a variation of your web page, lets you target a cohort of users, [music] choose KPIs, and runs the experiment for you. Prompt-based

experimentation makes what used to take days of developer time turn into minutes. [music] Try promptbased

minutes. [music] Try promptbased experimentation on your own web apps.

Visit chameleon.com/prompt

to join the wait list. That's k a m [music] e l e o n.com/prompt.

AI is writing code faster than ever, but can your testing keep up? Test Cube is the Kubernetes native platform that scales testing at the pace of AI accelerated development. One dashboard,

accelerated development. One dashboard, all your tools, full oversight. Run

functional and load tests in minutes, not hours across any framework, any environment. No vendor lockin, no

environment. No vendor lockin, no bottlenecks, just confidence that your AIdriven releases are tested, reliable, and ready to ship. TestCube scale

testing for the AI era. See more at testcube.io/os.

testcube.io/os.

That's t e s t k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k u b e.io a a k ash.

Nailed it in my opinion which is that if you're not the end user, it becomes more and more important and the more distance you have from that end user like in a healthcare setting the more and more

important it is to create the evals. I

think also one thing that probably claude code benefits from is that anthropic in their postraining is using a bunch of evals around coding. So even

if cla code doesn't have formalized eval, we know anthropic does, >> right? I think distance is the perfect

>> right? I think distance is the perfect way to think about it. If you imagine uh anthropic, which is just an amazing organization, bubbling with talent, you have the people training the models, you

know pre-training post-raining building the harness, building the product, the UI, which is cloud code for the harness, and the end users all inside of, you know, one set of four

walls. And so the efficiency with which

walls. And so the efficiency with which they're able to circulate feedback is very high. and and therefore it may not

very high. and and therefore it may not need additional process or or whatever to help facilitate uh the feedback. As

any of those points of distance start to increase, you actually need a little bit more structure. Like one of the big use

more structure. Like one of the big use cases for brain trust has actually been helping our customers collect evals that they can share with labs so that labs can do a better job of implementing

support for their use case. Uh so they you know they need something they need some ledger to be able to capture that information otherwise how are they going to communicate it? Makes sense. And you

mentioned Brain Trust. I wanted to ask you how big is Brain Trust today? What

can you share? Whether that's users, revenue valuation.

>> Yeah, Brain Trust is about 100 people.

We have many hundreds of customers and many tens of thousands of organizations using the product. We actually have a pretty generous free plan um which we intend to make even more generous over

time. If you're a product manager or an

time. If you're a product manager or an engineer and you're working on a hobby project, we want you to be able to use brain trust without having to really think about it. And growth has just been

absurd. Nowadays, people are running

absurd. Nowadays, people are running about 10 times as many evals as they were this time last year. Today, people

log about twice as much data per day as they did the entire first year of Brain Trust being in existence. And it's just been incredible. I think what we've seen

been incredible. I think what we've seen is that everything is growing in AI.

Every individual LLM call is getting bigger. People are creating larger

bigger. People are creating larger prompts. They're putting more context

prompts. They're putting more context into their prompts. There are more LLM calls in every uh request that comes through because people are building agents and agents are doing research and

interacting more frequently with users and doing much richer work. And then AI products are actually achieving real product market fit. And so the number of requests is also growing very rapidly.

And if you multiply those three things together, you get this incredible explosion of interesting data that we see flowing through Brain Trust.

>> Wow. So, as of I believe October 2024, so a year and change ago, it was reported that you were valued at $150 million in that fund raise. Can you give us a sense of what the scale of growth

has been since then?

>> Yeah, I think we have been very fortunate. we were uh cash flow positive

fortunate. we were uh cash flow positive for a very long time and so uh we've been able to utilize capital actually very very effectively. I think uh I can't share the very very specific

numbers now but if you look at our revenue metrics and growth metrics we are more than an order of magnitude in growth uh on you know literally every axis and if you look at just consumption

growth it's multiple orders of magnitude of growth. So, um, it's been a pretty

of growth. So, um, it's been a pretty wild I don't know what it is 15 months since then.

>> That's crazy. And I think it's a testament to companies that you have as customers like Versel and Replet and Air Table being so keen on eval. Why are all the hottest companies so focused on

eval?

>> You know, I think when we started Brain Trust, we wanted to partner with entrepreneurs and and uh builders who had companies that had pre-existing

product market fit and were earnestly investing in AI. Um, I'll just highlight Brian from Zapier, for example. Zapier

was our first customer. Brian is the CTO. He's been working on Zapier for a

CTO. He's been working on Zapier for a long time. And uh when I met him, he

long time. And uh when I met him, he basically introduced himself as a full-time AI engineer. Now, this guy's like super successful. He probably

doesn't have to work, but I haven't seen anyone nerd out about AI as much as Brian does. And the reason that we

Brian does. And the reason that we wanted to partner with these companies is that we knew that they would only build and ship products that met a

certain level of quality and they would hold themselves to a rigorous product market fit bar, but they were very uh earnestly adopting AI. And that has very much turned out to be true with with all

of the companies on on this list and you know most of the companies that that we work with. And I think if you consider

work with. And I think if you consider that like that these companies have pre-existing product market fit. So they

have to do things at some level of scale. They can't simply rely on on vibe

scale. They can't simply rely on on vibe checks. Although they do a lot of vibe

checks. Although they do a lot of vibe checking as as everyone should. They

can't simply rely on it. They have

enough product market fit to actually drive real scale. And then they have products like if RAMP doesn't work, it's very bad. You know, they can't they

very bad. You know, they can't they don't really have the leeway to screw things up. And so the standard for the

things up. And so the standard for the quality of the products that they're shipping is very high. You kind of mix those ingredients together and it's very very obvious from first principles that you need to run evals and take

observability very seriously to to implement a good product. And so

honestly it's been no surprise to me. I

I built brain trust as an internal tool when I led the AI team at Figma. And you

know intuitively I I've known for a long time just how critical evals are to being able to execute product well and we've very much seen that play out with with these companies who I think are very much on the leading edge of

building great AI products. So I want to get a little bit more tactical for everybody. You have on there the stat

everybody. You have on there the stat which I think is pretty crazy. 12.8

experiments per day. What exactly are those tangibly like what are people doing that they are running this many experiments per day? I remember 10 years ago we were talking about 2015 PRDS for

instance we might run 4.8 experiments total in a quarter let alone just on our eval.

>> Yeah. a um of AI is that experimentation which used to be something that you would only run in production um is now something that you can do offline as well and that is uh actually one of the

things that I think contributes to so much rapid evolution of AI products uh you're absolutely right like if you had this non-deterministic problem that you had to solve then you might have to AB

test it and doing an AB test is a very very highfidelity but very expensive way to get feedback about whether whether a non-deterministic thing works or not. In

AI, because you're able to do evals and actually um iterate offline, you can do those experiments just on your laptop.

In fact, in a few minutes, we're going to run some experiments with um a prompt and an MCP server and try and and improve some stuff. And I don't know if we'll run 12.8, but we're certainly going to run more than one experiment

and and iterate, you know, just just live.

>> What are the steps that we need to go through in order to define an experiment like this? So uh this is straight from

like this? So uh this is straight from our docs. Uh an eval consists of three

our docs. Uh an eval consists of three things and I think this is a very helpful framework because it allows you to simplify what might otherwise be kind of an overly complex or infinitely

complex topic. But an eval is is

complex topic. But an eval is is literally three things. Data which is a set of inputs. So we're going to play with linear MCP in a moment. Uh an

example of a piece of data could be how many tasks do I have assigned to me?

That could be the input question. And

then optionally you might have a ground trth answer like 12. You might not which is totally fine. Actually we're not going to have ground truth in the eval that we run but if you do you might be able to use it. The next part is a task.

A task is something that takes an input and then generates an output. And a task could be as simple as a single LLM call.

Like you could just take the question and paste it as a message into GPT5 Nano and then get a response. Or it could be as complex as an agent. It might do some

research or call an MCP server. It might

call other LLMs. It might call APIs or vector databases, whatever it may be. At

the end of the day though, it's going to produce some kind of output and that's the thing that you evaluate. And then

the last thing is scores. Scores take

the data. So they they know about the input, they know about maybe the expected output, they take the output of the task, and then their job is to produce a number between zero and one. I

think it's actually really really important you normalize things between a fixed range uh zero and one. And the

reason that's important is that it forces you to make everything comparable. So no matter what like a

comparable. So no matter what like a week from now or a month from now when you run a new eval you'll be able to produce a score that's within the same range. And uh when you do that that

range. And uh when you do that that means that you'll be able to compare how the thing that you did today performs against the thing that you do tomorrow.

Uh so it's kind of this forcing function for you to really simplify how you want to assess whether the thing is working or not. And then once you are creative

or not. And then once you are creative and you figure out how to do that, you have a really nice artifact that allows you to continue testing and evaluating things.

>> Okay, so data task scores. I think I got it. Let's see it in action.

it. Let's see it in action.

>> Awesome. Uh so we are going to create an eval entirely from scratch. There's no

pre-written prompts. There's no

pre-written data set. There's no

pre-written scoring functions. This is

going to be 100% live. Expect some fun uh nuances along the way. And uh let's have some fun. And and by the way, I actually haven't done this demo before.

So Akos, if you have ideas or feedback about how we can evaluate this together, I'm I'm all yours. [clears throat]

>> All right. For those who don't know, Linear, if you're just on a Jira ecosystem, Linear is a competitor to Jira. So it's your task management tool.

Jira. So it's your task management tool.

It's where you're putting down, hey, these are all the things our engineers are going to build. Um, we use Linear.

Uh, it's been a fantastic piece of software for us. And Linear also uses Brain Trust. So, uh, they're a good

Brain Trust. So, uh, they're a good friend of ours and I think they have a really nice MCP server which is, uh, super cool. Um, and we're going to use

super cool. Um, and we're going to use it actually as part of this. So, let's

say that we're building a tool that allows us to ask questions, uh, about our task workload and understand, you know, what what work do we have to do?

So, let's just write a really simple system prompt. you are a helpful

system prompt. you are a helpful assistant who answers questions from linear. Okay. And let's create a data

linear. Okay. And let's create a data set. And instead of creating it from

set. And instead of creating it from scratch, let's just use opus to help us create the data set. So it's going to look it. It's going to know that we're

look it. It's going to know that we're working on something related to linear and it's going to generate some test data.

>> Okay. And for those of you who are wondering, you just said MCP. What what

is that usage? So model context protocol. Well, it's just the standard

protocol. Well, it's just the standard definition. Basically, it's like the API

definition. Basically, it's like the API that LLMs can use. So, it's allowing the tool we're looking at, brain trust, to get access to the data inside of linear.

And mentioned Brian Brian is the one who did the MCP primer podcast episode nearly a year ago on this podcast itself. So, if you want more, you can

itself. So, if you want more, you can check that out.

>> That's awesome.

>> But, it looks like we've got the initial test data from Opus. This is not the one from the MCP yet. This data no MCP connection yet. Um,

connection yet. Um, >> and and actually I don't love this data.

So, this is asking questions about what linear is. Let's let's try to improve

linear is. Let's let's try to improve it. So, I I don't really I we know what

it. So, I I don't really I we know what linear is. We're we're trying to build a

linear is. We're we're trying to build a bot that helps us ask questions about the workload. So, let's let's actually

the workload. So, let's let's actually tweak it. Actually, I want questions

tweak it. Actually, I want questions about my linear project. For example,

what tasks assigned to me? So creating stronger test data in

to me? So creating stronger test data in this case, making it more about tasks and kinds of tasks instead of just the high level it was before.

>> Okay, great. Last but not least, remove the expected answers since we don't know them.

Models still love to hallucinate [laughter] even Opus 4.5.

Okay, great. So now it's going to create this data. Of course, I can always edit

this data. Of course, I can always edit it. So let's see like we don't do

it. So let's see like we don't do sprints at Brain Trust. So let's say like how many tasks need to be triaged.

>> And now what we can do is just hit run.

So it's going to use GPT5 Nano, which is one of my favorite models. It's super

cheap and relatively fast. And let's see what it comes up with. Okay, so this doesn't seem like a great answer. What

tasks are assigned to me? Happy to help with linear. What would you like me to

with linear. What would you like me to do? Let's see. Are there any overdue

do? Let's see. Are there any overdue tasks? I can help with questions about

tasks? I can help with questions about linear usage. Um, okay. Well, what we

linear usage. Um, okay. Well, what we just did is a vibe check. And that means that we looked at some of these questions. We looked at the answers.

questions. We looked at the answers.

Akos, feel free to disagree. I think

these answers are pretty bad.

>> Yeah.

>> Um, so now before we actually try to improve them, what I'd like to do is be able to quantify that. And that is where scoring comes in. The benefit of quantifying it is that we're of course

going to vibe check the improved results as well. But the artifact that we'll

as well. But the artifact that we'll produce by actually running these eval is something that our team could continue to use so that as we add more data, as we evolve the prompts, we'll

have a quantitative signal about whether we're improving the thing that we're trying to improve. So now let's go back to loop. Uh by the way, loop is the

to loop. Uh by the way, loop is the agent that's built into brain trust and it works kind of like cloud code or cursor. It has tools that are plugged

cursor. It has tools that are plugged into all of the nooks and crannies of our product and so it can interact with data and prompts and run evals and stuff for you. So anyway, we have these tasks

for you. So anyway, we have these tasks and we we know they kind of suck. Let's

just see if we can write a scoring function using loop so we don't have to create it from scratch. So

>> these answers aren't great. They are

vague and introductory. Can you create a scoring function that make sure that the

answers actually answer the question?

And B, if they cite any information or include any facts about tasks, they cite

a source.

>> Okay. And while this is coming up here in the lore of the podcast, the prior few evil episodes that we had from Homo Hussein and Trey Shankar Aman Khan, they

all warned against numerical scores.

They said that we need to go for more of like a binary yes no. Here we're going for a score there. Can you talk to us about that?

>> Yeah, I think the simple way to think about it is that jumping into scores like 0.2 uh or 0.4 four before you have

really justified the need to do that is not a good idea. And in fact, even though we are going to create numbers here, we're actually only going to create scores that fit a specific set of

values. Uh let's see what the model came

values. Uh let's see what the model came up with here.

>> So it it only has three options.

>> Um and if we look at what the definition of B is, it's partial. So it's saying it's missing citations for tasks but it has some sort of answer. We can change

that like we can say that hey actually I don't like the fact that it's doing that. I I don't want I don't want to

that. I I don't want I don't want to give you any partial credit in that case.

>> Um so I think it's important not to over complicate your scores and I think if you're creating LLM based scores you shouldn't ask the LLM to generate a number because that's not very clear.

It's useful to have clear criteria, but I actually disagree that every score needs to be binary. I don't think there's any real justification behind that. In fact, I worked with the OpenAI

that. In fact, I worked with the OpenAI team about a year ago and published a research cookbook that I can uh send a link to you to that walks through somewhat scientifically what is a good thing to do and what is a bad thing to

do and why. And so that might be helpful reading if anyone wants to go, you know, one level deeper.

>> So here we've gone with categorical and that should be all right.

>> Yeah. Yeah. And you can always change it, right? Like if you don't want if if

it, right? Like if you don't want if if you think it's getting too complicated, you can simplify it. If it's not complicated enough, you can make it more complicated.

>> Got it?

>> So now let's run it with the scorer and see how it does. We'd hope that the score is quite low. If the score is high, that means that our vibe checks are not aligned properly with the

scoreer that we created.

>> Okay, good. So it's it's zero everywhere. Um if you look at one of

everywhere. Um if you look at one of these, you can actually see why. So we

have the model actually tell you the rationale and so it labeled it as option C and it says uh it doesn't answer the question and it doesn't provide any claims or specific facts.

>> Love it. It's accurate.

>> So now let's have a little bit of fun and actually connect this thing to the linear MCP. Um it's pretty easy to do

linear MCP. Um it's pretty easy to do that. You just click MCP and then Linear

that. You just click MCP and then Linear has a nice HTTP based MCP. So, we can just put the URL in and it will actually authenticate to my linear account and

give me a bunch of tools. Models can get somewhat overwhelmed by having a lot of tools. So, I'm going to remove some of

tools. So, I'm going to remove some of these just to keep it simple. We can

always add them back or tweak this later.

>> That's an interesting insight around just selecting the tools that you need so that it doesn't accidentally choose the wrong tool.

>> Today's podcast is brought to you by Pendo, the leading software experience management platform. McKenzie found that

management platform. McKenzie found that 78% of companies are using Genai, but just as many have reported no bottom line improvements. So, how do you know

line improvements. So, how do you know if your AI agents are actually working?

Are they giving users the wrong answers, creating more work instead of less, improving retention, or hurting it? When

your software data and AI data are disconnected, you can't answer these questions. But when you bring all your

questions. But when you bring all your usage data together in one place, you can see what users do before, during, and after they use AI, showing you when agents work, how they help you grow, and

when to prioritize on your roadmap.

Pendo Agent Analytics is the only solution built to do this for product teams. Start measuring your AI's performance with agent analytics at pendo.io/os.

pendo.io/os.

That's pendo.io/

a kh.

Here's the dirty secret about prototyping. You spend two weeks

prototyping. You spend two weeks building a prototype. You validate your assumptions. Engineering loves the

assumptions. Engineering loves the direction. Then what happens? You throw

direction. Then what happens? You throw

the whole thing away. Bolt changes this completely. When you prototype in Bolt,

completely. When you prototype in Bolt, you're not building throwaway mockup.

You're building real front-end code that integrates with your existing design system. So when you hand it to

system. So when you hand it to engineering, they don't throw it away.

They ship on top of what you've built. I

use Bolt every single day. I host my LAN PM job cohort on it. And honestly, I'm up till 2:00 a.m. some days just vibing in the tool, having fun, and building.

That's when you know a product is good.

When you're using it past midnight, not because you need to, but because you want to. Check out Bolt at

want to. Check out Bolt at bolt.new/acos.

bolt.new/acos.

That's bolt.new/

a ka. Link in the show notes. I hope

you're enjoying today's episode. Are you

interested in becoming an AI product manager, making hundreds of thousands of dollars more joining OpenAI anthropic?

Then you might want to do a course that I've taken myself, the AIPM certificate ran by OpenAI product leader McDad Jaffer. If you use my code and my link,

Jaffer. If you use my code and my link, you get a special discount on this course. It is a course that I highly

course. It is a course that I highly recommend. We have done a lot of

recommend. We have done a lot of collaborations together on things like AI product strategy. So check out our newsletter articles if you want to see the quality of the type of thinking you'll get. One of my frequent

you'll get. One of my frequent collaborators, Pavle Hearn, is the build labs leader. So, you're going to live

labs leader. So, you're going to live build an air product with Pavvel's feedback if you take this EIPM certificate. So, be sure to check that

certificate. So, be sure to check that out. Be sure to use my code and my link

out. Be sure to use my code and my link in order to get a special discount. And

now, back into today's episode.

>> And by the way, you can always do that later and actually evaluate that. So,

one of the things that you could do here is actually keep all of them enabled, run it, and then um maybe you don't get great performance. You could duplicate

great performance. You could duplicate the prompt again and then try disabling some of the tools and see if you get better performance. Mhm. makes sense.

better performance. Mhm. makes sense.

>> Okay, great. So, we'll save that and let's try running it again.

>> It's fun how fast you can iterate here.

And so, this might be like an example of those 13 experiments like they're constantly improving what they're working on and you just get the results so quickly.

>> Exactly. Every time I click run, actually, it is essentially running an experiment. Okay, great. So, it didn't

experiment. Okay, great. So, it didn't actually do that. Well, welcome to AI.

Um, so let's let's see what happened here. It said, "Are there any overdue

here. It said, "Are there any overdue tasks?" And this model said, "I'm ready

tasks?" And this model said, "I'm ready to help with linear tasks." But it doesn't actually do anything. It just

says what it can do. And it doesn't really solve the problem.

>> Wow.

>> Now, there's a few things we can do. One

thing we could do is we could try a different model. So, we could say maybe

different model. So, we could say maybe let's try GPT5 or GPT5 Mini and see if we get better performance.

>> Another thing we could do is we could try to improve the system prompt. So we

could say don't ask clarifying questions please just use the tools and figure it out.

>> Let's try it actually. And you know a third thing we could do is we could go and actually edit the questions. Maybe

the questions are not great and uh maybe if we made them a little bit more specific we'd get better results. Um and

then of course the fourth thing we can do is edit the scoring function. But I

agree. I think my vibe check on the score, which was zero, is consistent with what the score actually was. So I I wouldn't advise that we do that.

>> Yeah. My brain immediately went to, well, let's improve the system prompt.

Maybe let's add a few few shot examples of how to run it. Maybe let's specify the tools, but we didn't go to that level quite yet.

>> Exactly. And and as you can see, it actually didn't solve all of the problems yet, and OpenAI returned an error for one of them, but it seems like this one is actually pretty good.

>> So, let's take a look. Here's a quick digest of the 20 issues assigned to you.

Uh, so it's actually talking to linear.

And then if we go and look at the score here, it says it doesn't include a citation. So, it just mentioned that it

citation. So, it just mentioned that it got the digest, but it did answer the question. So it gave us partial credit.

question. So it gave us partial credit.

>> Yeah. And it did a pretty good job citing its sources. So

>> So yeah, maybe that means we should improve the scoring function.

>> Yeah.

>> Yeah.

>> So it looks like we probably want to iterate both on our system prompting our scoring function.

>> Exactly. And by the way, I'm handdoing this just for the purpose of showing you that. But one of the things that I

that. But one of the things that I really love about loop is that I can say things like I think the scoring function

is too harsh. If the response contains any references to BRA tasks, then it has cited its sources.

So, this will go update the scoring function. And could we do the same for

function. And could we do the same for the system prompt? Could we say right now we're still failing on four out of the five. So, can we add a few few shot

the five. So, can we add a few few shot examples and specify which MCP tools in linear to use?

>> Absolutely. So here you can see it's edited the criteria for the the scorer.

We can hit accept and it will update it.

Here's the updated one with the new criteria.

>> Nice.

>> And we can also say scores still. Oh

yes, please. It's offering to actually run the task and see how it scores.

>> Oh, yep. So we can do one change at a time and check it first.

>> Yeah, let's just have it do that.

>> Yep. Now that I'm back to becoming a coder thanks to Vibe coding tools after 16 years being away, I'm like one step at a time. All right. H

>> Yes, please. Improve.

>> Yep. The prompt. Yeah. But not just with citations right?

>> Yeah.

>> Okay. And if you wanted, you didn't need to use AI to do this. I think like a lot of people, this might be a process also that maybe the PM isn't necessarily controlling at this point. Improving the

system prompt. It might be something that an AI engineer or an engineer involved with it is, but usually the PMs are pretty involved in the scoring function. Going back to your point

function. Going back to your point around evalu.

>> Yes. And by the way, I I want to edit this slightly. We should actually get

this slightly. We should actually get rid of this prompt so it just focuses on this one if that's okay.

>> Yeah.

>> Okay. So, I'm going to remove this one uh so that it just focuses on the MCP oriented prompt.

>> Yeah. And then I'm going to say please rerun the eval and then improve the prompt.

>> All right.

>> So now we can, you know, go make a cup of coffee or relax for a little bit. Um,

it's kind of fun like watching a movie to see it actually do its thing and try to improve the prompt. And of course, all of this is versioned in Brain Trust.

So if it updates the prompt or it updates the score or something in a way that doesn't make a whole lot of sense, you can always revert back.

>> Yes, I think the AI is right. Instead of

using all the tools, a lot of times it's just listing back what's available. So

now it looks like okay, it's going to go ahead and There we go. It telling it's creating a system prompt kind of like how we were describing.

>> Exactly. Yeah. You can see those examples are right here.

>> Yeah.

>> Let's try it.

>> And let's just let it keep ripping away.

>> Yeah. You know, there was this kind of watershed moment we saw with Claude 37 where it was the first time we saw that a model was able to look at its own work

and improve. I think prior to that, the

and improve. I think prior to that, the metaphor that I use is it was kind of like a dog looking at itself in the mirror. It didn't really know, you know,

mirror. It didn't really know, you know, whether it was a virtual representation of itself. Sometimes models would assume

of itself. Sometimes models would assume the identity of the model that they were or the prompt that they were working on.

But when Claude 37 came out, things started to change. And that's actually when we shipped loop. We had our own eval for for this problem leading up to uh shipping loop and the eval performed

terribly. And then finally claude 37

terribly. And then finally claude 37 came out and it was a huge jump and we realized this product idea might actually work and it allowed us to really be uh aggressive about that.

>> That's really really interesting. So you

should be thinking about what the future state products are going to be. create

the eval then watch the models. Once the

models hit the right set of quality, you can go ahead and release it.

>> Absolutely. I think one of the most important things is to have evals that fail. If you only have evals that

fail. If you only have evals that succeed, then you don't know what problems uh there are. And that means that you either don't have a clear understanding of what problems your users are hitting or you don't have a

clear understanding of what is impossible today. And I think it's very

impossible today. And I think it's very very important to have both. If you have evals that are failing, then when a new model comes out, the first thing you should do is just rerun those evals. And

you'll be surprised that every time a new model comes out, something interesting is going to happen. I heard

like uh some people who are running a coding tool, three flash was somehow performing better on like a lot of coding benchmarks than 3 Pro, Gemini 3

Pro, but it was hallucinating more. Oh

yeah. These are like these nuances where you need to have a full eval testing suite to really understand which metrics improving versus which it's hurting.

>> For sure. And I think as with any benchmark, an up does not necessarily mean good. And up just means that

mean good. And up just means that something interesting happened. And I

think more often than uh not when you see something interesting happened in a benchmark, including an improvement, it means that the benchmark itself is broken. But you should not necessarily

broken. But you should not necessarily hypothesize whether a benchmark is broken until you're able to reproduce it with some real data. So I I am a big

believer in doing really dumb seemingly obvious things like just autogenerating silly questions about linear tasks or whatever it is and then running stuff

and confronting you know the actual generated outputs with your intuition and using that moment as the opportunity to improve things as opposed to spending a month creating a perfect golden data

set that you think represents the problem that you're trying to solve and doing all this other prep work. I think

you should just jump in and then start iterating.

>> So that's a real case for don't silo your brain trust licenses and user accounts to the AI engineers. Make sure

that the PMs maybe even the right go to market domain experts who really understand it have access to the tool.

>> Yeah. I mean we about uh I don't I I actually don't remember maybe three or six months ago we sort of realized that brain trust should not be constrained to the AI engineering team and we removed

userbased pricing. So there's no no

userbased pricing. So there's no no userbased pricing. It's just based on

userbased pricing. It's just based on how many emails you run and how much data you log. You should just not worry about that.

>> Mhm.

>> Looks like this thing is cooking and it's made some serious progress. I've

been watching along as we've been talking and it's solved some problems like telling the model to use the tool.

It's also solved the problem of the model asking for clarification. So I

think in chatbased use cases, models are postrained to ask for clarifying questions. In the context of this demo,

questions. In the context of this demo, we're not giving it the opportunity to do that. We're just hoping that it

do that. We're just hoping that it generates a response from one question.

And so it's really important that we tell it not to do that.

>> Yeah, it's very cool. Well, I think it like started with a partial score, then it moved it said, "Okay, let me iterate on the system prompt again to get a full score." So, it's really working through

score." So, it's really working through the problems. >> Um, and yeah, I mean, this is evaluation. I think a few things I'd

evaluation. I think a few things I'd highlight are that we touched all three parts of the workflow. We worked on the data set. We iterated it a little bit.

data set. We iterated it a little bit.

From here, uh, you might add more examples to it. You might tweak the ones that you have. You could use loop to help you think about more examples to add. A second thing that we did is we

add. A second thing that we did is we actually worked on the task function itself. So we wrote prompt, we picked a

itself. So we wrote prompt, we picked a model, we changed the MCP tools that were available to it. We could do more work there. Like I think maybe switching

work there. Like I think maybe switching to a better model might help us consistently get a better score. Oh wow,

it looks like we're now at 0.75 across the board. Uh which is a huge

the board. Uh which is a huge improvement from where we were before.

>> Yeah.

>> And then the third thing that we did is we actually iterated. We created and then iterated on a scoring function. So

we made an initial one. You pointed out that the scoring function was being a little bit too nitpicky. So even though the response was citing the specific issues, it wasn't really giving it credit for that because it didn't have a

link or something like that. So we also improved and iterated on the scoring function to better represent what our vibe check or in this case your vibe check was indicating was a little bit off about how it was working. And I

think that process that we just ran is is very very representative of how people do eval.

>> So what is the distinction between offline and online evals and when should people be doing whip?

>> Yeah. So one of the cool things about the work that we did is we created a scorer and even though we're using it in this playground, this isn't the only place that you could use the scorer. So

if we go into the scorer list in brain trust, you'll see that we have the scorer right here and we can actually run it on real live logs and deploy it

into production. So that let's say we

into production. So that let's say we take this app that we built and we start using it. Every time we ask a question,

using it. Every time we ask a question, it will actually run the scorer online.

In fact, we [clears throat] can do that right now. If we go back to the

right now. If we go back to the playground, we can save this prompt. Oh,

it's right here.

>> And I'm loving this prompt. You can see the tool usage patterns.

>> Great.

>> And and again, uh you're a product manager, so I think you probably correct me if I'm wrong, but you see this and you get some PRD vibes, right?

[laughter] >> And and that's what I mean. Like this is a much more quantifiable version of thinking about what a product should be.

And it it's really fun, I think, to actually be able to take uh product intuition and quantify it and turn it into something really tangible. So if we go, we can actually take that scoreer

that we created and run it online. So we

have linear answer quality. We can run it on uh every LLM span and we'll run it on 100%.

>> We'll give it a name.

>> Um and then it's super easy to actually test these things out in Brain Trust. So

we had the prompt that we created here.

There's a little built-in chat interface. So I can say like um what

interface. So I can say like um what tasks are assigned to me and you can see it's calling this tool and it's going to generate an answer.

>> And what makes this online is it's accessing the real data or what when people talk about that distinction what should they understand?

>> Yeah. So what's happening here is we'll go to it in a minute but every time I use a prompt in brain trust or whatever my app is I'm going to be generating

real live logs uh of my production application and online evals are um taking these scorers that we build and running them on your real live user logs

and I think that's helpful for two reasons the first is that it gives you insight into how well the same eval functions that you're using to test

things offline are actually translating into real world performance. So let's

say that offline we are able to achieve a score of 0.75 which is not bad and then we run the same score online and we consistently see the result is 0.3. That

means that maybe it's not actually working as well in the real world as we think it's working in our little simulation environment. And then the

simulation environment. And then the second thing is that it becomes a really good flywheel for you to find uh examples that are worth including in your offline eval. So when you see that

the score is.3, then you can actually filter down to the examples that are not performing very well and then grab them and add them into that same data set that we were using to assess things.

Wow, I have a lot of tasks assigned to me. I'm going to have to do some coding

me. I'm going to have to do some coding work after this uh after that. Um, so if we go to our logs page, you'll see that right here,

this is exactly the chat thing that we had and then the eval running. And it

looks like we didn't do a Oh, it looks like we actually in the end did a good job. So I set it up to evaluate every

job. So I set it up to evaluate every step. Maybe you only want to evaluate

step. Maybe you only want to evaluate the last step, >> but it looks like at the end it actually scored pretty well.

>> Nice. So to summarize for people, your offline eval is based on that golden data set and you can continue to improve that golden data set when you see a discrepancy between your performance on

your offline your online eval. You say,

"Okay, everything we failed online, let's potentially add that back in or those are candidates to add back in to your golden data set for what's running offline." Did I get that right?

offline." Did I get that right?

>> Exactly. You can actually do that directly in our UI. So you just find examples that you think are interesting and then you can add them to the data set.

>> Okay, very cool. So, how do you maintain trust in your eval system so people don't bypass it when they're shipping new features?

>> Yeah, I mean, I think that the best teams don't think of evals as a gate.

They think of it as a core part of their iterative loop of actually improving things. And I think that the best

things. And I think that the best workflow looks like looking at real production examples. Um, in fact, some

production examples. Um, in fact, some of our customers have kind of like a ritual where every morning in standup, they'll look at some examples from the previous day's usage of their product.

And then what they'll do is they'll reconcile what they see with those examples with what their evals uh have.

So, let's say that the scores are very low for let's just use this linear example, questions related to our UI.

It's like, huh, maybe we don't have that many questions related to UI tasks in our eval data set. So what they'll do is find these novel patterns that have

emerged from their logs and then add them to the data set and maybe you do that in the morning and then what they'll do is they'll grind that day and actually try to improve the eval performance on the things that they

noticed and that becomes a really helpful way to prioritize what you should actually work on and what it means to actually succeed on a particular endeavor like hey it clearly

looks like we're not doing well on questions related to UIs let's bring in a bunch of those tasks, add them to our data sets, reproduce that problem in our evals, and then go and iterate on it until we're able to produce a better

result. And I think that's the best way

result. And I think that's the best way to think about evals. If you think about evals as instead, which a lot of people do, and and I try to discourage uh folks from thinking about it this way, instead

what you might do is I think there's a problem. Let me edit my prompt to try to

problem. Let me edit my prompt to try to fix the problem and play with it on three examples and then okay, it seems like it's better now. let me go run a full eval run and see if I can ship this

thing. I think you're you're not going

thing. I think you're you're not going to be as efficient because you're not thinking about the broader problem which is represented in the data set uh while you're actually making those iterations

in the first place.

>> Amazing. Couldn't agree more. This you

guys was a less than one hour master class into eval. There is much much more out there that you can go deeper. If

people want to go deeper encore where should they be going? Well, you can reach out to us. www.brain.dev

um is our website. Um you can email me an enkurbrain.dev or reach out to us on um X or Discord.

We also have a user conference coming up in February called Trace. So if you go to brain.devtrace,

to brain.devtrace, you can see information about signing up. It's a zerobullshit practitionerled

up. It's a zerobullshit practitionerled conference. So, a bunch of talks from

conference. So, a bunch of talks from people like companies from companies like Dropbox, RAMP, uh, Notion, um, other folks that we talked about earlier who are just going to talk about how they're solving these problems and would

love to meet you.

>> All right, guys. For my money, in 2026, whether or not you're building an AI feature or not right now, every PM should be learning this skill. I hope we got you excited enough to go out there

and try this out, maybe with a free Brain Trust account or something else, whatever platform you are using. Get out

there, start iterating. You saw how fun it was. You saw how I was jumping in on

it was. You saw how I was jumping in on how I wanted to do more of the system prompt. I think you'll feel that same

prompt. I think you'll feel that same excitement once you get your hands into a tool like this. So, I hope we remove that barrier to entry for you guys. And

we'll see you in the next episode.

>> Thanks for having me.

>> I hope you enjoyed that episode. If you

could take a moment to double check that you have followed on Apple and Spotify podcasts, subscribed on YouTube, left a rating or review on Apple or Spotify, and commented on YouTube, all these things will help the algorithm

distribute the show to more and more people. As we distribute the show to

people. As we distribute the show to more people, we can grow the show, improve the quality of the content and the production to get you better insights to stay ahead in your career.

Finally, do check out my bundle at bundle.ac.com akashg.com to get access

bundle.ac.com akashg.com to get access to nine AI products for an entire year for free. This includes Dovetail, Mobin,

for free. This includes Dovetail, Mobin, Linear Reforge Build Descript and many other amazing tools that will help you as an AI product manager or builder

succeed. I'll see you in the next

succeed. I'll see you in the next episode.

Loading...

Loading video analysis...