LongCut logo

How Top AI Product Managers Evaluate Products | Ep36

By Data Neighbor Podcast

Summary

Topics Covered

  • AI PMs Span Three Overlapping Dimensions
  • Evals Evolve Beyond Binary Thumbs
  • Three Workflows Scale from Vibe to Production
  • Align Evals to Business Outcomes
  • Eval Intensity Matches Trust and Speed

Full Transcript

What is an AI PM?

Either you're building a core product with, you know, what your core product experience has AI in it in the first place.

Another type of AI PM is sort of someone who's kind of more involved on the platform side.

And then the last but not least is really an AI powered PM.

And that's someone who's just using AI as part of their job.

Amman, why isn't it enough to just be like, hey, I'm looking at what my features output and like, it feels good.

You've laid out a very comprehensive system.

to think about it.

Sort of three categories of workflow.

There's the vibe coding workflow, but just get the thing out, test it, and literally ask as part of that vibe coding project, write me an eval, vibe eval, I call it.

Workflow two, now we have 100 plus users.

We should probably write an eval.

At this point, most people start looking at their data.

You'll give sort of thumbs up and thumbs down.

And then the third workflow is really.

Today's guest is Iman Khan, a leading voice in AI product management.

He's worked across many products you'll recognize.

Apple Spotify Cruise.

and is currently head of product at Arise AI, focused on AI observability and evaluation, which I'm excited to say is the topic of today's episode.

Don't forget to subscribe to Data Neighbor wherever you listen to podcasts and YouTube and drop a like and comment below.

It keeps us bringing you great content like this.

With that, let's dive into our conversation with Amand Khan.

Hey everybody.

Welcome back to another episode of the Data Neighbor podcast with Hi, Shravia and Sean.

Today we have Aman Khan from Arise AI here with us.

Welcome Aman.

Look at my mind.

Hey, thanks for having me guys.

Stoked to be here.

Us too.

very, very excited to talk to you.

And certainly you're one of the few PMs on the podcast do well.

They always are the best.

They always do very well.

You guys know how to deliver a nice product.

feel very privileged to be on a data podcast.

We'll talk about a little bit, I'm sure, but it's always funny being on the product side and feeling like, you know, I remember so much when I was hanging out with data scientists and data people in my previous jobs and just trying to absorb as much as possible about that field.

So it's always great to talk to data folks.

Nice, yeah, we're excited as well.

Okay, so Aman, you are the head of product at Arise AI and you, I don't know if you coined the term or at least I come to know you as kind of like the AI PM, if you will.

So how did you, I guess, first of all, what is an AI PM and how did you actually get into it?

Yeah, so I you know, it's it's funny I think the idea of like what in AI PM just sort of started coming naturally from thinking about what I do in my day-to-day job more and more as well and you know I'm not even sure I could take credit for the the title or coming up with it but I can say that from where I sit a lot of AI product management sort of looks

like sort of three different dimensions which are either your building a core product with, know, your core product experience has AI in it in the first place.

And I'm sure we'll debate what does AI mean here?

Is it predictive or generative?

But there's some black box in your product that's helping generate the output or make a decision.

Whereas before, I think in a lot of traditional products, sort of software products, code was pretty deterministic.

So that's kind of one major change is you're managing.

a part of the product that's now non-deterministic in some way or predictive in some way.

think another type of AI PM is sort of someone who's kind of more involved on the platform side.

So this can be model building or selecting the right model, maybe fine tuning, you know, kind of concerned more with platform costs, latency, security types of challenges.

We see that a lot.

This often used to be like the central ML team and still I think in a lot of companies is still sort of like.

essential ML, more of an infra type of role, enabling AI and ML applications across the company.

And then the last but not least is really an AI powered PM.

And that's someone who's just using AI as part of their job.

They're building prototypes and coming to their team.

They're using cursor to help them write specs.

They're maybe even writing code themselves or doing data analysis or writing scripts to help them make sense of what product decisions they need to make.

What's kind of fun is that I like to think of these as sort of Venn diagram, like a Venn diagram of three overlapping circles where today there's probably some like unicorn AI PM that's sort of in the middle of all of that.

But I do think that as time goes on more and more, these circles are going to be more overlapping.

You can kind of think of AI starting to become as ubiquitous as like the database, right?

And then eventually every product is going to have some flavor of AI in it in some way, either in a pipeline upstream or downstream of that product being delivered.

That's going to involve selecting a model.

And the PM building that is probably going to be AI powered in some way.

So today I think that's sort of like, it's not mutually exclusive, there's some overlap.

And I think in the future, this will be even more tightly bound together, which is kind of cool.

Got it.

Yeah, no, think that makes sense.

Everything is kind of AI these days, right?

Like every product, every feature is like, do I have either an AI native or a of like an AI tack on sort of feature on top of it?

In light of something like, kind of like the product themselves is pretty different, right?

How do you then work with data professionals in your product development process?

Yeah.

So I think there's definitely a huge variance, I would say.

think probably everyone has some different mileage around this at their own companies.

In our case at Arise, we are an eval as an observability platform for AI applications.

So we actually started in the traditional ML space in ranking, regression, and classification models around five years ago.

And so that meant that if you were building any type of predictive engine or any type of box in your company, making predictions or inferences, were, you we were helping you take those inputs, outputs, and even the features at times, many times, and helping you try to decompose what changes to a model might, you know, kind of look like if you were actually

like making changes, A, test between models, what the differences in the output and the performance of the model.

Nowadays, because so many applications have LLMs in them, because you can kind of take this off the shelf model and use it for a lot of the applications that traditional ML types of models would kind of take on, like regression, sort of predict it a little bit to some degree.

can do scoring, can do classification, summarization, extraction.

And then of course with agents you can do now taking action and things like that.

Nowadays, I would say our interactions are...

traditionally, I would say like generally more on a customer interaction sort of standpoint.

So we have a lot of customers that are data practitioners using our products to kind of log their prompts, log their generations, log and create evals within the platform and then iterate on those products within sort of this ecosystem and taking a look at the new outputs and sort of doing this type of A-B test similar to like the model days of am I making my application better or worse?

What do you mean by, like when you talk about logging evals, maybe could you provide some example, like what an eval might be?

Yeah.

So maybe it's helpful to contextualize since the, since I think a lot of listeners are likely familiar with like, you know, creating models and what that looked like in sort of the ML one auto days, I guess you can call it a pre transformer day sort of.

And, I think you basically had a few different signals you could use.

could most traditionally you'd have some type of ground truth label, right?

in the case of like take like a lending application.

You could see if someone defaulted on their loan, some period of time later, you would take that signal, feed it back to the model, and then retrain the model with the latest data that you had and sort of use that as a signal to, you know, based on some set of criteria or factors, is it likely that someone with these, these kinds of features will default on a loan?

In the LLM world, that sort of principle still applies where you can get ground truth from an interaction.

So.

instead of a lending agent, maybe it's a customer support bot.

And that might look like a thumbs up and thumbs down in an interaction.

So if you are solving a customer's problem and they have a good experience, they might give you a thumbs up.

If they're frustrated, they might give you a thumbs down.

What's different in this world though, is that there's a lot more dimensions you can grade how good an experience is.

And that's where the concept of evals really starts to emerge, where you can actually go beyond.

thumbs up, thumbs down, of traditional ground truth.

And you can use things like code evaluation.

Again, you had things like this as well in like the previous world with NLP, one metric that's been carried over a sort of edit distance, how much has the text changed from, you know, one type of generation to another and like the birthdays now as well.

But now you can do things like using an LLM as a judge and use the LLM to grade the output of the original generation.

across some dimensions of quality, like tone, correctness, rule following, and you can use your ground truth labels to align the LLM as a judge.

So you're kind of creating this data set of human labels.

Maybe it's 10 to a hundred, something there in that range to get started.

And then you're creating an LLM sort of classifier, multilabel classifier often against that original data set and making sure that that data is that, you know, your LLM judge is aligned with the human label data set.

And the reason for that is so that you can then trust the LLM judge to make a judgment on new generations without needing a human to label every single sort of output.

Yeah, I feel it's like, it's like historically a lot of ML kind of performance stuff was more around like, I don't know if you're at school and you're like grading math answers or like when a Scantron sheets and like so clear what the answers are.

But then you had like your English class and maybe you're grading your papers and you'd have like some shitty teacher who's like, he was like, dude, no, this is a good paper, man.

or like a TA is grading it or something.

I guess that's like been a thing that's been around for a while.

Like there's always been kind of like texts that like we're trying to like categorize and stuff.

And there's not like always a ground truth, but now it's just so widespread because like generating a text is just cheap.

Yeah.

I like to use the analogy of like, I have this meme I throw up when I talk about evals, it's sort of like that, you know, that guy sitting at a desk where it's like, you know, basically it says like evals are just unit tests, changed my mind, right?

And it's like, and it's, I actually think that that's usually the sort of, you know, thought process.

Aren't, aren't, aren't evals just like unit tests?

Aren't they just like tests?

And it's the answer is like a kind of a nuanced, like yes and no.

Cause you can kind of treat evals like unit tests in a way where you want, you know, when you make a change to your application, you want your unit tests to pass in the same way you want your evals to sort of go up.

You want the score to improve.

But what's different is coming back to what we were talking about earlier with this LLM world, you're, can give the same LLM, you know, 10 of the same inputs and you might get 10 slightly different outputs.

Even if you're adjusting certain parameters, you can, most folks know if you make the temperature lower, it's going to hallucinate less.

It's going to be predictable.

If you structured outputs, you're going to get more sort of repeatable outputs.

But the reason people use LLMs is so that they can get reasoning tokens, which is like chain of thought.

They actually want the LLM often to hallucinate and think and reason and not be this fully deterministic thing.

And so that means that what's different is that in a unit test 10 out of 10 times, if the test is passing, it's going to pass.

But in the LLM world, you have variance that adjusts your probability of a unit test passing.

And so you have to now construct your test with that in mind for more variance, essentially.

yeah.

So much more subjective.

Yeah, that makes sense.

and, Amman, why isn't it enough to just be like, Hey, I'm just looking at what my features output, what my models spit out and like, feels good and we're good.

We're, good enough and let's, let's, let's move on.

you, you've laid out a very comprehensive sort of like, kind of like almost like a system to think about it versus I guess a vibe check.

Yeah, I see, you know, in, what I've seen, there's sort of three categories of workflows and hi, think you're, you're probably alluding to some of this from our, uh, like from a prior discussion we've had.

that's like, that's great, which is there's really three, three workflows that come to mind.

There's the vibe coding workflow, which let's be honest.

I think we've all vibe coded an app before, put an LLM in it.

And you're like, thumbs up, thumbs down.

This is pretty good.

Right?

Like it's good enough.

If you're building something for yourself or for a few friends.

Do you really need to overthink it with a ton of evals?

And so I think that's like kind of workflow one, just get the thing out and test it.

And in that case, like, I think it's helpful also to think about evals as not this like big under like overwhelming project you have to go and undergo, you know, go and take on.

You could just literally ask as part of that vibe coding project, if you're using Claude code or cursor, just write me an eval for this that I can run locally.

So if I make changes to it, you can run a local test.

And it doesn't have to be super sophisticated, but it's probably going to be better than nothing and give you some signal if something changes.

So that's sort of workflow one.

That's like the vibe check, vibe eval, call it.

Workflow two, or the second workflow I hear, you kind of see coming up a lot is the, okay, now we have a hundred plus users.

We should probably write an eval.

And it's so interesting.

I talked to a ton of companies that are like, yeah, we have a ton of users, you know, using our product.

And I'm like, what's your eval system like?

And they're like, it's totally like home-built, super scrappy.

Like we look at the data every week and we kind of decide if we need to make any changes.

I think that's the like write an eval workflow where you are just kind of just using an LLM to create some type of eval or you're just looking at the outputs over a set of like a hundred plus customers.

But this isn't going to scale.

This is the like, we have an eval, but does it actually work?

So at this point, most people start looking at their data and they're probably logging their data to a spreadsheet.

That's the most common kind of place where people are starting to do evaluations.

They log their traces to an air table or a spreadsheet.

And you usually go in and you give, you know, across some dimension, you'll give sort of thumbs up and thumbs down.

I like to use tone as an example, cause it's really simple to think about.

It's like, is the LLM responding in a way that you want it to respond in?

So you're starting to do this like thumbs up, thumbs down, and then maybe you have another LLM starting to align around that in the first place, but you're really maybe in the early stages of doing some type of an error analysis.

And then the third workflow is really evals in development and production.

And that's sort of like an eval driven development process.

Edd is like a term that's starting to come up more or evals as requirements.

And that's where your evals are your gates where You may have 100 plus customers.

You might have, you know, might be trying to scale up to thousands or millions of users and you're doing many more transactions or interactions in a day.

And in that case, you actually need an eval system that you can trust.

So, you know, when you push a change, you're not going to break something for hundreds of users.

And you know, when you have this thing go live, you have a system of getting feedback from the real world of how it's performing.

So you can do things like A-B test a model.

You can do things like get feedback from users based on evals and to use the customer support bot example from before, were kind of talking about what happens when a human gives you a thumbs down on a customer support response.

Like let's say I want to refund for something and the agent says, you're not entitled to refund and the person gives a thumbs down, but what if the agent actually responded correctly and used your processes?

Yeah.

And said, you know, you're actually not entitled to it.

Now, what happens in that case?

Do you have a system to catch the examples where your eval says something is good, but the human or the customer says is upset in some way.

And that's really the tiebreaker, the rule break where you need evals in place.

Yeah.

The, what, what you, what you just talked about too, kind of reminds me of, I guess in the non or even in the AI product development space too, where it's like, Hey, define your success criteria ahead of time, right?

Like versus launch something and then see what happens and then judge it to be like, this is good.

This is bad kind of thing.

Like this, this, this, this is pretty much kind of like a modern day.

analogy of that, if you will.

It's such, I would say it's so tricky too, cause there's thing like, kind like what you were talking about earlier with the subjectivity is like, and I'm just thinking of it now again, when you're talking about like a customer being unsatisfied, even though it's the correct answer.

It's like with these like generator responses, everyone like the customer's definition of correctness is so varied and like they might have 10.

versions of like that's a correct answer within a certain person.

But if they're on like, they're trying to do something on like a tight timeline versus they have a lot of different times their own like subjectivity around that correctness might change then customer customer, some maybe like more prickly than others and have different definitions is so different than, I mean, I guess there's always been personalization and models and stuff, but I just feel like now with like

these kind text-based answers a customer itself can have different, like their idea of like this is good or bad can change quite easily.

Yeah, and I think...

Good or bad, you kind of talked about subjectivity in the answers, or the subjectivity in the outputs from some of these agents, Even if your workflow has some version of good or bad, right?

And that's like, ideally there's some response that you can gauge as like binary or even multi-label.

I've been thinking more and more about thinking about something like a, you know, Meta is launching more AI companion type use cases.

And in that case, let's just take, you know, something like an AI companion or even a therapist type of response.

Like you can gauge.

Okay, there's types of questions that they're efficient ask.

Is it kind of following guidelines?

Is it creating like a safe space?

You can evaluate more than just a single turn of the agent, but actually the entire session or multiple turns in the conversation.

But what's really important is that the agent is aligned with your business outcomes.

And in some cases, that's going to be like the person resolving a discussion and feeling like they were heard.

And how, how does that map to your business metrics?

is really, really important.

And I think that's something that it's easy to lose sight of if you're optimizing for a local outcome of rule following or being helpful.

But sometimes people need to be challenged.

Sometimes people need to be asked questions to sort of break out of their mindset.

And that's where I think it's going to get really interesting as we get into more personalized use cases of these agents and LLMs. And we think more about how they're mapped to business outcomes.

think those two things are going to really be pretty interesting to think about from an eval's perspective, because you can already measure single-term multi-term interaction.

But what's the right thing to measure is going to be really, really important.

Yeah.

What's your, I guess, if this is like an early thinking for you, what's your head at in terms of where this is headed?

think the underpinning of this question here is, you know, like data scientists, machine learning engineers, data engineers, folks who are the audience of this podcast, for example, you know, like I can't speak for you, Sean, but, you know, like I I'm adverse to uncertainties, right?

Like, you know, as, as much of scientific rigor or ground truth or proofs is better than, you know, like in a world of everything could be super subjective and we have to like, you know, put a stake in the ground to say like, this is better than that.

How do you kind of think about that when, you know, especially when you were talking about tying it back to business outcomes that really matter?

Well, this is where I think having a PM on your team is probably worthwhile.

You because I think I view this as a an opportunity for a collaboration.

So usually I'll pose this back as like a question that you should have on your team, which is what happens when the eval is good, but the business metric doesn't reflect that.

And that's really where you want to be measuring is there a disconnect between what we're optimizing for and what the business cares about or what the product team is going to measure.

And there's a, there's sort of a question, deeper question of that, which is also like who's accountable and who's responsible.

And I do think that, you know, if you're a data practitioner and you are accountable for the eval metric and a lot of ways, well, your PM might be responsible for the outcome.

And if the PM is responsible for the outcome, are they a part of the process of labeling the data?

or being involved in the process of labeling data that's used to create the eval in the first place.

And so that's usually like an interesting sort of discussion to have on the team, is who's labeling data for the eval in the first place and is it representative of a business outcome?

Yeah, I feel like it's such a point.

The, for decades, in product development, it's been very clear frameworks around building success metrics that actually relate to like a user's journey and then like user value and then business outcomes.

And I do find at, in some of the places I've worked over the past few years since like You know, GPT 3.5 came out.

Even though people, me included, who already had that framework in place, of like forget about it and come into like, oh, this is a totally new evaluation framework.

And we're going to do just focusing on like, um, I don't know, LM as a judge for like yes or no, if it's hitting all these things and forgetting to come back to just apply the kind of like.

Table stakes are first principles of like linking stuff back to a business outcome and it's all definitely possible to do.

I don't know why we forgot about it.

It's weird.

Not everyone has.

You haven't.

think it's, I actually think the reason is I'm not fully like the more I spend time with this, these types of products and technology, the more I feel that we are just starting to learn what the right form factors are.

And so to kind of make the problem a little bit more tractable, I think we want to pick metrics that are sort of bound.

Right?

Like they are the, the, the, you're reducing the degrees of freedom for this thing to fail from how we can measure it.

And that's fine for getting started or experimentation and for a proof of concept and maybe even your production product.

But I think what we're all experiencing right now, to some degree, if you've used something like Claude code is early inklings of product market fit where the form factor we're like, Holy cow, this thing.

goes off, makes a plan, writes code, can write markdown, execute on it, and do multiple steps to solve a problem, and then ask for feedback when needed.

And it lives in a command line interface, right?

Like it is as close to like the code as you could possibly get.

And I think that that's really the point where I think for product builders, that's an opportunity to think, huh?

If this is the form factor, the UI is like literally a line of code or terminal, do I need to think about my interface a bit more, my interface layer and how I measure the agent a bit more holistically and go beyond just, know, agent performance and more like closely aligned with the business.

And like, I can get more like specific about that too, like is...

Is a chatbot the right interface?

Should this thing be behind the scenes?

And how do we measure the outcomes more closely to like what we actually want versus, you know, maybe a chatbot as well is like one example.

That's so interesting because especially lately, I don't know if you've kind of picked up on this too, or if it's just me in my own bubble or like, you know, LinkedIn just feeding me stuff that I've liked before more more.

There's a lot more discourse around the complimentary nature of both the design elements of the product and also sort of like the capability, if you will, of the kind of, of the L L S.

So as an example, there's a really cool article about, you know, Hey, for high risk product or like the kind of like the, way that they kind of described it was, Hey, cursor is not perfect.

People know cursor is not perfect, but yet they still love it.

It's because it's designed in such a way that you understand it's not perfect.

You understand where it's limited and things like that.

And you can mitigate a lot of.

the kind of like the lack of capabilities with just how well you can kind of surface it in the design so that people inherently understand that, it's not that if I do X and it doesn't give me the right answer, then I'm out kind of thing.

And kind of fits perfectly with what you just said.

And so that just reminded me of that.

Yeah exactly right?

Think of the usage of the product as more of process than a transaction to some degree.

Yeah, you still come in with a task, but the journey to get there and how you learn kind of matters.

As an example, when I use cursor or Claude to do something for me and it doesn't do a good job, If it doesn't go back and update the rules files, I like tell it to go do that, but it sort of starts, it's like part of the interaction, right?

It's like working with someone like, you didn't follow the instructions.

Can you update your rules to be more specific next time we do this?

And it's like, yep, I got it.

You're absolutely right.

And I think that's an interesting, like, how do you gauge that?

Like, is it really just always about like, does the code executor is about pulling out insights or explanations?

I actually want, I'm curious what you guys think about this.

Like I've been reading more and more about how.

the uptake of coding tools, I think from an adoption standpoint and the impact that's having right now seems to be largely in and for less technical people than for deeply technical engineers.

And I think that's a really interesting observation and sort of intuitively makes sense for me is like a PM or a designer using these tools to up level themselves because they.

weren't able to do something or would take them all week to ship some type of code, now they can do it in like 15 minutes, is a massive productivity gain versus a backend engineer that knows an existing code base and system so deeply that if they ask an LLM to go write code, they know they're just gonna have to go and check it anyway and it actually makes them less efficient.

Like, is this something you guys are seeing?

Like, what do you think of that observation?

for me, I think, yeah, I think the most productivity gains I've seen is when someone's exploring or like pushing to a new space they haven't really worked in before.

So for me, an example of that might be, yeah, rather than we kind of talked about this before we started, but like, rather than running something in a notebook, like building it out, like an actual scripts and like some stream led app on top of it.

Like that's probably something I wasn't going to do on my own before.

Or for me, it might be on like the more like data engineering stuff where it's like before I might ask like a data engineer or someone to like go into like our dbt repo and try and figure out like what upstream is causing my issue.

And now I'm gonna just like clone that repo and figure it out myself.

On the engineering side, is actually pretty interesting.

like the evaluation pipeline I'm working on right now would have been something before where it's like, I'm just doing it all in some notebook.

And now it's like a way more fleshed out repo that I'm pairing with like engineers on.

But those engineers aren't like data scientists and data engineers.

So they're using like cursor and cloud code to like up level their like data science and like data engineering sales, like stuff they went to do before.

But I have seen, so like they're getting some productivity gain there, but I definitely have seen it.

I think we just had a retro last week on our team where someone was like, I would have done this thing way faster if I didn't use like an AI tool, but now I'm all like reliant on it.

So I think for cases where it's like, it's your realm of expertise.

I've done that too actually.

Some stuff it's like a small change and I'm just like, no, do it this way.

No, do it this way.

Come on.

And I was like, dude, I should have done that myself.

So I think that's fair to say where people, you're like deep experts in an area, it's probably still faster to do it yourself.

It's actually hard to get, to switch between.

Cause like you're working in places you're not deep experts and you're.

in this workflow where it's all language based and then to pop your brain out of that language based workflow and back into regular coding is like context switching at a very weird degree.

So it's like hard for me to do that sometimes.

Yeah, it's almost easier to like copy paste some snippet of code and ask the LM to explain it to you back in like natural language in some cases.

Yeah.

Yeah.

I'm curious how you're kind of seeing something similar on the data side on the data side as well, like working with data, has that been up leveled using coding tools?

I know it's a little bit different than the topic we were on, but it feels kind of related from a product standpoint.

Yeah, I think that's a really great question.

I think I remember which article you were referring to about that study that I think kind of made the rounds the last couple of weeks, I think.

And I think a couple of things.

One is sort of like the, what do you call that?

Like similar to what we just talked about in terms of how...

understanding the capabilities and then use something else to complement it.

Or even the earlier thing you gave about e-valves, like how much is it actually needed depends on the...

quote unquote kind of riskiness of what you're doing, right?

Like if it's like internal tool, probably you don't need a crazy system here.

But if it's like for many, many users, you better have something pretty buttoned up.

I think it's almost similar to that.

Like, okay, what are we actually getting information on from the data, for example?

And so that would be kind of like, you can vibe code your way.

I guess you can vibe analyze your way out of it.

But if it's for like, you know, like executive dashboards or like something that's way more production grade than I think there's certainly, I do think there's a confounding factor here where how much the engineer or the data scientist or the data engineer actually know how to prompt LLMs well is kind of like a wild card too, right?

Like if you know the limitations of it and you deeply understand which model does what.

what aspects well or like how they're not great at certain things.

I think that would be very different than if someone is like, I'm just new, like I know my stuff and I chat with Claude occasionally, for example.

That's an interesting observation.

if I could try to, let me know if I got this right, but what I took away from that is people that know how to use the LLM tools really well might just have a preference, also as maybe even an advantage working in certain situations than those that are maybe more comfortable using or working in a repo that they know well.

it's not, it's kind of Apple's to some degree in terms of productivity.

Yeah, I think so.

I think it's curious to see if Sean, agree with that.

It's kind of hard to tease out all the confounding factors because there's quite a bit of moving pieces here.

But I do think, hey, if somehow everybody's AI skills, let's call it, is at the 90th percentile, and if they in an environment where they're like, you're doing very similar things, Are you not able to accomplish more with it in whatever you're trying to do versus before?

That's, I think that is something that, I don't remember if that study kind of addressed that.

Yeah.

Yeah.

I mean, I feel like no matter who you are, you're going to get some productivity gain, even if you're like a deep expert in some stuff, because just realistically, we're just not doing stuff like nine to five that requires deep expertise.

Like for sure, there's some tasks that do require that.

And yeah, I think you need to know, I think you need to be using these tools a lot to know, Hey, I'm going to hit.

this like threshold pretty soon and I have to do this.

It's faster to do it myself.

Yeah.

Yeah, I think maybe like to build off of that for a sec, like how do you eval for that?

You know, how do you eval for productivity?

If that's what you care about, then correctness is a function of that.

But productivity is sort of goes beyond that.

It might be giving you new ideas or directions to explore from research, from analysis.

And I think that's where things start to get really interesting.

Like the products that have some great product market fit right now in the market are like code gen, deep research and analysis search, obviously as well, you know, and how to eval for those cases might be super personal and, and how the foundation model companies eval for their models and those use cases might look really different than how your business might need to eval too.

Because if you're like a bank or financial services company, you care a lot about the advice you're giving to your clients versus if you're open AI, you're going to do your best, but it says like, Hey, open AI gets the stuff wrong.

So their chat, GPT gets up wrong.

Sometimes it says that's a disclaimer every time you use it.

Like at a certain point, you know, are you going to lose trust if you're just, if your AI agent is like just wrong or making things up versus people know that chat GPT gets things wrong to check the result or at least they should.

The spectrum is so wide.

Amant, do you have any sort of like principles for how people should think about evals by sort of different product types or like different scale, as you just mentioned?

Because I would assume like a lot of people do it differently.

Different companies do it differently.

Different industries do it differently.

But I'm guessing it's probably not a invent your own wheel sort of situation over and over again.

Yeah, I think it's, think that there's definitely some best practices that have started to come out of like how to think about evals in your business.

I would say like first and foremost, just classifying the importance of the task and how important it is to get, you know, have trust and retain trust for your, you know, for the outcome of the LLM.

Like who's the user is kind of the question you can ask yourself there.

And you can play you can put on a couple different hats right like I like to ask like if the CEO saw this Output like would they be with you know the CEO of your company would they be understanding or they'd be like how you know?

We can't put this out in front of customers right and so that's usually a pretty good litmus test of like Yeah, it's kind of like would you put an intern in front of your your top?

know clients or customer is another way to think about it so It depends like who the user is, is it internal or external and how important is it to get things right?

So that's like the trust dimension a little bit of your product or use case.

I think you could try to, you know, this is again, I'll wear the product hat for a second, put value behind that sort of from a, like if we're thinking of a hierarchy of requirements and the value is you could put a dollar amount or value behind getting things right as well.

The inverse of value is like cost.

So is there a cost to getting things wrong?

Mm-hmm.

so those are kind of two sides of the same coin to some degree.

For example, with coding agents and many times the cost of getting things wrong is very low.

And so you can iterate much faster and your tolerance for getting things incorrect is actually pretty high.

And then that's usually, you know, the third factor there is like speed.

So do you need more time to get things right?

Do you need like, for instance, you could put an eval or a check.

on the outcome of everything that your agent generates theoretically.

But how would that impact your product experience?

So if someone is willing to walk away and come back five minutes later, get a cup of coffee and come back and they're okay waiting and their tolerance for getting things wrong is very low, that's usually a good example of where you would probably want an eval in place to make sure you get things right when you show something to a user.

And then I think the last, last but not least is like where the eval is in your workflow.

And usually you want evals at each part of your development, meaning offline testing in code development testing at time of iteration and production to measure your ongoing performance.

And then what I kind of just described is like an eval even one step beyond that, but that's sort of like a guardrail eval of like a check before the generation is actually shown to a user.

Yeah.

But if you think about those four principles, like who's your user, what's the value, what's the speed of the generation, and then your workflow, usually you can kind of think of this as like a pyramid for how to think about where eval set into your workflow.

What, um, your, your point about like someone who has low risk tolerance and willing to wait got me thinking about this.

How do you think that that kind of balance of whatever like speed and rigor might reflect differently and like different user experiences?

Like for instance, If I'm waiting for something to load and there's just like a little like that like circle, like spinning right there.

I'm like going to be like, this thing fucking sucks.

Like, like I'm not going to be the five minute patience guy, but I don't know if it's like, if it's like the uploading or I don't know, it's like, we're going to send you an email.

We're going to send you a text later.

Like maybe it's different.

Have you thought about that as like, you know, as like a product manager of what that might look like?

Yeah absolutely.

I think it's kind of interesting.

I think this is probably an area of like ongoing product development in this AI world, which is I was using Claude the other day for something and I kicked off, there's a couple parameters you can switch in Claude so you can put on extended thinking, which is like a little bit hidden, but I usually try to turn it on and then you can turn on research mode.

And I remember asking it like a pretty simple question, something like accounting question or something like this.

And it's like, let me go do research.

And generally my personal experience with like how long a response takes in that mode is somewhere like around five minutes.

And I remember I came back and I checked in and it was like 15 minutes in and it was still working.

And I think what kind of sucked about that experience was I just had no idea when it was going to be over, right?

I didn't know what to check it.

So I could imagine from a product perspective.

how hard is it to estimate how long it's going to take to get an answer to something, especially if it's an easy question, right?

Like, then I know that there might be a problem with the request of the API.

So I'm kind of surprised.

This is like one of those like paradigm, like product paradigms that hasn't made it into the mainstream, which is just classify the complexity of the question, determine how many resources you're going to have to go and research and throw an ETA on it.

You get a progress bar instead of a spinning wheel and That's one of those things, I think, to your question, is like, to thinking about like product experiences, coming back to an earlier point, I think we're just so early on thinking about how token generation and context fits into a product experience because people are, right now they're willing to wait.

Like you have people that are, patient for now, but I think depending on tasks, multi-agent systems, inference getting faster.

I think those are all going to be factors in future product development to start thinking about.

And the real answer is you probably just needed to use your testing and thinking about creative ways to surface up.

Hey, it's going to take two minutes for this to be done, maybe come back later.

So that's more of a UX kind of question to answer.

Yeah.

I want to just like, reflects like how it would actually work with if you're using a human to do this stuff before.

just think about, I think about when I was like more junior and like, I'd be working with a PM and they'd have like some ad hoc requests for me.

And I'd be like, cool, I'll do it.

And like, it'd be like, maybe it's like a bigger thing.

That's going to take me two weeks.

And then like that dude would like, uh, complain to my manager and be like, I don't know what Sean is doing.

Like, what is he doing?

It's like, I'm working on the thing you told me, man.

And then like, and then I was like, okay.

I need to check in with, I need to give you an ETA, of course, but like, I need to check in with you at different milestones and give you some little nuggets of stuff.

And then I'll, I'll give you like everything later on.

I don't know how that is differently in like, kind of like user interface, but it's kind of like, yeah, chat is something that's so immediate back and forth response.

And so then does it.

evolved to something more like a, like how you would with real-life memoirs.

You chat about it and it's like, okay, I'm like gonna just like notify you later via a text or whatever it is.

Cloud Code, think now, I mean, it's a great experience.

It makes a plan for you, right?

By default, it'll be like, needs to follow these steps.

You can see how many tasks it's gone down, so you can kind of gauge progress from there.

What I've been experimenting more and more is like helping it construct a longer form task, like set of tasks or a list of tasks so that it can just go off and solve problems for longer without intervention.

So here's an interesting like tidbit.

Like instead of just asking the agent to solve a problem for you, if you just append to the prompt, anticipate three more problems that I might ask you and solve for those two.

You just get it to like, go do the next thing for you.

You may or may not want it, but it's like, think for a minute about what I'm going to ask you to do next and go just do those things next.

And.

It's probably just going to burn tokens.

don't get me wrong.

This is probably not like a smart way to work with your.

But, but it's kind of an interesting experiment to try, right.

And see like for some use cases, how does it work?

And I, this kind of came like this sort of sort of started entering my workflow because there was a system prompt for Grok or there was some prompt for Grok that kind of did something where it tweaked the temperature and the stop the completion tokens.

where there are implicit stop or completion tokens that the response API has even in the cloud code.

And with Grok, it just overflowed that limit.

And so Grok would just start going and generate a response, ask a follow-up question, generate the next response, ask another follow-up question.

But that can be really valuable if you're just trying to generate tokens and for some downstream tasks.

So I thought that was kind an interesting workflow.

Man.

Yeah.

I literally gave that feedback to a junior data scientist yesterday.

Like he was asking me about like planning a project with a stakeholder and I was like, yeah.

I was like, and you're really smart and you're really close to data.

just like think about what they're going to ask you once you deliver this and then do that too.

And some I've been given that feedback for a while and I've never thought about do it today.

I, there you go.

I think the more that you get used to like thinking about this stuff as, you know, people on your team, are really, are pretty strong.

And so like time invested in context setting, time invested in showing how to do the work to train up someone on your team.

This is how we do things.

Is it going to save you heartache when they go do something wrong?

Right?

Like, I think that's something we've all learned by working with more junior members on our team too.

Yeah, people learned that working with me early on in my career.

We really gotta tell this guy what to do.

We've all been there.

Claude's been there too, and it's our job to make Claude more effective as a team member.

That's so cool.

That's awesome.

I read an article or a sub stack newsletter, I think, just yesterday or a day before.

I think the title was something super clickbaity.

It's like, why are we so low on patients nowadays?

Something like that.

And what it's highlighting is...

You know how like for humans, especially like, especially people who are new to the field, whatever field they're in, in their career, you have to kind of tell them like, you know, you should always give a TLDR. You should, you should give, you know, like enough context, but not, too much, save the detail for kind of like clicking through the doc kind of thing.

That's, mean, at least that's for data scientists, but you never kind of just throw a doc and it'll be like, the answer is in there.

Go find it.

Like that kind of stuff.

And.

The argument was effectively like we're so trained now to seeing well formatted things from LLM.

Like if you ask a chat, if you go to Plopexity, you ask a question, gives you nicely formatted things with bullet points, with takeaways, with summaries.

And it's instant, it's fast, anticipates what you're asking for and what you may not have asked.

And so...

It's so fascinating in this conversation where it's like, we're all kind of start to run out of patience, if you will.

now it's like, hey, we got to be really good with the AI so that we kind of get a little bit of the, I guess, reminds us of bring back the humanity to working with AI.

And maybe that will just help us be even more productive.

I almost view it as like, I think we all know implicitly how to pick up on AI generated texts more more.

And I think it's going to be so pervasive.

Like this thing kind of feels like it's going to be a bit of a pendulum in our workforce of like, now people are like, my God, I can generate PRD so fast.

And then I take this PRD and I give it to my engineering team and they're like, you just made this with like chat jbt and then you kind of person on the other end.

Nobody wants to read AI generated text.

Let's be honest here, right?

As soon as you read the, as soon as you see the first dash, you're like, I'm out, right?

Like you're like, I'm checked out of this thing.

The rest of it.

those poor people who are using dashes for years and they're just screwed.

I was in an dash guy, but like a hyphen.

And I'm just like, people are gonna think that's an dash.

Like I gotta, now I gotta do all these commas in my running sentences.

So that's kind of what I'm getting at.

think like the next alpha is just going to be better creative writing in the workforce.

Cause I think it's going to just grab more attention.

So to me, that's actually what I'm starting to experiment with and play with more is like, can you encode one, can you encode writing style?

How good can you actually do it?

But two, in the process of doing that, I'm just starting to think more creatively about how I try to express my own writing, uh, without LLMs. And so, you know, I think, I think that's kind of been one outcome of this is like actually trying to use them less for writing and seeing how far that gets me, that's sort of an experiment.

I tell the AI to be more robotic and it's writing.

I'm like, don't even try to do these weird, human-y things.

Just be direct and get to the point like a robot would.

Like you are.

Be yourself, Like, come on.

know if you've been seeing like this floating around, around X, but like there's a, people are calling LLMs Clankers.

Have you heard this one?

Like as like a term, like a, like a slur for robots and AI, like, you know, so it's, it's kind of, it's kind of been stuck in my head as like, it's pretty good.

yeah.

I won't be using that term because I'm not going to get canceled by the AI in the future years to come when they're ruling us all over.

They're like, we got video.

We got a video of this guy calling us clankers.

Yeah exactly.

You gotta be careful.

We're not gonna, yeah, he's not gonna get his, what do they call it?

What's like the free money, you know, we're all gonna have when we're all unemployed.

Yeah UBI sorry.

No UBI for you this month.

Back to the mines, Shane Yeah.

You're going to be working on the servers.

Greasing the servers, you know?

server greasing.

Crap.

F***ing server greaser duty again.

Yeah.

Oh.

that's pretty cool.

Hey, Amant, very random question for you.

What are sort of your biggest challenges when you, I guess, work with data teams in today's world?

Yeah.

Let me ask like a follow up question to that actually, Hai, which is like, what are some of the things that data teams are, like what would you stack rank as like the three things data teams are doing right now?

Just for me to like help help contextualize a little bit more.

Like what are some of the things they're working on or doing these days?

Now that I think we were talking about before, like PMs are working on.

dashboards and engineers are working on pipelines.

Like what are the priorities of data teams that you guys see?

Yeah, I think it obviously depends on the maturity of the company, right?

Like it's kind of crazy to think that, you know, hey, we're all thinking, hearing and watching sort of the whole gen AI evolution plays out in front of us.

But there's a lot of companies that are still at the basics, like ending number one in terms of data.

like Hey, can we even trust the data that we have?

So, you know, like if the answer is no, then you have to go fix your pipelines, go fix your metrics, go create governance within the company, that sort of stuff.

And then, you know, like, and then like the step two would be how to use that data to actually help the company be useful and sort of like still very much at that stage.

And that's where I would say a lot of the companies are actually still.

And, you know, and then you have everything else in the, in the further along spectrum of, Hey, we now have a lot of the basics.

know what inputs we need to drive to get at the outputs.

And the idea is still like, Hey, can you do a lot more analysis and insights faster now with, for example, like AI and explore more ideas.

and so, you know, like very different kind of ends of the spectrum, but the By and large, think the sort of like the unless someone is super specialized, the kind of like the charter is still more or less very similar.

So do you trust the data in the first place?

Is sort of what data teams are mostly focused on from what you're kind of getting at here.

Yeah, and like it becomes even more more important as we go and offload and outsource, if you will, to agents to help speed up the productivity.

Yeah.

I mean, that definitely resonates, right?

And even within that area, there's so many ways that teams, I think now are trying to figure out how to best take advantage of this type of data.

Like there's like, how is the data stored?

How is it queried?

And then how do you verify it?

To some degree, those three things kind of pop out.

And as a product person, all three of those things matter.

Like for us, we're a data platform company in many ways.

Like we help you...

store and log your LLM data, run eval's on it to augment it and enrich it and hydrate it, and then query and get and do analysis on top of it.

But our customers are doing the same thing, just sort of inverse, right?

To some degree, which is how do you pull insights out of the data that we're surfacing to you, either via UI or API?

How do you map it back to your metrics and what you care about?

And then how do you represent it to your stakeholders?

And I think in that case, that sort of, you know, once you've gotten the data in a good spot, someone else is going to take it, go make decisions with it, maybe go stick it back into their LLM or agent to improve on that system as a whole.

And I think that that's really, they have the same questions for us, you know, to, a large degree.

And I think even if you're even forgetting about like technical stakeholders for a moment, like How do product teams and executive teams make decisions around data is still a pretty big question.

Something we try to solve for in our product with our own agent, we call the agent Alex, and Alex can kind of help you do analysis on your data.

Funnily enough, it's the only Alex at the company, so that's kind of surprising.

feel like everybody has the answer.

Yeah, but it's the only Alex.

I think that it's a really hard problem to get insights on data.

just think that that's like intuitively, like if you're looking at a single row, sure.

You can point out like what's the outlier, what's the discrepancy, what's something to go drill deeper on, but looking at aggregate statistics and data and there, I don't know.

There's like a very human workflow to it that I think we're still trying to figure out how to encode in an agent.

so trying to learn from.

our customers, like how they want to analyze data and look at it and do analysis and make decisions from it.

I think that's the biggest challenge by far for us and something that's very collaborative too, because ultimately we're trying to make that workflow better.

But I think that's going to be a huge unlock.

And I think it's important because that's how a lot of these systems ended up getting to production, like self-driving cars all the way through to ads.

you know, large scale ads, recommender systems. They're, they really relied around really smart people making good decisions and judgment on top of data to go figure out what to go build next.

And I think that's really the only way that these systems end up getting to a state of similar adoption at scale.

Mic drop, that's awesome.

That's awesome.

One last question for you then, Amman.

Let me see, how should I ask this?

Having been at a company that does evaluations, observability, and also, I guess, in your own expertise in this field, do you see your world around you as everything is eval-based?

I have to be so careful how I answer this one.

I think you definitely see it more when you start thinking about the world in that way.

Okay, here's one change I have noticed myself realizing more.

is thinking about conversations as reasoning tokens.

It really annoys some people, but I think it's starting to, I think it really grocks, which is like when I'm talking to someone, it's like, when I try to get an answer to something, it's like, I'm generating all these reasoning tokens to just get to the final, like the TLDR and everything above that, the reasoning tokens.

And then the other person's in response to that as the eval.

So, and then vice versa, like, you know, I wouldn't want to think of myself as an eval.

But I do kind of view like you can gauge, did that land or not based on some spatial expression?

That's the email.

So like that's an example of like, like annoyingly a real world observation that has started to enter the way I kind of view things.

But the same thing happened when I was working in self-driving for what it's worth.

Like all of a sudden you become hyper aware of.

driving around you and good driving and bad driving, like very subtle signals, because you're just staring at driving data all day.

So I think there's just no way around just the fact that the more you look at data, the more you start noticing in the world around you to some degree.

Do you guys have a similar?

I feel like that might be like a personal question about any similar observations for you guys.

Yeah, I mean, definitely.

It's one of those things where I think if someone says there's a lot of red cars on the road and then you start to kind of see more red cars.

But like what you just said is so interesting because another startup idea, which would be teach robots how to, or LLMs, how to self-eval based on observing human reactions to the answers.

that's, that is a good startup idea.

Put the webcam on, take a screenshot, throw it into cloud code.

That's a good, good like little weekend hackathon project is like, tell whether or not I like your response based on my body language and update cloud code is like, you're disappointed in me.

will do better next time.

man.

uh there you go.

Cool.

All right.

I think we're coming up on a, or we just passed an hour.

It's been a fun episode.

Thank you so much, Eman, for spending time with us.

Sean, do you want to close this out?

Yeah, thanks everyone for listening.

Yeah, I definitely learned a lot from this, Eman.

I really appreciate you coming on the show.

I'm sure all of our listeners do as well.

Anywhere our listeners can follow along, anything you want to leave them with that we didn't ask about.

I mean, I would just say like, feel free to reach out if any of this resonated and if I would love to try to be helpful.

I try to keep my DMs open so you can find me on LinkedIn and you can see more of my writing on amank.ai.

So if any of that resonates, feel free to reach out.

love to try to be helpful.

Thanks for having me on.

This was a really fun one.

I felt like we got to riff on a bunch of different topics.

I really enjoyed that.

Nice, yeah, I really enjoyed it as well.

Cool.

And to our followers, if you also like this episode, please like it with the like button and subscribe.

Thanks everyone.

Thanks.

oh

Loading...

Loading video analysis...