Run A/B Tests That Actually Work for AI Features with AI — Data Neighbor Live

By Data Neighbor Podcast

Summary

Topics Covered

AI Features Leak Users from Treatment
Confident AI Failures Mask as Success
AI Noise Demands Larger Sample Sizes
Predefine Segments to Avoid P-Hacking
Users Evolve Queries Over Time

Full Transcript

So today I'm Sean. Um, thanks for joining. I teach AI evaluations and

joining. I teach AI evaluations and product analytics, AI analytics at Maven. Um, today we're going to talk

Maven. Um, today we're going to talk about how uh to design experiments for AI features. So um, I'd say like a lot

AI features. So um, I'd say like a lot of the stuff is going to be exactly the same. There's going to be some like

same. There's going to be some like stuff you should do before you go into a live experiment, but um for the most part, if you know how to run AB tests,

treatment control, primary metric, ship decisions, if you've done this for traditional features, then you know like the

you know, I don't know, 80 80 85% of the playbook here. What I'm going to show

playbook here. What I'm going to show you today is just like how a few specific decisions um slightly change with AI features, but they can actually like really throw off the decision you

make when you look at your results. Um

so it's not the full like playbook kind of thing for AB testing. I'm kind of assuming folks know some of that. Um

some of look the same on the surface. Um

but for AI features, it's like beneath the surface. I guess some things change

the surface. I guess some things change around like the distributions of the data you're working with, but maybe just to get a vibe of the room. Quick check

in the chat. Maybe drop a one if you've run AB tests before, but not on AI features. Like you've ran them or your

features. Like you've ran them or your team's ran them or your company's ran them. Two if you've shipped AI features,

them. Two if you've shipped AI features, but you haven't really ran like AB tests on them. And then three, you're uh

on them. And then three, you're uh you're running AB testa features, I guess. And maybe some things look off

guess. And maybe some things look off sometimes. Three. Nice. Two. Two. One.

sometimes. Three. Nice. Two. Two. One.

Two.

Cool. All right.

So, we'll basically get to three today.

Um, good. Everyone's at least a one. I guess

good. Everyone's at least a one. I guess

zero would be like if you've never if you don't know what AB test is. If you

don't know what AB test is, not going to get into that today, but there are many people who teach AB tests and many great YouTube videos and you can just ask LLM

today what how that works. Um, okay,

let's jump into it though. Uh, we'll

start with a like potential use case here. So

say you have some team they're building a AI like data analyst for their internal stakeholders. So for their

internal stakeholders. So for their maybe for their product or their like product and engineer orgs. So this AI analyst um

you can feed it some natural language question um like what's revenue look like today or something like that. um

and outputs it outputs SQL, it outputs charts, it has a narrative around the data it's outputting as well. It sits on top of your company's like analytics

database. So, it has access to data

database. So, it has access to data around like your users, events, revenue, sessions. Maybe there's like

sessions. Maybe there's like 147 tables in this database. So, I'm

giving you the the the details because it'll matter in a little bit.

Let's say this this agent this like SQL agent there's like a two steps in how it retrieves the answer when someone inputs

a question. The first one um

a question. The first one um it finds the top 20 candidate tables from those 147 tables in your database that could answer the question or would

be required to join to each other to answer the question. And the second step, it picks like the best five of those to use um as context. So those

tables plus some example queries, they go into like open AI API

um for SQL generation. that result gets charted and narrated and um all along so so so you see like all along there's like different steps that have to happen

to get that information just from the input of a question to the output of a user there's multiple steps where an AI is is interacting with the data and making decisions and each of those steps

can have basically leakage in terms of the users that are actually getting exposed to experience so in this case um let's say we want to run a test where we

have like 1,500 daily active users within our product or that uses this like internet tool. Like

this tool like already exists, but we want to make a V2 of the tool and it's going to have like a more sophisticated ranking of like what are those top 20 tables it pulls from? It's going to

expand like it's have like more schema context, more context around the data to pull from. And uh maybe it's like say

pull from. And uh maybe it's like say like we're back in the day and we were on like GPT 3.5 and it's it's going to use 40 now.

So they run a a standard AB testif users, they split them randomly in half.

They run it for two weeks and they have some primary metrics around like did SQL return the right answer or return an answer. um

answer. um the results come back and they say like okay we got a 3% increase but our p value and p- value means like is that increase real or is it due to some

randomness um is that 0.15 but you can imagine maybe this p value is at like 0.5 something really higher for this to be a real statistically

significant difference it has to be the industry standards like a 05 so say it's way above that per perhaps it's not so significant And it's expensive to run

these more complicated models. It's

expensive to have larger context windows because it takes more tokens up that you're hitting the API with. They maybe

they take longer, too. So the team's like, well, it didn't really significantly increase whether it returns the right answer, and it's costing us a lot more money, so we're

just going to like shell that.

And so if you think about it, it's like this could be like a lot of resources probably went into this. Like could be like months of engineering work maybe.

Who knows? Um and

decisions can get made around like shelving whether it's rolled out or not seen by anyone else just based off one

number. And that one number can not can

number. And that one number can not can especially in AI systems in any in any system really cannot really be an accurate reflection of what's going on.

So in this case um when we look at like the actual

people who received the new um the new experience of those 743 people we thought were going to receive the V2. We

find out because of a bunch of leakage through all the different uh stages that AI interacts with our our query and our

data that only maybe 23% of the people 167 people actually received the experience. Everyone else got some sort

experience. Everyone else got some sort of maybe like fall back to uh the V1 experience at different stages. Um,

and this is something that can happen without anyone really knowing about it if like the correct guard rails uh aren't in place. And this isn't this is this kind of leakage leakage exists for

any complex feature like not just AI like payment systems, search ranking, email delivery. It's not a new problem,

email delivery. It's not a new problem, but what's new is the scale of people who are now building features that have these kind of multi-step opportunities

for leakage. So, like I don't know, like

for leakage. So, like I don't know, like putting AI in a product just isn't that hard anymore. And there's a lot of stuff

hard anymore. And there's a lot of stuff going on under the hood that it's not as simple as when teams are like, "Oh, I'm changing the the button

from red to blue." And we know 99% of the people who are in our treatment group are going to see um that blue button. In this case, it could be that a

button. In this case, it could be that a bunch of the people we lose on the way.

So, let me show you like some like potential scenario where this could happen.

So, say you have a funnel of uh where like different users can leak along the way. This is basically kind of like your pipeline of uh what's

going on from your input to your output.

So in this case uh we have 743 users that are assigned to V2. Half of that original 1500 population.

Um maybe 156 people never open the tool during that window. Um or maybe they like switched to using direct SQL

instead. So that's an exposure problem

instead. So that's an exposure problem right there. Um, now you're down to 587

right there. Um, now you're down to 587 users. Maybe 190 of those queries hit a

users. Maybe 190 of those queries hit a cache. So a lot of the

cache. So a lot of the uh, and this will be with like many products, but especially with like data quering products, um, a lot of the query

answers are stored in cache. So you're

not having to hit whatever your data warehouse is again. It makes it faster.

It makes it cheaper. Um, so maybe the system is actually serving stale V1 results, but nobody actually knows that.

Um, maybe 129 people hit like a timeout on one of the V2 retrieval steps and it so it falls back to the faster V1 model, but it doesn't inform them of that. So

now you're down to 268 and then maybe there's, you know, very similar, maybe there's like some like syntax errors or permission errors with the V2 and and those cause a fall back to V1 as well.

So, has anyone ever like shipped a feature or like an AI feature where you later discovered like most users weren't actually even getting the

new experience?

Yeah. Like sample size gets like pretty screwed up. Like this can happen in

screwed up. Like this can happen in traditional stuff too, but it's just that now when we're like, yeah, let's create this agentic framework. There's

so much stuff that can happen along the way.

Sample size is going to be kind of the name of the game of this in this presentation. There there'll be even

presentation. There there'll be even more things that affect sample size.

So uh say when you account for all that stuff and you just look at those 167 people who got the new experience,

um, instead of the whole 750 people, you say like, oh, actually the real effect for that small group of people who actually got the experience is like four to five times larger. Maybe

it's like plus 14% instead of 3%. And

then you might then conclude like, oh, like the feature works. We can just look at that. But the catch is like um like

at that. But the catch is like um like JY said in the the chat here, it's uh knowing the real effect is bigger doesn't make it significant anymore. You

have a much lower sample size. You were

going into this, you probably the team probably did like a power calculation where they said like, "Hey, we need a sample size of 760 and now they only got

167." So, um it tells you why the test

167." So, um it tells you why the test came back flat. It tells you that the test came back flat because um you have leakage, but it doesn't necessarily tell

you um that you have a statistically significant result yet. So you have to actually go back and now rerun this test. So, the good thing with this is

test. So, the good thing with this is like all of this can be fixed like um you can fix the cash, you can handle timeouts,

you can do whatever you can to get the people who actually are exposed um higher and higher and higher uh within that funnel. I'm just reading the

uh the chat here. Ran some AB tests and never got static because the sample included people who never got exposed to the features. Yeah.

the features. Yeah.

the story is totally different. Yeah.

When you actually limit people who who are exposed.

Um but if you don't kind of account for all of that up front, then you you don't just get to like filter down and then see what the result is. You kind have to rerun it all over

is. You kind have to rerun it all over again. So definitely something that's

again. So definitely something that's pretty important I think early on with AI features before you run an experiment. um

experiment. um understanding every single possible point of leakage and either calculating your sample size down lower in the

pipeline where you know leakage has stopped or doing whatever you can to like fix the leakage earlier.

Uh the other thing here is the metrics.

So the second problem the t the team measured this metric like SQL success rate but that could be

defined many different ways and potentially like this in this case like it's it's really easy to measure like the wrong thing as well. This goes back to just like

setting up proper AI eval metrics. A lot

of that will happen offline before you even get into a production experiment environment. But all of those eval

environment. But all of those eval metrics you create for offline evaluation have to continue through to online evaluation as well. You'll also

want to evaluate user value act stuff too. But it's um something like if you

too. But it's um something like if you measure like SQL success rate that's not going to like does the SQL compile and run without error. If users

actually care about something else like does it return the right answer then that's not actually helpful. So in this case like say they had 200 reference questions from known correct answers

maintained by the finance team or something and when they checked you two against those grand truth answers like 5% of them were incorrect. um they had

like a confidently wrong answer where the SQL executes, the chart renders, the narrative sounds authorative, uh but the numbers are wrong by you know more than

10% and the user can't really tell without going through and looking at the SQL themselves.

So this is a failure mode that's genuinely different about AI. Um it

looks like success when it isn't. Um,

and you know, you can end up in situations where this is out in the wild and then like two weeks later after a full ramp of this new V2, like a VP or something presents some quarterly

revenue to the board and it's like totally the wrong number. Um,

I don't know any has anyone ran into this? I guess this is more like a

this? I guess this is more like a analytics thing than a than an experimentation thing, but like where someone had had a metric that looked fine on the surface when something was like totally wrong underneath that got

reported out. Like this can happen in

reported out. Like this can happen in regular dashboards or just queries that your analyst pulls for you. That same problem is going to happen with AI no matter

what product you're doing. So all those offline eval metrics have to be pulled through to your online experiment as well.

Um, so I mean for a traditional for traditional product like this team probably ran like a pretty well-designed experiment.

It may have been fine. They they

probably had like 5050 split reasonable sample size. They had some metric. They

sample size. They had some metric. They

had like a twoe window which is probably long enough for that. But for AI features, um, there's kind of these different

changes within each part of these decisions along the way of of both designing your experiment and then reading the results later on. So like

they didn't get this wrong because they're they're sloppy or anything in experiment design. They just got it

experiment design. They just got it wrong because they weren't checking off a few additional boxes when they're working with AI features. And a lot of that actually just has to do with sample

size and also like manually kind of reading the results or having some really way of automating up scaling up manual reading of results. Um so I'll do

right now I'll I'll go through these kind of like four uh pieces of decision that you want to watch out for. It's

like size it, measure it, guard it, and read the results. For each one I'll show you kind of what they assumed, what's actually true, what to do differently.

So in this case like if we look at uh size it we talked about this a lot already when you change a button color right everyone sees the same button when you change AI model everyone gets a

different output so as we said before like in that funnel when everything's leaking out and you

have to account for that it means you mean like it the the fraction that gets the treatment requires more and more users you'll need to put in your sample size if you don't fix those

parts of your funnel. So like step one like say the effect gets diluted only a quarter of users get the treatment. So

you measure it your measured effect shrinks to like a quarter of the real effect. Then

effect. Then the issue here is that detecting smaller effects doesn't just need proportionately more users. It needs

disproportionately more. So, think of it like trying to like hear a quiet sound in a noisy room. If the if the sound's half as loud, you don't just need to listen twice as hard, you have to listen

like four times as hard. That's kind of how power works as well when you're getting your sample size. So, if you don't fix those leakage effects, you just have to have so many more

people at the top of your funnel. Um,

there's many ways the sample gets diluted. leakage is one of them, but

diluted. leakage is one of them, but there's multipliers of other ways it gets diluted too. So

one, the AI side is just noisier. So

when I say noisier, you have much more variance in the experience that the users are getting. So

when you create a test that's like red blue button, you know, everyone's getting the blue button. But when you have a AI feature output,

I could ask for the same query as someone else in treatment and we can get totally different responses due to the probabilistic nature. So you have much

probabilistic nature. So you have much wider variance in your actual experience that you're getting which just kind of dilutes the

result of what you're getting as well.

So the other part of this too is uh you have to like segment everything like the the the AI features could perform really

well in one aspect of your data or use cases or users but really poorly in another. So example might be on really

another. So example might be on really simple queries uh the V2 does great but on complex ones it does terribly which means now you

have to like segment down your data and for each of those segments you have to have like a larger sample size to compare the treatment segment of simple queries to the control uh segment of

simple queries. Um and all these things

simple queries. Um and all these things like multiply together but not just like additives. So it you can end up in

additives. So it you can end up in situations where you need like you know like 10 times as many users as you think you do.

So measure it when you test an when you when you test an AI feature your primary question is did the AI give the right answer which isn't directly observable.

Um, we talked about this team using like SQL success rate, but they probably had these other levers levels they should have looked at like is it syntactically

valid? Um, did it execute? Did it return

valid? Um, did it execute? Did it return non-mpy results? Did it get the actual

non-mpy results? Did it get the actual correct answers versus some data set we have? Yes, I will. Sorry, just again

have? Yes, I will. Sorry, just again chat. I'll uh share a recording of this

chat. I'll uh share a recording of this either today or tomorrow. It'll auto

email out to you.

So, four different metrics. Um,

defining the metric uh precisely is step zero. Um, you're going to have a lot

zero. Um, you're going to have a lot more success metrics and guardrail metrics going into this than you would a

normal experiment. Um

normal experiment. Um once you define it, measure it, it has noise like like uh so so defining. So trying to see how I want

so defining. So trying to see how I want to phrase this.

One of the hardest things about this is like the returning a correct answer uh thing at the bottom. So like

when you define it and you measure it, that measurement in itself has noise.

Like if you use like say like LM as a judge, a second model that grades the first um to understand if your answer is correct. A lot of people will use these

correct. A lot of people will use these because they scale. Uh it's a lot easier to have an LM as a judge judge a bunch of experiments results than a human. Um

but say it only like is correct 80% of the time. So you have like some human

the time. So you have like some human annotators and elements a judge and about 80% of the time the element judge agrees with the human annotators. Like

that sounds really good at kind of face value. But the problem is uh imagine you're grading like exams with a pen that randomly marks some answers

right and some answers wrong. Like it

just randomly marks right answers wrong 80% of the time.

both the treatment group and the control group are getting graded with this same kind of noisy pen and so you're not only getting like this noise in one group that you're comparing the mistakes

really wash out the real difference between the groups. Um so now you have variance not only in the experience that people are getting but the you have

variance and some like untrustworthiness in the actual metric itself.

So the fix here is whenever possible like my rule of thumb is like use automatic checks and code alongside

a judges wherever possible. So like

these first three like syntactic tactically valid executes without error not an empty result. Those are all things that could be checked in code.

Um, I've seen people have LM as a judge like run those things because they're in the mindset of like, oh, the LM as a judge can check if the SQL returns the correct answer. Uh,

let's also set them up the LMS as judge to see if it executes without an error or if it's like syntactically valid.

Stay away from that. Like any sort of metric that you can write in code and have actual deterministic checks around, that's what you're going to want to form metrics around. you already have

metrics around. you already have non-deterministic outputs you're dealing with and you're going to have at least one non-deterministic metric for an LM as a judge. Uh so try to reduce the

others as much as possible. And then the signal is kind of like if you see those moving in different directions. That's a

bit of a red flag for you to do like more manual look into what's going on here.

Um, I think one of like the kind of cool things I've seen most when we've ran experiments though is uh

the the pairing of offline and online uh eval. So

eval. So there's kind of like a couple I don't know school's thoughts right now.

There's a lot of people who are really like skip straight to experiments, skip the offline stuff because you won't know what impact the AI is having on your

users unless you get in front of users.

And I like definitely believe that a lot as well. But I think before going into

as well. But I think before going into that, setting up your offline eval metrics can really set you up for success in creating great online eval

metrics and it can also eventually help you predict what those online eval metrics will do without having to get

kind of I don't know risky untested product right in front of them. So what

I mean by that is uh so offline testing you know that's that's what can tell you a lot of that tells you a quality like was the SQL correct offline testing is where like a lot of the annotation happened

experiments are the only place where you can get like user value metrics stuff like task completion or time to insight might be one for this uh data analyst example adoption

they're fundamentally different you have like perfect SQL accuracy Um but you can actually have terrible user outcomes still because the user asked the wrong question or the chart

was confusing or the narrative was buried in the insight. Um the strategic one of the the greatest I think strategic pay payoffs of running online experiments and also having a strong

offline evaluation culture that I don't see a lot of people doing yet is over many experiments you can start to see how those offline quality uh variables

relate to real user metrics. So you can effectively um identify those relationships over time and build a sort of calibration. So

you know hey when our offline correctness improves by five point points we can predict that if we get this in front of users their tax

completion may increase by one to three points. Um

points. Um it's a bit of a flywheel. Takes like 5 10 15 experiments. Kind of depends like are how successful those experiments are as well. in capturing the variance of

as well. in capturing the variance of user behavior. But once you have it, you

user behavior. But once you have it, you can start to move much faster because you have a lot more confidence in what your offline eval metrics will mean in an online environment. You still have to

run the experiment eventually, but um I think you can spend a little more time that way iterating offline, which is a lot faster making changes and just running your offline email suite, you

know, that runs in minutes or seconds or a day rather than like putting something out in front of users and waiting two to four weeks for feedback.

Uh guard it, the third one here. Um, so

traditionally there's probably some guardrails around like latency, so how fast something's going or like a crash rate of a site or

something. There's usually like two or

something. There's usually like two or three guardrail metrics in AI features.

There's a lot more guardrail metrics going on. Um, so I just read in the

going on. Um, so I just read in the chat, how do I define task completion?

That'd be like a more of like a plot.

that's going to be like use case specific to whatever your product is. So

it's it's basically your inproduct representation of what the user is trying to accomplish outside the product. Um

product. Um for if I was trying to buy us anything on Amazon, they'd probably have task completion be something like he added

Sean added something to his cart or Sean checked out his cart. Like I I got onto the Amazon site. I went around and I like looked at a bunch of stuff and uh

then I put it in a cart so I didn't just like browse and leave and then I checked out.

Uh task completion I'd be it's like outcome achieved but I think you could have like sub outcomes to like a broader outcome but so it could be steps but I think it's like a step that means

something with respect to user value.

Like if I go on Amazon and I'm just like sitting on some page looking at something that's not really like my task completed even though I did get into a product.

Even putting in my carts probably not really a task completion is probably checkout if that makes sense. Um

so guardrails yeah there's a lot more guardrails with AI features. So

uh latency is definitely a big one. your

cost per queries. So your cost per query can go up with how many tokens you're using. Could also be going with like the

using. Could also be going with like the cost per token of a given model. Um

there could be like safety violations, PIA violations like hallucination rate, confidently wrong rate, data leakage

rate. It could be a guardrail even.

rate. It could be a guardrail even.

So there's going to be like you know several more metrics across quality, cost, safety and operations. A lot of those um will be identified when you're

building your offline eval suite uh when you're creating failure modes. Uh but I think sometimes people overlook like the

cost latency aspect of this. So you

could have um basically when you have more guardrails, it just means there's like more chances

of blocking the ability to ship a product confidently since there's a tradeoff and a catch with each guardrail and your success metrics. So quality

improvements, you know, they can break other guardrails. Like if you could improve

guardrails. Like if you could improve the quality of your output, you could improve the correctness of your SQL, but that better retrieval would mean more API calls means higher cost, better

model can mean more latency. Um, so

you're testing a bundle where improvements on one dimension cause regressions on another.

And this to you know a degree happens in um traditional software as well but I think it's like a a way bigger um

impact here with AI especially which like it's like you have to spend a lot of time even testing not only the AI feature that you're building yourself but which of the frontier models you're

using like what version of Gemini or Sonnet or Opus or whatever. you're

leveraging and mapping out those trade-offs between cost and uh latency

quality. So speed, cost, and quality is

quality. So speed, cost, and quality is is a much broader spectrum of uh more a lot more opportunities for one canceling out the others with AI features when you're experimenting. And a lot of this

you're experimenting. And a lot of this stuff, you're not going to necessar necessarily see it. The true uh tradeoff in your offline evals because you're not

um your offline evals aren't necessarily capturing how the user is actually going to leverage the product. So you could be testing on a ground truth suite and offline eval, yeah, it's pretty fast on

this and it's pretty cheap and the quality is high. Let's ship it out. And

then a user could be using it in a way that you had no idea they were going to. Like maybe some huge context they

to. Like maybe some huge context they input or really really long chats if it's like a chat product. Um that just

takes your uh your uh costs through the roof.

um a guardrail. And then one other note about guardrails is just uh making sure you'll get these probably when you're building your offline suite, but like making sure you have like actual real

metrics measurable around them. So uh

you know if a team has a guard rail around some not having a some percentage of confidently wrong answers. Now you

need to build some mechanism to flag if an answer is confidently wrong. um just

spot checking it when it's offline is okay like you don't have a big sample size but once you're in a production an online experimentation environment you have to figure out some

way whether you're doing LM as a judge or you have um some of those code checks I checked before like it's something like a data analyst thing is a bit

easier because you can do things like hey let's like run these numbers different ways and see if they tie out and if like the one they said doesn't tie up but it was completely wrong about tying up but it's going to be different

for each product.

All right. And then decision four is just like actually spending more time reading the results and seeing what's going on in the output

beyond just like the single number average. you look at looking at the

average. you look at looking at the entire distribution of results and segmenting it is extremely important in AI features because you can have thing

you can have a a feature that rolls out that works for 95% of use cases but doesn't for you know

some 5% and that 5% is like really important maybe it's like the most complex things that people even use AI for in the first place they don't even

care about the other 95%. Maybe it's

things that really break trust if they're wrong. So like if it's really

they're wrong. So like if it's really good at uh I don't know some internal product query for for a specific team but it's

really bad at like finance queries where like members are going to be reported out to investors in the board. um that

that's probably not going to ship, right? Um so in that case, it's like you

right? Um so in that case, it's like you could have an average where like the average says this thing improved by 6%.

Um but if it fails on a certain really important uh cohort of users or a cohort or segment of use cases, the right

decision isn't ship it. It's like ship it for simple queries but hold it for complex queries and having some definition around that. So this again does happen I think sometimes in

traditional uh traditional product uh AB testing but a lot of times that people just roll things out for everyone. I think with AI features it is really a lot more of like

targeted rollouts. We're like okay we're

targeted rollouts. We're like okay we're able to lock it in for this segment of use cases. we can roll out for that but

use cases. we can roll out for that but we still need to go back to the drawing board for this remaining maybe it's a big virt maybe it's like it can only roll out for 30% of use cases and 70% a

lot of work still needs to be done on it so I think there's like a lot more targeted and um subsequent like incremental rollouts rather than

shipping out to everyone at one time after navy test um the real practical impa implication for this though is to decide what are those

segments before you run the experiment.

So if you were to run your experiment and then just like slice and dice by every possible thing and try and see where it

gets wrong and it doesn't get it wrong and then just roll out for what it got right and not for what it got wrong.

it's you're going to have that sample size problem again. And it's also a issue of P hacking which means like there's going to be you know 5% of the time even if your test says you have a

statistically significant result it's not actually. Um so if you define those

not actually. Um so if you define those segments and use cases uh beforehand then you can set up your experiment in a

way where you'll have a large enough sample size to address them.

Yeah, I feel like it could be quite extensive to Andrew in terms of a dashboard. Um,

dashboard. Um, I feel like this probably this could be one of those things where it's more like

less of a dashboard and more of like someone has to do some deep dive analysis on really big experiments. Um,

but yeah, it could be pretty extensive for sure. I think you all I think people

for sure. I think you all I think people will I think like as you go through the experiment over time too and also with the offline eval but especially with

experiments um like there will just be like you're not going to get all the right segments in the beginning. Like

some segments we choose are going to be oh we didn't even need a segment by this. Like it works pretty much all the

this. Like it works pretty much all the time. these are more similar than we

time. these are more similar than we thought. Some segments we're going to

thought. Some segments we're going to miss. And so I think over time there

miss. And so I think over time there will also be like probably like key segments that come out that are probably more prioritized and whatever your kind of like experiment result dashboard are

that you always look at and then some sub ones that you only look at if it's specific to like some very specific change you're going to make where you're

like, "Oh, I think this might not work on I don't know this set of users.

Yeah, it's mainly accustomed to the feature at this point like right now. So

the company I worked at or work at, we don't have a big like experimentation platform. We run a lot of this stuff in

platform. We run a lot of this stuff in like uh Jupyter notebooks experiment by experiment.

Um, the other thing that I think is like a little non-intuitive, but it makes sense once you think about it, is running experiments for longer than you

think. Um, and it's not for probably the

think. Um, and it's not for probably the reason you think. So,

you've probably heard the industry is still figuring out how to experiment.

Yeah, I think I think the industry is definitely still trying to figure out with AI evaluation.

Yeah, we do a lot of causal inference um ahead of time rather than experiments right away for like big changes. So the

product I work is in the legal AI sector and so our users are like proumers like

they're lawyers. They know their

they're lawyers. They know their and if we give them bad output from AI like they're going to see it like very

clearly and we're going to lose trust with them. There's also a lot of like

with them. There's also a lot of like you know privacy and security implications and confidentially confidential confidentiality implications around like that product.

So a lot of the testing and analysis we do is actually through causal inference um because we're timid about breaking that trust. We also

do a lot of like beta testing with a small group of um a small group of like hand raisers who like know they're in a kind of experimental mode, but I'd say for sure people are still figuring it

out.

Um okay, what was I talking about? Oh yeah,

running experiments wrong longer than you think. Okay, so you've probably

you think. Okay, so you've probably heard of like novelty effects. So this

is where a user explores uh they get the new treatment and they kind of explore the new experience of the product and they test its limits and you probably

see like engagement spike, but then it kind of settles down after they're like, "Ah, whatever." like this isn't that

"Ah, whatever." like this isn't that much different than the last thing. So

that's like a very real thing that happens in most products. You'll have

like a big spike in your impact and then they'll kind of taper down a little bit.

So that's why folks usually want to run their experiments for a few weeks uh at least like two is usually like the rule of thumb for a lot of people. It depends

on the seasonality of like how people use your product. Um the other reason you want to run long experiments, right, is to get higher sample size. So the

longer you run it, the more observations you have. The new thing with AI, I think

you have. The new thing with AI, I think why why you want to run them longer is there's this deeper issue around users will actually change how they interact

with the feature. Like in the case of a data AI analyst, they'll change what they're actually asking the inputs, not just how often. So it's not just that

like we're running it for two weeks and we capture all this production data of like these are all the input types of questions people ask. Hey, it worked

great. After those two weeks, the inputs

great. After those two weeks, the inputs people put in there could evolve to a totally different set of inputs that our AI analyst has never seen before. And we

don't know how it's going to react until it sees those inputs since it's all non-eteristic.

So say this V2 nails simple queries and users learn to trust it. So what do they do next? Like they're going to start

do next? Like they're going to start with simple stuff, right?

they're probably going to start asking harder and harder questions, things that require more joins, more complex state logic, more edge cases, more like

analytical thinking. So

analytical thinking. So maybe you before had a a mix of like 80% of queries people ask are are simple and

20% are difficult. Maybe that totally shifts to like they're 5050 now. Um so

the AI is actually as it progresses and being out in the product um it's being evaluated on harder and harder problems or just

different problems than it was at launch. So I think like a two week test

launch. So I think like a two week test catches kind of like the shortterm effect but misses how users will naturally adapt to your AI uh feature

that's being rolled out and then how that feature is going to respond to those new adaptations.

So I don't I don't really have a good rule of thumb for like how long to run an experiment for. But I'm definitely in

the camp of man, we should probably have like a 5% hold out group for like a couple months at all time with any change just in case. And just to

understand, I mean, it might it might be the first experiment you run. It might

be like run it for a couple months or like have that hold out and try and get a vibe around like when did you see the change in whatever your success metric

around like AI quality or or user success um kind of taper out. I think

it's going to be different from product to product. Maybe there'll be some

to product. Maybe there'll be some products where it's like oh yeah they just users behave the same with it no matter how long it's out. But I think for at least for this case with like a

data analyst, I could easily see it. I'm

like playing with one right now and my I started with very simple questions and then after got those right, I'm like let's see how far I can go.

I think humans are naturally like that.

We want to offload as much of our this work thing as possible.

Um, okay. We only have 10 minutes left and I want to answer some questions. So

I'm going to skip this like exercise. Um

but I'll do this summary at least. So

for experiment design decisions, I think that changed slightly with AI. Um

size it for real math. Not everyone gets the treatment. um the AI side is a lot

the treatment. um the AI side is a lot noisier in terms of the uh the output that's going to happen and

you're going to need to segment it to understand how it it affects your users across many more parts of the distribution. The average doesn't really

distribution. The average doesn't really work anymore. All that stuff multiplies.

work anymore. All that stuff multiplies.

So your sample size is probably going to be larger. Um measurement matters. So

be larger. Um measurement matters. So

define your metrics all up front with offline evals. They're going to be

offline evals. They're going to be unique to your use case and then carry those through to your online evals. Um

that's something that's a bit different than historic historically traditional uh AB testing where you probably have like your same user value metric you

measure across many experiments. These

are going to change feature to feature or change to change in that feature.

Um, build guard rails. There's going to be more of these, so try and account for those as much as possible.

Um, and then the fourth, yeah, check segments and then run longer. The

averages are going to hide everything.

You know, plan your segments before the experiment. And I I think the

experiment. And I I think the experiments are probably going to want to run for like four weeks or longer or at least have some hold out. I don't

have like a rule of thumb about that yet but that's what I'm seeing at least in terms of like how I see users behaving differently over time with when rolling out AI products.

Um, if you want to learn more, um, I have like a free email course, like a 30-day email course. It's like the byite-size version of our live in person

course. So, um, you can check that out

course. So, um, you can check that out at bit.ly

at bit.ly slashfree eval. I'll actually put our

eval. I'll actually put our website in the chat right now.

If you go to AI analystlab.ai, uh most of these links will be in there.

Um so if you want to learn more about emails but not looking to get into a full course, I check out that uh 30-day email list. uh the force the full course

email list. uh the force the full course is on aevel.ai. So it's a sixeek course.

We cover like the entire endtoend pipeline. You can check that at

pipeline. You can check that at aieal.ai.

aieal.ai.

The whole syllabus will be there. If you

have questions, uh feel free to DM me on LinkedIn or email me at sea ael.ai can

tell you more about it. And yeah, 25% off um if you put in this promo code e2.

That's it. We got a little less than 10 minutes. Anyone have questions?

minutes. Anyone have questions?

Hey Shane, thanks for running through all of this. I appreciate how you're um making me think uh another level deeper on this for running the test. It's been

really interesting.

>> Nice. Yeah, no problem.

>> I'm curious um if you do a like a one month delay in terms of like running your split test and it's going out.

Traditional split tests like that makes sense. You know, it's going to take

sense. You know, it's going to take time. anytime you run a split test is

time. anytime you run a split test is there's some time element to see does it make sense to go general audience um for some features

uh if you do like a three-month delay before going out to everybody you could also just lose product market fit uh which is something that like lovable's talking about like every 3 months they

have to reinvent themselves and they're the fastest growing company in the world >> um so I'm just curious how you're thinking about this trade-off between

um wait until it works for everybody and then release in like the traditional model versus um if we don't get it out there, we could just completely

even regress product market fit in in some use cases. Um but at least we could, you know, be losing a potential adoption. No. Yeah, it's a really good

adoption. No. Yeah, it's a really good question because yeah, like you said, the pace of everything is just changing

so rapidly right now. Um,

I think it depends a lot on the product and a lot on the users and their kind of uh feelings about leveraging AI. So for me

in my day job like I mentioned like we're working with lawyers and um depends on the lawyer but there's just like lawyers I'd say like banking

finance some stuff in like health like where there's there's a lot more skepticism in AI because there's a lot more risk of things getting wrong like

if we like screw up some medical record or if we screw up like some contract for a lawyer like that totally breaks trust with that customer, they're

they're gone. So like that probably

they're gone. So like that probably weighs a little heavier than product market fit for them. But if there's a user base that's like, you know, like AI

evangelist, like like a product like lovable where it's like, hey, people are they love it. Like they're in here because they just want to like see what

this thing can do, then I think there's a lot more like wiggle room. Um

I think in that case uh I would I'd probably like just have so so so not I don't have like a scale or something of like oh based on this risk should you

like run it for this long based on this risk should you run it for that long I'd probably my recommendation would be like yeah if you're low risk you can run it shorter but maybe like have some holdout

group like I mentioned just to see just understand with the users if there is some sort of like uh

weird long-term quality uh regression. Um so because if you do

uh regression. Um so because if you do see that then like the competitors are probably going to be experiencing that too. And so if you you know you could

too. And so if you you know you could roll it out to 50/50 for two weeks and then you could pull it back to like 955 and then see how that 5% runs for

another two months. If there's no weird regression in the quality metrics, then um then yeah, then you could just keep running two two week tests instead. But

if there is, I don't know, my angle might be like publicizing that kind of like, hey, like our competitors are doing XYZ like how do you think about it after two months?

We've we tried that and we see that after two months quality degrades as like things hit some complexity. Um,

that might be one angle to go, but I honestly haven't like thought about about that as much for my day job just because like we're more trying and to build like for us like the trust space I

think is a lot more important.

>> Does that answer your question? Kind of.

I think I think I'm hearing you say um first you're you're looking at product solution fit because it actually has to do the job and then with that getting

out um looking at a higher risk industry working with lawyers, health etc. um it you need to confirm that it keeps

working and to the extent that it's working the best case is you work with a a small pilot group first and then it can go out it at least saves uh like

tighter feedback loops um and then for that extended window before you keep

going wider are you are you like if something breaks then that's like what you're looking for. Like how long do we want to wait to

for. Like how long do we want to wait to see if something breaks? If nothing

breaks, then we then we release it. Or

it's like we identified all of these issues and then it's just how long does it take for us to fix them? Um, I'm sure there's a little bit of both, but

>> yeah, I think it was kind of like the former more like having some having some small held out group that's not released to new thing just to see

like for your specific product. Do you

is there evidence that users change their behavior over time after the roll out to the point where something will break that you need to account for later

on? If there is, then next time when

on? If there is, then next time when you're developing the feature offline, you might be able to like identify like, hey, we know last time we rolled out

this feature, there's these like long time longer term native impacts when people stress tested it at the edges of the capability. So, we need to like

the capability. So, we need to like actually stress test the edges of the capability a lot more when we're doing the initial kind of like offline tests.

That might be one way to protect it against it. But I think early on when

against it. But I think early on when you're experimenting just trying to like have some hold out group so you can like make that measurement um

is pretty helpful. Otherwise because

otherwise it's really hard to like tease out after you've had something out for two months. If you see like quality

two months. If you see like quality metrics going down it's like oh is it because of this or it's something you know the other dozen things that got released in the past two months.

Cool.

I appreciate it.

>> Yeah. Yeah, no problem.

What else? We got a couple more minutes, Greg. I like those guitars in the

Greg. I like those guitars in the background.

Those are sweet.

>> Never fails to get a response.

>> Yeah, I bet that's why they're there, huh? Conversation started.

huh? Conversation started.

>> Not really. It's a It's a side effect.

>> Nice.

Um I'll see in the chat. If you would do it all over again, how would you approach setting up experiment systems frameworks from zero?

Yeah, I think um something that we set up that we didn't set up initially. Actually, this is kind of interesting. So,

of interesting. So, we ran a bunch of offline evals. We were

pretty confident in them. And then we're releasing a product to our users, but like as we've kind of talked throughout this, like going offline to online can

be like drastically different uh experiences with like what happens with quality. And so, there's still a lot of

quality. And so, there's still a lot of there's still a lot of like fear from us like do we want to really like release this to a bunch of people? Like even

when we beta tested it with like some hand raisers, it was like it's hard to really get the vibe from like a a few people, especially people who are like raising their hand to give feedback.

Maybe they're like very positive people or maybe they're raising their hand because they want to give negative feedback or something. So that's just like biased feedback that's hard to

parse out. Uh what we ended up doing was

parse out. Uh what we ended up doing was we like created another thing we called like shadow mode um or like it's like

background mode. So basically

background mode. So basically you can run your current uh like uh output from whatever your current v1 is

and all the users see that on production data and then in tandem in real time be running the uh

running the v2 in the background but like don't actually surface any of that to the user but basically mimic whatever is happening in product as if with the

V1 model as if it was going to be the V2 and then send all of that data and store it to your data warehouse or whatever and you can do like basically your

experiment on that. It's not going to be as good as like you're not going to get like the user value task success stuff from that, right? Because they're not actually going to see it. It's not going

to affect their test success, but it will get you like a much more realistic view on all these like online product metrics around quality or some of those

like codebase ones. Uh for us, I say what would I do if I like started over with that is because we kind of didn't think about that until we were like

against until we were like ready to to to roll it out and they were like, "Oh, like okay, let's build this thing out."

And then that took us, you know, a few more weeks or something to build out. So

that's something I' I'd think about building out early.

Cool. All right. I gotta jump to another meeting, but um yeah, no problem. I'll

send out the recording and the deck tonight or tomorrow. I'll

email it to everyone here. Um, and then we'll put it on our YouTube and stuff, too. And yeah, feel free to connect with

too. And yeah, feel free to connect with me on LinkedIn.

Check out that website, uh, AIA analystlab.ai, if you want to learn more about some of the other free workshops we're reporting on, um, some other stuff we're working

on. We're working on a pretty cool open

on. We're working on a pretty cool open source project right now. This like AI analyst project um, in cloud code. Uh,

we're probably gonna share that out next week for anyone else to play with. So,

um, yeah, I don't know. You stay up to date with that if follow me on LinkedIn or something.

But thanks for the questions and thanks for staying hanging out for an hour.

Talk to you'all later hopefully.

Loading...

Loading video analysis...