Experiment Lab: Validate Your A/B Tests with AI

By Product Direction

Summary

Topics Covered

AI Power Analysis Prevents Doomed Tests
Check SRM Before Trusting Results
Segments Reveal Simpson Paradox
Mobile Wins Hide Desktop Losses

Full Transcript

How many times have you run an AB test only to end up with an inconclusive results? Most tests fail before they

results? Most tests fail before they even start because no one checked if there was enough traffic. Today, I will show you how AI tells you in minutes whether your test will work before you

run it and what to do when you get the results. How to go beyond the aggregate

results. How to go beyond the aggregate number so you don't make the wrong call.

All that with two live demos. I'm Nacho

and every week I share how to use AI in product management practices. And today

we will talk about why running experiments in product should be straightforward. You have hypothesis,

straightforward. You have hypothesis, you run an AB test, you look at the results and you decide. But if you have tried doing this with rigor, you know that's not how it goes. Usually before

running the test, you need to know if it's even viable with the traffic you have in your product, how many users you will need, how many weeks it will take, and if you don't have enough traffic,

what are your alternatives? This is

called power analysis. And traditionally

a data scientist or specialized calculator handle it. Most P simply didn't do it and run test that never had a chance of reaching significance. And

then when you have the results, you check the dashboard, see that this variant one, roll it out and move on without verifying if the split held true without segmenting, without asking

whether the aggregate number is telling you the whole story or hiding something.

Ron Gohabi who led experimentation at Microsoft and Airbnb documented that only 10 to 20% of AB test show significant positive impact just one in

eight. So the post test analysis needs

eight. So the post test analysis needs to be deep. We need to go beyond that and the pre-est design needs to be rigorous as well or we are burning weeks on test that will go nowhere. So what we

need is a framework that covers the two critical moments of any experiment, the before and the after. And the good news is that with AI, you now have an experimentation expert available 24/7 that helps you with both. So the first

step before running the test, ask AI to evaluate whether it makes sense. This is

power analysis. And now you can do it in two minutes with a prompt. You give it the parameters, your metric, the current baseline, the effect you expect to detect, your weekly traffic, and the AI

will give you a viability table. How

many users you need per variant, how many weeks it will take, and if it doesn't work out, what are your alternatives? Switch to a proxy metric

alternatives? Switch to a proxy metric higher up in the funnel, make a more aggressive change, consider a roll out.

Think of it like checking if you have enough fuel before a road trip. Without

that calculation, you might end up stranded halfway through a test that was never going to reach significance. The

second step is when you have results, don't stop at the number. Ask AI for a deeper analysis. The first thing it

deeper analysis. The first thing it checks is SRM separation mismatch. If

you design a 50-50 split and there is a large imbalance, something broke in the assignment and the results are unreliable. Most teams do not check this

unreliable. Most teams do not check this and can be a major cause of frustration or incorrect decisions. Then comes the most important part which is the segment

analysis. AI analyzes each available

analysis. AI analyzes each available variable separately platform user type geography whatever is useful for your product looking for whether any segment contradicts the aggregate results. This

is critical because sometimes the total says winner but when we look at specific segment it may be losing. It's called

simpson paradox and it's more common than you will think. That leads to the third step the most nuanced one. The

right decision is often not yes or no.

It's yes for this segment in this context. If mobile wins and desktop

context. If mobile wins and desktop loses, the answer isn't a general rollout. It's a partial rollout and AI

rollout. It's a partial rollout and AI gives you a recommendation. Go no go partial rollout with justification by each segment. So let's see how this

each segment. So let's see how this looks like in practice. Okay, we will create in Gemini this gem of the experiment analyst. And what is this

experiment analyst. And what is this this about? The intention of the

this about? The intention of the experiment analyst is to have something that can run continuously giving us feedback input when we are preparing a

test and also output when we executed a test and have a much more rigorous analysis of what's going on. So the role is an experimentation statician specializ in AB testing for digital

products. You have two modes of

products. You have two modes of operation. Mode one is pre-est when I

operation. Mode one is pre-est when I describe an experiment and mode two is a post- test when I share a CSV with experimental results. So more

experimental results. So more instructions on each of the modes. In

mode one pre-est when I give you test parameters like the metric the baseline expected effect you respond with sample size table for different lifts like 5%

10% with users per variant and which need according to the traffic I have.

That also means that you will tell me about the viability. Can I run this test? If not, what can I do about it?

test? If not, what can I do about it?

And the risk that may invalidate the test. So very simple prompt but very

test. So very simple prompt but very useful to use this once and again when I'm running different tests. The second

mode as we're saying is when we are analyzing results and here is a very important part I would say a few checks that we expect PMS to do or PM should do

and maybe they don't do regularly because we can say hey this isn't the the a lift and and just throw it out or not depending on the results but there are nuances that we need to check for

example the sample ratio mismatch if we kind of distributed the the people accordingly if there is an agg aggregate results with statistic significance.

This will be the most obvious result.

But then a segmented analysis by each available variable in the data set and we will see that in a second when we do it and the Simpson paradox detection if

there is any segment that contradicts the aggregate result. Um yeah and with that go no go decisions or partial routes depending on on the result. So as

you can see this is a very simple gem and very simple instructions but you will see how powerful is this to when I'm doing test day in day out I will be able to use this I don't need to don't

need to think again about the the instructions I need to put for for for yeah to analyze the test I'm doing so let's see this in practice we will first view the mode one basically and what I'm

trying to do is to I I want to run an IB test and help me evaluate if this is viable I am working in product growth OS uh professional development for PMS the

product I'm building. I have a new on boarding with an app download prompt.

This is coming from a past demo. If you

haven't seen that video uh refer back to that one but basically we detected that users who download the application have a higher retention rate much higher

retention rate that than those who don't. So we want to compare that new

don't. So we want to compare that new prompt during the on boarding with the current on boarding that doesn't have that. The primary metric is 30 retention

that. The primary metric is 30 retention and we have a bas of 45 45% retention and the expected result is that we estimate a net lift of between 10 and

15% based on obsella correlations. So

the mobile users return at 33% versus desktop at 32. Traffic is 120 signups per week and we want to split this 50/50. So let's see the analysis. So as

50/50. So let's see the analysis. So as

you can see here uh it gave me the answer sample size and duration table to detect these lifts with a standard decal power of 80% and significant levels of 5% which is kind of the very typical AB

testing standards to detect a five uh a lift of 5% so going from for the 45% to uh 47% you will need this many uses per variant

which equals to 30,000 users which for traffic will be 256 weeks. So multiple

multiple years in order to get this result. If I wanted to take 10% I would

result. If I wanted to take 10% I would need 73 weeks. So over an year. If I

wanted to take 15%. So there will be almost half an year. So if we are having a viability assessment, it will say hey this is not viable. Uh it will take over

a year to detect 10% lift. Even if you if you live 15% which is a very aggressive estimate you will need 28 weeks which is too long for a startup cycle which is kind of the the current

uh condition of the company. Uh so what we can do in terms of strategies we can move up the the funnel. So instead of using the 30 retention use they want have it or uh if they download the app

for example. So these sort of things we

for example. So these sort of things we will have a higher conversion rates and we will need less sample size. Same if

we are doing an aggressive test. If you

believe the app is this is kind of the silver bullet instead of prompting the user let's make the onboard them mobile first so we have a much larger expect

effect so we will need less or smaller sample size to detect it um then there are other things like bas testing so I don't want to go into all the recommendations but you can see that

already give me kind of a very strong verdict and what I can do about it and the last part is hey the risk of validation of of how valid will be the

result. So we can have selection bias.

result. So we can have selection bias.

For example, the mobile user may be a fallacy. Maybe these guys who are more

fallacy. Maybe these guys who are more active do have the application because the most active users, the power users are the one who download the application. So if I incentivize other

application. So if I incentivize other users to download application, it will have no effect because I'm biased by only analyzing these um power users.

Also, we have the novelty effect. We may

have the app download uh because it's new but actually engage more over 30 days uh technition the different things again I don't want to go through all of them but you can see they have very

strong information very strong analysis with a very simple gem and a small prompt of what I'm trying to do let's now go to analyze the results I want to share the CSV that we have with the

results so as you can see here very simple CSV user ID group platform user type so you can see here that we have well the of course the group is the treatment on control the platform is

mobile or desktop and the user type is new or returning user and we have yeah if they converted the 7-day retention rate the 30-day retention rate so this

is a very simple result we may have more variables so more more information have a about the user can be for example the geography or other parameters depend

plant type whatever you you want to use but let's actually run the analysis so I attach the CSV and I will just tell it, hey, here are the results of my AB

testing. Analyze them in mode two and uh

testing. Analyze them in mode two and uh I have the same product. The test is a new on boarding with notifications and app download and the duration is four weeks and my primary metric is

redemption 30 days. Let's see what it tells me. What I'm fascinated about is

tells me. What I'm fascinated about is that it will write code to actually extract information and do some calculations. So this is how cheap

calculations. So this is how cheap actually today coding is that it will create code just for this uh session and it will delete it just to analyze my

data. All right. And now I have the

data. All right. And now I have the results uh based on the results here.

The post test evaluation first of all sample ratio mismatch it passed because I have a control with almost above 1,000 but treatment is also almost 1,00.

the there is no significant deviation from the 50/50 bit. But here you can see that I already have a critical warning that the desktop users control had more

users than the treatment and for mobile the opposite control has less user treatment. So even though the the

treatment. So even though the the overall is okay when I go to this small details the imbalance is static significant. So the dreaming group is

significant. So the dreaming group is heavily overindexes on mobile users.

With that what is the result? The result

is that it went from the almost 45 to almost 49 a lift of 95%.

This is not statistically significant at 95% confidence rate. So on the surface this looks uh like a directional win.

That's just miss significance. Uh this

is misleading because this is artificial because the treatment group contains more mobile users who naturally retain better. This is a very interesting

better. This is a very interesting analysis only by just kind of giving the CSV results much more powerful than what some of the tools that you can use out of the box will will give you. And if we

go to the segment analysis, the true story emerges when we break it down by platform. So you can see that actually

platform. So you can see that actually in desktop users we have a negative lift. So is significant. So it's it's

lift. So is significant. So it's it's powerful um and with a negative effect.

And on mobile users we have the opposite. we have a positive correlation

opposite. we have a positive correlation which is um statically significant. So

this while this 9% is not significant at a 99% confidence we can go to each of these platforms and the test if we isolate those users the tests are

significant and are telling opposite stories. If we turn this for desktop the

stories. If we turn this for desktop the the retention rate will drop. If we turn this for mobile the retention rate will grow. So here is hey we detected that

grow. So here is hey we detected that seems paradox because the aggra views such as this 9% lift but the we have for

example a 18% negative on desktop so the recommendation no go for a full rollout the action plan should be a partial rollout mobile only this feature is a

massive win for mobile and if you think about it it's kind of almost makes sense if I'm on desktop it's just me to download an app if it's a bit of friction there I will not do because it

means switching devices. Instead, if I'm in mobile browsing the web of mobile and you have a click to install, that may be a high conversion action. Okay, as you saw in less than a minute, we discovered

that what looked like a positive result was actually hiding two opposite stories per platform. If we have done a general

per platform. If we have done a general rollout, we would have hurt retention by seven points for the 43% of the users who uses desktop. This kind of segmented

analysis is exactly what AI makes effortless. If you want to go deeper and

effortless. If you want to go deeper and practice this with real data in product direction, that's exactly what we do every week. Live coaching with me, 10

every week. Live coaching with me, 10 modules of expert content, AI module, and a community of product managers helping each other out. The link is in the description. And if you're finding

the description. And if you're finding this useful, also hit subscribe for more content like this one. But let's recap what we have covered. Before running a test, ask AI to evaluate if it's viable

with your traffic. A power analysis in two minutes. No more test doomed to fail

two minutes. No more test doomed to fail from the design stage. After the test, don't stop at the aggregate number. As

for a segment analysis, verification, Simpson paradox detection, the right decision is often not yes or no. It's

yes for this segment. In this context, I wrote an article with the full experiment analyst prompt, the before and after framework in detail and the classic traps AI helps you avoid. Link

in the description. And if you want my help applying this to your product, please go to check the product direction community is where we do that. Every

week we meet live and you bring your real challenges and I give you direct feedback. Also, there is a 14-day

feedback. Also, there is a 14-day guarantee. So, it's not for you. No

guarantee. So, it's not for you. No

problems. And if you subscribe in the next video, we are making a different kind of leap. We go from validating individual hypothesis to building a system that scales your judgment, your

own AI coach. So stay stay tuned and see you in the next

Loading...

Loading video analysis...