Experiment Lab: Validate Your A/B Tests with AI
By Product Direction
Summary
Topics Covered
- AI Power Analysis Prevents Doomed Tests
- Check SRM Before Trusting Results
- Segments Reveal Simpson Paradox
- Mobile Wins Hide Desktop Losses
Full Transcript
How many times have you run an AB test only to end up with an inconclusive results? Most tests fail before they
results? Most tests fail before they even start because no one checked if there was enough traffic. Today, I will show you how AI tells you in minutes whether your test will work before you
run it and what to do when you get the results. How to go beyond the aggregate
results. How to go beyond the aggregate number so you don't make the wrong call.
All that with two live demos. I'm Nacho
and every week I share how to use AI in product management practices. And today
we will talk about why running experiments in product should be straightforward. You have hypothesis,
straightforward. You have hypothesis, you run an AB test, you look at the results and you decide. But if you have tried doing this with rigor, you know that's not how it goes. Usually before
running the test, you need to know if it's even viable with the traffic you have in your product, how many users you will need, how many weeks it will take, and if you don't have enough traffic,
what are your alternatives? This is
called power analysis. And traditionally
a data scientist or specialized calculator handle it. Most P simply didn't do it and run test that never had a chance of reaching significance. And
then when you have the results, you check the dashboard, see that this variant one, roll it out and move on without verifying if the split held true without segmenting, without asking
whether the aggregate number is telling you the whole story or hiding something.
Ron Gohabi who led experimentation at Microsoft and Airbnb documented that only 10 to 20% of AB test show significant positive impact just one in
eight. So the post test analysis needs
eight. So the post test analysis needs to be deep. We need to go beyond that and the pre-est design needs to be rigorous as well or we are burning weeks on test that will go nowhere. So what we
need is a framework that covers the two critical moments of any experiment, the before and the after. And the good news is that with AI, you now have an experimentation expert available 24/7 that helps you with both. So the first
step before running the test, ask AI to evaluate whether it makes sense. This is
power analysis. And now you can do it in two minutes with a prompt. You give it the parameters, your metric, the current baseline, the effect you expect to detect, your weekly traffic, and the AI
will give you a viability table. How
many users you need per variant, how many weeks it will take, and if it doesn't work out, what are your alternatives? Switch to a proxy metric
alternatives? Switch to a proxy metric higher up in the funnel, make a more aggressive change, consider a roll out.
Think of it like checking if you have enough fuel before a road trip. Without
that calculation, you might end up stranded halfway through a test that was never going to reach significance. The
second step is when you have results, don't stop at the number. Ask AI for a deeper analysis. The first thing it
deeper analysis. The first thing it checks is SRM separation mismatch. If
you design a 50-50 split and there is a large imbalance, something broke in the assignment and the results are unreliable. Most teams do not check this
unreliable. Most teams do not check this and can be a major cause of frustration or incorrect decisions. Then comes the most important part which is the segment
analysis. AI analyzes each available
analysis. AI analyzes each available variable separately platform user type geography whatever is useful for your product looking for whether any segment contradicts the aggregate results. This
is critical because sometimes the total says winner but when we look at specific segment it may be losing. It's called
simpson paradox and it's more common than you will think. That leads to the third step the most nuanced one. The
right decision is often not yes or no.
It's yes for this segment in this context. If mobile wins and desktop
context. If mobile wins and desktop loses, the answer isn't a general rollout. It's a partial rollout and AI
rollout. It's a partial rollout and AI gives you a recommendation. Go no go partial rollout with justification by each segment. So let's see how this
each segment. So let's see how this looks like in practice. Okay, we will create in Gemini this gem of the experiment analyst. And what is this
experiment analyst. And what is this this about? The intention of the
this about? The intention of the experiment analyst is to have something that can run continuously giving us feedback input when we are preparing a
test and also output when we executed a test and have a much more rigorous analysis of what's going on. So the role is an experimentation statician specializ in AB testing for digital
products. You have two modes of
products. You have two modes of operation. Mode one is pre-est when I
operation. Mode one is pre-est when I describe an experiment and mode two is a post- test when I share a CSV with experimental results. So more
experimental results. So more instructions on each of the modes. In
mode one pre-est when I give you test parameters like the metric the baseline expected effect you respond with sample size table for different lifts like 5%
10% with users per variant and which need according to the traffic I have.
That also means that you will tell me about the viability. Can I run this test? If not, what can I do about it?
test? If not, what can I do about it?
And the risk that may invalidate the test. So very simple prompt but very
test. So very simple prompt but very useful to use this once and again when I'm running different tests. The second
mode as we're saying is when we are analyzing results and here is a very important part I would say a few checks that we expect PMS to do or PM should do
and maybe they don't do regularly because we can say hey this isn't the the a lift and and just throw it out or not depending on the results but there are nuances that we need to check for
example the sample ratio mismatch if we kind of distributed the the people accordingly if there is an agg aggregate results with statistic significance.
This will be the most obvious result.
But then a segmented analysis by each available variable in the data set and we will see that in a second when we do it and the Simpson paradox detection if
there is any segment that contradicts the aggregate result. Um yeah and with that go no go decisions or partial routes depending on on the result. So as
you can see this is a very simple gem and very simple instructions but you will see how powerful is this to when I'm doing test day in day out I will be able to use this I don't need to don't
need to think again about the the instructions I need to put for for for yeah to analyze the test I'm doing so let's see this in practice we will first view the mode one basically and what I'm
trying to do is to I I want to run an IB test and help me evaluate if this is viable I am working in product growth OS uh professional development for PMS the
product I'm building. I have a new on boarding with an app download prompt.
This is coming from a past demo. If you
haven't seen that video uh refer back to that one but basically we detected that users who download the application have a higher retention rate much higher
retention rate that than those who don't. So we want to compare that new
don't. So we want to compare that new prompt during the on boarding with the current on boarding that doesn't have that. The primary metric is 30 retention
that. The primary metric is 30 retention and we have a bas of 45 45% retention and the expected result is that we estimate a net lift of between 10 and
15% based on obsella correlations. So
the mobile users return at 33% versus desktop at 32. Traffic is 120 signups per week and we want to split this 50/50. So let's see the analysis. So as
50/50. So let's see the analysis. So as
you can see here uh it gave me the answer sample size and duration table to detect these lifts with a standard decal power of 80% and significant levels of 5% which is kind of the very typical AB
testing standards to detect a five uh a lift of 5% so going from for the 45% to uh 47% you will need this many uses per variant
which equals to 30,000 users which for traffic will be 256 weeks. So multiple
multiple years in order to get this result. If I wanted to take 10% I would
result. If I wanted to take 10% I would need 73 weeks. So over an year. If I
wanted to take 15%. So there will be almost half an year. So if we are having a viability assessment, it will say hey this is not viable. Uh it will take over
a year to detect 10% lift. Even if you if you live 15% which is a very aggressive estimate you will need 28 weeks which is too long for a startup cycle which is kind of the the current
uh condition of the company. Uh so what we can do in terms of strategies we can move up the the funnel. So instead of using the 30 retention use they want have it or uh if they download the app
for example. So these sort of things we
for example. So these sort of things we will have a higher conversion rates and we will need less sample size. Same if
we are doing an aggressive test. If you
believe the app is this is kind of the silver bullet instead of prompting the user let's make the onboard them mobile first so we have a much larger expect
effect so we will need less or smaller sample size to detect it um then there are other things like bas testing so I don't want to go into all the recommendations but you can see that
already give me kind of a very strong verdict and what I can do about it and the last part is hey the risk of validation of of how valid will be the
result. So we can have selection bias.
result. So we can have selection bias.
For example, the mobile user may be a fallacy. Maybe these guys who are more
fallacy. Maybe these guys who are more active do have the application because the most active users, the power users are the one who download the application. So if I incentivize other
application. So if I incentivize other users to download application, it will have no effect because I'm biased by only analyzing these um power users.
Also, we have the novelty effect. We may
have the app download uh because it's new but actually engage more over 30 days uh technition the different things again I don't want to go through all of them but you can see they have very
strong information very strong analysis with a very simple gem and a small prompt of what I'm trying to do let's now go to analyze the results I want to share the CSV that we have with the
results so as you can see here very simple CSV user ID group platform user type so you can see here that we have well the of course the group is the treatment on control the platform is
mobile or desktop and the user type is new or returning user and we have yeah if they converted the 7-day retention rate the 30-day retention rate so this
is a very simple result we may have more variables so more more information have a about the user can be for example the geography or other parameters depend
plant type whatever you you want to use but let's actually run the analysis so I attach the CSV and I will just tell it, hey, here are the results of my AB
testing. Analyze them in mode two and uh
testing. Analyze them in mode two and uh I have the same product. The test is a new on boarding with notifications and app download and the duration is four weeks and my primary metric is
redemption 30 days. Let's see what it tells me. What I'm fascinated about is
tells me. What I'm fascinated about is that it will write code to actually extract information and do some calculations. So this is how cheap
calculations. So this is how cheap actually today coding is that it will create code just for this uh session and it will delete it just to analyze my
data. All right. And now I have the
data. All right. And now I have the results uh based on the results here.
The post test evaluation first of all sample ratio mismatch it passed because I have a control with almost above 1,000 but treatment is also almost 1,00.
the there is no significant deviation from the 50/50 bit. But here you can see that I already have a critical warning that the desktop users control had more
users than the treatment and for mobile the opposite control has less user treatment. So even though the the
treatment. So even though the the overall is okay when I go to this small details the imbalance is static significant. So the dreaming group is
significant. So the dreaming group is heavily overindexes on mobile users.
With that what is the result? The result
is that it went from the almost 45 to almost 49 a lift of 95%.
This is not statistically significant at 95% confidence rate. So on the surface this looks uh like a directional win.
That's just miss significance. Uh this
is misleading because this is artificial because the treatment group contains more mobile users who naturally retain better. This is a very interesting
better. This is a very interesting analysis only by just kind of giving the CSV results much more powerful than what some of the tools that you can use out of the box will will give you. And if we
go to the segment analysis, the true story emerges when we break it down by platform. So you can see that actually
platform. So you can see that actually in desktop users we have a negative lift. So is significant. So it's it's
lift. So is significant. So it's it's powerful um and with a negative effect.
And on mobile users we have the opposite. we have a positive correlation
opposite. we have a positive correlation which is um statically significant. So
this while this 9% is not significant at a 99% confidence we can go to each of these platforms and the test if we isolate those users the tests are
significant and are telling opposite stories. If we turn this for desktop the
stories. If we turn this for desktop the the retention rate will drop. If we turn this for mobile the retention rate will grow. So here is hey we detected that
grow. So here is hey we detected that seems paradox because the aggra views such as this 9% lift but the we have for
example a 18% negative on desktop so the recommendation no go for a full rollout the action plan should be a partial rollout mobile only this feature is a
massive win for mobile and if you think about it it's kind of almost makes sense if I'm on desktop it's just me to download an app if it's a bit of friction there I will not do because it
means switching devices. Instead, if I'm in mobile browsing the web of mobile and you have a click to install, that may be a high conversion action. Okay, as you saw in less than a minute, we discovered
that what looked like a positive result was actually hiding two opposite stories per platform. If we have done a general
per platform. If we have done a general rollout, we would have hurt retention by seven points for the 43% of the users who uses desktop. This kind of segmented
analysis is exactly what AI makes effortless. If you want to go deeper and
effortless. If you want to go deeper and practice this with real data in product direction, that's exactly what we do every week. Live coaching with me, 10
every week. Live coaching with me, 10 modules of expert content, AI module, and a community of product managers helping each other out. The link is in the description. And if you're finding
the description. And if you're finding this useful, also hit subscribe for more content like this one. But let's recap what we have covered. Before running a test, ask AI to evaluate if it's viable
with your traffic. A power analysis in two minutes. No more test doomed to fail
two minutes. No more test doomed to fail from the design stage. After the test, don't stop at the aggregate number. As
for a segment analysis, verification, Simpson paradox detection, the right decision is often not yes or no. It's
yes for this segment. In this context, I wrote an article with the full experiment analyst prompt, the before and after framework in detail and the classic traps AI helps you avoid. Link
in the description. And if you want my help applying this to your product, please go to check the product direction community is where we do that. Every
week we meet live and you bring your real challenges and I give you direct feedback. Also, there is a 14-day
feedback. Also, there is a 14-day guarantee. So, it's not for you. No
guarantee. So, it's not for you. No
problems. And if you subscribe in the next video, we are making a different kind of leap. We go from validating individual hypothesis to building a system that scales your judgment, your
own AI coach. So stay stay tuned and see you in the next
Loading video analysis...