A/B Testing in Data Science Interviews by a Google Data Scientist | DataInterview
By DataInterview
Summary
Topics Covered
- Always Start with Business Context
- Success Metrics Need Four Qualities
- Never Peek at P-Values Mid-Experiment
- Launch Decisions Beyond Statistical Significance
Full Transcript
if you are preparing for a data science interview AB testing is a mustn concept whether it's for Google meta and Uber and so forth AB testing is a very
popular Topic in interview questions because data scientists in those companies will use Ab test as a way to figure out whether the change that they have made on those platforms are due to
random chance or because of the actual change that they have implemented and so what we are going to do in this video is I'll provide a walk through of an AB test based on a real life example
and along the way I'll pepper in some hints that can be helpful in terms of acing interview questions on AB test hey everyone I'm Dan the founder of data interview.com X Google and PayPal data
scientist in this video we're going to do a deep dive on AB testing based on a real life example and what I will do is I'll walk through the procedure of setting up an a test and I'll talk about
a couple things that you should definitely mention when whenever you're walking through an a AB testing case now let's get started the first thing that you need to realize is that when you are walking through the AB testing procedure
there are essentially seven steps that you need to consider the first thing to do is basically understanding the problem the problem statement this is where you try to make sense of the case
problem that you need to solve by asking clarifying questions to the interviewer and also figuring out what is this success metric and what is a user journey and we will definitely do a deep dive on this topic in a second the
second thing is that you want to Define your hypothesis testing and what this basically means is that you set up what your null hypothesis and alternative hypothesis is and and you want to set up some parameter values for your
experiment such as the significance level and statistical power the third step is designing the experiment itself and so this is where you talk about what is the randomization unit and which user
type you're going to actually Target for this experiment and various other things that you definitely need to consider when you're designing the experiment the next step is to run the experiment
itself and this is where you need to think about the instrumentation that is required to actually collect the data and analyze the result now once you've collected data the next thing that you need to do even before you actually
interpret the result and decide to launch is basically do some sanity check or validity checks because if your experiment design was flawed or if there's some bias that was implemented
into the data collection itself then you have flawed result and so you might end up making poor DEC decision so this is where doing sending a check even before you think about the interpretation and
launch decision is very crucial once you have done the sending check the next step is to basically interpret the result in terms of what is the lip that you saw the P Val in compress interal
and lastly now that you have the statistical result along with the business context this is where you make a decision in terms of whether you're going to launch the change or not now
with all of the steps covered now let's do a deep dive on an actual case problem suppose that an online clothing store called fashion web store wants to test a new ranking algorithm to provide
products more relevant to customers how would you design an experiment so in order to tackle this problem that requires AB testing we want to first of all understand the nature of this
product so we know that this is an e-commerce store that sells Goods such as basically clothing Goods such as clothes shoes bags and other type of
merchandise and what this store uses are basically a product recommendation system an algorithm where once the user searches some keyword let's just say
clothing for instance it's going to generate a result with product that's could be relevant for the customer based on let's just say their profile transaction history and so forth and so
what we want to test in this example is that if we change the recommendation system maybe we're providing more relevant products to the users thereby it should boost the revenue sales of
this e-commerce store so we have some general sense about what this problem involves and one thing I want to mention is that I've worked with the number of clients on AB testing interview
questions one thing I've noticed is that clients will often skip this problem statement part and jump right into basically thetical methodology proposing
what the experiment design is you don't want to engage the interview question that way you always want to start with the business context and then and segue over to the sisal methodology now as
you're fleshing out the business goal of this problem one thing that you want to do is basically clarify what is the user Journey or the product experience that you're trying to change so this
e-commerce store has the following user funnel a user will visit meaning they land on this Landing site and then they will search an item which basically produces a result based on the
recommendation and then there a user is going to browse couple items and then eventually they might might click an item and then they'll purchase it which is the ultimate success that we seek to achieve
through the change now thinking about the user journey is important because later down the road when you're thinking about what is a success metric and what
is a target user population you want to think about at what stage do you want to consider the user to be participant for the experiment and this is something we're going to talk about very soon now
one pro tip I definitely want to mention is that when you have an a testing interview round whether it's for meta Google go through the individual product basically the core product the core
features of that platform and then create an outline of what that user journey is once you have done this this is going to be really helpful when you are in the actual interview setting and
you're asked to design an AB test based on that particular product once you have staish the user Journey the next thing to do is basically Define the success metric so what what is it that you need
to move in order to basically be certain that the change that you're applying is actually better for the platform overall now when you think about the success metric you want to think about a couple qualities so the first thing that you
need to think about is that is it measurable so is it the type of user Behavior you can actually collect through the instrumentation or the platform and the next quality you want
to think about is is your metric attributable and what this basically means is that can you establish a clear linkage that the cause the treatment that you applied to the platform is has
led to the effect that you saw the change in the metric in this case the next quality that you want to think about is is your metric sensitive and you essentially have a metric that
serves as a proxy that yes the there is a a genuine difference in terms of the user experience in terms of the old algorithm versus a new algorithm and so
statistically speaking you you want to find a metric that has low variability and a bad metric for in this case let's just say time span on the website so the
total time spent on the uh on the website for a given user might have high variability and so you cannot clearly tell whether there is an actual difference in terms of the how the user
is engaging the e-commerce given that there are some underlying changes to the ranking algorithm the fourth thing is that AB experiments needs to be very
quick it's very it's a very iterative process as a way to improve a product very quickly and so you want to ensure that what you're measuring is timely you don't want to wait weeks and months to
observe a user behavior and then eventually make the change because that's a very costly way of running an experiment so you want to think about what is that short-term behavior that
can serve as a proxy for long-term desired Behavior So based on these four qualities ultimately the success metric
that we want to use for this case is the revenue per day per user now once you establish the prom statement clearly the next thing to do is establish your hypothesis testing so this is where you
state your no hypothesis and alternative hypothesis so the no hypothesis in this case is going to be the average revenue per day per user between the Baseline and the variant ranking algorithms are
the same and the alternative hypothesis in this case is going to be that the average revenue per day per user between the Baseline and the variant ranking algorithms are different once you've
State the hypothesis statements the next thing that you want to do is you want to set the significance level or your Alpha in this case so the significance level is basically the decision threshold if the probability of observing a
particular event is very low then it is deemed statistically significant and so in this example case the significance level that we want to set is 005 and that's the usual value that is often set
in in online experiments the next value that we want to set is the statistical power the statistical power usually is set as80 and what 080 basically means is is the
80% probability of detecting effect given that the alternative hypothesis is true and lastly what you want to set is your practical significance basically the minimum detectable effect and
typically for a large online platform with millions of users the MD is 1% lived once you've set up the hpth testing the next thing that you want to do is design the experiment itself so
the first thing that you want to consider as you design the experiment is you want to consider what is the randomization unit in this case we're going to randomly assign at at the user
level um basically randomly assign them in terms of the control group or the treatment group once you have defined the randomization unit you need to think about which population of the users you
want to Target and this is something that I talked about earlier in terms of the user funnel right so you have a user that visits you have a user that searches they browse they will visit an
item and then they'll eventually purchase so at what level do you want to allow the user to participate in the experiment so in this case we actually
want to Target users who have actually started searching something why because this is where the algorithm actually kicks in and then they're actually
exposed to the treatment in this case which is either the uh the old algorithm or the new algorithm the third thing that you want to Define is your sample size the general rule of thumb is the
following formula which is that n is approximately equal to 16 * the variance / by the Delta Square where Delta
essentially represents the difference of the key metric between the treatment and the control and this formula is based on your the assumption that your
significance level is 05 and your statistical power is 80 once you have determined your sample size the next thing that you want to do is determine the duration of your experiment and the
typical duration of the experiment is going to be 1 to two weeks you don't want to run the experiment less than one week because you want to account for the day of the week effect meaning that there could be some underlying
difference in terms of the how the user engages the website during the weekdays versus the weekends once you have designed the experiment the next thing that you want to do is you want to run
the experiment and this is where you use instrumentation some experiment platforms as a way to collect the data and track your result now it is very important that while you're running the
experiment you do not Peak at the P value meaning you don't make any decision in terms of whether you're going to launch or not while the experiment hasn't been completed yet the reason is that when you're peaking
there's a chance that when you have low sample sies for instance there's a lot of variability in terms of where that lift goes and so you might falsely conclude that there is an underlying difference when there is so once you
have determined what the experiment time is given your statiscal power and your sample size you have to ensure that you weigh it out otherwise you're going to increase the chance that you falsely
reject the null hypothesis given that it is actually true after you have run the experiment the next thing that you want to do is you want to perform validity checks so this is where you conduct
sanity checks including instrumentation effect are there any bugs or glitches that could potentially affect the experiment results another potential issues that you want to look at are
external factors so maybe you run the experiment during the holiday or when competion launch something very important or it could be some general economic conditions like covid or
recessions and when you run an experiment when some external disruptions happen this could potentially impact your experiment result so ideally you want to run an
experiment that avoids periods like this the next thing that you want to do is you want to check for selection bias and what this basically means is that you want to assume that the underlying distribution between the control and the
treatment group even before they're exposed to the treatment condition in this case is that they're homogeneous and so one way to confirm that the distributions are the same is to
basically run an AA test the next thing that you want to check is a simple ratio mismatch and what it basically means is that if you're randomly assigning a user into the control or the treatment group
then out of all the part partipants of the experiment 50% of them should be in the control and 50% of them should be in the treatment but there are cases where because of some flaws with the
randomization algorithm that the ratio is actually not 50 to 50% it might be 49 to 51% so in order to ensure whether this could
pose a potential issue later down the road you want to use a Kai squore test as a way to ensure that the ratio between the two samples is sound the next item you you want to check is the
nty effect what this basically means is that if you made some change to the website itself a user might have reacted simply because there's a novelty behind
being exposed to something new one way to detect a novelty effect is basically look at it by user segment look at the underlying difference of the success
metric in terms of new visitors versus the recurrent visitors and if you see that there is a change between the two then there is a presence novelty effect so you may actually want to run the
experiment where you segment it by the new visitor group versus the recurrent group ver you've conducted the validating check and there's no issue with the experiment now you can actually interpret the result and when you
interpret the result you want to look at the direction of the success metric is a left negative or positive and you want to consider the P value because it helps you establish whether the lift that you
saw is statistically significant or not and you also want to consider the the confidence interval in this case so based on the experiment that we ran for this example what we see is that the
average revenue per day per user in the control group is $25 whereas in the treatment group it's $26.10 this produce the following lift so in
terms of the absolute difference is $110 in terms of relative lift it is an increase of 4.4% and the lip we saw is statistically
significant because we seen that the P value 01 is is less than the Staal significance at 05 and we also see that the confidence
interval at the significance level of 05 is between 3.4 and 5.4% lift so the initial interpretation based on what we see is the following there is
statistical significance to reject the null hypothesis and conclude that the average revenue per day per user between the Baseline and the varant ranking algorithms are different with this
result we can now consider whether we want to launch or not now when you think about whether you want to launch it or not and in this step there are three factors that you want to consider the first
factor is the metric tradeoff so you might have a case where the success metric might have improved but the guard R make metrics or the secondary metrics might have declined and so you have to
think about what are the pros and cons of launching it considering that the guard one metrics might have declined the next Factor you want to consider is the cost of launching if you see that
the cost of building this out basically rolling out to all of the users and the cost of maintaining this change is highly costly then maybe this isn't something that you want to actually
launch the last Factor you want to consider is a risk of committing false positive or your typon error rate for instance if you falsely conclude that there is an effect when there isn't and
you've made a change then it might have a negative consequence to the user you might might end up providing poor experience to users and they might turn and ultimately that is going to
negatively affect the bottom line of the product so these are three important business factors that you want to consider now you want to kind of go back to the interpretation of the result that you got from the experiment and along
with the business context and the statistical result ultimately you want to decide whether you're going to launch it or not so what we're going to do is we're going to look at a couple example cases where we'll look at a possible
range of lift along with the com tabl and then think about what is a sound decision based on the result that we have seen so in this first case what we
see is that the lift is placed at a positive value but it's still less than practical significance and you also see that the lower bound and the upper bound
of the confidence interval are less than the Practical significance in this case which is going to be positive 1% so in this case what you want to consider is that maybe you want to change CH the
algorithm or scrap the idea all of it together in the next Cas what we see is that the lift and the bounds the comperable is uh practically significant
so this provides a strong support that we should make a launch in the third case we see that the entire interval is less than zero so it's in a negative territory in this case you want to
consider perhaps iterating the idea or scrapping it all together and the next example is you have a positive direction in the expected LIF but then the bounds are in the negative territory and the
positive directory and you can see that it's a very wide bound now what is very important to note is that the upper bound as interval is practically significant it's greater than 1% so
there is still some likelihood that you might see a lip with practical significance so what you want to consider in this case is actually to rerun the experiment with increased physical power and this is going to help
improve the Precision of that lift that you're saying and in the last case you see that the lift itself you see that the expected lift is practically significant but you see that the lower bound of this is not practically
significant but it's still in the positive territory and the best thing to do in this case is to rerun the experiment with increased statistical power just to be absolutely sure that the underlying change is practically
significant So based on all of these considerations in terms of the business context along with various statistical outcomes ultimately what the decision you want to make is that you want to
launch this new algorithm as a way to provide a more relevant product recommendations to the users thereby improving Revenue overall so there you have it guys this is the endtoend
process of how to walk through an AB test how to actually address an AB testing interview question I hope you found this video really helpful if you
need any help in terms of mock interview coaching or um courses that comes with a testing courses and various business case problems and SL Community Access
definitely check out datant tv.com and if you have any questions along the way uh feel free to drop a comment down below or feel free to send me an email
at Dan dat nv.com I'll see you in the next video
Loading video analysis...