A/B Testing in Data Science Interviews by a Google Data Scientist | DataInterview

By DataInterview

Summary

Topics Covered

Always Start with Business Context
Success Metrics Need Four Qualities
Never Peek at P-Values Mid-Experiment
Launch Decisions Beyond Statistical Significance

Full Transcript

if you are preparing for a data science interview AB testing is a mustn concept whether it's for Google meta and Uber and so forth AB testing is a very

popular Topic in interview questions because data scientists in those companies will use Ab test as a way to figure out whether the change that they have made on those platforms are due to

random chance or because of the actual change that they have implemented and so what we are going to do in this video is I'll provide a walk through of an AB test based on a real life example

and along the way I'll pepper in some hints that can be helpful in terms of acing interview questions on AB test hey everyone I'm Dan the founder of data interview.com X Google and PayPal data

scientist in this video we're going to do a deep dive on AB testing based on a real life example and what I will do is I'll walk through the procedure of setting up an a test and I'll talk about

a couple things that you should definitely mention when whenever you're walking through an a AB testing case now let's get started the first thing that you need to realize is that when you are walking through the AB testing procedure

there are essentially seven steps that you need to consider the first thing to do is basically understanding the problem the problem statement this is where you try to make sense of the case

problem that you need to solve by asking clarifying questions to the interviewer and also figuring out what is this success metric and what is a user journey and we will definitely do a deep dive on this topic in a second the

second thing is that you want to Define your hypothesis testing and what this basically means is that you set up what your null hypothesis and alternative hypothesis is and and you want to set up some parameter values for your

experiment such as the significance level and statistical power the third step is designing the experiment itself and so this is where you talk about what is the randomization unit and which user

type you're going to actually Target for this experiment and various other things that you definitely need to consider when you're designing the experiment the next step is to run the experiment

itself and this is where you need to think about the instrumentation that is required to actually collect the data and analyze the result now once you've collected data the next thing that you need to do even before you actually

interpret the result and decide to launch is basically do some sanity check or validity checks because if your experiment design was flawed or if there's some bias that was implemented

into the data collection itself then you have flawed result and so you might end up making poor DEC decision so this is where doing sending a check even before you think about the interpretation and

launch decision is very crucial once you have done the sending check the next step is to basically interpret the result in terms of what is the lip that you saw the P Val in compress interal

and lastly now that you have the statistical result along with the business context this is where you make a decision in terms of whether you're going to launch the change or not now

with all of the steps covered now let's do a deep dive on an actual case problem suppose that an online clothing store called fashion web store wants to test a new ranking algorithm to provide

products more relevant to customers how would you design an experiment so in order to tackle this problem that requires AB testing we want to first of all understand the nature of this

product so we know that this is an e-commerce store that sells Goods such as basically clothing Goods such as clothes shoes bags and other type of

merchandise and what this store uses are basically a product recommendation system an algorithm where once the user searches some keyword let's just say

clothing for instance it's going to generate a result with product that's could be relevant for the customer based on let's just say their profile transaction history and so forth and so

what we want to test in this example is that if we change the recommendation system maybe we're providing more relevant products to the users thereby it should boost the revenue sales of

this e-commerce store so we have some general sense about what this problem involves and one thing I want to mention is that I've worked with the number of clients on AB testing interview

questions one thing I've noticed is that clients will often skip this problem statement part and jump right into basically thetical methodology proposing

what the experiment design is you don't want to engage the interview question that way you always want to start with the business context and then and segue over to the sisal methodology now as

you're fleshing out the business goal of this problem one thing that you want to do is basically clarify what is the user Journey or the product experience that you're trying to change so this

e-commerce store has the following user funnel a user will visit meaning they land on this Landing site and then they will search an item which basically produces a result based on the

recommendation and then there a user is going to browse couple items and then eventually they might might click an item and then they'll purchase it which is the ultimate success that we seek to achieve

through the change now thinking about the user journey is important because later down the road when you're thinking about what is a success metric and what

is a target user population you want to think about at what stage do you want to consider the user to be participant for the experiment and this is something we're going to talk about very soon now

one pro tip I definitely want to mention is that when you have an a testing interview round whether it's for meta Google go through the individual product basically the core product the core

features of that platform and then create an outline of what that user journey is once you have done this this is going to be really helpful when you are in the actual interview setting and

you're asked to design an AB test based on that particular product once you have staish the user Journey the next thing to do is basically Define the success metric so what what is it that you need

to move in order to basically be certain that the change that you're applying is actually better for the platform overall now when you think about the success metric you want to think about a couple qualities so the first thing that you

need to think about is that is it measurable so is it the type of user Behavior you can actually collect through the instrumentation or the platform and the next quality you want

to think about is is your metric attributable and what this basically means is that can you establish a clear linkage that the cause the treatment that you applied to the platform is has

led to the effect that you saw the change in the metric in this case the next quality that you want to think about is is your metric sensitive and you essentially have a metric that

serves as a proxy that yes the there is a a genuine difference in terms of the user experience in terms of the old algorithm versus a new algorithm and so

statistically speaking you you want to find a metric that has low variability and a bad metric for in this case let's just say time span on the website so the

total time spent on the uh on the website for a given user might have high variability and so you cannot clearly tell whether there is an actual difference in terms of the how the user

is engaging the e-commerce given that there are some underlying changes to the ranking algorithm the fourth thing is that AB experiments needs to be very

quick it's very it's a very iterative process as a way to improve a product very quickly and so you want to ensure that what you're measuring is timely you don't want to wait weeks and months to

observe a user behavior and then eventually make the change because that's a very costly way of running an experiment so you want to think about what is that short-term behavior that

can serve as a proxy for long-term desired Behavior So based on these four qualities ultimately the success metric

that we want to use for this case is the revenue per day per user now once you establish the prom statement clearly the next thing to do is establish your hypothesis testing so this is where you

state your no hypothesis and alternative hypothesis so the no hypothesis in this case is going to be the average revenue per day per user between the Baseline and the variant ranking algorithms are

the same and the alternative hypothesis in this case is going to be that the average revenue per day per user between the Baseline and the variant ranking algorithms are different once you've

State the hypothesis statements the next thing that you want to do is you want to set the significance level or your Alpha in this case so the significance level is basically the decision threshold if the probability of observing a

particular event is very low then it is deemed statistically significant and so in this example case the significance level that we want to set is 005 and that's the usual value that is often set

in in online experiments the next value that we want to set is the statistical power the statistical power usually is set as80 and what 080 basically means is is the

80% probability of detecting effect given that the alternative hypothesis is true and lastly what you want to set is your practical significance basically the minimum detectable effect and

typically for a large online platform with millions of users the MD is 1% lived once you've set up the hpth testing the next thing that you want to do is design the experiment itself so

the first thing that you want to consider as you design the experiment is you want to consider what is the randomization unit in this case we're going to randomly assign at at the user

level um basically randomly assign them in terms of the control group or the treatment group once you have defined the randomization unit you need to think about which population of the users you

want to Target and this is something that I talked about earlier in terms of the user funnel right so you have a user that visits you have a user that searches they browse they will visit an

item and then they'll eventually purchase so at what level do you want to allow the user to participate in the experiment so in this case we actually

want to Target users who have actually started searching something why because this is where the algorithm actually kicks in and then they're actually

exposed to the treatment in this case which is either the uh the old algorithm or the new algorithm the third thing that you want to Define is your sample size the general rule of thumb is the

following formula which is that n is approximately equal to 16 * the variance / by the Delta Square where Delta

essentially represents the difference of the key metric between the treatment and the control and this formula is based on your the assumption that your

significance level is 05 and your statistical power is 80 once you have determined your sample size the next thing that you want to do is determine the duration of your experiment and the

typical duration of the experiment is going to be 1 to two weeks you don't want to run the experiment less than one week because you want to account for the day of the week effect meaning that there could be some underlying

difference in terms of the how the user engages the website during the weekdays versus the weekends once you have designed the experiment the next thing that you want to do is you want to run

the experiment and this is where you use instrumentation some experiment platforms as a way to collect the data and track your result now it is very important that while you're running the

experiment you do not Peak at the P value meaning you don't make any decision in terms of whether you're going to launch or not while the experiment hasn't been completed yet the reason is that when you're peaking

there's a chance that when you have low sample sies for instance there's a lot of variability in terms of where that lift goes and so you might falsely conclude that there is an underlying difference when there is so once you

have determined what the experiment time is given your statiscal power and your sample size you have to ensure that you weigh it out otherwise you're going to increase the chance that you falsely

reject the null hypothesis given that it is actually true after you have run the experiment the next thing that you want to do is you want to perform validity checks so this is where you conduct

sanity checks including instrumentation effect are there any bugs or glitches that could potentially affect the experiment results another potential issues that you want to look at are

external factors so maybe you run the experiment during the holiday or when competion launch something very important or it could be some general economic conditions like covid or

recessions and when you run an experiment when some external disruptions happen this could potentially impact your experiment result so ideally you want to run an

experiment that avoids periods like this the next thing that you want to do is you want to check for selection bias and what this basically means is that you want to assume that the underlying distribution between the control and the

treatment group even before they're exposed to the treatment condition in this case is that they're homogeneous and so one way to confirm that the distributions are the same is to

basically run an AA test the next thing that you want to check is a simple ratio mismatch and what it basically means is that if you're randomly assigning a user into the control or the treatment group

then out of all the part partipants of the experiment 50% of them should be in the control and 50% of them should be in the treatment but there are cases where because of some flaws with the

randomization algorithm that the ratio is actually not 50 to 50% it might be 49 to 51% so in order to ensure whether this could

pose a potential issue later down the road you want to use a Kai squore test as a way to ensure that the ratio between the two samples is sound the next item you you want to check is the

nty effect what this basically means is that if you made some change to the website itself a user might have reacted simply because there's a novelty behind

being exposed to something new one way to detect a novelty effect is basically look at it by user segment look at the underlying difference of the success

metric in terms of new visitors versus the recurrent visitors and if you see that there is a change between the two then there is a presence novelty effect so you may actually want to run the

experiment where you segment it by the new visitor group versus the recurrent group ver you've conducted the validating check and there's no issue with the experiment now you can actually interpret the result and when you

interpret the result you want to look at the direction of the success metric is a left negative or positive and you want to consider the P value because it helps you establish whether the lift that you

saw is statistically significant or not and you also want to consider the the confidence interval in this case so based on the experiment that we ran for this example what we see is that the

average revenue per day per user in the control group is $25 whereas in the treatment group it's $26.10 this produce the following lift so in

terms of the absolute difference is $110 in terms of relative lift it is an increase of 4.4% and the lip we saw is statistically

significant because we seen that the P value 01 is is less than the Staal significance at 05 and we also see that the confidence

interval at the significance level of 05 is between 3.4 and 5.4% lift so the initial interpretation based on what we see is the following there is

statistical significance to reject the null hypothesis and conclude that the average revenue per day per user between the Baseline and the varant ranking algorithms are different with this

result we can now consider whether we want to launch or not now when you think about whether you want to launch it or not and in this step there are three factors that you want to consider the first

factor is the metric tradeoff so you might have a case where the success metric might have improved but the guard R make metrics or the secondary metrics might have declined and so you have to

think about what are the pros and cons of launching it considering that the guard one metrics might have declined the next Factor you want to consider is the cost of launching if you see that

the cost of building this out basically rolling out to all of the users and the cost of maintaining this change is highly costly then maybe this isn't something that you want to actually

launch the last Factor you want to consider is a risk of committing false positive or your typon error rate for instance if you falsely conclude that there is an effect when there isn't and

you've made a change then it might have a negative consequence to the user you might might end up providing poor experience to users and they might turn and ultimately that is going to

negatively affect the bottom line of the product so these are three important business factors that you want to consider now you want to kind of go back to the interpretation of the result that you got from the experiment and along

with the business context and the statistical result ultimately you want to decide whether you're going to launch it or not so what we're going to do is we're going to look at a couple example cases where we'll look at a possible

range of lift along with the com tabl and then think about what is a sound decision based on the result that we have seen so in this first case what we

see is that the lift is placed at a positive value but it's still less than practical significance and you also see that the lower bound and the upper bound

of the confidence interval are less than the Practical significance in this case which is going to be positive 1% so in this case what you want to consider is that maybe you want to change CH the

algorithm or scrap the idea all of it together in the next Cas what we see is that the lift and the bounds the comperable is uh practically significant

so this provides a strong support that we should make a launch in the third case we see that the entire interval is less than zero so it's in a negative territory in this case you want to

consider perhaps iterating the idea or scrapping it all together and the next example is you have a positive direction in the expected LIF but then the bounds are in the negative territory and the

positive directory and you can see that it's a very wide bound now what is very important to note is that the upper bound as interval is practically significant it's greater than 1% so

there is still some likelihood that you might see a lip with practical significance so what you want to consider in this case is actually to rerun the experiment with increased physical power and this is going to help

improve the Precision of that lift that you're saying and in the last case you see that the lift itself you see that the expected lift is practically significant but you see that the lower bound of this is not practically

significant but it's still in the positive territory and the best thing to do in this case is to rerun the experiment with increased statistical power just to be absolutely sure that the underlying change is practically

significant So based on all of these considerations in terms of the business context along with various statistical outcomes ultimately what the decision you want to make is that you want to

launch this new algorithm as a way to provide a more relevant product recommendations to the users thereby improving Revenue overall so there you have it guys this is the endtoend

process of how to walk through an AB test how to actually address an AB testing interview question I hope you found this video really helpful if you

need any help in terms of mock interview coaching or um courses that comes with a testing courses and various business case problems and SL Community Access

definitely check out datant tv.com and if you have any questions along the way uh feel free to drop a comment down below or feel free to send me an email

at Dan dat nv.com I'll see you in the next video

Loading...

Loading video analysis...