Stanford Seminar: Peeking at A/B Tests - Why It Matters and What to Do About It

By Stanford Online

Summary

Topics Covered

P-values inflate 60% false positives from peeking
Always valid P-values control continuous monitoring
Adaptive boundaries use sqrt(log n / n)
Adaptive tests stop faster than fixed horizon

Full Transcript

great so I'm really excited to be speaking here uh it's my first time speaking to an HCI audience of any kind so I hope you'll forgive me if I'm uh if I don't follow the Norms that need to be followed um so a couple comments on the

work I'm going to talk about it is it is sort of work that has a pretty strongly statistical nature but I prepared the talk sort of from first principle so um we're going to give the talk is actually in three parts there's going to be sort

of a an introduction to kind of what we did and then um I'm going to go at the board actually which is an amazing thing I hope I hope that everybody can can can be okay with that so I'm going to

actually use the board to give you some intuition for why our idea works and then um I'll go back and and show you some more slides on on some of the results and and visuals of the platform so what is the talk about the talk is

actually about some work that I did with a platform called optimizely and so I want to be clear that this is work that actually did as a technical adviser to optimizely so you know in particular you

know just by by way of full disclosure I'm still an adviser to them and you know and this is work that I did essentially as a consultant for them um

the idea was that um we wanted we wanted to try to think about how people look at the information that you report after a randomized experiment called an AB test um in uh in in sort of Technology

platforms and and how do they process the information that they get back okay so let's start um by first just talking about what is AB testing so um this is just to make sure I set context for everyone hopefully those of you that have sort of been around the tech

industry would have seen something like this before but the idea is that AB testing is you know in Silicon Valley we have to invent new names for old things all the time so AB testing is a new name for an old thing called randomized

controlled trial okay and randomized controled trials are sort of the gold standard of experimentation um the basic idea is that you have two versions of something you want to compare and the

way you compare them is by taking the the subject population in this case it's say the users of your website and subjecting half of them to you know one of the variations and half of them to

the other variation so here maybe you want to compare using a black by now button against a red by now button and so you randomize and send half of your users here and half of your users here

and then um ask yourself you know some questions like which one has a higher conversion rate okay so AB testing is basically the process of running experiments and you know the only reason

that phrase matters is that that's the phrase that's used in the tech industry otherwise this should really be a talk that's called randomized control trials now one of the themes of the talk is

going to be that um running AB tests in in sort of the tech industry changes how we think about randomized experimentation okay and I want to I want to get to that um but for now I

just want to make sure that you know we set we set context set on Set uh um set what we're talking about it's basically that the prototypical type of experiment is going to be something like that okay a tests are used for everything

um you know so they're used to test out algorithms like search algorithms or Uber andyt will test out pricing algorithms using AB testing certainly used for for changes to design um in more sort of interesting ways they'll

often be used to test out even things like Logistics so like on the back end for you know Amazon or uh companies like that how they how they handle uh delivery okay so let's start a 100 years

ago um well before HC and what we're going to think about is the origins of AB testing and kind of a remarkable fact is is that most of the statistics that

are displayed in AB testing dashboards today in tech companies um actually rely on things that were done 100 years ago um so farmer Ron and farmer Ron here is Ron Fischer who's sort of the father of

of the way we think about about experimental design um he's you know very well-known statistician so um what does he want to do so he wants to test two fertilizers okay and this literally this example the we it's about

fertilizers and crop yields is this really was like the original motivation for thinking about experimental design sorry sorry Michael I should have said that before you ran

up so okay so what does he do so that the the the way you are taught if you take a a stats 10one class or you read a read a experimental design book the way you're taught to do this experiment is

to apply what's called fixed sample size testing okay and I'm going to walk you through how it works um and again just just uh just to reemphasize right I am doing things from first principle so my

apologies if this is sort of old hat to many of you um but hopefully you know for those of you for whom it's not this just sets context correctly so the first thing you need to do is you need to commit to your sample size which is how

many plots are you going to use to compare these fertilizers okay um obviously the more plots you use the more likely you are to get kind of an honest assessment of whether one is

better than the other so better higher sample size will be better okay then what do you do you wait for a crop cycle and you collect your data at the end so you know you use one fertilizer in some subset of the plots another fertilizer

and the other subset of the plots and then you wait until the end and see you know um see what you get and here's the key part the way that you analyze the

data is by running the following literally hypothesis test which is that I hypothesize what would have happened if there really was no difference between

the two fertilizers right so I had a collection of of of plots of land where I used one fertilizer another one where I used the other and what I first asked myself is suppose there really was no difference between these two right what

do I think would happen in that case why do I run that hypothesis test I compare it against what I actually received and now the question I ask is well how

unlikely is it that if there really was no difference that I would see what I got in in my experiment okay that's what's called the P value so the P value is literally a measure of false

positives because it's basically saying that you know if there really is no difference and you believe that you you look at the data and what you conclude

is that the data is really unlikely um if if there had been no difference you could still be wrong there still might have been no difference and you know it's just that rare thing that happens where even

though there was no difference you actually see a big difference right so that that that event where you think there's a difference and there wasn't one is a false positive right so this P the P value is a way of capturing the

false positive probability and and the the role that it plays in decision-making is is that you set for yourself a tolerance level for false positives all right that's often called the size of the of the test what you you

decide for yourself well what level of false positives are you willing to tolerate you know in scientific publishing we typically want P values that are smaller than you know 0.1% um in in Industry you'll see people

make decisions where the cut offs are like 5% or 10% you know some in some cases false positives matter a lot some cases they matter little right and that's kind of a measure of risk that you're willing to tolerate you should really think of that in the same way

that you would think about binary classification so in binary classification you're trading off false positives and false negatives the trade-off is that as you become more conservative here you're also less likely to detect things that are real

okay um and so you know that's essentially a measure that trades off for you are you actually going to detect real effects against against kind of making mistakes and claiming that there's a difference when there really

wasn't one right and so what feron does then is he cuts off the P value at 5% and if it's less then he says results are significant does this make sense I want to stop here for a second and ask

if there's any questions on what farmer Ron is doing so again the key the key idea behind farmer Ron's approach is this idea of theorizing about what might have happened if there was no difference

and then asking well How likely or unlikely is it that I would see what I got um in the experiment that I ran okay right okay so this is very classical um I should add by the way

that that that process of theorizing about what you might have received if there was difference that's that's typically what's referred to as frequentist hypothesis testing so when you hear the word frequentist it refers exactly to that kind of theorization

about what other things might have happened um okay so there's an important point about this approach it's not just like one way to think about things it's actually in some sense an optimal way to

think about things all right and I won't quantify in what sense it's optimal but informally what happens is that if you do this right then under you know under some assumptions lots of stuff being

brushed under the rug there um this approach actually allows you to optimally trade off false positives and detection of true positives right so in particular like you know when you work in binary classification when you're

building a classifier you you can ask yourself like you look at the the the ROC curve and you're looking for curves that trade off false positives and false negatives as well as possible right so one way to think about this approach as

a binary classification approach where your goal is when there really is a difference to detect it and when there's really no difference to not to not say there's a difference right so you're trading off false positives and false negatives this

approach actually achieves an optimal tradeoff there okay in an appropriate sense so um that's one of the reasons why it survived for 100 years it has it

it has this optimality property and it has this optimality property despite what I would call a relatively simple user interface the thing about this this rule is that you don't actually need to

be a statistician to apply that rule if somebody else runs the experiment they compute the P value they throw it up on a dashboard all you have to do is decide what level of false positives are you willing to tolerate and that's not an

unreasonable decision to be able to make right to ask yourself well how bad would it be if we made this call wrong on some things it would be quite bad on other things not so bad and I can tailor that 5% accordingly but other than that

that's the only thing I need and then my user interface is really simple just cut the P value off at 5% okay so so this is like this is kind of this is I I think

one of one of the reasons why um it survived for so long is because it's optimal despite a simple user interface um I think one other thing I want to highlight about this is that this also has a really nice organizational benefit

in companies because in companies you'll have many different people processing the same dashboard of results right and one nice thing about this is that a single number the P value summarizes the

experiment and each person can then bring their own desired sort of risk tolerance to the table which is this false positive probability and cut the P value off and draw their own inferences

um so that's that's sort of another nice feature of this is that you've got a kind of portable description of the experiment that can be shared across many people that might have different you know desired false positive probabilities that they actually care

about managers versus data scientists for example okay any questions on this pause for a second

here all right yeah oh good okay all right okay so now let's Flash Forward 100 years and I think this is actually going to say 2016 because the slides are

a little bit out of date but that's fine so in 1916 data was expensive and slow practitioners were trained statisticians and this is a really important point I think one of the points I want to emphasize is that 100 years ago if

you're using this methodology you could be pretty confident that the person you're showing the P value to was also a statistici or a scientist that they understood where the P value came from okay and then they understood how to process it they understood how to think

about it the way I just told you on the previous slide um this is Ronald fer the statistician that I mentioned but in 2016 we live in a really different environment as far as experimentation is

concerned um data is not expensive and slow you don't have to wait for crops to come in after a full season you get data in real time right and not only do you get the data in real time it can be processed and visualized and displayed

to the user in real time and not only is this happening but at the same time the people that are looking at these dashboards I mean could be anyone it could be you I mean many of

you are going to get jobs where at some point in the next 5 years you're going to look at a dashboard that contains the output of an experiment okay and that that is a huge change from where we were 100 years ago we're not

we're not necessarily expecting that every person that's going to look at the output of an experiment is someone who is a trained statistician anymore and I guess one of the things we want to understand is like what are the consequences of that

shift um so let's ask how are you different from farmer Ron right so here's how you're different and in the background there is a data center so the first thing is that if you're a data scientist staring at this dashboard then

time is money so you want results as quickly as possible okay so there's this trade-off to you see the thing is when the data only arrived at the end of a crop season yeah you might want results earlier but you don't really have a

choice you have to wait till the crops grow but that's not a problem in a tech platform because now the data is coming in continuously and it's more reasonable

to ask yourself to to sort of stop the test as soon as possible one of my favorite questions in industry is how long do I have to wait to get statistical significance because like the most commonly question asked by a product most commonly asked question of

a product manager is they you know just how long are they going to have to wait until the test is done okay um so that's one and as a result what you end up doing is that you've got this dashboard

up that's tracking the experiment for you and I'll show you an example in a second so what do you do you just every day when you walk into work you pop it up because every day you're wondering is this the day that I can stop the experiment right like do I have enough

now have I seen enough all right that seems reasonable right you're living in a in a real-time data environment so it seems reasonable and what you do is you rely on that dashboard to tell you when your results

are significant okay so um this is what's called so the the thing farmer Ron is doing is what's called fixed sample size testing a crucial feature is that feron committed

to the sample size that he was going to use in advance in particular the time of the experiment he's going to use in advance and the number of samples he's going to use in advance what you're going to do though is adjust the test length in real time which really means you're adjusting the number of

observations you're getting in real time right you're going to look at the data to decide how long this experiment is going to be which is not what the textbook tells you to do it's not what stats 101 advises you to do all right

and you're kind of in uncharted waters at that point and so what you do is you adjust the test length in real time based on the data coming in and notice I one of the reasons I set the slide up this way is I want to point out you're doing it for good reasons because you

want results as quickly as possible so it's not that you don't know your statistics necessarily that might be part of the reason but the main reason is just that you care about getting results as quickly as possible so that's

why you adjust in real time okay so here's a sample of what these dashboards look like um this is what the optimiz Le dashboard looked like just before we made the change over to the new methods that we use I want to point out that the

methods that we use don't change the dashboard a key feature of the methods was actually to leave the dashboard the same what we're going to do is change the numbers that are in the dashboard all right but almost any company that you go to is going to have dashboards

like this and I want to just quickly run through what the components are first of all I want to point out that the biggest numbers in the dashboard are how long the test has been running which really the reason that's there is to act as

like this running timer that you know every time you look at it reminds you that time is money and that you want to stop as quickly as you can all right um

then over here what we have is basically a a matrix where the rows are different variations of that say the web page that you're going to compare against the control web page right so each of these

gets compared back to the control and then in the column we have different metrics that we might care about views an example click a click on a picture and then each of these boxes is telling

us whether on that corresponding metric that that particular variation is better or worse than the original okay um and so you know what people are looking for and and the green and the red are the platform signaling statistical

significance to you so what people are looking for is to make you know is to wait as little as possible to be able to get these things to the green and red you know answers that they want and I

guess the thing I love most about this dashboard is actually this phrase right here I don't know if you can read it it says how long should I run my test okay and the reason I love that is because

farmer Ron from the textbook knew that that question needed to be asked before the dashboard got created like that was the question you were supposed to answer before you ran the experiment I love the fact that this is a question that's

being asked while you're running the experiment it's on the results page not on the project design page okay which is a complete like inversion of how experimental Des design was supposed to work so we're actually encouraging you

to think in real time about how long to run the test okay so here's a thought experiment suppose that um a hundred of you are are now in this position of

running experiments there's about I don't know probably 50 or 60 of you in the room suppose there's 100 of you that and you know um you're going to run experiments like this and you're going to follow a pretty simple rule you're going to stare at that dashboard and

you're just going to wait until the first time the P value is less than a fixed cuto off say 5 % okay um but the experiments that you're running are interesting in in an important way

they're not actual AB tests they're what are called AA tests where both of the variations are actually the same thing and you know that right so this would be the case where like both buttons were black and why are you doing this I'll

come back to that in a second well let's suppose you do that where both variations are the same okay so the P value is being cut off at 5% right now remember 5% is the false positive

probability okay so what are we expecting will happen if we follow you know a correct statistical procedure what we would expect is that if we use a

false positive threshold of 5% any reject any time we declare that there's actually a difference in an AA test that's a false positive because there is no difference so the the how

many of these should we declare a difference in probably about five maybe a little more maybe a little less there's some Randomness but about five right that's what we should get so my question is how many actually find a

significant result and stop early with this with this rule of stopping the first time you see something less than 5% and let's suppose that the sample size is up to a maximum of 10,000

visitors what's the number that you get yes so it is a number higher than five that's right otherwise this talk would end really quickly actually um so

so what is what is that any any guesses as to how big it's going to be yeah everyone every okay that's really high but it's lower than that so it's not everybody it's not quite that

bad um I mean actually ironically if that was true right I mean as you know a classifier that's always wrong is as useful as a classifier that's always right so if we ended up in that position we might actually have a pretty good

rule just do the opposite of whatever the test says um but but no so that's it's not going to be that bad um any other any other guesses how many people think it's going to be above 10% let's just we'll do we'll run an

ascending auction here how many people think it's going to be about 20% above 30% above

40% above 50% okay so it's above 50% all right so the the number of people who stop early and declare there's a difference is

actually over half just think about that for a second all right now here's this procedure that I declared to be optimal and great and 100 years old and used repeatedly and and all that and just a small amount of misuse which is this

like monitoring of the test continuously and then stopping early is enough to completely destroy the usefulness of this process like this is worse than actually experimenting in my opinion right like I think you'd be better off

applying your own knowledge to your business instead of running experiments in this case I've actually said that to people that this is what you're going to do then just don't bother AB testing

right okay so um the point is that in AB testing peing and it's not just the peing obviously right if you just look it's not like that changes anything the problem is the peing and then deciding

whether or not to continue the experiment on the basis of what you see can dramatically inflate false positives right and I want to try to help you develop some intuition for why this happens it's not obvious at all that

this should happen and I want to I want to give you some intuition for that before I do that let's just dig in a little into those numbers how bad can it be so what I'm plotting here is the the false positive probability that you

would obtain once the P value is less than if you follow this rule of stopping the first time the P value is below a cut off okay does it make sense everyone like I'm watching The Experiment unfold wait for the first time the P vales

below a fixed cut off so I'm plotting three curves here for Alpha being1 05 or Point uh you know 0.1 Z so 1% 5% 10% and you know this is the curve we were just

talking about a second ago the 5% curve so at 10,000 observations if you allow the experiments in principle to run out to that far then 60% of them are going

to end up with early stopping okay about 60% you know and then even at 01 we You're Expecting 1% false positives you have over a 20-fold inflation in the

number of false positives right um okay so let's so let me let me first talk a little bit about why this is happening right and and again if this is something that you're familiar with I apologize

but I think it'll be worth kind of just spending a couple minutes on it so here's a graph of one minus the P value what's sometimes called the statistical significance or the confidence level okay in some of these dashboards which

means here what you should be thinking is when the graph is high then the P value is low right and in particular that red line there is like the P equals

5% cut off so above that is like 95% significance so this is like the 095 line right here okay and what you're

watching is the trajectory of one minus the P value over time now the rule of stop the first time the P value crosses 5% goes below is equivalent to stop the

first time this graph goes above the red line this is an actual graph that was in the optimizely dashboard still a version of is still there and so that would mean you would stop right here

okay um and even if you a little bit conservative you would have an opportunity to stop again down here potentially like maybe you want to wait a little longer now this is an AA test okay so this is a test which in fact there's no difference between the two

things right and you can see that even this is crossing twice above that line so let's first try to understand why this causes an inflation of false

positives here's the basic issue what fixed sample size testing says is that I commit in advance to how many visitors I'm going to wait for and I guess the thing I want to point out to you here is

the the exsis is is real time but for a second just pretend it's visitors all right it's not that big of a difference actually so let's suppose that the commitment I made is that I'm going to run the test in advance I tell myself

I'm going to run the test out to here now the statement that I have a 5% false positive probability okay the validity of that statement comes from the fact

that if I ran 100 AA tests from here all the way out to here and waited until this point at most 5% of them would be above the red

line right does that make sense so a key condition in that is I'm only looking at this point in time that makes no claims about what goes on in the middle all right so now

suppose I take those 100 tests think of each of them as one of these trajectories laid out in front of you and you just walk your way down the line and in each one wait for the first time it actually crosses the red line so you're just cherry picking right in

every single one you allow yourself to stop the first time it crosses the red line the point is that although only 5% of them are going to be above the red line if you had waited to hear in fact

60 of them are above the red line at some point in this period of time all right that's the basic problem so the problem is favorable selection of the sample path on the basis of what

it's doing before you hit the terminal point all right um and I think so coming back to why AA tests are interesting the reason I found this example so interesting like optimizely had data on

AA tests and the reason they had it is because customers would run AA tests as a way to check the platform so basically like okay let's say you sign up for an AB testing service something that's going to help you run experiments and

you're like I don't know if this is actually going to do what I need what's the first diagnostic test you could run run an experiment where you know what the answer is supposed to be that the two things are the same so naturally when you run an experiment and the two

things are supposed to be the same and you get significant results what do you do you call the customer support team and you're like wait a second I ran this experiment the two things are the same what's going on and I got to say I

really feel bad for the customer support team that there's no easy answer to this question like one answer is that the customer is wrong right because we gave you the right statistics you just use

them incorrectly and that's not something you're going to tell someone on a customer support call and the other obvious answer is we look into it for you which is ultimately what we did but there's not a lot to look into I mean in

the end this is supposed to happen this amount of time and there's not really a good answer to that right so um you know I think I already said all this so basically like the key point is

that if you wait long enough there's a high chance of an eventually inclusing inconclusive result looking significant along the way yeah so just echoing this and maybe terms that many of us have

experienced you have uh the Golden State Warriors playing the Lakers or something there's the question of who wins the game and maybe the Golden State Warriors win 95% of the time but what this is saying is is there any point in any of these games in which the Lakers were

temporarily leading and that happens a ton of the time that's right exactly yeah that's a really good way to think about it basically that's right there's a huge difference between asking whether the P value is ever going to look significant along the way and whether it

actually looks significant at that fix Point there's nothing about the design of farmer Ron's method that's allowed to let you look at it continuously and know that you would still have validity yeah oh sorry oh sorry Michael you were G to

say something this is your very fast if let's say we had picked one and we picked you know halfway yeah you would then still and you waited there you would still get that the correct performance at that point right that

like maybe things are always moving around but just so happens that only five of them would end up above the yeah for any fixed point in time you're okay and actually you can get more sophisticated than that you can ask questions like suppose I only use two

points in time so I look at the first and decide based on that whether to continue or not and then look at the the next if I continued and you know those are adjustments that have been made in things like clinical trials where you'll

stage the clinical trial and then you'll run part of it look to see what happened and make a go no go decision on the next stage and then it's easy to kind of back out from that what are the right stopping rules to use and things like

that um and at some level actually the methodology we use is really like a smooth version of the stuff that's done there so James you had a question so do the platform give the person designning

an experiment like a way to do a Power analysis or something to make an estimate to say this is how many users or how long you should run the test yeah so that that link that I pointed out on

the dashboard that how long should I run my test link is actually a power calculator it's it's basically a sample size calculator and and I I think again the consideration there is that that test is supposed to be that that calculator is supposed to be run before

you run the experiment now you raised an interesting point so when I when I talk to people in Industry about this like one common response is to do the following which is to to run the test and then while you're running the test

you apply the following two-stage rule the first thing you do is you wait to see is the P value below 05 and okay if it's not you definitely don't stop but let's suppose it is now you apply a secondary Diagnostic and the secondary

diagnostic is to take what you've observed the current effect that you're observing the current difference and run this sample size calculation using the current difference as the difference you're trying to detect and and then

that'll spit out how many samples you're supposed to wait for and as long as you've waited at least that many samples then you stop the test and it turns out so that's what's called a post Haw power

calculation that's has exactly the same issue it's going to lead to exactly the same inflation kind of inflation but is even showing the value wrong from the

perspective of how the statistics were designed absolutely it shouldn't be showing it it could show you yeah this is 10% better that's better but it should be saying you need to do this many more people

yeah actually okay approximately and we're not going to say what the P value is you reach there or some stopping commiss that's a great point so hold that back for a sec I'll come back to that in a minute yeah would it be valid

for someone uh looking at this graph to ask them themselves all what percentage of the time am I above that P value line and use that to get a a method like a an idea of the significance of these

results yeah um I mean definitely so I'll talk to you a little bit as we go along um I'm hoping to be able to spend at least 10 minutes just to give you some of of a peak inside the methodology and um roughly speaking I mean that's

the idea like if you want to make it possible for people to continuously monitor you have to understand something about how frequently these kinds of fluctuations can occur and then you have to control for that so I think at at at

at some technical level I think what you're describing is roughly the approach that has to be taken and I'll try to I'll try to formalize it why don't you stick the customer support into the program um so to anticipate the

question you know what I mean yeah okay so so all right since I I love this audience there's so many suggestions of how to solve this problem right um and so let me let me first frame the problem correctly because this is the problem we

tried to solve and then we'll talk about some potential Solutions so um what would the user like first of all I want to just be precise about this because again I want to emphasize that they're not doing something entirely stupid they actually have a good reason for looking

at the tests early and the reason the thing the user would really like is they would like to be able to choose when to stop ad adaptively adaptively meaning as they're watching the data and the reason is because they really want to trade off the sample size against detection of

true positives if it feels to them like you're not going to likely detect something here they would love to be able to stop sooner because there's an opportunity cost to not doing other stuff not running other tests with your

site right um and you know what they' love is then at that point to to use this simple rule that if the P value is small enough the result is declared significant so fixed sample size testing is really nice because it optimally

balances false positive and true positives but it's not adaptive on the other hand adaptive testing while meeting the first goal is something which massively inflates false positives if we use the same P values right and so

when I was confronted with this the first thing I did was talk to my colleagues here and I asked okay what should we do so the first suggestion was you know uh in a safe for kids version

abbreviated as the rtfm approach and that was um well basically like and I guess at some level this is what you know one version of putting support into the platform would be you know to to just tell people here's how you run a

fixed Horizon experiment a more extreme version would be what James suggested we actually don't even allow them to look at the P value until the Horizon right and there's definitely some companies that their test this way where it's not

even Poss you're not allowed to look at the P value or reject or anything like that before some fixed Horizon run length that you've committed to okay so that's definitely one solution and this is a convenient solution because it

allows us to basically say you know our methodology was right you just don't know what you're doing as the user and therefore like the right fix is to fix the user okay um now I mean I'm in an HCI seminar so I hope you can guess what

the next thing is that I'm going to say which is that you know I I think it's very hard it's very hard to Envision a scenario where you change the user so our approach was to start by presuming that there had to be some reason why they're doing what they're doing and

like I said they have this opportunity cost to waiting too long and instead the goal is to commit to this as a constraint right I mean it's not stupid of them the data is coming in in real time so if you just think about it from

the perspective of like the range of things you could possibly do shouldn't you only be able to do better if you're taking advantage of the data as it's coming in right like it's a weird constraint to impose that I refuse to

look at the data as it's coming in so the question isn't so much like to read the textbook necessarily but rather can we allow them to be adaptive and still give reasonable inference and reasonable control okay so we actually committed to

this as a constraint and then said can we design a procedure that does a good job relative to that constraint okay and so the approach basically that we took was to change the P value so kept everything else the same but we changed

the notion of significance and how do we change it and this is what I'll dig into a little bit with you to show you what we did we allow users to continuously monitor the test so that's the commitment to that straint they can be

adaptive and then um we report what we call an always valid P value so this is something where no matter how the user chooses to stop the test there will be no more than a 5% false positive probability and in a moment I'll walk

you through like I said on the board kind of the basic idea behind how that works um the method that we Ed I won't be able to talk about all the technical results related to this but I'm happy to dig into this more with anyone here I'm

at Stanford obviously so uh you can you can catch me anytime um but the kind of the key results behind our approach are that we're able to provide a nearly optimal balance between test length and DET detection of true positives while

controlling false positives the way we said here and allowing the user to be adaptive so sort of the way we think about it is that we're trying to recover something which is giving kind of a good balance between sample size and

detection to the user while controlling false positive probability and most importantly allowing them to be adaptive all through the same type of user interface that they're used to which is this like P value being reported in the dashboard okay I think there's a longer

conversation about whether P values are a good idea for this kind of stuff and I think I'm happy to have those conversations as well as as those of you at lunch know but um but at least let me tell you a little bit about how this

works right um so I'm going to actually pause here on the slides and move to the board but before I do let me just ask if there's any questions or more comments I and I guess probably the biggest question is like yeah so what

did you do right and I'll I'll try to explain that a little bit okay all right so could you raise the screen um and then turn the lights back

on do I wave Michael oh there there you Michael said wave and the screen would magically go up awesome thank you and if there's lights here that'd be

great to put them back on the board oh can you raise it up you can turn the screen off entirely actually and just raise it up so I can use the board see I think I can start here but

it would be nicer if the screen was gone thank you let me just pull this out okay so um so here's the basic idea

all right I'm going to look at a very very simple example for a second so this requires a little bit of suspension of disbelief on your part but hopefully you'll bear with me all right which is I

want you to imagine that I'm giving you a collection of values um say x0 X1 X2 these are all independent values that

come from a normal distribution all right that has a mean let's call the mean M and a variance of one okay um and the only thing I'm going

to assume is that my goal is to use the data to figure out what the mean was so like just intuitively right that like thinking about the law of large numbers and things like that what I'd expect of course is that if I take the sample average of these things that's going to

about look like the mean and the more of them I have the closer it's going to be to the mean right so that's how I would guess what the mean is is natural thing to do is to stare at the sample average the more of them I get I know it's getting to the mean

I actually know a bit more than that right we have the central limit theorem the central limit theorem basically tells us that not only is this thing going to look like the mean eventually but if you ask what's the variance of the sample average or like what's the

standard error this is hopefully you know something that you you might have seen before the standard error is proportional to one over the square root of the number of samples that I have so that's like the spread in my estimate

okay so the way I think about getting a good estimate is I I kind of the way I pick my sample size I pick my sample size n so that one over square of n is about the tolerance that I'm willing to

accept in my estimate of the mean does that make sense okay all right so now let's let's just do all our analysis in Sample mean world so what I'm going to do is I'm going to plot for you kind of

what's happening over time uh let let's make that n sorry so it's a number of samples that you have okay and what you're going to do on this

graph is visualize a sample path that depicts the sample mean okay now what I'm going to do is I'm going to I'm going to ask myself that the equivalent of testing whether there's a difference

between the two A and B cells in this simple setup is going to be to try to figure out whether m is different from zero or not okay so um if m is not

different from zero which m is actually zero but I declare it to be different from zero that's a false positive my goal is to control those if m is different from zero and I fail to do to

to to note that then that's a false negative and I want to control that too I want to do a good job on both sides okay so when you're thinking about what the what the sample mean is going to be doing over over my samples if m is

actually zero you should visualize something which is swinging all over the place but where the range of the swings is about 1un of n okay and then as I get down to Infinity it's going to converge

into zero if the sample mean is different from zero I'm going to get the same kinds of swings but away from zero and eventually it'll converge into what the right value is so here's the problem

the problem is that when when I claim that I will stop the first time that the P value is below .05 like using the classical farmer Ron P value that is

equivalent to stopping the first time that the sample mean crosses a boundary that looks like it's proportional to 1/ of n so I'm just going to draw down 1 of n

here okay and so this boundary looks like a constant divid the RO of N and you know the same thing this is minus the same

constant divid square of n so when you're watching your sample mean so so that the equivalent of waiting for that AA test to go above the red line is waiting for the first time your sample

mean crosses these Dash lines and saying I'll stop now okay so the problem is that there's a resultant probability that tells us that this line is going to get crossed infinitely Often by the

sample mean okay not just once but infinitely under the assumption that the true mean was zero right it's obviously going to get crossed if the true mean is non zero right because if the true mean is non

zero eventually I converge to the true mean that's definitely going to cross one of these two dash lines right whether it's negative or positive but if the true mean is zero it's not obvious that it's going to cross these two dash

lines right the the limit will actually lie on this line okay but it turns out that there's a result in probability that says you're actually going to cross this infinitely often so you're guaranteed to make a mistake if you were

to wait indefinitely so you know that 50% or 60% that we were talking about when I did the survey that would be 100% if I allowed you to wait as long as you wanted you'd be guaranteed to cross okay

and again you would be guaranteed to cross again and again so even waiting for multiple crosses would not save you so what you have to do is you have to fundamentally just be a lot more

conservative in this boundary so one overun of n is not good enough in fact the result says something stronger it's not just that one over Square Ro of n will be crossed infinitely often it's that square root of log log n/ n is

going to get crossed infinitely often right which is which is a boundary that lives above this one so in fact the first boundary where I can guarantee that I only cross with some fixed

probability that's not one is something that looks like Square Ro of log n over so log n Over N okay and so if I could draw the picture in to make it

more visual basically what I mean is there's some other curve that looks like this and what these Define sorry that shouldn't tilt back down it goes to

Z so these curves also have the property that they eventually conge down to zero of log n n goes to Z as n goes to Infinity but they have the property that

if the truth is that m is zero and you watch the sample mean the probability that it crosses one of these two black lines the outside black lines is actually strictly less than one and by

tuning the constant in front of this so there's another constant let me call it you know it's different constant constant Tilda times that to tuning this constant by which basically involves moving this in and

out a little bit I can actually adjust what that probability is now what is that probability that's my false positive probability right so what we do is we basically pick the constant here so that the probability that you ever

cross the black lines is bounded at exactly what your desired false positive control is okay so that's the thing that gets you the false positive control is

to shift the boundary out the reason this is the thing we use like why don't we just use something where the boundary is flat like just like that right that would ensure that at least I would have false positive control the reason we

don't do that is because then if if the truth is different from zero I'll never cross that boundary or there's a good chance I won't in that case I won't detect real effects right so this

boundary is still nice because it still goes to zero as n goes to Infinity which means if the truth is different from zero remember the sample mean is going to going to center around that different

the truth that's different from zero all eventually cross this black boundary so I'm guaranteed to cross if the truth is different from zero and I can control how frequently I cross if the truth is equal to zero yeah doesn't

this assume that I know the standard deviation so if I don't know the standard deviation that I'm sampling from like I don't know exactly how wide out these things yeah I think there's a lot of really good questions there so there's there's some assumptions about

what the Varian is in here that are important so putting test for example I don't know how much variation there be I think I think that's right so um I I guess so there are a couple different things I

have to say about that I think if you were to translate from this to the AB testing framework right one is the issue that the variants might not be known the other is of course the issue is that this is not even an a test it's just an

a test so I need I need two cells and then I have to ask well how does this change when I'm actually tracking the difference in conversion rates of two groups um I guess in the interest of time let me just say that that's where

some of the technical work goes to actually generalize this approach into a setting where those are some of the things we're dealing with um the particular issue of variance is actually not the most uh the most difficult one to deal with there there are a couple

other issues that prove to be more problematic um Let me let me defer that for a moment so I can I can kind of keep keep uh just make sure I get through a few few other key points I wanted to make so I know that this sounds a bit

technical right so let me Zoom it out again and and kind of summarize what the key idea is here so you give me um and move these two boards up so we're going

to on the on the next two boards I'm going to write in a slightly less technical way what the basic ideas behind what we do all right okay so the idea is that

first the user picks I mean they don't have to do this they can do this on their own but the user let's say user has a

desired false positive probability Target Alpha all right and then what we do is we say okay the rule stop the first

time the sample mean crosses constant Time s root of log n / n can guarantee

Alpha if constant is chosen right okay and then if the

truth is different from zero this rule is guaranteed to detect eventually

okay um so there are a couple more points I want to make um so this is kind of a summary non-technical summary of what I set up here and there are a couple more points I want to make so the first one is how do we get p values out

of this okay the way we get p values is that we think about we we give this time a name okay so let me call this

time t Alpha so tpha is exactly this time here you fix Alpha and then it gives you what's the first time you would stop um what's the first time that this boundary gets crossed okay with the

right constant being chosen based on Alpha so the rule we give is that the P value at any given n is the lowest Alpha such that t Alpha

was before n you've got a family of these right now right so for every Alpha there's one t Alpha and some of them have not happened yet some of them you know given the data

you have so far you wouldn't stop so you and and of course they're nested right you're going to become more conservative when Alpha gets smaller so what you can do is you can say all right given where

I am right now what is the smallest Alpha at which I would have stopped by now okay and that's the P value that's

the P value we display in practice you just calculate this out from 01 to 99 and just show in practice actually turns out to be much simpler than that there's a there's a very simple like static

update that does it you don't have to compute all the T Alphas separately you just compute one number and and it's enough yeah um so I think like as I mentioned I had to suppress some of the

technical details so it turns out that maybe if I say one sentence about it and so if you want to tune out for a second the next the next 30 seconds or so are going to be slightly technical um the

actual way we run the test is we compute um a version of a likelihood ratio statistic between the alternative and the null and then um what you're looking for is just uh how

large the likelihood ratio statistic is relative to one over Alpha and so you only need to track that one number the likelihood ratio statistic and then the smallest Alpha is actually just you know the is related only to that so you don't

need to compute all of them separately yeah so that's the rough summary okay so I've got a bunch of these so the P value definition is the

smallest Alpha such that T Alpha so this is the P value at time n is the smallest Alpha such that t Alpha is less than or equal to n and

that's the P value um this has two nice properties so one is now imagine that somebody stops the first time the P value Falls below Alpha what does that amount to stopping the first time the P

value Falls below Alpha is stopping at T Alpha by definition okay and as we said t Alpha is chosen so that you've guaranteed false positive prob Alpha so this

actually coming back to your way of thinking about things of controlling these additional fluctuations that's exactly how it's happening we first work out what is the range of these fluctuations and then we set T Alpha up

to make sure we've controlled them and then now um when if if the user stops the first time this crosses Alpha it's it's exactly the same as controlling the fluctuations correctly um that's one

point the other point though that's nice is that it turns out that's not the only rule for which you would get false positive probability control no matter when the user chooses to stop in an Adaptive way their false positive

probability would be would be bounded by Alpha and like that's an additional few lines of proof from here um you know that that I'll skip for now but that's kind of a nice thing about these P values so these are this is the nature

of the always valid P values that we use um I want to make one comment about the methodology like the the actual the procedural test that we use this this likelihood ratio statistic that I was mentioning to Michael it's something

called the mixture sequential probability ratio test and it's not our idea it's something that actually has been in the literature for a long time um so her Robbins who is a statistician here and passed away and David Sigman

who's still a statistician here um developed kind of foundations of it and then Li Z Li who's also a statistician here kind of did a lot of work to to to build out some of the theory around it and the statistic that we use these T

Alphas that we construct I guess picking the right constants here comes from their calculations some of the stuff that they did okay all right so I want to show you a few more slides that kind of give you some sense of how this works

okay and what's good about it um so let me let me wrap up with that um okay

[Music] okay this is almost perfect except for the fact that the screen's not there yet

let see try reloading one more time all right there we go uh oh second

let try some more time okay so last thing I want to talk to you about is oh sorry go ahead so T Alpha is the the first time

that so say you fix Alpha what that determines is the constant in front of the square of log n n so T Alpha is the first time at which you cross those two solid black lines with that particular

constant in front of the square of log n Over N um so for each Alpha you get kind of one of these pairs of black lines and of course the more conservative you are the further out the black lines go okay

so as Alpha gets smaller the black lines go out more and so the as Alpha gets smaller the T get longer it takes you longer to hit one of those two black

lines and uh FAL Alpha is your false positive control yeah the 5% from our examples okay all right so let me let me wrap up with one last comment so the

test looks good because it's giving you two things it's giving you a control false positive probabilities and it's giving you detection eventually so um the one problem with that is that it

might give you detection eventually but what if it takes forever to get there right and so one thing we wanted to do is we wanted to compare How We Do relative to a test where you you know

you've committed to the fixed Horizon run length in advance okay and so here's kind of how I would think about that you have to give up something in return for continuous monitoring of the test um even though you've got the data coming

in continuously now you're now protecting yourself against a much wider range of things the user could potentially do okay and so um here's the basic question if the effect SI is known

in advance meaning the actual gap between A and B is known in advance in principle you should be able to do only better by continuously monitoring right because you've just expanded your feasible region of things you could do

with that knowledge if you knew what the Gap is supposed to be then you know you you know you can you can perfectly set up a rule to stop so that you do better in practice the problem is that you don't know the effect size in advance

and the test that we designed does not assume any knowledge of the effect size and so what we want to do is we want to compare How We Do against the fixed Horizon test using data from optimizely so this is kind of a graph that takes some time to process so let me walk you

through it so first of all um there's four tiers that that when we launch this that optimizely bronze silver gold and platinum roughly ordered in terms of the number of customers they would have in experiments that they ran actually what

you were doing is paying for the amount of visitors running through optimizely that that um that that you you know that you would use per month so what am I

plotting here so I'm plotting um a ratio of how long our test takes against how long a fixed Horizon test takes now

for what the for what is that I I use um a sample size calculator to determine a sample size under the following assumptions that I know what the effect

size is that I have a particular F uh um false positive probability I'm trying to hit and I have a particular false negative probability I'm trying to hit once I fix those three things that determines the sample size on a fixed

Horizon test I need to be able to meet those constraints okay so the first thing we do is we imagine we we took about 10,000 historical experiments from optimizely that had run long enough that

you could kind of treat the final effect as like the true effect size and we use that in the sample size calculator to work out how long you would run a fixed Horizon test that's the denominator the

numerator is that we used our testing procedure on top of the same data and asked okay how when would you stop when's the first time the P value would cross the the cut the cut off so the first graph I want to point you to is

the solid black lines which don't look particularly good so these ratios remember it's a ratio of ours to to to fixed Horizon testing it looks like it can range as high as three so that be

like a 3X inflation in the Run length of the test okay which sounds terrible but that 3x inflation presumes that you perfectly knew in advance what the actual effect was going to be right now

suppose that when you calculated a sample size assuming something about what the effect is going to be but you got the effect wrong so you were off by 30% or 50% which is not unreasonable if you're looking for like a 1% conversion

rate difference being off by oops oh I forgot to turn caffeine on um being off by a 1% conversion rate difference what amount to basically um would amount to

would amount to having you know a true effect size which is point 05% difference instead of a 1% difference so that's not unreasonable in practice and now you can see like let's look at the

the dash line or the dotted line here that's a case where you off by 50% on the effect size estimate and now Ron links are typically much shorter than the fixed and run link so this diagram for example most of the weight of that

histogram is less than one so our ratio is less than one we we stop faster than the fixed Horizon test and the red line is really the one that I would draw most of your attention to which basically says now suppose that the effect size

comes from a prior which is the distribution of effects we've seen across optimizely and like compare the fixed Horizon testing for that relative to the fixed Horizon testing Sorry relative to the Adaptive sample size

testing that we do and what you find is that in general you're sto earlier in particular the mean of all these distributions is less than one okay so on average you stop faster sometimes you still stop later but on average you stop

faster and here's the reason for this I'll leave you with this which is that um if you're wrong about the effect taking detecting an effect of size Delta takes you a run length that's

about one over Delta squared and the reason is that that this this thing I told you that there's a one over root of n uncertainty in your estimate right so if you're trying to detect something that looks like Delta and you have a

one/ uncertainty you need about one Delta squ samples to resolve that uncertainty so the penalty for guessing wrong about this should be a capital Delta sorry the penalty for guessing

wrong about this is really high being off by a little bit like an an effect size that's 2x2 small actually takes 4X too long so it's a huge huge mistake when you're even slightly off about the

effect and our test because it's adaptive automatically kind of adapts to that right um so I'm going to stop there there's a bunch of stuff on multiple testing that I didn't plan to cover today but I just want to leave with this

note which is what I really love about this kind of work is the user is like a fundamental component of the modeling um so and there's a bunch of other things I'm thinking about that are kind of like this where I really think we have a lot

of work to do on the methodological side to develop ways of of prevent you know presenting statistics to users that can actually lead to good decision- making within the context of what feels

reasonable and plausible to them um and you know I think you you guys can think about similar examples of this all over data scien it's currently practiced you know I found it fascinating because you've got this you've got these like

relatively sophisticated tools being applied and just a small amount of misuse can completely destroy their validity so you know that's not the only case where that's happening and I think I think it's incumbent on all of us both

you know in HCI and in kind of on more on the statistical side of things to do a better job of helping people make good decisions from the data they have so I'll stop there thank

[Applause] you question yeah go for it I guess I'll leave this up yeah let's jump over to the humanities and sence and Sciences in social science

side of Campus where everyone's going n replication crisis in Psychology right can

you or on blogs um the equiv can we fix that through something like what you're doing like as you're describing this I'm remembering when I was doing undergrad research you know I would talk to

friends who were also spending the summer doing like site research and they'd say like oh man we almost had it statistically significant but then it like swung away so we're going to get some so now we need to up our end like in theory everyone does a power analysis

and it's like okay so we need 100 people and they do that but in practice if it's like 006 you keep running you keep running yeah and like it seems like basically exact it's the same problem

and I'm wondering if you know we should just not be doing this just for AB testing online but instead do this for like my k ERS and everything else that

we sort of like now we assume not exactly the worst of human of human beings but that like we're you know we're sort of really yeah I so I actually have a few different comments I want to make about what you just asked the first one is you might have heard

the term P value hacking and it's often brought up in in the same sentence with replication crisis and it's basically that like P value hacking means P values mean one thing but people will pull all

kinds of tricks to try to get significant results right and um and so the first thing I wanted to point out is that you know I I'm I'm completely comfortable with the criticism of some

of the criticism of P values but one thing I worry about is that it it's putting all the burden of of failure on the statistic instead of something about the user model right and so that's one

thing is that I I think like embedded in your question is that part of the problem is is how people interact with P values and what they use them for and and then how we interpret the consequences of that interaction so

second comment I'll make is that in in in you know if you start applying this more broadly I think there's some balance of James's kind of rules approach where you like fix boundaries that people can't go past and

flexibility and so like I think in you know in Psychology and some of the journals anyway you actually have to pre-file what your run length is going to be what your experiment is going to look like what are the confirmatory variables that you're exploring then you

can do other stuff if you want but all that stuff is set in stone now and you can only manipulate it up to a certain point um and I guess the last point I'll make is that this is more just to make sure that all of you are curious enough

to think about these things you know this is one example of how looking at data that's coming in from an experiment can go wrong two others that I think everybody should know about and those of you that took 226 I see Sam the Audi so

he knows this is um one is post selection inference which is basically an analog of that would be what you just brought up Michael where I first look at the results of the experiment and then use what I learned to figure out which

hypothesis test I'm going to report okay so that's a major cause of the replication crisis that's a form of P value hacking it's it's related to what we just talked about here and then the other the other one is multiple hypothesis test which as I noted I

didn't get get a chance to talk about that's basically that when you run 100 hypothesis tests if you have lots of variations lots of metrics you should expect five of them to be significant because you had a 5% false positive cut

off and that's not useful inference because now you're staring at a dashboard your eyes are drawn to the green and the red right so what you need is something which helps people protect against the the kind of human tendency

to want to focus on the stuff that's significant in a world in which it's already known that some of those things are false positives optimizely think let's say you took a thous 10,000 uh

tests there how do how do they go to sleep at night knowing that you know well we've implemented a huge number of those are false positive no so we actually implemented we we in addition to what I talked about we implemented

control for multiple hypothesis testing so if you run 10,000 hypothesis tests your false Discovery rate which is the fraction of significant results that are false divided by the total number of significant results would still be

controlled at across all optimiz across it's not across all of optimizely it's across all customer each customer's experiment yeah I see that's an interesting question yeah should we should we apply so yeah I I think that's

a longer conversation should we apply FDR control across all of optimizely I think that's that's probably that would lead to making no decisions whatsoever but um that's one way to control false positives just to not make any decisions

at all yeah all right should I stop there all right let's thank for [Applause]

Loading...

Loading video analysis...