1 - A Brief Introduction to Causal Inference (Course Preview)
By Brady Neal - Causal Inference
Summary
Topics Covered
- Causal Inference Is About Intervention Effects
- Why Correlation Does Not Imply Causation
- The Paradox That Flip-Flops Your Data
- Randomization is Magical
- Why Some Causal Effects Are Forever Unmeasurable
Full Transcript
hi everyone welcome to a brief introduction to causal inference in this talk i'll be giving you a preview of the first few weeks of the course my main goal is to focus on motivation
and intuition this talk isn't meant to give you a complete understanding of all the topics that i cover i'm going to try to cover a decent number of topics so that you'll
get a good idea of what some of the basics of causal inference are but you won't get a complete understanding of these topics until the first few weeks of the course you might have in mind some machine
learning topics like out of distribution generalization and you might be interested in how causal inference is related to those those won't appear in this talk because they're not some of the basics of causal inference but they will
appear later in the course all right so with that let's get started what is causal inference causal inference is mainly about inferring the effect of one thing on
another thing so the you could be inferring the effect of any treatment policy or intervention for example
some examples are if you want to infer the effect of a treatment on a disease or say you want to reduce emissions and you have multiple different
climate change policies to do this you might want to pick the policy that is most effective at reducing emissions so it causes the largest
reduction in emissions similarly say you notice that there is this rising mental health bad outcomes and you
think that social media might be one of the important causes of this so you could try to do a causal analysis to see how important social media is and contributing to this problem
maybe what percentage of the problem it contributes and there's many more so just generally if you have any x that you want to talk about the effect of
on some y that's what causal inference is for so here is the outline slide we're first going to give a motivating example which is simpsons paradox you might have heard of this before and it turns out
that the causal structure of the problem is absolutely essential in resolving simpson's paradox then we'll talk about why correlation does not imply causation
you've probably heard this before but hopefully i'll give you a bit more understanding of why this is the case and what kind of important implications this has and then in the last part of the talk we'll talk about observational studies
so an observational study is when you just are given data you're not able to make any experiments and this is where a lot of causal inference
research takes place these days all right so with that let's get into our motivating example simpson's paradox say that in a purely hypothetical scenario
there is a new disease covid27 and there are two treatments for the disease a and b which will code as zero and one and it's your job to decide which treatment
to choose for your country and the only thing you care about is minimizing death so which treatment will cause the smallest number of people
to die or will help the largest number of people live and an important thing about these two treatments that we will use in this example
is that treatment b is much more scarce than treatment a another thing that you have data on so you're getting data just from your doctors in your country they're
administering treatments and then they're collecting data on what happens when they give those treatments another thing that you have data on is the condition of each patient whether they come in with a mild
condition or a severe condition which will also code as zero or one and then finally there's the outcome why your patients will either be alive or dead
and in all of these cases we're only looking at binary variables but in general there's a theme causal inference where you can extend analysis from binary variables to say
continuous variables or multiple outcomes say so here is what your data looks like at the treatment level among people who were
given treatment a 16 percent of them died so that's 240 people out of 1500
1500 people got treatment a and 240 of them died and then among people who got treatment b 19 of them died so just just looking at this it seems like treatment a
is doing a bit better than treatment b you know three percent fewer people die but then something interesting happens when you subgroup the data
by condition so if you look at patients who have mild condition 15 of those who are given treatment a die
compared to only 10 of those who are given treatment b so treatment b actually looks better among the patients who had mild condition and it's the same thing with
patients who have the severe condition actually you have only 20 percent mortality rate only 20 of people died when they um of people had severe condition and
received treatment b whereas a larger 30 percent of people had severe condition and received treatment a died so how is
how are these numbers flipped in some sense there's a paradox in that if you look at the total population you ignore the subgroups treatment a has a lower number treatment
a looks better but then when you look at each of the subgroups treatment b looks better in both subgroups this is uh this is simpson's paradox
and it turns out that the numbers work out just fine you know the way that you get these numbers say the 16 percent the 240 is just summing up the numerators of the
treatment a mild group and the treatment a severe group and then the 1500 is just summing up the denominators 1400 plus 100
a maybe more interesting way of writing these calculations is to so here i've massaged the calculation a bit so that i've written in terms of weightings of
the big numbers in the boxes so this 0.15 here is the 15 same with the 0.3 and so on and so
i rewritten this as weightings the 0.15 is weighted much larger than the 0.3 for the treatment a calculation
and this is just because most of the people who received treatment a had mild condition fourteen hundred out of fifteen hundred
so the sixteen percent number that you see in the total population for treatment a is largely coming from the big weight that was put on the 15 percent
among the mild conditioned people in contrast for treatment b there was a much bigger weight placed on the severe group and that's just because 500
out of 550 people who received treatment b had severe condition and so the the 19 that you see for the
treatment b in the total population is largely coming from the 20 percent that you see in the severe group for treatment b simpson's paradox largely comes from this unequal
weighting okay from the fact that the treatment a people mostly had mild condition and the treatment b people have mostly had severe condition and people with severe condition are just more likely to die than people
with mild condition so that's why we see these flipping of the numbers but you know i've kind of explained why we have the flipping of the numbers
but the question still remains which treatment should you choose so hold that question in your head a bit see if you can come up with your answer to it
the spoiler is that as i'll show you in the next few slides the answer to this question largely depends on the causal structure of the problem so in scenario one
um where the causal graph is that the condition is a cause of the treatment and also the treatment and the condition are causes of the outcome but importantly the condition as a cause
of the treatment in this scenario generally treatment b is the better choice and i'll give you a specific example to help
illustrate the intuition for why this is the case so the important thing to remember is that condition is a cause of the treatment the example is that say a doctor sees
someone come in who has a mild condition the doctor might then decide for most of those people to assign them treatment a because they want to save the more scarce treatment
treatment be for people with more severe condition people are more likely to die say and this is why we see that among people who had mild condition
1400 out of 1450 of them were assigned treatment a similarly if someone comes in with severe condition the doctor might be more likely to prescribe that person
treatment b thinking okay this person has severe condition it's worth it to give them the more scarce treatment and this is why we see that among people
who have severe condition 500 out of 600 of them were assigned treatment b so most of them got treatment b okay so in this scenario why is
treatment b preferred treatment b is preferred because the reason that we are getting these this large
19 mortality rate among treatment b people is mainly because treatment b is disproportionately being assigned to people with severe condition who have
higher chance of dying similarly treatment a is disproportionately being assigned to people with mild condition who have a lower chance of dying
so the correct numbers to look at are the subgroup numbers the ones in the mild and severe column and so that's why treatment b should be preferred in this scenario
when you have condition as a cause of treatment now in scenario two the main conceptual difference is that treatment
is now a cause of condition everything else in the causal graph is the same treatment and condition are still causes of the outcome why
okay and in this causal graph treatment a is actually preferred so i'll give you an example again to kind of illustrate the
intuition of this so the example is that say you're a prescribed treatment bee you might have to wait a long time to actually take treatment b
because it's rather scarce and in that time while you are waiting your condition could worsen so say you come in with a mild condition your condition could worsen to a severe
one that's why we see that among people who were prescribed treatment be 500 out of 550 of them
had severe condition you know it could have been that many of them transitioned from the treatment group or sorry from the mild condition to severe condition
in this example that i'm giving it's different from the example in the previous slide all right so you know that's how treatment b could cause could be a cause of your condition but say you're assigned treatment a
similarly because treatment a is abundant unlike treatment b you don't have to wait at all and so if you come in with a mild condition you will have a mild condition probably at the
time that you actually take the treatment and that's why we see out of the 1500 people who were assigned treatment a
1400 of them had mild condition of the 100 that severe condition probably none transitioned from the mild condition while they were waiting
it's probably mostly just that those ones came in with severe condition okay so the reason that we prefer
treatment a in this setting is that the treatment is actually causing people to have a worse condition if you're a given treatment b if you're given treatment a it's causing
people that you know the same condition but so the treatment has an effect on your condition which then has an effect on your
probability of dying so there's this effect that's going through your condition that we have to take into account
and the the way to take that account is to look at these total numbers so the thing to keep in mind is that treatment b is kind of bad
in this example because it causes you to have a worse condition so in this scenario we would prefer treatment a all right and that concludes the
motivating example so that's simpson's paradox a quick recap is that we prefer treatment b when
the condition is a cause of treatment in that graph and we prefer treatment a when the treatment is a cause of condition
so you have to decide which treatment to give your whole country it's an important issue it will um it will decide the lives of many people
and the decision hinges crucially on the causal structure of the problem okay so with that let's get into
correlation does not imply causation you've probably heard this many times before and for machine learning people this is
prediction does not imply causation so the example here is that say you are looking at data of people who sleep with their shoes on
and wake up with headaches and it turns out that most people who sleep with their shoes on wake up with headaches most people who don't sleep with their shoes on don't wake up with headaches the two are strongly correlated strongly
associated you might think okay i probably shouldn't sleep with my shoes on because i don't want to wake up with a headache that it's a it's a cause but you know what if in your data you
also have information that most of the people who were going to sleep with their shoes on were also drinking pretty heavily the night before and those same people were the ones waking up with a headache
then you might think okay the only reason that we're seeing this association between the two is because there's a common cause which is drinking the night before and there's sort of two ways to
resolve in your head why shoe sleeping is so strongly correlated with waking up with headache and the first is that you know consider two groups of people the ones who went
to sleep with their shoes on so that's the shoe sleepers and the ones who went to sleep without their shoes on the non-shoe sleepers these two groups of people defer in a very key way
and that is that almost everyone in the shoe sleeping group drank the night before almost everyone in the non-shoe sigma group did not drink the night before so the key way is sort of
think about the average number of people who drank the night before it's just a completely different number and that kind of explains why we can't look at these two groups
and deduce some causal effect from them the groups are not comparable you would want that the groups to be the same in every way except for the treatment whether or not
they went to sleep with shoes on the other way is confounding so because we have this common cause it confounds the effect of shoes sleeping on waking
up with a headache and graphically you should visualize it this way there's this confounding association
that is running between shoes sleeping and waking up with a headache and that's the association that we observe this is in contrast to causal
association which would be a sort of directed path from shoe sleeping to headache
the association that we observe the total association is a mixture of these causal and confounding associations
and correlation is just one type of association association here is just a synonym for statistical dependence and correlation is technically a measure of
linear statistical dependence but people frequently use it to mean statistical dependence in general um but to avoid the confusion we're just going to use the word association rather than correlation
but yeah just let that sink in your minds if you were to measure correlation or any measure of association you would be looking at
a mixture of causal and confounding association and that additional confounding association is
why correlation does not imply causation so many of us have learned that correlation does not imply causation
but that doesn't stop us from using this heuristic all the time correlation equals causation is actually a cognitive bias
okay so in this example where shoe sleeping is associated with a headache you could actually just replace shoe sleeping with star
just with anything say so just anything associated with your headache and this star could come from a variety of different places one is the availability heuristic which
is another cognitive bias and what does the availability heuristic say it roughly says what will come into your mind what star will come into your mind is just whatever is most readily available
in your mind so for example if you recently read say yesterday that caffeine is associated with headaches then you
might think oh it's because i drank a cup of coffee earlier today that's why i have a headache even if say two years ago you read an article saying caffeine is not
associated with headaches that's just not nearly as available in your mind so say you want to explain why you have a headache you know you have a headache and you're saying why do i have this headache how can i not have a headache
in the future the way we might do this is come up with some star via the availability heuristic or motivated reasoning which i'll talk about briefly and given that
that star is associated with our headache we'll say okay that explains the headache using the correlation equals causation cognitive bias
and motivated reasoning here is that we have some world view that we want to come up with reasoning to justify our world review so an example of motivated reasoning in
this case is say that i don't enjoy spending time with my in-laws i might be motivated to attribute my headache
to the time that i spent with my in-laws earlier that day i might say okay i got a headache because i spent time with my in-laws earlier today so i probably shouldn't hang out
with my in-laws in the future it gives me a reason to not do what i don't want to do say okay so that's motivated reasoning as a
as a recap we use correlation equals causation as a cognitive bias all the time for example say we want to explain something and we notice something else
is associated with it then we use correlation equals causation here is a real data example where we have the number of people who
drowned by falling into a pool and the number of films that nicholas cage appeared in and it looks like these two quantities are pretty well associated over time
so does that mean that nicholas cage drove people to drown themselves or does it mean that nicolas cage found out that people were drowning themselves and then he tried to
make movies to convince people not to drown themselves or is it neither of those right so it's it's probably that
these two are correlated just by chance and that um and that one isn't really the cause of the other so i've just told you that correlation does not imply causation you've probably
heard this several times before but what does imply causation so if we're doing causal inference we have to answer this question
and that's what this section will be about we're gonna say a way to get causation how can we know that something causes something else
how can we estimate how much of an effect does something have on something else to do that i'll have to introduce a new concept known as potential outcomes
which is something that is unique to causal inference you wouldn't have seen it in regular statistics or something else before and i'll start with just motivation for this
so you know keep in mind the context is that we're inferring the effect of some treatment on some outcome and say that in this example you have a headache
and you know that if you were to take a pill your headache would go away and say that you also know that if you were to not take a pill you
would still have your headache if you know these two things you can probably say that there is a causal effect of the pill on your headache the pill makes your headache go away
but what if when you are to not take the pill your headache still goes away so your headache goes away regardless of whether or not you take the pill
and you might say that the pill does not have any causal effect maybe it's just a sugar pill so this is the intuition for potential outcomes and we'll now get a bit more precise
with specific notation for this we'll use do t equals one to denote taking the pill and we'll use do t equals zero to denote not taking the
pill the outcome that you would observe if you were to take the pill is this first one and the outcome that you would observe if you were to not take the pill is the
second one we'll use simpler notation for this so just y i one as the potential outcome if you were to take the pill
and y i zero as the potential outcome if you were to not take the pill and then we can define the causal effect as just the difference between these two potential outcomes
so the causal effect of taking the pill on your headache is just this difference between potential outcomes so i think it's worth taking some time
to make sure that sinks in and that you understand this because potential outcomes are a new concept this notation is a new notation
that you wouldn't have seen before if you haven't seen causal inference before however there is a fundamental problem here
so say that y i one equals one so that means that if we're to take the pill then headache goes away
and y i zero equals zero which means that if you're to not take the pill you would still have your headache so one means headache zero means not headache
the fundamental problem here is that say you were to not take the pill you can't actually observe what would have happened if you were to take the pill we'll call
that a counterfactual and because you can't observe that you can't compute this causal effect similarly if you were to take the pill you can't
observe what would have happened if you were to not take the pill and same as last time you will not be able to compute this causal effect
because you only have access to one of the terms in this difference this is known as the fundamental problem of causal inference
so we'll denote that difference we're just looking at as the individual treatment effect because there's a specific i where i stands for a specific individual but how can we get around
this maybe we can take an average of this so if we take an average over i then you can use linearity of expectation to get this equation
and then you would like that this difference between potential outcomes is equal to a difference between conditional expectations
but unfortunately in general it is not and this is because correlation does not imply causation right so because we have this confounding
association here and these conditional expectations are just a measure of association
this first quantity is a causal quantity but the second quantity is not causal it is a mixture of
say the confounding association and the causal association and this is mainly the case because we have confounding when we have confounding then these two
quantities are not equal to each other we can't just look at the difference between conditional expectations well what if there were no arrow going
from c to t what if c were not a cause of treatment then we wouldn't have any confounding association be fantastic and turns out that's exactly what
randomized control trials do so a randomized control trial is where an experimenter randomizes subjects into a treatment group or control group and
the way that they choose which group a specific subject is in is by say a coin flip just some random number generator
that means that t your treatment is only determined by a coin flip it can't have any causal parents it's completely random
another way of viewing that is that the treatment groups must be comparable so if you think back to the shoe sleeper example and waking up with a headache and drinking
the problem there is that the shoe sleepers were not comparable to the non-shoe sleepers because most of the shoe sleepers were drinking the night before whereas
most the non-sleepers were not but if you were to randomize whether someone wears shoes or not when they are going to bed so say you went into their rooms
you get these drunk people you flip a coin for whether or not you're going to take their shoes off or you get these sober people and you flip a coin for whether or not you're
gonna sneakily put shoes on them or not while they're sleeping then what that would do it was it would make
the drunk people evenly distributed across the two groups across the ones that have shoes on when they're sleeping and the ones that don't have shoes on then the groups are comparable okay so randomization here
makes the groups comparable compare gold groups is good because then you can get causal effects and so that's what this says right here when there is no confounding the ade the
average treatment effect is equal to the difference in conditional expectations okay so at the top of the side we have the in general these are not equal
but at the bottom we have that when there is no confounding such as when you do a randomized controlled trial then they are equal you know randomization is sort of magical
in the sense when you can do an experiment causal inference is easy even for any other variable so so here we just have c but if there's any other variables that
we don't observe randomization will also take care of those variables they can't be causes of t t can't have any call apparents because it's just determined by a coin
flip randomization takes care of all of those unobserved variables as well all right so i mean that's what does imply causation a randomized experiment is one way of
getting causation but that's you know that's not where a lot of research is done and and uh we'll see why so observational studies are where you're
just given a data set and this data set is not gathered by you you know it's just gathered by someone and they give it to you and this person probably wasn't doing an
experiment so ideally you would have that the that c here is not a cause of treatment because then you don't have any confounding but in general in observational studies
you're going to have some confounding pretty much all of the time and the problem is that you can't always randomize treatment so randomizing treatment would give you this ideal
but there are some reasons why you can't do it for example it could be unethical say you want to measure the effect of smoking on lung cancer
it could be unethical to randomly assign people to smoke or even or it could even be infeasible say you want to measure the causal effect
of say capitalism versus communism [Music] on gdp in order to do this because communism capitalism would be assigned at the country level
you would have to be a dictator of the world you'd have to be able to just assign whole countries to economic systems and you know that doesn't seem
impossible you know i could imagine that maybe someone could be that um powerful but it certainly seems infeasible and then there's things that are just
impossible say you want to measure the effect of someone's dna at birth on their probability of getting breast cancer
maybe in the future you might be able to change their dna at the time that you do the study but absent a time machine you'll never be able to
change their dna at birth and there's all this stuff that could happen that their dna could cause between their birth and now so it's actually impossible to
randomize their dna at birth and do a causal effect that way and then one that's not on this slide but it's just more convenient to do observational studies it's really
expensive to do randomized experiments some of us like in computer science we don't even know how to do that we have to go through ethics boards and stuff that's that just doesn't sound fun so observational studies are really
important because we can't always randomize treatment and they're just generally more convenient but that leaves us with a really natural question
how do we measure causal effects in observational studies the solution is what you might expect to adjust or control for confounders but
what does that mean well i'm going to use w for the confounders here and if w is sufficient adjustment set which i'll say a bit more about what
that means in the next slide then we have this equation the expected potential outcome under treatment small t
in the subpopulation where w equals little w is equal to expected value of y given t comma w this bit in the middle here is just a
new notation that i'm introducing here so i've been using the do that you saw before and this is a notation that you would see in the do notation essentially
it means the same thing as this potential outcome quantity here okay but the important thing is that when we condition on w
we get something that has no causal quantities in it it's just regular y expect a value of y given t comma w and why is this it's because in this picture down here the confounding
association is blocked when you condition on w so the shading here refers to conditioning and w is c in this case and the next
slide will have w is a set of variables that's a bit more complex okay but this quantity is still conditioned on w what if you
just want the average potential outcome expected value of y t well it turns out that the solution there is just to
marginalize out w just take an expectation over w and this w here is now random variable so you're just taking w expectation over w of this quantity up
here that gives you the marginal uh the marginal average potential outcome and then here's the do notation version of it which that's what we'll be using
on this slide okay so we gave one example where w is c c is a sufficient adjustment set there and in this slide i'll just be using shaded
nodes as examples of sufficient adjustment sets there'll be multiple nodes sometimes but shaded also means conditioning which you know that's what we're doing we're conditioning on w in this formula
what if you have a graph like this so it has multiple paths along which confounding association flows one way that you could block those is by
controlling for c and w two there are other ways though you could also control for c and w1 notice that it blocks the
association even sooner here so w w2 looked like that and w1 looks like that and what does this do it isolates the causal association
just like if you were to remove these edges from t you know randomized t you get the same thing okay and uh you could even also control for w3 makes no difference here
and then as a final thing uh which i won't go into deal here detail here if you have a sort of v like structure like this then you don't need to control for z2 for example actually you don't want
to control for z2 and we'll see this more in the course but for now that's just a quick warning now that we've gone through how to get these causal quantities using this
adjustment formula here uh we can actually go back to the covid 27 example and calculate these so i've copy and pasted
the formula up here and we're going to be looking at the scenario where the causal graph is the one where condition is a cause of treatment you might remember that from scenario two
we could also consider a scenario where treatment is a cause of condition but that's not the one we'll be looking at so the sufficient adjustment set here is c
we have to condition on c and then marginalize over it and because c is a discrete variable it only takes on two values
we can talk about a sum over c so this outer expectation over c turns into a sum over c times probability of c and then we have this inner expectation
and then we can add this causal column so this column makes it clear that treatment b is preferable
to treatment a in this scenario and as you see under these columns i actually list what all of them are so this one is expected value of y
given do t the causal quantity all these other quantities are non-causal well you can treat these two as causal but
i they don't have do operators in them for example so now that we have this column we can actually go through the calculation
and the way that we do this calculation here are just the numbers listed here but i'll walk you through it is we first want to get the expected value of y given t comma c
and t corresponds to the row where c corresponds to the column so the expected value of y given those two is just this number in the box so 15
here so that's where we get the 0.15 and that's where they get the 0.1 here and then we multiply that by the marginal probability of your condition so you have to add up the total number
of people in the population 1400 plus 100 plus 500 plus 50 gives you 2050 and then in the numerator put the number
of people with that condition so for mild condition it's 1400 plus 50.
that's how we get 1450 here 2050 here and notice that it's the same for both treatment
and for both treatment a and treatment b we use the same weight okay and then for the
severe condition it's the same thing right so the expected value of y given t comma c is just 0.3 and 0.2 here and here
and then we're still going to have 2 500 people in the denominator and then in the numerator we will have 100 plus 500 equals 600 people and that's how we get
these results so let's maybe compare that to the numbers that we got when we just naively took the total column
so the way we get this 16 and 19 in the total column is we took the same
conditional expectations as we used here in the causal last demand same conditional expectations so those are all the numbers in the parentheses but then we gave them different weightings and this is why in the beginning
i wrote this calculation of the total column in terms of these weighted averages or yeah these are weighted averages
right so since these are the denominators the same between these two and then the numerators add to the denominator
okay and the difference is that in this naive example we have a complete different weight given to the mild group
between treatment a and treatment t sorry in treatment b whereas in the causal example we give the exact same weight to the mild
group in both treatment and treatment b same thing in the severe condition group we give the exact same weight
to treatment a and treatment b whereas in the naive example that's what we use to compute the total column we give a much larger weight
to this severe group in treatment b than in treatment a that's why the naive numbers are bad that's why they're wrong
in uh when condition is a cause of treatment so i encourage you to go back and look at these numbers i think that staring at them a bit maybe
doing them yourself can give you some useful intuition at least for simpson's paradox and this is something that we will go more into detail in the
first few weeks of the course this will probably be about week two all right well with that welcome to causal inference if you want
to join the course mailing list go to causalcourse.com
Loading video analysis...