1 - A Brief Introduction to Causal Inference (Course Preview)

By Brady Neal - Causal Inference

Summary

Topics Covered

Causal Inference Is About Intervention Effects
Why Correlation Does Not Imply Causation
The Paradox That Flip-Flops Your Data
Randomization is Magical
Why Some Causal Effects Are Forever Unmeasurable

Full Transcript

hi everyone welcome to a brief introduction to causal inference in this talk i'll be giving you a preview of the first few weeks of the course my main goal is to focus on motivation

and intuition this talk isn't meant to give you a complete understanding of all the topics that i cover i'm going to try to cover a decent number of topics so that you'll

get a good idea of what some of the basics of causal inference are but you won't get a complete understanding of these topics until the first few weeks of the course you might have in mind some machine

learning topics like out of distribution generalization and you might be interested in how causal inference is related to those those won't appear in this talk because they're not some of the basics of causal inference but they will

appear later in the course all right so with that let's get started what is causal inference causal inference is mainly about inferring the effect of one thing on

another thing so the you could be inferring the effect of any treatment policy or intervention for example

some examples are if you want to infer the effect of a treatment on a disease or say you want to reduce emissions and you have multiple different

climate change policies to do this you might want to pick the policy that is most effective at reducing emissions so it causes the largest

reduction in emissions similarly say you notice that there is this rising mental health bad outcomes and you

think that social media might be one of the important causes of this so you could try to do a causal analysis to see how important social media is and contributing to this problem

maybe what percentage of the problem it contributes and there's many more so just generally if you have any x that you want to talk about the effect of

on some y that's what causal inference is for so here is the outline slide we're first going to give a motivating example which is simpsons paradox you might have heard of this before and it turns out

that the causal structure of the problem is absolutely essential in resolving simpson's paradox then we'll talk about why correlation does not imply causation

you've probably heard this before but hopefully i'll give you a bit more understanding of why this is the case and what kind of important implications this has and then in the last part of the talk we'll talk about observational studies

so an observational study is when you just are given data you're not able to make any experiments and this is where a lot of causal inference

research takes place these days all right so with that let's get into our motivating example simpson's paradox say that in a purely hypothetical scenario

there is a new disease covid27 and there are two treatments for the disease a and b which will code as zero and one and it's your job to decide which treatment

to choose for your country and the only thing you care about is minimizing death so which treatment will cause the smallest number of people

to die or will help the largest number of people live and an important thing about these two treatments that we will use in this example

is that treatment b is much more scarce than treatment a another thing that you have data on so you're getting data just from your doctors in your country they're

administering treatments and then they're collecting data on what happens when they give those treatments another thing that you have data on is the condition of each patient whether they come in with a mild

condition or a severe condition which will also code as zero or one and then finally there's the outcome why your patients will either be alive or dead

and in all of these cases we're only looking at binary variables but in general there's a theme causal inference where you can extend analysis from binary variables to say

continuous variables or multiple outcomes say so here is what your data looks like at the treatment level among people who were

given treatment a 16 percent of them died so that's 240 people out of 1500

1500 people got treatment a and 240 of them died and then among people who got treatment b 19 of them died so just just looking at this it seems like treatment a

is doing a bit better than treatment b you know three percent fewer people die but then something interesting happens when you subgroup the data

by condition so if you look at patients who have mild condition 15 of those who are given treatment a die

compared to only 10 of those who are given treatment b so treatment b actually looks better among the patients who had mild condition and it's the same thing with

patients who have the severe condition actually you have only 20 percent mortality rate only 20 of people died when they um of people had severe condition and

received treatment b whereas a larger 30 percent of people had severe condition and received treatment a died so how is

how are these numbers flipped in some sense there's a paradox in that if you look at the total population you ignore the subgroups treatment a has a lower number treatment

a looks better but then when you look at each of the subgroups treatment b looks better in both subgroups this is uh this is simpson's paradox

and it turns out that the numbers work out just fine you know the way that you get these numbers say the 16 percent the 240 is just summing up the numerators of the

treatment a mild group and the treatment a severe group and then the 1500 is just summing up the denominators 1400 plus 100

a maybe more interesting way of writing these calculations is to so here i've massaged the calculation a bit so that i've written in terms of weightings of

the big numbers in the boxes so this 0.15 here is the 15 same with the 0.3 and so on and so

i rewritten this as weightings the 0.15 is weighted much larger than the 0.3 for the treatment a calculation

and this is just because most of the people who received treatment a had mild condition fourteen hundred out of fifteen hundred

so the sixteen percent number that you see in the total population for treatment a is largely coming from the big weight that was put on the 15 percent

among the mild conditioned people in contrast for treatment b there was a much bigger weight placed on the severe group and that's just because 500

out of 550 people who received treatment b had severe condition and so the the 19 that you see for the

treatment b in the total population is largely coming from the 20 percent that you see in the severe group for treatment b simpson's paradox largely comes from this unequal

weighting okay from the fact that the treatment a people mostly had mild condition and the treatment b people have mostly had severe condition and people with severe condition are just more likely to die than people

with mild condition so that's why we see these flipping of the numbers but you know i've kind of explained why we have the flipping of the numbers

but the question still remains which treatment should you choose so hold that question in your head a bit see if you can come up with your answer to it

the spoiler is that as i'll show you in the next few slides the answer to this question largely depends on the causal structure of the problem so in scenario one

um where the causal graph is that the condition is a cause of the treatment and also the treatment and the condition are causes of the outcome but importantly the condition as a cause

of the treatment in this scenario generally treatment b is the better choice and i'll give you a specific example to help

illustrate the intuition for why this is the case so the important thing to remember is that condition is a cause of the treatment the example is that say a doctor sees

someone come in who has a mild condition the doctor might then decide for most of those people to assign them treatment a because they want to save the more scarce treatment

treatment be for people with more severe condition people are more likely to die say and this is why we see that among people who had mild condition

1400 out of 1450 of them were assigned treatment a similarly if someone comes in with severe condition the doctor might be more likely to prescribe that person

treatment b thinking okay this person has severe condition it's worth it to give them the more scarce treatment and this is why we see that among people

who have severe condition 500 out of 600 of them were assigned treatment b so most of them got treatment b okay so in this scenario why is

treatment b preferred treatment b is preferred because the reason that we are getting these this large

19 mortality rate among treatment b people is mainly because treatment b is disproportionately being assigned to people with severe condition who have

higher chance of dying similarly treatment a is disproportionately being assigned to people with mild condition who have a lower chance of dying

so the correct numbers to look at are the subgroup numbers the ones in the mild and severe column and so that's why treatment b should be preferred in this scenario

when you have condition as a cause of treatment now in scenario two the main conceptual difference is that treatment

is now a cause of condition everything else in the causal graph is the same treatment and condition are still causes of the outcome why

okay and in this causal graph treatment a is actually preferred so i'll give you an example again to kind of illustrate the

intuition of this so the example is that say you're a prescribed treatment bee you might have to wait a long time to actually take treatment b

because it's rather scarce and in that time while you are waiting your condition could worsen so say you come in with a mild condition your condition could worsen to a severe

one that's why we see that among people who were prescribed treatment be 500 out of 550 of them

had severe condition you know it could have been that many of them transitioned from the treatment group or sorry from the mild condition to severe condition

in this example that i'm giving it's different from the example in the previous slide all right so you know that's how treatment b could cause could be a cause of your condition but say you're assigned treatment a

similarly because treatment a is abundant unlike treatment b you don't have to wait at all and so if you come in with a mild condition you will have a mild condition probably at the

time that you actually take the treatment and that's why we see out of the 1500 people who were assigned treatment a

1400 of them had mild condition of the 100 that severe condition probably none transitioned from the mild condition while they were waiting

it's probably mostly just that those ones came in with severe condition okay so the reason that we prefer

treatment a in this setting is that the treatment is actually causing people to have a worse condition if you're a given treatment b if you're given treatment a it's causing

people that you know the same condition but so the treatment has an effect on your condition which then has an effect on your

probability of dying so there's this effect that's going through your condition that we have to take into account

and the the way to take that account is to look at these total numbers so the thing to keep in mind is that treatment b is kind of bad

in this example because it causes you to have a worse condition so in this scenario we would prefer treatment a all right and that concludes the

motivating example so that's simpson's paradox a quick recap is that we prefer treatment b when

the condition is a cause of treatment in that graph and we prefer treatment a when the treatment is a cause of condition

so you have to decide which treatment to give your whole country it's an important issue it will um it will decide the lives of many people

and the decision hinges crucially on the causal structure of the problem okay so with that let's get into

correlation does not imply causation you've probably heard this many times before and for machine learning people this is

prediction does not imply causation so the example here is that say you are looking at data of people who sleep with their shoes on

and wake up with headaches and it turns out that most people who sleep with their shoes on wake up with headaches most people who don't sleep with their shoes on don't wake up with headaches the two are strongly correlated strongly

associated you might think okay i probably shouldn't sleep with my shoes on because i don't want to wake up with a headache that it's a it's a cause but you know what if in your data you

also have information that most of the people who were going to sleep with their shoes on were also drinking pretty heavily the night before and those same people were the ones waking up with a headache

then you might think okay the only reason that we're seeing this association between the two is because there's a common cause which is drinking the night before and there's sort of two ways to

resolve in your head why shoe sleeping is so strongly correlated with waking up with headache and the first is that you know consider two groups of people the ones who went

to sleep with their shoes on so that's the shoe sleepers and the ones who went to sleep without their shoes on the non-shoe sleepers these two groups of people defer in a very key way

and that is that almost everyone in the shoe sleeping group drank the night before almost everyone in the non-shoe sigma group did not drink the night before so the key way is sort of

think about the average number of people who drank the night before it's just a completely different number and that kind of explains why we can't look at these two groups

and deduce some causal effect from them the groups are not comparable you would want that the groups to be the same in every way except for the treatment whether or not

they went to sleep with shoes on the other way is confounding so because we have this common cause it confounds the effect of shoes sleeping on waking

up with a headache and graphically you should visualize it this way there's this confounding association

that is running between shoes sleeping and waking up with a headache and that's the association that we observe this is in contrast to causal

association which would be a sort of directed path from shoe sleeping to headache

the association that we observe the total association is a mixture of these causal and confounding associations

and correlation is just one type of association association here is just a synonym for statistical dependence and correlation is technically a measure of

linear statistical dependence but people frequently use it to mean statistical dependence in general um but to avoid the confusion we're just going to use the word association rather than correlation

but yeah just let that sink in your minds if you were to measure correlation or any measure of association you would be looking at

a mixture of causal and confounding association and that additional confounding association is

why correlation does not imply causation so many of us have learned that correlation does not imply causation

but that doesn't stop us from using this heuristic all the time correlation equals causation is actually a cognitive bias

okay so in this example where shoe sleeping is associated with a headache you could actually just replace shoe sleeping with star

just with anything say so just anything associated with your headache and this star could come from a variety of different places one is the availability heuristic which

is another cognitive bias and what does the availability heuristic say it roughly says what will come into your mind what star will come into your mind is just whatever is most readily available

in your mind so for example if you recently read say yesterday that caffeine is associated with headaches then you

might think oh it's because i drank a cup of coffee earlier today that's why i have a headache even if say two years ago you read an article saying caffeine is not

associated with headaches that's just not nearly as available in your mind so say you want to explain why you have a headache you know you have a headache and you're saying why do i have this headache how can i not have a headache

in the future the way we might do this is come up with some star via the availability heuristic or motivated reasoning which i'll talk about briefly and given that

that star is associated with our headache we'll say okay that explains the headache using the correlation equals causation cognitive bias

and motivated reasoning here is that we have some world view that we want to come up with reasoning to justify our world review so an example of motivated reasoning in

this case is say that i don't enjoy spending time with my in-laws i might be motivated to attribute my headache

to the time that i spent with my in-laws earlier that day i might say okay i got a headache because i spent time with my in-laws earlier today so i probably shouldn't hang out

with my in-laws in the future it gives me a reason to not do what i don't want to do say okay so that's motivated reasoning as a

as a recap we use correlation equals causation as a cognitive bias all the time for example say we want to explain something and we notice something else

is associated with it then we use correlation equals causation here is a real data example where we have the number of people who

drowned by falling into a pool and the number of films that nicholas cage appeared in and it looks like these two quantities are pretty well associated over time

so does that mean that nicholas cage drove people to drown themselves or does it mean that nicolas cage found out that people were drowning themselves and then he tried to

make movies to convince people not to drown themselves or is it neither of those right so it's it's probably that

these two are correlated just by chance and that um and that one isn't really the cause of the other so i've just told you that correlation does not imply causation you've probably

heard this several times before but what does imply causation so if we're doing causal inference we have to answer this question

and that's what this section will be about we're gonna say a way to get causation how can we know that something causes something else

how can we estimate how much of an effect does something have on something else to do that i'll have to introduce a new concept known as potential outcomes

which is something that is unique to causal inference you wouldn't have seen it in regular statistics or something else before and i'll start with just motivation for this

so you know keep in mind the context is that we're inferring the effect of some treatment on some outcome and say that in this example you have a headache

and you know that if you were to take a pill your headache would go away and say that you also know that if you were to not take a pill you

would still have your headache if you know these two things you can probably say that there is a causal effect of the pill on your headache the pill makes your headache go away

but what if when you are to not take the pill your headache still goes away so your headache goes away regardless of whether or not you take the pill

and you might say that the pill does not have any causal effect maybe it's just a sugar pill so this is the intuition for potential outcomes and we'll now get a bit more precise

with specific notation for this we'll use do t equals one to denote taking the pill and we'll use do t equals zero to denote not taking the

pill the outcome that you would observe if you were to take the pill is this first one and the outcome that you would observe if you were to not take the pill is the

second one we'll use simpler notation for this so just y i one as the potential outcome if you were to take the pill

and y i zero as the potential outcome if you were to not take the pill and then we can define the causal effect as just the difference between these two potential outcomes

so the causal effect of taking the pill on your headache is just this difference between potential outcomes so i think it's worth taking some time

to make sure that sinks in and that you understand this because potential outcomes are a new concept this notation is a new notation

that you wouldn't have seen before if you haven't seen causal inference before however there is a fundamental problem here

so say that y i one equals one so that means that if we're to take the pill then headache goes away

and y i zero equals zero which means that if you're to not take the pill you would still have your headache so one means headache zero means not headache

the fundamental problem here is that say you were to not take the pill you can't actually observe what would have happened if you were to take the pill we'll call

that a counterfactual and because you can't observe that you can't compute this causal effect similarly if you were to take the pill you can't

observe what would have happened if you were to not take the pill and same as last time you will not be able to compute this causal effect

because you only have access to one of the terms in this difference this is known as the fundamental problem of causal inference

so we'll denote that difference we're just looking at as the individual treatment effect because there's a specific i where i stands for a specific individual but how can we get around

this maybe we can take an average of this so if we take an average over i then you can use linearity of expectation to get this equation

and then you would like that this difference between potential outcomes is equal to a difference between conditional expectations

but unfortunately in general it is not and this is because correlation does not imply causation right so because we have this confounding

association here and these conditional expectations are just a measure of association

this first quantity is a causal quantity but the second quantity is not causal it is a mixture of

say the confounding association and the causal association and this is mainly the case because we have confounding when we have confounding then these two

quantities are not equal to each other we can't just look at the difference between conditional expectations well what if there were no arrow going

from c to t what if c were not a cause of treatment then we wouldn't have any confounding association be fantastic and turns out that's exactly what

randomized control trials do so a randomized control trial is where an experimenter randomizes subjects into a treatment group or control group and

the way that they choose which group a specific subject is in is by say a coin flip just some random number generator

that means that t your treatment is only determined by a coin flip it can't have any causal parents it's completely random

another way of viewing that is that the treatment groups must be comparable so if you think back to the shoe sleeper example and waking up with a headache and drinking

the problem there is that the shoe sleepers were not comparable to the non-shoe sleepers because most of the shoe sleepers were drinking the night before whereas

most the non-sleepers were not but if you were to randomize whether someone wears shoes or not when they are going to bed so say you went into their rooms

you get these drunk people you flip a coin for whether or not you're going to take their shoes off or you get these sober people and you flip a coin for whether or not you're

gonna sneakily put shoes on them or not while they're sleeping then what that would do it was it would make

the drunk people evenly distributed across the two groups across the ones that have shoes on when they're sleeping and the ones that don't have shoes on then the groups are comparable okay so randomization here

makes the groups comparable compare gold groups is good because then you can get causal effects and so that's what this says right here when there is no confounding the ade the

average treatment effect is equal to the difference in conditional expectations okay so at the top of the side we have the in general these are not equal

but at the bottom we have that when there is no confounding such as when you do a randomized controlled trial then they are equal you know randomization is sort of magical

in the sense when you can do an experiment causal inference is easy even for any other variable so so here we just have c but if there's any other variables that

we don't observe randomization will also take care of those variables they can't be causes of t t can't have any call apparents because it's just determined by a coin

flip randomization takes care of all of those unobserved variables as well all right so i mean that's what does imply causation a randomized experiment is one way of

getting causation but that's you know that's not where a lot of research is done and and uh we'll see why so observational studies are where you're

just given a data set and this data set is not gathered by you you know it's just gathered by someone and they give it to you and this person probably wasn't doing an

experiment so ideally you would have that the that c here is not a cause of treatment because then you don't have any confounding but in general in observational studies

you're going to have some confounding pretty much all of the time and the problem is that you can't always randomize treatment so randomizing treatment would give you this ideal

but there are some reasons why you can't do it for example it could be unethical say you want to measure the effect of smoking on lung cancer

it could be unethical to randomly assign people to smoke or even or it could even be infeasible say you want to measure the causal effect

of say capitalism versus communism [Music] on gdp in order to do this because communism capitalism would be assigned at the country level

you would have to be a dictator of the world you'd have to be able to just assign whole countries to economic systems and you know that doesn't seem

impossible you know i could imagine that maybe someone could be that um powerful but it certainly seems infeasible and then there's things that are just

impossible say you want to measure the effect of someone's dna at birth on their probability of getting breast cancer

maybe in the future you might be able to change their dna at the time that you do the study but absent a time machine you'll never be able to

change their dna at birth and there's all this stuff that could happen that their dna could cause between their birth and now so it's actually impossible to

randomize their dna at birth and do a causal effect that way and then one that's not on this slide but it's just more convenient to do observational studies it's really

expensive to do randomized experiments some of us like in computer science we don't even know how to do that we have to go through ethics boards and stuff that's that just doesn't sound fun so observational studies are really

important because we can't always randomize treatment and they're just generally more convenient but that leaves us with a really natural question

how do we measure causal effects in observational studies the solution is what you might expect to adjust or control for confounders but

what does that mean well i'm going to use w for the confounders here and if w is sufficient adjustment set which i'll say a bit more about what

that means in the next slide then we have this equation the expected potential outcome under treatment small t

in the subpopulation where w equals little w is equal to expected value of y given t comma w this bit in the middle here is just a

new notation that i'm introducing here so i've been using the do that you saw before and this is a notation that you would see in the do notation essentially

it means the same thing as this potential outcome quantity here okay but the important thing is that when we condition on w

we get something that has no causal quantities in it it's just regular y expect a value of y given t comma w and why is this it's because in this picture down here the confounding

association is blocked when you condition on w so the shading here refers to conditioning and w is c in this case and the next

slide will have w is a set of variables that's a bit more complex okay but this quantity is still conditioned on w what if you

just want the average potential outcome expected value of y t well it turns out that the solution there is just to

marginalize out w just take an expectation over w and this w here is now random variable so you're just taking w expectation over w of this quantity up

here that gives you the marginal uh the marginal average potential outcome and then here's the do notation version of it which that's what we'll be using

on this slide okay so we gave one example where w is c c is a sufficient adjustment set there and in this slide i'll just be using shaded

nodes as examples of sufficient adjustment sets there'll be multiple nodes sometimes but shaded also means conditioning which you know that's what we're doing we're conditioning on w in this formula

what if you have a graph like this so it has multiple paths along which confounding association flows one way that you could block those is by

controlling for c and w two there are other ways though you could also control for c and w1 notice that it blocks the

association even sooner here so w w2 looked like that and w1 looks like that and what does this do it isolates the causal association

just like if you were to remove these edges from t you know randomized t you get the same thing okay and uh you could even also control for w3 makes no difference here

and then as a final thing uh which i won't go into deal here detail here if you have a sort of v like structure like this then you don't need to control for z2 for example actually you don't want

to control for z2 and we'll see this more in the course but for now that's just a quick warning now that we've gone through how to get these causal quantities using this

adjustment formula here uh we can actually go back to the covid 27 example and calculate these so i've copy and pasted

the formula up here and we're going to be looking at the scenario where the causal graph is the one where condition is a cause of treatment you might remember that from scenario two

we could also consider a scenario where treatment is a cause of condition but that's not the one we'll be looking at so the sufficient adjustment set here is c

we have to condition on c and then marginalize over it and because c is a discrete variable it only takes on two values

we can talk about a sum over c so this outer expectation over c turns into a sum over c times probability of c and then we have this inner expectation

and then we can add this causal column so this column makes it clear that treatment b is preferable

to treatment a in this scenario and as you see under these columns i actually list what all of them are so this one is expected value of y

given do t the causal quantity all these other quantities are non-causal well you can treat these two as causal but

i they don't have do operators in them for example so now that we have this column we can actually go through the calculation

and the way that we do this calculation here are just the numbers listed here but i'll walk you through it is we first want to get the expected value of y given t comma c

and t corresponds to the row where c corresponds to the column so the expected value of y given those two is just this number in the box so 15

here so that's where we get the 0.15 and that's where they get the 0.1 here and then we multiply that by the marginal probability of your condition so you have to add up the total number

of people in the population 1400 plus 100 plus 500 plus 50 gives you 2050 and then in the numerator put the number

of people with that condition so for mild condition it's 1400 plus 50.

that's how we get 1450 here 2050 here and notice that it's the same for both treatment

and for both treatment a and treatment b we use the same weight okay and then for the

severe condition it's the same thing right so the expected value of y given t comma c is just 0.3 and 0.2 here and here

and then we're still going to have 2 500 people in the denominator and then in the numerator we will have 100 plus 500 equals 600 people and that's how we get

these results so let's maybe compare that to the numbers that we got when we just naively took the total column

so the way we get this 16 and 19 in the total column is we took the same

conditional expectations as we used here in the causal last demand same conditional expectations so those are all the numbers in the parentheses but then we gave them different weightings and this is why in the beginning

i wrote this calculation of the total column in terms of these weighted averages or yeah these are weighted averages

right so since these are the denominators the same between these two and then the numerators add to the denominator

okay and the difference is that in this naive example we have a complete different weight given to the mild group

between treatment a and treatment t sorry in treatment b whereas in the causal example we give the exact same weight to the mild

group in both treatment and treatment b same thing in the severe condition group we give the exact same weight

to treatment a and treatment b whereas in the naive example that's what we use to compute the total column we give a much larger weight

to this severe group in treatment b than in treatment a that's why the naive numbers are bad that's why they're wrong

in uh when condition is a cause of treatment so i encourage you to go back and look at these numbers i think that staring at them a bit maybe

doing them yourself can give you some useful intuition at least for simpson's paradox and this is something that we will go more into detail in the

first few weeks of the course this will probably be about week two all right well with that welcome to causal inference if you want

to join the course mailing list go to causalcourse.com

Loading...

Loading video analysis...