Elizabeth Silver — Causality and Causal Discovery

By Machine Learning and AI Meetup

Summary

Topics Covered

Causality is fundamental to cognition
Why study causal relationships at all
Causal dependence is asymmetric
How constraint-based causal discovery works
Use causal learning because there are no better options

Full Transcript

I guess we should start with the joke people still getting B is I can probably

do the joke while the getting B is so anyway this one is from my Twitter feed so if you follow me then sorry but I don't have that much new material here we go this is the enterprise buzzword

translator when you hear someone Enterprise saying business analytics what they mean is spreadsheet when you hear them say process automation what they mean is a spreadsheet when they say

predictive model you betcha it's a spreadsheet and when they say Big Data they mean a spreadsheet that's too big to email and of course machine

learning is linear regression but done in a spreadsheet thanks thanks now I'm glad then that you know but that's probably the best laugh

I've got from an MLA I chokes I feel pretty good about that you know yes so I think the next thing we'll be doing is

the news normally have more of a spiel but today I just don't so the news is a section where you can call out whatever you like and I will write it down so someone phone someone from the

audience will surely demonstrate linden what's making you happy and interested this week yeah I love it

that is that is really us that it's

really actually super cool yes so silver

pond has a elephant detection model running in the wild catching poachers

considering that's an application of like artificial intelligence which isn't like you know neo-liberal a hyper

normalization that's really great and where can people buy in the study v16 okay

yes yeah yes right okay what else is going on in the Melbourne or international AI scene and I'll start naming people if I don't hear something

soon so consider it your moral duty to say something oh thank you I can in this version to even install it you

need to have a Google cloud account zing thank you I'm here a week oh yes has anyone given it a go I saw it but I didn't I didn't actually try it

oh yeah Russ what was your experience suddenly somebody selling for credit related stuff and we lost numbers accessing credit item

mmm yeah well you know you're sitting there trying to invent the next generation of artificial intelligence and what you really need is the

latitudes and longitudes of toilets in

Portland excellent so any any projects that people are working on I mean even

if it's just something that you just started doing you just thought about doing you should probably say it I mean you know no one else is gonna do it I never know if how long I should give

people how awkward they need to feel before someone is just like I'll say something just to put us out of our misery but anyway I'm pretty good at this I'm used to it now all right oh yeah oh yeah

Suzie and then Alex yeah looking to develop a set algorithm that comes up with unbiased results

tastehh an example is when you have Google echo a Peugeot and other search engines what comes up more often or not

it's formulated results we did a great it many people she's breaking out for assistance and guidance and once we've got project for

so if there's anyone there that actually that is really keen to sort of get that on the way out so look into that project it's not like a really beautiful on

projects or for interested people you should they contact okay alright like a

really cool project that would say shoot me a message and I'll be stuck just she acted okay

well how do we know in degrees to the ball well this goes out with what the details here the magic of editing Alex I knew you had something yeah I was

going to buy time for people one bit about that project City of Hope everywhere yeah it is more secondary and next week is do a cloud summit so I have a good prepared joke about

this one if you're a Chinese national you are allowed to attend the summit but you can't go to all of the talks yeah no nothing nothing come on come on

[Laughter] [Music] we come with BL s s including super sampling you get really good

dice and feelings and oh yeah they all coming thanks a call because libel [Music] yeah though that's actually the tensor

cause thing is really cool because it's one of those you used to have to get enterprise-grade cards to get the tentacles so now there is a yeah I saw some people doing tests and I think doll

up like deep learning flop that card is gonna blow all of the other ones out the

water right at least right now okay well um wait what's let's have one more thing why not this is one of the best of the News's we've ever done so I'm touched

and genuinely impressed but well well who wants to be the last one oh by the way this was a

they work for Accenture or IBM okay well here we go

so our next speaker Elizabeth Silva is someone have been talking to about doing a talk for a while I think there's been six months in the making maybe yeah I was really

excited to have it so to give a little bit of background a lot of the talks we've been doing recently have been a little bit off the kind of mainstream deep learning path just because there's

enough people talking about cnn's out there that you can get your feel on youtube so i've tried to really highlight people with important and interesting ideas that I think will be

incredibly or incredibly influential but I'm not entirely appreciated universally so this talk is about causality and I

think it's one of the most important problems facing machine learning today so there was an experiment it's a famous arm experiment which was done I think a little bit before they tightened up the

ethics on animal experimentation but anyway it was two kittens and one was attached to the essentially what one was

in a cart attached to the other kitten and they would just be passively dragged around by the active kitten so they're kind of the cart was facing backwards and one of the kittens would look this way and the active kitten would look

that way and the active kitten could move around the world as they would and the the passive kitten would actually be restrained so they could not interact with the world and even though it was

exposed to a rich and continuous data stream sense information movement it did not develop properly and so I think what this really says is at least for animals

causality matters the fact that you can interact with the world and that it's your actions that have consequences I think during an extraordinarily fundamental

part of cognition and a lot of the focus on especially simple probability and Bayesian ISM can tell you a lot about

what things happen together but the problem is it doesn't matter what things happen together it matters what happens when you do something so concluding that

and and and without that you will be mislead so I think that this is gonna be there's a talk that I was really excited for you to hear and I'm really excited

to hear it too so please welcome Elizabeth Silva okay so for those of you

who were here last month Ross said that he was gonna give an Icelandic cyber of a talk this is gonna be more of a Wikipedia article in the sense there

will be a really high-level overview overview of material that really deserves a deeper treatment and the quality will be variable and none of

it's my own work so okay I'm not using the clicker so one ground rule because

this talk is about causation there's probably at least one person who's just waiting for a moment to be like the causation is not correlation so if

that's you I I'm gonna ask you to hold that question till the end and see if you still have it I hope that I'll give you some more precise reasons to be

skeptical because I promise you I'm also skeptical so why study causal relationships machine learning is really

good at prediction but you don't need to know about causation in order to predict things well so why study it at all firstly we need to know about causal

relationships in order to predict the results of interventions if we're going to intervene and change something about the world we can't use a prior data to

predict what we're going to see next secondly the environment that we're studying may change over time and we would like to know about relationships

that are going to be robust to changes in the environment and thirdly sometimes we don't just care about predictions

sometimes we have more of a scientific interest in how the data is generated we want to know why it happens way it does or how so

in those cases we want to know about the causal mechanisms producing our data so firstly I'm going to tell you about the inputs and outputs of causal discovery

methods and the outputs are going to be probably new to some of you so the input data is very familiar just tabular data

with random variables in the columns and observations of cases in the rows but the output representation is going to be

of a different form it's not going to be predictions of some random variable it has to represent causal relationships between those variables so that

representation has to fit with how we think causal relationships work it's a difficult problem finding a good representation so one natural way to

represent causal relationships is with structural equation models in this example I have just used completely

general structural equations I'm not assuming that they're linear functions they could be any function and if a

variable is caused by some other variable then that will be an argument to its structural equation each variable

is a function of its causes plus some specific era turn note that I've written the assignment sign instead of the equal

sign and this captures that the equality there is not just accidental this is how we think those values are

assigned by some process but for the rest of the talk I'm not really going to talk about structural equations instead I will use graphical models as a

shorthand for structural equations so here is an example of that so in this graphical model every

variable that was an argument in one of the equations in the previous slide is now there's now an edge into that

variable so for example lung cancer is a function of smoking and asbestos so there are edges from smoking to lung cancer and asbestos to lung cancer the

nodes the random variables they come straight out of the column names in our data set so we already know what they are what we're trying to learn is the

set of edges that's the learning problem right and normally we assume that these models have no cycles so I'm going to use the term dag to mean directed

acyclic graph sounds funnier in Australia than America where I learned about this stuff the other note on

terminology directed graphs often use kinship metaphors so I might say for example that smoking is a parent of lung

cancer or lung cancers a child of smoking to represent those relationships those edges we're going to interpret in

terms of interventions so what that means is that if I were to intervene and

say force someone to smoke so that's what this do operator represents that you're actually intervening you're not just observing whether they smoke or not

you would see a change in the probability that they get lung cancer things to notice about this firstly it

is relative to some setting of the other variables it doesn't matter it doesn't have to be all settings there has to be at least one setting where you see this

difference making thing so you'd have an edge step secondly it's probabilistic so this represents our intuition about

causality that it doesn't have to be deterministic we know that not everyone who smokes gets lung cancer but smoking

still causes lung cancer what I'm up to first second third lis thirdly this shows the difference between a

probabilistic dependence and a causal dependence so there's a probabilistic dependence between having nicotine stains on your teeth and having lung cancer if someone has nicotine stained

teeth they're more likely to have lung cancer but if I intervene paint their teeth yellow I'm not going to see a change in the probability of them having

lung cancer so this is the whole correlation is not causation thing and third or fourth or fifth or whatever I'm

up to I so probabilistic dependence is symmetric if having yellow teeth is correlated with smoking and smoking is

also correlated with yellow teeth but causal dependence is asymmetric if I force someone to smoke I increase the probability that they have nicotine

stains on their teeth but if I paint their teeth I don't make them start smoking so there's an asymmetry there so this

representation covers a lot of our intuitions about what causal relationships are supposed to look like so another thing to notice is that the

true set of edges which remember is what we're trying to learn that will depend on the set of variables we're looking at so for example if I'm looking at the

variables years of education and income there should be an edge from education to income but if I expand the set of variables I'm interested in and includes

skills then maybe the only effect of Education on income is by increasing skills and so in the expanded model you don't need a direct edge from education

to income so for example if like me you go into a PhD in philosophy it doesn't increase your skills so it has no effect on income

[Applause] another thing with this representation covers is it allows us to represent

interventions so if we're going to represent an intervention we add one node to the model and one edge into

whatever that variable is intervening on so here I'm going to intervene on yellow

teeth by painting them and so I have the paint node and if I'm just randomizing what color I paint people's teeth that

means that smoking no longer has an effect on whether people have yellow teeth or not so that breaks the edge from smoking to yellow teeth and this

kind of intervention is called a heart or a surgical intervention because it breaks that edge the you can also do

what are called soft interventions where you change the probability of a node but allow other things to affect it and this is common for example in economics we

can experiment on people's income by giving them more money but it's really hard to get ethics approval to take away the money they already have so in those

cases you don't break that edge but you can put a new edge in so smoking here we

would think of as a confounder of the relationship between having yellow teeth and having lung cancer because it's a common cause that's how we represent

confounders in this framework and this illustrates a how randomized control trials that do hard interventions break the influence of those confounders and

even if you are doing a soft intervention you might not learn the effect of the yellow teeth variable on lung cancer but you still have this

intervention variable that you've added which is exogenous it has no confounders going into it

so that's how interventions break the influence of confounders yes all right

okay so I'm gonna mention this now and come back to it in more detail in a bit but the output that we get from causal

discovery algorithms can include some uncertainty so if you were to give data from this model to a bunch of causal

inference algorithms probably they would return something like this model down below where they've identified that there are these three edges and they've

oriented two out of the three but they can't tell what direction this third edge is going in so this is learning as

much as we can about the true structure and representing the remaining uncertainty accurately that's what's going to be our goal not to learn

everything but to be clear about what we know and what we don't so there's kind

of two steps in causal structure learning firstly you want to learn what the true set of edges is and then secondly once you know that once you

know what are the arguments to each structural equation you can fit that equation using standard statistical estimation techniques so the first part

of the problem is harder and I think more interesting and it's also the one that I know more about so that's all I'm going to talk about I'm not going to

talk about step two still waiting um Oh what was that oh yeah another way to think about step

one is avoiding model miss specification we all know if that we regress on the wrong set of variables we'll get the

wrong edge coefficients okay so a high-level overview of what we're doing in regular statistical inference we know

we have the probability theory that tells us how any probability distribution would generate data and then when we observe data we use

statistical inference to go back to the generating distribution we're doing a similar thing here with causal learning we have the causal axioms that tell us

how each causal structure would constrain the kinds of probability distributions that could be generated by that structure and then when we see constraints on the probability

distribution we're going to use those to do causal inference back to the set of structures that could have generated that distribution that's the high level

so there are lots of different kinds of constraints of which I am mostly going to talk about conditional independencies

because conditional independencies are the only kind of constraint that holds

in all parametric families there are a bunch of other interesting kinds of constraints that I might describe

briefly and we get other constraints from a background knowledge about the sampling conditions like if I know I've done an intervention I know that I've

broken some edges or we and some of the constraints so we might come from our assumptions about a solicit e and so on

so conditional independence constraints what are they gonna start with the Markov condition so the Markov condition

follows from properties of structural equation models it's pretty general it tells us the local version of the Markov

condition says that for any variable X if you condition on its direct parents in the graph anything that has an edge

into X then all of its non descendants become independent of X the parents

screen off all of its ancestors and what we can infer from Markov is that so

Markov implies that if there's no edge in the model from X to Y there will be some conditional independence and that

means that if we see some if we see conditional dependence no matter what we condition on x and y are dependent on each other we infer that there has to be

an edge between them that's the backwards inference so the Markov condition means that when we see

a dependence we can infer an edge when does this fail this fails if we have some unobserved confounders that we have

not included in our model so for example if I just ignore smoking and just look at yellow teeth and lung cancer because neither of them cause the other the true

causal model has no edges in it it's just an empty graph but I will see a dependence between the two of them so Markov would say you've seen a

dependence you should be able to infer an edge but that is incorrect in this case because we have omitted that confounder smoking so this is a

violation of what we call causal sufficiency which just says we haven't omitted any common cause of any two variables in your model now

you can include a variable without observing it if you haven't observed it you can't maybe learn much about the structure but you won't make a mistake

where you think you know the structure even though you don't so I don't worry

too much about violations of Markov the second condition though the second because Markov only lets us infer that edges exist but we always we already

know that the complete graph where everything causes everything that can fit any probability distribution so to learn a space model we need some

criterion that allows us to exclude edges so for that we turn to the faithfulness condition which is exactly

the converse of the Markov condition so the faithfulness condition says if you see if there is an edge in the model

then those variables are going to be dependent no matter what your condition on and that means when you see some conditional independence you can infer

that there's no edge between them so it's exactly the converse of Markov it says that all the in dependencies in the

distribution are those that Markov entails there are no extra ones so when does faithfulness fail faithfulness can fail when we have

causal pathways that exactly cancel out or distributions that are set up exactly right so that dependencies vanish so in this

example say you're interested in the effect of birth control on blood clots or thrombosis and you know that the you're looking for the direct effect but

you know that birth control also reduces the chance of pregnancy and pregnancy increases the risk of blood clots so you include pregnancy in the model if which

is unlikely if this were a linear model with these edge coefficients if alpha beta equals gamma then those two

pathways exactly cancel out and you will see that birth control is independent of

thrombosis so the argument for faithfulness usually goes like this that

isn't really going to cancel out the the surface of distributions that have these exact canceling paths is a surface in a higher dimensional space of

distributions it has lebesgue measure zero it has probability zero that things will exactly land on the surface but there are cases where we expect

faithfulness to fail specifically in a self regulatory system the thermostat in your home ensures that the indoor temperature will be independent of the outdoor temperature even though the heat

outside affects your home so it's precisely set up to make that dependents fail so in biology for example there are a lot of cases where

we expect the distribution to be unfaithful homeostasis is like your own physical thermostat so people worry more

about unfaithfulness than Markoff failing so now I'm going to give

you an intuitive example just to kind of see how we can orient edges because the standard like correlation is not

causation argument usually says we can't say whether a causes B or because as a sometimes you can so if we have this toy

model where we're modeling a car and we include the petrol tank the battery and whether the car starts or not in the

model you would think that the petrol and the battery charge are independent of each other neither of them has anything to do with the other but as

soon as you learned that the car did not start and you know that you have plenty of petrol what can you infer about the

battery it's probably dead so this example shows that when you have two parents of a common child in your graph

keep using the laser pointer and they're turning and then it's just pointing it

something oh so yeah if you have two

parents of a common child then you will see that the parents if they're if there's no causal connection between them are independent but as soon as you condition on the child

they've become dependent this is called collider bias or selection bias and it's

a case where if we see this pattern of unconditional independence but then conditional dependence we can fit this structure we we can orient these edges

inwards because this is the only structure that gives you that pattern so yeah oh yes sorry I meant to say that

yes so the upside-down T symbol is independent of so here gas tank is

independent of battery in the case where it's got a line through it means not independent of so gas tank is dependent

on the battery conditional on the car starting being zero yeah dependence and

correlation correlation is a subset of kinds of dependence it's linear dependence I will tend to say dependence

precisely because sometimes we have no correlation even though we some we do have dependence and that's used by some of these algorithms but if hearing the

word dependence is confusing just hear

it as correlation so if we have the true model a cause as because of C this will give us exactly one conditional

independence if you conditioned on B you make a and C independent but that same independence is implies implied by these

two other models C causes B causes a and B causes both a and C so if we see that one independence and everything else is

dependent well we can infer is that there's no edge between a and C we can't orient the other two edges so we can

represent this what's called a markov equivalence class that contains three different models using this object which

is a complete partially directed acyclic graph or CP dag which has two edges that are oriented because each one of those edges

goes one way in some of these models and the other way and some others but this second mark of equivalence class again there's just one conditional

independence and that's a is independent of say conditional nothing this is the only model in the Markov equivalence class so that means that our CP dag of

that that represents that equivalence class Orient's all the edges so we really want to be in this situation where we know more about the model but

how much we know about the model depends on the way the world is we have no control over it so we want to accurately

represent our uncertainty the right okay so I've given you some intuitive

examples and I want to be clear that there is there's a recipe for doing this the the d separation criterion tells us

all of the conditional independencies that are implied by a graph you start with a graph and if two things are d separated then they should be

conditionally independent in the probability distribution D separation lets you move between a graph separation criterion and a probabilistic

independence but I don't really need to go through all the conditions I just want to be clear that they exist and

this work has been done and continue with my high level Wikipedia style overview so this powers the constraint

based algorithms so the idea of constraint based search you start with some data you feed it to a statistical inference engine which tells you all of

the conditional independencies that are in that distribution that your data comes from so that gives you the constraints and then you feed those constraints to a causal discovery

algorithm which will tell you a markov equivalence class it should tell you exactly the lack of equivalence class that implies those same set of constraints

notice that the statistical inference engine is kind of separate from the causal discovery algorithm so in this sense constraint based search is

nonparametric you can give it any nonparametric conditional independence test you like so that's nice that that's

general but it's still really hard and the reason that it's hard is because there's a lot of different models that

we're searching through if you think about a directed acyclic graph every pair of nodes so if we have n nodes

that's n choose 2 which is order N squared every pair of nodes can either have an edge between it or no edge so

there's at least 2 to the N choose 2 models but if we consider the orientation of that edge it could go

forward or backward or be absent so there's at most 3 to the N choose 2 models some of them will be ruled out

because they're cyclic but still the search space s grows super exponentially in the number of nodes so this is a very

difficult discrete search problem and the advances in causal discovery methods

that make that search process efficient the picture last rain make it look so easy yeah yes well there's a lot of like

computationally intense stuff happening here so I'll go through the PC algorithm

which is relatively simple it is named after Peter spur DS and Clark gleam or it's what Peter and Clark if you're wondering why they didn't use their last names it's because they'd already

invented the Sporty's gleam or in China's algorithm which is like PC only less efficient so so they yeah right

they're done they do invent other ones and they named them other strange things so PC unfortunately is quite difficult

to Google so all right the idea behind PC let's say we're starting with this true graph down the

bottom where a and B equals C and C equals D PC starts with the complete graph everything's connected to

everything else the first step is to test the zero order conditional independencies and we find one a is

independent of B conditional on nothing and by faithfulness were allowed to remove that edge but by mark off we have to keep all the other edges in there so

we remove that first edge then we move on to the first order conditional independencies and we find two a is

independent of de given C because say screens it off and B is independent of D

given C so again C screens it off so and everything else is dependent so again faithfulness lets us to remove those two edges now we have what's called the

skeleton of the graph it's got all the right adjacencies whoops sorry so the oh yeah another thing that makes

PC efficient but which doesn't really come across in this small example is that the set of variables included in

the conditioning set is limited again by the Markov condition you can all the screening off that you need to do is by conditioning on the parents of each node

and we don't know which are the parents but we know they have to be a subset of the neighbors which is everything that's connected to a know it regardless of the

direction so we just condition on sets that include the neighbors so as the graph gets sparsa the number of tests

that you have to do decreases and we're starting with the low order conditional independencies which are more robust so

the next step is because we see no more conditional independencies is to orient the V structures so you take every

triple of nodes where two of them are connected to the third and then you see whether you get a dependence if you

condition on that third node between the two notes on the endpoints and we find one v structure a and B can be oriented

in to C because of this dependence a is dependent on B conditional in C it's

just like the car example so we have oriented two edges and we have a V

structure and an undirected edge if we've just left that undirected that would imply that in the markov equivalence class there's some models

where D points into C but if that were the case then we would have found another V structure and we didn't so we

can orient the last edge away from the V structure and that Pacey is done it exits it has found the true graph in this small

example that I picked specifically because it was easy so the benefits of Pacey it's very efficient especially on spice models if you have a really dense

graph if everything's connected to everything else you will have to do all of the conditional independence tests to

learn that it's that dense and the downside of PC is that it can propagate errors if it makes a mistake early on

everything else is going to be screwed up and it does really rely on having a

good conditional independence test okay so that's constraint that's an example of constraint based search another thing

that's done is score based search so the conditional right so in score based search for each mark of equivalence

class of bottles you can pick one model within that class and fit it to the data so if it were a linear model you'd learn

the edge coefficients and then once you fit it you can score the fit using for example the v.i.c score and then once you've done that for all the models

you're interested in you can just take the Arg max the model that gives you the best score so this is more of a Bayesian

approach again it suffers from the problem that the search space is huge you'd have to search you have a score a lot of models so one approach that

people take is to artificially ignore most of the search space so a common way of doing this is just to say I really

think that no annoyed is going to have more than four parents and that will dramatically limit your search space if you have a lot of nodes

another option which I'm going to describe is to take a greedy approach where you don't score all the models but you move through the space of models

trying to find a better scoring one so if you have a consistent scoring

criterion and it's been shown that bi C is consistent you'll find that okay if we have these three models the true

model is G star and then we have an Augmented model G Plus where we've added an edge removed and independence and a

diminished model G minus where we've removed an edge and created an independence we will find that the score of the Augmented model sorry the score

of the true model is always greater than that of the Augmented model and the Augmented model is always better than the diminished model it's always best to

account for all the dependencies that you have even if you have some extra ones and it's even better again to not

have any extra ones since just the B I say score so the really nice thing

beautiful proof by max Chickering shows that the these score differences are not just for the whole model but they also

apply to little sub graphs so we can move through the space of models by adding and removing individual edges and

then if we have infinite data the consistent score give us the true model so this is the idea behind greedy

equivalent search you start with the empty model unlike PC which starts with a complete model and then the forward phase well there's still some valid edge

addition that improves the score you add that edge and just keep doing that till the score stops improving and then that

you start on the backward face while there's some edge that you can validly remove and still improve the score then

you remove it GES at each stage finds the mark of equivalence class of whatever model it's gotten to so it moves through a space of markov

equivalence classes which means the search space is smaller but more complicated the so for example if you

add the first edge between two nodes the markov equivalence class of that model is going to have an undirected edge so you start by adding a directed edge then

step two point two you unoriented and then you go to add another edge so

because the score of an Augmented model is always gonna it's always going to be worth it to add extra dependencies to account for the ones in your data the

forward phase is always going to account for all of your dependencies and then if you've accidentally added some extra ones the backward phase will get rid of

them this is why we only need two phases if we have infinite data so you might wonder why I'm talking about this like

infinite data consistency result because we never have infinite data and so I'm emphasizing the theoretical results

because real finite sample validation is really hard for causal models this is kind of the soft underbelly of

causal discovery how do we validate our

results yes eventually we'll get there so we if we're doing regular statistical

learning we can validate on the same kind of data that we use to train the model but for causal inference we have

to test on the outcome of interventions which we don't necessarily have that's why we're doing causal discovery so it's

really hard to find a good validation set for causal discovery and the space of potential experiments is really large

you could intervene on any one of the variables in your model or any subset of them so the power set of your variables and you could decide to set those values

any way you like so it's a huge space of potential experiments the model that you're learning is going to have

implications for all of them so the alternatives if you can't do that kind

of validation I you know how to get theoretical results and you can validate

on simulated data if you create a generate a generative model you know what the true causal structure is generate some synthetic data from it and

then see if you can take that data and learn the generative model so this is what is standardly done in causal discovery papers and it's a pretty good

approach but it doesn't necessarily assuage the really bad worries we have about this approach because it's quite

difficult to generate synthetic data where the assumptions of the causal algorithms will be violated in a

realistic way and we know that almost every case in reality is going to violate some these assumptions of course insufficiency or ASIC licit in your

faithfulness but we don't know how badly and what we want to know is how robust the algorithms are to those violations

of the assumptions you can also pick a known causal relationship where you note from background knowledge but that's kind of a shorthand way of saying we've

done the experiment before so I'm gonna give you a couple of examples where validation was done really well because

it can be done so this research groups take over at all they were investigating where they were developing a new causal

discovery algorithm but they validated it on this plant Arab adopts estar Liana

looking for genes that regulate flowering time in Arab adopters so they had micro arrays giving them the gene

expression of all genes in the genome and that's thousands of variables and

also for each of those plants what time did that plant flower so they used a variant of PC PC stable which I forgot

to describe earlier and subsampling to find edges that was stable on various sub samples of the data and they had an extra gadget called I da for estimating

the model and they finally got to their 25 top genes and of those 25 5 of them were already known to influence flowering time so that's pretty good

finding a needle in a haystack but they went further and tried to test the other 20 all those 20 there were 13 where there were commercially available mutant

strains someone had already you know mutated that gene and you could just buy the mutant seeds so they grew the plants and off those 13 strains nine of them

produced enough viable plants to measure flowering time accurate and of those nine four of them had

measurably different flowering time so from just running an algorithm on gene

expression they were able to really help prioritize experiments to find new regulators of flowering time so if you're a biologist this is really gonna

save you a lot of time you could be testing any of those tens of thousands of genes in the genome but this one

experiment found four good regulators so another validation set is actually a

cattle challenge from I was at 2012 2013 on cause and effect pairs hence the

little chicken and egg picture so you'll note that there are no there's no conditional independencies that allow you to distinguish the direction of

causation between just two variables without any other context but I mentioned that there are a bunch of other constraints that you can use in causal discovery and that's what the

winning entries in this challenge used so the training data there were about 4,000 training pairs but most of them

were semi artificial they mix together some real variables to create artificial causal relationships but they did have

hundreds of real pairs where just from background knowledge we know the direction of causation and this is an example I so you have a fifty percent

chance of getting right getting it right would anyone like to guess whether a causes B or B causes a in this example

this isn't chickens but I'll just I'll just tell you so B in this case is

altitude and a is temperature so you might think that we can cool a mountain down if we build it higher but we

probably can't make it taller by cooling it down right so on the intervention theory you know because is a but a doesn't cause B and if you had been

using some of the causal discovery algorithms that use other constraints you might have noticed that this looks

like it's a non invertible function that's been inverted and that is one of the signs it looks like there's a bit of

a curve there that is one of the signs of a causal relationship that's been fitted the wrong way so the winning entry in this challenge had an accuracy

of 0.8 - rather than the point 5 that we'd expect from chance actually I think they had some cases of a confounder so you'd expect the accuracy to be even

lower by chance so pretty impressive and if you want to a large scale evaluation of your causal discovery algorithm that

data is still there so I just wanted to give you an idea of the an intuition for

lingam so lingam stands for linear non Gaussian

a cyclic models and the idea behind it is if I if I set up a linear but non

Gaussian model so Y has just some uniform noise so X has some uniform noise and Y has its own uniform noise

plus some of X if I try to fit Y on X which is the the true direction of causation I find that

the residuals from that regression have no association with Y but if I hit the reverse model if I regress X on Y the

residuals from that regression will be associated with X they won't be correlated the linear correlation has been removed by the regression so here's a case of dependence without correlation

but you can see there's a dependence because there's no no data in this little corner here but there is data

along here so depending on the value of x you either will see samples of Y in

that range or not so that's the intuition for how these non Gaussian distributions give you constraints that

you can use to distinguish the direction of causation with only two variables and there's a lot more stuff with nonlinear

models as well that I have not made slides on all right so we have all of these assumptions now let's talk about some of the other things that go wrong I

talked a bit about how Markov and faithfulness can go wrong so here's another thing it can go wrong really like talking about the problems let's

say you have a variable that you want to intervene on like cholesterol there were a whole lot of clinical trials for

treatments that would lower cholesterol that wound up not having an effect on heart disease even though they successfully lowered cholesterol like

the coronary drug project and the reason that they didn't successfully reduce heart disease because cholesterol is actually two different things

it's high-density cholesterol which reduces heart disease and low-density cholesterol which increases heart disease

so if you don't know which one you're intervening on you can't predict the result of that intervention I really wanted to mention this problem because it's particularly an issue in machine

learning where we do a lot of feature engineering if you throw every combination of your features every manipul transformation of them into the

model then you're going to have a lot of variables where it doesn't it's not clear what it means to intervene on it

and the causal model won't make sense so I don't have an answer for this because I don't know what counts as like the

causally relevant set of variables just pointing it out as an issue for applying these methods in a machine learning

context yeah right yeah you can only intervene on both or neither which is I should have mentioned with interventions

we require that you you can't have a fat hand intervention which is an intervention that hits more than one node in the model if you have two nodes where you can't intervene on one without

the other then the models not specified correctly to have a causal interpretation so another problem is measurement error

let's say that I haven't measured exactly how much people smoke but I've just asked them like do you smoke now maybe they have a 40 pack year history

but they just quit last week and in that case I I've created a new variable

smoking yes no which is a child of the exact amount that they smoked and now I haven't measured the exact example

another they smoke so I've grade this out in the model and unfortunately now

if I try to condition on this coarsened variable smoking yes no I won't screen

off yellow teeth from lung cancer I can't actually break that path so this is a really pervasive problem we won't

get the conditional independencies that we expect because we haven't measured

what we think we've measured and another really foundational and worrying thing

is that causal search is not uniformly consistent back to that unfaithfulness example let's say that it is a faithful

model these edge coefficients don't precisely cancel out alpha beta equals gamma plus some little epsilon it's not

zero so it's a faithful model but for any finite sample size I can pick an epsilon that's small enough that my constraint based algorithm will school

based algorithm will say that there's no edge from birth control to thrombosis in that case the error I'm making in terms of how far I am from the true

coefficient of the model is arbitrarily large because I've said that gamma is zero when in fact gamma could be

anything so for any finite sample size I can be arbitrarily wrong even though if I had infinite data I would get the

truth because it's a faithful model so and that's a response to this is to say well let's bound epsilon away from zero let's say we can't be that close to

unfaithful it doesn't matter how small epsilon is if we assume it is any nonzero number then we get uniform

consistency so it's a natural response unfortunately so Caroline Allah has done this great work on the geometry of unfaithfulness so I said the the

distributions that are unfaithful are a surface in a higher dimensional space and she's visualized the surface for

tiny model with just three edges when you add up all the different surfaces for any of the covariances in that model

vanishing because they're so convoluted they wind up taking most of the space most of the volume so for any little epsilon around that surface if you're

gonna say the model can't be in that area you wind up taking a lot of the volume of possible distributions so the strong faithfulness assumption where you

say we can't be any closer to faithfulness than Epsilon it's a really strong assumption rules out a lot of the

distributions so yes I believe it does but I don't have a good enough

understanding of Caroline's paper to really say so you might now be asking

why use causal learning at all and I would say it's because there are no better options so if you're really in

the situation where you don't know what the structure is you just have a whole lot of data you often what we see say

psychologists doing its guess and test I'm just gonna guess that the model is this I'll fit it to the data and then

I'll test the fit and if the p-value is big enough I'll say that's the model and there's no consistency results for

guessing test it is probably going to do worse than one of the causal discovery algorithms so whatever worries you have about

causal discovery and I have a lot I have more worries about guesses test so that's my sales pitch this is why I

should use these methods and that is all

[Applause] we do have fun questions I have like

three but I was just you know the same probably thing yeah so the question the question is when you are seeing

weird results with cholesterol how do you know how to modify the model I don't know I think that's outside the causal

discovery theory I think you the only way we figured it out was by doing finer scale mechanistic investigations of how

heart disease works but I am NOT a medical researcher so I don't know exactly the history of how we figured that out and I don't know how to do it in general [Music]

the breath right so the question was about the parameterization of the model so yeah I

said failure Leon that I'm dividing the problem of causal discovery into two parts first learn the structure and then

fit the structural equation models so definitely the learning the structure is

a discrete task either the edge is there or it's not and then the problem of figuring out how strong the edge is or

whether the influence of two parents on the common child is going to be interactive or if they're going to be just additive I that is a statistical

estimation problem and once we know what the arguments are in each of those structural equation models we can then fit it using whatever fancy statistical

estimation tools we want sometimes it's really hard to fit those equations especially if some of the variables in them are unmeasured and there's a whole are the literature on figuring out how

to fit those things when you know the causal structure but I've ignored it because I don't know enough about it [Music]

yes so there is a open source package called tetrad which is written in Java

and was developed by people in in so the question was is there software to do this I didn't have time to do a live

demo but maybe next time there's yeah so tetrad you can download online search

for maybe tetrad Carnegie Mellon because there are other things called tetrad or

tetrad causal discovery and there's some documentation on it as well how do problem are sigh fools

deal [Music] so this is how big a problem our cycles there are some causal discovery

algorithms that were developed exactly to learn cyclic models so cyclic causal discovery or CCD was the first one

cyclic models can differ in how we interpret them sometimes like supply and

demand you might think as a quick model but we could also interpret it as a model which is a cyclic if you consider

it over time so you have time indexed measurements so if you have time indexed observations you can then fit that model

using dynamic Bayes net learning algorithm but then there's others where either it's a cyclic overtime but we

just don't have good enough measurements to tease that out or maybe we really do think it's cyclic and then there are some algorithms that can handle that but

only for limited distributions CCD works for linear Gaussian distributions I don't think there's a general way of doing it yet

the spirit special what is it years have interaction for example when totally her will reflect and maybe Capano me

cofactor of a final one and oh sorry no I can start becoming

which means our attraction has a cognate direction I found but how do you do so the question is how to deal with

interactions when you have multiple parents of in order to interact so there's they're kind of easy answer to that where I say again that's a

statistical estimation issue but some kinds of interactions do cause problems for causal discovery so there's an

assumption that I don't know if I even listed it called compositionality which says that if a variable has a causal

influence on another one then that edge that dependence should appear even when you don't take other variables into

account but we can imagine cases where this doesn't hold like an exclusive or function each one of the parents could

be a random coin flip and the child could be you know on if both the parents are on and off of both of them are off like switches and that in that case

you've got an interaction and you will see no dependence between either parent individually and the child

so it's an unfaithful model and compositionality fails so yeah interactions can make it really hard to

learn the structure but if if we don't have that problem and we can learn the structure then I'm going to say all of the estimating how strong those interactions are is just a statistical

estimation problem someone else's problem does it help the visit Nathan Kozel discovery process if we have done better

than what variables we intervene done absolutely so the question is can you do better causal discovery if you give the algorithm information about what we

intervened on the answer is yes we can add that as a constraint and then that will rule out some of the models that would have been within the Markov equivalence class so we can narrow down

our results and tetrad will allow you to do that I think we had one last question well may I ask one yeah so my my

question was do you know of any work that tries to do this causal extraction live so in the examples it kind of assumed that you had the data and they will post hoc analyzing it but it seems

like it's something you could do online where each time you needed an independence test you could actually then like ask a robot to do it or

something yeah so there are approaches that Frederick Ibarra and his collaborators are working on using Sat

solvers so what you give the Sat solver is a set of conditional independence constraints and which have been translated into constraints for the

solver and then you can query it about the model you can ask is there an edge between these two variables and it'll say yes no I don't know and you can keep

feeding it constraints continuously and it will keep refining what it knows because what you can give the model is

such a weird subset of independence constraints I don't think they've figured out a way to represent what the Sat solver knows about the model like

there's no a complete partially directed acyclic graph no no handy visual representation for it but you can query it and you can keep giving it information so that is

I would call an online approach to this awesome thank you well I was such a great presentation

[Applause]

Loading...

Loading video analysis...