Elizabeth Silver — Causality and Causal Discovery
By Machine Learning and AI Meetup
Summary
Topics Covered
- Causality is fundamental to cognition
- Why study causal relationships at all
- Causal dependence is asymmetric
- How constraint-based causal discovery works
- Use causal learning because there are no better options
Full Transcript
I guess we should start with the joke people still getting B is I can probably
do the joke while the getting B is so anyway this one is from my Twitter feed so if you follow me then sorry but I don't have that much new material here we go this is the enterprise buzzword
translator when you hear someone Enterprise saying business analytics what they mean is spreadsheet when you hear them say process automation what they mean is a spreadsheet when they say
predictive model you betcha it's a spreadsheet and when they say Big Data they mean a spreadsheet that's too big to email and of course machine
learning is linear regression but done in a spreadsheet thanks thanks now I'm glad then that you know but that's probably the best laugh
I've got from an MLA I chokes I feel pretty good about that you know yes so I think the next thing we'll be doing is
the news normally have more of a spiel but today I just don't so the news is a section where you can call out whatever you like and I will write it down so someone phone someone from the
audience will surely demonstrate linden what's making you happy and interested this week yeah I love it
that is that is really us that it's
really actually super cool yes so silver
pond has a elephant detection model running in the wild catching poachers
considering that's an application of like artificial intelligence which isn't like you know neo-liberal a hyper
normalization that's really great and where can people buy in the study v16 okay
yes yeah yes right okay what else is going on in the Melbourne or international AI scene and I'll start naming people if I don't hear something
soon so consider it your moral duty to say something oh thank you I can in this version to even install it you
need to have a Google cloud account zing thank you I'm here a week oh yes has anyone given it a go I saw it but I didn't I didn't actually try it
oh yeah Russ what was your experience suddenly somebody selling for credit related stuff and we lost numbers accessing credit item
mmm yeah well you know you're sitting there trying to invent the next generation of artificial intelligence and what you really need is the
latitudes and longitudes of toilets in
Portland excellent so any any projects that people are working on I mean even
if it's just something that you just started doing you just thought about doing you should probably say it I mean you know no one else is gonna do it I never know if how long I should give
people how awkward they need to feel before someone is just like I'll say something just to put us out of our misery but anyway I'm pretty good at this I'm used to it now all right oh yeah oh yeah
Suzie and then Alex yeah looking to develop a set algorithm that comes up with unbiased results
tastehh an example is when you have Google echo a Peugeot and other search engines what comes up more often or not
it's formulated results we did a great it many people she's breaking out for assistance and guidance and once we've got project for
so if there's anyone there that actually that is really keen to sort of get that on the way out so look into that project it's not like a really beautiful on
projects or for interested people you should they contact okay alright like a
really cool project that would say shoot me a message and I'll be stuck just she acted okay
well how do we know in degrees to the ball well this goes out with what the details here the magic of editing Alex I knew you had something yeah I was
going to buy time for people one bit about that project City of Hope everywhere yeah it is more secondary and next week is do a cloud summit so I have a good prepared joke about
this one if you're a Chinese national you are allowed to attend the summit but you can't go to all of the talks yeah no nothing nothing come on come on
[Laughter] [Music] we come with BL s s including super sampling you get really good
dice and feelings and oh yeah they all coming thanks a call because libel [Music] yeah though that's actually the tensor
cause thing is really cool because it's one of those you used to have to get enterprise-grade cards to get the tentacles so now there is a yeah I saw some people doing tests and I think doll
up like deep learning flop that card is gonna blow all of the other ones out the
water right at least right now okay well um wait what's let's have one more thing why not this is one of the best of the News's we've ever done so I'm touched
and genuinely impressed but well well who wants to be the last one oh by the way this was a
they work for Accenture or IBM okay well here we go
so our next speaker Elizabeth Silva is someone have been talking to about doing a talk for a while I think there's been six months in the making maybe yeah I was really
excited to have it so to give a little bit of background a lot of the talks we've been doing recently have been a little bit off the kind of mainstream deep learning path just because there's
enough people talking about cnn's out there that you can get your feel on youtube so i've tried to really highlight people with important and interesting ideas that I think will be
incredibly or incredibly influential but I'm not entirely appreciated universally so this talk is about causality and I
think it's one of the most important problems facing machine learning today so there was an experiment it's a famous arm experiment which was done I think a little bit before they tightened up the
ethics on animal experimentation but anyway it was two kittens and one was attached to the essentially what one was
in a cart attached to the other kitten and they would just be passively dragged around by the active kitten so they're kind of the cart was facing backwards and one of the kittens would look this way and the active kitten would look
that way and the active kitten could move around the world as they would and the the passive kitten would actually be restrained so they could not interact with the world and even though it was
exposed to a rich and continuous data stream sense information movement it did not develop properly and so I think what this really says is at least for animals
causality matters the fact that you can interact with the world and that it's your actions that have consequences I think during an extraordinarily fundamental
part of cognition and a lot of the focus on especially simple probability and Bayesian ISM can tell you a lot about
what things happen together but the problem is it doesn't matter what things happen together it matters what happens when you do something so concluding that
and and and without that you will be mislead so I think that this is gonna be there's a talk that I was really excited for you to hear and I'm really excited
to hear it too so please welcome Elizabeth Silva okay so for those of you
who were here last month Ross said that he was gonna give an Icelandic cyber of a talk this is gonna be more of a Wikipedia article in the sense there
will be a really high-level overview overview of material that really deserves a deeper treatment and the quality will be variable and none of
it's my own work so okay I'm not using the clicker so one ground rule because
this talk is about causation there's probably at least one person who's just waiting for a moment to be like the causation is not correlation so if
that's you I I'm gonna ask you to hold that question till the end and see if you still have it I hope that I'll give you some more precise reasons to be
skeptical because I promise you I'm also skeptical so why study causal relationships machine learning is really
good at prediction but you don't need to know about causation in order to predict things well so why study it at all firstly we need to know about causal
relationships in order to predict the results of interventions if we're going to intervene and change something about the world we can't use a prior data to
predict what we're going to see next secondly the environment that we're studying may change over time and we would like to know about relationships
that are going to be robust to changes in the environment and thirdly sometimes we don't just care about predictions
sometimes we have more of a scientific interest in how the data is generated we want to know why it happens way it does or how so
in those cases we want to know about the causal mechanisms producing our data so firstly I'm going to tell you about the inputs and outputs of causal discovery
methods and the outputs are going to be probably new to some of you so the input data is very familiar just tabular data
with random variables in the columns and observations of cases in the rows but the output representation is going to be
of a different form it's not going to be predictions of some random variable it has to represent causal relationships between those variables so that
representation has to fit with how we think causal relationships work it's a difficult problem finding a good representation so one natural way to
represent causal relationships is with structural equation models in this example I have just used completely
general structural equations I'm not assuming that they're linear functions they could be any function and if a
variable is caused by some other variable then that will be an argument to its structural equation each variable
is a function of its causes plus some specific era turn note that I've written the assignment sign instead of the equal
sign and this captures that the equality there is not just accidental this is how we think those values are
assigned by some process but for the rest of the talk I'm not really going to talk about structural equations instead I will use graphical models as a
shorthand for structural equations so here is an example of that so in this graphical model every
variable that was an argument in one of the equations in the previous slide is now there's now an edge into that
variable so for example lung cancer is a function of smoking and asbestos so there are edges from smoking to lung cancer and asbestos to lung cancer the
nodes the random variables they come straight out of the column names in our data set so we already know what they are what we're trying to learn is the
set of edges that's the learning problem right and normally we assume that these models have no cycles so I'm going to use the term dag to mean directed
acyclic graph sounds funnier in Australia than America where I learned about this stuff the other note on
terminology directed graphs often use kinship metaphors so I might say for example that smoking is a parent of lung
cancer or lung cancers a child of smoking to represent those relationships those edges we're going to interpret in
terms of interventions so what that means is that if I were to intervene and
say force someone to smoke so that's what this do operator represents that you're actually intervening you're not just observing whether they smoke or not
you would see a change in the probability that they get lung cancer things to notice about this firstly it
is relative to some setting of the other variables it doesn't matter it doesn't have to be all settings there has to be at least one setting where you see this
difference making thing so you'd have an edge step secondly it's probabilistic so this represents our intuition about
causality that it doesn't have to be deterministic we know that not everyone who smokes gets lung cancer but smoking
still causes lung cancer what I'm up to first second third lis thirdly this shows the difference between a
probabilistic dependence and a causal dependence so there's a probabilistic dependence between having nicotine stains on your teeth and having lung cancer if someone has nicotine stained
teeth they're more likely to have lung cancer but if I intervene paint their teeth yellow I'm not going to see a change in the probability of them having
lung cancer so this is the whole correlation is not causation thing and third or fourth or fifth or whatever I'm
up to I so probabilistic dependence is symmetric if having yellow teeth is correlated with smoking and smoking is
also correlated with yellow teeth but causal dependence is asymmetric if I force someone to smoke I increase the probability that they have nicotine
stains on their teeth but if I paint their teeth I don't make them start smoking so there's an asymmetry there so this
representation covers a lot of our intuitions about what causal relationships are supposed to look like so another thing to notice is that the
true set of edges which remember is what we're trying to learn that will depend on the set of variables we're looking at so for example if I'm looking at the
variables years of education and income there should be an edge from education to income but if I expand the set of variables I'm interested in and includes
skills then maybe the only effect of Education on income is by increasing skills and so in the expanded model you don't need a direct edge from education
to income so for example if like me you go into a PhD in philosophy it doesn't increase your skills so it has no effect on income
[Applause] another thing with this representation covers is it allows us to represent
interventions so if we're going to represent an intervention we add one node to the model and one edge into
whatever that variable is intervening on so here I'm going to intervene on yellow
teeth by painting them and so I have the paint node and if I'm just randomizing what color I paint people's teeth that
means that smoking no longer has an effect on whether people have yellow teeth or not so that breaks the edge from smoking to yellow teeth and this
kind of intervention is called a heart or a surgical intervention because it breaks that edge the you can also do
what are called soft interventions where you change the probability of a node but allow other things to affect it and this is common for example in economics we
can experiment on people's income by giving them more money but it's really hard to get ethics approval to take away the money they already have so in those
cases you don't break that edge but you can put a new edge in so smoking here we
would think of as a confounder of the relationship between having yellow teeth and having lung cancer because it's a common cause that's how we represent
confounders in this framework and this illustrates a how randomized control trials that do hard interventions break the influence of those confounders and
even if you are doing a soft intervention you might not learn the effect of the yellow teeth variable on lung cancer but you still have this
intervention variable that you've added which is exogenous it has no confounders going into it
so that's how interventions break the influence of confounders yes all right
okay so I'm gonna mention this now and come back to it in more detail in a bit but the output that we get from causal
discovery algorithms can include some uncertainty so if you were to give data from this model to a bunch of causal
inference algorithms probably they would return something like this model down below where they've identified that there are these three edges and they've
oriented two out of the three but they can't tell what direction this third edge is going in so this is learning as
much as we can about the true structure and representing the remaining uncertainty accurately that's what's going to be our goal not to learn
everything but to be clear about what we know and what we don't so there's kind
of two steps in causal structure learning firstly you want to learn what the true set of edges is and then secondly once you know that once you
know what are the arguments to each structural equation you can fit that equation using standard statistical estimation techniques so the first part
of the problem is harder and I think more interesting and it's also the one that I know more about so that's all I'm going to talk about I'm not going to
talk about step two still waiting um Oh what was that oh yeah another way to think about step
one is avoiding model miss specification we all know if that we regress on the wrong set of variables we'll get the
wrong edge coefficients okay so a high-level overview of what we're doing in regular statistical inference we know
we have the probability theory that tells us how any probability distribution would generate data and then when we observe data we use
statistical inference to go back to the generating distribution we're doing a similar thing here with causal learning we have the causal axioms that tell us
how each causal structure would constrain the kinds of probability distributions that could be generated by that structure and then when we see constraints on the probability
distribution we're going to use those to do causal inference back to the set of structures that could have generated that distribution that's the high level
so there are lots of different kinds of constraints of which I am mostly going to talk about conditional independencies
because conditional independencies are the only kind of constraint that holds
in all parametric families there are a bunch of other interesting kinds of constraints that I might describe
briefly and we get other constraints from a background knowledge about the sampling conditions like if I know I've done an intervention I know that I've
broken some edges or we and some of the constraints so we might come from our assumptions about a solicit e and so on
so conditional independence constraints what are they gonna start with the Markov condition so the Markov condition
follows from properties of structural equation models it's pretty general it tells us the local version of the Markov
condition says that for any variable X if you condition on its direct parents in the graph anything that has an edge
into X then all of its non descendants become independent of X the parents
screen off all of its ancestors and what we can infer from Markov is that so
Markov implies that if there's no edge in the model from X to Y there will be some conditional independence and that
means that if we see some if we see conditional dependence no matter what we condition on x and y are dependent on each other we infer that there has to be
an edge between them that's the backwards inference so the Markov condition means that when we see
a dependence we can infer an edge when does this fail this fails if we have some unobserved confounders that we have
not included in our model so for example if I just ignore smoking and just look at yellow teeth and lung cancer because neither of them cause the other the true
causal model has no edges in it it's just an empty graph but I will see a dependence between the two of them so Markov would say you've seen a
dependence you should be able to infer an edge but that is incorrect in this case because we have omitted that confounder smoking so this is a
violation of what we call causal sufficiency which just says we haven't omitted any common cause of any two variables in your model now
you can include a variable without observing it if you haven't observed it you can't maybe learn much about the structure but you won't make a mistake
where you think you know the structure even though you don't so I don't worry
too much about violations of Markov the second condition though the second because Markov only lets us infer that edges exist but we always we already
know that the complete graph where everything causes everything that can fit any probability distribution so to learn a space model we need some
criterion that allows us to exclude edges so for that we turn to the faithfulness condition which is exactly
the converse of the Markov condition so the faithfulness condition says if you see if there is an edge in the model
then those variables are going to be dependent no matter what your condition on and that means when you see some conditional independence you can infer
that there's no edge between them so it's exactly the converse of Markov it says that all the in dependencies in the
distribution are those that Markov entails there are no extra ones so when does faithfulness fail faithfulness can fail when we have
causal pathways that exactly cancel out or distributions that are set up exactly right so that dependencies vanish so in this
example say you're interested in the effect of birth control on blood clots or thrombosis and you know that the you're looking for the direct effect but
you know that birth control also reduces the chance of pregnancy and pregnancy increases the risk of blood clots so you include pregnancy in the model if which
is unlikely if this were a linear model with these edge coefficients if alpha beta equals gamma then those two
pathways exactly cancel out and you will see that birth control is independent of
thrombosis so the argument for faithfulness usually goes like this that
isn't really going to cancel out the the surface of distributions that have these exact canceling paths is a surface in a higher dimensional space of
distributions it has lebesgue measure zero it has probability zero that things will exactly land on the surface but there are cases where we expect
faithfulness to fail specifically in a self regulatory system the thermostat in your home ensures that the indoor temperature will be independent of the outdoor temperature even though the heat
outside affects your home so it's precisely set up to make that dependents fail so in biology for example there are a lot of cases where
we expect the distribution to be unfaithful homeostasis is like your own physical thermostat so people worry more
about unfaithfulness than Markoff failing so now I'm going to give
you an intuitive example just to kind of see how we can orient edges because the standard like correlation is not
causation argument usually says we can't say whether a causes B or because as a sometimes you can so if we have this toy
model where we're modeling a car and we include the petrol tank the battery and whether the car starts or not in the
model you would think that the petrol and the battery charge are independent of each other neither of them has anything to do with the other but as
soon as you learned that the car did not start and you know that you have plenty of petrol what can you infer about the
battery it's probably dead so this example shows that when you have two parents of a common child in your graph
keep using the laser pointer and they're turning and then it's just pointing it
something oh so yeah if you have two
parents of a common child then you will see that the parents if they're if there's no causal connection between them are independent but as soon as you condition on the child
they've become dependent this is called collider bias or selection bias and it's
a case where if we see this pattern of unconditional independence but then conditional dependence we can fit this structure we we can orient these edges
inwards because this is the only structure that gives you that pattern so yeah oh yes sorry I meant to say that
yes so the upside-down T symbol is independent of so here gas tank is
independent of battery in the case where it's got a line through it means not independent of so gas tank is dependent
on the battery conditional on the car starting being zero yeah dependence and
correlation correlation is a subset of kinds of dependence it's linear dependence I will tend to say dependence
precisely because sometimes we have no correlation even though we some we do have dependence and that's used by some of these algorithms but if hearing the
word dependence is confusing just hear
it as correlation so if we have the true model a cause as because of C this will give us exactly one conditional
independence if you conditioned on B you make a and C independent but that same independence is implies implied by these
two other models C causes B causes a and B causes both a and C so if we see that one independence and everything else is
dependent well we can infer is that there's no edge between a and C we can't orient the other two edges so we can
represent this what's called a markov equivalence class that contains three different models using this object which
is a complete partially directed acyclic graph or CP dag which has two edges that are oriented because each one of those edges
goes one way in some of these models and the other way and some others but this second mark of equivalence class again there's just one conditional
independence and that's a is independent of say conditional nothing this is the only model in the Markov equivalence class so that means that our CP dag of
that that represents that equivalence class Orient's all the edges so we really want to be in this situation where we know more about the model but
how much we know about the model depends on the way the world is we have no control over it so we want to accurately
represent our uncertainty the right okay so I've given you some intuitive
examples and I want to be clear that there is there's a recipe for doing this the the d separation criterion tells us
all of the conditional independencies that are implied by a graph you start with a graph and if two things are d separated then they should be
conditionally independent in the probability distribution D separation lets you move between a graph separation criterion and a probabilistic
independence but I don't really need to go through all the conditions I just want to be clear that they exist and
this work has been done and continue with my high level Wikipedia style overview so this powers the constraint
based algorithms so the idea of constraint based search you start with some data you feed it to a statistical inference engine which tells you all of
the conditional independencies that are in that distribution that your data comes from so that gives you the constraints and then you feed those constraints to a causal discovery
algorithm which will tell you a markov equivalence class it should tell you exactly the lack of equivalence class that implies those same set of constraints
notice that the statistical inference engine is kind of separate from the causal discovery algorithm so in this sense constraint based search is
nonparametric you can give it any nonparametric conditional independence test you like so that's nice that that's
general but it's still really hard and the reason that it's hard is because there's a lot of different models that
we're searching through if you think about a directed acyclic graph every pair of nodes so if we have n nodes
that's n choose 2 which is order N squared every pair of nodes can either have an edge between it or no edge so
there's at least 2 to the N choose 2 models but if we consider the orientation of that edge it could go
forward or backward or be absent so there's at most 3 to the N choose 2 models some of them will be ruled out
because they're cyclic but still the search space s grows super exponentially in the number of nodes so this is a very
difficult discrete search problem and the advances in causal discovery methods
that make that search process efficient the picture last rain make it look so easy yeah yes well there's a lot of like
computationally intense stuff happening here so I'll go through the PC algorithm
which is relatively simple it is named after Peter spur DS and Clark gleam or it's what Peter and Clark if you're wondering why they didn't use their last names it's because they'd already
invented the Sporty's gleam or in China's algorithm which is like PC only less efficient so so they yeah right
they're done they do invent other ones and they named them other strange things so PC unfortunately is quite difficult
to Google so all right the idea behind PC let's say we're starting with this true graph down the
bottom where a and B equals C and C equals D PC starts with the complete graph everything's connected to
everything else the first step is to test the zero order conditional independencies and we find one a is
independent of B conditional on nothing and by faithfulness were allowed to remove that edge but by mark off we have to keep all the other edges in there so
we remove that first edge then we move on to the first order conditional independencies and we find two a is
independent of de given C because say screens it off and B is independent of D
given C so again C screens it off so and everything else is dependent so again faithfulness lets us to remove those two edges now we have what's called the
skeleton of the graph it's got all the right adjacencies whoops sorry so the oh yeah another thing that makes
PC efficient but which doesn't really come across in this small example is that the set of variables included in
the conditioning set is limited again by the Markov condition you can all the screening off that you need to do is by conditioning on the parents of each node
and we don't know which are the parents but we know they have to be a subset of the neighbors which is everything that's connected to a know it regardless of the
direction so we just condition on sets that include the neighbors so as the graph gets sparsa the number of tests
that you have to do decreases and we're starting with the low order conditional independencies which are more robust so
the next step is because we see no more conditional independencies is to orient the V structures so you take every
triple of nodes where two of them are connected to the third and then you see whether you get a dependence if you
condition on that third node between the two notes on the endpoints and we find one v structure a and B can be oriented
in to C because of this dependence a is dependent on B conditional in C it's
just like the car example so we have oriented two edges and we have a V
structure and an undirected edge if we've just left that undirected that would imply that in the markov equivalence class there's some models
where D points into C but if that were the case then we would have found another V structure and we didn't so we
can orient the last edge away from the V structure and that Pacey is done it exits it has found the true graph in this small
example that I picked specifically because it was easy so the benefits of Pacey it's very efficient especially on spice models if you have a really dense
graph if everything's connected to everything else you will have to do all of the conditional independence tests to
learn that it's that dense and the downside of PC is that it can propagate errors if it makes a mistake early on
everything else is going to be screwed up and it does really rely on having a
good conditional independence test okay so that's constraint that's an example of constraint based search another thing
that's done is score based search so the conditional right so in score based search for each mark of equivalence
class of bottles you can pick one model within that class and fit it to the data so if it were a linear model you'd learn
the edge coefficients and then once you fit it you can score the fit using for example the v.i.c score and then once you've done that for all the models
you're interested in you can just take the Arg max the model that gives you the best score so this is more of a Bayesian
approach again it suffers from the problem that the search space is huge you'd have to search you have a score a lot of models so one approach that
people take is to artificially ignore most of the search space so a common way of doing this is just to say I really
think that no annoyed is going to have more than four parents and that will dramatically limit your search space if you have a lot of nodes
another option which I'm going to describe is to take a greedy approach where you don't score all the models but you move through the space of models
trying to find a better scoring one so if you have a consistent scoring
criterion and it's been shown that bi C is consistent you'll find that okay if we have these three models the true
model is G star and then we have an Augmented model G Plus where we've added an edge removed and independence and a
diminished model G minus where we've removed an edge and created an independence we will find that the score of the Augmented model sorry the score
of the true model is always greater than that of the Augmented model and the Augmented model is always better than the diminished model it's always best to
account for all the dependencies that you have even if you have some extra ones and it's even better again to not
have any extra ones since just the B I say score so the really nice thing
beautiful proof by max Chickering shows that the these score differences are not just for the whole model but they also
apply to little sub graphs so we can move through the space of models by adding and removing individual edges and
then if we have infinite data the consistent score give us the true model so this is the idea behind greedy
equivalent search you start with the empty model unlike PC which starts with a complete model and then the forward phase well there's still some valid edge
addition that improves the score you add that edge and just keep doing that till the score stops improving and then that
you start on the backward face while there's some edge that you can validly remove and still improve the score then
you remove it GES at each stage finds the mark of equivalence class of whatever model it's gotten to so it moves through a space of markov
equivalence classes which means the search space is smaller but more complicated the so for example if you
add the first edge between two nodes the markov equivalence class of that model is going to have an undirected edge so you start by adding a directed edge then
step two point two you unoriented and then you go to add another edge so
because the score of an Augmented model is always gonna it's always going to be worth it to add extra dependencies to account for the ones in your data the
forward phase is always going to account for all of your dependencies and then if you've accidentally added some extra ones the backward phase will get rid of
them this is why we only need two phases if we have infinite data so you might wonder why I'm talking about this like
infinite data consistency result because we never have infinite data and so I'm emphasizing the theoretical results
because real finite sample validation is really hard for causal models this is kind of the soft underbelly of
causal discovery how do we validate our
results yes eventually we'll get there so we if we're doing regular statistical
learning we can validate on the same kind of data that we use to train the model but for causal inference we have
to test on the outcome of interventions which we don't necessarily have that's why we're doing causal discovery so it's
really hard to find a good validation set for causal discovery and the space of potential experiments is really large
you could intervene on any one of the variables in your model or any subset of them so the power set of your variables and you could decide to set those values
any way you like so it's a huge space of potential experiments the model that you're learning is going to have
implications for all of them so the alternatives if you can't do that kind
of validation I you know how to get theoretical results and you can validate
on simulated data if you create a generate a generative model you know what the true causal structure is generate some synthetic data from it and
then see if you can take that data and learn the generative model so this is what is standardly done in causal discovery papers and it's a pretty good
approach but it doesn't necessarily assuage the really bad worries we have about this approach because it's quite
difficult to generate synthetic data where the assumptions of the causal algorithms will be violated in a
realistic way and we know that almost every case in reality is going to violate some these assumptions of course insufficiency or ASIC licit in your
faithfulness but we don't know how badly and what we want to know is how robust the algorithms are to those violations
of the assumptions you can also pick a known causal relationship where you note from background knowledge but that's kind of a shorthand way of saying we've
done the experiment before so I'm gonna give you a couple of examples where validation was done really well because
it can be done so this research groups take over at all they were investigating where they were developing a new causal
discovery algorithm but they validated it on this plant Arab adopts estar Liana
looking for genes that regulate flowering time in Arab adopters so they had micro arrays giving them the gene
expression of all genes in the genome and that's thousands of variables and
also for each of those plants what time did that plant flower so they used a variant of PC PC stable which I forgot
to describe earlier and subsampling to find edges that was stable on various sub samples of the data and they had an extra gadget called I da for estimating
the model and they finally got to their 25 top genes and of those 25 5 of them were already known to influence flowering time so that's pretty good
finding a needle in a haystack but they went further and tried to test the other 20 all those 20 there were 13 where there were commercially available mutant
strains someone had already you know mutated that gene and you could just buy the mutant seeds so they grew the plants and off those 13 strains nine of them
produced enough viable plants to measure flowering time accurate and of those nine four of them had
measurably different flowering time so from just running an algorithm on gene
expression they were able to really help prioritize experiments to find new regulators of flowering time so if you're a biologist this is really gonna
save you a lot of time you could be testing any of those tens of thousands of genes in the genome but this one
experiment found four good regulators so another validation set is actually a
cattle challenge from I was at 2012 2013 on cause and effect pairs hence the
little chicken and egg picture so you'll note that there are no there's no conditional independencies that allow you to distinguish the direction of
causation between just two variables without any other context but I mentioned that there are a bunch of other constraints that you can use in causal discovery and that's what the
winning entries in this challenge used so the training data there were about 4,000 training pairs but most of them
were semi artificial they mix together some real variables to create artificial causal relationships but they did have
hundreds of real pairs where just from background knowledge we know the direction of causation and this is an example I so you have a fifty percent
chance of getting right getting it right would anyone like to guess whether a causes B or B causes a in this example
this isn't chickens but I'll just I'll just tell you so B in this case is
altitude and a is temperature so you might think that we can cool a mountain down if we build it higher but we
probably can't make it taller by cooling it down right so on the intervention theory you know because is a but a doesn't cause B and if you had been
using some of the causal discovery algorithms that use other constraints you might have noticed that this looks
like it's a non invertible function that's been inverted and that is one of the signs it looks like there's a bit of
a curve there that is one of the signs of a causal relationship that's been fitted the wrong way so the winning entry in this challenge had an accuracy
of 0.8 - rather than the point 5 that we'd expect from chance actually I think they had some cases of a confounder so you'd expect the accuracy to be even
lower by chance so pretty impressive and if you want to a large scale evaluation of your causal discovery algorithm that
data is still there so I just wanted to give you an idea of the an intuition for
lingam so lingam stands for linear non Gaussian
a cyclic models and the idea behind it is if I if I set up a linear but non
Gaussian model so Y has just some uniform noise so X has some uniform noise and Y has its own uniform noise
plus some of X if I try to fit Y on X which is the the true direction of causation I find that
the residuals from that regression have no association with Y but if I hit the reverse model if I regress X on Y the
residuals from that regression will be associated with X they won't be correlated the linear correlation has been removed by the regression so here's a case of dependence without correlation
but you can see there's a dependence because there's no no data in this little corner here but there is data
along here so depending on the value of x you either will see samples of Y in
that range or not so that's the intuition for how these non Gaussian distributions give you constraints that
you can use to distinguish the direction of causation with only two variables and there's a lot more stuff with nonlinear
models as well that I have not made slides on all right so we have all of these assumptions now let's talk about some of the other things that go wrong I
talked a bit about how Markov and faithfulness can go wrong so here's another thing it can go wrong really like talking about the problems let's
say you have a variable that you want to intervene on like cholesterol there were a whole lot of clinical trials for
treatments that would lower cholesterol that wound up not having an effect on heart disease even though they successfully lowered cholesterol like
the coronary drug project and the reason that they didn't successfully reduce heart disease because cholesterol is actually two different things
it's high-density cholesterol which reduces heart disease and low-density cholesterol which increases heart disease
so if you don't know which one you're intervening on you can't predict the result of that intervention I really wanted to mention this problem because it's particularly an issue in machine
learning where we do a lot of feature engineering if you throw every combination of your features every manipul transformation of them into the
model then you're going to have a lot of variables where it doesn't it's not clear what it means to intervene on it
and the causal model won't make sense so I don't have an answer for this because I don't know what counts as like the
causally relevant set of variables just pointing it out as an issue for applying these methods in a machine learning
context yeah right yeah you can only intervene on both or neither which is I should have mentioned with interventions
we require that you you can't have a fat hand intervention which is an intervention that hits more than one node in the model if you have two nodes where you can't intervene on one without
the other then the models not specified correctly to have a causal interpretation so another problem is measurement error
let's say that I haven't measured exactly how much people smoke but I've just asked them like do you smoke now maybe they have a 40 pack year history
but they just quit last week and in that case I I've created a new variable
smoking yes no which is a child of the exact amount that they smoked and now I haven't measured the exact example
another they smoke so I've grade this out in the model and unfortunately now
if I try to condition on this coarsened variable smoking yes no I won't screen
off yellow teeth from lung cancer I can't actually break that path so this is a really pervasive problem we won't
get the conditional independencies that we expect because we haven't measured
what we think we've measured and another really foundational and worrying thing
is that causal search is not uniformly consistent back to that unfaithfulness example let's say that it is a faithful
model these edge coefficients don't precisely cancel out alpha beta equals gamma plus some little epsilon it's not
zero so it's a faithful model but for any finite sample size I can pick an epsilon that's small enough that my constraint based algorithm will school
based algorithm will say that there's no edge from birth control to thrombosis in that case the error I'm making in terms of how far I am from the true
coefficient of the model is arbitrarily large because I've said that gamma is zero when in fact gamma could be
anything so for any finite sample size I can be arbitrarily wrong even though if I had infinite data I would get the
truth because it's a faithful model so and that's a response to this is to say well let's bound epsilon away from zero let's say we can't be that close to
unfaithful it doesn't matter how small epsilon is if we assume it is any nonzero number then we get uniform
consistency so it's a natural response unfortunately so Caroline Allah has done this great work on the geometry of unfaithfulness so I said the the
distributions that are unfaithful are a surface in a higher dimensional space and she's visualized the surface for
tiny model with just three edges when you add up all the different surfaces for any of the covariances in that model
vanishing because they're so convoluted they wind up taking most of the space most of the volume so for any little epsilon around that surface if you're
gonna say the model can't be in that area you wind up taking a lot of the volume of possible distributions so the strong faithfulness assumption where you
say we can't be any closer to faithfulness than Epsilon it's a really strong assumption rules out a lot of the
distributions so yes I believe it does but I don't have a good enough
understanding of Caroline's paper to really say so you might now be asking
why use causal learning at all and I would say it's because there are no better options so if you're really in
the situation where you don't know what the structure is you just have a whole lot of data you often what we see say
psychologists doing its guess and test I'm just gonna guess that the model is this I'll fit it to the data and then
I'll test the fit and if the p-value is big enough I'll say that's the model and there's no consistency results for
guessing test it is probably going to do worse than one of the causal discovery algorithms so whatever worries you have about
causal discovery and I have a lot I have more worries about guesses test so that's my sales pitch this is why I
should use these methods and that is all
[Applause] we do have fun questions I have like
three but I was just you know the same probably thing yeah so the question the question is when you are seeing
weird results with cholesterol how do you know how to modify the model I don't know I think that's outside the causal
discovery theory I think you the only way we figured it out was by doing finer scale mechanistic investigations of how
heart disease works but I am NOT a medical researcher so I don't know exactly the history of how we figured that out and I don't know how to do it in general [Music]
the breath right so the question was about the parameterization of the model so yeah I
said failure Leon that I'm dividing the problem of causal discovery into two parts first learn the structure and then
fit the structural equation models so definitely the learning the structure is
a discrete task either the edge is there or it's not and then the problem of figuring out how strong the edge is or
whether the influence of two parents on the common child is going to be interactive or if they're going to be just additive I that is a statistical
estimation problem and once we know what the arguments are in each of those structural equation models we can then fit it using whatever fancy statistical
estimation tools we want sometimes it's really hard to fit those equations especially if some of the variables in them are unmeasured and there's a whole are the literature on figuring out how
to fit those things when you know the causal structure but I've ignored it because I don't know enough about it [Music]
yes so there is a open source package called tetrad which is written in Java
and was developed by people in in so the question was is there software to do this I didn't have time to do a live
demo but maybe next time there's yeah so tetrad you can download online search
for maybe tetrad Carnegie Mellon because there are other things called tetrad or
tetrad causal discovery and there's some documentation on it as well how do problem are sigh fools
deal [Music] so this is how big a problem our cycles there are some causal discovery
algorithms that were developed exactly to learn cyclic models so cyclic causal discovery or CCD was the first one
cyclic models can differ in how we interpret them sometimes like supply and
demand you might think as a quick model but we could also interpret it as a model which is a cyclic if you consider
it over time so you have time indexed measurements so if you have time indexed observations you can then fit that model
using dynamic Bayes net learning algorithm but then there's others where either it's a cyclic overtime but we
just don't have good enough measurements to tease that out or maybe we really do think it's cyclic and then there are some algorithms that can handle that but
only for limited distributions CCD works for linear Gaussian distributions I don't think there's a general way of doing it yet
the spirit special what is it years have interaction for example when totally her will reflect and maybe Capano me
cofactor of a final one and oh sorry no I can start becoming
which means our attraction has a cognate direction I found but how do you do so the question is how to deal with
interactions when you have multiple parents of in order to interact so there's they're kind of easy answer to that where I say again that's a
statistical estimation issue but some kinds of interactions do cause problems for causal discovery so there's an
assumption that I don't know if I even listed it called compositionality which says that if a variable has a causal
influence on another one then that edge that dependence should appear even when you don't take other variables into
account but we can imagine cases where this doesn't hold like an exclusive or function each one of the parents could
be a random coin flip and the child could be you know on if both the parents are on and off of both of them are off like switches and that in that case
you've got an interaction and you will see no dependence between either parent individually and the child
so it's an unfaithful model and compositionality fails so yeah interactions can make it really hard to
learn the structure but if if we don't have that problem and we can learn the structure then I'm going to say all of the estimating how strong those interactions are is just a statistical
estimation problem someone else's problem does it help the visit Nathan Kozel discovery process if we have done better
than what variables we intervene done absolutely so the question is can you do better causal discovery if you give the algorithm information about what we
intervened on the answer is yes we can add that as a constraint and then that will rule out some of the models that would have been within the Markov equivalence class so we can narrow down
our results and tetrad will allow you to do that I think we had one last question well may I ask one yeah so my my
question was do you know of any work that tries to do this causal extraction live so in the examples it kind of assumed that you had the data and they will post hoc analyzing it but it seems
like it's something you could do online where each time you needed an independence test you could actually then like ask a robot to do it or
something yeah so there are approaches that Frederick Ibarra and his collaborators are working on using Sat
solvers so what you give the Sat solver is a set of conditional independence constraints and which have been translated into constraints for the
solver and then you can query it about the model you can ask is there an edge between these two variables and it'll say yes no I don't know and you can keep
feeding it constraints continuously and it will keep refining what it knows because what you can give the model is
such a weird subset of independence constraints I don't think they've figured out a way to represent what the Sat solver knows about the model like
there's no a complete partially directed acyclic graph no no handy visual representation for it but you can query it and you can keep giving it information so that is
I would call an online approach to this awesome thank you well I was such a great presentation
[Applause]
Loading video analysis...