Causal discovery in Python - Aleksander Molak, Lingaro | GHOST Day: AMLC 2022
By GHOST Day: AMLC
Summary
Topics Covered
- A is cause of B if B listens to A
- Causal Discovery extracts structure from data
- Colliders flip variable independence rules
- No universal algorithm fits all data
- Causal inference already uses ML estimators
Full Transcript
thank you welcome everyone on today's session on causal Discovery in Python we have a lot of material to cover today so without
further ado let's stop we'll start with a quick agenda uh in the beginning we'll mention a couple of stuff that will be important for us
today and I will introduce myself very briefly then we'll jump into causal inference and puzzle Discovery we'll talk about what it is what differences between causal inference and causal
Discovery and why even bother with those with those things then we'll talk about causal graphs which are essential building blocks of causal
models then we will talk about three families of methods that we use for causal discovering and finally we'll
do some practical examples in Python and after that hopefully we'll have five minutes for for your questions
great so let's start ah one more important thing and we'll have a couple of QR calls in the presentation so you might want to scan them to access the
the materials the repository notebooks and so on so you can prepare so let's start with the materials uh we have a repository for this for this talk and we have a main notebook that we will
use in the in the second part so you can open them if you don't have a QR code reader you can just use the link on the on the top
I'll give you five seconds more okay great uh three words about myself my name is Alexander molak I'm a machine learning
engineer and researcher at iron scales and also a machine learning research a pencil cell I'm specializing natural language processing uh probabilistic modeling and
cultural inference and if you're on LinkedIn I'm very happy to connect with you you can scan this QR code this will uh this was this this will uh draw you to my page when it was to find
my LinkedIn profile and all this stuff so I'm happy to I'm sorry sorry I'm happy to connect and and be in touch and discuss interesting topics on machine learning and color
impressive great uh let's talk about causality um so what is causal inference what is causal Discovery and why should we even bother but before we answer these
questions let's ask a more Elementary one which is what is causality itself we'll start with the definition for provokipedia this is a definition says
States causality is the relationship between causes and effects is the good definition I don't know you might feel that it's a little bit circular in a sense we are
trying to Define causality in terms of causes which doesn't necessarily explain too much uh but this is how it
is but this is just one of many many definitions that appear all over the history regarding causality because this was a very interesting topic for many people
from ancient Greece ancient China ancient Middle East until until today and this is a topic
that was discussed by philosophers by scientists mathematicians and so on and so on anyway this um definition might not be
the best at least not the best for our purposes so for our purpose we'll use something different a definition that comes from Judy apparel
and Judy apparel is a Godfather of causality you could say he devoted his life uh to to work on problems uh
causality and we will will use a lot of ideas from his work in the presentation today and the
definition that he proposes is a very actionable one it says a is a cause of B if B listens to a what does it mean that
b listens to a this means that if we change a then we should observe a change in b as well do you might ask
why should we model causality so we have machine learning and machine learning solves many uh pretty complex problems
that that we would sell that it would be really hard to imagine a decade or two ago that those problems could be cells by by computers and pretty maybe even
easily so to answer the question why even model because I even bother with modeling causality I will ask another question
how to increase the number of non-commercial space launches per year one of the answers a data-driven one is
to increase the number of phds in sociology in the United States is it a good answer well I don't know but certainly from data point of view it
seems it's it's it's a really good answer in a sense number of doctorates in in sociology in the US is a very very good predictor of uh those
non-commercial space launches you can also see this data as a scatter plot with a fit plan you can see that this relationship is pretty pretty strong oh
I'm sorry um nonetheless uh if we think back to the definition from Judea probably thought of uh it would be rather unlikely for
probably for most of us that's my assumption uh that if we change number of stock trades in sociology or phds in sociology the number of non-commercials
space launches also will also change um why is that because we don't see how a causal what could be a causal model between those two things like a direct causal model and we see that these
changes here are very very directly related so it's not that we have those phds less or more in many years we see a change in non-commercial space launches
it's it's something that happens immediately so that will be for I suppose for many people most people wouldn't be a reasonable explanation so now uh
we can think about a causal inference process which is a process of estimating a causal effect from observation from observational data a causal effect is
different from uh from correlation it's as you know one could famous famously say say causation is correlation it's not
causation this is what people what we are being taught many is that statistics courses uh but still we can actually estimate
causal effects and we just need slightly different tools for this then let's say first order statistical tools okay what do I mean by first order we will see in
a second now uh to estimate this causal causal effects we need some causal structure so we don't only need data like here we
also need some causal causal structure that we that we uh provide the model so we provide a model with data as any classical statistical or machine
learning problem but also we provide the model with some causal structure so you can think about this as as a model that works not only on the data but also on
the story behind this data okay now let's think about causal discovering our today's topic so causal Discovery is a process of estimating
causal structure from observational data okay so we said that for console inference we need data and a story and now we want to learn the story
from the data we can also look at this uh this problem from slightly different perspective if you if you're familiar with machine learning you know that we have data and
we have a model there is usually some data generating process behind the data but in machine learning we only look at those two top uh these two top elements
now in causal inference we have something additional so we we have some information about this data generating process and we also
we also provide a model with this information about causal the data generating process as well as the data itself and finally some causal discovery
what we do is that we are trying to obtain this information about data generating process from the data directly from the data and then pass it
to the model because of inference model so in a sense you can think of this causal Discovery causal Discovery module
as a second order statistical learning which in many cases it's just this and causal Discovery so this little
block here will be our Focus today okay let's talk about causal graphs so causal graphs as we said in the beginning they are essential building
blocks of causal inference we'll start with a simple example now to learn about those causal blocks and we'll learn about free uh free most
basic ones uh so let's imagine a fire alarm system that works in a way that when there's a where there's a smoke it it only reacts to smoke it does not react to fire
itself it only reacts to smoke so we might have different fire alarm systems on the market but for for the sake of this example we are only focusing on this type of fire alarm system that
reacts to smoke so it scans the it scans the the air and detects some pack small particles in the in the air
so a color model of such a uh alarm it could be could be presented uh like this so we have a fire that causes smoke and
then the smoke covers a lot now an important thing to see here to notice here is that if we have smoke here in the middle
then alarm and fire they become independent so there is no Arrow between fire and alarm why is that that's because our alarm only reacts to smoke
so we can imagine a situation where there's no fire but there is a smoke and our alarm um and our alarm reacts to this now let's try to make it a little bit
more abstract so instead of talking about fire and smoke and so on we're all just talking about a b and c okay this structure that we've seen in the fire alarm
example is something that is called a chain okay and one of the important properties of the structure is that if
we have B if we control for B then A and C become independent okay but they yeah and then the second
structure is a fork so as you can see fork has one arrow that goes in the other direction compared to chain so chain goes all the way that and four
goes like left and right yeah it's important property is exactly the same so if we control for B A and C are independent
but they will be dependent if there is no control for B okay this is an important and important aspect and finally the cell structure is called a
collider and a collider is different as you can see now you have B and all the arrows are direct I directed to into B
okay and a very important aspect a very important property of this structure is that a and C become
dependent if we control for B but they are independent otherwise okay so this is reversed comparing to uh comparing to
the fork and the chain and now let's see the same in in numbers okay so I perform a simple
experiment for you and I build a chain a fork and a collider and then I try to predict C with a and as you can see using some simple
Model A linear regression and as you can see a is not an a significant uh predictor of C when controlling for B
and this is about chain right about the chain structure and then for the fork structure it's exactly the same so a is
also not a significant predictor of C when controlling for B but when we look at the collider though we see that a
is a significant predictor of C given me right or controlling for me so you can see the notebook it's the first notebook in the repository you can
use this QR code um to to perform this experiment for yourself and now we will go to causal Discovery methods
um so important thing to remember from this section is that collider has a different independent structures and fork and chain here okay
and now let's talk about causal Discovery methods so we talked about a lot three basic graph structures that we can use and we can leverage their
properties uh in causal models and now we'll talk about free families of causal Discovery methods so there are methods that are that are
using different aspects of of data to build to build a causal graph so the first family is constraint based methods so-called constrained weight
methods they're called constraint because they are leveraging constraints of graphs so this is exactly what we discussed as a a second ago
so they are looking at those independent structures between any free variables in your graph in in your data set and they are trying to construct a graph and say
hey these two are dependent but they become independent when you control for the third one so maybe there's a collider or maybe there's no collider right and so on and so on and in this
sense they are trying to detect connections between variables and also detect uh direct the edges right because if there's a collider then we know that
edges go like this but if there is no no collider edges go either this or this or this all like this all like this all like this right so there are still a couple
of um a couple of possibilities but maybe there's another variable A4 variable that will help us Orient those edges the second family of methods is called
score based methods so these methods they are trying to generate many graphs and then compare them based on their fate to the data
okay and they compute some kind of score like maybe Bic or some other scholar and finally we have functional methods and function methods are trying to leverage
uh statistical properties of variables and noise and noise in the model and uh to leverage some asymmetries between them
to orient the edges in the graph okay so constraint based methods Let's uh let's try to summarize this yeah leveraging to graph
independent structure uh based on colliders and non-colliders to build a causal graph they also
they also need a an important assumption which is called faithfulness we won't talk about this too much but it basically says that if there's an in
there's some Independence in the data then this Independence map maps to uh the independence in the graph as well and a
an example of a constrained based method a classic one maybe not the best one today from today's point of view but but a classic one is a PC algorithm okay
so-called PC algorithm then we have score based Maps so this is the second fan and scored based methods are comparing different graphs in terms of their feed
to the data as we said um finding the optimal graph is NP hard though so it's well it's it's not realistic right so these methods usually
use some heuristics sometimes pretty smart heuristics to find an optimal graph it's not always globally optimal but it's at least it's uh it's it might
be luckily up you know and one of the um example algorithms for this type of methods is so-called GES which stands
for green greedy equivalence search and finally we have functional methods they are leveraging asymmetries and or irregularities in variable and noise
distributions they assume that the causal model has has a functional structure a functional form or can be represented in a functional form uh
where why so any variable any variable that has some parents and those parents will be X is a function of
those parents and some and some noise and there are many different functions many different models that might work on additives non-editive and so on and so
on thanks so Lydia and non-linear and one of the classic examples of functional methods is unlike algorithm called lingam which stands for linear
non-gaussian additive model it's non-gaussian because it leverages asymmetries in the uh in the model so gaussian variables they are symmetric
right you have like you have a mean and the same on the left the same on the right other distributions are not symmetric and this can be used uh and
this can be used to to estimate the causal the causal graph and there are also different other other different methods the hybrid methods that are
combining constrained approach with score based approach reinforcement based methods end-to-end methods that do causal Discovery and influence and at once
and sometimes they also do missing data imputation because they are they are building a generative model that generates the graph and also can it can
also generate the data itself so this is very interesting but this is above the beyond the scope of today's uh today's presentation great so we are ready to go
and do something in practice we'll jump to the notebook now hopefully hopefully you can see this yeah great so we are importing a couple
of libraries the main library that we'll be using today is called G Castle the important it is it as Castle you can install it using pip uh you can also
install the whole environment using the young file in the a repository that we link to we'll also use this GES library that implements greedy equivalent search because this is
not implemented in G Castle currently and we will use networks for graph visualizations and numpy phone for some for some data generation okay let's talk
about PC algorithm so we talked about those three families a constraint based score based and functional models PC
algorithm is an example of a constraint based model we're talking about graphs today we're talking about this three basic graph structures
chains forks and colliders and we will make use of them right now so let's imagine that this is our true graph so we have X and Y called Z and Z column W
and PC algorithm starts with a fully connected graph okay so something like this fully connected and undirected graph and then it removes edges based on
graph Independence structure so first it removes edges that between variables that are independent that are always independent
um pairwise independent and it follows and so on and so on and as you can see at some point it has undirected edges then it directs edges
here because we have a collider on Z it can it can find out that X and Y are both causing Z and then we have we are
left with this one and then the model checks PC checks if this is a Collider checking the independence and conditional Independence between w and y
and it says it's not a collider so it can infer that the direction is is like there like there right so this
is this is pretty neat and I must give you one uh additional information here that this is a simple example and PC works very well for this example but
there are more complex examples and sometimes PC algorithm is able to find the edges but it's not able to direct them so it gives a so-called Mark of
equivalence class this equivalence class it might ring the bell with this GDs agree the equivalent set and that's uh very good because this
it's the same thing Markov equivalence class it means that there's a graph that is that has some edges that are undirected and they might be from from graphical mobile point of view
statistical point of view it might be directed this way or that way and we don't know we just don't know so anyway in this case it works very well let's try to implement it so we
have our variable X it's just a gaussian variable 100 1000 observations Y is the same thing then Z is X Plus y plus some
noise NW is uh 0.7 times Z plus some noise as well okay we'll stack this this is just like
building a little Matrix out of those variables and we have this data set a little sanity check okay and now let's learn this causal structure so it's very simple we just take PC that we import
that we import from Castle dot algorithms and we say PC learn and we provide it
with and we provide it with the data set we can see that it returns a causal Matrix and this code it's a lot of lines but don't worry it's just visualizing
the graph let's see so we have X causing z and y Calling Z and Z causing w so this is exactly what we have here that's great so that's pretty awesome
um and and that's well that's really amazing to see this right it really works it was able to recover this causal structure from the data so now we will build a couple of data sets that are
more complex okay and we'll try a couple more a couple more um
a couple more methods on those data sets so this is our graph the original graph and we'll have a couple of data sets linear non-linear gaussian and no
gaussian based on this graph this is the same graph as a matrix and we use PC algorithm GES so score based algorithm Lin gam
in one of its flavor which is called ACA lingam and Golem which is a more advanced non-linear algorithm and we will and we'll run these algorithms on all the data sets here I did it in
advance because it takes time for some of these algorithms so we can just briefly examine the results so as you can see PC method did not a very good
job but not a very bad job as well so it falls Discovery rate is 0.2 recall 0.75 F1 Square point seven five a number of undirected edges is one so this means
that one Edge detected by this algorithm and extra Edge was an ad uh that was actually um adding this undirectedness to them
all GES the score based method it well it's much worse than an NPC on this data set and lingam is pretty terrible as you
can see right now maybe not terrible but it's pretty it's pretty bad why because it's a non-gaussian method and that was a a linear gaussian data set right so
this is not a good method for this type of data set Golem algorithm did a very good job so actually F1 score of 100 percent
then linear exponential you can see that PC does a worse job on this data set than the previous one GES is also not
very good Lane gandal is actually perfect so as you can see uh non-gaussian we have linear and non-gaussian distribution and this
algorithm works perfect comparing to gaussian distribution when it where it worked oh very much much worse just much worse
and then Golem algorithm again gives us a very good result and then finally we have a non-linear quadratic data set we can see that PC is is not very good on
this because PC is is works the best a classical implementation at least on on Norm on normal this normal linear data sets and we have GES
also not too good link I'm not too good and Golem not too good as well we also need to say that Golem was only working for 25
000 iterations uh so increasing number of iterations could bring us to a better to a better result okay great I encourage you to look through these results yourself
and now let's go back to the to the presentation for a second um and the last thing I want to say uh to you is thank you if you're interested
in causality and related topics I invite you to to my talk on on 29 uh on Pi data Hamburg it will online uh you can get a
link to this event when you go to this QR code but this QR code will leave you also to a quick survey I'll be very grateful for your opinions uh on this talk how do you like it so if you can
fill this area that will be super helpful and will help me prepare better talks in the future please feel free free to link with me on LinkedIn I'll be
very happy to be in your network thank you so much for your attention and hopefully see you again thank you very much Alexander for an interesting talk there are some
questions from the audience asks what are the areas of Science in which you would like particularly see causal Discovery applied and perhaps
where existing methods could be replaced to get better results I think I need to sorry I have some Interruption of my network so I'll try to sit in another room a bit better
um can you hear me yes oh can you hear me Andre yes yes I can
uh maybe now can you hear me yes okay good could you repeat the question please because I had some Interruption unfortunately
uh okay so Jana asked what are the areas of Science in which you would particularly see causal Discovery applied and perhaps where existing
methods could be replaced to get better results yeah so I think there are many areas of science actually where causal Discovery can be applied one of those areas where
where scientists extensively use these types of methods are all the subfields of biology um like cellular molecular and so on and
so on so finding those structures finding the interactions um these These are certainly Fields where where this you can find this
method also in social science uh especially in economics so I think uh the uh this year's or last year's uh
Nobel Prize for in economics went to uh among others with the infants who is a person involved in in causal models slightly different School of causal
models but say than the one that we are discussing here today which is more pearly but certainly you can find in social
sciences in economics many many areas where people are using this these kinds of methods biology as I mentioned uh Neuroscience but I also
think that it might be applicable in in things like you know um like chemistry and so on okay thank you very much for your answer
and mikhao asks actually says that thank you for a great talk and could you please tell a little bit about human
aided inference uh are there any methods that allow for the input prior beliefs and for as a guidance for
algorithms that's a great question um the methods that we discussed today in the form that they they are implemented
they're implemented in they do not allow for for passing prior knowledge but they could be potentially modified um to support support prior knowledge
some of them let's say that said there are methods uh let's say more advanced methods uh that allow for this so there are many
methods today that are using some forms of Bayesian inference uh behind the behind the behind the scene to find the causal graph and these
methods as they are Beijing they are operating on pryors now you can provide the method with your prior you can also to an extent
uh set this prior to be strong or less strong and so on and so on so um I'm not sure if this answers your question but I hope that it at least
gives you at least gives you a little bit of an Outlook how it looks like so the short answer would be yes it's it's possible yeah okay thank you there is also a
follow-up from me how how about the size of data sets required for a causal discovery um well so so we said during the talk
that causal Discovery is pretty often uh some kind of statistical estimation model I called it's a second second order in a sense that it's not working on the data itself it's working on the data to
extract the causal structure and then pass this causal structure to to a graph so it's like in with any statistical method uh the bigger the data set the
better your estimate uh in particular uh when estimating Independence between variables
uh it might be very important to have sufficient sample size in each of the in each of the uh points in your in your
data space let's say very saying very generally so it has impact I would not give you a precise number because it depends on the method
it depends on your data set and many many factors so there is no like a golden rule for this but certainly especially when your data sets the relationships in your data sets
are more complex they are non-linear maybe very complex non-linear there are many of them the interactions and this kind of stuff uh the more data the better
interactions is another topic we could talk about this for for a while okay thank you very much and Anita asks do you see causal Discovery
used alongside machine learning model methods in the future uh yes definitely uh it's also a great question um we use this uh today oh it's already
happening so we're talking about uh a castle this causal Discovery but in the inch in the introduction we also talked about causal inference and what people use in in marketing for instance today
is causal inference that is powered by Machine learning methods uh so um causal inference was not our main topic today so we didn't go into deep
but uh when you do console inference you know you do not only have an estimate so something that we in machine learning we are trying to estimate some results right
but you also have something called estimate so s demand is this is the way how we identify causal relationships in the graph when we already have a graph
so now for S demands we use set and methods to get them so it might be something called a pact or criteria for instance or something else
but then for estimators to estimate the causal effect when we already have the estimate we can use any machine learning model virtually any machine learning model
and there are methods like causal trees for instance uh or Coastal Forest which is based on on causal um which is a causal version of random Forest it's a
pretty powerful um it's a pretty powerful algorithm you also have a thing called double machine learning which can use also any estimators including neural networks
working on potentially on any type of data so it's already happening we are doing this in many areas this is and this is a very also very Lively area of
research I would say a lot of papers on this topic appearing every month okay thank you and one more question
by mikhao he wonders if the described methods allow to determine the effect size and
the confidence regarding at a given causation yes but this is a domain of mainly a domain of causal inference so in causal
Discovery traditionally at least we are only finding a graph so this is a adjacency matrix that defines which variable influences which which
variable in in like directed causal way now in causal inference what we are doing in causal inference we are actually estimating effect sizes as you
said or like causal effects they might be estimated using different approaches it might be like an average uh treatment effect so-called or average causal effect it might be conditional average
treatment effect which gives you an effect for every group however it is is defined hopefully it answers your question if if
it does not please feel free to ping me on LinkedIn I'm happy to continue this conversation thank you very much Alexander for your
answers and let me again thank you to all our speakers today for a very interesting and very enlightening talks
and thank you again to Intel for sponsoring the second talk I hope you'll all enjoy the conference
[Music] foreign [Music]
Loading video analysis...