Causal discovery in Python - Aleksander Molak, Lingaro | GHOST Day: AMLC 2022

By GHOST Day: AMLC

Summary

Topics Covered

A is cause of B if B listens to A
Causal Discovery extracts structure from data
Colliders flip variable independence rules
No universal algorithm fits all data
Causal inference already uses ML estimators

Full Transcript

thank you welcome everyone on today's session on causal Discovery in Python we have a lot of material to cover today so without

further ado let's stop we'll start with a quick agenda uh in the beginning we'll mention a couple of stuff that will be important for us

today and I will introduce myself very briefly then we'll jump into causal inference and puzzle Discovery we'll talk about what it is what differences between causal inference and causal

Discovery and why even bother with those with those things then we'll talk about causal graphs which are essential building blocks of causal

models then we will talk about three families of methods that we use for causal discovering and finally we'll

do some practical examples in Python and after that hopefully we'll have five minutes for for your questions

great so let's start ah one more important thing and we'll have a couple of QR calls in the presentation so you might want to scan them to access the

the materials the repository notebooks and so on so you can prepare so let's start with the materials uh we have a repository for this for this talk and we have a main notebook that we will

use in the in the second part so you can open them if you don't have a QR code reader you can just use the link on the on the top

I'll give you five seconds more okay great uh three words about myself my name is Alexander molak I'm a machine learning

engineer and researcher at iron scales and also a machine learning research a pencil cell I'm specializing natural language processing uh probabilistic modeling and

cultural inference and if you're on LinkedIn I'm very happy to connect with you you can scan this QR code this will uh this was this this will uh draw you to my page when it was to find

my LinkedIn profile and all this stuff so I'm happy to I'm sorry sorry I'm happy to connect and and be in touch and discuss interesting topics on machine learning and color

impressive great uh let's talk about causality um so what is causal inference what is causal Discovery and why should we even bother but before we answer these

questions let's ask a more Elementary one which is what is causality itself we'll start with the definition for provokipedia this is a definition says

States causality is the relationship between causes and effects is the good definition I don't know you might feel that it's a little bit circular in a sense we are

trying to Define causality in terms of causes which doesn't necessarily explain too much uh but this is how it

is but this is just one of many many definitions that appear all over the history regarding causality because this was a very interesting topic for many people

from ancient Greece ancient China ancient Middle East until until today and this is a topic

that was discussed by philosophers by scientists mathematicians and so on and so on anyway this um definition might not be

the best at least not the best for our purposes so for our purpose we'll use something different a definition that comes from Judy apparel

and Judy apparel is a Godfather of causality you could say he devoted his life uh to to work on problems uh

causality and we will will use a lot of ideas from his work in the presentation today and the

definition that he proposes is a very actionable one it says a is a cause of B if B listens to a what does it mean that

b listens to a this means that if we change a then we should observe a change in b as well do you might ask

why should we model causality so we have machine learning and machine learning solves many uh pretty complex problems

that that we would sell that it would be really hard to imagine a decade or two ago that those problems could be cells by by computers and pretty maybe even

easily so to answer the question why even model because I even bother with modeling causality I will ask another question

how to increase the number of non-commercial space launches per year one of the answers a data-driven one is

to increase the number of phds in sociology in the United States is it a good answer well I don't know but certainly from data point of view it

seems it's it's it's a really good answer in a sense number of doctorates in in sociology in the US is a very very good predictor of uh those

non-commercial space launches you can also see this data as a scatter plot with a fit plan you can see that this relationship is pretty pretty strong oh

I'm sorry um nonetheless uh if we think back to the definition from Judea probably thought of uh it would be rather unlikely for

probably for most of us that's my assumption uh that if we change number of stock trades in sociology or phds in sociology the number of non-commercials

space launches also will also change um why is that because we don't see how a causal what could be a causal model between those two things like a direct causal model and we see that these

changes here are very very directly related so it's not that we have those phds less or more in many years we see a change in non-commercial space launches

it's it's something that happens immediately so that will be for I suppose for many people most people wouldn't be a reasonable explanation so now uh

we can think about a causal inference process which is a process of estimating a causal effect from observation from observational data a causal effect is

different from uh from correlation it's as you know one could famous famously say say causation is correlation it's not

causation this is what people what we are being taught many is that statistics courses uh but still we can actually estimate

causal effects and we just need slightly different tools for this then let's say first order statistical tools okay what do I mean by first order we will see in

a second now uh to estimate this causal causal effects we need some causal structure so we don't only need data like here we

also need some causal causal structure that we that we uh provide the model so we provide a model with data as any classical statistical or machine

learning problem but also we provide the model with some causal structure so you can think about this as as a model that works not only on the data but also on

the story behind this data okay now let's think about causal discovering our today's topic so causal Discovery is a process of estimating

causal structure from observational data okay so we said that for console inference we need data and a story and now we want to learn the story

from the data we can also look at this uh this problem from slightly different perspective if you if you're familiar with machine learning you know that we have data and

we have a model there is usually some data generating process behind the data but in machine learning we only look at those two top uh these two top elements

now in causal inference we have something additional so we we have some information about this data generating process and we also

we also provide a model with this information about causal the data generating process as well as the data itself and finally some causal discovery

what we do is that we are trying to obtain this information about data generating process from the data directly from the data and then pass it

to the model because of inference model so in a sense you can think of this causal Discovery causal Discovery module

as a second order statistical learning which in many cases it's just this and causal Discovery so this little

block here will be our Focus today okay let's talk about causal graphs so causal graphs as we said in the beginning they are essential building

blocks of causal inference we'll start with a simple example now to learn about those causal blocks and we'll learn about free uh free most

basic ones uh so let's imagine a fire alarm system that works in a way that when there's a where there's a smoke it it only reacts to smoke it does not react to fire

itself it only reacts to smoke so we might have different fire alarm systems on the market but for for the sake of this example we are only focusing on this type of fire alarm system that

reacts to smoke so it scans the it scans the the air and detects some pack small particles in the in the air

so a color model of such a uh alarm it could be could be presented uh like this so we have a fire that causes smoke and

then the smoke covers a lot now an important thing to see here to notice here is that if we have smoke here in the middle

then alarm and fire they become independent so there is no Arrow between fire and alarm why is that that's because our alarm only reacts to smoke

so we can imagine a situation where there's no fire but there is a smoke and our alarm um and our alarm reacts to this now let's try to make it a little bit

more abstract so instead of talking about fire and smoke and so on we're all just talking about a b and c okay this structure that we've seen in the fire alarm

example is something that is called a chain okay and one of the important properties of the structure is that if

we have B if we control for B then A and C become independent okay but they yeah and then the second

structure is a fork so as you can see fork has one arrow that goes in the other direction compared to chain so chain goes all the way that and four

goes like left and right yeah it's important property is exactly the same so if we control for B A and C are independent

but they will be dependent if there is no control for B okay this is an important and important aspect and finally the cell structure is called a

collider and a collider is different as you can see now you have B and all the arrows are direct I directed to into B

okay and a very important aspect a very important property of this structure is that a and C become

dependent if we control for B but they are independent otherwise okay so this is reversed comparing to uh comparing to

the fork and the chain and now let's see the same in in numbers okay so I perform a simple

experiment for you and I build a chain a fork and a collider and then I try to predict C with a and as you can see using some simple

Model A linear regression and as you can see a is not an a significant uh predictor of C when controlling for B

and this is about chain right about the chain structure and then for the fork structure it's exactly the same so a is

also not a significant predictor of C when controlling for B but when we look at the collider though we see that a

is a significant predictor of C given me right or controlling for me so you can see the notebook it's the first notebook in the repository you can

use this QR code um to to perform this experiment for yourself and now we will go to causal Discovery methods

um so important thing to remember from this section is that collider has a different independent structures and fork and chain here okay

and now let's talk about causal Discovery methods so we talked about a lot three basic graph structures that we can use and we can leverage their

properties uh in causal models and now we'll talk about free families of causal Discovery methods so there are methods that are that are

using different aspects of of data to build to build a causal graph so the first family is constraint based methods so-called constrained weight

methods they're called constraint because they are leveraging constraints of graphs so this is exactly what we discussed as a a second ago

so they are looking at those independent structures between any free variables in your graph in in your data set and they are trying to construct a graph and say

hey these two are dependent but they become independent when you control for the third one so maybe there's a collider or maybe there's no collider right and so on and so on and in this

sense they are trying to detect connections between variables and also detect uh direct the edges right because if there's a collider then we know that

edges go like this but if there is no no collider edges go either this or this or this all like this all like this all like this right so there are still a couple

of um a couple of possibilities but maybe there's another variable A4 variable that will help us Orient those edges the second family of methods is called

score based methods so these methods they are trying to generate many graphs and then compare them based on their fate to the data

okay and they compute some kind of score like maybe Bic or some other scholar and finally we have functional methods and function methods are trying to leverage

uh statistical properties of variables and noise and noise in the model and uh to leverage some asymmetries between them

to orient the edges in the graph okay so constraint based methods Let's uh let's try to summarize this yeah leveraging to graph

independent structure uh based on colliders and non-colliders to build a causal graph they also

they also need a an important assumption which is called faithfulness we won't talk about this too much but it basically says that if there's an in

there's some Independence in the data then this Independence map maps to uh the independence in the graph as well and a

an example of a constrained based method a classic one maybe not the best one today from today's point of view but but a classic one is a PC algorithm okay

so-called PC algorithm then we have score based Maps so this is the second fan and scored based methods are comparing different graphs in terms of their feed

to the data as we said um finding the optimal graph is NP hard though so it's well it's it's not realistic right so these methods usually

use some heuristics sometimes pretty smart heuristics to find an optimal graph it's not always globally optimal but it's at least it's uh it's it might

be luckily up you know and one of the um example algorithms for this type of methods is so-called GES which stands

for green greedy equivalence search and finally we have functional methods they are leveraging asymmetries and or irregularities in variable and noise

distributions they assume that the causal model has has a functional structure a functional form or can be represented in a functional form uh

where why so any variable any variable that has some parents and those parents will be X is a function of

those parents and some and some noise and there are many different functions many different models that might work on additives non-editive and so on and so

on thanks so Lydia and non-linear and one of the classic examples of functional methods is unlike algorithm called lingam which stands for linear

non-gaussian additive model it's non-gaussian because it leverages asymmetries in the uh in the model so gaussian variables they are symmetric

right you have like you have a mean and the same on the left the same on the right other distributions are not symmetric and this can be used uh and

this can be used to to estimate the causal the causal graph and there are also different other other different methods the hybrid methods that are

combining constrained approach with score based approach reinforcement based methods end-to-end methods that do causal Discovery and influence and at once

and sometimes they also do missing data imputation because they are they are building a generative model that generates the graph and also can it can

also generate the data itself so this is very interesting but this is above the beyond the scope of today's uh today's presentation great so we are ready to go

and do something in practice we'll jump to the notebook now hopefully hopefully you can see this yeah great so we are importing a couple

of libraries the main library that we'll be using today is called G Castle the important it is it as Castle you can install it using pip uh you can also

install the whole environment using the young file in the a repository that we link to we'll also use this GES library that implements greedy equivalent search because this is

not implemented in G Castle currently and we will use networks for graph visualizations and numpy phone for some for some data generation okay let's talk

about PC algorithm so we talked about those three families a constraint based score based and functional models PC

algorithm is an example of a constraint based model we're talking about graphs today we're talking about this three basic graph structures

chains forks and colliders and we will make use of them right now so let's imagine that this is our true graph so we have X and Y called Z and Z column W

and PC algorithm starts with a fully connected graph okay so something like this fully connected and undirected graph and then it removes edges based on

graph Independence structure so first it removes edges that between variables that are independent that are always independent

um pairwise independent and it follows and so on and so on and as you can see at some point it has undirected edges then it directs edges

here because we have a collider on Z it can it can find out that X and Y are both causing Z and then we have we are

left with this one and then the model checks PC checks if this is a Collider checking the independence and conditional Independence between w and y

and it says it's not a collider so it can infer that the direction is is like there like there right so this

is this is pretty neat and I must give you one uh additional information here that this is a simple example and PC works very well for this example but

there are more complex examples and sometimes PC algorithm is able to find the edges but it's not able to direct them so it gives a so-called Mark of

equivalence class this equivalence class it might ring the bell with this GDs agree the equivalent set and that's uh very good because this

it's the same thing Markov equivalence class it means that there's a graph that is that has some edges that are undirected and they might be from from graphical mobile point of view

statistical point of view it might be directed this way or that way and we don't know we just don't know so anyway in this case it works very well let's try to implement it so we

have our variable X it's just a gaussian variable 100 1000 observations Y is the same thing then Z is X Plus y plus some

noise NW is uh 0.7 times Z plus some noise as well okay we'll stack this this is just like

building a little Matrix out of those variables and we have this data set a little sanity check okay and now let's learn this causal structure so it's very simple we just take PC that we import

that we import from Castle dot algorithms and we say PC learn and we provide it

with and we provide it with the data set we can see that it returns a causal Matrix and this code it's a lot of lines but don't worry it's just visualizing

the graph let's see so we have X causing z and y Calling Z and Z causing w so this is exactly what we have here that's great so that's pretty awesome

um and and that's well that's really amazing to see this right it really works it was able to recover this causal structure from the data so now we will build a couple of data sets that are

more complex okay and we'll try a couple more a couple more um

a couple more methods on those data sets so this is our graph the original graph and we'll have a couple of data sets linear non-linear gaussian and no

gaussian based on this graph this is the same graph as a matrix and we use PC algorithm GES so score based algorithm Lin gam

in one of its flavor which is called ACA lingam and Golem which is a more advanced non-linear algorithm and we will and we'll run these algorithms on all the data sets here I did it in

advance because it takes time for some of these algorithms so we can just briefly examine the results so as you can see PC method did not a very good

job but not a very bad job as well so it falls Discovery rate is 0.2 recall 0.75 F1 Square point seven five a number of undirected edges is one so this means

that one Edge detected by this algorithm and extra Edge was an ad uh that was actually um adding this undirectedness to them

all GES the score based method it well it's much worse than an NPC on this data set and lingam is pretty terrible as you

can see right now maybe not terrible but it's pretty it's pretty bad why because it's a non-gaussian method and that was a a linear gaussian data set right so

this is not a good method for this type of data set Golem algorithm did a very good job so actually F1 score of 100 percent

then linear exponential you can see that PC does a worse job on this data set than the previous one GES is also not

very good Lane gandal is actually perfect so as you can see uh non-gaussian we have linear and non-gaussian distribution and this

algorithm works perfect comparing to gaussian distribution when it where it worked oh very much much worse just much worse

and then Golem algorithm again gives us a very good result and then finally we have a non-linear quadratic data set we can see that PC is is not very good on

this because PC is is works the best a classical implementation at least on on Norm on normal this normal linear data sets and we have GES

also not too good link I'm not too good and Golem not too good as well we also need to say that Golem was only working for 25

000 iterations uh so increasing number of iterations could bring us to a better to a better result okay great I encourage you to look through these results yourself

and now let's go back to the to the presentation for a second um and the last thing I want to say uh to you is thank you if you're interested

in causality and related topics I invite you to to my talk on on 29 uh on Pi data Hamburg it will online uh you can get a

link to this event when you go to this QR code but this QR code will leave you also to a quick survey I'll be very grateful for your opinions uh on this talk how do you like it so if you can

fill this area that will be super helpful and will help me prepare better talks in the future please feel free free to link with me on LinkedIn I'll be

very happy to be in your network thank you so much for your attention and hopefully see you again thank you very much Alexander for an interesting talk there are some

questions from the audience asks what are the areas of Science in which you would like particularly see causal Discovery applied and perhaps

where existing methods could be replaced to get better results I think I need to sorry I have some Interruption of my network so I'll try to sit in another room a bit better

um can you hear me yes oh can you hear me Andre yes yes I can

uh maybe now can you hear me yes okay good could you repeat the question please because I had some Interruption unfortunately

uh okay so Jana asked what are the areas of Science in which you would particularly see causal Discovery applied and perhaps where existing

methods could be replaced to get better results yeah so I think there are many areas of science actually where causal Discovery can be applied one of those areas where

where scientists extensively use these types of methods are all the subfields of biology um like cellular molecular and so on and

so on so finding those structures finding the interactions um these These are certainly Fields where where this you can find this

method also in social science uh especially in economics so I think uh the uh this year's or last year's uh

Nobel Prize for in economics went to uh among others with the infants who is a person involved in in causal models slightly different School of causal

models but say than the one that we are discussing here today which is more pearly but certainly you can find in social

sciences in economics many many areas where people are using this these kinds of methods biology as I mentioned uh Neuroscience but I also

think that it might be applicable in in things like you know um like chemistry and so on okay thank you very much for your answer

and mikhao asks actually says that thank you for a great talk and could you please tell a little bit about human

aided inference uh are there any methods that allow for the input prior beliefs and for as a guidance for

algorithms that's a great question um the methods that we discussed today in the form that they they are implemented

they're implemented in they do not allow for for passing prior knowledge but they could be potentially modified um to support support prior knowledge

some of them let's say that said there are methods uh let's say more advanced methods uh that allow for this so there are many

methods today that are using some forms of Bayesian inference uh behind the behind the behind the scene to find the causal graph and these

methods as they are Beijing they are operating on pryors now you can provide the method with your prior you can also to an extent

uh set this prior to be strong or less strong and so on and so on so um I'm not sure if this answers your question but I hope that it at least

gives you at least gives you a little bit of an Outlook how it looks like so the short answer would be yes it's it's possible yeah okay thank you there is also a

follow-up from me how how about the size of data sets required for a causal discovery um well so so we said during the talk

that causal Discovery is pretty often uh some kind of statistical estimation model I called it's a second second order in a sense that it's not working on the data itself it's working on the data to

extract the causal structure and then pass this causal structure to to a graph so it's like in with any statistical method uh the bigger the data set the

better your estimate uh in particular uh when estimating Independence between variables

uh it might be very important to have sufficient sample size in each of the in each of the uh points in your in your

data space let's say very saying very generally so it has impact I would not give you a precise number because it depends on the method

it depends on your data set and many many factors so there is no like a golden rule for this but certainly especially when your data sets the relationships in your data sets

are more complex they are non-linear maybe very complex non-linear there are many of them the interactions and this kind of stuff uh the more data the better

interactions is another topic we could talk about this for for a while okay thank you very much and Anita asks do you see causal Discovery

used alongside machine learning model methods in the future uh yes definitely uh it's also a great question um we use this uh today oh it's already

happening so we're talking about uh a castle this causal Discovery but in the inch in the introduction we also talked about causal inference and what people use in in marketing for instance today

is causal inference that is powered by Machine learning methods uh so um causal inference was not our main topic today so we didn't go into deep

but uh when you do console inference you know you do not only have an estimate so something that we in machine learning we are trying to estimate some results right

but you also have something called estimate so s demand is this is the way how we identify causal relationships in the graph when we already have a graph

so now for S demands we use set and methods to get them so it might be something called a pact or criteria for instance or something else

but then for estimators to estimate the causal effect when we already have the estimate we can use any machine learning model virtually any machine learning model

and there are methods like causal trees for instance uh or Coastal Forest which is based on on causal um which is a causal version of random Forest it's a

pretty powerful um it's a pretty powerful algorithm you also have a thing called double machine learning which can use also any estimators including neural networks

working on potentially on any type of data so it's already happening we are doing this in many areas this is and this is a very also very Lively area of

research I would say a lot of papers on this topic appearing every month okay thank you and one more question

by mikhao he wonders if the described methods allow to determine the effect size and

the confidence regarding at a given causation yes but this is a domain of mainly a domain of causal inference so in causal

Discovery traditionally at least we are only finding a graph so this is a adjacency matrix that defines which variable influences which which

variable in in like directed causal way now in causal inference what we are doing in causal inference we are actually estimating effect sizes as you

said or like causal effects they might be estimated using different approaches it might be like an average uh treatment effect so-called or average causal effect it might be conditional average

treatment effect which gives you an effect for every group however it is is defined hopefully it answers your question if if

it does not please feel free to ping me on LinkedIn I'm happy to continue this conversation thank you very much Alexander for your

answers and let me again thank you to all our speakers today for a very interesting and very enlightening talks

and thank you again to Intel for sponsoring the second talk I hope you'll all enjoy the conference

[Music] foreign [Music]

Loading...

Loading video analysis...