Ruibo Tu - A brief introduction to causal discovery
By Digital Futures: Research Hub for Digitalization
Summary
Topics Covered
- Conditional Probability vs Interventional Distribution
- Correlation Is Not Causation, But We Can Still Discover Causes
- Functional Causal Models Identify Direction of Causation
- Auto-Regressive Normalizing Flow for Causal Discovery
Full Transcript
I would like to share some of my opinions about causal discovery so therefore I could have, definitely have some bias based on my literature studies and my works so if you know more, if I was drawn somewhere, please just correct me. And then so today's talk I will just,
correct me. And then so today's talk I will just, so because I assume by today's audience, I assume you don't have any knowledge about causal discovery so this will be a very brief introduction in the basic level and then I will mention some connection about the causal
discovery and machine learning studies. Yep let
me start. So maybe I could give you some causality examples, actually causality is not something we are not familiar with and it actually happens in our daily life, we could see it in the report that papers and newspapers, and you can see like explained by or
something list to something where you could also see some figure like these like the child conception rate and Nobel prize, number of Nobel prize. So you can see okay so what is causality and why it is so important? And now in
order to answer the question, I would like to share a very concrete example because the previous one or two blurry. Yeah so the example is made up by me so nothing is real
here it's just for demonstration. Well let's imagine that we are given a causal graph or maybe I could highlight this, good. So actually this example is based on the pandemic experience, maybe something about Covid-19.
And then so firstly we are given a causal graph, we want to show how can we use causal or causality for our daily research or applications so now we are given a causal graph, and this causal graph actually tells us that okay, we are like we
are interested in the status of our patients and we have some special medicine and we have another variable which tells us like whether a patient has a fever or not. This causal graph tells us okay, there could be there is a causal
relationship between the medicine and the status of the patient meanwhile the fever has an effect, has an influence on the medicine which means whether we give the medicine to the patient depends on whether the patient has a fever or not.
And this actually direction or arrow tells us that whether this patient has a fever, have an influence on the status states of the patient, so if this patient has a, has a fever they will feel bad or good. And now we are actually interested in one thing,
we want to know whether this certain medicine, a specific medicine is effective or not. So what can we do? And then for example we could do
or not. So what can we do? And then for example we could do randomized control trials, we could also think maybe we want to use some data driven method, so we collect data and then we have some data we just randomly collect data. We come to the patient, we ask them whether you have a fever
or not, and then we ask, okay, have you ever taken any medicine? How do you feel after you took the medicine? And then we have such data set and one means okay this is a good it's like I feel better. And this one here means okay I took the medicine and this one here means that, yeah, I do have a
fever, yeah, let's move on. So in order to answer the question, we could we could use randomized control trials. Of course, this is a very is this
control trials. Of course, this is a very is this is a simplified randomized control trials, this is not something in practice. So for
example now we have a group of patients, we split them into two groups, randomly split them into two groups, and then we give the medicine. You give the medicine to the first group and then it gives nothing effective in the capsule to the second group and then we see after they take, after they take the medicine well
whether they become better or not, and then we see the recovery and then we make some conclusion based on that.
And you may have already seen there are some issues about that but we can come back to that later because, for example, ethical issues whether we are allowed to do that or really, or is it very expensive but let's just put this method here and think about something else can we do something else
since we have already got the data? Can we do something with the data? Yes, of course. What
can we do? So for example we can compute the conditional probability distributions of all the variables, for example, we can become the probability of the status of patient conditioning on the medicine, so like they took the medicine here, this capsule equals to one is okay. We only consider the group of patients who
okay. We only consider the group of patients who took the medicine, and here means who have never taken any medicine.
And we can get the frequency of the good states are they will observe, okay, in the data set I collected you have this probability, I have again the same probability.
What does it mean? Does it mean this medicine is effective? Or does it mean the medicine is not effective? We don't know because they're the same, so maybe we could conclude it doesn't help, at least it didn't make things become worse. Then we
compare something else like okay so now we only have look at the group of people who have a fever, now we can put exactly the same thing here we get a number, they'll get another number. Then this actually gives us another conclusion that okay if we consider to use the
second row of probabilities actually the medicine is helpful. Well which one should I took? Actually, there are some conflicts with each other which depends on which conditional
took? Actually, there are some conflicts with each other which depends on which conditional probability you use.
However I want to say that the misleading is it's not actually we need it doesn't mean we cannot use the observational data to make a conclusion but it means that we need a more careful way to compute the probabilities. So for example in causal inference there is a concept of
probabilities. So for example in causal inference there is a concept of interventional distribution. It is actually doing something similar to the randomized
interventional distribution. It is actually doing something similar to the randomized control trials. So in the randomized control trials there is this kind of randomly give
control trials. So in the randomized control trials there is this kind of randomly give patients medicines, it is actually trying to break the arrow from the fever state to the
medicine. So whether a patient take a medicine or not is not depending on the fever state
medicine. So whether a patient take a medicine or not is not depending on the fever state and up anymore. Then we were wondering okay now what is the effect of the medicine, right? So we can show a variable, we make we break this
misleading relationship, then we see what happens else. And then in the causal inference it is actually computing the interventional distribution to see to check the effectiveness of the medicine. So for example here this is the interventional
distribution and probability, and then instead of computing this conditional probability, we can issue this interventional probability and the only difference here you may have already observed that is here it uses marginal distribution instead of the conditional
distribution. Because this graph is actually we are using, but we don't have the data of it
distribution. Because this graph is actually we are using, but we don't have the data of it so the idea is that we want to use the data from one causal graph to compute some probabilities and use that one to represent the causal graph we want to use because we
don't have data of this. So here's the basic idea of interventional distribution, and then we can make some conclusion about okay whether it is effective or not effective, something like that. Here you may have already noticed that the conditional probability distribution is
that. Here you may have already noticed that the conditional probability distribution is not the interventional distributions, at least in this case, so that this is why we need convert it for this causality for this application, because without that modeling maybe we can now get such method or conclusion.
Good, this is one example like why we need causality or how can we use causal knowledge or causal graph for some applications.
Then let's move on to think another question. Okay, so now we know we can use causal graph for some applications; however, how can we get such causal graph beforehand, right? Yeah, well it is helpful but, where can we get it? In most of the case we don't have the causal graph, but for example
we could have it from textbook from domain knowledge we could construct some graph like that. But what if we don't have that? Then it will be super expensive time consuming or we
that. But what if we don't have that? Then it will be super expensive time consuming or we need to do some experiment like the randomized control trials. Well in many applications the golden standard is still the randomized control trials to get this causal knowledge, or to create this causal graph.
However, we are thinking can we get causal knowledge for this causal graph purely based on observational data? This is this does this good or possible to do that? Because we have a lot of data, can we really do the data mining to get the real knowledge from that out of
it? Well we have a very famous quote tell us okay correlation is not causation. Well not
it? Well we have a very famous quote tell us okay correlation is not causation. Well not
really, because this is not so if you use this statistic method you may have some problem because this reading is not causation, but it doesn't mean we cannot do that, it doesn't mean we cannot use observation of data for the causal discovery. It means that we need to
do more things more rigorous things or methods to get calculations from the data. So
for example we may need some very maybe we need some meaningful assumptions or probability we need a probe yeah or we may need some principles. Which is more general things, and then we could have some causal hypothesis or
principles. Which is more general things, and then we could have some causal hypothesis or causal conclusions based on, purely based on the observation of data. Well in today's talk I would like to introduce so here is the causal
discovery method, here are the causal discovery methods and today I will only introduce some of them and you can see that there is one class of method which is named by the constraint based discovery method.
This is trying, this class of method is trying to use the independence relationships in the data to infer a causal graph, and some assumptions. And the second category is which is score based method a defined score function to evaluate the causal structure and that function
satisfies some constraints or conditions. And you can design your score function for applications, and then we could optimize this score function and the one the graph structure with the best scores is the ground true
causal graph. So this is the second categories, and one of the commonly used method is GES, I
causal graph. So this is the second categories, and one of the commonly used method is GES, I think I gave the reference somewhere later. This
is the second categories, actually the third categories so this method so they are more recent, maybe recent 15 years so this is nailed by functional causal model, for example LiNGAM means Linear Non-Gaussian Acyclic Models, the second one is Additive Noise
Models, the third one is Post Non-linear Models.
And I will introduce yeah and then the last one so something there are many other methods left for example the Entropy based methods, I remember there are some works in KTH which is based on this entropy method information, theory based method, and there are also the independent -
independent causal mechanism based method, and others but I will not mention it so the most there are a lot of works recently and I will not have a chance to introduce these methods. Well let's move on, yeah. So this is the basically the outline of today's talk
methods. Well let's move on, yeah. So this is the basically the outline of today's talk so firstly I will introduce the constrain-based method very briefly because I don't have enough time this talk and then I will introduce some extension of the method and this and the second row you could see
these are more recent works. And I give the link here and I will share this slide and you can directly check papers with the link.
Yeah, so the first row are the mentioned causal discovery method, and the second row are the method related to the machine learning methods or models, and the blue and the yellow one so these are more recent methods. Okay, and then first level introduced the one
recent methods. Okay, and then first level introduced the one of the constraint-based method which is PC algorithm named by Peter and Clark, and some extensions for example in this talk I will introduce the missing data problem. Of course there are some other problem like selection bias and so on but I don't have a chance to introduce it. And for
the second category I may not have time to introduce it, yeah I would like to refer the audience to read the paper regarding that if you have interest to this like optimization based method, and the score-based method, and I will introduce these functional causal m,odels in this talk
and one method trying to use normalizing flow for the causal discovery and causal inference. Well let's start.
Well firstly before I introduce before I introduce the constraint based method, I would like to highlight some very important concepts beforehand because I was confused about these concepts for a long time, for example DAG's
named Directed Acyclic Graphs and causal graphs, well what are the relations between these two concepts? Maybe we could first have a look at DAG's. directed acyclic graph. Well this is a very effective way to represent our human knowledge, like prior knowledge domain knowledge we can summarize our knowledge
in a graph. Each node is a factor or random variable and the directions actually they are maybe they are indicating the causal directions implicitly but here actually it it encodes the conditional or independencies between variables and then it
tells us and it gives us the local and global markov conditions, and I will introduce this later. Now this is actually the DAG, one DAG, and regarding DAG there are a lot of methods for it like Bayesian network methods, there are a
lot of work in PGM, probabilistic graphical model, we could use the tools for the problems regarding or using this DAG. Okay and actually there is a very close relationship
between DAG and causal graphs and recent code discovery methods most of them are actually based on this assumption, acyclic which is DAG. Then we could have a look at what is causal graph or what is causality? Well
it is it is a way of cause of human it's a way of, so causal reasoning is a very important way of human cognition and we actually reason about so many things with this causal sense, so this kind of subjective thing and for example I could give you some
example, so like the sun rise up, so we have a sun rise up then we observed that a bird or a single song, but so this is one direction so the sun is the cause of a birds in your
song, and then actually we don't have another direction back to tell us that okay the reason why sunrise is up is a bird is singing a song.
So that means we don't have this causal relationship so I will call it the direction, we don't have direction from this observation to this observation, so we only have one direction to go. So this is acyclic, this graph is very simple there are only two variables here this is a DAG. Let's see and then the second example is that okay, we have a
bird and then we have an egg, this is one direction. Another direction is that okay we have an egg and then we have a bird, so this is a very like typical cyclic causal graph. So a causal graph can be cyclic so it is not necessarily not necessary to be acyclic and there are many cyclic photographs and
recently there are a lot of works I mean, yeah there are many works here a lot there are many works trying to solve the problem of directed cyclic causal graph but it is out of the scope of today's talk so today let's think
about only think about DAG. Yeah, then DAG's can actually be considered as closed graphs and this is the that's now I introduce this from the perspective of constraint based method, causal discovery method. So in which case we
can say a DAG is a causal graph so I use the interpretation from Dominik's work here, a shared reference here, it's like okay a DAG is a causal graph if or it's accepted this statement is acceptable if and only if the joint distributions satisfy the Markov
condition with respect to the DAG. Of course there is another interpretation of it so the second one is more related to economics, well they set out it's more machine learning way.
Yeah, and then how can we understand this statement, so what does it mean the Markov condition? And actually there is a local Markov condition, there's another
condition? And actually there is a local Markov condition, there's another like global Markov condition. And I would like to interpret causal relationships in a sense of local Markov condition. Well the local Markov condition tells
condition. Well the local Markov condition tells us a very informative thing, that is given variable of interest we only need to, so we only need to know the parent variable of it or the parent node in the
graph, then we know that the variable of interest is independent of everything else except the descendants of it. So that means we don't need to look after all the variables, we only need to look at a few variables which is a subset of the whole variables. They are
sufficient for us to know about the variable of interest.
So this idea you could apply it for like causal feature selection or you could increase the robustness performance of machining methods of your models and they can, it has been applied in many machine learning models at this idea. But actually this idea so this thing, this statement is trying to interpret the
causal relationships in a DAG. Right. Actually this is the constraint-based code discovery method perspective.
Okay now I think we are ready to know something about constraint based method and first of all I would like I would like to introduce two very important assumptions of this kind of stream based method, especially PC today. Of course there is another one which is the minimality, but for understanding the method I think these two are
very important. For the for example the first one is causal Markov condition. Actually it
very important. For the for example the first one is causal Markov condition. Actually it
tells us that the closer so now given the causal relationship in a graph, so we have a graph right here, we have the cause relationships, we know the fever state is the cause of the medicine, which means yeah I have already reduced that okay.
And then given these causal relationships actually we have a first step we have these separation relationships but I will not introduce that today. So given the causal relationships we have a way to know okay the statistical dependencies in the data that conditional independences or independencies. So
they are consistent, so given the graph we can say okay what is happening in data or what are supposed to be in the data. Moreover actually for causal discovery we care about another direction ,because we are more interested in that so given the data we have all the data, we want to say something about the causal graph. So we have a data but we don't have the graph,
so how can we say something about the causal graph? Then the second assumption is trying to tell us okay given the statistic statistical dependencies in the data it can do so the conditional independencies so dependencies in the data, it can infer these separations or the causal relationships in a graph, so they are supposed to be consistent with each other.
Yeah, so this is quite powerful one so, it's quite strict. So there are a lot of discussion regarding this assumption, but let's have a look at in which case the faithfulness assumption can be violated. Well let's just recap this causal graph, we care about the
medicine variable and the status of patient. Actually there are two paths between, two paths between these two variables. The first thing is a direct cause, cycling is through this like this confounder.
And in the case that one part the effect along one path have a negative influence, and another one has a positive one, and these two together, they are canceled out. Then the data will not observe any correlation or any hint about the causal relation because
there are canceled out. So this is actually a variation of faithfulness, because faithfulness something tells us that what you observe in the data can be used to infer the causal relation in the graph. However if you have something like this happening you cannot do that. So this is a violation of the subject.
do that. So this is a violation of the subject.
Okay now maybe we could have a look at the PC algorithm, yeah.
So PC algorithm we can think this as a package in python maybe, or it is a function. The input is data so of course I think I would like to emphasize
function. The input is data so of course I think I would like to emphasize the assumptions of code discovery methods are very important, so that assumptions are actually indicating the view how we understand the causal relationships and the results.
Okay so given the function we input the data, there are two steps to do PC algorithm, and after these two steps we return a completed partially DAG.
I will let you know later about it. Okay so this is the ground truth of the causal graph and we can generate data based on this graph, okay.
Now we have the data so firstly we initialize a complete graph like this, so everything is connected with each other. The first step is skeleton search, so it's actually checking every pair of variables whether they are conditionally dependent or independent with
each other given all the all the subsets of the remaining variables. So it will check every pair of variables but of course they are
remaining variables. So it will check every pair of variables but of course they are efficient ways to do this. And once we found that okay, x is independent of y conditioning on some variable, we remove the edge. So we don't have the edge here. Similarly we remove the other
edges here then we have a result which is the skeleton. We name it by causal skeleton which is the undirected causal graph. So this the
skeleton. We name it by causal skeleton which is the undirected causal graph. So this the causal graph, undirected version is actually the causal skeleton.
And then we come to the second step, we will orient the edges and we are, so we will follow some rules summarized in algorithm after applying that rules, we'll have the direction of some edges. But we will not have all and I will explain
that later. And then this is actually the result of PC algorithms.
that later. And then this is actually the result of PC algorithms. And you can see that we do have all this because.
Okay that means cause.
But we don't know which one, and under is actually we're using the causal sufficiency assumption here as well. That's why we can say okay there's one causal relationships here, there's one cause here. Okay and yeah let's move on. So now we know what is PC algorithm and we
on. So now we know what is PC algorithm and we just have maybe now we could understand some about the results, why we can only have part of the causal relationships, not all. Because we
have, but all of them are in the same Markov equivalent class so for example here we have I'll show you like a three variables case so we have three, X, Y, Z. Right, these three they are equivalent to each other in the sense of
these separate edges, so they have the same separations. So they have the same statistical dependence, they are supposed to have the same statistical dependencies in the data. So that's
why we can now distinguish them. However for this case they have the same causal skeleton but they have different separations, so we can this we can distinguish these one which is the collider case, with all the others. So that's
why we can orient the, so we can orient the causal direction for this case and of course there are some following, there are some following rules which can be implied based on this. All right, then there is a theorem say that two graphs they are Markov equivalent to
this. All right, then there is a theorem say that two graphs they are Markov equivalent to each other if only if they have the same skeleton, codoskeleton, and they have
the same colliders. Yes so then that's why we have part of the oriented graph. Now we know one our first causal discovery method, of course there are many limitations of it and there are many challenges of it. So for example the
model limitation, like we cannot use it for the dynamical model or the dynamical scenario.
And we cannot use it at least this version PC of course you can extend it too and for time series data we can, we cannot use it at least you know, and for acyclic graph for sure we cannot use it. With this version because the of course you can use it but the results will be misleading and especially the Bivariate
case scenario, we cannot distinguish a pair of variables because they are in the same, because like a pair of variables, one calls another one ,they are in the same Markov equivalent class so you know distinguish the most I mean the simplest case right. And when
we apply this PC algorithm in the applications, we have a lot of practical challenges. For example the missing data problem, what if we have missing data or if we have selection bias, and what if we have some variables which is unknown there which is a confounder and for example the heterogeneous data.
So these are all the challenges for causal discoveries and this is also the ongoing researches or state of art researchers are focusing on. And I will briefly introduce one of the challenges which is the missing data I was working on, that so
given the causal graph here we can generate data and we generate data. So now I will very quickly imagine the else of the talk. Good
so we have a causal graph here okay general data of it and this variable y actually there are missing values of the variable, and we you can use the code generate data and here, these variables are actually missing. And then one
way to do with the missing data is that we just delete all the records included the missing value. And then we use the deleted dataset, I mean the set left, we apply PC with them to
value. And then we use the deleted dataset, I mean the set left, we apply PC with them to it. However actually in our study we showed that
it. However actually in our study we showed that the conditional or independence relationships in the data set of the deletion can be different with the one with the complete data set, and if you use that one actually you can have a misleading results. And then and then we recover that we recover the relation
independence relationships with the method using the incomplete data and then we can recover the wrong results. Yep let's move to a more recent studies, actually this is the important part of the talk. Yeah uh yeah so now let's move to another scenario, so these the
following the following studies are more recent ones, maybe recent 15 years all right yeah. So now let's just consider the simplified
yeah. So now let's just consider the simplified scenario we have only two variables and we know one of them is the cause of another one.
But how can we distinguish the cause from effect, can we use PC algorithm or can we statistically independence to do that? Well
maybe we could have some have a look at some examples here, like we have two variables so they are both of them are gaussian distributed.
Then yeah we see x and y and y by x is seen the same. And then we see another scenario that
the same. And then we see another scenario that okay we have uniform distributed x, we don't know about y right, and then we see something observational data like this and then we look from another direction. Like okay but they looks different, but they look the same, so what is happening here? Why when I change the
distribution from gaussian to uniform distribution it becomes looks different?
So here and the hint is here, the ideas here, that I generate data using these using this like we, this linear this is a linear model, we have x and a noise here. This you could think this is a linear regression model, e is the residual x, the variable x is the cause and then I generate data.
And here is actually I'm actually using the uniform distribution x and e right and then if you have a closer look at the residue, now we have the data we'll apply linear regression online.
And then we have the residue here e y and then here we consider another direction to fit the regression model.
I have ratio here but for gaussian case I didn't, I don't think they have any differences with each other. But for the uniform distribution case there are some different things, we can observe that the residual is actually not independent of the cause the hypothesis of cause, but here it is and
actually only the ground truth direction, like all x causes y we have this observation. In
another direction we don't have that, and this idea is very important for the identifiability of causal direction. And let's have another like formulation of the same of the observation, right. So actually we found that so we know this is a linear function, linear formulation, we have some noise distribution
like Gaussian uniform distribution. Here we have hypothesis like I x crosses y which is using these like y equals to a x plus noise to generate it. They want to observe something like so under this assumption of data generation process, maybe there are some condition and under that condition we can say
if we apply linear regression to the data, we'll have different observations regarding the x and the residue. For example x independent the first residue but y is not independent of another residue.
And this idea is true, I mean this intuition is actually true and we can use it to add it to identify the causal relationship of two variables and how actually we have to introduce this the functional causal models, the concept of functional causal models here. So this y goes to f x
and the noise this is actually the functional causal model and it assumes that the x which is the cause is independent of the noise.
And then the effect y can be represented by the function of the x caused, independent noise and then if we further but on its own we cannot identify the causal assumption, but under some assumptions it identifiable. The causal direction become
identifiable at once. For example the assumption is additive noise model so this is very common in machine learning, we have additive causal models and you may think this is like Gaussian process or a deterministic non-linear function and e is like some random noise.
And here there's a more general version of it's a post-nonlinear model, of course there are some assumptions of the function f and the g, and x is the independent of the noise the independent noise here. And if we just rethink about the whole model, the function causal model, it's actually trying to model the data generating process with two things. One is f trying to
capture the data generation process if you think about the additive noise case, and then we have another factor which is noise. This trying to capture the stochastic behavior in the observational data. And then we have the x and y, and this is quite powerful why because
observational data. And then we have the x and y, and this is quite powerful why because under some assumption the causal direction which is in the same Markov equivalent class become identifiable. And actually this function of causal models so it has a close
become identifiable. And actually this function of causal models so it has a close relation with the many machine learning methods. Well there are 15 minutes left I will have a very quick introduction of the last part of this talk, I don't think I have enough
time for the score based method but I will go through the last example which is the connection between functional causal model and normalizing flow. Like functional causal model in this case you could also you could also call it like structural causal model. We have
this causal graph, we can write down the corresponding functional causal models or structural causal models and if you are familiar with the Bayesian network you can factorize it in this way and let's just call it causal order or causal factorization. Let's
just keep this in mind, we have a causal order to factorize variables and we have a corresponding structural causal models. You can
see they are corresponding to each other, and there there are there is a very important concept named by independent causal mechanism.
You can directly get it from this formulation.
Okay and then we have and then we have this normalizing flow, well I don't think I have enough time, so clearly introduce all the details of nominative flow but I will quickly go through it, like the normalizing flow you can use it to do the density estimation so actually it hypothesized some simple distribution which are easier to sample from,
then you they want to, so the normalizing flow trying to use that distribution to estimate a more complicated distribution here, like zk. And if you do it in a compositional way, like neural network one layer another layer until the final layer. And it's actually using the change of variable formula x so we may think x
it is observational data or variables set Gaussian distribution the hypothesized variables. And we can model we can estimate the observational data
hypothesized variables. And we can model we can estimate the observational data distribution with a single distribution quotient distribution and a Jacobian function of this function f in the compositional way. And this F it is actually a neural network. And a concrete way is we can write it
down we can like write down the F in this way which is auto regressive model way. And what is auto regression model is like, you can, but like yeah you could have it. But let's assume you know what is auto regression model.
You can write it down in this way and then you can write down the Jacobian of set in this way, the determinant of the occupant matrix in this way and you can easily compute the Jacobian matrix.
And you can let's assume we can train we can learn the function f with data.
Yeah now I will highlight what other relations between autoregressive model and causal factorization, right. So here is the causal factorization, if we go back to the autoregressive model it is actually can be used to factorize the joint distribution in a specific order. So we can factorize it in this
specific order. So we can factorize it in this way and this way they are equivalent with each other. And actually the causal factorization is one of the autoregressive factorization
other. And actually the causal factorization is one of the autoregressive factorization and all of them, so let's assume we just use normalizing flow so autoregressive numbers flow, we have this nonlinear functional form, we just need to learn this normalizing functional functional form from the data then we get the model we want to have.
Aand yeah so now since causal factorization is one of the auto regressive factorization that I'm representing the work which is highlighted here, then can we use autoregressive normalizing flow for causal discovery or causal inference. Well actually yes
we can do that, so now we can again we write down we write down the causal structural causal model, xy to s3, and the function are actually the neural network. So it's like it's a vector we have vector z we have vector x and we have a function. And this is a normalizing flow, auto
a function. And this is a normalizing flow, auto regressive flow and the f can be written down in this way and we can learn it with this maximum likelihood function, maximum likelihood.
Yeah and actually we can use it for causal discovery or causal inference, suppose we have the causal graph we just need to learn the parameter of the functions here and then directly apply it as a normal causal universal method which is so given the structural causal model we definitely can do causal inference, so there's no
doubt about it. Then the question is that can we use auto-regressive normalizing flow model for causal discovery?
Well we could think in a simple scenario that we have two variables and which now the causal discovery task is just to distinguish the causal direction. Suppose there's a relationship between each other, then which one is the causal one? Again we go back to the bivariate case and now we have
two causal hypotheses, so we can factorize this joint distribution in this way in the first way, a way, we also factorize it in a second way which is the b way, and then we can write down the corresponding structural causal model which is corresponding to two different auto regressive normalized flow. And then this work
actually tells us that we can use, so the one with causal direction that one has larger likelihood. So you can you can tell that you can
likelihood. So you can you can tell that you can do the model selection with likelihood ratio, which is proposed by this work, and then you just use it for causal discovery. You compare two model
then you get just the better one which is the causal one. Yeah there are definitely some limitations, could be due, to could be solved later in the future work. For example
here how can you make it, so for example how can we apply this idea to a larger graph? For
example, we have like 10 to 20 nodes, then we have to enumerate all the causal graph, then compare each other, then we may need to learn a lot of autoregressive normalizing flow which is problematic. And yeah so this the model selection procedure could be become
could be the future work, definitely. And that's
it and then we can conclude that okay we can use auto regressive normalizing flow for causal discovery.
Loading video analysis...