10 - Causal Discovery from Observational Data
By Brady Neal - Causal Inference
Summary
Topics Covered
- The Faithfulness Assumption's Hidden Vulnerability
- Markov Equivalence: The Ceiling of What Graphs We Can Identify
- The PC Algorithm's Three-Step Blueprint for Causal Discovery
- Non-Gaussian Noise Breaks the Markov Equivalence Barrier
- Non-Linear Additive Noise: Exact Identification Without Linearity
Full Transcript
hi everyone and welcome to the 10th week of introduction to causal inference this week we'll be covering causal discovery from observational data a natural question is what do i mean by
causal discovery and to motivate that consider what we've been doing throughout this course we've been doing causal inference inferring causal effects
given some causal assumptions basically where we assume that we have the causal graph but what if we don't have the causal graph that's what causal discovery
is here for so with causal discovery our goal is to start from just data and then learn the causal graph from that data then if we wanted to infer causal effects from
there we can just apply stuff that we've learned earlier in the course and i've used the word identification throughout the course now we're going to have a new sort of definition for identification we're
going to have structure identification which we'll be talking about in this lecture so whenever i say identify in this lecture it's about if we can identify the causal graph
identification of structure rather than of causal s demands here's the outline of this lecture there's two main categories of methods that we'll
see a bit of the first is independence-based causal discovery in that we try to discover conditional independencies in the data and then use those to infer the causal
graph and then the second one is semi-parametric causal discovery there we make some parametric assumptions about the functional form
of how the data was generated and then leverage those assumptions to do causal discovery as you know we always have a big focus on assumptions in this course so with that let's get right into the
assumptions for independence-based causal discovery the new assumption that we need to make for independence-based causal discovery is what's known as the faithfulness assumption
so recall the markov assumption from week three of the course when we were introducing graphs this assumption says that if x and y
are deseparated in the graph by some conditioning set z then that implies that x and y are conditionally independent in distribution where they're
conditioning on some conditions at z so this markov assumption tells us how to go from the causal graph on the left-hand side of this implication
to some statement about the distribution of the data on the right-hand side of this implication but to do causal discovery we really want to be able to go the other direction we want to be able to
start with data and then infer the causal graph from there so we need an assumption that has an implication in the other direction and this is the faithfulness assumption
it's just the converse of the markov assumption so using faithfulness we can search for conditional independencies in the data so in this data distribution
p and then once we find those then that will tell us something about the graphical structure it will tell us what's
deseparated from what so we haven't had to make the faithfulness assumption at all throughout this course until now and that's good because the faithfulness
assumption is a bit less attractive than the markov assumption because it's easy to think of counter examples to the faithfulness assumption so here's one example for
a violation of faithfulness we'll just copy and paste the assumption here and the common example of violation of faithfulness is to have a graph where let's focus on a
and d here we have a graph where the association flowing from a to d is cancelled out along these two different paths
so the idea here is that the blue path cancels the red path in terms of the association flowing along them these paths are like opposites in a sense and
their associations cancel so that we get that a is independent of d in the distribution even though a and d aren't d separated in this graph
okay so if faithfulness were true we would have that a and d are independent which then
implies that a and d are deseparated that's what we would have if faithfulness were true if it weren't violated but in this example it is violated to make this a bit more
concrete i'm going to add some coefficients to these paths here and what these mean is in a linear example it corresponds to these structural equations
i haven't added any noise terms to the structural equations but you could do the same thing with noise terms it would just uh be a bit less minimal of an example
okay so this edge from a to b corresponds to this structural equation with alpha here from a to c is this structural equation with gamma and then the d structural equation
corresponds to these two edges coming into d with beta and delta then if we plug in for b and c in the structural equation for d
if we plug in their corresponding structural equations then we get this for d in terms of a this captures the association flowing
from a to d it's alpha times beta plus gamma times delta and so the concrete version of the path canceling that we were graphically
depicting with that blue and red arcs is if alpha times beta equals the negative of gamma times delta
so if this is the case then this factor in front of a is equal to zero meaning that the association that's flowing from a to d is zero so
that's an example of a violation of faithfulness and we're assuming that we don't have these kinds of violations when we're assuming the faithfulness assumption and then the other two assumptions are
much more familiar the first one causal sufficiency is just a way of saying that there's no unobserved confounders of any of the variables in the graph and the second one acyclicity is just
what we've been doing throughout this course the graphs are a cyclic so listing all the assumptions together we have the mark of assumption still the faithfulness assumption
causal sufficiency and acyclicity that brings us to the end of the assumptions section and to our first question which is y is the mark of assumption
plus causal sufficiency and acyclicity not enough for learning causal graphs from data in this next section we'll introduce the very important concept of markov
equivalence and give the main theorem around independence-based causal discovery and what it has to do with markov equivalents so here we have a chain graph
going to the right a chain graph going to the left and a fork graph recall that we first saw these structures back in week 3 and we saw what
conditional independence assumption these graphs imply given the markov assumption that's that x1 is independent of x3 conditional on x2
in all three of these different graphs this is the case and minimality also told us that x1 is dependent on x2 and x2 is dependent on x3
here faithfulness will be guaranteeing that x1 is dependent on x3 minimality didn't quite guarantee this because there can be
some intransitive cases but the main thing you can focus on here is just the what the markov assumption tells us the thing i want to emphasize here with the fact that i put these three
graphs on the same slide and say they all imply the same conditional independence given the markov assumption is that these three graphs are markov equivalent
that's just what we mean by they imply the same conditional independence assumption and some other terminology that we'll be using is that they're all in the same
markov equivalence class that's a class of graphs that all imply the same conditional independencies so that's chains and forks but
immoralities are a bit special if you remember from back in week three here i've just copy and pasted the chains and forks markov equivalence class
on the right here where i have this blue circle around it to indicate that this is a class of graphs a set of graphs unlike
chains and forks immoralities have this interesting probability where x1 is independent of x3 conditional on nothing
right so this property does not match the chains in forks one where we have to condition on x2 to get independence otherwise they're dependent and with immoralities
x1 and x3 are dependent if we do condition on x2 right so it's kind of like the reverse of what's going on here in the chains and forks so this means that immoralities are
outside of the markov equivalence class for chains and forks and they're in their own markov equivalence class so unlike chains and forks if you're an
immorality then you're single in your markov equivalence class so if we can somehow figure out the
markov equivalence class when the true graph is the basic immorality here then we can actually identify the full graph so once we know this blue
circle and because there's only one graph in there we've identified the causal graph as the basic immorality whereas if we just figure out the
markov equivalence class for when the true graph is a chain or a fork then all we know is that it's one of these three graphs we don't know which one it is
okay so that's the sort of special information that immoralities tell us but chains and forks must also tell us some useful information that we can leverage in causal discovery and
that has to do with skeletons so the skeleton of these three graphs here the two chains and the fork is this graph where the way we get a skeleton is just by
turning directed edges into undirected edges so you can check that for these three graphs if we turn the directed edges into undirected edges they all give this same skeleton
and the information that this skeleton tells us is that x1 is independent of x3 conditional on x2 that's in contrast to a graph like this one
which is the chain where we've added an edge from x1 to x3 in this graph x1 is still dependent
on x3 even after we condition on x2 because this there's this direct edge from x1 to x3 okay so this graph here is outside
this markov equivalent class and we can tell that based on the fact that it has a different skeleton than the chains and fork have
so the skeleton of a graph tells us some useful information about the markov equivalence class as well so there are two important qualities of graphs that we can use to
distinguish graphs from each other based on the conditional independencies that they encode and that's immoralities which we saw a
couple slides ago how those are special and the skeleton of a graph which we saw on just the last slide so we've built up intuition for these
two things on the previous two slides and that leads us to this very important theorem which is that two graphs are markov equivalent if and only if they have the same
skeleton and the same immoralities if you think about this a bit and realize you have no intuition for this then i recommend going back to slides
and re-watching that bit on why immoralities are special and then on skeletons and then why do we care about this theorem like what what does it do for us
well it tells us that if we're trying to infer the graph from just conditional independencies that we
find in the data then the best we can do is discover what's called the central graph which is just the skeleton of the true
graph plus the immorality so take the skeleton of the real graph and then orient the edges
of the immoralities this sort of partially directed graph is what we can expect to discover from independence-based causal discovery
another word for this is cp dag where cpdag stands for completed partially directed a cyclic graph not all of the edges will
be directed and this essential graph or cp dag you can think of as a sort of graphical representation of the markov equivalence class
we'll see an algorithm for discovering this essential graph or this markov equivalence class that we can actually infer from data based on conditional independencies in the data
but before we do that let's make sure that this markov equivalent stuff is clear and go through a few questions and examples on it the first question is what graphs are markov equivalent to
the basic fork graph the second question is what graphs are markov equivalent to the basic immorality the answers to both of these two
questions are in a few slides ago so go ahead and check those out if you can't quite recall the answers then the next question is what graphs
is the following graph markov equivalent to so basically the way we'll do this is just try to flip edges and see if we get graphs that still have
the same skeleton and immoralities we're going to ensure that we have the same skeleton by just flipping the edges and then we'll make sure we don't introduce any new immoralities
or get rid of any this graph doesn't have any immorality so we just need to not introduce new immoralities so the first graph this is this is markov equivalent to is if we flip
the a to b edge note that if we had flipped the c to b edge then we would introduce an immorality c to b and then a to b would be an immorality
so that's a graph that this graph is not markov equivalent to okay so here's one that we saw it's markup equivalent to and then there are two more another is if we take this graph and then we
flip the c to b edge and then finally there's this graph where the d to b edge is pointing to b and then the other two edges are pointing outward
so again for example if we were to take this graph on the right here and then flip the a b edge so that we have it pointing from a to b then we'd be introducing a new immorality so that graph
would not be markable equivalent to these four graphs it would be in a different markov equivalence class the next question is what graphs is the following graph
markov equivalent to this is a bit of a trick question and that this graph is alone in its own mark of equivalence class so that's because it's engaged in an
immorality this a to c and then b to c is an immorality which means that this graph will be alone
in its markov equivalence class so to see that let's consider what would happen if we flipped edges if we flipped this ac edge so that we had an edge from c to
a then we would get rid of this immorality and so then it would be a different uh it would imply different conditional independencies based on that theorem remember we have to keep the skeleton
and the same immoralities same if we were to flip this cb edge and then finally if we flipped this d
c edge so that we it went from d to c then we would have new immoralities right then we'd have this immorality d
and b would have a this child together without being connected and then also a and d would have c as a child together without being connected right so there's
just be immoralities all over the place but importantly we needed to have only this immorality the a and b as a
immoral parents of c we can't have d jumping in there making things uh more immoral as according to the theorem that would give us a graph that is not markov
equivalent to this one and then the final question is this one where i i'm just asking you to give a few graphs that the following graph is markov equivalent
to i won't actually give you the answer for this one so go ahead and pause and take a think here
next up we'll cover the pc algorithm which is an algorithm for learning the essential graph this graphical representation of the markov equivalence class that we can
discover using independence-based causal discovery this is one popular algorithm for independence-based causal discovery but there are others we'll give a quick overview of the pc
algorithm before we cover each of its steps so there's three steps and in this overview we'll have a sort of true graph in mind we're going to try to learn the
essential graph of this true graph using these three steps of the pc algorithm so the first thing or so step zero in the pc algorithm is to
start with a complete undirected graph so a complete graph is one in which there is an edge between every pair of variables a complete graph
essentially encodes that we're not assuming any conditional independencies about the distribution so we're allowing distributions to be any distribution then the pc algorithm is three steps the
first step is to identify the skeleton this is us making use of the information about the skeleton
that we can actually get from the data assuming we have enough data to do these conditional independence tests then the second step is to identify the
immoralities and orient them so we can identify this a to c b to c immorality and then in the third step
we can orient more edges using the fact that we would have discovered any immoralities that existed in phase two so edges that are incident
on colliders and immorality can sometimes be oriented away from the collider so in steps two and three here
we're using the information in the data that tells us about immoralities in the graph so remember there's two kind of key structures we emphasize there's a
skeleton and then there's immoralities and the skeleton is step one immoralities are steps two and three okay so that's just kind of an overview
of the pc algorithm but now we'll go into each of these steps in more detail in step one for identifying the skeleton what we do is we start with the complete
undirected graph and then we remove edges x y where x and y are independent condition on some conditioning set z where that
conditioning set could be empty could be if size one of two whatever conditioning set you want except for including x and y can include x and y and
we start with conditioning sets of smaller size and then increase the size of the conditioning set so first start with the empty set conditioning set of size 0
then go to condition sets of size 1 do that for all possible variables we're getting condition on then conditioning sets of size two so that's the number of nodes choose two
then of size three that's number of nodes choose three and so on okay so let's see what this looks like in our example where this is the true graph
we start with this complete graph and then the first thing we do is in the data detect that a is independent of b we find that
when we are running the independence test for the empty set as the conditioning set and we have this independence because we try to look for any path from a to b
and the only one is blocked by this collider c so that means that all paths from a to b are blocked so there's no association flowing between a
and b a and b are independent this means we can remove this edge between a and b in our graph that started off as a complete graph then we move to conditioning sets of
size one and so what we do is for all other pairs of variables other than c here so c is going to be the main used
variable that's useful to condition on so for all other pairs x and y so that's like a and d is one pair a and e is another pair
b and d b and e and even a and b for all of the pairs we detect the ones that are independent so that's everything
and then we remove all those edges okay so what's going on there let's start with a and d there's this path there's only one path from a to d
which is through c it's this chain and that path is blocked by conditioning on c so a is independent of d conditional on c
that's why we removed this edge that was here same thing from a to e that's another your
chain and that's why we removed this edge from a to e then the same thing for b to d and b to e and i mentioned that we'll also do the
test for a and b but we find that a and b are not independent conditional on c because this is a collider and we're just gonna
store that information for later say that might be useful information for detecting colliders for example but so we've gone up to conditioning sets of size
one and we've already identified the skeleton so we've completed step one essentially and you can keep going up in higher and higher conditioning sets but we don't need to do that for now because we have the skeleton
though you might not know that you have the skeleton if you're doing this in practice okay so we have the skeleton from step one and now in step two we
want to identify the immoralities so in this step what we do is for any paths x z y in our working graph that's at the bottom right here where the following two conditions are
true one we discover that there is no edge between x and y in our previous step so we've removed the direct edge between x and y in step
one graphically that means that in the structure xzy there's no arc here connecting x and y and then two z was not in the
conditioning set that makes x and y conditionally independent another way you could think of this is that when you condition on z x and y are dependent
but just because the algorithm will work by testing for conditional independence that's why it's phrased this way in condition two so if we have these two conditions
satisfied then we know that x z y forms an immorality where x and y are pointing into z okay so
in our example on the bottom right here let's look at condition one there's no edge between x and y and a previous step so that's a and b there's no edge connecting them we removed that edge on the previous
step and then condition two is that z was not in the conditioning set that makes x and y conditionally independent so z here is c and c was not in the conditioning set that
makes a and b independent right so that conditioning set was empty so that condition 2 is satisfied so then we know
that xzy structure which the a c b here forms an immorality so in step two here we have identified all the immoralities in this graph it's
just one immorality so this allows us to orient some of these edges and finally in step three we're going to try to orient some other edges
the main idea here is that in the last step we would discover all the immoralities and we can use that fact that we should have discovered all the immoralities to orient even more edges
that are not part of immoralities so for example this c d edge here if d were pointing to c
that would form another immorality with a and with b as well so because we didn't detect the immorality back in step two where we detect immoralities
that means that the edge actually has to be pointing from c to d so the way that we do this more formally is that any edge z
y part of a partially directed path of the form x points to z and then undirected edge to y
where there is no edge connecting x and y so no like arc like that any structure like this the
edge that's undirected here can be oriented so that it's directed from z to y right so in that example that we just discussed with d let's just consider this path a to c to
d that matches this x to z then undirected to y and there's no edge connecting x and y there's no edge connecting
x a and d here so that means that this edge z to y right in this structure here which is c
and d here this edge can be oriented like that and then the edge c e can be oriented for the same reasons
so that's step three step three here for this specific example allowed us to actually completely identify the causal graph right so by that i mean that we oriented
all of the edges we didn't leave any edges undirected which means we discovered the markov equivalence class of this graph
which happens to be a singleton graph so cool in this example we have identified the graph that won't be the case in general we won't always be able to identify
the exact graph from conditional independencies in the data rather we as we say we'll be able to identify the markov equivalence class and that markov equivalence class will
often have several graphs in it represented by some of the edges being undirected in the essential graph so that's the pc algorithm and what if we don't like some of the assumptions
that we had to make for this algorithm so for example say we don't want to assume causal sufficiency we want to allow for unobserved confounders well there's algorithms that try to drop that assumption so for
example there's the fast causal inference fci algorithm or if you don't like acyclicity there's this ccd algorithm as an example and if you want
to drop both of those assumptions there's this satisfiability based causal discovery so for example you can check out these two papers
where sat or satisfiability if you're familiar with it what they do is they reduce call to discovery to a satisfiability problem and then they just give those
constraints to a sat solver i won't really get into that here but basically you're able to drop the causal sufficiency and acyclicity assumptions
when you're using those methods and then one final thing that we'll discuss before we complete this section is that conditional independence testing is hard so these causal discovery algorithms
that are independence based where we're taking data and then trying to discover conditional dependencies in the data to identify the graphs to discover
graphs these algorithms rely on having accurate conditional independence tests and they can definitely have that if we have infinite data for example
then conditional independence testing is relatively easy however in general we don't have infinite data and it can be quite a hard problem when we have finite data
as conditional independence tests can sometimes require a lot of data to get accurate test results you can check out this paper if you want to learn more about that that concludes the section on the pc
algorithm and brings us to this question which is what are the essential graphs of the following graphs on these four graphs here so the first one on the left here its
essential graph is when we just undirect all of the edges then the next one this basic morality as we discussed it is its own
essential graph it's the only graph in its markov equivalence class then the essential graph for this next graph is where all of the edges are undirected and finally
the essential graph for this last one is itself right we can't change the orientation of any of these edges without changing the immoralities and then the second
question which i won't give an answer to is just walk through the steps of pc to get these graphs then the next question is what is the essential graph for this graph you don't have to walk
through the steps of pc here but go ahead and try to figure out what the essential graph is for this one
then a natural question is can we do better than identifying just the markov equivalence class so we saw that with the faithfulness assumption
we can identify the essential graph we can identify the markov equivalence class of the true graph but say there's multiple graphs in that markov
equivalence class can we narrow the class down more so ideally we would get to just a single graph can we
do that for example if we make more assumptions well in the setting where we have multinomial distributions
or we have linear gaussian structural equations this is the best we can do we can only identify a graph up to its markov
equivalence class we can't get anything smaller than that okay but this seems a bit specific like why are we talking about linear gaussian
structural equations what if we care about non-gaussian structural equations where the the noise is non-gaussian or
non-linear structural equations and the spoiler is that we can get identifiability of the exact graph in those cases so that's what we'll see in this semi-parametric
causal discovery section it's semi-parametric in the sense that we'll be making assumptions about the parametric form of these structural equations right so the sort of functional form of how the
data was generated we saw some examples of parametric assumptions for identifiability of causal estimates not of graph structures but of causal s demands
in the week on instrumental variables for example so to motivate this section let's quickly discuss some issues with independence-based causal discovery that we don't
that we won't have in semi-parametric causal discovery the first is that independence-based causal discovery requires the faithfulness assumption which we saw is maybe sometimes not that
attractive of an assumption it's kind of easy to think up counter examples to it then the next big issue which we already touch is that it can require large samples for
accurate conditional independence tests and finally with independence-based causal discovery we can only identify up to the markov equivalence class of the true graph
whereas with semi-parametric calls discovery we'll often be able to pick out the exact graph that is the true causal graph assuming that our semiparametric
assumptions are correct in the first part of this section on semi-parametric causal discovery we're going to point out that you can't
identify the true causal graph without making any parametric assumptions so by identifiability here in this whole lecture
we're referring to identifying the graph not some causal estimate and if we want to identify that graph not just the markov equivalence class of that graph if we want to identify the
specific graph we have to make parametric assumptions so we'll see why that's the case now first we'll look at this through the perspective of markov equivalence so if
we have infinite data we have access to p of x comma y here we have two variables x and y and say we want to distinguish just
these two graphs is our data generated by a graph where y is generated as a function of x
or is our data generated by data where x is generated as a function of y right which causal graph is it can we infer that
from this infinite data that we have well if we're looking at conditional independencies in the data no right the essential graph which depicts the markov equivalence class here
doesn't have the direction of the edge in it right both of these graphs here both of these graphs correspond to where we're not making any conditional independence assumptions
so they could correspond to any distribution and here's the markup equivalence class for that that's the perspective on if we can
identify which of these two graphs generated this data with two variables from markov equivalents but maybe if we look at it from a
different perspective from scm structural causal models and their structural equations then maybe we can identify which of those two causal graphs generated our data
well unfortunately it turns out that's not the case so there's this proposition that for every joint distribution p of x comma y
on two real valued random variables there is an scm in either direction that generates data consistent with that distribution
in other words an scm where x is generated from y could have generated the data or an sem where y is generated from x could have generated the data so we don't know which variable was
generated from which we we can't distinguish between those two causal graphs mathematically this is written as follows so there exists a function f sub y
such that y equals f sub y where you feed x and some noise variable
u sub y and where x and u sub y are independent so there exists a function in that direction mapping x to y and similarly there exists a function
going the other direction f of x and similarly we have independence here so there exists these functions that are
consistent with the data distribution we can generate data that is consistent with p of x comma y
using either this scm or this scm in the opposite direction so that first one corresponds to this causal graph
x to y and then the second one corresponds to this causal graph where there's an arrow from y to x either causal graph could explain
our data so here we haven't made any restrictions on our scms the sems can be whatever and when that's the case we are not able to
identify the true causal graph we can't distinguish between these two causal graphs to know which one is the true one and in order to do that we're going to need to make some assumptions about the
parametric form the first class of assumptions we'll make is that our structural equations are linear with non-gaussian noise
so recall that we cannot hope to identify the graph more precisely than its markov equivalent class in the linear gaussian noise setting
but that setting seemed oddly specific what if the noise is non-gaussian for example so by the linear non-gaussian noise assumption
i mean that all structural equations causal mechanisms that generate the data are of the following form where
y is generated as a linear function of x so linear function plus some noise term u where x
and u are independent and importantly u is distributed as some non-gaussian so if you were gaussian here we have this impossibility result
that we can only do as well as the markov equivalence class but if u is non-gaussian we're going to be able to exactly identify the graph
that's what this theorem from shimuzu at all 2016 states so it's saying that in the linear non-gaussian setting if the true scm
is this one where y is generated from x then there does not exist an sem in the reverse direction where x is generated from y
where y and this u tilde are independent so there doesn't exist such an scm in the reverse direction that can generate the data consistent
with the observational distribution of course there exists an scm in the true direction that can generate data consistent with the observational distribution but in the linear non-gaussian setting
there doesn't exist an scm in the reverse direction for generating that observational distribution because there doesn't exist an scm in
the reverse direction we are able to identify the direction of this edge we won't cover the proof in this lecture but you can see the proof in the course
book okay so that's the result but what's something that you can use to help remember this uh so give you some intuition for why we
have this result turns out it's going to be having a lot to do with these independencies here so the input variable being independent from the
noise and then here we're going to have that the input variable is not going to be independent from the noise in the reverse direction
okay so that's what we'll see here so say that we have some data and we're going to regress y on x so we're going to fit a linear
regression to these data points and get this line and we'll say that this is the correct causal direction in other words y is generated from x when we do the linear regression in this causal
direction we get this nice line that seems to fit the data pretty well but when you do the regression in the anticausal direction when the noise term is non-gaussian you
don't get such a nice line you actually get something that looks like this okay so this line doesn't quite look right it doesn't look like this line where we do the regression in
the causal direction but in the anti-causal direction we get this line and this happens only because the noise term is
non-gaussian if the noise term were gaussian we wouldn't have this okay so something seems off here but to get a bit more clarity on this we have to look at the
residuals so by looking at the residuals that's where we subtract our prediction here so this is our
prediction if we were to fit this function using linear regression we're to subtract our prediction from the true value right so it's just like moving this to the other side then we isolate the
residual same thing in the causal direction we would be fitting this function and subtract that to the other side we'd be trying to estimate this residual so we're going to look at the
residual plot to kind of get plots of this u and this tilde u here so this is what our residual plot looks like when we do the linear regression in the
causal trigger action regressing y on x in this plot we have the residuals on the y axis which is like an estimate of u
and we have x on the x axis so we're kind of trying to see if x is independent of u and it looks pretty independent
of u here but then when we look at the regression in the anti-causal direction now here we're regressing x on y and now on the
y-axis we have this u-tilde and on the x-axis we have y now because we want to be matching this sort of input variable
now we see that y is dependent on u tilde okay so this looks pretty independent right so the distribution
of the residual u here doesn't seem to be changing as we condition on different values of x so they're independent but then here
this distribution of u so it's pretty thin here when we condition on values of y over here it's a bit wider as we condition on values of y over here the distribution
is getting wider okay and then it thins out again so clearly the support is changing as we change y
so u tilde f absolutely depends on y and this is the anti-causal scm right we are trying to fit an sem in the
anti-causal direction but we didn't get one right because we needed y to be independent of the noise term in the anti-causal direction but it's clearly not
you can check out the proof in the book if you want to see kind of the technical stuff going on for how to prove this but hopefully this gives you some good intuition and helps you
remember this there are several extensions to this linear non-gaussian identifiability result and method so what we're looking at was just for
two variables x and y but you can extend it to multiple variables so this multivariate setting and if you want to drop some assumptions you can drop the causal sufficiency assumption that's what they work on in
this paper or drop the acyclicity assumption that's what they work on in this paper in the final section we'll look at
a non-linear semiparametric assumption where we're assuming additive noise so let's see what that is again recall that we cannot hope to identify the graph more precisely than
the markov equivalence class in the linear gaussian noise setting but what if the structural equations are non-linear so here's this nonlinear additive noise
assumption for all variables for all i the structural equation that generates x sub i is a nonlinear function
f sub i so f psi is nonlinear function of the appearance of the ith variable plus some noise term here okay so that's the
nonlinear additive noise assumption so importantly this is additive under this assumption and under the markov assumption the causal sufficiency assumption a cl asyclicity and then this this nonlinear
additive noise assumptions when we just mentioned and then a technical condition from the paper where this comes from which we won't get into under these assumptions we can identify the causal
graph right so we don't just get the markup equivalence class we get the precise graph a single graph but it's only the correct graph
under this nonlinear additive noise assumption and the technical condition from their paper cool so this seems a bit more flexible in the sense that we don't have to assume
that the functional form is linear it can be nonlinear now but say you don't like that we're assuming this additive noise well we can get that to be a bit more
general so you don't like additive noise then just kind of chuck in another function on the outside there right so it's post
nonlinear in the sense that it's this g comes after we've added we've added u here but it's not quite additive noise in the sense that we also have
this function of both of these after we've added it okay so in this post non-linear setting where you just have this function g this function f then you can get an
identifiability result there as well if you want to learn even more about causal discovery from observational data here are three nice sources
so the first one is a review article from frederick emberhart the second one is another review article from some pretty big
authors so all three of these guys do a lot of research on this topic i think this guy is the p
in the pc algorithm i think the p comes from from this guy so his first name starts with a p peter and then this guy is the c
in the pc algorithm clark and then this guy just does a ton of papers on this topic so yeah so they all wrote a review paper together
and then the last resource on this list is a book called elements of causal inference where it's basically the only book on causal discovery from observational data
and we're lucky enough to have the first author on this book jonas peters give a guest talk in this course so the next video that gets uploaded to the channel
will be his guest talk which will be in the full lectures playlist if you're looking at the playlist of smaller videos it's not in that place playlist it's in the full lectures one
if you want to get notified for that lecture or any stuff coming up in the future any lectures any other videos coming up in the future then go ahead and hit the subscribe button below
and same with the bell icon next to it you have any good jokes about immoralities go ahead and leave those in the comments below thanks for watching and i hope to see you in the next one
not that i can actually see you through your webcam or anything i don't think the nsa cares enough about causal inference quite yet for that to be the case but i don't know
what time you're watching this lecture in the future
Loading video analysis...