10 - Causal Discovery from Observational Data

By Brady Neal - Causal Inference

Summary

Topics Covered

The Faithfulness Assumption's Hidden Vulnerability
Markov Equivalence: The Ceiling of What Graphs We Can Identify
The PC Algorithm's Three-Step Blueprint for Causal Discovery
Non-Gaussian Noise Breaks the Markov Equivalence Barrier
Non-Linear Additive Noise: Exact Identification Without Linearity

Full Transcript

hi everyone and welcome to the 10th week of introduction to causal inference this week we'll be covering causal discovery from observational data a natural question is what do i mean by

causal discovery and to motivate that consider what we've been doing throughout this course we've been doing causal inference inferring causal effects

given some causal assumptions basically where we assume that we have the causal graph but what if we don't have the causal graph that's what causal discovery

is here for so with causal discovery our goal is to start from just data and then learn the causal graph from that data then if we wanted to infer causal effects from

there we can just apply stuff that we've learned earlier in the course and i've used the word identification throughout the course now we're going to have a new sort of definition for identification we're

going to have structure identification which we'll be talking about in this lecture so whenever i say identify in this lecture it's about if we can identify the causal graph

identification of structure rather than of causal s demands here's the outline of this lecture there's two main categories of methods that we'll

see a bit of the first is independence-based causal discovery in that we try to discover conditional independencies in the data and then use those to infer the causal

graph and then the second one is semi-parametric causal discovery there we make some parametric assumptions about the functional form

of how the data was generated and then leverage those assumptions to do causal discovery as you know we always have a big focus on assumptions in this course so with that let's get right into the

assumptions for independence-based causal discovery the new assumption that we need to make for independence-based causal discovery is what's known as the faithfulness assumption

so recall the markov assumption from week three of the course when we were introducing graphs this assumption says that if x and y

are deseparated in the graph by some conditioning set z then that implies that x and y are conditionally independent in distribution where they're

conditioning on some conditions at z so this markov assumption tells us how to go from the causal graph on the left-hand side of this implication

to some statement about the distribution of the data on the right-hand side of this implication but to do causal discovery we really want to be able to go the other direction we want to be able to

start with data and then infer the causal graph from there so we need an assumption that has an implication in the other direction and this is the faithfulness assumption

it's just the converse of the markov assumption so using faithfulness we can search for conditional independencies in the data so in this data distribution

p and then once we find those then that will tell us something about the graphical structure it will tell us what's

deseparated from what so we haven't had to make the faithfulness assumption at all throughout this course until now and that's good because the faithfulness

assumption is a bit less attractive than the markov assumption because it's easy to think of counter examples to the faithfulness assumption so here's one example for

a violation of faithfulness we'll just copy and paste the assumption here and the common example of violation of faithfulness is to have a graph where let's focus on a

and d here we have a graph where the association flowing from a to d is cancelled out along these two different paths

so the idea here is that the blue path cancels the red path in terms of the association flowing along them these paths are like opposites in a sense and

their associations cancel so that we get that a is independent of d in the distribution even though a and d aren't d separated in this graph

okay so if faithfulness were true we would have that a and d are independent which then

implies that a and d are deseparated that's what we would have if faithfulness were true if it weren't violated but in this example it is violated to make this a bit more

concrete i'm going to add some coefficients to these paths here and what these mean is in a linear example it corresponds to these structural equations

i haven't added any noise terms to the structural equations but you could do the same thing with noise terms it would just uh be a bit less minimal of an example

okay so this edge from a to b corresponds to this structural equation with alpha here from a to c is this structural equation with gamma and then the d structural equation

corresponds to these two edges coming into d with beta and delta then if we plug in for b and c in the structural equation for d

if we plug in their corresponding structural equations then we get this for d in terms of a this captures the association flowing

from a to d it's alpha times beta plus gamma times delta and so the concrete version of the path canceling that we were graphically

depicting with that blue and red arcs is if alpha times beta equals the negative of gamma times delta

so if this is the case then this factor in front of a is equal to zero meaning that the association that's flowing from a to d is zero so

that's an example of a violation of faithfulness and we're assuming that we don't have these kinds of violations when we're assuming the faithfulness assumption and then the other two assumptions are

much more familiar the first one causal sufficiency is just a way of saying that there's no unobserved confounders of any of the variables in the graph and the second one acyclicity is just

what we've been doing throughout this course the graphs are a cyclic so listing all the assumptions together we have the mark of assumption still the faithfulness assumption

causal sufficiency and acyclicity that brings us to the end of the assumptions section and to our first question which is y is the mark of assumption

plus causal sufficiency and acyclicity not enough for learning causal graphs from data in this next section we'll introduce the very important concept of markov

equivalence and give the main theorem around independence-based causal discovery and what it has to do with markov equivalents so here we have a chain graph

going to the right a chain graph going to the left and a fork graph recall that we first saw these structures back in week 3 and we saw what

conditional independence assumption these graphs imply given the markov assumption that's that x1 is independent of x3 conditional on x2

in all three of these different graphs this is the case and minimality also told us that x1 is dependent on x2 and x2 is dependent on x3

here faithfulness will be guaranteeing that x1 is dependent on x3 minimality didn't quite guarantee this because there can be

some intransitive cases but the main thing you can focus on here is just the what the markov assumption tells us the thing i want to emphasize here with the fact that i put these three

graphs on the same slide and say they all imply the same conditional independence given the markov assumption is that these three graphs are markov equivalent

that's just what we mean by they imply the same conditional independence assumption and some other terminology that we'll be using is that they're all in the same

markov equivalence class that's a class of graphs that all imply the same conditional independencies so that's chains and forks but

immoralities are a bit special if you remember from back in week three here i've just copy and pasted the chains and forks markov equivalence class

on the right here where i have this blue circle around it to indicate that this is a class of graphs a set of graphs unlike

chains and forks immoralities have this interesting probability where x1 is independent of x3 conditional on nothing

right so this property does not match the chains in forks one where we have to condition on x2 to get independence otherwise they're dependent and with immoralities

x1 and x3 are dependent if we do condition on x2 right so it's kind of like the reverse of what's going on here in the chains and forks so this means that immoralities are

outside of the markov equivalence class for chains and forks and they're in their own markov equivalence class so unlike chains and forks if you're an

immorality then you're single in your markov equivalence class so if we can somehow figure out the

markov equivalence class when the true graph is the basic immorality here then we can actually identify the full graph so once we know this blue

circle and because there's only one graph in there we've identified the causal graph as the basic immorality whereas if we just figure out the

markov equivalence class for when the true graph is a chain or a fork then all we know is that it's one of these three graphs we don't know which one it is

okay so that's the sort of special information that immoralities tell us but chains and forks must also tell us some useful information that we can leverage in causal discovery and

that has to do with skeletons so the skeleton of these three graphs here the two chains and the fork is this graph where the way we get a skeleton is just by

turning directed edges into undirected edges so you can check that for these three graphs if we turn the directed edges into undirected edges they all give this same skeleton

and the information that this skeleton tells us is that x1 is independent of x3 conditional on x2 that's in contrast to a graph like this one

which is the chain where we've added an edge from x1 to x3 in this graph x1 is still dependent

on x3 even after we condition on x2 because this there's this direct edge from x1 to x3 okay so this graph here is outside

this markov equivalent class and we can tell that based on the fact that it has a different skeleton than the chains and fork have

so the skeleton of a graph tells us some useful information about the markov equivalence class as well so there are two important qualities of graphs that we can use to

distinguish graphs from each other based on the conditional independencies that they encode and that's immoralities which we saw a

couple slides ago how those are special and the skeleton of a graph which we saw on just the last slide so we've built up intuition for these

two things on the previous two slides and that leads us to this very important theorem which is that two graphs are markov equivalent if and only if they have the same

skeleton and the same immoralities if you think about this a bit and realize you have no intuition for this then i recommend going back to slides

and re-watching that bit on why immoralities are special and then on skeletons and then why do we care about this theorem like what what does it do for us

well it tells us that if we're trying to infer the graph from just conditional independencies that we

find in the data then the best we can do is discover what's called the central graph which is just the skeleton of the true

graph plus the immorality so take the skeleton of the real graph and then orient the edges

of the immoralities this sort of partially directed graph is what we can expect to discover from independence-based causal discovery

another word for this is cp dag where cpdag stands for completed partially directed a cyclic graph not all of the edges will

be directed and this essential graph or cp dag you can think of as a sort of graphical representation of the markov equivalence class

we'll see an algorithm for discovering this essential graph or this markov equivalence class that we can actually infer from data based on conditional independencies in the data

but before we do that let's make sure that this markov equivalent stuff is clear and go through a few questions and examples on it the first question is what graphs are markov equivalent to

the basic fork graph the second question is what graphs are markov equivalent to the basic immorality the answers to both of these two

questions are in a few slides ago so go ahead and check those out if you can't quite recall the answers then the next question is what graphs

is the following graph markov equivalent to so basically the way we'll do this is just try to flip edges and see if we get graphs that still have

the same skeleton and immoralities we're going to ensure that we have the same skeleton by just flipping the edges and then we'll make sure we don't introduce any new immoralities

or get rid of any this graph doesn't have any immorality so we just need to not introduce new immoralities so the first graph this is this is markov equivalent to is if we flip

the a to b edge note that if we had flipped the c to b edge then we would introduce an immorality c to b and then a to b would be an immorality

so that's a graph that this graph is not markov equivalent to okay so here's one that we saw it's markup equivalent to and then there are two more another is if we take this graph and then we

flip the c to b edge and then finally there's this graph where the d to b edge is pointing to b and then the other two edges are pointing outward

so again for example if we were to take this graph on the right here and then flip the a b edge so that we have it pointing from a to b then we'd be introducing a new immorality so that graph

would not be markable equivalent to these four graphs it would be in a different markov equivalence class the next question is what graphs is the following graph

markov equivalent to this is a bit of a trick question and that this graph is alone in its own mark of equivalence class so that's because it's engaged in an

immorality this a to c and then b to c is an immorality which means that this graph will be alone

in its markov equivalence class so to see that let's consider what would happen if we flipped edges if we flipped this ac edge so that we had an edge from c to

a then we would get rid of this immorality and so then it would be a different uh it would imply different conditional independencies based on that theorem remember we have to keep the skeleton

and the same immoralities same if we were to flip this cb edge and then finally if we flipped this d

c edge so that we it went from d to c then we would have new immoralities right then we'd have this immorality d

and b would have a this child together without being connected and then also a and d would have c as a child together without being connected right so there's

just be immoralities all over the place but importantly we needed to have only this immorality the a and b as a

immoral parents of c we can't have d jumping in there making things uh more immoral as according to the theorem that would give us a graph that is not markov

equivalent to this one and then the final question is this one where i i'm just asking you to give a few graphs that the following graph is markov equivalent

to i won't actually give you the answer for this one so go ahead and pause and take a think here

next up we'll cover the pc algorithm which is an algorithm for learning the essential graph this graphical representation of the markov equivalence class that we can

discover using independence-based causal discovery this is one popular algorithm for independence-based causal discovery but there are others we'll give a quick overview of the pc

algorithm before we cover each of its steps so there's three steps and in this overview we'll have a sort of true graph in mind we're going to try to learn the

essential graph of this true graph using these three steps of the pc algorithm so the first thing or so step zero in the pc algorithm is to

start with a complete undirected graph so a complete graph is one in which there is an edge between every pair of variables a complete graph

essentially encodes that we're not assuming any conditional independencies about the distribution so we're allowing distributions to be any distribution then the pc algorithm is three steps the

first step is to identify the skeleton this is us making use of the information about the skeleton

that we can actually get from the data assuming we have enough data to do these conditional independence tests then the second step is to identify the

immoralities and orient them so we can identify this a to c b to c immorality and then in the third step

we can orient more edges using the fact that we would have discovered any immoralities that existed in phase two so edges that are incident

on colliders and immorality can sometimes be oriented away from the collider so in steps two and three here

we're using the information in the data that tells us about immoralities in the graph so remember there's two kind of key structures we emphasize there's a

skeleton and then there's immoralities and the skeleton is step one immoralities are steps two and three okay so that's just kind of an overview

of the pc algorithm but now we'll go into each of these steps in more detail in step one for identifying the skeleton what we do is we start with the complete

undirected graph and then we remove edges x y where x and y are independent condition on some conditioning set z where that

conditioning set could be empty could be if size one of two whatever conditioning set you want except for including x and y can include x and y and

we start with conditioning sets of smaller size and then increase the size of the conditioning set so first start with the empty set conditioning set of size 0

then go to condition sets of size 1 do that for all possible variables we're getting condition on then conditioning sets of size two so that's the number of nodes choose two

then of size three that's number of nodes choose three and so on okay so let's see what this looks like in our example where this is the true graph

we start with this complete graph and then the first thing we do is in the data detect that a is independent of b we find that

when we are running the independence test for the empty set as the conditioning set and we have this independence because we try to look for any path from a to b

and the only one is blocked by this collider c so that means that all paths from a to b are blocked so there's no association flowing between a

and b a and b are independent this means we can remove this edge between a and b in our graph that started off as a complete graph then we move to conditioning sets of

size one and so what we do is for all other pairs of variables other than c here so c is going to be the main used

variable that's useful to condition on so for all other pairs x and y so that's like a and d is one pair a and e is another pair

b and d b and e and even a and b for all of the pairs we detect the ones that are independent so that's everything

and then we remove all those edges okay so what's going on there let's start with a and d there's this path there's only one path from a to d

which is through c it's this chain and that path is blocked by conditioning on c so a is independent of d conditional on c

that's why we removed this edge that was here same thing from a to e that's another your

chain and that's why we removed this edge from a to e then the same thing for b to d and b to e and i mentioned that we'll also do the

test for a and b but we find that a and b are not independent conditional on c because this is a collider and we're just gonna

store that information for later say that might be useful information for detecting colliders for example but so we've gone up to conditioning sets of size

one and we've already identified the skeleton so we've completed step one essentially and you can keep going up in higher and higher conditioning sets but we don't need to do that for now because we have the skeleton

though you might not know that you have the skeleton if you're doing this in practice okay so we have the skeleton from step one and now in step two we

want to identify the immoralities so in this step what we do is for any paths x z y in our working graph that's at the bottom right here where the following two conditions are

true one we discover that there is no edge between x and y in our previous step so we've removed the direct edge between x and y in step

one graphically that means that in the structure xzy there's no arc here connecting x and y and then two z was not in the

conditioning set that makes x and y conditionally independent another way you could think of this is that when you condition on z x and y are dependent

but just because the algorithm will work by testing for conditional independence that's why it's phrased this way in condition two so if we have these two conditions

satisfied then we know that x z y forms an immorality where x and y are pointing into z okay so

in our example on the bottom right here let's look at condition one there's no edge between x and y and a previous step so that's a and b there's no edge connecting them we removed that edge on the previous

step and then condition two is that z was not in the conditioning set that makes x and y conditionally independent so z here is c and c was not in the conditioning set that

makes a and b independent right so that conditioning set was empty so that condition 2 is satisfied so then we know

that xzy structure which the a c b here forms an immorality so in step two here we have identified all the immoralities in this graph it's

just one immorality so this allows us to orient some of these edges and finally in step three we're going to try to orient some other edges

the main idea here is that in the last step we would discover all the immoralities and we can use that fact that we should have discovered all the immoralities to orient even more edges

that are not part of immoralities so for example this c d edge here if d were pointing to c

that would form another immorality with a and with b as well so because we didn't detect the immorality back in step two where we detect immoralities

that means that the edge actually has to be pointing from c to d so the way that we do this more formally is that any edge z

y part of a partially directed path of the form x points to z and then undirected edge to y

where there is no edge connecting x and y so no like arc like that any structure like this the

edge that's undirected here can be oriented so that it's directed from z to y right so in that example that we just discussed with d let's just consider this path a to c to

d that matches this x to z then undirected to y and there's no edge connecting x and y there's no edge connecting

x a and d here so that means that this edge z to y right in this structure here which is c

and d here this edge can be oriented like that and then the edge c e can be oriented for the same reasons

so that's step three step three here for this specific example allowed us to actually completely identify the causal graph right so by that i mean that we oriented

all of the edges we didn't leave any edges undirected which means we discovered the markov equivalence class of this graph

which happens to be a singleton graph so cool in this example we have identified the graph that won't be the case in general we won't always be able to identify

the exact graph from conditional independencies in the data rather we as we say we'll be able to identify the markov equivalence class and that markov equivalence class will

often have several graphs in it represented by some of the edges being undirected in the essential graph so that's the pc algorithm and what if we don't like some of the assumptions

that we had to make for this algorithm so for example say we don't want to assume causal sufficiency we want to allow for unobserved confounders well there's algorithms that try to drop that assumption so for

example there's the fast causal inference fci algorithm or if you don't like acyclicity there's this ccd algorithm as an example and if you want

to drop both of those assumptions there's this satisfiability based causal discovery so for example you can check out these two papers

where sat or satisfiability if you're familiar with it what they do is they reduce call to discovery to a satisfiability problem and then they just give those

constraints to a sat solver i won't really get into that here but basically you're able to drop the causal sufficiency and acyclicity assumptions

when you're using those methods and then one final thing that we'll discuss before we complete this section is that conditional independence testing is hard so these causal discovery algorithms

that are independence based where we're taking data and then trying to discover conditional dependencies in the data to identify the graphs to discover

graphs these algorithms rely on having accurate conditional independence tests and they can definitely have that if we have infinite data for example

then conditional independence testing is relatively easy however in general we don't have infinite data and it can be quite a hard problem when we have finite data

as conditional independence tests can sometimes require a lot of data to get accurate test results you can check out this paper if you want to learn more about that that concludes the section on the pc

algorithm and brings us to this question which is what are the essential graphs of the following graphs on these four graphs here so the first one on the left here its

essential graph is when we just undirect all of the edges then the next one this basic morality as we discussed it is its own

essential graph it's the only graph in its markov equivalence class then the essential graph for this next graph is where all of the edges are undirected and finally

the essential graph for this last one is itself right we can't change the orientation of any of these edges without changing the immoralities and then the second

question which i won't give an answer to is just walk through the steps of pc to get these graphs then the next question is what is the essential graph for this graph you don't have to walk

through the steps of pc here but go ahead and try to figure out what the essential graph is for this one

then a natural question is can we do better than identifying just the markov equivalence class so we saw that with the faithfulness assumption

we can identify the essential graph we can identify the markov equivalence class of the true graph but say there's multiple graphs in that markov

equivalence class can we narrow the class down more so ideally we would get to just a single graph can we

do that for example if we make more assumptions well in the setting where we have multinomial distributions

or we have linear gaussian structural equations this is the best we can do we can only identify a graph up to its markov

equivalence class we can't get anything smaller than that okay but this seems a bit specific like why are we talking about linear gaussian

structural equations what if we care about non-gaussian structural equations where the the noise is non-gaussian or

non-linear structural equations and the spoiler is that we can get identifiability of the exact graph in those cases so that's what we'll see in this semi-parametric

causal discovery section it's semi-parametric in the sense that we'll be making assumptions about the parametric form of these structural equations right so the sort of functional form of how the

data was generated we saw some examples of parametric assumptions for identifiability of causal estimates not of graph structures but of causal s demands

in the week on instrumental variables for example so to motivate this section let's quickly discuss some issues with independence-based causal discovery that we don't

that we won't have in semi-parametric causal discovery the first is that independence-based causal discovery requires the faithfulness assumption which we saw is maybe sometimes not that

attractive of an assumption it's kind of easy to think up counter examples to it then the next big issue which we already touch is that it can require large samples for

accurate conditional independence tests and finally with independence-based causal discovery we can only identify up to the markov equivalence class of the true graph

whereas with semi-parametric calls discovery we'll often be able to pick out the exact graph that is the true causal graph assuming that our semiparametric

assumptions are correct in the first part of this section on semi-parametric causal discovery we're going to point out that you can't

identify the true causal graph without making any parametric assumptions so by identifiability here in this whole lecture

we're referring to identifying the graph not some causal estimate and if we want to identify that graph not just the markov equivalence class of that graph if we want to identify the

specific graph we have to make parametric assumptions so we'll see why that's the case now first we'll look at this through the perspective of markov equivalence so if

we have infinite data we have access to p of x comma y here we have two variables x and y and say we want to distinguish just

these two graphs is our data generated by a graph where y is generated as a function of x

or is our data generated by data where x is generated as a function of y right which causal graph is it can we infer that

from this infinite data that we have well if we're looking at conditional independencies in the data no right the essential graph which depicts the markov equivalence class here

doesn't have the direction of the edge in it right both of these graphs here both of these graphs correspond to where we're not making any conditional independence assumptions

so they could correspond to any distribution and here's the markup equivalence class for that that's the perspective on if we can

identify which of these two graphs generated this data with two variables from markov equivalents but maybe if we look at it from a

different perspective from scm structural causal models and their structural equations then maybe we can identify which of those two causal graphs generated our data

well unfortunately it turns out that's not the case so there's this proposition that for every joint distribution p of x comma y

on two real valued random variables there is an scm in either direction that generates data consistent with that distribution

in other words an scm where x is generated from y could have generated the data or an sem where y is generated from x could have generated the data so we don't know which variable was

generated from which we we can't distinguish between those two causal graphs mathematically this is written as follows so there exists a function f sub y

such that y equals f sub y where you feed x and some noise variable

u sub y and where x and u sub y are independent so there exists a function in that direction mapping x to y and similarly there exists a function

going the other direction f of x and similarly we have independence here so there exists these functions that are

consistent with the data distribution we can generate data that is consistent with p of x comma y

using either this scm or this scm in the opposite direction so that first one corresponds to this causal graph

x to y and then the second one corresponds to this causal graph where there's an arrow from y to x either causal graph could explain

our data so here we haven't made any restrictions on our scms the sems can be whatever and when that's the case we are not able to

identify the true causal graph we can't distinguish between these two causal graphs to know which one is the true one and in order to do that we're going to need to make some assumptions about the

parametric form the first class of assumptions we'll make is that our structural equations are linear with non-gaussian noise

so recall that we cannot hope to identify the graph more precisely than its markov equivalent class in the linear gaussian noise setting

but that setting seemed oddly specific what if the noise is non-gaussian for example so by the linear non-gaussian noise assumption

i mean that all structural equations causal mechanisms that generate the data are of the following form where

y is generated as a linear function of x so linear function plus some noise term u where x

and u are independent and importantly u is distributed as some non-gaussian so if you were gaussian here we have this impossibility result

that we can only do as well as the markov equivalence class but if u is non-gaussian we're going to be able to exactly identify the graph

that's what this theorem from shimuzu at all 2016 states so it's saying that in the linear non-gaussian setting if the true scm

is this one where y is generated from x then there does not exist an sem in the reverse direction where x is generated from y

where y and this u tilde are independent so there doesn't exist such an scm in the reverse direction that can generate the data consistent

with the observational distribution of course there exists an scm in the true direction that can generate data consistent with the observational distribution but in the linear non-gaussian setting

there doesn't exist an scm in the reverse direction for generating that observational distribution because there doesn't exist an scm in

the reverse direction we are able to identify the direction of this edge we won't cover the proof in this lecture but you can see the proof in the course

book okay so that's the result but what's something that you can use to help remember this uh so give you some intuition for why we

have this result turns out it's going to be having a lot to do with these independencies here so the input variable being independent from the

noise and then here we're going to have that the input variable is not going to be independent from the noise in the reverse direction

okay so that's what we'll see here so say that we have some data and we're going to regress y on x so we're going to fit a linear

regression to these data points and get this line and we'll say that this is the correct causal direction in other words y is generated from x when we do the linear regression in this causal

direction we get this nice line that seems to fit the data pretty well but when you do the regression in the anticausal direction when the noise term is non-gaussian you

don't get such a nice line you actually get something that looks like this okay so this line doesn't quite look right it doesn't look like this line where we do the regression in

the causal direction but in the anti-causal direction we get this line and this happens only because the noise term is

non-gaussian if the noise term were gaussian we wouldn't have this okay so something seems off here but to get a bit more clarity on this we have to look at the

residuals so by looking at the residuals that's where we subtract our prediction here so this is our

prediction if we were to fit this function using linear regression we're to subtract our prediction from the true value right so it's just like moving this to the other side then we isolate the

residual same thing in the causal direction we would be fitting this function and subtract that to the other side we'd be trying to estimate this residual so we're going to look at the

residual plot to kind of get plots of this u and this tilde u here so this is what our residual plot looks like when we do the linear regression in the

causal trigger action regressing y on x in this plot we have the residuals on the y axis which is like an estimate of u

and we have x on the x axis so we're kind of trying to see if x is independent of u and it looks pretty independent

of u here but then when we look at the regression in the anti-causal direction now here we're regressing x on y and now on the

y-axis we have this u-tilde and on the x-axis we have y now because we want to be matching this sort of input variable

now we see that y is dependent on u tilde okay so this looks pretty independent right so the distribution

of the residual u here doesn't seem to be changing as we condition on different values of x so they're independent but then here

this distribution of u so it's pretty thin here when we condition on values of y over here it's a bit wider as we condition on values of y over here the distribution

is getting wider okay and then it thins out again so clearly the support is changing as we change y

so u tilde f absolutely depends on y and this is the anti-causal scm right we are trying to fit an sem in the

anti-causal direction but we didn't get one right because we needed y to be independent of the noise term in the anti-causal direction but it's clearly not

you can check out the proof in the book if you want to see kind of the technical stuff going on for how to prove this but hopefully this gives you some good intuition and helps you

remember this there are several extensions to this linear non-gaussian identifiability result and method so what we're looking at was just for

two variables x and y but you can extend it to multiple variables so this multivariate setting and if you want to drop some assumptions you can drop the causal sufficiency assumption that's what they work on in

this paper or drop the acyclicity assumption that's what they work on in this paper in the final section we'll look at

a non-linear semiparametric assumption where we're assuming additive noise so let's see what that is again recall that we cannot hope to identify the graph more precisely than

the markov equivalence class in the linear gaussian noise setting but what if the structural equations are non-linear so here's this nonlinear additive noise

assumption for all variables for all i the structural equation that generates x sub i is a nonlinear function

f sub i so f psi is nonlinear function of the appearance of the ith variable plus some noise term here okay so that's the

nonlinear additive noise assumption so importantly this is additive under this assumption and under the markov assumption the causal sufficiency assumption a cl asyclicity and then this this nonlinear

additive noise assumptions when we just mentioned and then a technical condition from the paper where this comes from which we won't get into under these assumptions we can identify the causal

graph right so we don't just get the markup equivalence class we get the precise graph a single graph but it's only the correct graph

under this nonlinear additive noise assumption and the technical condition from their paper cool so this seems a bit more flexible in the sense that we don't have to assume

that the functional form is linear it can be nonlinear now but say you don't like that we're assuming this additive noise well we can get that to be a bit more

general so you don't like additive noise then just kind of chuck in another function on the outside there right so it's post

nonlinear in the sense that it's this g comes after we've added we've added u here but it's not quite additive noise in the sense that we also have

this function of both of these after we've added it okay so in this post non-linear setting where you just have this function g this function f then you can get an

identifiability result there as well if you want to learn even more about causal discovery from observational data here are three nice sources

so the first one is a review article from frederick emberhart the second one is another review article from some pretty big

authors so all three of these guys do a lot of research on this topic i think this guy is the p

in the pc algorithm i think the p comes from from this guy so his first name starts with a p peter and then this guy is the c

in the pc algorithm clark and then this guy just does a ton of papers on this topic so yeah so they all wrote a review paper together

and then the last resource on this list is a book called elements of causal inference where it's basically the only book on causal discovery from observational data

and we're lucky enough to have the first author on this book jonas peters give a guest talk in this course so the next video that gets uploaded to the channel

will be his guest talk which will be in the full lectures playlist if you're looking at the playlist of smaller videos it's not in that place playlist it's in the full lectures one

if you want to get notified for that lecture or any stuff coming up in the future any lectures any other videos coming up in the future then go ahead and hit the subscribe button below

and same with the bell icon next to it you have any good jokes about immoralities go ahead and leave those in the comments below thanks for watching and i hope to see you in the next one

not that i can actually see you through your webcam or anything i don't think the nsa cares enough about causal inference quite yet for that to be the case but i don't know

what time you're watching this lecture in the future

Loading...

Loading video analysis...