MIA: Cheng Zhang and Nick Pawlowski, Deep End-to-end Causal Inference; Primer: Causal Discovery
By Broad Institute
Summary
Topics Covered
- Correlation Does Not Imply Causation
- A/B Testing Cannot Scale to Real-World Problems
- Interventions Are Fundamentally Different from Observations
- End-to-End Causal Inference Unifies Discovery and Estimation
- Combining Deep Learning and Causality Enables Real Solutions
Full Transcript
decision optimization with causal machine learning as introduced we will talk about more kind of the the basic concepts in the first part and then we
will talk more about our recent research in the second portion of the talk so in general we want to use machine learning to answer what if question and so that we can really get insights
from all past decisions for everything happen and so that we can improve our future division so we can drive for better revenue or we will help the patient get better health Etc so that's
all the reason we care about how the decision making and so as uh I can introduce the today we'll have two parts the first part will
be generic color machine learning what's called what's called a discovery and basically called the inference the second part will focus on deep energy and causal uh influence so in this way
we really want to make large-scale real world applicable machine learning algorithm to that we can help the society and help out different patients customers and all users
so let's get started so first let's think about what's called when we talk about this word all the time but what make it colorful we all
know like uh being something called the ice cream consumption like as we were from UK and being boasted raining today so it doesn't matter how much ice cream
I eat the weather not be good right so what's causal so causal tells us if we actually change in the cause something will happen in the effect
so that is kind of Leyland terms really defines what's causal and why we care about causal it is the change seeing the cause it leads to the change
in the effect okay and so then with Caldwell it leads to question so what's the difference between Caldwell and correlation so the
writing box common calls uh principle doesn't tell us so all the correlation we observed in the data it's not to be induced by some called relationship so
for example if there is two variable if they're correlated it might be one of the following situations it can be X cos y y cos x or if they're caused by
something else together thus if we observe correlation it's commonly induced like underlying system there is some type of color relationship if we don't care about color we just
ignore the okay we don't differ with all these situations but we still observe formulation so correlation X Y A correlated doesn't tell us anything
about what can we do these to what will happen okay and so why it's important notification I guess I already reviewed it is if we don't care about collision
and if we just build a machine Learning System that correlation based what will happen is if we want to drive revenue and I think maybe how big is the office building how beautiful it is it's super
correlated with revenue and if you don't care about causal the machine Learning System they say okay if we spend more money on office building the revenue will grow so that it will be completely
wrong the same in healthcare right so we want to make sure we care about the conversation making and so that we can make an optimized decision please do ask your questions if you have
any photos on my audience and not my audience and so uh one we want to make real conversation commonly uh this is the good old
randomizer control trial and what happens is to figure out how those this is still the gold standard for people in AV group one group give treatment another doesn't and then we can figure
out like what's the difference between these two groups so then we know if a couple relationship exists or not or what's a good impact but we can see that most of the time that's not possible
right especially in health I think you are not allowed to just randomly assign patient and then do some treatment uh this is just not ethical I mean some other application this is just too
powerful way so then that leads this shows the limitation with our good old a b testing because it have very high cost the cost can determine the money you
know people Health time and other things it's because it's so high cost it can only be run on a small scale and most of the time we need to have a really long waiting time and we start the changing
words when you gather inside maybe not valid anymore and it only have a low resolution it tells us in general if there is a treatment back uh between two groups it doesn't tell us what's the
treatment pack on individual level so the resolution is really low because then that leads to why we care about causal machine learning hopefully you're convinced about causal and why we care about causal machine learning so how do
machine learning is we want to do to answer the same common question we want to get the Insight with existing data along so we want we don't want to do AV
testing for everything and get in touch but rather just use what we have the existing data means it can be observation data along but it can also be a maximum observation on the
Interventional data so if you happen like there are some big problems if you happen to have some AV testing results of course you want to just like this type of data but it doesn't require you to get all the insights that's from all
possible in testing the word because that have a huge volume and we can help you out of them so color machine learning is a few really use a cover principle in machine learning working
with existing data and give us insight and then what's what we call the talk about qatarity there are actually two common causal
tasks I mean personally I I don't like like the things to be dividing all this ways because we can view things like really how can we help users maybe we don't need to divide things in this way
but in the research Community I think these are the most common to cover tasks one is called causal Discovery one is called cuddling so what's called a
discovery so color Discovery is focusing on if we have the data we want to understand the causal relationship we want to find out is for example
incentive uh calls the number of service to grow we want to find whether there is a common relationship which is the cause we should be back the output is commonly described and then for the carbo
inference is more caring about what exactly happens the consequence of the action so if we do something uh something will change so that the focus
and causal inference and indeed the two common causal tasks uh historically and research and now that can be more in the detail
how we represent called the whole uh got through the more fundamental way so I've already showed you so the one way to represent causal is using holograph
and holographic model it's very easy to understand we have nodes and we have edges and commonly so most of common case we assume it's a Darkly or cyclic
breath of course they are Real World of Cycles but the but most of the case we will see with directly affect the graph so in this graph all the address really represent the causal relationship so if
as we Define if you change a you're expecting them changes increase and another common way with common language discover probability is what a structural equation models so this is
just a thinking about like if for each variable it's a function of its two parents so the parents are they called this so it tells you the every
observation is generated from its true proper patterns with some noise so these are just the two most common languages about the babadi uh so in a
way like you can think about the graph kind of tell you so the sem is richer even tells you how this address function looks like and meanwhile uh the graph
actually gives the guidance like for example C is the function of a and b with noise and so why not function of b as well so the cover graph tells you which are the parents and that's
reflected in sem or you can read svm and get a couple of graph so these are the most common problem languages we use and with these languages we also need to clear about data distribution and
thinking about in graph really is a very convenient tool for us to work with cloudate so these are some very basic assumptions like the most common example
like almost you see everywhere the first is Marco condition so what does multiple condition tell us the market condition tells us is
If You observe conditional Independence in the graph it tells you the independent conditional impedance in the data so if two nodes I say here is if
you read from the graph or you know like it's X across y cause causally cause y if condition z x y are implanted you can read conditions from a graph right so if
you read this from graph that will reflect it in the data so that's all with multiple condition tells you and then there is another condition is Facebook
so faithfulness tells you it's more from data distribution we can read the same thing from the graph so if two variables
shows in Independence property in the data in the graph it should also be independent right so it just tells you from the other way data property
translate to graph property so you can see uh why sometimes this is a particular cases it doesn't fulfill for example if you have this graph
exactly the the impact from A to B you have to to have right but if this two impacts exactly cancel that then in this situation the Facebook assumption will
be violated uh but these are fairly mild assumptions because we hope that you will award exactly which situation don't happen that often so these are the two fundamental
assumptions uh we're using convertible often and so you can say why this is important because combining the market assumption
and the face bonus attention it just tells us how we look at cosmograph and is consistent with what we look into the data statistic properties
and so then we can kind of build custom machine learning based on historic data and so it tells us what we read from graph and what we read from the data distribution uh kind of is
consistent with each other okay with chronograph there is actually I introduced the basic homograph but actually there are a lot more graphs I'll now go to the food tables of
photograph it's very easy to work with but I'll just give one example so we talked about this graph property and data property right but sometimes with
all the data properties you have you can have multiple graphs to explain it so for example if you have uh you can see these three graph this redactive graph
there uh recyclable graph on the left they have the same conditional Independence uh conclusion from data distribution
right A and B are dependent B and C are dependent right given B A and B are independent so if the data
property there may be multiple graphs to explain it right so for the graph that have the same like a conditional
Independence and all these properties we call it martial equivalent and then this back graph is commonly called multiple improving class and in quality sometimes
because we have remarkable class we can represent it something called condition or a complete partial tag and so it is in general we don't know
how to orient this hash and then we can represent in the way in the right so if you see something in the right it commonly means this other or things like uh they represent so in
general uh if you can see slider is something special so uh you can never warm new colliders so all these graphs
commonly have the same glider structure and so this is another apart from that uh CP DACA is another way that's representing all the rock but there are
many graphs so it's very easy to work with reference this is one of the common language we use in polarity okay so when we're just a calculator is
the preliminary and then the next part so for me and Nick will dive a little bit more into the basic method for causal Discovery and how the imprint as we talked about there are these two
tasks and we'll just introduce a little bit like what are the basic methods and this reflect a little bit like how people think about this problem and then
uh we'll lead to the second part later so these are the uh the research that is done by many great researchers and
really later the ground for like a little bit more nowadays large-scale research so let's just a big habit What's called the discovery so call the discovery what we care about is
we have input is historical data and you can imagine a very few people the data and the output is a color graph we want to figure out we call the relationship
Regional calls which foreign models so I'll just give some examples you know each of the methods and hopefully just
to give you a flavor of how we think about how the discovery and what other ways possible so personally when I get in this field I always thought oh how can you find the relationship that's
impossible but here it really tells you that's possible and that makes me excited about the whole field because if I'll just cover like a few
very very basic like the very early can spring things method so when it comes PC another is called FCI
stuff PC uh is actually named uh by the authors so exactly it use if it have quite mild assumptions so it uses Assumption of remarkable Facebook notes
we're going to talk about and we also assume it's like listed cancel the basic version of PC we assume couples efficiency so it means everything we care about is observing
the data there is nothing like a really really critical variable and that's impact everything and we don't know what's here and so this is the first version so PC in general have two staff
running skeleton search and another Edge orientation of course we're in compute if you will introduce more methods than others because the the basic idea is how
do we set up extreme and to to Independent paths and find the color graph so in general for PC we let's say we
have a database we start by assuming everything is fully connected and now we start to do conditional independent search for conditional independent delivery test two variables are independent so it can now be a attribute
point if the two variables are independent right and you can test more conditional Independence so if the address is conditionally intended you can remove that actually in general you test our address and you add more and
more in the condition as far as you find your independent you can remove it right you think about the Facebook there's a multiple condition if the independent you can remove it so after you do all this there's no Ash you can remove anymore
we finish the skeleton search so in the end you may get something like a skeleton right so if you got the skeleton you don't know the orientation yeah but this is already a good person that the
orientation is they have some rules so it's come from a very basic graph Theory right so you can look at the triplet those three nodes so you just look at all the triplets for ABC we don't know
the direction but the eighth condition on C A and B are dependent again it must be a collider that's the only structure fulfill this right so you can look at all the graph of your triplets and find
all the colliders and after you find all the colliders you can earn it to the uh the CMD as well because if a c and b cannot be a slider it must
be C pointing to d right if D point to c you create a new glider right so you can run or intend all the address so with this two simple rules you can find in the cover Direction so you can see in
this way we can just use specific property graph Theory analysis of course we have this list of assumptions we can find the color graph this tells us what is called what is in
fact and this is already very rich information United applications this can already make huge real world impact you care may care about the gene regular networks any education people care about
what's the Education topic and all this so this can be already very in fact both of many applications don't just I would just want to say one thing the example I gave is that that
usually actually returns CP nuggets we talked about there are many graphs can share the same Independence conditions and all this and so for example this new case will be
an e you can actually not Orient it so the final result will be the last one so this just tells you it can be either one graph or the other because we just
propelled you the best as we can so for this type of algorithm we return this this partial duct interest in general manipulation means it's a set of graph
oh we cannot know it anymore but it might be one of them and they are all have the same uh my conditional intensive processing okay that's PC so this is you can see
you have a flavor we're putting constraints and with the constraint we use our basic rule and we find we call the relationship there are obviously a riff and and proof like one's a full
list of attemption are are fulfilled we do can find the Practical graph and you can think about that of course there are like a difficulties and additional something introduced when you do conditional Independence tests right and
then there are many methods okay uh brother of course as researchers we don't want to limit ourselves with that whole list of assumptions and the number in a real world problems we can
solve this very limited and we know it cannot be prepared in many applications we have latency founding if we have relationship funding we say algorithm will break so FCI is with the one
algorithm that's relaxed the assumption that uh kind of work very similar to PC they have a list of new rules that be able to
consider the situation where this latent variables if you know there are latent variables are there and the data set is not observed so we're now going to go
through all the rules or the maps we have 11 rules you aren't in the edge so we're now going to go through it so this constraint based called this government method is still an active research field
and in the past of two decades there are actually many work like relaxed manual assumptions for example oh how can we rely cycle uh the data may have missing
this is not at random what happens can we in all these situations do you use the PC algorithm and then they also can be combined with other method instead using these rules can we combine with
other methods oriented The Edge and also as a type of constraints like uh there are people looking at the small patterns looking for no there are certain property which looks through the all
patterns but this is a typical example the constraints causal Discovery your setup constraints and you do fossil discovery as another type let's go to the second type is score base
conceptually this extremely simple the scoreboard method just to ask ask that which Graphics the data better so they actually prove that there are theory
behind is the uh I mean of course there is also a list of assumption with a list of potentials there are several tells you that the graph exploring the data
fast will be the true photograph an algorithm-wise it's very simple it's just you want to find the graph uh it will temperature include it's a direct attack clear graph that maximize your
score the score is commonly your definition but one of them also like traditional commonly used one is for like a Bic more modern method I mean most methods are likelihood they score
and the original information based score so you can Define scores and you need to Google but the the traditional one one of the early score based methods it just keep on pic so you can compute the Bic
and kind of just evaluate which graph gets your data that and so conceptual is very very simple right
so what's the challenge here because all possible graphs that's a huge space this grows super exponential so you can see with 10 nodes you can you you're not
able to test all the browsers anymore right go score based method uh in the early time in the early ages people developed heuristics of course
you need to prove this you're still correct on how to narrow down the search space so the the GS is one of the earliest Forest methods so in general they developed heuristic in a way like
okay we start from empty graph we we do step-by-step increasing address we just suppose the address in when they increase the score and then we remove badges while they be prison score what are these heuristic rules that narrow
down our research space and made this uh made this search feasible and there's more Ruth but this is the absolutely I kind of have a Ford and a
backward search so that tells you because narrow down the third space make a Facebook find potential graph explain the data and that for color graph with listen assumptions so in more recent years and people start
to think about can we convert the disconnect Vision check all the possible space in the graph into a constrained uh continuous optimization so one of the work that is actually very popular
nowadays many including us respects now this work is called backwards no tiers so it's in 2018 so one of the
contributions the research is instead of looking at possible graphs we find a way that can characterize what does it mean to be recycled graph so we find out the
trees uh so the p is the adjacent Matrix the treatment of this format will be the same as the number of nodes so then you can easily set a constraint like that that's become a constraint optimization then you can use augmented like laundry
to do that and always look forward fourth definition so this is a large step to make such method more scalable so we'll get a little bit more into this in the second part of the talk but you
can see these are the two type of causative government methods both of them can potentially return sleep index the Lesser type is functional problem models so functional color model is not
another class so as we talk about the Earth's graph there's also fem so we can't assume it so just to make some basically kind of simple assumptions on the functional form on the structure
recruit models we may be able to identify the graph and call the directions deterministic function and that's the capital noise
and one of the early example and I think it's really like enlightening is a single example of lingard it's linear non-gaussian voice model so instead of dealing with the general functional form
of the fdm we look at additive noise model and let's look at the linear case first a simple example look at it so if we have the data X and Y so we want to find
out what x called y y called X or if there's a couple of relationship so if linear we know the the functional form
is linear right we consume if x cause y we can justify the linear regression and after we're fixed the linear regression right we're learning the linear function we can look at the noise the residual
and the curve in this case we assume causes X they're independent that's the duration y but maybe it's another situation it's y called Zach and now we calculate another linear regression x
equals uh a uh y plus the noise now we can do it again and then we can look at religious noise and look at the cause we assume here which is what so you can look at this case only the first case
the noise and the calls are independent the second case the noise and the chords are not independent anymore so that's raw right so it's only one situation is correct so you can see how we can just
make some assumption on the functional form and we'll be able to identify it must be the last configuration is correct so that's how we discover the causal relationship
and so this is the basic idea and the functional problem models and then just with well here we call the noise must have been candle noises and some noise we can already identify a lot of things
of course uh why would you leave your non-geous model if I should be non-gaussian is essential so for all linear models the linear editing noise models gaussian modeling is the only one
not that inviable you can still see if you fit two directions we'll see the same results so that's why we need the assumption and actually this research decline of research we can relax this assumptions
and hopefully we kind of Bio to navigate that possible so they are research and talking about okay instead of linear additive noise model we can put Post in a linear model so both F1 actually can
be non-linear we're all good here as well and there are only five cases not identifiable of course we don't only deal with two variables in the income case we can deal with multiple variables it actually can be very efficient
because think about it all the noise are independent right so that's the key impact we're having here and then we can borrow the good old method like ICA to compute the whole thing variationally so
we can use this to discover like a multiple notes as well and so that's the spirit of the functional color models so I think by now I give you an
introduction you know the fundamentals for the type of causal Discovery method string based or based on functional color models and you can see that we're not caring about the sem and functional
relationship and we cannot answer we call the inference question and for next part Nick will actually tell you more about thousands of fundamental causal
inference research has been conducted thanks John um yeah as mentioned that was just first intro and um course Discovery I'm going to take over talk a bit more about 12
infants and um just to kind of like remind us of why are we doing this uh the first part was really about finding edges and graphs the second part now is kind of like
asking this question how much ice cream would we sell if it would be sunny given that we know that like sun like sunshine causes ice cream consumption rather than the other way around and this is the
important part here really that um causal entrance estimates those causal various they're a bunch of those that we can talk about and different interesting kind of aggressions that we can ask and
answer but all of them assume some sort of knowledge about a graph a partial graph or some sort of equivalent assumptions so let's start off kind of like talking
about what does it mean to perform an intervention or kind of like actually act and change something in the system and this is um at least one way of thinking about it
is Pearl Stu operator which um basically means if you want to act on C in this graph let's try this
um seeing this go up here and kind of like uh change the value to one what we basically do in um this model is we cut
the Edge from a to c because there's now no relationship of like any kind of change in a changing um the value in C and uh doesn't Tangled it
so if we then want to look at the um when we then want to look at the uh distribution or the value of B given um
that we've done this intervention on C you know like we changed the value of z um we could compare this to the conditional distribution of just saying oh we have a bunch of observational data
and what is kind of like the outcome of B given that we can like have observed C equals to one and this is actually not the same thing and just to kind of like visualize this a bit given kind of like
if we assume that we're mainly in a binary scenario a is binary B is uh binary or like we didn't actually have any binary but C is binary what we kind of like see is in the left case when we
do the conditional so just let me find um if you do the conditional uh we see that um we still need to marginalize out over
this conditional distribution of a given C whereas in the right case uh this uh actually doesn't come like we don't actually um condition our distribution of a on C
animal because this Edge is cut and this is kind of like the important difference here that we've mutilate the graph is what it's called by cutting this Edge getting rid of this dependence and then
we um don't marginalize over this conditional distribution over a but only the marginal distribution of a and looking at both quantities we will actually then get different results in
most cases at least and um because that you need to do something different than just doing um conditional probabilities and using all your usual rules that you might have known about already
so what are the things that we want to calculate the most common one is probably the so-called average treatment effect where we calculate um effects on the whole population so if we for
example want to change um be in the previous scenario I called it X here where X is our course or our treatment as it's often called and ysd effect variable we often then would care about
what is the effect of changing our treatment from zero to one and we would then simply look at our expected outcome why in under the um Interventional
distribution when we run this intervention on our treatment variable minus the expected value of our outcome variable given that we run a different treatment which is then often a baseline
or reference treatment and the main thing really here is we do this expectation over the full distribution of the poor population that we've ever observed the second thing which might be more
interesting to a lot of people is these so-called conditional average treatment effect where we care about the effect on subpopulations for example in the ice cream scenario what is kind of like if we could turn the sun on what would the
ice cream consumption be in a country like I know the UK versus a country like Italy where it's like Sunny most of the time anyway um and there might be less of a
difference or people are definitely in like um I know Scandinavia have had stock all the time and they'd still like ice cream just eats it all the time and there's not like much of a effect on that
similarly in um you can think about it in sales scenarios in different countries you can think about um medical scenarios about like different gender different
um other kind of like um conditions uh different other kind of like historical conditions preconditions or kind of like just exposures um that you might care about and want to
condition them and the last one which is um kind of like the up on like the most potentially most interesting or more complex one to calculate is the so-called individual treatment pack where we really care about I don't know
what if I kind of changed something about myself or if something gets uh what is kind of like my ice cream consumption when it's sunny versus when it's not Sunny uh what if I take this
drug um and this is kind of like much more targeted at a single individual or a single instance rather than looking at any population it requires more
understanding of um the actual functions and the how the world works that's just to kind of like talk about what are the things that we care about I'm mainly going to talk about the average treatment effect so
um because they're kind of like are the most common thing to deal with um so what do we kind of like use to then estimate those things the first one is we use kind of like we look at the
graph that we found using cultural Discovery or that we have from um some sort of domain knowledge and we then use both through calculus so do
calculus as the combination of General kind of like condition probability and probability rules as well as the do operator and some rules around that and basically what it does is it identifies
sub graphs that we that are required to estimate the effect that the input event so if we look at this graph here we have a b and c again is from the previous example but we have some example nodes
on Pub as well if we care about this intervention on C we actually cut all the edges that lead into C so we get rid of this top right
one arm here because it's just completely disconnected from our outcome variable B and so we can just throw it away we don't need to actually measure it and we yeah it doesn't give us any
information and we also cut the edges to this other node and left and um what we kind of like then can do is apply it more uh rules of probability and see
that if we actually observe a we can get rid of that before and by kind of like following all of those rules we can identify a smaller subgraph that we actually need where we
need to observe the variables to then be able to estimate this effect from some observational data knowing that this is the graph that generates the data and then we can get back to this equation
that we see here previously but we just need to marginalize over um the marginal distribution of a and um like yeah basically set C to one and then we can get the outcome there
um however this obviously works for much more complicated examples as well and um we can for example look at this example here where we have some unobserved compounding between
um X and Z4 so this is kind of like a note that causes both of them that we can't observe and some other nuisance variables that we don't necessarily care about and we know this is debated on a graph we might now ask this question of
like what are the variables that I want to observe so that I can actually like identify the effect and estimate the effect so we then go here and um look at the thing called the vector
criteria which is probably the most common um well the easiest kind of like Criterium to identify causal effects and it says
that a set set um we'll see a satisfied back to the podium so c is now a set of nodes in our graph um with respect to pair of variables X and Y where like we kind of like say one
is the effect variable one is the course uh where we need to say already know which one which treatment effect we want to estimate and if no note in Z is a
descendant of X where we say also X is now actually our treatment variable and Y is equals ah sorry Y is the outcome um so we can't have any node in um the
set C being at The Descendant and um this Etsy needs to block every path between X and Y that contains an error interacts so this basically means we
kind of like need to get rid of any um colliding features basically and if we then look at our example graph
what we kind of like first see we need to get rid of any of those um sorry that colliders earlier I'm a compile this um we kind of like saw see the first
obvious collider uh fun fact um being said four here that we need to control for um and take this into our set here and
we then still also see that if we control for C4 we actually open A New Path all the way around here
and the backdoor part that we need to control for so we need to either control for C2 or C5 and if we then do this
um we can choose our conditioning set to either be one of them in this example I now kind of like say let's control for C4 and C5 and we can then estimate the
treatment effect of X and Y if we measure SQL and C5 and the other variables we actually don't need to measure and we can still estimate the treatment effect or the Interventional distribution then
um and then what do we basically do is again we're interested in this expectation over the poor population uh where we don't say what's the average on
average do you change and for example uh in a penotype if we knock out a gene um where we then kind of like say let's actually run this experiment we know which variables we need to measure um
kind of like one short note here we have still C6 in this pop here this kind of like just gets consumed in the noise of the outcome variable so we don't really need to care about this and we can still
estimate the effect so using this uh do calculus we can identify which variables we need to actually measure so that you can then go and assign the experiments for example or if you have an experiment you can ask
do I have all the variables and all the measurements that I need to estimate my um just a quick question uh why does controlling V4 opens New Paths can you
go go back to the definition and explain a little bit so the thing is that oh sorry this was still going to happen um so if we control for C4 right
um we then get this collider here uh between like uh why like we have this open path from C2 all the way down to y um and then we have this collider
structure between uh C2 C4 and then I'm an observed node that we know exists but we can never measure okay um so it didn't draw it in there but then we have a collider um from here to there up to
there and this doesn't get opened if we control for that so that means you assume that there might be some you know where the possible compound there will be yeah exactly
so I'm kind of like this is if you know like this is one thing you need to know your craft to be able to apply those rules to then identify uh whether you can estimate the effect and how to
estimate it well if there are many observed Compounders you can say between B4 and B2 then uh what if we get into trouble
uh if you you say if we have unobserved Compounders between C4 and C2 yeah in addition to between the one that's between X and people
um if we measure C4 and C2 um and the on observed Compounders are only in between those two variables then this is still fine okay yeah any more questions
cool thank you thank you um but yeah so basically there are a bunch of rules that you can follow and apply and they will then tell you how to estimate well first of all whether you can estimate your cause of effect and
then also how to estimate them and what we then can do uh kind of like two very simple um ideas of how to calculate the effect uh they're much more out there but just to kind of like give you first
interaction one is kind of like we can just run simple linear equation we can say our outcome is a linear transformation of all the variables that
we now know influence our outcome and kind of like we can run linear regression on that given its linear regression we kind of like want to look at the effect between zero and one we
can then simply look at um our um our coefficient that linear regression points so this is going to be very very simple um case that a lot of people did in like
in the very beginning and a similar easy case that you can look at is you can simply count you can literally at least in the binary case if you have a table uh like that up there like the variables
that we identify that we need to measure uh so C4 C5 X and Y we can simply count um the cases where you had um treatment
one you can look at the average outcome of that uh you can look at the cases where your treatment was um the other one you can take the
average of Y assuming that um you have a kind of like uniform distribution of a C4 and C5 you can then do that and then just take the averages of that and look
at the treatment effects um often this obviously doesn't like um do the track and you would kind of like need to um do some other things
um but let's kind of like talk about one more um extension to kind of kind of like Call of inference or like different ways of thinking about it so I've only been talking about um kind of like this uh structure called
the model or structural gradient model and close to calculus but there's one more interpretation of it um and this is the so-called potential outcome framework there the idea is you
always have potential outcomes so like you always have y 0 and y1 being the outcome if I would have taken action zero or action
one you just don't observe them they're always there they kind of like always um get a table like this where you have y0 and y1 S columns but you can just only
ever observe one of the two and this then turns this whole thing into more of a finding uh Missing data about them so your counter factual so the
um outcome if you would have taken at the connection is simply a unobserved missing data available and um this is basically the same thing
as what we do um previously you can kind of like say the graph we have on the left which is the one I previously shown you um with like I guess previously I call
them ABC um what do you kind of like add in this graph is your potential outcomes X here and X1 which are now directly
uh caused by c and x only really selects which outcome um you are actually seeing and this is uh the so-called ignorability assumption
where the potential outcome framework assumes that your potential outcomes are independent of your actual treatment
variable given your um conditioning set and um just to kind of like make sure this is not the same as saying your actual
observed outcome is independent of your treatment given the conditioning set because as I said seeing here kind of like your actual observed outcome it's the one
that gets chosen by the treatment that we actually applying yeah just doing exactly that and um this is kind of like exactly the same if you know
um the confounding and we know uh do calculus we can then kind of like get this usually kind of like triangle graph and we then get to exactly the same thing between like using the pl
framework as well as using the um pearls framework of like do calculus and structural greater models um and then there's kind of like more
complicated ways of actually estimating those things so one thing that we could do when we actually say our distribution over um
our conditioning available so it's not the same what we can do is look for examples where it is the same calculate the average treatment effects
for those examples and then average those treatment rights that basically means we want to fill in those columns of our
potential outcome um kind of like the outcome if we would have applied a different treatment for some instances or individual um entries in our data set
so what we do there is simply match instances of the same properties so in my case here the properties are simply the color of that individual and we fill
it in so in the first case we have our orange individual we apply 3.1 we don't know what the potential outcome is if we would have applied treatment zero uh but we know the outcome of treatment one we
have a separate individual in there somewhere down there where we did apply the other treatment and we know the outcome if we apply the other treatment and because we believe them to be very
similar at least they have the same color right to use color this is not I didn't think about it it's like the same properties um then we can simply fill this in
and um we can then use that to calculate the treatment pact for this um subgroup of individuals we can calculate that and
then we can fill in the other um differences by finding other similar individuals to those other groups and we then average them and we then point our average speed point right the problem
then becomes how do you find whether something is um similar or not similar but I'm not going to go into this right now because this is just gonna um yeah bring this kind of like um make it stopped along here so kind of like
summing this up a bit um the problem that we have in causality a lot as first of all their different communities developing uh different
methods of different assumptions kind of like just a quick question the previous slide um you had the second orange person has a zero yeah
so how do you here's the treatment here but but how do you estimate the treatment of
ah sorry um yeah this actually it's kind of like you shouldn't have the treatment here anymore um so we have if you look on the right um we have uh treatment one for that one
orange and treatments here for another orange so we just um take the outcome of that one and the outcome of that one up there okay so it's like zeros two and then one
is yeah and then you can look at the difference and get the uh two effect for that any more questions on that anything else um so yeah what we've kind of like seen is uh we've seen called Discovery we've
seen Global imprints both of them are somewhat separate communities working with like different methods and different assumptions the same is true for like call um cause of imprints using pearls framework as well as a DPO
framework and um it's got like a bunch of problems there kind of like trying to unify them but there's work on that we've seen that um DPO framework is actually equivalent to
um curls2 calculus and you can find rules to transition a graph into kind of like those assumptions and vice versa because there's also kind of like work that um yeah helps to like deal like
with unobserved compounding NPO for example and you can then use those assumptions that they make to build a graph and turn that like do it vice versa as well and um kind of like
another problem there is that a lot of classical methods are often actually restricted to just linear binary settings in both the input at least which is not true in the real world and there's been a lot of work in recent
years um for example double machine learning that is really good at dealing with some of those problems and um this is kind of like showing that the introduction of machine learning really helps with
scaling this to more realistic functions um but most recently deep learning has really made an impact on causality as well because of the machine learning this is kind of like what the second top
part of talk is going to be about um just recapping all of the things we talked about kind of like one slide each uh what I kind of like to call the things that we care about
um first of all interventions are not the same as conditionals like trying to hammer this in because sometimes you get people ask about why don't you just condition on your data and it's not the same and we then want to calculate those
um average treatment effects for example on the poor population um kind of like regardless of your subgroup um where you would then need to find
enough data for all of this approves and um kind of marginalize this up to about the other one is calculating conditional driver average treatment effect where
are we interested in population or sub sorry subpopulation averages so saying um sales in this country um the effect of a gene knockout on a
specific cell line um the effect of a treatment on a specific kind of like disease or kind of like personal specific preconditions and so on and so
forth and the last one really is this individual treatment effect where you are very much interested and I know if I take this drug what's the effect on me personally versus um Chung or like any other person in
this room and this is kind of like yeah the Holy Grail to some extent where you get to personalization and can have the most impact really moving on uh next thing kind of like
covered we saw um different ways of kind of like um I'm doing cause Discovery using kind of like conditional independencies and independent system General um as well as kind of like different
score functions or functional forms and uh especially it's kind of like how do we turn a table like this up there into a graph and um you always need to make some
assumptions to do that and lastly how do we turn again um how do we turn this table of data as well as a graph into
um this average treatment effect for example or there's like causal effects and um as mentioned previously it's really important that you get a good graph here otherwise this thing down
there is most probably going to be wrong and um often causal infants would use um the main knowledge for example to get those graphs but we believe that code of Discovery can have a big impact there
because a lot of domains out there either too many variables or not enough domain knowledge to actually identify this graphs and um so what we're doing here is kind of like
we use beta and profit assumptions to find the effect of an action and we can either use something like a potential outcome framework where we assume ignorability meaning that our potential
outcomes are independent of the treatment or we can um use fills framework where we assume at least a subgraph is known again they're equivalent and using those
things we can then estimate it back to using either parametric or some non-parametric models format between for example in a question or non-parametric simply counting and averaging things
and that's it um I think we have yeah basically a time sorry for running a bit longer I guess there but if any questions um please chat
uh I have uh one question uh so a chang mentioned that um or constant Discovery we don't necessarily need to have Interventional
data uh so if we had Interventional data does it help or improve our you know chances of getting a better graph or something like that it helps a lot
significantly so any of the prior knowledge helped a lot because it's really hard to have collected with opposition data and find the graph and so Interventional data right so if you have international data first you
already know some budget right so you kind of have a partial graph in all different methods you can code it differently so that's uh kind of just a narrow down the search space significantly and that meanwhile not
only Interventional data it is actually any domain prior knowledge it helps a lot even in situations in our team we find situations if even domain experts
have a gas like I feel this is more likely some colors you should be this this is very unlikely I mean expertise in our experience some situation actor with
international data they do know but in many other situations it's like five people think they exist another three desert right so even for this situation have different uncertainty for something
this helps improve the quality a lot because in the service search space it's huge getting the graph really cracked especially with limited number of data it's extremely hard any international
data Atomic activity is very helpful okay thank you so what we're doing is trying to use custom machine learning to difficult decision optimization problem and answer
what if questions for doctors we want to do the answer what will happen to the patient if I gave treatment a or if I get treatment B and if a business
brother if like what will happen to my Revenue if I give a 20 discount so in this way if we know I'll be down for the decision making it easy right because we've got to choose the better
one okay well then let's recap what are we looking at the problem we want to talk about this because of machine learning with existing data so think about all
this scenario what people give us is the table their historic data existing data can be mixed with observational information data and what in some ways they want is a high resolution large
scale table anyway without code right you have many many uh partner customer patients users and you have all different actions
and you want hidden Factor information or you just want to know what happens after this International distribution and then you can make decision if you look at this
we want to do this at low-cost large-scale real-time high resolution that's the whole color machine learning about but if you look at this you can see what we just introduced for
the first hour none of them answered this right so now we call it end to end call the inference just want to Define what's that the end to end call the inference actually means
we want to only use the user's data but instead of treating couple Discovery and call the influence separate we would like to give everything about us we want
to just take user data and we want to help them to optimize the decision uh we want to help them help their decisions based on this big large table we can do it real life real-time
large-scale to help the decision and before we are able to do that we need to know the graph at least the partial uh like a seminal graph information and we need to know the function relationship
so that we can compute it so this task we call it end-to-end cargo inference so that's in some way is Young both cover Discovery and influence it is this input
has everything about code this is our definition and you can see this is actually the real world need very few people at school either way and most of people we talk with especially if I'm a
corporate lab that's what most people really need and that's enables not to have a general solution about the existing methods are not sufficient existing method is around called The Discovery we have a table we
have the graph partograph or different graphs or we have color imprints with completely new different sets of assumptions kind of give assume the graphics given so none of them solve
this problem and so you can see there's more problems the next talk about even within called the inference people have different assumptions but if you look at how to discover the inference together you can say ah we put this together we'll solve the problems that's actually
a noun preview you may lead to I'm talking about the worst case scenario stimulation but you may leave too I use lingard for causal discovery which means it can be it have to be non-linear I
didn't know its model linear gaussian we cannot solve it that's only the case we don't know but then you use the causal inference method that works with only linear gaussian noise model that actually many color imprint methods you
put them together you solve absolutely nothing I mean this is the worst case scenario so you see the problem and then there's more things Hollow inference you see the flavor you can talk about they already assume like
pearls a method assume you already gave them a graph a graph but the word color Discovery I talk about we give complete partial there we
get packed none of the output from the cause of Discovery is using the same format as a coloring print people need we do research assuming okay we will get
this but actually that's not true and so I'm talking about workplace in a row you can get a little bit better but that's General problem is this
and moreover as this to community are separate these almost have an arch enemy called machine learning and exactly everything you love that but this why
this is very very different it's like one of the fundamental thing in causality we care about Theory it's very important to care about this area because otherwise we cannot claim it
combo but meanwhile with nothing in the theory I still read paper a scale up to six node this do not solve our role or the
problem is do not know if a medical doctor came and I want to know how to create patients there are a lot of patients in the hospital they are like business people and other customers
there are a lot of programs to do there are a lot of people so we need to be able to solve real world large-scale problems we need to do machine learning
especially if learning type of models to really make it scalable make it a flexible real world applicable we can not only look at six node the linear
model questionnaire so this is the whole second part of talk about so this is also part of our group's mission is we want to do deep end-to-end causal
inference so what we want to solve is this goal that with our general users we can deliver General Solutions people can provide historical data then we can give
them all the insights they need okay let's get started I'll say this is the field we have more questions than answers we made some steps and we'll
explain and assure you what we are working on and we're also inviting everyone to join us to contributing this I personally feel very very excited
okay let's recall it we talked about smart frequential models right Square for equation models tell us how uh like a different effect was generated from
call this so this just tells us the uh we call the parents called the effect it specifies exactly how I depends on the current and there is a noise if
shown here okay so but if you think about it this is a generative Model A coval system is in some way you can have
extraneous noise you think about arms noise input and what your observation is just to the observation of the data right so it's a generative system it's
just a we are caring about the true underlying word on how data are generated but this is generally model in machine learning especially in recent years we developed so many different
remodel that have the flexibility ability that solves many real world problems we have our I mean my favorite version of encoders and normalizing
flows I think there's some future models these are terms of models what people have been using this model is to have a noise input and model 3D images I mean in some way it is to have noise input
and we generate our observational data so we can really bridge the gap and think about how often we use our really Advanced Tools in machine learning and
to solve for the question and the different energy model is a really perfect match because call the system is generative system and so now let's just gather into detail
with the bridging the different model and structuring through the models how well we can do so we care about giving the data we want to care about the graph
right and then how we can do that one of the times we did is we built down the work of our great researchers did the work before we
can use a score based to call it recovery method so we can choose our objective which is the score uh how well the graph is part of the data so that's
the marginal likelihood is so very well established before and it's the graph is a graph we need to learn and the data or observation so we can factorizing this way
now we want it to be scalable and real world applicable go look at the livelihood so the likelihood they've just tells us like the the
how well the graph can read the data we don't know the graph right but we'll assume we have the rough so we can think about the transformation right so so in general we can think uh
Define transformation like this so V is the transformation of x minus the function of the G so this is invertible transformation if f is that of G is the dark
so would you prove it in the paper and if you can Define such reasonable transformation so you can see just the organizing to the bar this is additive noise model so in
uh called The Discovery when we talk about functional causal models we talked about if you make some mild assumptions about functional four you can show that it's identifiability
I'm not sure it's identifiable so actually non-linear additive noise model it's one of the case for college Discovery it's identifiable so we can in
this case with some other assumption with internet data we can identify the unique color graph so in this way if you find this transformation what we can do is we can
Define our likelihood like this so in this way we can use our like in this case you can actually maybe just remodel but in this case we're using normalizing flow
to Define our likelihood so you can see it's the same type of method we're just trying to confuse orbitals for the likelihood we can
utilize modern machine learning methods and really go beyond the very constraint basic case like doing linear regression right now this so we Define likelihood
just using normal logging flow and this is you can think about that as an extension of one of the following papers so there is one paper here causal All The Right Move flows but in this paper to deal with the graph is already given
where they think about only two nodes and the fit both empty which one is better so you can think about if we think about the whole objective we generalize it
you have a question yeah I'm just curious um you said when the noise is gaussian the causal direction is not identifiable but here you're
using other regressive flow which assumes the noise is calcium right so so it's not identifiable why it's a linear with gaussian well it's non-linear with
gaussian noise it is identifiable okay that's great thank you so you know the first the basic version we assume the noise discussion and then
the function form is down here okay and so we have the likelihood right and now we need to think about the price so what would you want to uh to have the prior carrying information a couple of
things first we want it to be because you can see we need it to build a cyclic graph we need the video back so we already know that there is a form that's
constraint we can set the constraint uh from the note here the computerization constraint with this equal to zero it tells its attack so we can build it into
our prior so we can use the G here the HG the first align into our prior uh so why do we put in this form you can imagine we can use argument like
launcher uh very simple and of course commonly uh simulating the spirit of the Bic we want the graph to be sparse so the first term we can construct the
prior that is w0 is kind of sparsely penalty so we want the ref to be bars so with this prior warm you can actually add a lot of more so back to your
question if we have domain expert we can easily set other constraints we can also set like a tuning the domain expertise to take it into the prior knowledge which actually is more likely and which
actually is not not so likely so we can do the prior like this and then for many people maybe sometimes people know me as an inference person so now this become a approximate inference
function because we have our objective we want to find the vector score we use normal length flow for likelihood we Define our prior you know encouraging
this is a tag and varsity and all these things we don't know how to compute the G right so the G A is just a global structure we don't know how to compute
and we want to find because this G is a graph so in reality we use a adjacency matrix for this graph for the first step I use my good old Hammer that's called
Midfield variational inference and then we can construct the evidence lower bound so let me feel very important in general which is to use the standard imprints we assume there's a q distribution and we
welcome in mind so KL this is a fairly standard technique so we convert a causal discovery problem into a approximate inference problem so with our nuclear version
inference we convert the whole thing into optimization problem so of course without Theory we can now claim this is causal so I will now go into the details so in the paper we do
prove that with importance of other data and there are a couple other assumptions like like a theory standard like the macro Facebook is no latest and compounding these tender assumptions
that's a detailed in the paper we do show that uh we are able to identify the crack of causal graph so we do proof under internet data we can find the
correct the cover bar so in this way so you can think about we propose the first method is given the data we construct the
objective that is income with score based but that's converted to evidence lower bound we converted into optimization problem making the graph
learning as approximate increase problem to learn the graph this is extremely flexible it is using normalizing flow it is extremely scalable because all the
good technique from our attention you can use it here and you can learn the graph with learn the graph that's your learn posterior that is your graph and meanwhile as you are doing optimization
together you already learned the functional relationship implicitly and now Nick will tell you about how can we do the treatment effect transformation and decision making here thank you
um so we've seen how to learn the graph and how to learn functions by kind of like having a deep learning model that has this latent variable being graph and
um optimizing everything jointly um now the question is how do we kind of like actually get those treatment effect estimates out of there you know like say oh we have a gentle model in it kind of
like might be straightforward it cannabis but there are a few kind of like small nitty-gritty details um yeah I could talk about um so let's talk about kind of like our
growth posterior first um because we learn this distribution of our graphs because we have like this mean field variational inference as Chang was saying
um what we really have is a distribution of a graphs where we kind of like say oh let me um we have a graph that we find oh this is the correct graph and we might only
assign this because I'm um probability to it but if you find another graph it is like one additional engineer and one missing over there that we assign a
lower probability to it but it's still some like it's not zero um so because we can't know which one of the graphs in our distribution of our
graphs is the correct one we now now need to actually look at the full distribution when we do our treatment effect estimation and when we've done kind of like say how
do we kind of get this ate arcade out is we kind of like just take a Bayesian view and say let's just do model averaging so what happens is that we
calculate this expectation over the um ate or Kate or I can like marginalizing out our graph and what does that kind of like look
like in practice and kind of like just to remind you um the average written on the fact is an expectation itself the what do we kind of like want to do is um
look at the expected outcome giving some intervention minus the expected outcome of another intervention and how do we estimate this first of all
we do what we've previously done when we kind of like have a single graph we need to look at our intervention and cut those edges and actually mutilate a graph to get rid of any dependency that
we have in our um in the graph and that we would usually actually use in our deep terminal model as well so uh rather than doing our treatment effect estimation or forward propagation
we actually put in um this new graph with the edge cut and the treatment variable set to this fixed value talking about in atomic
interventions here only for anybody who might be wondering if you want to do more complicated things um but just talking to about this thing where we kind of like say oh this now let's set this treatment variable to a
single value we cut the edges and then um but then you go up in there and then kind of like the magic sauce really boils down to using
um Monte Carlo estimation and uh the problem here is that we have a distribution of our graphs we can sample from it but it's hard to marginalized it out in a close form
simply because well not just because it's deep learning model but especially because it's a deep learning model that depends on the graph and for anybody of you that might have done like some sort of like
the learning based probabilistic calculations before you will know that marginalizing like deep learning at known advice and that's really hard and basically impossible so what we do
instead is um we sample a lot of graphs for every graph that we sample so this is um very graphic example we then mutilate it
and put our intervention in there and we then sample from our um Interventional distribution of our outcome variable again we now use our
neural network to forward propagate um I don't know if the model if you have a bunch of edges in between um your treatment variable and the outcome variable what you will need to do is actually put your treatment
variable and estimate the intermediate node again put this those new values in an estimate kind of like run this forward until you estimate the outcome variable
so you do this both for the um uh first intervention as well as the second intervention as well as for all the graphs that we sample from um
the graph posterior so given all of the examples we then simply calculate this average of um the difference um sorry uh the average of the samples
of our outcome variable minus these are other average of the other samples of our convertible for both for all the graphs and all the samples that we get out of that and this will then give us the average student
impact uh taking the difference between them uh we are good and we can do that um or um where the
parent so I guess that would be a a t here right so yeah t is the treatment um I don't have any kind of like traditional plans yeah so let's say t
um are you sampling the distribution of x t like are you or are you XT is uh the treatment variable so we set it to fixed value oh if we would
have let's say x um sorry I meant to say Yeah in there XP being in between x t and X actually I'm asking about for the adjustment sets right I didn't mean to ask about t but there's right so you've got you know if
yeah exactly so for the adjustment said yes we do sample all of that as well all right so is that first of all I don't see that on your slides where is that you still but you still need it even for
ate well yes because we just dropped the um all the modernization in there you need to turn your microphone um so what we do is
no not not uh I think you might but so basically first of all they're not on the slides here because um we kind of like this is already marginalized it's in dimensional
distribution right well you could have also used a non-parametric estimator for the adjustment set given a graph structure it's like the raw data distribution so
it's more like a question like why do we even want to to find address on the set right so I mean even now we know the whole graph right so there are a couple ways forward we can say we can use other
routes we find the adjustment set and use the more uh traditional way to to do it right but now we have a scalable deep learning model we don't even if we can
do the full graph we just do the simple graphical modulation rule right like the previous one so we don't even consider the documents that we've got everything
we also do forward simulation together so basically kind of like to walk you through this example what we do there is we have a gentle model that knows how to
sample x0 so what we do is we sample from this we then uh know that we don't have to simulate x t because x t is our intervention to then actually sample
from XE what we do is we pass in our samples from x0 and our treatment um multiple samples from maxillary sample
multiple values to maximum thank you for clarifying that does answer my question I just want to point out that you could have also used a non-parametric estimate for x0 because you could have just used the right the original data distribution
what you're doing here if you have any error estimating P of X zero that would sort of carry over yeah but I think this does make sense I think does that yeah exactly because
this is uh it's more like a non-parametric performing from the data if we have the marginal if we assume our marginal distribution are estimated perfectly you can draw this example
a follow-up question is um if you have a lot of data then what do you need to estimate to generate a model on that I understand you can estimate a graph but um it seems like you need to First
estimate the graph and we kind of like use the generative model to estimate the graph is one thing and the second thing is dependent on the treatment effects that you want to estimate
um you might not well first of all for Ates I guess there are things that you can you can do a lot of things with just the data distribution um but first of all we used to pull down a deep down model because you can then
post hop the side of the interventions you might want to run um so you can decide oh I now want to intervene on X 0 actually on XC rather
than XT post-op after you've trained the full model on all the data um and that gives you a lot of flexibility at so-called test time I guess kind of like one way of putting it
kind of like deployment time um that and then um yeah it just gives you um the full kind of like you can look at a lot more properties that you can use and if you
then would go kind of like the next thing that would have started talking about is a conditional average treatment effects there the problem is um that kind of like you will have a harder time estimating this from um just
observational data alone and just um doing the non-parametric estimate set and you know like we can do some neat tricks in alternative models to actually estimate that cool so that's the normal average
treatment effect as mentioned in an example forward sample in our generative model um having it mutilated it and having the treatment assignment in there obviously
everything that we've done so far is assuming um no unobserved compounding which is the same thing that we've kind of like done previously and all like most of the time that we do and causal inference
um otherwise some of this kind of like fails but we'll get to this later uh talking about um conditional average treatment effect though is if we want to actually condition
um like quickly going back to show you the scrap again so um if you want a condition on XC here where XC is a descendant of the
adjustment set or like your conditioning set that you need to marginalize over um and doesn't have a direct impact on your outcome variable XE you will have a
hard time actually doing this because you would need to apply base rule and yeah marginalize some things out of there which again is very difficult if you don't like that which of you will
know if you've done Asian deep learning or any kind of like um more Bayesian probability Theory before and the problem there is we can't estimate our
conditional distribution of um see oh sorry x0 given C easily because we have the learning components for all of that and
if you basically need to invert um one of those edges in our models and what we do instead is um just use another model
um as a surrogate model and um we don't use the Deep learning model here because we want to be a bit more robust what we do instead is we sample a lot of data from our Interventional distribution
um so not conditioning on the conditioning variable but simply from the Interventional like from the distribution applying the intervention we can then train a model
um that predicts the outcome variable as a function of your conditioning variable um very rather than just using the conditioning variable as is we actually
use some set of like um random period pages to make this a bit non-linear and use a linear surrogate model just running kind of like a linear regression
on that and um okay run linear regression on that and then have a model that predicts the outcome given
um the conditioning variable or the conditioning set that you want to run your conditional average treatment effect on and um you need to kind of like train two of
the surrogate models actually one for your first Interventional value and one per second Interventional value they just sample a lot of um values to train the circuit
and you're then the conditional average treatment effect then becomes the difference between uh those um surrogate models at your conditioning value
and you're still modernizing our glass so yeah kind of like it's another complexity in there um but otherwise you can't like do the same thing as we've done previously just adding those very good models
rookie model predicting uh model is predicting the outcome variable so going back two slides that we flat so the surrogate model basically we
sample a lot of I see okay xe's um given kind of like our treatment and we train a model that predicts XE from XC
um so that we can condition on XC to oh sorry uh so that we can condition on XC to find the value of x e given that we'd run this intervention so that we can kind of like run this Arrow backwards
that otherwise it's hard to do right well I understand the motivation but then all the other variables become unobserved co-founder for your Authority model so they're not under soft
Compounders because we actually we already so we trained the surrogate model only on the Interventional distribution all right okay so because we only use the Interventional samples
for this uh circuit model uh like it learns to model the Interventional uh relationship and the international data is generated by your generator model
if I just want to show you the Nexus slides next yeah exactly so it just went through like what's this type of model I think mentally you can think about the world
with the building a huge stimulator I mean the difference of the model becomes just a huge simulator and this respect all the Truecaller structure so then for
all these things uh we press the brutaloid if they're conditioned their treatment what we can do is brutally stimulation simulator a lot of a lot of a lot of data I don't know what this
circus model is trying to predict it is to lie so that you don't need to get that then it's a really good model so I think nickel is your organization um so it's kind of like uh very neat
um graphic that one of our collaborators and like did for the paper and what we see here really is um again so we have the correct graph here on the top left
where we can like have again our four variables like zero XC x t and XE we want to calculate the conditional average treatment effect of um the treatment variable XT on the
outcome variable XE condition on this conditioning variable XC and uh we have some sort of well we have some observed distribution over here um so this is the outcome of the The
Joint distribution of the outcome variable XE on the y-axis and the um conditioning variable XC on the x-axis
and um so this is literally just like a density plot of our observations we then have learned a kind of like a giant model to estimate
this observational distribution so this is uh what we show on this uh column here where this first um the top or like the middle column on the left
um shows the observational distribution from our terms of model using the correct graph and um the gray is the correct observational data and blue is the um
our observational uh or um learned observational distribution and they're fairly similar and they're fairly good um we then also see that the second
graph that we've learned um has a different observational distribution um but it's still a very good fit to your um data distribution which is exactly
the problem of like identifying causal structure from data if you have different models that have the same observational distribution or a very similar one it will be become harder to
distinguish between those two across um so we're kind of like stuck with those two drops and we now want to actually look at the Interventional distribution
of um XT on um x uh sorry on of XT on XE which we show on the um right here where we have um
how different um Interventional distributions for different intervention values so with the red one is using
um our intervention of x t equals to one value and the um blue one is kind of like our reference or Baseline where we set XT to a different value and um you
can then look at the outcome at like just by drawing this line um according to like this um conditioning variable XC so it's like we said this we want to condition on XC
equals two and then you find um the value of like the blue distribution and the right distribution and you would then just take the difference between them to get the true conditional average treatment features uh that's another question
um to find it so how are you using the density over XE in this procedure what does that needed for adjustable visualization but do you have to fit it
uh this is just like a histogram like Ade estimate relationship this is just for visualization so hello no it's good
that's not good okay anyway we already have joined the distribution of everything for the entire data set if we have like a 100x we can plot it both for anything right
so this is a visualization the support if you look at I understand but the procedure the surrogate that you're learning is it are you modeling XE as well okay I didn't think okay so I'll move on there in a second
um just kind of like showing what's actually on the slide um so it's kind of like just this is the ground group uh that we want to compare to and for actually estimating decayed
what we do is we have our first row we sample a bunch of variables from our first row um in the under the intervention and the reference uh distribution so we then
kind of like have our Interventional um distribution here the green one I sample a bunch of variables we then transform them using query feature I will be transform our conditioning variable
using prayer features and learn the surrogate model because now uh the screen curve which is non-linear because we have this random period features on there and we then sample a bunch of
variables from our reference um intervention again um have our outcome variable now XE and learn a surrogate model again using
those around a few features that regresses XE so your y variable or
y-axis on XC being on the x-axis here so you get those two lines you then plug in your conditioning value with this two which is this dash dash line here
um you then find like the value of your circuit models of the overall the intervention Circle model and you find the value of your reference circuit model you take the difference and you get a conditional average treatment
attack value which then is here is 2.3 three not perfect but decently close and oh you wanna let somebody want to read the question today at the moment
oh can you hear me yep okay so yes there is a question in the chat and says uh why do you need the surrogate can you
sample from XC and XE when do when you do the intervention and keep just the samples where XC has the correct values in the conditioning set
um yes we can do that the problem is there might not be many or there might give if XC is um continuous there might not be any
value that is um actually exactly your conditioning value and especially if you have a high like a multi-dimensional conditioning set it will be very hard to get like exactly to the value that you
want to get to and you would then start to kind of need to um come up with some other kind of like matching algorithm again defining some sort of like uh distance function between your
um yeah but Junior conditioning sets and uh to kind of like say oh this is close enough to my actual conditioning set that I care about and then only average those samples and kind of like this is
just one choice of doing this you can also do this like matching algorithm and find samples are close to your conditioning values that you care about it's just a different way of approaching disabled
thank you um so going back to this um we kind of like trained the surrogate model on our first um grub and the first kind of like set of Interventional distributions we again for the second graph to do the
same thing where we sample uh values from our Interventional distributions given the two different Interventional values learn surrogate models take the difference and then get a second value which is then 1.1 given that this is the
wrong graph it's kind of expected that this value is um not perfect and not as good as the previous one we then take the weighted average between
um the different Kate estimates taking the probability of each graph into account and then arrive at our final paid estimate which then becomes uh in
this case 1.84 which is like close enough to our true page to kind of like actually be valuable and useful so we can also do gate estimation using those circuit models yes we could use
something else um there was just our choice to kind of like have a decently robust way of estimating those conditional average treatment effects for any type of
arbitrary conditioning Set uh be it like a single web or be like multivariate cool so too long didn't read um kind of like someone writing that we have a pretty cool in my opinion uh deep
Channel model that combines like normalizing flows and like some throughout neural network as integration or learning the graph as well as the
functions from observational data that just takes some noise in and you kind of like then can sample from your yeah distribute upon the observational distribution as well as your Interventional distribution and can
calculate this average student in effects as well as uh conditional average and even people do things like individual do you know the fact estimation which you haven't talked about here at all but it's fairly simple
using this model actually and um the neat thing really is you only need to train it once like you try have your observational data you take all of this as well as some kind of like idea about the graph if you have some domain
knowledge as Chung mentioned earlier that helps a lot you can say I believe this ad should be there and I give it like a 70 uh probability that it should be there or let's say oh
um I know that revenue is kind of like an effect uh variable and it shouldn't actually have any descendants and have some hard constraint on there and also can put that in and you take all of this
domain knowledge in there uh to get up the data and we can then learn the graph the functions and this whole deep joint of model and you can then get a few things out of that first of all you have
a graph which in itself can be useful thinking about like Dean regulatory networks learning them from data it could be something that could be cool um and kind of like just finding out which interactions on the data but you
can also fill this table that Chang was talking about like getting those interventions out uh so those kind of factuals out to know this is the effect that I would see if I would run a
certain interaction or intervention on an individual you can then use this to calculate average student and practice individual between impacts um and like conditional average treatment facts but you can really also the third instance
use that for personalized decision making at scale and this is kind of like the main thing that we care about that you can run all of this at scale by having a relatively minor kind of like up on cost
when you train this model once but you have a lot of kind of like Downstream applications you can please um can you hear me
yeah yeah so given the difference of categorical and Market data can we have a join numerical and categorical data as the input data Matrix
yes that is actually one of the next slides um so we actually um like previously in kind of like the main thing that we've been talking about is we have this non-linear non-linear
additive noise model using gaussian noise I don't really really looking at like completed variables and everything being observed but we have in the actual paper and the code base we relax those assumptions
quite a lot by saying let's do some like normalizing flow as a noise model as well so we have like this spline uh normalizing flow that allows us to have more complex noise distributions we
allow for categorical or like binary kind of like any type of discrete data in there as well as kind of grouping variables into a multi-bred variables on a node
um so that you can deal with any type of data really and lastly we have also add an imputation Network so that you can actually fill in missing observations and real world data which is actually
something that we observe quite a lot and it's quite important does it answer your question or did you have more questions around the um categorical data um yeah that's enough thank you
oh thank you so um that was just to kind of give you an overview there are much more features there that we kind of like already deal with but there's also still a lot more things to do
um however we've already kind of like done use this for some scalable applications internally where we work with graphs like that um kind of like we're building at like hundreds of variables uh well we have
like 500 or so values that we group into like 100 and something variables and we get some decent um graph Discovery performance out of that where we kind of like then talk with the main experts that tell us oh
this actually makes sense and also use deep into effect estimations to actually provide some business insights so this is yeah really scalable really usable and it's good performance
um however as I said a lot of things missing one of them is for example estimating effects over time I'm gonna cut like just run quickly over this just to talk about some extensions uh where
we had some recent work extending this into the time series domain because in a lot of cases you might actually be interested in oh if I now kind of like start doing an action if I have like a
few actions I can do how does something change um for us revenue is something that matters how does the revenue change over time and there might be differences and
you say oh if I look at the outcome in two months then action B is really good but if I look at the outcome in like five months action a might be really good so this time series domain and like
time complexity really matters and they're kind of like some nice properties to it that um when you look at the temporal group and you kind of like have this Auto aggressive nature to it
errors can't go backwards like if president or future does not influence your past which is a really strong um constraint already in the kind of like makes it called the discovery a bit easier um I'm not going to go to and to do my
ctla I'm not going to say much more about this right now but uh there's a paper online already and otherwise just come talk to us uh where we have a very very flexible form uh functions that can
deal with um all types of noise and temporal data um that's one extension I think there might be some applications even in biology or something I know that it's
really hard to measure like cell status over time but I think there might be some other applications in healthcare at least ah cool um so I think there's some really cool things that want to do there and the
second thing that we've um also had some work on uh is actually dealing kind of Compounders uh where you kind of like say in US business scenarios again you would say um some sales are actually impacted by
whether you give them a discount or not and then some kind of like other observed variables but there's kind of like this thing like the economic situation in the world which is a really high impact on that
um that um we can then start modeling there's one problem that we can't have edges between kind of like um those variables um that the compounder is compounding so in our case here we would have the
economic situation have an impact on like salaries and like sales but in our kind of like to make our method work we need to assume that there's no direct attached between those two variables uh
which is a so-called bow free assumption um it's the first step at least you can like make another compounding work in our framework or in causality in general because yeah it's a hard problem to solve
um so overall what we kind of like do really is bringing um causality and deep learning together where causality gives us a lot of uh good theory to approve that we find the correct called the
graph and can estimate those causal effects but some of the impact has been limited because of the very restrictive assumptions and not like low scalability to large data on the other hand we have
deep learning where we have some really good impact on like image classifiers large language models just um last week again of like the new open AI model um all of them are correlation based and
they don't really tell you much about the actual group of reality and we're really bringing both of them together to build scalable causal AI Solutions and another thing to kind of like talk
about here as well is that we transforming this traditional causal pipeline where previously what you would need to do is you would have a human specified a graph and kind of give you
some data from this um specified graph you would then do this causal identification or verification step where you would do something like do calculus to find a way to estimating
your treatment effect giving the graph and the data that you have it's kind of like getting rid of all the um other notes that you don't actually need and using this um so-called estimate you
would then run a call the estimation step for example some regression algorithm or something to get to the causal effect and we kind of like really transforming this into this large-scale General
solution where you can provide some incomplete um prior knowledge but or no probably not at all we run cultural Discovery and causal imprint in the same algorithm and really do this in
an end-to-end uh Journal end-to-end Pipeline and Method and yeah this is kind of like our real approach to kind of solving real world problems by putting all of those things together
um and we're in the trends here like um causal machine learning is one of the kind of like high emerging topics according to um Technology reviews recently and because like in the beginning of like the Innovation cycle
so we believe that there's a lot of impact that we can still have by working on this a bit more and we see the same thing when we kind of like look at causality papers and archive so I can only encourage you to kind of like think
about causality more because it's really kind of cool and new emerging field a lot can be done one way to do that is look at our code on GitHub um all of that is open source you can
just like download the framework and run that if there any questions email us or raise a GitHub issue also for any kind of like students or anybody else out there we're hiring broker intern and Fe
positions right now um there are links down there send us an email otherwise or just come ask a question yeah and this is really it and we hope
that uh scalable real world applicable causal AI can have a lot of impact and a lot of domains not just business up there but we've also worked with education and had a competition out there where we actually have some
Interventional data and show that a lot can be done but also Healthcare scientists and all of that thank you hello hey can I have a question hello
there is a question in the chat if anybody has any questions in the audience please feel free to go in the microphones go ahead oh yeah great um thanks for the talk I had uh two
questions one I guess was um about the treatment of X estimation I guess the posterior assigned by your approach to graphs uh is saying
something about how well it fits the observational data um so I guess for it to work the approaches showed somehow that should the correctness in terms of pitting to
observational data should also reflect the correctness of the treatment effect so is that like uh yeah just how does that assumption really work
um I guess they're kind of like different ways to think about it the main things are kind of like to preface this is this is the theory that junk showed
um we have a proof to show that our um the graph that we learn is the correct one given that the data generating process as part of the model class that we're assuming and we have
internet data and kind of like fully convergeated Global Optimum so you see that a lot of ifs if this is true then we get the correct graph and in a lot of scenarios kind of like an easy scenario
as we see that we only converge to single graph actually and this is the correct one and it's fine uh we only really need this distribution of a graph because we have limited data or there might be some kind of like mismatch in
the um modeling assumptions or kind of like training deep learning does not always converge to the global Optimum um so in that case this is why we really
kind of start using this graph and yes you can kind of say you could say let's just use the most likely graph which is most probably the correct graph it gives really good results and we kind of like in all the Benchmark tables that we have
in the paper we're having that as a baseline um and it gets good results but it's not always the best performance simply because you might be somewhat off or you might have like
I know pie graphs Each of which have like 20 probability but one of them has like 21 probability and yet and kind of like you don't want to overweight the single graph that is just a tiny bit
better than all the rest but you do get some performance from including kind of the other yeah we improved performance by doing that's really in most cases at least and the second quick thing is I guess from your dependence on
normalizing loans you're not addressing the kind of causal representation learning issues so much so how do you envision that kind of fitting in Downstream yeah I guess this is kind of
like um yeah some are a different side of the same coin kind of question that right now we've really kind of focused in on doing deep learning or machine learning for causal not necessarily
course or the other thing which a couple is kind of like the web called the representation learning for me personally and um it's yeah there's a lot of work out there from a lot of
groups um like yeah Benjo sure cup and so on and so forth and um there's not an easy kind of like approach to using our model for that you could run our graph Discovery algorithm on I know
images and try to find the different um relationships between pixels but it's not going to be really useful I believe um so there's still a lot of work to be done depending on like um translation
that I'm not sure about Hangouts anymore sorry about that so I don't think causal machine learning are like a lot of people deep learning how deep learning for color there's two different people at all it's actually the same so representation learning is learning
latest and compounding I mean your latent confounder like what we were doing the whole free thing there's a South is a representation yeah I guess I'm just interested in like
when you don't have you know like the pixel exactly so that so the same algorithm for the latent confounding can run on that type of data it is naturally
finding the group right so we have the living compounding that have multiple observation that is this little compounding Asia representation learning so for example
there are many recent work seven of kunjang have a recent work about using the latent confounding all these theories and solving exactly evaluating our image data and representation
learning so theoretically you know under the hood you have a lot of common theory that the implementation like this skill and all this but on the line it's the same latent
confounded learning and representation learning share all the online Theory great yeah thank you so much for your details okay there are a couple more questions
in the chat so one says normalizing flows normalizing flows are well suited to the gaussian noise model have you tried diffusion normalizing flows for
sharper more extreme noise distributions given the variational approximation does that even make sense to try for graph Discovery and the person says perhaps
not since you upgraded the splines and used Monte Carlo estimates for inference so let's first go to the normalizing progression and then I might ask you to repeat the second one um normalizing
Plus versus the fusion models um the way that I would have like got like understood this question is first of all we do use normalizing flows so that we don't use gastro noise
because kind of like we talk about um this noise assumption kind of like what is added on to this to your um um structural equation model but you say um
your data is a deterministic function of its parents plus the noise and it's kind of like where this additive noise term comes from or why it's called edit noise model and originally we've used the gaussian noise there but once we kind of
like actually start splines or no other normal lengths of flows this noise term is not gaussian anymore yes our base distribution is gaussian um but the actual noise distribution that we add to
um the variables and we actually use as a likelihood term is not gaussian anymore there's at least I wouldn't I would say it's not gaussian we've not tried using the fusion models yet as a
likelihood model or kind of like to learn this this noise term but it's certainly something interesting to try um Pedro Sanchez from the University of
Edinburgh working from sota's something I always forget his last name um but um look up Pedro Sanchez he has some work on the fusion models for kind of factual inference
um but it's not integrated into a um because of the discovery framework yet and kind of like doing all of this end to end so I can I just add something rounding this is very easy but the thing
is you you prove it's called though and they have the identifiability that's challenging because uh for the moments what we're using the Z is exactly correspondent to the edge of the noise
in the partial equation model but for the Future model you have the but then use the diffusion for that you add noise what does it mean for causal system this
is unclear there are like also research myself and others are doing like for example taking a whole dynamic system point of view on Caldwell system and I think we have the optimal
transport for public Discovery that's at an interview and I think in uh Amsterdam Euro smoothies book have a lot of Animation View on color Discovery so
be able to reach the theory I think that's also one of the challenges on which view you're taking we want to not just to you know make something running rather also care about why it should
propose oh if it's supposed to give us the correctly results when the soundtrack will do what was the second part of this question about like VI and
like yeah so the second part was uh given the variational approximation does that even make sense like the uh diffusion models I guess
uh and the person says perhaps not since you upgraded to a splines and used Monte Carlo estimates for internet yeah I don't fully understand that part of the question I guess
um maybe if they can um clarify that okay sounds good if the person can clarify otherwise I can read the next
one which says recent papers have noted the sensitivity of no tears type methods to rescaling the data can you comment on this and is your
network affected by this as well and there is a reference here uh from uh like the the not years right so sometimes even in
that original paper they already discussed about it like the sensitivity and remote here comes from there helps you uh score and the definition of how to score you know the goal part is not
suitable uh like I saw that many models using similar loss but if the loss is changes to the likelihood base a loss that's enough sensitivity to the scaling anymore because you do a digital
marginal and so because we use likelihood based one so we are not sensitivity sensitive to the scaling of the data and we actually in the paper we even have experiments to show scaling
and landscaling of the data um that's great thank you so much any other questions from the audience well thank you so much Nick and Chang
that was very uh interesting so let's uh thank the speakers again
Loading video analysis...