MIA: Cheng Zhang and Nick Pawlowski, Deep End-to-end Causal Inference; Primer: Causal Discovery

By Broad Institute

Summary

Topics Covered

Correlation Does Not Imply Causation
A/B Testing Cannot Scale to Real-World Problems
Interventions Are Fundamentally Different from Observations
End-to-End Causal Inference Unifies Discovery and Estimation
Combining Deep Learning and Causality Enables Real Solutions

Full Transcript

decision optimization with causal machine learning as introduced we will talk about more kind of the the basic concepts in the first part and then we

will talk more about our recent research in the second portion of the talk so in general we want to use machine learning to answer what if question and so that we can really get insights

from all past decisions for everything happen and so that we can improve our future division so we can drive for better revenue or we will help the patient get better health Etc so that's

all the reason we care about how the decision making and so as uh I can introduce the today we'll have two parts the first part will

be generic color machine learning what's called what's called a discovery and basically called the inference the second part will focus on deep energy and causal uh influence so in this way

we really want to make large-scale real world applicable machine learning algorithm to that we can help the society and help out different patients customers and all users

so let's get started so first let's think about what's called when we talk about this word all the time but what make it colorful we all

know like uh being something called the ice cream consumption like as we were from UK and being boasted raining today so it doesn't matter how much ice cream

I eat the weather not be good right so what's causal so causal tells us if we actually change in the cause something will happen in the effect

so that is kind of Leyland terms really defines what's causal and why we care about causal it is the change seeing the cause it leads to the change

in the effect okay and so then with Caldwell it leads to question so what's the difference between Caldwell and correlation so the

writing box common calls uh principle doesn't tell us so all the correlation we observed in the data it's not to be induced by some called relationship so

for example if there is two variable if they're correlated it might be one of the following situations it can be X cos y y cos x or if they're caused by

something else together thus if we observe correlation it's commonly induced like underlying system there is some type of color relationship if we don't care about color we just

ignore the okay we don't differ with all these situations but we still observe formulation so correlation X Y A correlated doesn't tell us anything

about what can we do these to what will happen okay and so why it's important notification I guess I already reviewed it is if we don't care about collision

and if we just build a machine Learning System that correlation based what will happen is if we want to drive revenue and I think maybe how big is the office building how beautiful it is it's super

correlated with revenue and if you don't care about causal the machine Learning System they say okay if we spend more money on office building the revenue will grow so that it will be completely

wrong the same in healthcare right so we want to make sure we care about the conversation making and so that we can make an optimized decision please do ask your questions if you have

any photos on my audience and not my audience and so uh one we want to make real conversation commonly uh this is the good old

randomizer control trial and what happens is to figure out how those this is still the gold standard for people in AV group one group give treatment another doesn't and then we can figure

out like what's the difference between these two groups so then we know if a couple relationship exists or not or what's a good impact but we can see that most of the time that's not possible

right especially in health I think you are not allowed to just randomly assign patient and then do some treatment uh this is just not ethical I mean some other application this is just too

powerful way so then that leads this shows the limitation with our good old a b testing because it have very high cost the cost can determine the money you

know people Health time and other things it's because it's so high cost it can only be run on a small scale and most of the time we need to have a really long waiting time and we start the changing

words when you gather inside maybe not valid anymore and it only have a low resolution it tells us in general if there is a treatment back uh between two groups it doesn't tell us what's the

treatment pack on individual level so the resolution is really low because then that leads to why we care about causal machine learning hopefully you're convinced about causal and why we care about causal machine learning so how do

machine learning is we want to do to answer the same common question we want to get the Insight with existing data along so we want we don't want to do AV

testing for everything and get in touch but rather just use what we have the existing data means it can be observation data along but it can also be a maximum observation on the

Interventional data so if you happen like there are some big problems if you happen to have some AV testing results of course you want to just like this type of data but it doesn't require you to get all the insights that's from all

possible in testing the word because that have a huge volume and we can help you out of them so color machine learning is a few really use a cover principle in machine learning working

with existing data and give us insight and then what's what we call the talk about qatarity there are actually two common causal

tasks I mean personally I I don't like like the things to be dividing all this ways because we can view things like really how can we help users maybe we don't need to divide things in this way

but in the research Community I think these are the most common to cover tasks one is called causal Discovery one is called cuddling so what's called a

discovery so color Discovery is focusing on if we have the data we want to understand the causal relationship we want to find out is for example

incentive uh calls the number of service to grow we want to find whether there is a common relationship which is the cause we should be back the output is commonly described and then for the carbo

inference is more caring about what exactly happens the consequence of the action so if we do something uh something will change so that the focus

and causal inference and indeed the two common causal tasks uh historically and research and now that can be more in the detail

how we represent called the whole uh got through the more fundamental way so I've already showed you so the one way to represent causal is using holograph

and holographic model it's very easy to understand we have nodes and we have edges and commonly so most of common case we assume it's a Darkly or cyclic

breath of course they are Real World of Cycles but the but most of the case we will see with directly affect the graph so in this graph all the address really represent the causal relationship so if

as we Define if you change a you're expecting them changes increase and another common way with common language discover probability is what a structural equation models so this is

just a thinking about like if for each variable it's a function of its two parents so the parents are they called this so it tells you the every

observation is generated from its true proper patterns with some noise so these are just the two most common languages about the babadi uh so in a

way like you can think about the graph kind of tell you so the sem is richer even tells you how this address function looks like and meanwhile uh the graph

actually gives the guidance like for example C is the function of a and b with noise and so why not function of b as well so the cover graph tells you which are the parents and that's

reflected in sem or you can read svm and get a couple of graph so these are the most common problem languages we use and with these languages we also need to clear about data distribution and

thinking about in graph really is a very convenient tool for us to work with cloudate so these are some very basic assumptions like the most common example

like almost you see everywhere the first is Marco condition so what does multiple condition tell us the market condition tells us is

If You observe conditional Independence in the graph it tells you the independent conditional impedance in the data so if two nodes I say here is if

you read from the graph or you know like it's X across y cause causally cause y if condition z x y are implanted you can read conditions from a graph right so if

you read this from graph that will reflect it in the data so that's all with multiple condition tells you and then there is another condition is Facebook

so faithfulness tells you it's more from data distribution we can read the same thing from the graph so if two variables

shows in Independence property in the data in the graph it should also be independent right so it just tells you from the other way data property

translate to graph property so you can see uh why sometimes this is a particular cases it doesn't fulfill for example if you have this graph

exactly the the impact from A to B you have to to have right but if this two impacts exactly cancel that then in this situation the Facebook assumption will

be violated uh but these are fairly mild assumptions because we hope that you will award exactly which situation don't happen that often so these are the two fundamental

assumptions uh we're using convertible often and so you can say why this is important because combining the market assumption

and the face bonus attention it just tells us how we look at cosmograph and is consistent with what we look into the data statistic properties

and so then we can kind of build custom machine learning based on historic data and so it tells us what we read from graph and what we read from the data distribution uh kind of is

consistent with each other okay with chronograph there is actually I introduced the basic homograph but actually there are a lot more graphs I'll now go to the food tables of

photograph it's very easy to work with but I'll just give one example so we talked about this graph property and data property right but sometimes with

all the data properties you have you can have multiple graphs to explain it so for example if you have uh you can see these three graph this redactive graph

there uh recyclable graph on the left they have the same conditional Independence uh conclusion from data distribution

right A and B are dependent B and C are dependent right given B A and B are independent so if the data

property there may be multiple graphs to explain it right so for the graph that have the same like a conditional

Independence and all these properties we call it martial equivalent and then this back graph is commonly called multiple improving class and in quality sometimes

because we have remarkable class we can represent it something called condition or a complete partial tag and so it is in general we don't know

how to orient this hash and then we can represent in the way in the right so if you see something in the right it commonly means this other or things like uh they represent so in

general uh if you can see slider is something special so uh you can never warm new colliders so all these graphs

commonly have the same glider structure and so this is another apart from that uh CP DACA is another way that's representing all the rock but there are

many graphs so it's very easy to work with reference this is one of the common language we use in polarity okay so when we're just a calculator is

the preliminary and then the next part so for me and Nick will dive a little bit more into the basic method for causal Discovery and how the imprint as we talked about there are these two

tasks and we'll just introduce a little bit like what are the basic methods and this reflect a little bit like how people think about this problem and then

uh we'll lead to the second part later so these are the uh the research that is done by many great researchers and

really later the ground for like a little bit more nowadays large-scale research so let's just a big habit What's called the discovery so call the discovery what we care about is

we have input is historical data and you can imagine a very few people the data and the output is a color graph we want to figure out we call the relationship

Regional calls which foreign models so I'll just give some examples you know each of the methods and hopefully just

to give you a flavor of how we think about how the discovery and what other ways possible so personally when I get in this field I always thought oh how can you find the relationship that's

impossible but here it really tells you that's possible and that makes me excited about the whole field because if I'll just cover like a few

very very basic like the very early can spring things method so when it comes PC another is called FCI

stuff PC uh is actually named uh by the authors so exactly it use if it have quite mild assumptions so it uses Assumption of remarkable Facebook notes

we're going to talk about and we also assume it's like listed cancel the basic version of PC we assume couples efficiency so it means everything we care about is observing

the data there is nothing like a really really critical variable and that's impact everything and we don't know what's here and so this is the first version so PC in general have two staff

running skeleton search and another Edge orientation of course we're in compute if you will introduce more methods than others because the the basic idea is how

do we set up extreme and to to Independent paths and find the color graph so in general for PC we let's say we

have a database we start by assuming everything is fully connected and now we start to do conditional independent search for conditional independent delivery test two variables are independent so it can now be a attribute

point if the two variables are independent right and you can test more conditional Independence so if the address is conditionally intended you can remove that actually in general you test our address and you add more and

more in the condition as far as you find your independent you can remove it right you think about the Facebook there's a multiple condition if the independent you can remove it so after you do all this there's no Ash you can remove anymore

we finish the skeleton search so in the end you may get something like a skeleton right so if you got the skeleton you don't know the orientation yeah but this is already a good person that the

orientation is they have some rules so it's come from a very basic graph Theory right so you can look at the triplet those three nodes so you just look at all the triplets for ABC we don't know

the direction but the eighth condition on C A and B are dependent again it must be a collider that's the only structure fulfill this right so you can look at all the graph of your triplets and find

all the colliders and after you find all the colliders you can earn it to the uh the CMD as well because if a c and b cannot be a slider it must

be C pointing to d right if D point to c you create a new glider right so you can run or intend all the address so with this two simple rules you can find in the cover Direction so you can see in

this way we can just use specific property graph Theory analysis of course we have this list of assumptions we can find the color graph this tells us what is called what is in

fact and this is already very rich information United applications this can already make huge real world impact you care may care about the gene regular networks any education people care about

what's the Education topic and all this so this can be already very in fact both of many applications don't just I would just want to say one thing the example I gave is that that

usually actually returns CP nuggets we talked about there are many graphs can share the same Independence conditions and all this and so for example this new case will be

an e you can actually not Orient it so the final result will be the last one so this just tells you it can be either one graph or the other because we just

propelled you the best as we can so for this type of algorithm we return this this partial duct interest in general manipulation means it's a set of graph

oh we cannot know it anymore but it might be one of them and they are all have the same uh my conditional intensive processing okay that's PC so this is you can see

you have a flavor we're putting constraints and with the constraint we use our basic rule and we find we call the relationship there are obviously a riff and and proof like one's a full

list of attemption are are fulfilled we do can find the Practical graph and you can think about that of course there are like a difficulties and additional something introduced when you do conditional Independence tests right and

then there are many methods okay uh brother of course as researchers we don't want to limit ourselves with that whole list of assumptions and the number in a real world problems we can

solve this very limited and we know it cannot be prepared in many applications we have latency founding if we have relationship funding we say algorithm will break so FCI is with the one

algorithm that's relaxed the assumption that uh kind of work very similar to PC they have a list of new rules that be able to

consider the situation where this latent variables if you know there are latent variables are there and the data set is not observed so we're now going to go

through all the rules or the maps we have 11 rules you aren't in the edge so we're now going to go through it so this constraint based called this government method is still an active research field

and in the past of two decades there are actually many work like relaxed manual assumptions for example oh how can we rely cycle uh the data may have missing

this is not at random what happens can we in all these situations do you use the PC algorithm and then they also can be combined with other method instead using these rules can we combine with

other methods oriented The Edge and also as a type of constraints like uh there are people looking at the small patterns looking for no there are certain property which looks through the all

patterns but this is a typical example the constraints causal Discovery your setup constraints and you do fossil discovery as another type let's go to the second type is score base

conceptually this extremely simple the scoreboard method just to ask ask that which Graphics the data better so they actually prove that there are theory

behind is the uh I mean of course there is also a list of assumption with a list of potentials there are several tells you that the graph exploring the data

fast will be the true photograph an algorithm-wise it's very simple it's just you want to find the graph uh it will temperature include it's a direct attack clear graph that maximize your

score the score is commonly your definition but one of them also like traditional commonly used one is for like a Bic more modern method I mean most methods are likelihood they score

and the original information based score so you can Define scores and you need to Google but the the traditional one one of the early score based methods it just keep on pic so you can compute the Bic

and kind of just evaluate which graph gets your data that and so conceptual is very very simple right

so what's the challenge here because all possible graphs that's a huge space this grows super exponential so you can see with 10 nodes you can you you're not

able to test all the browsers anymore right go score based method uh in the early time in the early ages people developed heuristics of course

you need to prove this you're still correct on how to narrow down the search space so the the GS is one of the earliest Forest methods so in general they developed heuristic in a way like

okay we start from empty graph we we do step-by-step increasing address we just suppose the address in when they increase the score and then we remove badges while they be prison score what are these heuristic rules that narrow

down our research space and made this uh made this search feasible and there's more Ruth but this is the absolutely I kind of have a Ford and a

backward search so that tells you because narrow down the third space make a Facebook find potential graph explain the data and that for color graph with listen assumptions so in more recent years and people start

to think about can we convert the disconnect Vision check all the possible space in the graph into a constrained uh continuous optimization so one of the work that is actually very popular

nowadays many including us respects now this work is called backwards no tiers so it's in 2018 so one of the

contributions the research is instead of looking at possible graphs we find a way that can characterize what does it mean to be recycled graph so we find out the

trees uh so the p is the adjacent Matrix the treatment of this format will be the same as the number of nodes so then you can easily set a constraint like that that's become a constraint optimization then you can use augmented like laundry

to do that and always look forward fourth definition so this is a large step to make such method more scalable so we'll get a little bit more into this in the second part of the talk but you

can see these are the two type of causative government methods both of them can potentially return sleep index the Lesser type is functional problem models so functional color model is not

another class so as we talk about the Earth's graph there's also fem so we can't assume it so just to make some basically kind of simple assumptions on the functional form on the structure

recruit models we may be able to identify the graph and call the directions deterministic function and that's the capital noise

and one of the early example and I think it's really like enlightening is a single example of lingard it's linear non-gaussian voice model so instead of dealing with the general functional form

of the fdm we look at additive noise model and let's look at the linear case first a simple example look at it so if we have the data X and Y so we want to find

out what x called y y called X or if there's a couple of relationship so if linear we know the the functional form

is linear right we consume if x cause y we can justify the linear regression and after we're fixed the linear regression right we're learning the linear function we can look at the noise the residual

and the curve in this case we assume causes X they're independent that's the duration y but maybe it's another situation it's y called Zach and now we calculate another linear regression x

equals uh a uh y plus the noise now we can do it again and then we can look at religious noise and look at the cause we assume here which is what so you can look at this case only the first case

the noise and the calls are independent the second case the noise and the chords are not independent anymore so that's raw right so it's only one situation is correct so you can see how we can just

make some assumption on the functional form and we'll be able to identify it must be the last configuration is correct so that's how we discover the causal relationship

and so this is the basic idea and the functional problem models and then just with well here we call the noise must have been candle noises and some noise we can already identify a lot of things

of course uh why would you leave your non-geous model if I should be non-gaussian is essential so for all linear models the linear editing noise models gaussian modeling is the only one

not that inviable you can still see if you fit two directions we'll see the same results so that's why we need the assumption and actually this research decline of research we can relax this assumptions

and hopefully we kind of Bio to navigate that possible so they are research and talking about okay instead of linear additive noise model we can put Post in a linear model so both F1 actually can

be non-linear we're all good here as well and there are only five cases not identifiable of course we don't only deal with two variables in the income case we can deal with multiple variables it actually can be very efficient

because think about it all the noise are independent right so that's the key impact we're having here and then we can borrow the good old method like ICA to compute the whole thing variationally so

we can use this to discover like a multiple notes as well and so that's the spirit of the functional color models so I think by now I give you an

introduction you know the fundamentals for the type of causal Discovery method string based or based on functional color models and you can see that we're not caring about the sem and functional

relationship and we cannot answer we call the inference question and for next part Nick will actually tell you more about thousands of fundamental causal

inference research has been conducted thanks John um yeah as mentioned that was just first intro and um course Discovery I'm going to take over talk a bit more about 12

infants and um just to kind of like remind us of why are we doing this uh the first part was really about finding edges and graphs the second part now is kind of like

asking this question how much ice cream would we sell if it would be sunny given that we know that like sun like sunshine causes ice cream consumption rather than the other way around and this is the

important part here really that um causal entrance estimates those causal various they're a bunch of those that we can talk about and different interesting kind of aggressions that we can ask and

answer but all of them assume some sort of knowledge about a graph a partial graph or some sort of equivalent assumptions so let's start off kind of like talking

about what does it mean to perform an intervention or kind of like actually act and change something in the system and this is um at least one way of thinking about it

is Pearl Stu operator which um basically means if you want to act on C in this graph let's try this

um seeing this go up here and kind of like uh change the value to one what we basically do in um this model is we cut

the Edge from a to c because there's now no relationship of like any kind of change in a changing um the value in C and uh doesn't Tangled it

so if we then want to look at the um when we then want to look at the uh distribution or the value of B given um

that we've done this intervention on C you know like we changed the value of z um we could compare this to the conditional distribution of just saying oh we have a bunch of observational data

and what is kind of like the outcome of B given that we can like have observed C equals to one and this is actually not the same thing and just to kind of like visualize this a bit given kind of like

if we assume that we're mainly in a binary scenario a is binary B is uh binary or like we didn't actually have any binary but C is binary what we kind of like see is in the left case when we

do the conditional so just let me find um if you do the conditional uh we see that um we still need to marginalize out over

this conditional distribution of a given C whereas in the right case uh this uh actually doesn't come like we don't actually um condition our distribution of a on C

animal because this Edge is cut and this is kind of like the important difference here that we've mutilate the graph is what it's called by cutting this Edge getting rid of this dependence and then

we um don't marginalize over this conditional distribution over a but only the marginal distribution of a and looking at both quantities we will actually then get different results in

most cases at least and um because that you need to do something different than just doing um conditional probabilities and using all your usual rules that you might have known about already

so what are the things that we want to calculate the most common one is probably the so-called average treatment effect where we calculate um effects on the whole population so if we for

example want to change um be in the previous scenario I called it X here where X is our course or our treatment as it's often called and ysd effect variable we often then would care about

what is the effect of changing our treatment from zero to one and we would then simply look at our expected outcome why in under the um Interventional

distribution when we run this intervention on our treatment variable minus the expected value of our outcome variable given that we run a different treatment which is then often a baseline

or reference treatment and the main thing really here is we do this expectation over the full distribution of the poor population that we've ever observed the second thing which might be more

interesting to a lot of people is these so-called conditional average treatment effect where we care about the effect on subpopulations for example in the ice cream scenario what is kind of like if we could turn the sun on what would the

ice cream consumption be in a country like I know the UK versus a country like Italy where it's like Sunny most of the time anyway um and there might be less of a

difference or people are definitely in like um I know Scandinavia have had stock all the time and they'd still like ice cream just eats it all the time and there's not like much of a effect on that

similarly in um you can think about it in sales scenarios in different countries you can think about um medical scenarios about like different gender different

um other kind of like um conditions uh different other kind of like historical conditions preconditions or kind of like just exposures um that you might care about and want to

condition them and the last one which is um kind of like the up on like the most potentially most interesting or more complex one to calculate is the so-called individual treatment pack where we really care about I don't know

what if I kind of changed something about myself or if something gets uh what is kind of like my ice cream consumption when it's sunny versus when it's not Sunny uh what if I take this

drug um and this is kind of like much more targeted at a single individual or a single instance rather than looking at any population it requires more

understanding of um the actual functions and the how the world works that's just to kind of like talk about what are the things that we care about I'm mainly going to talk about the average treatment effect so

um because they're kind of like are the most common thing to deal with um so what do we kind of like use to then estimate those things the first one is we use kind of like we look at the

graph that we found using cultural Discovery or that we have from um some sort of domain knowledge and we then use both through calculus so do

calculus as the combination of General kind of like condition probability and probability rules as well as the do operator and some rules around that and basically what it does is it identifies

sub graphs that we that are required to estimate the effect that the input event so if we look at this graph here we have a b and c again is from the previous example but we have some example nodes

on Pub as well if we care about this intervention on C we actually cut all the edges that lead into C so we get rid of this top right

one arm here because it's just completely disconnected from our outcome variable B and so we can just throw it away we don't need to actually measure it and we yeah it doesn't give us any

information and we also cut the edges to this other node and left and um what we kind of like then can do is apply it more uh rules of probability and see

that if we actually observe a we can get rid of that before and by kind of like following all of those rules we can identify a smaller subgraph that we actually need where we

need to observe the variables to then be able to estimate this effect from some observational data knowing that this is the graph that generates the data and then we can get back to this equation

that we see here previously but we just need to marginalize over um the marginal distribution of a and um like yeah basically set C to one and then we can get the outcome there

um however this obviously works for much more complicated examples as well and um we can for example look at this example here where we have some unobserved compounding between

um X and Z4 so this is kind of like a note that causes both of them that we can't observe and some other nuisance variables that we don't necessarily care about and we know this is debated on a graph we might now ask this question of

like what are the variables that I want to observe so that I can actually like identify the effect and estimate the effect so we then go here and um look at the thing called the vector

criteria which is probably the most common um well the easiest kind of like Criterium to identify causal effects and it says

that a set set um we'll see a satisfied back to the podium so c is now a set of nodes in our graph um with respect to pair of variables X and Y where like we kind of like say one

is the effect variable one is the course uh where we need to say already know which one which treatment effect we want to estimate and if no note in Z is a

descendant of X where we say also X is now actually our treatment variable and Y is equals ah sorry Y is the outcome um so we can't have any node in um the

set C being at The Descendant and um this Etsy needs to block every path between X and Y that contains an error interacts so this basically means we

kind of like need to get rid of any um colliding features basically and if we then look at our example graph

what we kind of like first see we need to get rid of any of those um sorry that colliders earlier I'm a compile this um we kind of like saw see the first

obvious collider uh fun fact um being said four here that we need to control for um and take this into our set here and

we then still also see that if we control for C4 we actually open A New Path all the way around here

and the backdoor part that we need to control for so we need to either control for C2 or C5 and if we then do this

um we can choose our conditioning set to either be one of them in this example I now kind of like say let's control for C4 and C5 and we can then estimate the

treatment effect of X and Y if we measure SQL and C5 and the other variables we actually don't need to measure and we can still estimate the treatment effect or the Interventional distribution then

um and then what do we basically do is again we're interested in this expectation over the poor population uh where we don't say what's the average on

average do you change and for example uh in a penotype if we knock out a gene um where we then kind of like say let's actually run this experiment we know which variables we need to measure um

kind of like one short note here we have still C6 in this pop here this kind of like just gets consumed in the noise of the outcome variable so we don't really need to care about this and we can still

estimate the effect so using this uh do calculus we can identify which variables we need to actually measure so that you can then go and assign the experiments for example or if you have an experiment you can ask

do I have all the variables and all the measurements that I need to estimate my um just a quick question uh why does controlling V4 opens New Paths can you

go go back to the definition and explain a little bit so the thing is that oh sorry this was still going to happen um so if we control for C4 right

um we then get this collider here uh between like uh why like we have this open path from C2 all the way down to y um and then we have this collider

structure between uh C2 C4 and then I'm an observed node that we know exists but we can never measure okay um so it didn't draw it in there but then we have a collider um from here to there up to

there and this doesn't get opened if we control for that so that means you assume that there might be some you know where the possible compound there will be yeah exactly

so I'm kind of like this is if you know like this is one thing you need to know your craft to be able to apply those rules to then identify uh whether you can estimate the effect and how to

estimate it well if there are many observed Compounders you can say between B4 and B2 then uh what if we get into trouble

uh if you you say if we have unobserved Compounders between C4 and C2 yeah in addition to between the one that's between X and people

um if we measure C4 and C2 um and the on observed Compounders are only in between those two variables then this is still fine okay yeah any more questions

cool thank you thank you um but yeah so basically there are a bunch of rules that you can follow and apply and they will then tell you how to estimate well first of all whether you can estimate your cause of effect and

then also how to estimate them and what we then can do uh kind of like two very simple um ideas of how to calculate the effect uh they're much more out there but just to kind of like give you first

interaction one is kind of like we can just run simple linear equation we can say our outcome is a linear transformation of all the variables that

we now know influence our outcome and kind of like we can run linear regression on that given its linear regression we kind of like want to look at the effect between zero and one we

can then simply look at um our um our coefficient that linear regression points so this is going to be very very simple um case that a lot of people did in like

in the very beginning and a similar easy case that you can look at is you can simply count you can literally at least in the binary case if you have a table uh like that up there like the variables

that we identify that we need to measure uh so C4 C5 X and Y we can simply count um the cases where you had um treatment

one you can look at the average outcome of that uh you can look at the cases where your treatment was um the other one you can take the

average of Y assuming that um you have a kind of like uniform distribution of a C4 and C5 you can then do that and then just take the averages of that and look

at the treatment effects um often this obviously doesn't like um do the track and you would kind of like need to um do some other things

um but let's kind of like talk about one more um extension to kind of kind of like Call of inference or like different ways of thinking about it so I've only been talking about um kind of like this uh structure called

the model or structural gradient model and close to calculus but there's one more interpretation of it um and this is the so-called potential outcome framework there the idea is you

always have potential outcomes so like you always have y 0 and y1 being the outcome if I would have taken action zero or action

one you just don't observe them they're always there they kind of like always um get a table like this where you have y0 and y1 S columns but you can just only

ever observe one of the two and this then turns this whole thing into more of a finding uh Missing data about them so your counter factual so the

um outcome if you would have taken at the connection is simply a unobserved missing data available and um this is basically the same thing

as what we do um previously you can kind of like say the graph we have on the left which is the one I previously shown you um with like I guess previously I call

them ABC um what do you kind of like add in this graph is your potential outcomes X here and X1 which are now directly

uh caused by c and x only really selects which outcome um you are actually seeing and this is uh the so-called ignorability assumption

where the potential outcome framework assumes that your potential outcomes are independent of your actual treatment

variable given your um conditioning set and um just to kind of like make sure this is not the same as saying your actual

observed outcome is independent of your treatment given the conditioning set because as I said seeing here kind of like your actual observed outcome it's the one

that gets chosen by the treatment that we actually applying yeah just doing exactly that and um this is kind of like exactly the same if you know

um the confounding and we know uh do calculus we can then kind of like get this usually kind of like triangle graph and we then get to exactly the same thing between like using the pl

framework as well as using the um pearls framework of like do calculus and structural greater models um and then there's kind of like more

complicated ways of actually estimating those things so one thing that we could do when we actually say our distribution over um

our conditioning available so it's not the same what we can do is look for examples where it is the same calculate the average treatment effects

for those examples and then average those treatment rights that basically means we want to fill in those columns of our

potential outcome um kind of like the outcome if we would have applied a different treatment for some instances or individual um entries in our data set

so what we do there is simply match instances of the same properties so in my case here the properties are simply the color of that individual and we fill

it in so in the first case we have our orange individual we apply 3.1 we don't know what the potential outcome is if we would have applied treatment zero uh but we know the outcome of treatment one we

have a separate individual in there somewhere down there where we did apply the other treatment and we know the outcome if we apply the other treatment and because we believe them to be very

similar at least they have the same color right to use color this is not I didn't think about it it's like the same properties um then we can simply fill this in

and um we can then use that to calculate the treatment pact for this um subgroup of individuals we can calculate that and

then we can fill in the other um differences by finding other similar individuals to those other groups and we then average them and we then point our average speed point right the problem

then becomes how do you find whether something is um similar or not similar but I'm not going to go into this right now because this is just gonna um yeah bring this kind of like um make it stopped along here so kind of like

summing this up a bit um the problem that we have in causality a lot as first of all their different communities developing uh different

methods of different assumptions kind of like just a quick question the previous slide um you had the second orange person has a zero yeah

so how do you here's the treatment here but but how do you estimate the treatment of

ah sorry um yeah this actually it's kind of like you shouldn't have the treatment here anymore um so we have if you look on the right um we have uh treatment one for that one

orange and treatments here for another orange so we just um take the outcome of that one and the outcome of that one up there okay so it's like zeros two and then one

is yeah and then you can look at the difference and get the uh two effect for that any more questions on that anything else um so yeah what we've kind of like seen is uh we've seen called Discovery we've

seen Global imprints both of them are somewhat separate communities working with like different methods and different assumptions the same is true for like call um cause of imprints using pearls framework as well as a DPO

framework and um it's got like a bunch of problems there kind of like trying to unify them but there's work on that we've seen that um DPO framework is actually equivalent to

um curls2 calculus and you can find rules to transition a graph into kind of like those assumptions and vice versa because there's also kind of like work that um yeah helps to like deal like

with unobserved compounding NPO for example and you can then use those assumptions that they make to build a graph and turn that like do it vice versa as well and um kind of like

another problem there is that a lot of classical methods are often actually restricted to just linear binary settings in both the input at least which is not true in the real world and there's been a lot of work in recent

years um for example double machine learning that is really good at dealing with some of those problems and um this is kind of like showing that the introduction of machine learning really helps with

scaling this to more realistic functions um but most recently deep learning has really made an impact on causality as well because of the machine learning this is kind of like what the second top

part of talk is going to be about um just recapping all of the things we talked about kind of like one slide each uh what I kind of like to call the things that we care about

um first of all interventions are not the same as conditionals like trying to hammer this in because sometimes you get people ask about why don't you just condition on your data and it's not the same and we then want to calculate those

um average treatment effects for example on the poor population um kind of like regardless of your subgroup um where you would then need to find

enough data for all of this approves and um kind of marginalize this up to about the other one is calculating conditional driver average treatment effect where

are we interested in population or sub sorry subpopulation averages so saying um sales in this country um the effect of a gene knockout on a

specific cell line um the effect of a treatment on a specific kind of like disease or kind of like personal specific preconditions and so on and so

forth and the last one really is this individual treatment effect where you are very much interested and I know if I take this drug what's the effect on me personally versus um Chung or like any other person in

this room and this is kind of like yeah the Holy Grail to some extent where you get to personalization and can have the most impact really moving on uh next thing kind of like

covered we saw um different ways of kind of like um I'm doing cause Discovery using kind of like conditional independencies and independent system General um as well as kind of like different

score functions or functional forms and uh especially it's kind of like how do we turn a table like this up there into a graph and um you always need to make some

assumptions to do that and lastly how do we turn again um how do we turn this table of data as well as a graph into

um this average treatment effect for example or there's like causal effects and um as mentioned previously it's really important that you get a good graph here otherwise this thing down

there is most probably going to be wrong and um often causal infants would use um the main knowledge for example to get those graphs but we believe that code of Discovery can have a big impact there

because a lot of domains out there either too many variables or not enough domain knowledge to actually identify this graphs and um so what we're doing here is kind of like

we use beta and profit assumptions to find the effect of an action and we can either use something like a potential outcome framework where we assume ignorability meaning that our potential

outcomes are independent of the treatment or we can um use fills framework where we assume at least a subgraph is known again they're equivalent and using those

things we can then estimate it back to using either parametric or some non-parametric models format between for example in a question or non-parametric simply counting and averaging things

and that's it um I think we have yeah basically a time sorry for running a bit longer I guess there but if any questions um please chat

uh I have uh one question uh so a chang mentioned that um or constant Discovery we don't necessarily need to have Interventional

data uh so if we had Interventional data does it help or improve our you know chances of getting a better graph or something like that it helps a lot

significantly so any of the prior knowledge helped a lot because it's really hard to have collected with opposition data and find the graph and so Interventional data right so if you have international data first you

already know some budget right so you kind of have a partial graph in all different methods you can code it differently so that's uh kind of just a narrow down the search space significantly and that meanwhile not

only Interventional data it is actually any domain prior knowledge it helps a lot even in situations in our team we find situations if even domain experts

have a gas like I feel this is more likely some colors you should be this this is very unlikely I mean expertise in our experience some situation actor with

international data they do know but in many other situations it's like five people think they exist another three desert right so even for this situation have different uncertainty for something

this helps improve the quality a lot because in the service search space it's huge getting the graph really cracked especially with limited number of data it's extremely hard any international

data Atomic activity is very helpful okay thank you so what we're doing is trying to use custom machine learning to difficult decision optimization problem and answer

what if questions for doctors we want to do the answer what will happen to the patient if I gave treatment a or if I get treatment B and if a business

brother if like what will happen to my Revenue if I give a 20 discount so in this way if we know I'll be down for the decision making it easy right because we've got to choose the better

one okay well then let's recap what are we looking at the problem we want to talk about this because of machine learning with existing data so think about all

this scenario what people give us is the table their historic data existing data can be mixed with observational information data and what in some ways they want is a high resolution large

scale table anyway without code right you have many many uh partner customer patients users and you have all different actions

and you want hidden Factor information or you just want to know what happens after this International distribution and then you can make decision if you look at this

we want to do this at low-cost large-scale real-time high resolution that's the whole color machine learning about but if you look at this you can see what we just introduced for

the first hour none of them answered this right so now we call it end to end call the inference just want to Define what's that the end to end call the inference actually means

we want to only use the user's data but instead of treating couple Discovery and call the influence separate we would like to give everything about us we want

to just take user data and we want to help them to optimize the decision uh we want to help them help their decisions based on this big large table we can do it real life real-time

large-scale to help the decision and before we are able to do that we need to know the graph at least the partial uh like a seminal graph information and we need to know the function relationship

so that we can compute it so this task we call it end-to-end cargo inference so that's in some way is Young both cover Discovery and influence it is this input

has everything about code this is our definition and you can see this is actually the real world need very few people at school either way and most of people we talk with especially if I'm a

corporate lab that's what most people really need and that's enables not to have a general solution about the existing methods are not sufficient existing method is around called The Discovery we have a table we

have the graph partograph or different graphs or we have color imprints with completely new different sets of assumptions kind of give assume the graphics given so none of them solve

this problem and so you can see there's more problems the next talk about even within called the inference people have different assumptions but if you look at how to discover the inference together you can say ah we put this together we'll solve the problems that's actually

a noun preview you may lead to I'm talking about the worst case scenario stimulation but you may leave too I use lingard for causal discovery which means it can be it have to be non-linear I

didn't know its model linear gaussian we cannot solve it that's only the case we don't know but then you use the causal inference method that works with only linear gaussian noise model that actually many color imprint methods you

put them together you solve absolutely nothing I mean this is the worst case scenario so you see the problem and then there's more things Hollow inference you see the flavor you can talk about they already assume like

pearls a method assume you already gave them a graph a graph but the word color Discovery I talk about we give complete partial there we

get packed none of the output from the cause of Discovery is using the same format as a coloring print people need we do research assuming okay we will get

this but actually that's not true and so I'm talking about workplace in a row you can get a little bit better but that's General problem is this

and moreover as this to community are separate these almost have an arch enemy called machine learning and exactly everything you love that but this why

this is very very different it's like one of the fundamental thing in causality we care about Theory it's very important to care about this area because otherwise we cannot claim it

combo but meanwhile with nothing in the theory I still read paper a scale up to six node this do not solve our role or the

problem is do not know if a medical doctor came and I want to know how to create patients there are a lot of patients in the hospital they are like business people and other customers

there are a lot of programs to do there are a lot of people so we need to be able to solve real world large-scale problems we need to do machine learning

especially if learning type of models to really make it scalable make it a flexible real world applicable we can not only look at six node the linear

model questionnaire so this is the whole second part of talk about so this is also part of our group's mission is we want to do deep end-to-end causal

inference so what we want to solve is this goal that with our general users we can deliver General Solutions people can provide historical data then we can give

them all the insights they need okay let's get started I'll say this is the field we have more questions than answers we made some steps and we'll

explain and assure you what we are working on and we're also inviting everyone to join us to contributing this I personally feel very very excited

okay let's recall it we talked about smart frequential models right Square for equation models tell us how uh like a different effect was generated from

call this so this just tells us the uh we call the parents called the effect it specifies exactly how I depends on the current and there is a noise if

shown here okay so but if you think about it this is a generative Model A coval system is in some way you can have

extraneous noise you think about arms noise input and what your observation is just to the observation of the data right so it's a generative system it's

just a we are caring about the true underlying word on how data are generated but this is generally model in machine learning especially in recent years we developed so many different

remodel that have the flexibility ability that solves many real world problems we have our I mean my favorite version of encoders and normalizing

flows I think there's some future models these are terms of models what people have been using this model is to have a noise input and model 3D images I mean in some way it is to have noise input

and we generate our observational data so we can really bridge the gap and think about how often we use our really Advanced Tools in machine learning and

to solve for the question and the different energy model is a really perfect match because call the system is generative system and so now let's just gather into detail

with the bridging the different model and structuring through the models how well we can do so we care about giving the data we want to care about the graph

right and then how we can do that one of the times we did is we built down the work of our great researchers did the work before we

can use a score based to call it recovery method so we can choose our objective which is the score uh how well the graph is part of the data so that's

the marginal likelihood is so very well established before and it's the graph is a graph we need to learn and the data or observation so we can factorizing this way

now we want it to be scalable and real world applicable go look at the livelihood so the likelihood they've just tells us like the the

how well the graph can read the data we don't know the graph right but we'll assume we have the rough so we can think about the transformation right so so in general we can think uh

Define transformation like this so V is the transformation of x minus the function of the G so this is invertible transformation if f is that of G is the dark

so would you prove it in the paper and if you can Define such reasonable transformation so you can see just the organizing to the bar this is additive noise model so in

uh called The Discovery when we talk about functional causal models we talked about if you make some mild assumptions about functional four you can show that it's identifiability

I'm not sure it's identifiable so actually non-linear additive noise model it's one of the case for college Discovery it's identifiable so we can in

this case with some other assumption with internet data we can identify the unique color graph so in this way if you find this transformation what we can do is we can

Define our likelihood like this so in this way we can use our like in this case you can actually maybe just remodel but in this case we're using normalizing flow

to Define our likelihood so you can see it's the same type of method we're just trying to confuse orbitals for the likelihood we can

utilize modern machine learning methods and really go beyond the very constraint basic case like doing linear regression right now this so we Define likelihood

just using normal logging flow and this is you can think about that as an extension of one of the following papers so there is one paper here causal All The Right Move flows but in this paper to deal with the graph is already given

where they think about only two nodes and the fit both empty which one is better so you can think about if we think about the whole objective we generalize it

you have a question yeah I'm just curious um you said when the noise is gaussian the causal direction is not identifiable but here you're

using other regressive flow which assumes the noise is calcium right so so it's not identifiable why it's a linear with gaussian well it's non-linear with

gaussian noise it is identifiable okay that's great thank you so you know the first the basic version we assume the noise discussion and then

the function form is down here okay and so we have the likelihood right and now we need to think about the price so what would you want to uh to have the prior carrying information a couple of

things first we want it to be because you can see we need it to build a cyclic graph we need the video back so we already know that there is a form that's

constraint we can set the constraint uh from the note here the computerization constraint with this equal to zero it tells its attack so we can build it into

our prior so we can use the G here the HG the first align into our prior uh so why do we put in this form you can imagine we can use argument like

launcher uh very simple and of course commonly uh simulating the spirit of the Bic we want the graph to be sparse so the first term we can construct the

prior that is w0 is kind of sparsely penalty so we want the ref to be bars so with this prior warm you can actually add a lot of more so back to your

question if we have domain expert we can easily set other constraints we can also set like a tuning the domain expertise to take it into the prior knowledge which actually is more likely and which

actually is not not so likely so we can do the prior like this and then for many people maybe sometimes people know me as an inference person so now this become a approximate inference

function because we have our objective we want to find the vector score we use normal length flow for likelihood we Define our prior you know encouraging

this is a tag and varsity and all these things we don't know how to compute the G right so the G A is just a global structure we don't know how to compute

and we want to find because this G is a graph so in reality we use a adjacency matrix for this graph for the first step I use my good old Hammer that's called

Midfield variational inference and then we can construct the evidence lower bound so let me feel very important in general which is to use the standard imprints we assume there's a q distribution and we

welcome in mind so KL this is a fairly standard technique so we convert a causal discovery problem into a approximate inference problem so with our nuclear version

inference we convert the whole thing into optimization problem so of course without Theory we can now claim this is causal so I will now go into the details so in the paper we do

prove that with importance of other data and there are a couple other assumptions like like a theory standard like the macro Facebook is no latest and compounding these tender assumptions

that's a detailed in the paper we do show that uh we are able to identify the crack of causal graph so we do proof under internet data we can find the

correct the cover bar so in this way so you can think about we propose the first method is given the data we construct the

objective that is income with score based but that's converted to evidence lower bound we converted into optimization problem making the graph

learning as approximate increase problem to learn the graph this is extremely flexible it is using normalizing flow it is extremely scalable because all the

good technique from our attention you can use it here and you can learn the graph with learn the graph that's your learn posterior that is your graph and meanwhile as you are doing optimization

together you already learned the functional relationship implicitly and now Nick will tell you about how can we do the treatment effect transformation and decision making here thank you

um so we've seen how to learn the graph and how to learn functions by kind of like having a deep learning model that has this latent variable being graph and

um optimizing everything jointly um now the question is how do we kind of like actually get those treatment effect estimates out of there you know like say oh we have a gentle model in it kind of

like might be straightforward it cannabis but there are a few kind of like small nitty-gritty details um yeah I could talk about um so let's talk about kind of like our

growth posterior first um because we learn this distribution of our graphs because we have like this mean field variational inference as Chang was saying

um what we really have is a distribution of a graphs where we kind of like say oh let me um we have a graph that we find oh this is the correct graph and we might only

assign this because I'm um probability to it but if you find another graph it is like one additional engineer and one missing over there that we assign a

lower probability to it but it's still some like it's not zero um so because we can't know which one of the graphs in our distribution of our

graphs is the correct one we now now need to actually look at the full distribution when we do our treatment effect estimation and when we've done kind of like say how

do we kind of get this ate arcade out is we kind of like just take a Bayesian view and say let's just do model averaging so what happens is that we

calculate this expectation over the um ate or Kate or I can like marginalizing out our graph and what does that kind of like look

like in practice and kind of like just to remind you um the average written on the fact is an expectation itself the what do we kind of like want to do is um

look at the expected outcome giving some intervention minus the expected outcome of another intervention and how do we estimate this first of all

we do what we've previously done when we kind of like have a single graph we need to look at our intervention and cut those edges and actually mutilate a graph to get rid of any dependency that

we have in our um in the graph and that we would usually actually use in our deep terminal model as well so uh rather than doing our treatment effect estimation or forward propagation

we actually put in um this new graph with the edge cut and the treatment variable set to this fixed value talking about in atomic

interventions here only for anybody who might be wondering if you want to do more complicated things um but just talking to about this thing where we kind of like say oh this now let's set this treatment variable to a

single value we cut the edges and then um but then you go up in there and then kind of like the magic sauce really boils down to using

um Monte Carlo estimation and uh the problem here is that we have a distribution of our graphs we can sample from it but it's hard to marginalized it out in a close form

simply because well not just because it's deep learning model but especially because it's a deep learning model that depends on the graph and for anybody of you that might have done like some sort of like

the learning based probabilistic calculations before you will know that marginalizing like deep learning at known advice and that's really hard and basically impossible so what we do

instead is um we sample a lot of graphs for every graph that we sample so this is um very graphic example we then mutilate it

and put our intervention in there and we then sample from our um Interventional distribution of our outcome variable again we now use our

neural network to forward propagate um I don't know if the model if you have a bunch of edges in between um your treatment variable and the outcome variable what you will need to do is actually put your treatment

variable and estimate the intermediate node again put this those new values in an estimate kind of like run this forward until you estimate the outcome variable

so you do this both for the um uh first intervention as well as the second intervention as well as for all the graphs that we sample from um

the graph posterior so given all of the examples we then simply calculate this average of um the difference um sorry uh the average of the samples

of our outcome variable minus these are other average of the other samples of our convertible for both for all the graphs and all the samples that we get out of that and this will then give us the average student

impact uh taking the difference between them uh we are good and we can do that um or um where the

parent so I guess that would be a a t here right so yeah t is the treatment um I don't have any kind of like traditional plans yeah so let's say t

um are you sampling the distribution of x t like are you or are you XT is uh the treatment variable so we set it to fixed value oh if we would

have let's say x um sorry I meant to say Yeah in there XP being in between x t and X actually I'm asking about for the adjustment sets right I didn't mean to ask about t but there's right so you've got you know if

yeah exactly so for the adjustment said yes we do sample all of that as well all right so is that first of all I don't see that on your slides where is that you still but you still need it even for

ate well yes because we just dropped the um all the modernization in there you need to turn your microphone um so what we do is

no not not uh I think you might but so basically first of all they're not on the slides here because um we kind of like this is already marginalized it's in dimensional

distribution right well you could have also used a non-parametric estimator for the adjustment set given a graph structure it's like the raw data distribution so

it's more like a question like why do we even want to to find address on the set right so I mean even now we know the whole graph right so there are a couple ways forward we can say we can use other

routes we find the adjustment set and use the more uh traditional way to to do it right but now we have a scalable deep learning model we don't even if we can

do the full graph we just do the simple graphical modulation rule right like the previous one so we don't even consider the documents that we've got everything

we also do forward simulation together so basically kind of like to walk you through this example what we do there is we have a gentle model that knows how to

sample x0 so what we do is we sample from this we then uh know that we don't have to simulate x t because x t is our intervention to then actually sample

from XE what we do is we pass in our samples from x0 and our treatment um multiple samples from maxillary sample

multiple values to maximum thank you for clarifying that does answer my question I just want to point out that you could have also used a non-parametric estimate for x0 because you could have just used the right the original data distribution

what you're doing here if you have any error estimating P of X zero that would sort of carry over yeah but I think this does make sense I think does that yeah exactly because

this is uh it's more like a non-parametric performing from the data if we have the marginal if we assume our marginal distribution are estimated perfectly you can draw this example

a follow-up question is um if you have a lot of data then what do you need to estimate to generate a model on that I understand you can estimate a graph but um it seems like you need to First

estimate the graph and we kind of like use the generative model to estimate the graph is one thing and the second thing is dependent on the treatment effects that you want to estimate

um you might not well first of all for Ates I guess there are things that you can you can do a lot of things with just the data distribution um but first of all we used to pull down a deep down model because you can then

post hop the side of the interventions you might want to run um so you can decide oh I now want to intervene on X 0 actually on XC rather

than XT post-op after you've trained the full model on all the data um and that gives you a lot of flexibility at so-called test time I guess kind of like one way of putting it

kind of like deployment time um that and then um yeah it just gives you um the full kind of like you can look at a lot more properties that you can use and if you

then would go kind of like the next thing that would have started talking about is a conditional average treatment effects there the problem is um that kind of like you will have a harder time estimating this from um just

observational data alone and just um doing the non-parametric estimate set and you know like we can do some neat tricks in alternative models to actually estimate that cool so that's the normal average

treatment effect as mentioned in an example forward sample in our generative model um having it mutilated it and having the treatment assignment in there obviously

everything that we've done so far is assuming um no unobserved compounding which is the same thing that we've kind of like done previously and all like most of the time that we do and causal inference

um otherwise some of this kind of like fails but we'll get to this later uh talking about um conditional average treatment effect though is if we want to actually condition

um like quickly going back to show you the scrap again so um if you want a condition on XC here where XC is a descendant of the

adjustment set or like your conditioning set that you need to marginalize over um and doesn't have a direct impact on your outcome variable XE you will have a

hard time actually doing this because you would need to apply base rule and yeah marginalize some things out of there which again is very difficult if you don't like that which of you will

know if you've done Asian deep learning or any kind of like um more Bayesian probability Theory before and the problem there is we can't estimate our

conditional distribution of um see oh sorry x0 given C easily because we have the learning components for all of that and

if you basically need to invert um one of those edges in our models and what we do instead is um just use another model

um as a surrogate model and um we don't use the Deep learning model here because we want to be a bit more robust what we do instead is we sample a lot of data from our Interventional distribution

um so not conditioning on the conditioning variable but simply from the Interventional like from the distribution applying the intervention we can then train a model

um that predicts the outcome variable as a function of your conditioning variable um very rather than just using the conditioning variable as is we actually

use some set of like um random period pages to make this a bit non-linear and use a linear surrogate model just running kind of like a linear regression

on that and um okay run linear regression on that and then have a model that predicts the outcome given

um the conditioning variable or the conditioning set that you want to run your conditional average treatment effect on and um you need to kind of like train two of

the surrogate models actually one for your first Interventional value and one per second Interventional value they just sample a lot of um values to train the circuit

and you're then the conditional average treatment effect then becomes the difference between uh those um surrogate models at your conditioning value

and you're still modernizing our glass so yeah kind of like it's another complexity in there um but otherwise you can't like do the same thing as we've done previously just adding those very good models

rookie model predicting uh model is predicting the outcome variable so going back two slides that we flat so the surrogate model basically we

sample a lot of I see okay xe's um given kind of like our treatment and we train a model that predicts XE from XC

um so that we can condition on XC to oh sorry uh so that we can condition on XC to find the value of x e given that we'd run this intervention so that we can kind of like run this Arrow backwards

that otherwise it's hard to do right well I understand the motivation but then all the other variables become unobserved co-founder for your Authority model so they're not under soft

Compounders because we actually we already so we trained the surrogate model only on the Interventional distribution all right okay so because we only use the Interventional samples

for this uh circuit model uh like it learns to model the Interventional uh relationship and the international data is generated by your generator model

if I just want to show you the Nexus slides next yeah exactly so it just went through like what's this type of model I think mentally you can think about the world

with the building a huge stimulator I mean the difference of the model becomes just a huge simulator and this respect all the Truecaller structure so then for

all these things uh we press the brutaloid if they're conditioned their treatment what we can do is brutally stimulation simulator a lot of a lot of a lot of data I don't know what this

circus model is trying to predict it is to lie so that you don't need to get that then it's a really good model so I think nickel is your organization um so it's kind of like uh very neat

um graphic that one of our collaborators and like did for the paper and what we see here really is um again so we have the correct graph here on the top left

where we can like have again our four variables like zero XC x t and XE we want to calculate the conditional average treatment effect of um the treatment variable XT on the

outcome variable XE condition on this conditioning variable XC and uh we have some sort of well we have some observed distribution over here um so this is the outcome of the The

Joint distribution of the outcome variable XE on the y-axis and the um conditioning variable XC on the x-axis

and um so this is literally just like a density plot of our observations we then have learned a kind of like a giant model to estimate

this observational distribution so this is uh what we show on this uh column here where this first um the top or like the middle column on the left

um shows the observational distribution from our terms of model using the correct graph and um the gray is the correct observational data and blue is the um

our observational uh or um learned observational distribution and they're fairly similar and they're fairly good um we then also see that the second

graph that we've learned um has a different observational distribution um but it's still a very good fit to your um data distribution which is exactly

the problem of like identifying causal structure from data if you have different models that have the same observational distribution or a very similar one it will be become harder to

distinguish between those two across um so we're kind of like stuck with those two drops and we now want to actually look at the Interventional distribution

of um XT on um x uh sorry on of XT on XE which we show on the um right here where we have um

how different um Interventional distributions for different intervention values so with the red one is using

um our intervention of x t equals to one value and the um blue one is kind of like our reference or Baseline where we set XT to a different value and um you

can then look at the outcome at like just by drawing this line um according to like this um conditioning variable XC so it's like we said this we want to condition on XC

equals two and then you find um the value of like the blue distribution and the right distribution and you would then just take the difference between them to get the true conditional average treatment features uh that's another question

um to find it so how are you using the density over XE in this procedure what does that needed for adjustable visualization but do you have to fit it

uh this is just like a histogram like Ade estimate relationship this is just for visualization so hello no it's good

that's not good okay anyway we already have joined the distribution of everything for the entire data set if we have like a 100x we can plot it both for anything right

so this is a visualization the support if you look at I understand but the procedure the surrogate that you're learning is it are you modeling XE as well okay I didn't think okay so I'll move on there in a second

um just kind of like showing what's actually on the slide um so it's kind of like just this is the ground group uh that we want to compare to and for actually estimating decayed

what we do is we have our first row we sample a bunch of variables from our first row um in the under the intervention and the reference uh distribution so we then

kind of like have our Interventional um distribution here the green one I sample a bunch of variables we then transform them using query feature I will be transform our conditioning variable

using prayer features and learn the surrogate model because now uh the screen curve which is non-linear because we have this random period features on there and we then sample a bunch of

variables from our reference um intervention again um have our outcome variable now XE and learn a surrogate model again using

those around a few features that regresses XE so your y variable or

y-axis on XC being on the x-axis here so you get those two lines you then plug in your conditioning value with this two which is this dash dash line here

um you then find like the value of your circuit models of the overall the intervention Circle model and you find the value of your reference circuit model you take the difference and you get a conditional average treatment

attack value which then is here is 2.3 three not perfect but decently close and oh you wanna let somebody want to read the question today at the moment

oh can you hear me yep okay so yes there is a question in the chat and says uh why do you need the surrogate can you

sample from XC and XE when do when you do the intervention and keep just the samples where XC has the correct values in the conditioning set

um yes we can do that the problem is there might not be many or there might give if XC is um continuous there might not be any

value that is um actually exactly your conditioning value and especially if you have a high like a multi-dimensional conditioning set it will be very hard to get like exactly to the value that you

want to get to and you would then start to kind of need to um come up with some other kind of like matching algorithm again defining some sort of like uh distance function between your

um yeah but Junior conditioning sets and uh to kind of like say oh this is close enough to my actual conditioning set that I care about and then only average those samples and kind of like this is

just one choice of doing this you can also do this like matching algorithm and find samples are close to your conditioning values that you care about it's just a different way of approaching disabled

thank you um so going back to this um we kind of like trained the surrogate model on our first um grub and the first kind of like set of Interventional distributions we again for the second graph to do the

same thing where we sample uh values from our Interventional distributions given the two different Interventional values learn surrogate models take the difference and then get a second value which is then 1.1 given that this is the

wrong graph it's kind of expected that this value is um not perfect and not as good as the previous one we then take the weighted average between

um the different Kate estimates taking the probability of each graph into account and then arrive at our final paid estimate which then becomes uh in

this case 1.84 which is like close enough to our true page to kind of like actually be valuable and useful so we can also do gate estimation using those circuit models yes we could use

something else um there was just our choice to kind of like have a decently robust way of estimating those conditional average treatment effects for any type of

arbitrary conditioning Set uh be it like a single web or be like multivariate cool so too long didn't read um kind of like someone writing that we have a pretty cool in my opinion uh deep

Channel model that combines like normalizing flows and like some throughout neural network as integration or learning the graph as well as the

functions from observational data that just takes some noise in and you kind of like then can sample from your yeah distribute upon the observational distribution as well as your Interventional distribution and can

calculate this average student in effects as well as uh conditional average and even people do things like individual do you know the fact estimation which you haven't talked about here at all but it's fairly simple

using this model actually and um the neat thing really is you only need to train it once like you try have your observational data you take all of this as well as some kind of like idea about the graph if you have some domain

knowledge as Chung mentioned earlier that helps a lot you can say I believe this ad should be there and I give it like a 70 uh probability that it should be there or let's say oh

um I know that revenue is kind of like an effect uh variable and it shouldn't actually have any descendants and have some hard constraint on there and also can put that in and you take all of this

domain knowledge in there uh to get up the data and we can then learn the graph the functions and this whole deep joint of model and you can then get a few things out of that first of all you have

a graph which in itself can be useful thinking about like Dean regulatory networks learning them from data it could be something that could be cool um and kind of like just finding out which interactions on the data but you

can also fill this table that Chang was talking about like getting those interventions out uh so those kind of factuals out to know this is the effect that I would see if I would run a

certain interaction or intervention on an individual you can then use this to calculate average student and practice individual between impacts um and like conditional average treatment facts but you can really also the third instance

use that for personalized decision making at scale and this is kind of like the main thing that we care about that you can run all of this at scale by having a relatively minor kind of like up on cost

when you train this model once but you have a lot of kind of like Downstream applications you can please um can you hear me

yeah yeah so given the difference of categorical and Market data can we have a join numerical and categorical data as the input data Matrix

yes that is actually one of the next slides um so we actually um like previously in kind of like the main thing that we've been talking about is we have this non-linear non-linear

additive noise model using gaussian noise I don't really really looking at like completed variables and everything being observed but we have in the actual paper and the code base we relax those assumptions

quite a lot by saying let's do some like normalizing flow as a noise model as well so we have like this spline uh normalizing flow that allows us to have more complex noise distributions we

allow for categorical or like binary kind of like any type of discrete data in there as well as kind of grouping variables into a multi-bred variables on a node

um so that you can deal with any type of data really and lastly we have also add an imputation Network so that you can actually fill in missing observations and real world data which is actually

something that we observe quite a lot and it's quite important does it answer your question or did you have more questions around the um categorical data um yeah that's enough thank you

oh thank you so um that was just to kind of give you an overview there are much more features there that we kind of like already deal with but there's also still a lot more things to do

um however we've already kind of like done use this for some scalable applications internally where we work with graphs like that um kind of like we're building at like hundreds of variables uh well we have

like 500 or so values that we group into like 100 and something variables and we get some decent um graph Discovery performance out of that where we kind of like then talk with the main experts that tell us oh

this actually makes sense and also use deep into effect estimations to actually provide some business insights so this is yeah really scalable really usable and it's good performance

um however as I said a lot of things missing one of them is for example estimating effects over time I'm gonna cut like just run quickly over this just to talk about some extensions uh where

we had some recent work extending this into the time series domain because in a lot of cases you might actually be interested in oh if I now kind of like start doing an action if I have like a

few actions I can do how does something change um for us revenue is something that matters how does the revenue change over time and there might be differences and

you say oh if I look at the outcome in two months then action B is really good but if I look at the outcome in like five months action a might be really good so this time series domain and like

time complexity really matters and they're kind of like some nice properties to it that um when you look at the temporal group and you kind of like have this Auto aggressive nature to it

errors can't go backwards like if president or future does not influence your past which is a really strong um constraint already in the kind of like makes it called the discovery a bit easier um I'm not going to go to and to do my

ctla I'm not going to say much more about this right now but uh there's a paper online already and otherwise just come talk to us uh where we have a very very flexible form uh functions that can

deal with um all types of noise and temporal data um that's one extension I think there might be some applications even in biology or something I know that it's

really hard to measure like cell status over time but I think there might be some other applications in healthcare at least ah cool um so I think there's some really cool things that want to do there and the

second thing that we've um also had some work on uh is actually dealing kind of Compounders uh where you kind of like say in US business scenarios again you would say um some sales are actually impacted by

whether you give them a discount or not and then some kind of like other observed variables but there's kind of like this thing like the economic situation in the world which is a really high impact on that

um that um we can then start modeling there's one problem that we can't have edges between kind of like um those variables um that the compounder is compounding so in our case here we would have the

economic situation have an impact on like salaries and like sales but in our kind of like to make our method work we need to assume that there's no direct attached between those two variables uh

which is a so-called bow free assumption um it's the first step at least you can like make another compounding work in our framework or in causality in general because yeah it's a hard problem to solve

um so overall what we kind of like do really is bringing um causality and deep learning together where causality gives us a lot of uh good theory to approve that we find the correct called the

graph and can estimate those causal effects but some of the impact has been limited because of the very restrictive assumptions and not like low scalability to large data on the other hand we have

deep learning where we have some really good impact on like image classifiers large language models just um last week again of like the new open AI model um all of them are correlation based and

they don't really tell you much about the actual group of reality and we're really bringing both of them together to build scalable causal AI Solutions and another thing to kind of like talk

about here as well is that we transforming this traditional causal pipeline where previously what you would need to do is you would have a human specified a graph and kind of give you

some data from this um specified graph you would then do this causal identification or verification step where you would do something like do calculus to find a way to estimating

your treatment effect giving the graph and the data that you have it's kind of like getting rid of all the um other notes that you don't actually need and using this um so-called estimate you

would then run a call the estimation step for example some regression algorithm or something to get to the causal effect and we kind of like really transforming this into this large-scale General

solution where you can provide some incomplete um prior knowledge but or no probably not at all we run cultural Discovery and causal imprint in the same algorithm and really do this in

an end-to-end uh Journal end-to-end Pipeline and Method and yeah this is kind of like our real approach to kind of solving real world problems by putting all of those things together

um and we're in the trends here like um causal machine learning is one of the kind of like high emerging topics according to um Technology reviews recently and because like in the beginning of like the Innovation cycle

so we believe that there's a lot of impact that we can still have by working on this a bit more and we see the same thing when we kind of like look at causality papers and archive so I can only encourage you to kind of like think

about causality more because it's really kind of cool and new emerging field a lot can be done one way to do that is look at our code on GitHub um all of that is open source you can

just like download the framework and run that if there any questions email us or raise a GitHub issue also for any kind of like students or anybody else out there we're hiring broker intern and Fe

positions right now um there are links down there send us an email otherwise or just come ask a question yeah and this is really it and we hope

that uh scalable real world applicable causal AI can have a lot of impact and a lot of domains not just business up there but we've also worked with education and had a competition out there where we actually have some

Interventional data and show that a lot can be done but also Healthcare scientists and all of that thank you hello hey can I have a question hello

there is a question in the chat if anybody has any questions in the audience please feel free to go in the microphones go ahead oh yeah great um thanks for the talk I had uh two

questions one I guess was um about the treatment of X estimation I guess the posterior assigned by your approach to graphs uh is saying

something about how well it fits the observational data um so I guess for it to work the approaches showed somehow that should the correctness in terms of pitting to

observational data should also reflect the correctness of the treatment effect so is that like uh yeah just how does that assumption really work

um I guess they're kind of like different ways to think about it the main things are kind of like to preface this is this is the theory that junk showed

um we have a proof to show that our um the graph that we learn is the correct one given that the data generating process as part of the model class that we're assuming and we have

internet data and kind of like fully convergeated Global Optimum so you see that a lot of ifs if this is true then we get the correct graph and in a lot of scenarios kind of like an easy scenario

as we see that we only converge to single graph actually and this is the correct one and it's fine uh we only really need this distribution of a graph because we have limited data or there might be some kind of like mismatch in

the um modeling assumptions or kind of like training deep learning does not always converge to the global Optimum um so in that case this is why we really

kind of start using this graph and yes you can kind of say you could say let's just use the most likely graph which is most probably the correct graph it gives really good results and we kind of like in all the Benchmark tables that we have

in the paper we're having that as a baseline um and it gets good results but it's not always the best performance simply because you might be somewhat off or you might have like

I know pie graphs Each of which have like 20 probability but one of them has like 21 probability and yet and kind of like you don't want to overweight the single graph that is just a tiny bit

better than all the rest but you do get some performance from including kind of the other yeah we improved performance by doing that's really in most cases at least and the second quick thing is I guess from your dependence on

normalizing loans you're not addressing the kind of causal representation learning issues so much so how do you envision that kind of fitting in Downstream yeah I guess this is kind of

like um yeah some are a different side of the same coin kind of question that right now we've really kind of focused in on doing deep learning or machine learning for causal not necessarily

course or the other thing which a couple is kind of like the web called the representation learning for me personally and um it's yeah there's a lot of work out there from a lot of

groups um like yeah Benjo sure cup and so on and so forth and um there's not an easy kind of like approach to using our model for that you could run our graph Discovery algorithm on I know

images and try to find the different um relationships between pixels but it's not going to be really useful I believe um so there's still a lot of work to be done depending on like um translation

that I'm not sure about Hangouts anymore sorry about that so I don't think causal machine learning are like a lot of people deep learning how deep learning for color there's two different people at all it's actually the same so representation learning is learning

latest and compounding I mean your latent confounder like what we were doing the whole free thing there's a South is a representation yeah I guess I'm just interested in like

when you don't have you know like the pixel exactly so that so the same algorithm for the latent confounding can run on that type of data it is naturally

finding the group right so we have the living compounding that have multiple observation that is this little compounding Asia representation learning so for example

there are many recent work seven of kunjang have a recent work about using the latent confounding all these theories and solving exactly evaluating our image data and representation

learning so theoretically you know under the hood you have a lot of common theory that the implementation like this skill and all this but on the line it's the same latent

confounded learning and representation learning share all the online Theory great yeah thank you so much for your details okay there are a couple more questions

in the chat so one says normalizing flows normalizing flows are well suited to the gaussian noise model have you tried diffusion normalizing flows for

sharper more extreme noise distributions given the variational approximation does that even make sense to try for graph Discovery and the person says perhaps

not since you upgraded the splines and used Monte Carlo estimates for inference so let's first go to the normalizing progression and then I might ask you to repeat the second one um normalizing

Plus versus the fusion models um the way that I would have like got like understood this question is first of all we do use normalizing flows so that we don't use gastro noise

because kind of like we talk about um this noise assumption kind of like what is added on to this to your um um structural equation model but you say um

your data is a deterministic function of its parents plus the noise and it's kind of like where this additive noise term comes from or why it's called edit noise model and originally we've used the gaussian noise there but once we kind of

like actually start splines or no other normal lengths of flows this noise term is not gaussian anymore yes our base distribution is gaussian um but the actual noise distribution that we add to

um the variables and we actually use as a likelihood term is not gaussian anymore there's at least I wouldn't I would say it's not gaussian we've not tried using the fusion models yet as a

likelihood model or kind of like to learn this this noise term but it's certainly something interesting to try um Pedro Sanchez from the University of

Edinburgh working from sota's something I always forget his last name um but um look up Pedro Sanchez he has some work on the fusion models for kind of factual inference

um but it's not integrated into a um because of the discovery framework yet and kind of like doing all of this end to end so I can I just add something rounding this is very easy but the thing

is you you prove it's called though and they have the identifiability that's challenging because uh for the moments what we're using the Z is exactly correspondent to the edge of the noise

in the partial equation model but for the Future model you have the but then use the diffusion for that you add noise what does it mean for causal system this

is unclear there are like also research myself and others are doing like for example taking a whole dynamic system point of view on Caldwell system and I think we have the optimal

transport for public Discovery that's at an interview and I think in uh Amsterdam Euro smoothies book have a lot of Animation View on color Discovery so

be able to reach the theory I think that's also one of the challenges on which view you're taking we want to not just to you know make something running rather also care about why it should

propose oh if it's supposed to give us the correctly results when the soundtrack will do what was the second part of this question about like VI and

like yeah so the second part was uh given the variational approximation does that even make sense like the uh diffusion models I guess

uh and the person says perhaps not since you upgraded to a splines and used Monte Carlo estimates for internet yeah I don't fully understand that part of the question I guess

um maybe if they can um clarify that okay sounds good if the person can clarify otherwise I can read the next

one which says recent papers have noted the sensitivity of no tears type methods to rescaling the data can you comment on this and is your

network affected by this as well and there is a reference here uh from uh like the the not years right so sometimes even in

that original paper they already discussed about it like the sensitivity and remote here comes from there helps you uh score and the definition of how to score you know the goal part is not

suitable uh like I saw that many models using similar loss but if the loss is changes to the likelihood base a loss that's enough sensitivity to the scaling anymore because you do a digital

marginal and so because we use likelihood based one so we are not sensitivity sensitive to the scaling of the data and we actually in the paper we even have experiments to show scaling

and landscaling of the data um that's great thank you so much any other questions from the audience well thank you so much Nick and Chang

that was very uh interesting so let's uh thank the speakers again

Loading...

Loading video analysis...