AI Trends 2023: Causality and the Impact on Large Language Models with Robert Osazuwa Ness - 616

By The TWIML AI Podcast with Sam Charrington

Summary

Topics Covered

Deep Learning Finally Links Causal Discovery to Real Tasks
Simulators Are Causality's Missing Ingredient
Humans Use Counterfactual Simulation for Causal Judgment
LLMs Already Learn Causal Nuance—We Just Can't Verify It
Causal Models Could Make LLMs Trustworthy

Full Transcript

all right everyone welcome to another episode of the twiml AI podcast I am your host Sam cherrington today I'm joined by Robert osazua Ness Robert is a

senior researcher at Microsoft research professor at Northeastern University and founder of altdeep.ai before we get going take a moment to hit that subscribe button wherever you're

listening to Today's Show and of course you can follow us on all social media platforms including Tick Tock and Instagram at twimlai Robert welcome back

to the podcast thanks for having me back Sam I'm looking forward to digging into our talk it uh it's been just over three years since the last time you're on the

show it's hard to believe that uh it's been that long and you've had uh you know shall we call it a change of scenery uh so why don't we take a minute and start by having you update folks on

what you're up to nowadays sure yeah I'm uh so as you said I'm at Microsoft research uh and um my research generally focuses on

probabilistic machine learning and very much interested in the intersection between that and uh causality I also work with some other causal inference

researchers on uh tools for uh causal reasoning so this includes a library called piui which is an open source

Library not of which Microsoft is a major contributor now there's other companies involved and so you know making those tools useful for data scientists and analysts and

Executives who are trying to do causal reasoning and build causal reasoning workflow some of those tools are no code tools and of course doing a lot of fundamental research the

intersection of probabilistic machine learning causality um you know recently large language models uh you know it's uh and generally speaking how do we

think about um making formal algorithms for reasoning including including causal reasoning more accessible to to the

broader public this episode of course is uh part of our Trend Series so we'll be talking about the uh research in the the field of causality and causal modeling

over the past year and uh uh kind of the Outlook from your perspective uh but you know one question that I'd like to start us off with is

your kind of General reflection on the space over the past year I think the last time we spoke uh it was right on

the heels of yahshua bengio's uh system one system two talk at nureps which uh in my estimation put causality on the

maps of and the mines and and in the mouths of a lot of uh the machine learning and AI Community uh and then I think not long after that we went into

uh into the pandemic and there was a lot of desire to try to take advantage of causal reasoning and causal models and Healthcare and things like that

um do you think that that you know that that interest has continued to to accelerate or it has it you know slowed

down I hear about it a little bit less is which is maybe um where my question is coming from well sure I think after Yasha benjo's talk I think you're right and that it put it on

the on on the radar for a lot of uh machine learning researchers and you know in thinking about you know what the trends were for last year uh in preparation for

our talk I did uh end up thinking a lot about causal Discovery methods causal representation learning methods that were directly inspired by I think that

seed that uh yoshimagio planted uh that year but also to your point during the pandemic uh I think a lot of people realize that

when you have a you know a so-called Black Swan event like a pandemic a lot of your historic data tends to not be as useful as you would

hope and so uh it I think the pandemic also as people were scrambling to apply their expertise to solving problems in that space uh people realize that okay

we need to have some model of some causal representations in our model some model of intervention in order to understand uh in order to adequately model you know

how we're going to make decisions and you know policy decisions with respect to epidemiology as well as understanding uh the the molecular biology of the

disease so I think you're right that um that the pandemic also shifted sites again towards causality um

in in the last year you know perhaps you haven't heard as much about causality as as perhaps other kind of growing Fields there but I see a lot of connections with some of the things that were more popular last year

um and uh and those and so and the topics that you've probably been talking with other guests about that were popular last year certainly were big themes in the causal workshops that we

saw at the big machine learning conferences last year and so I'm happy to type more into those ideas but I think no I think um uh broadly speaking

I think that there's a lot of momentum under uh pushing forward causal reasoning in the space of machine learning and I'm excited about what's going to happen in

2023. that's awesome that's awesome uh

2023. that's awesome that's awesome uh and I think one of those those fields that you've already mentioned and it will spend some time uh digging into is everything that's going on around large

language models uh you know chat GPT especially at Microsoft but before we dig into that topic we ask you to think about some of the major

Trends in the in the field and the first one that you wanted to highlight was deep learning methods for causal Discovery um tell us a little bit about that field

or that area of research and what's uh what's important there so causal discovery is I think we should Define

terms it's learning from data what causes what so another way of thinking about it is is causal structure learning from data typically you're trying to learn a directed

acyclic graph and and I would say that before 2022 it'll already be it already become common to cast causal Discovery as a continuous

optimization problem so you would take the space of dags or or some other graph generalization and and figure out some kind of continuous uh

representation of that space and then try to optimize over that space and and just to continue uh your effort

to Define uh causal Discovery the ideas that you know in in papers we might kind of just write down a causal graph and you know just write down the

relationship between um uh you know one actor in a relationship and another and um you know do that based on uh

Intuition or what have you what you're talking about with causal Discovery is the idea that we've got a bunch of data it's kind of the you know what we're trying to do with deep learning we've got a bunch of data that data captures

relationships and patterns and all these things how can we extract those relationships that graph that pretty picture you might see in a paper from the data itself all right yeah and I think you know kind

of putting my own personal spin on it this was something that I spent a lot of time working on particularly during my PhD and in areas like systems biology

and I had seen a trend then of you know trying to scale these algorithms to larger uh graphs with more nodes and you

know obviously the space of graphs are super exponential and so like it's um and so there was a lot of focus on just getting bigger graphs and it seemed a lot more divorced that that particular

approach seemed divorced from you know the Practical use cases for learning a graph because you know typically if if you were going to do some kind of Downstream causal reasoning

task using a graph you didn't need all of those you needed a graph that was this giant hairball of of causal wedges you needed something that was very specific to The Domain that you're a

problem that you're uh that you were you were working on and then likely some subset of those nodes that you needed to reason about to actually solve the downstream task and so I had become a little bit jaded to be honest and so

when people started switching to you know using you know atom Optimizer to solve this problem I was like you know I'll kind of shrug my shoulders like okay yeah I guess that's

useful but can you give an example a concrete example of the kind of um relationships that you might be you know looking about looking at uh

and referring to in you know PhD research and since so what was happening was that in these past efforts I had recognized that it was important that you treat causal Discovery as something

that is a step in a downstream task and and that you might want to orient your analysis with respect to whatever that Downstream task is and uh so in the case of biology it was

something like drug Discovery if you know if you know if you you learn a graph and then you want to reason on that graph to figure out where to hit the system of the drug well you learn that graph and there's some uncertainty

around that graph you know maybe that you should propagate that uncertainty down to the selection of you know the intervention but you know I typically was focused on

learning one graph and learning uh as big a graph as possible and so I became a little bit disenchanted with it at the time and uh had switched to working on uh

probabilistic machine learning probabilistic programming and I had watched causal Discovery at conferences

and had seen that uh well we had started using deep learning Frameworks and their ability to optimized in high dimensional

spaces to essentially solve that same problem but now you know optimizing over continuous space and this year or rather last year in 2022 the trend that I saw was

actually connecting the learning of causal structure to Downstream machine learning tasks and I thought that was a really interesting development and I was excited about it

so what are some examples of um that connection I you talked about it uh in the context of of uh the bio biology use case but from a pure machine

learning perspective what does it mean to connect that uh the causal Discovery to the downstream machine learning sure for example a paper called on the

generalization and adaptation performance of causal models this is using an idea from causal inference called independence of mechanisms the idea is that suppose you have some cause

and some effect you have two variables one's a cause one's an effect and the mechanism that drives the cause is separate from the mechanism that

drives the effect given the cause and so from a a probabilistic standpoint that means that the probability distribution of the cause and the conditional probability distribution of

the effect given the cause are if you if you model them with some say some you know each of those you have some parameter vectors so you have a parameter Vector for the probability distributions cost parameter Vector for

the probability distribution of the effect given the cause and those because of the independence mechanism those parameter vectors will be orthogonal and so from a learning standpoint that

means that uh if you're if you're learning the the parameters for your entire system you expect uh you would like to use this or this orthogonality

inside the the learning process and what and what this paper did was they looked at adaptation so the idea is that okay well if I

if I have the correct direction between cause and effect and I were to bring in a new data set that I didn't train it on so I've done some pre-training and I bring in a new

data set well the only thing that should change only the only parameter Vector that should update is the is the parameter Vector of the cause because the parameter Vector of the effect is stable

with respect to the cause and so they use that intuition just to um well so to continue There what happens

then is since you have if that's true if you have the right graph then you have fewer parameters to update in order to adapt to the new data set and so they use that intuition to

basically say like well the more right my graph is the faster it should adapt to the new data set and so they use that speed of

adaption as a signal inside uh inside the optimization for the of the causal graph and and uh or towards the causal graph rather and when you say the speed

of adaption are we talking about um you know what we might typically think of as like a convergent convergence rates and things like that

like meta parameters of the training process as opposed to some some type of parameter or hyper parameter so again

just like learning rate with respect to out of distribution data is that I I don't or or speed or or more generally kind of how fast it it converges in the in the out of

distribution data yeah is that um in your experience is that um kind of a novel idea to pull that

convergence rate into uh into the machine learning Loop I would say it's not the first paper I've seen historically that's kind of

using this causal invariance to enforce some kind of modularity on the parameters that you're that are being trained

um but it's the first time I've I've seen it in terms of how fast a candidate Network a candidate graph adapts to a model trained on the

candidate graph adapts to to out of distribution data that's the first time I've seen that yet uh and you identified some additional papers uh around this idea of causal Discovery and and its

advancement over the past year a lot of that work seems to be happening yet uh deepman and Mila uh and in Montreal yeah in fact I'm mentioning two three

papers specifically by um uh so two of them by three of them by Rosemary uh who is at at deepmind

um uh some collaborators at Mila including Yasha banjo um Stefan Bauer Bernard shulkoff uh Sylvia

siapa it's a deep mind so these people people who are working with rosemary um are a really um have led this kind of trend I think

in terms of uh thinking about Downstream machine learning tastic guide causal Discovery so the other paper that I had uh that shared was this learning neural

causal models from unknown interventions and so this is using the intuition that if you have so in if you have a

generative causal model and you do an intervention on one of the variables one of the nodes in the graph one of the variables in the model the only thing that should respond to that intervention is everything that's causally Downstream

of it and and so they're assuming in this paper that you have some intervention data the to make it a bit more practical the inter the targets of

the interventions are unknown so they try to predict what the targets are but given the given you've predicted the right target uh you should be you know if you have the right graph then

the only things that should respond would be the uh the variables in the in the graph that are Downstream of the intervention Target so

comparing that to what uh comparing that um intervention distribution to the actual empirical distribution of of what responded to the intervention and what

didn't in the training data that's uh so they showed how you can learn causal graphs using that kind of signal does that make sense I think so so the idea

is that uh the more correct your graph is the less um non-downstream uh

the less effect uh given intervention will have on uh nodes in the graph that aren't Downstream of where the intervention occurs and I mean if you had the right graph if you had a correct

graph obviously they wouldn't be affected at all but the more right your graph is the more the more you expect you know only things that are Downstream to be

affected and so do they then uh somehow um translate this um this

uh the degree of uh the degree to interv that the intervention uh impacts these other states into some kind of loss function and then kind of optimize

across that exactly and I think that this is when I learned causal inference it was if things are very binary like either you're right or you're wrong

either you've you've gotten this uh I mean you could you you were up in causal Discovery we would you would just Define some scoring function and then optimize uh over something like the you know some

kind of penalized likelihood of the data and and or and there were graphs and of course there are causal Discovery algorithms that use intervention data to um to learn to learn the causal graph

this is the first time I've seen one where the interventions are unknown uh well that's not true there are I've seen a few papers in the past that

softened the idea that you know exactly what the intervention targets are um but I would say like this one it's kind of seamlessly integrated into you

know a a loss function uh that you would optimize in a unsupervised uh training procedure and so I thought that was um uh that was quite novel

yeah it's uh talk through if you've got you know presumably you've got a known graph

and you've got unknown interventions present in some data and those unknown interventions I think you said are it's

not only the you know nature and degree of the intervention but also um the you know where the intervention is occurring is unknown

or am I is that too broad is there a simple kind of toy example that we could construct here to to make it more concrete okay let's think of a of a randomized

experiment right so randomized experiment so let's say that uh my graph is that um we're talking say we're playing and or

we're a data scientist at a gaming company and you have a you're interested in the relationship between engagement in the game side quests and the amount

of in-game purchases and there's a common cause there of you know whether or not you're a member of a guild and people who are because people who are members of guilds tend to you

know collaborate more and thus maybe they engage in side quests less um and people who are members of guilds might pay more for in-game and pay pay different amounts for in-game items

because they're going to share it and they're going to pull resources for example and so if I wanted to run an A B test or in our experiment I would do an inter

that's wanted to understand the causal effect of side quest engagement on in-game purchases when somebody you know logs on and they

go through my digital platform uh normally their level of side quest engagement would be set by this would be affected by whether or not

during the guild but what I'm going to do is is modify the game Dynamics for that player and say all right well I'm going to maybe course them into engaging

more in side quests or course them less than engaging more in side quests and so I'm specifically targeting this the uh

uh side quest engagement uh node setting it to a fixed value um and so like that's that's an intervention it's what's called an ideal

intervention um it's you know it's set by random policy but sure it's an ideal intervention and um uh now in this scenario and so that's

realistic data to have but oftentimes you don't know exactly what node is being targeted so say for example I meant I so

earlier I said I talked about say signal signaling Pathways and systems biology and let's say that my intervention on the system is

I don't know oxidative stress on the cell and I don't know exactly which you know some the nodes in My Graph are

referred to the activity of certain proteins in the signaling pathway but I don't know exactly which proteins that's that intervention is going to affect and

um and exactly what is going to set their value to and so that's a much more flexible

type of intervention uh regime to be in or uh General uh intervention regime to be in and and so they solved the problem for that regime so they don't they don't

assume you know exactly what nodes get targeted and what values it gets set to and the idea is that you just uh not just but you are collecting so much data

that on the average you're able to make inferences about the nature of these interventions and

what nodes they're targeting um the the I guess the scenario the the reaction I'm experience is the scenario seems so open-ended I'm uh it's not

clear to me at all how you you know derive the the how you get the result that they've that they've achieved so what they're trying to do is predict

which nodes get or are the targets of the intervention and then conditional on that uh uh you are able to predict what's the

downstream consequences of the intervention would be and based on the accuracy of those predictions with respect to so you know you obviously don't have any supervision

over which nodes are become intervened on but you have you have intervention data and you can see okay well there was an intervention here and these nodes were affected and so we want to find the causal graph

that says like well supposing there wasn't a dewy you know we haven't we always intervention let's suppose it was this node and uh and that would cause certain these nodes to be affected now we have that data and suppose it was

that no then that causes those nodes to be affected um now you don't know exactly which nodes were effective that's there's uncertain desolate and variable but of course um you can propagate it down towards

What notes are affected Acro and and try to find a graph that best matches your your your data set in that sense

and is is this necessarily a scenario that involves uh time series data where you're able to look at propagation or

are they a data set of independent uh uh interventions and and outcomes now in this case it was equilibrium data so we're assuming that once the

intervention's been applied you see the equilibrium consequences of the intervention okay super interesting interesting uh was there another paper that you wanted

to touch on in in this topic yeah this is another paper by uh called learning to induce causal structure uh some of the authors overlap with the previous

papers uh but um and so this was so typically the way we've done causal Discovery is it treated as an essentially in machine

learning terms and unsupervised learning uh problem sometimes just observational data sometimes with a mix of observational Interventional data as within a previous example that's

difficult because it the results tend to be pretty sensitive towards the hyper parameters and that you in your in your model it's hard to get those configure uh

configured to where you like them for a given problem and of course then you have to redo it for another problem um so they so in this paper they

made it a super a supervised learning problem in a sense that they took data sets of uh the data and the corresponding graph

and it treated the data as the features and the the graph itself as the label and and then

predicted what the graph would be given the data and I and I was skeptical when I first saw it because you know we in in causal

Discovery there's we have this equivalence class problem which is to say that given to it's possible that you can have

two or more graphs that are uh completely statistically consistent with the data these that these these graphs will have the same skeleton but they'll

have some of their edges will be going in different directions so they're fundamentally different causal graphs but from a you know a likelihood point of view or a conditional Independence Point of View they're indistinguishable

with respect to the data and so my thinking was like well if you did this if you tried to say well I'm going to get these big data sets or simulate some big data sets of graphs under corresponding data

there might be some bias that you inadvertently you know say for example if I have two I I have

I have a graph of two nodes and there's I can I can Orient the nodes from A to B or oriented denotes from B to a these by the way are two graphs that are in the

same Markov equivalence class and so they should be indistinguishable from data uh just using just statistical information and data you could have other inductive biases but from just conditional dependence and likelihood

standpoint they're indistinguishable and but I might if I say create some data sets that has these as examples maybe I put in the a b node more often than the

ba and so the a b graph more often than MBA graph because I I like things to be alphabetical for example and like that's that bias

um uh you know I was a little bit concerned when I read that that bias would kind of pollute the uh the approach but they based on the ablation and the sensitivity test they did in this in the analysis

um that didn't seem to be a problem and um uh you know generally speaking they got good results um now did they expect that to be a

problem and did they uh design uh their method to address that problem specifically and if so how did they approach it well I mean it's just a matter of making sure that you don't

introduce that kind of bias into the data well not necessarily the for example the alphabetizing issue

but just the the um the the equivalence like how what is the method that they uh how does their

method avoid this issue of there being many ways to represent a graph from the same underlying data well I mean they

can so the easiest thing to do is just be aware of the issue and so if you if you recover you infer a graph you you caveat it by saying like well this you

know this is true up to an equivalence class so for example like we do have representations for equivalence classes of of graph so if you if you thought

that this was a problem you could uh turn a directed acyclic graph into something called a completed partially directed acyclic graph and you know so it's basically it's basically just I

don't see that algorithm that uh acronym quite as much right so it's just yeah yeah CP dagger it just means that um uh imagine that you had a dag and that

you're going to take some some of the edges where you you don't have uh you're not 100 sure that it's oriented from A to B you just make them undirected so every yeah so so in other words that

the an undirected edge means that you can't resolve causality in this Edge but a directed Edge means you can resolve causality so you could just do that and there's and there are algorithms that

will turn a dag into a a CP DAC based on uh certain assumptions um another and another interesting thing that you know as I mentioned earlier that foreign

you might say for example be interested in a downstream task where you wanted to incorporate the uh uncertainty uh you know so you know rather than having a

you know essentially a point estimate of a graph you would realize like whether you know this is one of several you know highly probable graphs uh given this data and so maybe you want to

propagate that uncertainty to the downstream task um you know this this approach to use a Transformer architecture to encode a joint Distribution on edges in a causal graph

and you know it's one of those things where the first time you see you're like yeah of course why wouldn't it it's like you know causal graph it's like you know you can uh it's an ordered sequence you could uh

treat it like any other you could use Transformer networks like you would model any other sequence but um you know but once you see it you see somebody doing it like aha yeah that's a

great idea because what what if you wanted to say work with uncertainty and causal Discovery before what you would often do is

maybe sample some Ensemble of graphs and then uh or use some variational approach variational inference approaching get basically a probability Matrix uh where

each or each element of the Matrix is the marginal probability of an edge and when you do that when you work with just the marginal probability of edges

you you lose a lot of global information from the graph say for example the fact that the graph has to be acyclic either there's if you uh if you just have a probability Matrix you kind of you can

lose that information but the Transformer networks it you know it encodes a a joint Distribution on the edges and so you have to you know you

you don't you have to obviously generate from it to to make use of the um to make use of it but uh you know I was I thought that was interesting that was the first time I'd seen that as well uh

well we've spent uh a bunch of our time digging into the first Trend that you've identified um and you know it's probably fair to

acknowledge that uh that you know for me at least the the causal causality and machine learning is one of these topics that I am uh super

intrigued about um but it's one of those like Quantum Computing that like I have to hear like five or ten times for it to fully sink

in so uh and I guess we'll be talking about the the book that you're writing to help folks uh grop the topic uh when we talk about things to look forward to

for for 2023 um but uh if anyone else is you know like me trying to keep up with this conversation it's all good it's all good

yeah and yeah if I can I can come up with more clarifying examples um but yeah please hold me the task in terms of trying to be as as clear and give simpler examples if

I'm talking too much in terms of nodes and graphs and yeah it's you know it's also it's a difficult topic when you know whenever you're talking about graphs you kind of want to draw them uh

yeah and but this is a podcast and so you know there you have exactly exactly all right so the next thing that you had uh was talking about

um providing causal inductive bias to models uh what are you seeing there yeah so I mean I again I'm drawing kind

of lines of a Trends over papers that probably weren't thinking about what the other people were doing when they were doing it but just it just seems that there's a connecting thread to to me

uh so the first one I um I I wrote was a i i surface was a paper by a guy named Lars Lords working with um uh Bernard Shokoff and some others uh

called amortized inference for causal structure learning and you know it's also it's at its core is a causal Discovery paper using variational

inference what was interesting about it to me was that they were using simulators to

uh to simulate uh the the simulate data and so you have you know the the simulator has a ground truth graph use it to simulate data and

then you um train you know you train the encoder to map back from the data to the structure of the simulator and and then that's and

then you use that artifact uh in Downstream causal Discovery task and so I what was interesting about that to me was that uh using a simulator um so I think there's a lot of fields in

Engineering in The Sciences so you know what I think what scientists call Process models what Engineers call simulators um and systems biology kind of wholesale

models of organisms uh Gene Gene simulators protein protein signaling simulators these you know these are

ways that people encode knowledge often from physical domains about causal mechanism and and so using

that to create the inductive bias for a cause of Discovery or any other algorithm where any kind of inference

task but you're wanting you're wanting that that inference to be biased towards the causal knowledge that's baked into a possibly Black Box simulator I think

that's uh that's an interesting Direction so you mentioned causal Discovery in describing this work uh let's maybe take a step back and have

you kind of clearly delineate the that first Trend you saw from from this trend uh my uh what I'm hearing is

um that first trend is you have a bunch of data and you want to pull from that data the graph I.E the relationships uh

uh the calls or relationships that occur in in that data and here you're trying to create a model that reflects causal

relationships in the world and we're going to talk through some ways folks have approached that one is through simulation well I a good just to zoom out from the causal Discovery thing I'll mention the

second paper which is called climax a foundation model for weather and climate and it's a so this is a multimodal foundation model

so we're taking in different types of climate and weather data sets and training a foundation model on top of these and then doing some fine tuning for whatever Downstream whether a

climate prediction task you have but what again what was really interesting about this was not just the diversity of the types of data that we're going into the model but also

uh the the use of simulated data so you have uh process models from uh which are obviously big in in climate science as

well as as meteorology and you're trying and these encode a bunch of causal information about physical

mechanisms that regulates uh the climate and and the weather at different uh resolutions and different time scales and

so and and so the idea that you know beyond so in you know in causality you we know that you know oftentimes just the a set of Interventional or observational data is not enough to to

learn a full causal model but a a simulator is essentially somebody's very you know finely um

I'd say kind of a a causal model of a data generating process where they're uh you know rather than thinking in terms of of graphs and and and uh you know and

and edges they're thinking in terms of you know the local level um interactions between components in this system and and the physical laws that regulate

them and in some cases you know you can extend this to simulators of non-physical phenomenon like you um like you would see in a software like netlogo say for example simulating how disease

passes through a population right by kind of creating some rules for how people in that population interact and so you can think you can imagine

that so using that as an example I maybe I don't it's hard for me to come up with some kind of causal graph for how disease passes through a population

but I could write some simulation for how like you know you know this person they leave their house and then they bump into that person and there's some chance that they're going to spread the contagion to that person um you know some people are super

spreaders and some people are are not since you know you can kind of think about the causal system in terms of local level interactions build that into a simulator and then use that

simulated data to essentially uh create the training data for your foundation model especially if you can combine it with other sources of

um of actual data and you know what's interesting about that to me is is you you can control how that data is is simulated right so therefore you can

make sure that you are you know simulating data from uh that that's getting the um

that's from across the manifold that you would need to actually uh learn a faithful causal model and so that's um I think I mean we could talk about

that more in kind of trends for 2023 but I I think that's an interesting exciting area of research in terms of you know how can we you know we have these simulators it's a billion dollar industry

um uh but stir and we already there's already a lot of work uh in using machine learning models to build surrogates of simulators particularly

those that are expensive to run and so it's it's kind of a natural extension to kind of want to build causal machine learning models that are

surrogate simulators and um you know men make sure that the that you're not just approximating the simulator but you're also being intentional about the causal knowledge that you extract

or the cause of information that you extract from the simulator what is it about the nature of causal models that

makes it such that you can know enough to build a simulator of some kind of effect but you are unable to build uh

you know the causal model that you can build just by mapping what you know that went into the simulator isn't you know isn't useful isn't sufficiently useful

uh that you wouldn't just do that directly so one is a when you build a simulator you can just

number one stay at the like at a very low resolution right you can just think you know you what are the basic for lack of a better word particles or

entities in my system and how do they interact now the causal question that you might ask might be at a higher level of abstraction but you could always simulate data and

then figure out some way to aggregate that data so that you can think about what you know the uh the uh the right level of abstraction is from this from the simulated data but your model you

have to kind of in a traditional causal model you have to build that abstraction directly into the model say for example you know you know a lot of causal

inference research has come from places like economics econometrics where it's the people kind of take the the nodes in

the graph as a given but um you know in machine learning for example we're often working with very low level features like pixels like you know like uh

um you know fits in an audio file and so what are what are examples of the kind of nodes in the graph that we're talking about in the you know

economics econometrics you know levels of abstraction well say for example you were going to do some kind of analysis of

uh the impact of No Child Left Behind on graduation rates but you know if you if you had to kind of draw a dag for that system you know thinking about

um you know what's the you know the amount of money spent on the school and the amount of students who graduated and the ones who didn't and you know those uh in that kind of domain

you the the there's there's obviously some flexibility in abstractions right I think about for example you know variables like race and ethnicity have if you think about it in terms of what

the census form looks like has changed a lot over uh you know several decades but um generally speaking in those social science domains

um there's kind of uh an understanding about what the nodes what the variables in the model are going to be um but in in a physical science you might

okay it might just be easier to start with whatever the highest resolution say or lowest level of uh represent representation is and and then just

focus on the the micro interactions between those particles and then uh and then do your reasoning uh about the higher level state of the system from simulated data

got it got it and to your point earlier you that the the [Music] the practice of simulation in many of these fields is you know decades long

and well established and you know to some degree or another off the shelf in many cases yeah and you know and often involving large simulators or more or less black

box that are widely in use and um you know and and to be able to kind of extract causal information from that black box would be valuable awesome awesome

uh so the next paper you uh wanted to talk about is the inducing causal structure for interpretable neural networks is this one related to simulation as well it

it uh I don't think the authors well maybe they were they so they were you so this not as explicitly not as explicit I think it could be used so if our goal is

to figure out how to be intentional about extracting causal information from simulators uh you could apply this technique so they were applying it to neural Nets actually so they were saying

like Okay well let's suppose that um uh we wanted to train a neural network to have uh

certain causal structure uh to to reflect uh the cause of structure in the data generating process in order to um to for example be more explainable and and so what they do is this they

introduce this idea or previous papers that this paper builds upon built on this idea called The Interchange intervention so this is um essentially an intervention on the internal states

of the neural network models uh so for example like if you had a language model that predicted sentiment from text you could switch out the embedding for the input text for the

embedding of some other text and then see how the output changes and that would unders that would help you understand you know something of how the the embedding affects the sediment and so that they were using these they

were using this technique in uh and training a neural network rather than just analyzing one so what that means is your uh you know you're you're

taking steps to make sure that the uh you have some causal model that's a you know I kind of that's that's generating data that's now going to be used to train the neural network

and so that you would you would make sure that the causal that the say for example the nodes and the causal dag are corresponding to representations in the

in the neural network and uh and so you would align these reputations using this this intervention technique and then and then make sure that say for example Interventional and counterfactual

queries that are simulated from your causal model are aligning with Interventional and and kind of factual inferences from the neural net and so this I brought this up

um it's been you know it's it's The Eclectic uh area in terms of understanding and enforcing abstractions uh causal

abstractions and neural Nets but it was clear to me that like okay well you could do this for if you wanted to say for example use a simulator to

uh or some other causal model to train a foundation model for example and make sure that you know rather than just simulating a bunch of data and then feeding it into your model you're

simulating it you're you're generating data for to to be fed into your model into your foundation model or to your large language model or to your neural

network such that you're being very explicit about the um uh the causal information that is

being digested by that uh the model that's learning and you know and so if you that would provide some theoretical guarantees and there's a lot of questions now about you know this you

know can these Foundation models can these large language models reason causally and whether or not they can you can you know if you take techniques like this

you could provide theoretical guarantees about uh so I was we were you know we were saying you know this was the kind of intervention data that we simulated and

we and so we know that that you know at least in you know given enough data it's learning that it's learning that causal information when you train the the actual Foundation model or the you know whatever model you're training and so I

think um uh it's an interesting direction for understanding how we can incorporate causal information more systematically and intentionally into uh

models like foundation so at a high level this one is saying you know we want to accomplish the same goal uh use

data to impart some degree of causal structure into a neural network that we're training is this paper primarily focused on ways

to uh manipulate and represent that input data that feature data so as to accentuate for the the network or the

training process these causal relationships yeah so the cause of relationships the cause of representations the outputs of of of simulating the outcome of the output the

uh uh from a from an intervention distribution is so so stimulating what happens if you do an intervention simulating counterfactual outcomes and

making sure uh rigorously that the the neural net that's being trained is Faithfully learning those um uh those those elements

awesome awesome uh the next one you identified uh from last year is causal representation learning uh tell us a little bit about

what you're seeing there sure so I mean for people who are new to the space it's a cause of representation learning is a problem of learning

uh latent representations and uh from uh low-level High dimensional data

that's correspond to causal objects in a data generating process it's it's it's connected to say previous work in

learning disentangle representations and um has a big as a a large um Roots as it has roots in um

independent components analysis uh what I think in terms of trends we I don't think we've seen any huge applications of causal representation learning yet but I think we're moving

there steadily um I I saw some so some papers that were um formalizing uh the distance Dorada for

representation learning so like in other words what what makes a a a good causal representation uh in in causal terms and so

and once you do that what that allows you to do is is use is causal inference Theory to be uh you could be a lot basically a lot more formal in terms of how a good cause of representation ought

to behave um and so in this in other words it's this the talk of of what the how what makes a good representation becomes a lot more

formal and a lot less hand wavy it it enables you to understand exactly if and when you could learn that representation from your data

um I think you know in the previous research on this Dorado there was you know you know people were trying to hack at this problem even though there was some evidence and then you know and later on some papers that were showing

that it was actually uh not solvable uh from at least in a in a unsupervised observational data setting and so um

so I mentioned uh saw a paper here by Eastin Wong and and Michael Jordan uh this Dorada for representation learning a causal perspective and so this I thought was interesting in that it's

using uh probabilities of causation so probabilities of causation are um so what lawyers for example call but for causality and proximal causality the

idea is um uh something like the probability of necessity which is that you know given some cause led to some outcome you ask

had the cause causal event not occurred would the outcome still have occurred and the probability of sufficiency which is and what's the probability that that statement is true um and then you know and probability sufficiency you have a statement like

you know uh what's the probability that you know if the outcome supposing that you know the causal event didn't happen and the outcome didn't

happen what's the probability that had the causal event happens the outcome would then have happened right so you know that the out that the cause was sufficient to cause the uh the causal variable was sufficient the causal

events was sufficient to cause the outcome event and then the on the other case that the causal event was necessary to cause the outcome event

and so this is something that we have a lot of of theory behind in the univariate case this paper explored it

in the context of learning a lower dimensional uh but obviously more than you know variate uh representation from

high dimensional data and but it also used a lot of those that old theory about probabilities of causation to understand whether or not

you could even learn uh uh these uh uh probabilities of sufficiency and necessity from

uh from data itself and in the backup like the idea was that a good causal representation had both a high probability of of uh necessity and a

high probability of sufficiency and maybe again taking a step back and trying to distinguish this trend from uh

previous ones that we mentioned in the the case of causal Discovery you've got a bunch of data and you want to learn the graph like learn the relationships

here you have a bunch of data and you want to you're you still kind of want to learn relationships but you want to learn it's

more about learning a minimal kind of core uh representation of the the causal structure meaning how would you

differentiate that from learning the learning a minimal graph you know it's uh and I probably should have led with this it's not actually unrelated to causal Discovery in the sense that so

let's suppose that you have so Colorado you know representation learning and and college representation running disentanglement you know and disentangle representations this is

often thinking about you know the high dimensional data that you have common in in machine learning while causal Discovery traditionally is you know you're thinking about tabular

data that's you know and category so it's a bit of a setting distinction right uh but the problem is is similar or the same yeah and so like you can

think right so you can think most causal Discovery algorithms don't assume any latent variables or you know and and in particular they don't assume latent causes so that's you assume you know

what and causal inference we cause uh sufficiency which is that um you know of all of you know you have a set of observed variables and there are

no common unobserved causes for those observed variables and kind of a ID type of assumption yeah exactly and that's um

uh or conditional IID um uh so you know so you for example could assume that you had some latent variables and that's the observed data

is conditionally IID given those latent variables and then try to learn what those latent variables are so I would call that a type of uh you know

a causal representation learning insofar as you're trying to learn uh you know so so you have a if you have

a vector of 10 observed uh variables and you're trying to learn say a smaller Vector of two or three or four

unobserved causes that some of these variable these observed variables share uh and and the causal representation learning problem is essentially that but

it now extended the pixels for example um and so so I and so I think kind of you know linking the two and just saying like you know this is you know these are essentially similar things like you know

you're trying to learn some latent variables that are that are latent causes and you're learning the call you're hoping to learn the causal structure between them uh that's a kind of causal

representation learning on ordinary tabular data data and I think that's actually a really important Direction because it's uh you know not everybody is working with you know High dimensional videos

and this and that sometimes you just want to do good data science and learn cause of in fact if anything one of the biggest problems of traditional causal discoveries it assumes there are no

latent variables which is almost never the case and so um so I mentioned here a paper um uh from kunjang at kunjang's group at

CMU uh on the called the uh identification of linear non-gaussian latent hierarchical structure and so again so this is exactly that we're taking tabular data we're learning some latent causes and we're learning the

causal structure between those latent causes and uh and relying on uh linear non-gaussian assumptions in order to get

the math to work and uh it's and it's um it's using it's using observational data and so um you know I think so I think that's that's another Trend that we're

kind of seeing in college representation learning in terms of can you know can we kind of simplify the problem a little bit in order to um understand what the

the causal limitations are and try to resolve those first before scaling it up does that make sense yeah in the sense that

um you're not trying to simplify it so much as to you know as to the causal Discovery types of problems that we were talking about earlier but you know given

the high dimensionality setting what are the core elements you know is there some core element of causality that you can identify the way I think about it is like this in

one mode of research that I think has been very successful in deep learning is to essentially

uh brute force a practical problem with a lot of compute and and a lot of data and and try to

try and and try different architectures try different configurations try different activation functions just try different setups of the problem until you get something that

works so essentially trial and error and then once it works you do something like an ablation study to figure out uh what is driving why it works the thing right right what's the thing and then can you

you know you know can you reduce it down to that thing and then right right the thing is all you need paper right Christmas yeah but

that's difficult to do when causal reasoning is involved because you don't know if when it's not working

you don't know if the reason is not working is because the causal problem is is not identified in other words it's ill posed given the

data and and and your causal assumptions or if it's because of you know some of the other problems that are if that are common in in machine learning like say

for example kind of issues that have to you know like working in high dimensional settings for example and so in this and so what you what you need to

do I think and when this ha to resolve this this issue if you don't know if it's the causal issue or if it's the scale if it's a scaling issue or it's the if it's the forgetting or if it's still you know whatever that whatever

the non-causal issue might be is that you have to kind of simplify the problem to something that isolates just the

causal issues then you need to resolve them and then once you know that you're standing on solid formal causal Theory ground

uh then scaling it up and so I think like so I that's why I highlight this like kunjang's word because you know he calls it called representation learning uh but it's you

know linear non-gaussian yeah uh variables that you can put in like a pandas data frame uh similarly I mentioned this uh this this work uh systematic evaluation of causal

Discovery in visual model based reinforcement learning so this is you know so there was a I think in 2021 there was a reinforcement learning

environment called causal World which involved like a 3D environment of uh robotic hands manipulating blocks

and and the goal there was to kind of create a RL environment where you know causality was important say for example that the you know that the agent in

terms of learning how to manipulate the blocks that the color of the blocks didn't matter for example because they're not causally related to any of the issues that come up with manipulating blocks um but it was just

so hard there's just so many things that are hard that you have to solve there that uh we hadn't made much progress and so I you know it's like this paper for

example introduces a simpler 2D physics environment with blocks of different ways and basically uh you know heavier blocks can push later blocks and so it's it distills it down to just the core

puzzle you know so if I do an intervention and I and I push on this block the other block is only going to move if it's lighter than that block

um but you're you know you're you're it's got us simple images and uh you're focused on that core causal problem and then hopefully once you've solved it there then you can say scale it up to

something like call the world and so I think that's a uh so I think that was an interesting and useful um development kind of in the way that people working in

um you know on the causal aspects of reinforcement learning we're starting to think about their problems interesting um so you've got a a a last category here called actual causality and causal

judgments what's that one about yeah so you essentially mean the same thing to different audiences uh so in traditional

causal inference call actual causality is is in contrast to what most of us think of in terms of you know what causes whatsoever so if I say

smoking causes lung cancer it's called token causality but if I for example observe a dead person and or and I say well um

this person died of lung cancer uh and I'm interested in was it the smoking or was it for example the smog or uh the uh

the the hereditary predisposition to you know heart disease for example so that's actual because you're talking about a

specific instance as opposed to one concept being causal to another concept in terms of instead of talking about what variables cause what variables you're talking about what events cause

what events and causal judgments refers to which of those so causal judgments is

what cognitive psychologists and uh referred to how humans reason about

actual causality so in other words um from a causal inference standpoint it's funny that you know in the cause of Interest research actual causality has kind of developed quite slowly in the

sense that what happens is um were interested in is from a causal standpoint if we say like you know if there was somebody who

was smoking in the woods and there was a forest fire if you draw like you know the presence of oxygen inside that as a note in that Dag the you know there's there's no way of

looking at that and saying like no the smoking in the woods was more of a causal factor in his outcome than the presence of oxygen just because oxygen is always there um or something less extreme is like

well you know foreign it was you were in the woods during uh Fire season and therefore you cannot blame the dryness of the trees for the fire you have to blame you you have to blame the individual because they were

smoking a causal kind of graphical causal differences had trouble dealing with that and and they came up with uh and people

came up with these ways of quantifying actual causality using graphical causal inference but the way that research kind of developed was that you would come up with some

uh approach to it and then somebody would come up with a counter example something that's obvious uh to a human that's you know knowing in this case clearly this thing is at fault but this

quantitative method from causal inference literature fails and so and then so what people do is they go back and they add Nuance to their definitions and to

their into their uh of actual causes uh to accommodate this new this new thing and then somebody comes out of another thing and then you have to update it again um and then by the end at the end of the

day the uh the definitions of also of actual causes in a graphical sense tends to be tend to get a bit weird because now you can see how they've just it it's just expanded to accommodate all kinds

of edge cases and at the same time there's this literature from from computational cognitive scientists who

are trying to understand how it is that humans make causal judgments so they observe that humans are really good at say observing some outcome event say for

example a bunch of blogs has fallen on the floor a bunch of milk is spawn on the floor and then figuring out what happened in fact sometimes it really good right like if you have kids and you kind of walk into the living

room and you see some kind of scene and you can just make instant conclusions about what what unfolded to get here and it's not clear

how we do that um and so uh and so what this this approach to research tries to do is say like okay well uh let's come up with

some uh set of say causal narratives uh where we kind of vary different factors in in

the causal narrative show it to uh do a study with some you know undergraduate students or some some Turks and ask them how they judge uh how they make causal

judgments how they make attributional judgments how they assign responsibility how they make how they assign blame in in these situations and then quantify the statistical properties of

these responses understand their um uh break them down into you know they've designed the questions so that they can kind of understand uh what the the key variables and and how humans make these judgments are and then they

build up a computational model of of that process and so in the papers I list here so I mentioned a paper called a counterfactual simulation model of causal judgments for physical events and

then a follow-up paper what would have happened counterfactuals hypotheticals and causal judgments and so you know the the idea here is that they

they come up they came up essentially with a with the idea that humans imagine counterfactuals in their head in other words they mentally simulate

counterfactuals so what to to understand this you know imagine some billiard balls on a table and then I I roll a billiard ball towards towards a pocket

but it hits another ball and then I ask and then I ask a human say hey if that other ball had not been there would the original ball had gotten in the pocket that's a counter factual question and

then what the human does is they mentally imagine the trajectory of that ball with the with the with the other ball removed right so they they mentally

simulate the counterfactual situation and and the idea is that this is how humans make causal judgments um about why things happened and so uh and we see this in law right

so like again in law I mentioned but you know probabilities in the uh necessity and sufficiency we can lawyers call this but for and proximal causation that you're trying to

understand guilt of a of a of a for example a criminal um uh somebody who's who's on a defended in the criminal trial uh by asking like

well would this bad thing have happened had they not done this this act right and um and um was the bad thing that did sufficient to cause this bad outcome

um so for example proximal causality is you know somebody dies because somebody else punched that person you're asking well what's you know they shouldn't have punched him but was the punch sufficient

to actually kill a person in most in most situations right um and so like you ask these kinds of these questions to make a judgment about this individual

um and so like this is trying to model how those types of judgments work and I think uh yeah this is an area of literature that's it's really interesting it's been overlooked by the

the kind of actual causality uh uh literature often because it involves you know human cultural judgments involve um

things that are a lot are nebulous to people with who are thinking in terms of causal inference and statistics so for example a big factor of causal judgments is how normal a cause is like or or you

know should you have been there in the first place right so if I for example if I go into a grocery store and I accidentally knock over some olive oil

and somebody slips and breaks their hip the Judgment of my guilt in that situation is different than if the exact same scenario unfolded but I broke into the grocery store

right so like the background probabilities or normality or um uh some sense of how

expected the uh kind of background causes are is a big factor that's that I think has been challenging to kind of capture in traditional causal actual

causality but uh but that these these researchers have done a good job of um of highlighting and there's I mentioned another paper here called

counterfactuals and the logic of causal selection uh so causal selection again is just of multiple candidate causal events picking the rights picking the right event and

humans are very good at it and it's and it's been it's hard for it's been hard for us to get kind of get right in algorithms you know and then this so this paper just highlights that like the

the there's counter factual stimulation um that were imagining we're simulating different counterfactual possibilities but one common thread is that they they all are likely counterfactual

possibilities so we tend to focus on counterfascial outcomes that are probable so it's kind of like a I think of it like a counter factual Occam's razor

um and that's people you know and and also that's you know the strength of the cause on the effect is another is another big factor in how and how they assess these products these these uh causal judgments and so I think that's

really important going forward particularly since there's a lot of things that we can use here to understand for example how large language models are reasoning causally

as well as uh train them safe through fine tuning or through reinforced learning with human feedback to uh make better causal

judgment are the papers you you identified uh generally of the um

kind of cognitive science perspective trying to uh understand how humans are making these causal

judgments or are they kind of methods paper I thought you mentioned simulation in here which kind of suggests a method hey we observe this in humans let's try

to do something approximate in computers and see if we can get some interesting result so these papers often the authors are publishing both machine learning and

in cognitive science conferences uh that a lot of times the kind of structure of the paper is um here's how we think

uh causal reasoning in humans Works we're going to design a study uh

with humans to evaluate whether that's true and then we're going to uh build a computational model that can essentially

cap replicate these results so in other words building a a model of the algorithm that happens in the human head and um

uh yeah interesting interesting um awesome awesome well that uh we cover a lot of ground those those Trends are all I I guess

you know maybe one way to kind of net this out is that did the entire Fields well maybe this last one was a little bit different but a lot of what you talked about was hey we've got

a lot of data what can we learn about causal structure from that data it's kind of a big driving theme in the field right now and you've identified some specific areas that we've made uh

progress on that uh recently um so let's switch gears a little bit and talk uh a little bit about um you know some of the tools open source projects benchmarks anything that

you've seen uh of notes um uh that uh you know folks may be interested in checking out sure so um I encourage people to check out the

piwai library again this is an open source Library it's kind of grown Beyond Microsoft research and has active involvement from our collaborators at

Amazon uh and and it's being used um both in research and in the industry uh we recently I mentioned that causal representation learning work from

kunjang's group at CMU that was implemented in a package called causal learn that's now been added to 2 pi Y and so you have a few causal Discovery

causal representation learning options and and that's in that Suite of libraries a um I want to call out a really cool package

it doesn't get a lot of play card called why not uh or pronounce why not it's uh the it's it's lowercase Y and then zero it's a causal inference Library that's

implements identification algorithm so identification algorithm so if you've heard of the do calculus for example a you know the Duke Occulus is how you uh

prove that some causal uh question that you want to answer can be identified and estimated from well just just identify it from your data and then you can you

can think about statistical estimation after that but uh and so these uh and then those so those rules like the Dual calculus are you can there are algorithms called

identification algorithms that will just do those that reasoning for you and I would say that uh you know with the exception of a of an r package called

cfid you didn't see a lot of these algorithms like just accessible and as open source until why not came around and so they uh sort of there's there's regular ID

algorithm there's something called ID star and then there's conditional ID star that's those are all implemented and why not so again if you're interested in understanding identification algorithms and causal

inference that that's the python library to go to I also included a link to that that I mentioned that there's a

simplified version of the causal world uh uh reinforcement learning library uh it's called Uh causal MB

RL uh uh so I included a link uh there um there is there was a big paper uh called beyond the information game quantifying and

extrapolating the capabilities of large language models it had like about a billion authors on it but um it's it's uh a set of

of uh benchmarks for evaluating large language models and it has a it has several benchmarks ex related to

causal inference and causal reasoning uh so if you're interested in uh the intersection of causality and large language models I think this would be the first place to look in terms of

finding good benchmarks to evaluate how about on the like on the commercial side are we seeing uh

commercial uh application of um of some of the trends that you're identifying or new tools that are making

it easy for for folks to to use causality causal modeling I mean obviously the or

commercializing Foundation models right and so the obviously the big advances in large language models that we saw last year I mean I'm at Microsoft we've written that obviously we had a big

announcement about incorporation of of GPT into Bing and Edge recently I think going forward we're going to see a lot of really

impactful developments uh in terms of productizing these models in this space as well as developing New Foundation models and some of those Foundation models will um

be well some of those Foundation models like the large language models will already be surprisingly good at reasoning causally and then of course we'll be developing New Foundation

models that uh where we can actually be pretty intentional about causal reasoning during training and in terms of the data that we provided um using simulators for example I think

that that started last year and that that's going to continue to be a trend when you say those causal models will be good at reasoning causally are you

speaking about um causal models that uh you know causal models Beyond you know gpt3 for example

or even GPT so if you go to if you go to the to this to to Bing's new chat service right now and you ask it to say

hey I'm interested in a cause the causal relationships between uh smoking and lung cancer give me a dag

that represents uh that you know that's a give me a causal dag that involves the causal relationships between smoking and lung cancer and it will give you a dag and it will be pretty plausible that

doesn't necessarily say that it's doing causal reasoning just you know it could have just found a dag somewhere yeah and so like remember the dag to some degree

you know so my personal observations is and you know so this I think we're gonna maybe a prediction for for this uh uh 2023 is that we're going to have to get a lot more precise about what we mean by

reasoning um but but I think uh you know for my observations I've seen that uh

that's the language models learn what causes what um and call it say Common Sense causal knowledge that it's learning from kind of statistical

uh regularities and statements of causal relationships around the core the the the training data it learns something it of

transitivity for example so say if it says that uh the cost of cigarettes causes the causes um whether or not somebody smokes and

whether or not somebody smokes affects whether or not to get lung cancer then it can it does okay at concluding that the cost of cigarettes indirectly causes lung cancer um

it's it can they learn something of the of the nature of the causal relationships so for example uh that's you know that not just that smoking is a cause of lung

cancer but that smoking that increased smoking increases the probability or the risk of getting lung cancer um and I've learned that it's it's also types of useful

um uh causal Nuance say for example that you know it might conclude that um

while increased smoking never it's if you ask it to say you know on a scale of one to ten How likely is it that there are some people out there uh that for whom Inc

more smoking leads to less chances of of uh uh lung cancer if you compare that statement to

there are some subscribers out there where if I were increased promotional emails will actually lead them to or increase their risk of

turn or increase their risk of of uh of of uh not using the product even though the promotional emails are intended to get them to use the product more like the fact that's you know that first

relationship tends to be monotonic while that second relationship is not even though those statements are structured the same it can pick those up as well and so and that's important in causal inference so if I know that there's no

there's there are no subpopulations out there where smoking decreases the risk of lung cancer but I do know that there are some populations out there of of customers where more promotions actually

turns them off um that that is actually very useful functional information in terms of the relationship between the cause and the effects that I can use and cause them for instance so it sends it it's picking

that up in in the um the the training data I mean I got there's a question of you know I think you know looking at

the the question of how can these reasons can these models reason causally I think is it's a bit of an ill pose question because you could never really it'd be hard to validate right like if

you um if it for example is using the do calculus to answer some kind of causal inference query how would you know

um I think uh and then how would you know that that result is repeatable when you get to another domain um so I think you know the question of whether or not they can reason causally is is interesting although probably a

little bit ill pose I think and I think it's actually kind of talking about you know looking forward like and I think of some recent papers like uh

what is it called a program is thought um uh I'm sorry the name is escaping me but this there was that um and there's there there was these these

these uh papers where you're kind of getting you were getting improved reasoning um uh by asking the language model to reason step by step

a program of thoughts prompting a programming thoughts probably so like this these programming thoughts around thing was right rather than saying being using natural language or Reason step by step kind of can you

can you move it to some symbolic language and then um that's that's executable for example in some in a programming environment and

then have its enumerated steps or train it to enumerated steps in terms of that symbolic language yeah show its work but you know but rather than saying you know using natural language using some kind of

domain-specific language for formal reasoning and and then uh at the very end only then do you evaluate the expressionism internal external interpreter and then

um and so like and having it focus on the actual say steps of the formal reasoning and so I mean that's you know you mentioned before we mentioned for example the do calculus

and other formal reasoning algorithms you have in causal inference that's exactly what that is and so you could for example um you know possibly train these models to

go through the steps of formal causal reasoning and um and and get good results and I don't know if you would get any theoretical

guarantees but you'd probably be able to gets build something useful with it so when you think about the the

intersection of foundation models llms and uh causality you know where do you think or what do you think excuse me what do you think are the most exciting research

opportunities and and where do you think we'll see significant gains uh over the next you know year to end years pick your end

sure I mean I think I mean I think there's a lot of low hanging fruit in terms of combining causality with these large language models so I think uh this like I just said like if I can

I just mentioned like I can go into Bing chat and get a graph I can also ask you to take that graph and implement it for me in network X

right and so oh sorry yeah that's a python a python library for just representing graphs or I could say I can ask it to I can also

say you know I mentioned we you know the pi y Suite of tools there's one in there called do Y which is for causal effect estimation as well as some other things and I could say all right well given

this graph uh you know and here's some tabular data tell me um uh uh show me an uh python code for a

causal effect analysis that gives me the causal effect of of smoking on cancer and it will give me that code and then I could just go and and run it so from a

standpoint of um you know if you're going to build a tool for a data for a data analyst who didn't want to kind of interact with directly with the python code or you

wanted to make knowledge extraction from a domain expert a lot easier I mean this is I mean this changes the game I mean it's just this really

reduces the amount of of effort that goes into just getting a causal effect inference analysis for example off the ground

um and so I think that's uh and that's probably obvious to everybody right by now I mean somebody a colleague of mine named Ahmed Sharma um and then I'm sorry he recently uh

been a published a tweet about how you could uh there's a there's a data set called the two begin uh cause and effect pairs where it's just you know it's it's

it's data for pairs of causes and effects so say for example I don't know um altitude and temperature for example and and then you know the goal was to kind

of use algorithms to figure out what causes what you can throw away the data and then just ask about these pairs to large language models and it gets very high accuracy in

terms of what causes what and so I mean now you know in causal inference we're often concerned with objective truth and obviously with large language models objective truth in the real world is is

can be an issue um but in terms of foreign starting a a causal graph and a causal

coding workflow I think it's uh um you could if you could just surface it to the human and the human could edit it and say oh wait that never causes that that's not a big we uh that's that's a

lot so that's llms as a tool for uh researchers and developers that want to use causal models data scientists of

course analysts who want to use causal models what about uh causal models causality is a tool to

make llm output better you know we kind of are well aware of and I think we've talked a little bit about it at least the term hallucinations I think has come up in this conversation um

you know llm models not llm output is not perfect um you know one would think that you know if we've got this you know set of

Tools around causality uh that's sufficiently developed we can somehow couple that with llms to uh get them to produce better output in

some way how do you see that uh how do you see that kind of playing out it seems kind of obvious right like if you uh um you know the way LMS large language

models are uh you know probabilistic that their model of the joint Distribution on on language and you know

and and what they're doing is you know one way of thinking about that as they kind of maybe compress a bunch of regular causal relationships um in the world into some latent

representation and then based on that they generate uh causal statements that are plausible in some world if not

true in this world and actually something you know like you know if it's not true in this world it's true in some world and it's like you know and like I said we're talking about counter factuals it's like yeah we're you know you're reasoning

over different possible worlds and some of them are more likely than others and some of them and and one world things actually happened and so this sounds like the thing that

an llm would say as its own apologist foreign model in some sense be an ombudsman for the causal

for the veracity of what's coming on these models I mean particularly if for example Deacon it can generate something and then you can ask it to say okay put that in terms of some domain specific

language that I can run I can run through this uh causal validator it's causal fact validator uh you know that's and maybe it's not a causal model maybe it's like

uh um you know knowledge base uh maybe like maybe there's some kind of something going through a you know an smt solver

or a fair improver right we have a lot of um we have a lot of useful algorithms for reasoning about with constraints or about uh like you know causal fact or

fiction um and um and and it seems it seems obvious that you would and since that since large language models

are so good at generating code that you would want to pair these together so I think that we'll see a lot of developments there in the com in the coming year

um and again that seems to me the kind of I think for a lot of folks it's obvious I think there's a paper um a star search uh with low um

it was something maybe you'll have to look it up uh for your story notes but it was a it was a large language model paper that was um

modifying the decoder with heuristics for um some Downstream outcome so in other words rather than just saying generating some code some generating some text you

could uh and then rejecting it if for example it wasn't true you could uh maybe intervene in the actual decoding uh into the generated text directly um

uh I think that this particular paper was focused on the a star search algorithm but I uh but but there's there's no reason why you couldn't say for example use it to uh uh to the final heuristics that had

you avoid uh words that you didn't want the uh the algorithms generate or to uh to assert facts that you know based on some prior

knowledge to be untrue so for example uh using some some attached in some way model as opposed to

um you know prompting as a way to guide the model's output like hidden you know hidden prompting or uh starter prompts or

things like that I think you would you would use all of these in concert I mean like so I mean I mean you know if you look at the new

service we have with uh bing like it's using Bing's search uh technology to augment uh both the prompt and and the results

right and so I think it it probably the products the future products that are using say for example uh formal causal reasoning or other

kinds of formal algorithms are it probably has that type of interplay how deep have you gotten into the the

being product I'm curious about if it's something that you're you familiar with um

you know with the Bing product you put in some prompt it gives you a response and then uh unique uh relative to chat GPT at

least is that it will um provide references uh and you know I saw a tweet that referred to those as sources it's

unclear to me whether those sort are sources you know in a causal sense uh meaning like I got this information and it influenced the output or

um I created this output and I have correlated it to these external sources um and I'm wondering if you have any uh insight into what's going on there that

you can share yeah I think you're I think you're right and that it's better to view those as uh you know so if they were causal you're assuming that

um uh in the generation of the text the algorithm somehow knew that these sources were used in the trading data to

that's that's a stretch I think um just giving people useful um uh links to say you know if it's making a statement then you can click on it and then go to a place where you can kind of

go deeper uh that's that's quite valuable kind of thinking for it in terms of of this space too like I mean um if I had to kind of you know so

everything I mentioned so far in terms of just connecting uh large language models to to uh

on on on on on some kind of inner or outer loop with in terms of uh with formal reasoning algorithms like you have in causal inference uh I think that's that's slow hanging fruit

um if I had to kind of stretch stretch it out a little bit and think about what other things we might see um you know I I think about so if you you know you can

take a large language model like I said give it get it to give you a graph and then give it to tell you how to implement it in code and tell you how to take that graph in code and run it and run it through an analysis

um and so like that's um uh that said that intermediate representation I think that in terms of how we could improve reasoning with these algorithms I think

there's a lot we could do there now say for example in you know if for example a representation of my causal knowledge about a problem is it causal dag but I can also it's not just a visualization I can do things

that cause of dag I can reason about probability distributions I can make assertions that that's um something is conditionally independent of other things uh uh uh

given some given as causal parents for example I could say um I could model an intervention by uh applying graft surgery right so like

if I want to say you know if I want to understand how the world changes when I intervene on a node I do graph surgery but which means I remove the incoming edges from that node in the graph and so in other words I'm doing some kind of

transformation on that causal representation and and so that I think it's going to be interesting to think about how

that ability to operate in in ways that have like you know proofs of correctness on that intermediate

representation and uh that's you know that's at the interface section between natural language and some maybe some kind of computation is uh it's going to be really interesting and and might have

some interesting use cases there so an example I think of is like it's a reinforcing learning of human feedback if you think about I mean essentially

this with reinforcement if the reinforcing learning and human feedback that was used to train chat GPT you have humans rank completions and and that

provides signal uh you know that to back to the algorithm about which completions are more you know human-like or correspond with what humans might say um

if you you know you know but for example one could imagine instead of just ranking a few completions

um it could provide say a causal graph and then the human could edit edges in that causal graph and the entire graph now becomes you know which is a much

richer representation than a ranking than ranked a ranking set of choices um you know you know humans are very good at reasoning about graphs so it

could be a causal graph it could be a you know a hierarchy of say topics uh or

uh uh could be a graph of relationships and and that's and these are very rich representations that humans understand

intuitively arguably they're they're uh you know part of you know human inductive biases are a kind of or in terms of cause and effect relationships or hierarchies for example

and if you could say for example feed that kind of information back into the large language model or the foundation model using uh rather than just the

ranked list that could be perhaps a more powerful uh reward signal that you can provide to uh improving these models and that's just kind of one way of thinking about it like um

you know another one might be um you know if for example I could like I said you could generate a I could ask the chat the the large language model to

generate a a graph for me then I can ask it to change that graph to represent some intervention or I can ask it and once I

can do that if I could I can potentially ask it to uh do some pretty Advanced causal reasoning by chaining together uh a bunch of graph operations

um on that graph and then getting to something that's perhaps better than uh the kind of the informal causal reasoning that you would find in a

natural language Corpus setting aside the topic of LMS for a second you'll be talking about these these causal graphs in

uh the the pure machine learning world one of the topics that uh has been popular over the past few years is like geometric reasoning of reasoning around graphs and symmetries and stuff like

that is that is that a thing in causality or um or is the only thing in common with those two things just the word graph

you know there's I think that is important in graphing neural networks um and there is some intersection

between uh causal graph and graph neural networks um so yeah I think there is some connection there um about I'm not sure how well I can speak to it

all right um you identified an interesting use case that you've seen emerge this year uh say

can can you talk a little bit about that uh yeah so say can is uh uh I guess yeah I call it a use case from

from Google and so it it uh so this was for using large language models with robotics right so the idea was that uh you would take uh the large

language model and ask it for uh uh a set of instructions for how the robot could could solve some problems something to execute some tasks right

and so basically we're using a large language model for planning and but of course the large language model

doesn't have embodied agency in the world it can't intervene in the world and so uh uh but what what the robot

would then do is take that large language model generated plan and then modify it uh such that it uh you know that it's focusing on

a series of interventions that the Asian can actually execute in its environment right and so um I thought this was a good example like you know for me when I saw this I

mean this I think just made a Big Splash in reinforcement learning in the robotics Community but for me I'm thinking like well you know this is just causal decision making or a sequential

causal decision making with servos right so it's just like what's once you uh once you uh you know it it wants to remove to you know we were talking about like you know could you then take a

large language model produce something that could be digested by a causal model and then the causal model could then maybe modify that that output such that it's

um you know causibly causal causally feasible um uh you know that seems like a very natural extension from this from this

this uh say can work in fact um in fact easier right because you don't have to worry about the various constraints that a robot faces

um I alluded to it earlier but maybe before we wrap we can touch on uh another prediction

uh for 2023 and that's that there'll be a new book hitting the shows that you've written uh tell us a little bit about that effort and and what you're hoping to accomplish

with the book uh sure yeah I'm writing a book called causal machine learning and uh yeah the and you mentioned at the beginning of our talk like this

that how a talk about Yasha benzio uh kind of um uh kind of open the idea to the causal machine learning oh sorry the machine Learning Community that's that's

uh and that causality was relevant to machine learning and that's you know uh and the Deep learning techniques could scale up causal models uh and and so

this book is essentially about that uh so I mean uh you know most causal inference books are are very much

focused on the kind of traditional causal effect inference workflows you see in data science so how much does something affect something else and doing so in a way that is adjusting for

confounding um and obviously I I cover that but I want to make sure that you know you can do it using

probabilistic machine learning models in tools like Pi torch and that's uh the inference algorithms

that we've built using these deep learning tools like variational inference for example can be leveraged to do the kind of statistical heavy

lifting of causal of uh of of causal inferences and so and so this allows you to kind of focus on the you know understanding how causality

Works implementing it in in machine learning code using pi torch and uh uh and leveraging people's experience with these machine learning and deep

learning tools to solve causal problems and of course understand the interplay between machine learning particularly deep learning and causal difference

awesome awesome um and along those lines you uh through your company all deep um have taught uh or have developed and

taught a number of courses um and have done so via the total Community uh in the past as well like do you have any courses coming up that folks should

know about if they're you know hear all this conversation and are inspired to uh to learn a bit more uh yes yeah so we still run our causal machine learning course uh we have a probabilistic machine learning course

that is such a great out of the causal machine learning course one for all the the the things having to do with uh probabilistic inference and modeling and

machine learning uh that's uh you know we I couldn't focus on in the causal machine learning course so this is uh you know it's a lot like Kevin Murphy's book essentially those types of topics

so variational inference uh variational encoders normalizing flows and uh Beijing hierarchical models and latent variable models um and we're also working on a

course on Building Products around large language models um and uh and then here we're trying to understand uh so this this won't this is I'm I'm offering this at Northeastern University

to graduate students and then once we kind of figure out how it's How well it's going and going to smooth out the rough edges and we'll offer it to professionals but the idea here is you

know there's going to be a bunch you know we've already seen a whole bunch of products that are you know saying you know using chat or using gpt3 to write a

an SEO optimized blog post about brownies or something like that right but like um but really understanding kind of what the limitations of these models are and

then how to how to um deal with these limitations by connecting them with

other existing tools and building products for end users that really solve an important problem as that's the uh that's the topic of

that course so it's a it's a bit of an experiment it's a much more products I'd say how to build products with AI style course as opposed to AI

Theory type of course we're excited about it awesome awesome well definitely make sure folks in our community know and uh it's online and ready to go

thank you well Robert this has been uh wonderful conversation great catching up um I certainly learned a ton and uh

wonderful having you on the show again thanks for having me Sam I really enjoyed myself thank you

Loading...

Loading video analysis...