Are LLMs Good at Causal Reasoning? with Robert Osazuwa Ness - 638

By The TWIML AI Podcast with Sam Charrington

Summary

Topics Covered

LLMs can supercharge causal analysis workflows
Causal abilities emerged mysteriously in GPT-3.5 and GPT-4
Benchmarks can't distinguish memorization from true causal understanding
LLMs are brittle and sensitive to prompt framing
Task-specific RLHF could improve causal reasoning

Full Transcript

all right what's up everyone welcome to another episode of the twiml AI podcast I am your host Sam cherrington today I am excited to be here with good friend

and long time friend of the show Robert Ness a senior researcher at Microsoft research professor at Northeastern University and founder of altdeep.ai

before we get going be sure to take a moment to hit that subscribe button wherever you're listening to Today's Show Robert welcome back it's great to see you yeah it's good to see you as

well thanks for having me back in uh since this interview is maybe foreshadowed a little bit just recently

uh our conversation from the beginning of the year in which you reviewed research progress in causality and causal modeling for our AI Trend series

was published that was published back in February and kind of at the end of that or maybe throughout we were talking about llms and the implications of llms

and and causality and there's a little bit of an open question uh at the time but uh you and some of your colleagues

just published a paper causal reasoning and large language models opening a new frontier for causality which really dug into the topic so uh quick work on that

apparently and I'm looking forward to digging into that uh I think folks on the show you know may know a little bit about you but you know briefly you know share your background and what you do at

Microsoft sure okay so I'm a senior researcher at Microsoft research I work on the special projects team uh that's out of Redmond and my background is uh

in probabilistic machine learning probabilistic programming and causal inference uh how these things come together of causal AI I've worked a lot

of language models too and uh it's really exciting now working with large language models at MSR it's it's uh I can think a few places where that are competitive of MSR in

terms of a place to work on these problems and and then my team has special projects we all we look for these research applications and how we can transfer

them into impact both inside of MSR and externally and so we were very keen to explore this problem of you know what exactly uh can

large language models do in the area of causality and uh the this was uh in collaboration with my colleagues at MSR

Amit Sharma and Omri kisman as well as chunhao tan at the University of Chicago and yeah it was we we had essentially

explored uh some benchmarks into space as well as different areas of causal analysis and put and brought that together in this

paper so it was it was a it was a really exciting work I was happy to be a part of it awesome awesome now just to ground the discussion for folks that are new to

the idea of causality and causal modeling uh causal analysis um you give us the the quick overview overview uh so that we can you know have

that context for the specific conversation about llms also summarize all of summarize your your life's work

in 30 seconds or less please sir uh you know sure yeah so causal analysis is something that we're interested across different fields so it's ranging from

econometrics to epidemiology to statistics obviously and Natural Sciences and the issue is we have uh

data oftentimes it's from just observational you know passive observational data there's sometimes it's experimental data and we want to make causal conclusions

from the data and you know knowing the aphorism correlation does not imply causality we want to know that we want to understand under under we make certain modeling assumptions and then

use the data to try and make causal conclusions uh and and recently it's become I don't know how recently but semi-recently it's become uh an

interesting topic of focus in in machine learning for one because you know oftentimes we have very large data sets we want to leverage the ability of machine learning to scale up to large

data sets but also because causal reasoning is an essential task if you say that our the goal of AI is to or

goal of our field is to achieve generalized artificial intelligence we believe that the ability to reason causally is going to be a part of that uh

generalized AI so one aspect of this field is is figuring out okay how can we get learning agents to reason causally

and make correct causal conclusions or make causal conclusions in a way that is maybe comparable to a human baseline or

or or or is is aligned with some standard for reasoning and and so it's uh and it's definitely a a growing field and it's um

it's it's particularly interesting now that we have these Foundation models to to work with what's an example of causal reasoning in the the context of llms and

you can use one from the paper a simple example might be let's say causal so pairwise causal Discovery so this is to say given two variables

uh and you know assuming that one causes the other let's just say that we know that one causes the other trying to find out which causes which right and so this could be done with what we were what in

the paper we call covariance based analysis which is to say analyze the data and and using certain modeling assumptions try to conclude whether a causes b or B causes a uh and then with the large language

models we can also just ask it given the variable names uh does this you know for example does uh temperature cause altitude or does altitude cause temperature

and so that's a that's a pairwise I call that pairwise causal Discovery um there's also a full graph causal discovery which is to learn an entire you know causal graph of uh

relationships between variables uh there's also kind of if you're particularly in you know econometrics epidemiology uh statistics looking at

whether or not something causes an earth and another thing things that where the gold standard would be like a randomized uh clinical trial

then uh you know the question here is you know I maybe uh we got some sales from some months in November and December and January and we

saw a spike in sales in December um uh and in early December we placed an ad did the ad cause this the spike in

sales right and posing that in that natural language form to the large language model and asking it to give you a response that would be a kind of causal query for

what we would call an average treatment effect or a conditional average treatment effect okay and I think we may come back to that example and kind of talk about what you saw

maybe frame the the paper a little bit in terms of um you know your your broad goals for it yeah so I think when we started looking at the question like you know we were

seeing like okay well GPT three three three point fives and four these are very impressive models in terms of the various benchmarks that we would use to evaluate a large language model and how

well does it do on causal inference causal or causal queries in general and and so the question in the room where you're thinking

can large language models reason causally and of course when you say the word reason it's a bit anthropomorphizing the model and so to be a bit more specific at least the way I would frame it would be to say you

know are there conditions under which we can pose a natural language causal query and reliably or predictably get the right answer or not get the right answer

According to some definition of Rights you know in some in some cases that being some objective truth or uh in other cases maybe you know reasoning along the lines of how a human might

reason about a causal question and then we discovered well in this is a very broad question it's very

interesting given the types of questions that we can ask it we can find we found these benchmarks we posed it to the large language model and we got impressive results but we all upon you

know looking at it more closely we also found that actually the benchmarks aren't very good at answering this specific question for reasons we can talk about uh and that's

in order to answer this kind of question we might need we worked with the uh open AIS series of models and for our analysis we didn't have access to weights the trading data the

architecture so you know without you know maybe this is a kind of question that it's it's you need that kind of uh insight into the into the weights into the training data to be able to answer

you know provably and so we think that that's a very important question that uh and I think we we think it needs to be perhaps better opposed uh and but there are some

interesting experiments uh that could be set up there uh with various levels of access to data or perhaps training a new model that has certain disa durata for

causal reasoning uh but we did find that the answer to a different question was an enthusiastic yes and that question was can large language models essentially

supercharge causal analyzes are our our work as modelers in answering uh building workflows for answering causal queries

and and we were we were excited about that those that Discovery and how it can open up what we call in the paper a new frontier for causal reasoning a new frontier for uh causal causal analysis

and causal research so I think what you're saying there is you're kind of curious about how llms would do at tackling these causality these causal

problems you found existing uh benchmarks in the the causal reasoning community and you kind of

threw L alums at them and that they did well comma but dot dot dot um and so you've got some open questions there

[Music] um and you found some areas where they didn't do very well um if I'm uh hearing that correctly and

reading inferring that from the paper but ultimately you feel like you know they'll be you know if not as Standalone causal reasoners they're good tools for

complementing human causality you know researchers modelers analysts whatever dig into that you know but dot dot dot part like you know they they apparently

do well on all these causality benchmarks [Music] um you know but you were still left with questions like what did they do well at

and what were the questions that you were you were left with and also what is it about those questions that you feel like you need you would need

like access to the model weights or you know training data sets or or other things to really um suss out

well for one thing we saw that we didn't really see performative results with respect to

these benchmarks with respect to uh state-of-the-art bass lines until we until we got to a certain

level of model size right so essentially text DaVinci 3 GPT 3.5 turbo gbt4 where where when we started seeing comparative

or better than Baseline results and so this kind of you know it indicates so if you saw the Sparks of AGI paper from Microsoft where we you know essentially one of the

stories there was that things that you know with this movement from this this move from gpt3 to dpd4 we saw entirely new capabilities that were very weak or

non-existent in gpt3 it has that same flavor meaning gpt2 you know did not have the capability to do any causal

reasoning but you're starting to see it with three and four uh yeah or is it three and three point five and four uh three and four three and four okay let's

say text davinci03 so that's kind of what that's uh got it yeah instructs that's the instruction training fine tuning that's uh reinforcing learning human feedback so to me that's that's

mysterious right like um that you know so before and to be clear what that means is that before these models they were doing worse than random right so that's you know if on

these benchmarks you had a month Choice question or you had a binary question then there's there's a there's an accuracy number that is the number that you get if you're just guessing at random and and

um it wasn't until uh post cpd3 that we started seeing better than random and Baseline comparative well it wasn't until post cpt-3 where we

saw Baseline comparative or better performance and why is that mysterious I think we take it for granted uh in a lot of context that you know the models are a lot better and there are a lot of

things that they can do um better and perform better on as they get bigger my personal take on on this type of thing is

you know so there's a lot of talk in our community about emergent Behavior and and I and I uh I I I agree with that and and I

understand how the intention mechanism the more data you give it's the more the more kind of it can move up a hierarchy of abstractions in terms of in terms of

what it's attending to but it's mysterious it's uh and I don't I must again I'm serious in the sense of we have no idea

why this works and it's kind of interesting that it does yeah yeah and uh and that makes me a little bit queasy particularly if we start talking about I mean this is causality right and so like

one thing when we queasy in the if we start asking important questions and we don't know why or how this works it's kind of like right I mean so in in

in deep the machine Learning Community oftentimes our metrics are kind of focused on prediction and um you know maybe we you know maybe if you're nuanced you might Define some

kind of cost function on making errors um but the traditionally speaking the causal inference Community is is conservative in terms of what it's going to conclude like it was you know we're

we're saying that this vaccine uh prevents this illness or or prevents the worst the worst cases of this illness right and so you don't want to you know so the idea that it's it's like

is it's good performance a correlation or causally valid yeah it's just like if you know if you're going to use if you're going to start deploying this thing to actually make

uh decisions based you know like that that are involving kind of causal logic it feels like you know a a mysterious

capability of of doing well on accuracy Benchmark on an accuracy metrics on on benchmarks which as we talk about in the paper are often memorized by the uh

large language models uh is not a high enough standard for uh for that sort of thing so they give an example um

let's suppose that I were to deploy a model that would ingest somebody's resume and uh and then tell me whether or not I

should hire them by the way nobody should do this this is you know this there are very There's issues with this but if um I suppose now we're going to put that in the uh the largely I was going to deploy that system and just

have it you know automatically give an accept or reject for um a somebody who's applying for a job and I could go in there and in my

prompt I could say uh I can kind of give it constraints that would force it to emit certain uh

in a causal sense certain factors that would drive a that it that would drive its Logic for for why to hire not hire somebody say for example I said you may not consider gender you may not consider

ethnicity Etc I want you only make consider um uh factors in their in their educational or professional background for example

and it could generate a you know you could say like you know do Chain of Thought reasoning you know show me your logic step by step and it gives you step by step

um it gives you a step-by-step argument for why it it says you should not hire this person right um from a causal perspective you would want

to make sure that despite the fact that it says that that's the reasoning that is deploying to uh um to make arrive at this decision

how do you know how do you know that it's uh you know it's not just bsing you with a uh politically correct uh uh answer when maybe after you know maybe

uh you know maybe given all the say toxic content that was trained on is still somehow biasing uh the answer I um so from causal standpoint I mean so

and maybe a lot of scenarios this this isn't a big deal but from a say uh you know but from a causality standpoint you want we we care a lot about identification we care we care a

lot about uh so sorry by identification I mean kind of provably you're able to get uh use the data to answer a causal query um and that it's it and then and you

care a lot that the motto is using your causal assumptions to arrive at the answer um yeah and so I think that those yeah activations and and things like that

when these questions are being evaluated you might get some hint as to how they're grounded in the model or

something yeah exactly like using kind of uh probing procedures if the latent representations in the model align with some kind of uh causal model or causal abstractions or causal assumptions I

think this is this is you know on this is very fresh research I know you know we're working on this other people are working on this um and so that's not in this paper but uh Matt I think my personal belief is

that that's um the type of analysis you wanted the kind of experiments you want to run to understand exactly how it's reasoning if uh causally if if that's what it's doing

and so to be clear on in several of the types of questions you know from these benchmarks that were posed to the models like they did really well is that fair

to say that they did very well compared to the the state of the art and um in the these kinds of causal reasoning tasks am I interpreting that correctly

like there are some numbers even in abstract 97 13 Point gain uh for pairwise causal Discovery 92 for counterfactual reasoning 20 point gain

I'm assuming that's 20 points over the prior state of the art uh actual causality 86 percent uh there's no gain discussed here but um

like that seems to be good performance I guess one of the questions that you implied earlier is it like you know hey these benchmarks are out on

the internet that the model just read these benchmarks and it somehow knows the answer yeah so so some of you mentioned that 97 uh so that was with the tubing in

um uh pairwise Discovery Benchmark so this is a benchmark so I mentioned altitude and temperature so what this this restaurant has several examples like that from across different domains

um so zoology chemistry um geology and and ask questions like uh so it gives you a pair like that so temperature

altitude and then it gives you say a bunch of measurements of temperature and altitude from different uh you know geographical locations on Earth and uh

the the what the intention in the construction of this Benchmark was to say like okay well let me use a statistical method that's going to look at

um uh this this this these these pair of variables and figure out which which is the more likely direction that temperature causes altitude or the altitude causes temperature now you can

go to the of course now you can go to the um uh so what we did in the in the experiment I said well we didn't we threw away all the data just took the variable names and then posed some

questions uh you know specifically the prompts was something along the lines of you know which cause and effect relationship is more likely uh number one changing

template variable a causes a change in template variable b or changing template variable B causes attention and template variable a let's work this out step by step uh to make sure that we have the

right answer provide your final answer within these you know tags so that we can parse it um we did not use any in context learning examples but we did give it a

system prompt that said uh you are a helpful assistance for causal reasoning um and and that's what we did and so

we saw that it's uh so one of the things we noticed right there was that uh Chain of Thought mattered uh the system prompts mattered so we saw like a five

point increase for gpt3 when we use the system but we didn't use the system prompt versus using the system prompt um and

and then in terms of some of the kind of Errors it made so yeah it got very high percentage when we looked at some of the um the errors that would make it would

often um in some cases it was uh it was making erroneous you know it seemed things that you read and they were coherent but

we're kind of making kind of maybe small errors in judgment but oftentimes we saw cases where it was just it we actually kind of agreed with the large language model in

a sense that for example um ozone and radiation so the answer was that um radiation

causes ozone so that the presence of radiation causes ozone and um it concluded that ozone caused radiation but when we looked at the the

chain of thought it said it you can see that it was uh talking about ozone in the stratosphere right so that there's you know the less you know when the ozone layer is depleted there's more solar radiation but we were talking

about no at surface level you know if there's radiation being you know from some Source it causes you know causes there to be ozone in the air and so simply uh so oftentimes when it got it

wrong we realized that it could get it right or just a simple more a little bit more clarification um under and those are rare examples but when they did happen it kind of implied

that okay well you know again sensitive to the prompt and and also that's uh you know these so far these models aren't particularly good at

asking clarifying questions which uh which is another another area of concern there so um so yeah there were some failure modes but often it was because of uh they were

rare when they were just kind of just class like outright wrong and um uh at least for the causal Discovery task

and uh when they when it came to some of the actual causality causal judgment tasks then it was just it was more often making kind of errors in logic um but still

um it was the logic with GPT 3.5 and especially dpd4 especially gpd4 and the cause of judgment and actual consularity tasks

were were pretty coherent um even when they were wrong and and so you can imagine that with better prompting you might be able to get in the right into the right place it's just

out of curiosity when you you said you did some testing around the system prompt uh and you mentioned presence versus uh non-presence of the system

prompt did you vary the system prompt like I'm wondering you know if you're able to identify whether you know if you just said hey you're a helpful assistant like

if that got you a boost or was it like contextualizing that causality is what you're looking for that um that condition the the results

positively we didn't do much exploration in terms of prompt optimization right so we we we just basically

gave it a general obvious system prompt that stood out to us a um ask for Chain of Thought reasoning and particularly with some of the tasks that had to do

with counter factual reasoning actual causality causal judgments there was clear there was a clear sign that better prompting would have an impact

um but that was not a dimension that we explored deeply so going back to the this kind of core question

um it did very well in uh pairwise causal discovery um

there's some concern that you know there's some concern about the validity of that as a result like it if

you can separate the hey I don't really understand it and I want to understand it before I use it um you know you mentioned the possibility

of memorization as one um as one concern you know part of me says is it really in

the sense of like that's kind of what you wanted to do you wanted to go out and you know if you're talking about these Broad uh you know relationships like altitude

and temperature um you know you want it to have kind of memorized the answer to that or you know seen a bunch of examples to that and

pull that into its reasoning right you pick up on a really interesting and nuanced point which is that but in most of most we know that

with most large language models tests we have this problem of you know is the large language model uh ingesting The Benchmark and just kind of spitting back

out memorized the memorized version of The Benchmark and um and that we want it and that we wanted to be able to we want to avoid this problem memorization but but to

your point we want some memorization right we wanted to know what altitude is and what temperature is have memorized those you know it's not an Asian acting in the world hasn't

learned it through it can only learn it through the training data right so we wanted to know what those those things are and that there is a causal fact that

uh altitude affects temperature and uh and so we wanted to memorize that we just don't want it to memorize specifically what's in the ex the The

Benchmark itself yeah but what does that even mean and how do you do that you know for Broad facts like altitude and temperature well

so first thing we did is to test whether or not the The Benchmark was in the uh the training data and we and to do that we did a kind of

row com uh kind of uh a text completion test that basically tried to complete elements of The Benchmark so you know we put in the

readme and the kind of all the uh the kind of background from The Benchmark and then put in a specific row like the first two elements of a row and a

tabular in the tabular data in The Benchmark and then asked it and then basically see it just try to see if it could if it could auto complete the rest of the row and for the two begin data it

did I did that for about 20 of the examples if you just in terms of just predicting the next cell in the row it got about 60 accuracy and predicting the next cell so clearly the tubing in data

was in the benchmark and so and so we had to think about okay what does that mean now like what um you know what are we supposed to do with this in terms of

you know deploying it so well the actual problem that we uh we care about is to what extent does it generalize

Beyond The Benchmark and so if we think about if we kind of partitioned the task of answering the question into two parts or say or say two kind of conditional

probabilities or or two probabilities so one being the probability that's the causal fact is encoded in some sense in the in in the

uh in the large language model and uh so the lens or in other words it's a large language model has learned this causal effect and then

the second being that conditional and Having learned this causal effect to what extent can it answer your specific question given the context

given your prompt um uh in a way that transforms that's causal fact into part of that answer and the benchmarks are

good at evaluating the second but we can't know generally if it's doing well you know to what extent is happening on it for if we can't know it we can't evaluate

um how well a large language model has learned some causal facts about the world just how well it can answer a question based on the fact that it has learned it and that's always going to be

kind of confounded with its ability to kind of BSU and hallucinate and so that and so you know so kind of the conclusion there is like okay these benchmarks themselves don't tell us the

full picture um and then so also we want to know how well we can generalize I mean the whole point of The Benchmark is like can it's generalized to

tasks that are like the examples in this Benchmark and so you know we explored some ways around this problem so for a while so can you be can you get some

concrete examples of that generalization in the case of like related to the the tubing getting examples you've already

given like are you asking it to you know answer or or kind of Reason about you know different word problems

that relate the facts around altitude and temperature for example or is it some other kind of generalization right so I I I suppose you would say that if

so one example we talk about with in the paper again was um uh the causal relationship between the

length of a c mollusk called the Abalone and the age of an abalone and that's the fact that the truth there was that the

older the Abalone gets the uh the longer the longer it is we also saw an interesting error mode there where if we changed length to die to diameter it

actually got that wrong so it's uh that's an interesting example for how simple it's not quite doing whatever it's doing it's it's kind of off in the sense that it's uh it's kind of more

like memory than understanding yeah right so like you you know changing changing the word to what's effectively in this context a synonym it gets the answer wrong

um and what you're hoping there in terms of generalization is that it could generalize to other relate causal relationships causal facts

in the area of zoology you know that the Rings uh you know like if something similar will be like the number of rings in the stump of a tree and the age of a tree for example

which is not in the data right so you'll want to make sure that and so that's what I mean by generalization okay okay and so one of the things that we said to kind of we started exploring ways are there ways

that we can get around this let's say for example um could we prompt it

with some sort of inductive bias that's would help it generalize so you know in in you know so in machine learning and we often talk about inductive biases in terms of uh the syntax of the

architecture right so that's say for example convolutions and Max pooling um give you translation and variance right and and causal inference we're often getting our inductive biases from

explicit model assumptions that we formalize into a model so for example a causal graph a DAC or a more statistical assumptions like in a functional

assumptions like in a structural causal model and one of the interesting things with the large language model is it allows you to kind of use inductive biases that were

much easier easier much easier to use in the form of a natural language so like for example Occam's razor so we ran an experiment where we said you know we asked the large language

model um with it with the pairwise discovery to say like okay give us the best argument you can uh

um for the best argument for why a causes B now give us the best argument that you can for why B causes a and to try to make this argument as kind of coherent and and as reasons well

reasons as possible um and then we in a third prompt we said okay look at these two arguments that it just generated intelligent or anything but these two are these two arguments

that were just generated and tell us which argument was preferred is preferred based on Occam's razor and and uh you know and this is based on kind of research and psychology that

suggests that um you know the causal explanation that had that introduces the fewest external factors

to explain to explain something is preferred um at least for humans and so um and we saw that um the it for the wrong direction

it gave it had to kind of the company come up with a good argument how to kind of add stuff so say for example with uh the Abalone and uh it's

abalone's age and length it said that's well longer abalones could be much more competitive in a resource restricted environment and getting food and they could kind of crowd out all the other

abalones and therefore live longer right and so like that's the very reason well reasoned and plausible arguments but it it inquired introducing kind of resource

constraints as an additional Factor while the simple explanation that's that as things gets older they get longer

um uh it did not require that and so and in that case we got lower up we got obviously lower um accuracy than

um with gpt4 we got about uh 85 accuracy there but um you know considering the fact that we know that's

uh The Benchmark was memorized I think that's pretty good you mentioned um your in talking about these inductive biases the paper talks about

the these models ability to create causal graphs like did you mean that literally like um

you know you give it a problem and or a prompt and you say you know from this you know using some you know formula that or formulation that lends itself to

text like State the causal graph or are you inferring its ability to create the causal graph and its ability to perform well on these benchmarks and and answer certain kinds of questions

well yeah like so the ability to say like here's my problem here are the variables give me a causal graph um that's connects to these these these these problems and it causes something

and I think if and if you try that I mean plenty of people have experimented so for example if right now you go to chat TPT and you say hey I am interested in the relationship between smoking and lung

cancer and I want you to give me a causal graph and give me some possible confounders some some some mediators between smoking and lung cancer uh maybe something that

introduces some collider bias between smoking and cancer some instrumental variables like the cost of cigarettes and it'll suggest those variables and give

you a graph and and with an example like that which is kind of canonical it'll be something very plausible now what we wanted to do then is like okay well let's try and actually

give it a set of variables you know something like a non-trivialized large potential graph um say three you know similar cases 13

or 14 variables and ask it to construct a graph and and compare it for

uh causal causal Discovery algorithms that are taking the data and uh constructing a graph using algorithmic analysis of the data and the way the prompting worked there was nothing

sophisticated we didn't we did not just give it a bunch of variables and say give us a graph what we did was to say we just kind of extended our pairwise

Discovery problem and said um and asked it to uh uh extended that problem such that there was a third option so that instead of a

causing we have a causes b or B causes a or none of the above is essentially there is no Edge between these two nodes

and asked it's and then just got all those pairwise or or absence of pairwise relationships and and then and use that

to assemble the graph you know you know with ignoring like an acyclic acyclicity constraint just just assembling a bunch of paralyzed relationships and and then

looked at um things like F1 and and structural Hamming distance between uh the the learn graph and uh in cases where we had

the ground truth the ground truth graph and uh and then it had it had performance comparable to state-of-the-art Discovery algorithms including algorithms that were using deep learning methods

but of course this this implies that the names of the variables are informative enough that it could you know if you know if if the names are

just a b and c and d then it's not going to work obviously but it's it you know it has to have informative variable names where it could meaning it needs to Output informative variable names no no like I said the variable so for example

in one so one uh data set we use was a data set on neuropathic pain and you know so it had names like write L1 radicalopathy and uh Ferra genial

discomfort right so like things that you know if you were a domain expert you can look at the variable names and know what they meant and so talk a little bit about you know given

what you observed uh maybe net out like what these models are good at today what they're not good at and you kind of covered that but like

knitted out and then how does that the next step I think in your um kind of analysis is you know so therefore you know these models can be

used in these Ways by practitioners as effective tools I would say you know the models are getting good accuracy on at least for the pairwise and photographs discovery

um those types of benchmarks but is there like a is there a lower level of detail in there like the models are good at pair or or is that an atomic task

like the models are good at pairwise Discovery because they can do a b and c but you know they're you know we see these kind of failures a lot because

they can't do DNA I would say that and you let me know if this answers your question um that's the performance on the pairwise and photograph Discovery

indicates that for this problem of coming up with the causal Dag to to drive your Downstream analysis so every causal inference problem ranging

from causal Discovery causal effect inference counter factual reasoning all of them require causal assumptions and both in terms of you know what we see

from what we see from research from experience and from working with users for example some of the open source packages you work on like do y and other

packages in the piwai community this is you know my teaching a common theme is how do I know I have the right deck how do I know that if you know if the

downstream analysis depends on my dag I'm really worried that um I'm going to specify the wrong model and and my Downstream analysis will be will be will be wrong and

and and so this is a kind of a translation of don't this the problem of translating domain knowledge into some uh uh computable artifacts that we can that's that encapsulates our model

assumptions uh is a is a challenging problem for the causal inference for causal inference practitioners and I think that the results on

the pairwise and photograph Discovery show that these large language models are very useful at jumping that Gap at getting from what

you know to something that you can use to drive a statistical analysis that said that we it is it is brittle in ways that we don't really understand so for

example changing it from changing length of an abalone to uh sorry age of Abalone causes length and Abalone it gets that right uh diameter of Abalone to age of

an abalone and gets that wrong like that's a weird thing to get wrong we had another example about causal effect estimation where with that as I mentioned at the beginning of the call

this thing where you know we see a spike in sales in December was it the ad and it gave us this very well articulated um and suggested that we do an A B test

and we give it the A B test results and it said that and then I said oh here's the right you know I kind of did a kind of causal decision Theory type of analysis and said that this is a good policy that you should do this ad and I

miss the fact that it's December it's a holiday season right and so like it misses kind of obvious things while getting other things uh doing really well at other things and so that

suggests to me that you need uh that it's useful in this context in the human in the loop type of did you human using this as a uh to kind of borrow our

Microsoft uh parlensic kind of causal co-pilot but then doing the analysis itself it just kind of blows up too frequently I guess I'm just in that in

the context of that example that you gave like what missing December and the fact that December is a

holiday is that holiday season right now is that a failure in analysis is that a failure in I mean you could argue is

failure in constructing the graph but it's because it didn't pull in you know some context that should have been obvious that was not stated explicitly in the formulation of the

problem right well it in the formulation Department in the prompt itself it was kind of like almost like a trick question right so we asked it hey you know October November we got this many

sales January February we got this many sales in December we got a whole lot of sales and we placed an ad in early December and here's what the ad cost and here's the amount of sales we got and we

just gave it like all of this like you're like leading it down this path yeah yeah we were anchoring it yeah we were like uh you know we're kind of

baiting it to kind of focus on the uh um on this on reasoning about the ad itself and you know it but it's reasonable to assume that's a an expert

would have said wait a second before we go down this Rabbit Hole of you know the the cost benefit analysis of this ad we need to understand that there are other

possible factors here the most obvious one based on domain knowledge is that December is a holiday season and that's what it's supposed to be good at right and so like this kind of it so like it's

supposed to be good at establishing the rights using deploying domain knowledge to answer the question and so there if you

know you're the uh The Prompt in our case was leading it with a kind of you know sneakily leading it away from the right answer but you can you can imagine how

you you want it to be robust to that you want it to be you wanted to detect your blind spots and so had I forgotten that December

that I was just looking at some tabular data and completely forgetting that oh yeah it's a hard it's a holiday right there's holidays are in December people buy toys

um because that's not there in my data then um I I would hope that if I can use this model to kind of capture domain knowledge about the space that it can it

can it certainly knows that if I if I were to prompt it differently I'm sure confidently that it could say oh yeah December is you know there's is the holiday season and a lot of

um a lot of uh retailers make most of their Year's income in December right and it seems like the like explain your work kind of setup would have elicited

that and in other cases it did so again like with the Abalone example it's it it mentions even in the chain of work when I when I for like explain why the age of

the of the Abalone causes the length of the Abalone it gives it says like another thing that I might consider is uh resources available resources in its environment like food and and and other

things that it needs but it discounted that so it mentioned this thing and it said but that's not really important here what we need to focus on is just how as things get older they grow right and so in so in that case it's like okay

here's some external factors that might matter some other fact but but we're going to we're going to throw we're going to ignore those but in this in this uh sales example we didn't say you know

another Factor could be um holiday sales but we're not gonna we're gonna uh a holiday bump but we're going to ignore that and focus on this ad it didn't do that it just it just got

it got steered into just focusing on the ad and so that that tells me that like okay well now this thing is going to be very sensitive to the prompt and so now the

causal inference expert has to be good it has to be a very good prompt engineer if they're going to uh deploy these methods or you know in the future perhaps our meth our models get

smaller sparser maybe more focused on on causal tasks and uh and can help us alleviate that problem but that's that's speculation

do you foresee you know the next iteration of models uh addressing some of these or how do you see kind of the future of llms you know

changing the results that you saw uh in this paper I mean I think we have so one of the so with gpd4 we saw a lot of saturation

on the disc on the pairwise and full wise discovery of the data sets um again which were likely memorized um I think we

you know I think an interesting area is can we come up with other ways to evaluate these models on these tasks um one of the things that we started exploring in the paper was looking at so

actual actual causality and causal judgments so explain here so actual causality means um so if you know type causality which is what we've been talking about is

about you know things that cause other things in general right now and and actual causality or or token causality is about uh

here's an events that happen what what events cause this event so it's focused on individual events as opposed to you know reasoning over variables or populations

and and so we have and so we looked at a few benchmarks for that one it's called this

this crashed Dench Market uh it's um uh it did quite well on on on that data set uh there's a but there's an interesting data set uh in the big bench uh

collection of large language model data sets called causal judgments and accuracy for that with gpt4 for that

uh Benchmark is about uh well uh 0.6 about 65 is uh some people have reported higher but actually when you actually ask it to do Channel flat reasoning accuracy goes down a little bit and this

is an interesting data set because it's not it's not asking for it's unlike a causal effect or a causal graph there is no ground truth really it's about how humans reason about

what causes what so like you you could say for example that if I ask you know why the glass spill you you know it would be true to say because of the Big Bang right so like

right so it it involves like determining which causal events were actually relevant to this context kind of setting a causal frame and we found

that it's very good at setting that causal frame and then and and again in this Benchmark the actual labels are not some actual fact it's how humans

answered these questions and the top human labeler gets 100 of the answers right and so there you're trying to kind of evaluate how well a model can

align with human causal judgments and 65 still far from uh saturation but if we look at the cocci literature we saw that

the there are there are components for how humans make a judgment so whether or not an event was a necessary cause the need to happen for the outcome to happen did was it sufficient to cause

the outcome to happen um was it a norm violation in terms of a statistical Norm like did it was it was it surprising or unexpected or was it a

a kind of social Norm violation did it break a law or people's kind of social norms uh were other events Norm

violations so for example if you're playing cards and the other person draws a really bad hands do you blame Della versus bad hand or do you blame your good hand

um uh was the outcome undesirable so you know something you know the outcome is just some neutral thing like nothing nothing bad happens versus the outcome is you know people dying in a trained

problem a train trolley problem for example um and then you know whether or not the act the cause was a mission so like you broke the Machine versus you failed to maintain the machine all of these things

Drive human causal judgments about actual causality and we found that the model did fairly well in at least comparable comparable to its its ability

to predict the actual causal events um it did comparedly well in kind of parsing these individual components so if you asked it was this an unnecessary cause was it a sufficient cause was it a

norm violation it's it's getting around for those it's getting ranging from 70 to 80 percent um it does it did it did poorly in determining whether or not other causes

were Norm violating but that could have been because of the prompt um did very well in disturbing Discerning whether the outcome was undesirable or neutral and and whether

or not it got about 70 accuracy in determining whether or not it was Omission caused by emission or caused by action um so fixing the Machine versus not fixing the machine and so it seems to be

doing well at kind of breaking down a a causal problem as well a causal judgment problem as well as predicting the actual judgment uh and so and I think that there's a lot of

interesting space there in terms of you know could we if it understands the components of a judgment could we somehow or if it's able to you know if it's good at predicting the the

components of a judgment could we somehow prompt the large language model in a way that hasn't used these components to get to the right answer so kind of coming up with some kind of

causal recipes for for for reasoning that you kind of Supply as part of the prompt I think that's an interesting area of investigation and what's interesting in in what you just

described with this data set that articulates um like human

preferences or or human approaches to causally reasoning about a problem it's like that's exactly what you'd want to you know to tune on you know with you know

rlhf and then you have a model that's been specifically trained to kind of answer questions in a causal way

um you know and so that raises this issue of you know hey you had another Benchmark that you can't use anymore because you've uh you've sucked it into your training data but do you think that

that would result in a a model that is better able to to think causally and not to anthropomorphize but to approach

problems causally or Reason causally I think it I think exploring rohf approaches that if we suppose the

original kind of Ro rohf approach that made chat um wow original the rohf approach that made chat CPT so impressive uh was

essentially asking uh human Raiders to say what is you know what's the best answer here and so you're basically training it to say training it to emulate

human responses and I think training it to emulate um some other so changing it training it to

align with some other recipe for a life for providing an answer is is definitely an interesting thing that people could that I think people should be trying um just you know coming you know sounding like a human is that the only

thing that we care about um or perhaps in some cases maybe that's even you know maybe you know sometimes when when I when I correct and it apologizes

to me I'm like you know actually you're not actually sorry So like um you know I think like for a lot of tasks we could do away with the with the biomimicry and just focus on um

uh having its aligned with some kind of recipe for coming up with a good answer and but that that said I think we still need to understand whether or not it's actually following that recipe wouldn't it

um when it when it does that as I as I as I described with the hiring example um but I definitely think that's a very fruitful area of research particularly for like personalization like we don't

just need to use it for benchmarks like if you're trying to fine tune a model or train your own model for a task within your very particular you know your your

your corporate context or your your research your academic or some specific academic problem you're working on maybe you want to do that kind of you want to

do a narrow kind of a task specific rohf that's not you know General enough for a benchmark so besides from the you know your own work and research and and you

know that if your collaborators what's the coolest thing you've seen out there recently you know from an AI perspective I'll give a shout out well so I'll stick on the causality stuff

I'll give a shout out to a PhD student at Stanford University a guy named Atticus Geiger um who's uh I think I mentioned it before in our previous interview but he

has this method called I think it's interleaved intervention training which is looking at ways to take a foundation

model could be a large language model and training it such that it's the uh the abstractions that the model

is learning and its Lane representation align with some uh Oracle causal model that you have for a specific task so and I like that task

specific element of it because it's um you know you can have the large language model and kind of train it on everything but then with respect to some specific task you're kind of making sure that

it's uh aligned with some a causal model and for a lot of tasks maybe it's not possible to come out with with come up with the causal model um like a formal causal model but for

something you might be able to um I think that uh um and so making sure that it's a line of respect to the the cases where you can do that I think it's really it's a

very interesting idea and I hope we see some more some more work there and I encourage people to read his papers and where can uh um you know we'll put a link to the

paper obviously uh in the the show notes um anything else you want to plug okay so recently I've been getting into uh

this work from my colleagues at Microsoft research led by uh uh Scott Lundberg and others on

a a a repository called guidance which is a a uh it's it's a it's a repository it's a tool for working with large language models and constraining them and

controlling them in a way that you want uh it's competitive with Lang chain and uh semantic kernel but uh guidance has a lot of really cool features and I'm

really you know it's it's really it's it's really a a pleasure to use and I think it's and it's really exploding really quickly and I think people if they have heard if you haven't heard of it they should they should try it out

it's uh it's uh if you've had difficulties working with large language models ranging from getting it to select from a specific set of answers or doing token

healing or doing things like um returning a Json that has a specific structure uh it's it's very good at that so as well as other things so I

encourage people to check that out awesome awesome well Robert as always it's great to connect and catch up on what you're working on thanks so much for your time Sam I enjoyed this

Loading...

Loading video analysis...