Create Your Causal Inference Roadmap. Causal Inference, TMLE & Sensitivity | Mark van der Laan S2E6
By Causal Python with Alex Molak
Summary
Topics Covered
- Observational Studies Can Be As Powerful As RCTs
- Uncertainty Has Multiple Layers in Causal Inference
- You'd Need to Believe a 45% Cure Rate to Dismiss This
- TMLE Is the Future of Causal Inference
- ChatGPT Made My Son More Powerful Than College
Full Transcript
And then in observational study you might still have knowledge right because you can talk maybe let's say the doctors assigned the treatment and you talk to the doctors and you ask them how do you make these treatment decisions what
variables do you take into account it was a huge effect in the observational data so that gives you then if you tell people that they say and then you look at your data and you see everybody who doesn't take the treatment doesn't get
cured it's almost 100% no cure then then you start realizing wow right this is clearly very significant even though the study had all kinds of problems. What is
implausible causal gap? Now that's where you can play all kinds of games and that's where these different ideas can come in. And what is of course negative
come in. And what is of course negative controls can be helpful. That will also give you an idea that you could say at the end you know my causal gap has to be
three times as big as the impact of these particular confounders in my data set. So you have to believe that the
set. So you have to believe that the unmeasured confounders out there are much stronger than the measured confounders to still wash away this finding.
What's your favorite sensitivity analysis approach?
Hey causal bandits, welcome to the second season of the Causal Bandits podcast. The best podcast on causality
podcast. The best podcast on causality and AI on the internet.
Early on in his life, he was only interested in two things, chess and tennis. Today, his inspiration comes
tennis. Today, his inspiration comes from real world problems. He's driven by large-scale vision, not step-wise thinking. The godfather of team me,
thinking. The godfather of team me, professor of statistics at UC Berkeley.
Ladies and gentlemen, please welcome Professor Mark Vanderland. Let me pass it to your host, Alex Mo.
Welcome to the podcast, Mark.
Thank you. Pleasure to be here.
It's it's a great pleasure to have you here. you're such a big name in the
here. you're such a big name in the community and I'm very very excited that we finally managed to meet and that we can have this conversation and share it with the community.
Excellent. Excellent. Yeah, I'm looking forward to it.
So, I would like to start with a topic that some of our listeners will be very very familiar with. Uh some others may be a little bit less. Nevertheless, I
think that's an important topic and also hopefully interesting and it's it's the idea of this causal road map. So when we think about traditional
machine learning project in industry for instance we have some steps we do to make sure that the output the solution that we build will will be of good
quality when we do statistical analysis when we design a study and so on we also have certain steps uh to check out to make sure that I don't know our randomization is is correct and so on
and so on with causal inference especially with causal infence from observational or mixed data. This
picture be gets even more complicated and we have some authors starting maybe with Judia Pearl and then people like Amit Sharma the author of the Dwi
library who tried to pin down so to say this process and encapsulate in into a finite number of steps uh that could make it easier for people for
practitioners to grasp what's going on in this process of causal inference. You
also proposed a a a road map with with your colleagues. Can you tell us a
your colleagues. Can you tell us a little bit more about your version your vision of the causal inference road map and in particular the role of
uncertainty the role uncertainty uh plays in this in this road map right? Yeah. The causal of yeah we
right? Yeah. The causal of yeah we called it the road map for causal learning or targeted learning as well.
Yeah. is very important for us and has played a big role in anything we do really and also in our curriculum uh for
the students. Um and the basic uh steps
the students. Um and the basic uh steps as we uh stated them are step one is you have to realize that your data is the
result from an experiment. So you have to describe the experiment which generated the data and then with that you can also start describing the kind of causal question you have. And so
that's step one is just coding the data.
uh you might already give it some uh notation u because you probably have in mind a causal question and with that causal question there are certain definitions
what you might call the treatment and there might also be things like sensoring and so that we give special notation so that these variables there are so-called intervention nodes in
causal inference and so this way we start describing the data pure in a notation so that we can start naming variables and refer to uh like for example if you go to Robins Jimmy
Robbins early papers he would use the notation you know over time L of t and a of t and l of t would be everything you measure at time point t or an interval t
such as time dependent biomarkers and outcome process and then a of t is used for the treatment at time t and and also
have sensoring but that's step one then step two is describing Yeah, the the likelihood of the data. And by the way, there's slight difference in orderings
how people present the ordering. Some
start with a causal model. In this case, I say let's write down the likelihood of the data. Now, what you typically do
the data. Now, what you typically do there is you just say, okay, how was the data collected? Was there time ordering?
data collected? Was there time ordering?
Did we first do collect this, then that, then that, then that? And so that's already getting towards somewhat what you might think about a causal model. Uh
but usually this is just known. you just
know things were collected over time and so that helps you because your causal is getting hard if you don't have an time ordering. So when you describe the
ordering. So when you describe the likelihood you start also thinking about what do I know about my likelihood right you like in traditional statistics we
talk about maximum likelihood estimation so which means you need a model and we call that a statistical model and statistical model involves in this case you know you say okay what's the next
variable like you conditional probability of the treatment assigned to the person given the history what do you know about that that's for example knowledge Of course, in a randomized trial, you would actually literally know
it completely. It would be a flip of a
it completely. It would be a flip of a coin and it would be a marginal probability. You might have more
probability. You might have more advanced sequential randomized trials where you still know it, but then these probabilities are learned from the past.
And then in observational study, you might still have knowledge, right?
Because you can talk maybe let's say the doctors assigned the treatment and you talk to the doctors and you ask them how do you make these treatment decisions?
What variables do you take into account?
That's with yeah what we call conditional independence assumptions.
where you say the treatment is independent of the history given these key variables. So that's incredible
key variables. So that's incredible important knowledge. If you have that
important knowledge. If you have that that makes actually an observational study as powerful as an RCT because now with some extra challenges but then you also have can compensate that with larger sample size. That's the
statistical model. So that's just writing down essentially the likelihood as in product over time points of conditional probabilities what you observed next given the past and what do you know about them. Then comes the next
question. what is the feature we're
question. what is the feature we're trying to learn from the data and that's what we call the statistical estimate.
Now that's challenging and that's why you need a process for that and that's where we then get into defining a causal model which is kind of represents a perfect world where you can observe
everything you like to observe and and so for example you know in causal inference there's the so-called naman rubin model or the structure equation model but basically it allows you to
define potential outcomes like for every subject you can think of I might have an outcome under one treatment and an out on another treatment. So that defines what we call full data. It's the perfect
data you would have liked to observe on everybody and then in that world you define a feature you're trying to learn which is now in causal estim. It's not
something you necessarily can learn from your observed data but in that perfect world you could have. So that helps a lot because then your your collaborators you can sit down and define uh the real
question and define it in terms of like an hypothetical experiment like a hypothetical randomized trial or a perfect world where you can collect all these potential outcomes and then say what do you really want? Now we have
defined causal estimate like you can think of a simple AT average stream effect and then comes the next reality check. Okay, how are we going to
check. Okay, how are we going to identify that from the observed data distribution?
And in other words, how can we go from the likelihood of the data we can learn towards that causal quantity where you get so-called identification results and these are you know part of what causal
inference is about is establish an identification result and so there are a lot of them out there right based on no measure confounder assumption there's the so-called G computation formula or
the longitudinal G computation formula Robins 1986 and and there also identification result based on instrumental variables and so on. It's
like a theorem. It says I can describe this causal quality as a feature of the distribution of the data under the following assumptions. At that point,
following assumptions. At that point, you have to find a statistical estimate, something you can learn from the data which you can claim to be equal to your causal the answer to your causal question
in case these assumptions hold. Now, at
that point, you might have to discuss what you think about these assumptions.
It might drive you to and say, you know, we can do better. Let's collect some more data or think about another type of experiment. But either way, at some
experiment. But either way, at some point, you say, this is it. This is what I want to do. You have some sense of what we call the causal gap, which is the difference between the statistical estimate and the causal estimate.
Sometimes you just know it's equal.
Great. No causal gap. And sometimes you say, well, I have some maybe some sense that's pretty good. So, I'm moving forward. At that point, the statistical
forward. At that point, the statistical estimation problem is defined. You have
your widget, you have your statistical model and you have your statistical estimate. And so now comes the question
estimate. And so now comes the question the estimation step.
Right? So overall what I've talked about currently is describing the experiment, describing the likelihood, describing your statistical assumptions, your statistical model and through a causal
model and identification result also describing the statistical estimate.
At that point estimation becomes the next step. Now the estimation step is
next step. Now the estimation step is another thing where we are very precise about it by saying it has to be up here specified. In other words, something
specified. In other words, something which plays with the data and tries left and right is not an estimator. So an
estimator is an app specified algorithm you're going to apply to the data. You
hope it gets close to the statistical estimate. So the answer to your question
estimate. So the answer to your question and you also hope that it has a way to understand its sampling distribution.
Right? So there might be like a central limit theorem and you might have an approximately normal central distribution. Now that's kind of how do
distribution. Now that's kind of how do you construct such estimators is then the whole field and in particular there's efficiency theory which talks
about what would be an optimal estimator from asmtoic perspective. So these are called asmtoically efficient estimators.
So that they do not only approach the truth. They approach the truth at the
truth. They approach the truth at the typical rate like one over square root sample size. But moreover more
sample size. But moreover more importantly you can use it to construct confidence walls and they have also minimal variance asmtoic at least. So
that's uh yeah where there's a whole a lot to say about that and probably we're going to talk a little bit about that how you do that and that's actually has been a big part of my journey has been understanding that problem from a PZ
thesis starting about understanding non parametic maximum likelihood to working with Jamie Robins and Swan understanding the kind of estimate equation approach
and then eventually we came up with TLE as a kind of an sophisticated way of generalizing what people do in traditional statistics but then also
making it possible to do it in these realistic models because that's another point when I said what are your cisco model assumptions what do you know that's a real question right it's not a
model out of convenience right so that's why that model is taken very seriously and it's supposed to be realistic meaning it will never be a parametic model it will always be what we think of
as an infinite dimensional models where you might have knowledge like conditional independent has assumptions but you don't have knowledge up to a finite dimensional parameter like in a logistic regression or something. So
that's why when you do the estimation it will use it will recognize hey this estimate like the at depends on the conditional probability of death as if that's the outcome as a function of the
treatment and the coariants. So I have to learn that. So that's where machine learning comes in. Uh but then and then you plug that in the AT and then and then you want to have the right
properties. I want to uh ask to quickly
properties. I want to uh ask to quickly uh talk a little bit more about uncertainty because in causality in causal inference right and I I remember this is a part of your also of your road
map this step regarding uncertainty quantification it's not really one step because we have more than one source of uncertainty right we can think about conformal prediction maybe and we say
like okay under exchangeability this will be our quantification of uncertainty but here we have also what you what you mentioned we have also unsaid certainty regarding the uh the
identification itself and so on. So we
might have different levels. What's your
take on that?
That's also why the road map is so important because it addresses all these different uh levels of uncertainty and one is of course the formulation is so important. That's why I talked about
important. That's why I talked about these first steps because if you don't get the formulation right yeah you know that's the start of everything then you we're probably already off and then you have no idea what you're doing. So
what's uncertive at that moment right?
We need to know precisely what the target is. That's why we have the causal
target is. That's why we have the causal estimate. Then we get the identification
estimate. Then we get the identification result that creates one level of uncertainty which is the causal gap. But
at least we know what we're talking about. We have an actual real target.
about. We have an actual real target.
And then comes the statistical uncertainty and that's what the estimator is. And that's where we have
estimator is. And that's where we have to get valid confidence loss and and so that's another level of uncertainty. Of
course the statistical estimation and the confidence level construction they take care of at least if you do it right you can get a confidence of all with good coverage for the statistical estimate and but then you still have
that other uncertainty which is the causal gap might be there and that's where we use a final step again this is all recognized by the literature and
what we call sensitivity analysis step and that's where you start moving towards real conclusions what's your favorite sensitivity sensitivity analysis approach as I said
which one do you think is most useful there's this so-called I mean in the past there were these kind of more parametic model based sensitive analysis and I didn't like them so that's why we defined at the time well yeah
non-parametic sensitive analysis but in particular that leads to an interesting plot which I think is a great starting point and that is that you put on the
x-axis it's an xy plot and on the x-axis you put the causal gap so you say if the causal gap is zero What will be your constant involve for your causal quantity? Now that would just be the
quantity? Now that would just be the constant involved from your statistical estimation. So you can say if the causal
estimation. So you can say if the causal gap is zero, this is my statistical conclusion. My the truth is somewhere in
conclusion. My the truth is somewhere in this constant ball and I might reject the null or or look at the constant law. What if the causal gap is this much? Then you can
essentially just shift the constant law and then it would be under that kind of causal gap. It would be the constant of
causal gap. It would be the constant of all. And so now you can make a whole
all. And so now you can make a whole plot where you see all these confidence val shifting and you might see if the causal gap is positive then it shifts towards the null and at
some point it might actually include the null hypothesis. And so in particular
null hypothesis. And so in particular you can ask the question how big can the causal gap be and still have a significant finding and we call that the G value. Mhm.
G value. Mhm.
And so that you can now conclude at the end of the day if you trust your statistical console law you can conclude the causal gap can be this big and you would still have a significant finding.
Now then comes the next thing is okay how big is that right now you have a number now you need to kind of understand is this this will help you a lot actually already right because you get a sense oh wait a minute this is
like 20% of the raw risk or something so you get some sense of is this important or is this kind of borderline anyway and again keep in mind what you know based from the experiment how precise the
experiment was if it was careful you collect the confounded but anyway then you have another layer where you try to get towards what is implausible causal gap. Now that's where you can play all
gap. Now that's where you can play all kinds of games and that's where these different ideas can come in. And one is of course negative controls can be helpful. Right?
You might have a negative control treatment or negative control outcome.
You can redo the whole analysis the same estimator on that data and see what happens. And if you see that the
happens. And if you see that the consonant of all you know in the estimate is is kind of away from zero then you know you might have some bias causal gap there and that might give you a sense of the size of causal gap which might be plausible. Of course then it
all depends on the negative control being you know having similar confounding structure but at least you get a sense. Uh other things we also do in your analysis leave out some key
confounders the ones you measure actually and then see what happens with your change in estimate and conol and that will also give you an idea that you could say at
the end you know my causal gap has to be three times as big as the impact of these particular confounders in my data set. So you have to believe that the
set. So you have to believe that the unmeasured confounders out there are much stronger than the measured confounders to still wash away this fine.
How can we quantify this causal gap for for those people who are not familiar with this approach? What intuition could we give to them? What what is on this x-axis? What what is the meaning of this
x-axis? What what is the meaning of this number?
Yeah, if I tell you the the a let's say we are estimating an average effect I'm seeing whatever our estimate might be in difference in proportions being like 30%
or so and then with a constant. Yeah.
Then a causal gap could be you know at the end 0.1 point 2.3. So these are numbers but how do you interpret them is
is then one question and so one way again these are things where people can be creative. I mean some people like
be creative. I mean some people like simulations where you kind of simulate things and and see what that says. But what we also
like is that we are creating a unit which corresponds with for example if I leave out all confounders the measured confounders and redo my
analysis how would my estimate change and that change in estimate we make a unit and this does can be done with all confounders would also be done with a subset of the key confounders and you
say and your unit becomes now that you say hey one unit means I'm getting this difference in my estimate But due to leaving out this that group of confounders so that and then you can
measure that so-called G value which is this maximal size of causal gap which would wash away your finding. You can
say hey it's 4.5 that corresponds with you have to believe that causal gap is four and a half times as large as the change you would see by leaving out some key
confounders in your data set. So that's
how you can help to get some sense uh because that's kind of what people have to think about okay what confounders did I maybe miss are they kind of comparable in impact
with some things I've measured right and that kind might give you then some sense of getting your head around but anyway so that's one approach uh which you can carry out but there are others I mean
like in our original paper uh we actually did a real analysis there was a paper with Elon Das And it was on Shaga's disease
and it was actually an company which came to me and said okay we have this we want to make this drug we have this drug and we want to produce in the United States and so the problem is we cannot
run a randomized trial here in the United States because the drug is already used out there in Latin American countries. So there's observational data
countries. So there's observational data which clearly suggests this drug is very effective.
So how can we do this right? how can we get talk to the FDA about this and so that's where uh at the time we formulated this sense analysis and so
there we did this what I just described right but also we were able to bound the causal gap by something which is more interpretable
so we could in that case it was about you know there was this treatment versus control it's actually very long-term outcome right people get in this parasite and it takes sometimes 30 years
to to really develop all these heart problems. There was observational data out there and you have then like essentially it's a binary outcome some
kind of death or developing uh this heart disease and you can talk about the cure rate, right? If you take the drug, what would
right? If you take the drug, what would then be your cure rate versus if you don't take it? Then essentially it's zero because you never get cured. And so
we were able to bounce the causal gap by the counterfactual cure rate if you don't if everybody doesn't take the
treatment it's back under control and that was like a number like and then we did the G value in terms of that right and so we could show that you had to
believe that the causal gap of that you had to believe that the counterfactual cure rate under no treatment has to be 45% % in order to wash away this finding
because it was a huge effect in the observational data.
So that gives you then if you tell people that they say and then you look at your data and you see everybody who doesn't take the treatment doesn't get cured. It's almost 100% no cure. Then
cured. It's almost 100% no cure. Then
then you start realizing wow right this is clearly very significant even though the study had all kinds of problems right because that's missingness at unmeasured confounding but it becomes
obvious right so that was an example where yeah it was possible to be very conservative in the bounding of the causal gap by something which was very interpretable so that even biologically you say there's no way
but you cannot always do it that but that's the kind of things you sometimes can do but yeah so at the end the Transparency is important, right? That
because that's where other sensitive nails sometimes they they start doing these assume all kinds of like you Cox model
for your failure time and then you you you put in some kind of covert which is a counterfactual failure time under no treatment and then you talk about the coefficient of that how big that is. If
it's zero, then there's no unmatched confounding. And if that's it becomes
confounding. And if that's it becomes such a mess that people don't even know, right? It's not a smoke curtain
right? It's not a smoke curtain you throw out there and at the end it's not helpful.
And so I think it's very important to stay very transparent and and so that people can follow what you're claiming and that's I think possible with this kind of nonparatic sensitive analysis.
That's really great. You mentioned the the importance of transparency and and I think the road map you shared with us before similar to other other road maps
in in my in my opinion proposed by by Perl by by other people all of those road maps help practitioners and researchers to be more transer
transparent and I I like to I like to say you know that uh the transparency is one of this the most important things in in causal inference.
Because also this builds gives us a foundation to be then uh courageous to go and explore things that even you know even if there is something that we don't
know we go out there and we try to get as much uh knowledge as possible and try to incorporate this knowledge into the process and then hopefully get to some
useful conclusions. Right? Because what
useful conclusions. Right? Because what
you said about the sensitivity analysis won't be able always to get very narrow estimates, but in many cases we just can get to useful conclusions, which means
conclusions that will allow us to make a a a an informed decision.
Mhm.
You also mentioned in in the beginning the the importance of assumptions and knowledge uh in in the process of causal inference. in one of your recent papers
inference. in one of your recent papers with Kos with Caros Mehide in the abstract you write at the end that's the last sentence I I thought that was a
very that was a that was a sentence that really drew my attention the sentence says the shift from identification under assumptions to identification under
observation redefineses how the problem of causal inference is approached. Can
you share with us a little bit more about what's what's behind this what's behind this sentence? What's the logic there?
Anyway, to just hit home, she said the transparency is clearly very important and that's also I think we're you know it has to be reproducible. I think it helps you to write out that whole road
map, the steps you took. Uh so all that is part of the transparency at the end.
It also has to be good. you can be very transparent and you can do you say I did this I did that but then nobody knows what of it because it makes all these weird assumptions right so really a a
that's why it's often the discussion is often not either or it's always all of it has to be good every step has to be carried out very carefully uh but yes so it's a combination of transparency and
high quality uh methods as part of it that's why for me the FDA is kind of an important demonstration almost for science uh
because they have to do approvals because approvals have to be based on a priory specified analysis and so that's challenging right how do you do such a
sophisticated submission with which is not a standard logistic regression anymore it's using the state-of-the-art machine learning and it all has to be apply specified like target maximization
so that is a very yeah challenging thing how to do that and I think if you and carry out such analysis in that context under regulator what guidance and
approval then you know that's a high standard a very high standard for your analysis regarding everything you said the transparency dealing with every level of uncertainty and so for me
that's a very beautiful uh demonstration project in a sense of how how are we going to exactly do that and how are we going to be successful but anyway that was more of a site for now but you were
talking about this uh paper which is about u instrumental variables. Yeah.
Maybe the kind of shift in thinking in that paper is that normally we say you know like let's do the simplest case you
observe on every subject in your sample baseline coariantss then there is some instrument right that could be some something randomized uh it might be actually a treatment in a
randomized trial uh but it could be something else like something some kind of recommendation you're giving to a person observational study but you know it's kind of random. So we call that an
instrument and that's supposed to encourage a treatment that you say my uh like I randomize people to treatment and then you suppose that people actually do
it but they might be non-compliant right and or similar you might work with a doctor and you give them a recommendation with your system your your algorithm gives some recommendation that could be an instrument that might
have been randomized but then the doctor still decides what he does and that's then the treatment and so you have then these data sets baseline coares an instrument which you understand it's
randomized and then there is the actual treatment and then there is the outcome it's easy to estimate or at least it's nice to estimate the causal effect of the instrument because the instrument
was randomized and so you can actually rely on the no mound assumption and so everything can beautifully done on the other hand what people normally do they would say okay I care also about the
causal effect of the actual treatment So I'm going to ask myself the question for example what's the at what happens if I force everybody to give the treatment
versus what happens if nobody takes the treatment and what's the difference and then you say okay darn treatment is subject to unmeasured confounders I have to use some kind of identification
results based on the instrument and that will then naturally rely on all kinds of assumptions which are not always easy to sell okay so what we did here is said let's
not do that let's actually just say what can we identify for example the causal impact of any kind of intervention on the instrument these could be stoastic
interventions as well right they could be static intervention but also stoastic interventions so if if I put put any kind of stochastic intervention on the instrument I can identify what's the
so-called post-intervention distribution of your outcome so what would be the mean outcome for example under a particular stoastic intervention on the instrument.
But any intervention you do on the instrument changes also the distribution of the treatment.
Right? Think about the medical doctor.
You are giving the person a recommendation that influences the doctor and so it changes the way the doctor is going to assign treatment. So
what really happens if you just say if I assign an intervention to the instrument it implies an intervention on
the treatment and we can actually identify that. So we can actually write
identify that. So we can actually write out what for any given so-called conditional distribution of the instrument given the coariant so how you assigned it you can implies a
conditional distribution of the treatment and so really what we're then getting is that we can actually identify without any assumptions beyond the
instrument the mean outcome under any of the stoastic interventions on the treatment implied by a stoastic inter intervention on the instrument. So we
turn it around in that way. We are not asking you know how can we identify this particular causal question like causal effect of treatment and then need all
kinds of assumptions. We say no what we can identify is the mean outcome under any kind of stoastic intervention on the treatment corresponding with a
stoastic intervention on the instrument.
And so we have this mapping which maps a conditional distribution on the instrument which is a stoastic intervention instrument into the corresponding conditional distribution on the treatment which is a stoastic intervention on the treatment. And so
now we do get actually identification results for any kind of like mean outcome under any of these stoastic interventions. So if you for example
interventions. So if you for example take two interventions on the instrument two different ones right they imply two stoastic intervention on the treatment and so it gives some kind of difference
in mean outcome right and contrast and causal contrast of two treatment regimens differences and then on top you can say that's already nice so that we can actually
really you know write down what we can learn from the data and the kind of cause effects uh we will see but we can also ask the question um so what would be like among all these
possible stoastic interventions on the treatment we can identify which one is closest to what we really like to do and so we can learn that too and so this way we are not getting into
assumptions we don't know or don't trust or don't feel comfortable with we just literally learn from the data what we can get and how close we can get to maybe what you desired but either way
you get all kinds of interesting results and I also think that's actually quite natural right if Again, if I let's say I'm developing a recommendation system
with some fancy algorithm for doctors and I I know I will randomize that and it might have different levels. It's not necessarily
different levels. It's not necessarily binary these things. There might be all kinds of choices you're making. Well, I
kind of want to understand if I do this, if I assign my recommendation in this way, how does it impact how the doctor actually starts treating that? Actually,
we can work out and there's a formula for that. And so, we can actually learn
for that. And so, we can actually learn that. It's kind of the reality. This is
that. It's kind of the reality. This is
how it works. You have your recommendation system and you can influence through your recommendation system how you can intervene on that how the treatment changes of the doctor. So
that itself is an interesting thing that will also then give you the corresponding you know causal effect of that way you were able to influence the
doctor on the outcome and obviously yeah you will see that if your recommendations is hardly influencing the doctor yeah then you will have no impact and you will see that also from the actual you know treatment implied
treatment regimen implied by your regimen on the instrument. So yeah, I think it's just the reality and you're trying to answer questions you can actually get to.
Yeah, I think that's that's really really great and really interesting and also my feeling is that your thinking here is highly practical in a sense,
right? you you really think about what
right? you you really think about what would be the value of something like this in the real world where our where our information we know our information is imperfect
and also you know people were also asking came to me and asking questions about how do we do this in the time dependent setting we have time dependent treatment and and again there I think it's also very natural so we'll have a
paper following up on this the one you saw where we get into that as well and it's somewhat of a philosophical difference with some people uh who are very much you know I might have with you
day I might have a discussion and yeah he would be like a really clean definition of the causal quality you care about it and then just yeah then if you can't identify you can't identify it
and I more I'm willing to give up some of the interpretation of the causal quality but actually something you can actually learn instead relying on these assumptions so that's what this gets at right so it
tries to push the definition of what you're trying to learn the causal quality towards what's actually learnable and and that's uh has always been the way I talk. So in that
sense we yeah we often think in terms of what what are the causal questions we can actually answer for which there's support and this is an example of that where it's perfect support in the sense
you have no gap causal gap. So these are the kind of causal equations we can truly identify without any causal gap meaning you get true honest inference for it and and then at the end you can
interpret it and yeah I agree I think that is more the reality that's the things you can actually learn right and and depending on how good your instrument is you might be able to have
more influence on the treatment regimen of the doctor and in this way you will also develop more regimens you can learn because you are able to change it so with your instrument and therefore you
get now all kinds of causal question you might even desire from the getgo.
So yes you said about the this practical aspect the the applied aspect when we met before I remember you you also told me
that this practical perspective has always been important for you. Can you
can you tell us a little bit more about your personal history and how this idea of of practicality of applicability played out played a role in your in your
development and your thinking about causality? I've always anyway just as a
causality? I've always anyway just as a little background you know I did my degree in mathematics and then I did a PZ in theoretical statistics
uh working with Richard Gil and Richard Gil had a big impact on me because he's really also that kind of person who is he's a deep uh you know mathematically
trained theoretic theoretical I mean has done incredible work I mean you can you can look at his works on accounting processes and and and so on and and but
really establishing deep probability foundations and all that. But he's
always very interested in the real world. Now that's also what I had. I
world. Now that's also what I had. I
I've always at the end I'm not looking for theoretical problems to just have fun. I'm looking for real world problems
fun. I'm looking for real world problems which need to be solved.
And then it happens to be that many times these real world problems are very hard. And then then you have to come up
hard. And then then you have to come up with theory and that might be sometimes dig theory to do it right. And so that's where I come from. So I never start with
the toy problem. I always start with the real world problem and go from there.
Yeah. That's also why I automatically got attracted you know after during my PC thesis I worked on things like bver survival function estimation based on
bare right sensor data or multivariate right sensor data and how to analyze these estimators and how to repair the parametic maximum life estimate which breaks down in these problems and so on
and then I met Jamie Robbins he came actually to the um there was a workshop at the end of my at my graduation and and so people like Peter were there and
John Wellner and so on and also Jamie Robbins to just kind of uh also do some yeah give some talks and lectures and so on. So that was a nice way to get these
on. So that was a nice way to get these people together. Jamie gave a talk there
people together. Jamie gave a talk there and essentially said you know Mark it's fun to work on bvert right sensor data which you know at that point point in time I was pretty proud because it was
like a challenging problem for decades of how to do this deficient estimation and he said mark who cares about this right I mean who cares about the b right sensit
going to have in the real world you're going to have longitudinal complex data structures this is the real world patients enter your study. You have a
whole history. They randomly drop in for
whole history. They randomly drop in for visits. You have doctor's decisions.
visits. You have doctor's decisions.
Doctor decisions are made on the history of the patient. There might be dropout.
Dropout might be informed by the past as well and so on. So how are you going to do things there? And that's suddenly a world of data in a dimension of data which immediately
pushes you out of the comfort boundaries.
There's nothing convenient about it.
It's not like I can just simply fit a simple parametric model. So that's where I started working with him and and mostly what I did at the time is trying to learn what he did, right? He had all
these papers nobody read essentially because they were, you know, very hard to read. So I started reading the
to read. So I started reading the appendixes and this work with Andrea Rodeski and so on. And so we got into thinking about how to do some of his methods into problems where you never
observe the complete data interval centering that kind of thing. current
status and so on. But I was automatically attracted to the kind of realw world approach Jamie Robinson had to really addressing key questions in
the real world and how to answer them.
Anyway, so that's uh very much aligned and then of course with that came another journey how how I eventually got into things like target maximum like estimation and so on. Um but yeah it's
always the drive comes from the real world and if you look at some of my contributions yeah they're very much in that spirit right if you think about the kind of
when I studied cross validation for example and I got to this thing like superarning I mean that's a theory not developed for like le squares regression or something or linear models or no it
just says what's the real world out there the real world out there was people want to use and I wanted to use machine learning. You have a whole range
machine learning. You have a whole range of machine learning algorithms. Which one are you going to use? It's not like model selection because it's not the algorithms are not even using a model.
So it's not dive selection. It's not
variable selection. It's estimated
selection. And so I define it as estimator size. You just have a bunch of
estimator size. You just have a bunch of algorithms. You have to choose among them. And they have a certain goal of
them. And they have a certain goal of like learning a prediction function or something like that. And so how do you choose among them? And what's the theory doing that? And so that became like a
doing that? And so that became like a generalized way of doing cross validation for any kind of estimation problem any kind of machine learning algorithm and then developing the theory
which shows that crossation is the optimal way to do it and it has the so oracle property and so on but it is immediately very general very applicable to the real world and it gave in
particular this superarner which was saying you know this is the way we should be doing machine learning we shouldn't be betting on one algorithm we should not competing like I like this,
you like that. No. And if people come up with a new algorithm, great. Let's
include it in the library and build an even more powerful algorithm thanks to that extra contribution. And so this way it becomes a very practical system of
learning uh which integrates all the advances and really handles the real world. Yeah.
M and so it's the kind of things what you will see is that in in the kind of works I've done it's always from that perspective that it's really applicable
to solving these real world problems and that's also kind of how the theory I develop adapts to that to that challenge.
You mentioned super learning and that's a part of of a broader framework that you have created over the years and I'd love to talk about this but before we move there I have one more question to
you. Mhm.
you. Mhm.
You mentioned Jamie Robbins many times uh during during the our conversation today and I always had a feeling when I was reading your your works that Jamie
Robbins had a big impact on on you and you're thinking about causality. There's
also something very interesting in your works that I do not that often find when I'm reading other authors. you are
taking uh Robin's thinking on one hand and often you mix it with thinking coming from from Pearl for instance and you kind of seamlessly blend these two perspectives in in in your writing.
Where does this come from?
Yeah. I'm I'm also very pragmatic. Yeah.
So I'm not very much into you don't get me very excited to be part of a debate if we should be using structure equation model or the nan ruba model because for me they are both
interesting way of modeling the real world and sometimes to be honest I feel more comfortable with one right like a structure equation approach and sometimes I really like the just simple
you know missing data structure on a bunch of potential uh intervention specific processes or outcomes. So I'm
willing to switch between one and the other. At the end they both I define a
other. At the end they both I define a causal question, the same causal question and at the end they give the same statistical estimate and and and all that. And of course people have
all that. And of course people have different contributions, right? Like
Jamie Rollins of course did a lot of work on also the causal uh model development but also a lot of estimation
and UD. curl was much more on the
and UD. curl was much more on the structure equations modeling and then also getting identification results and algorithm identification results and so
on. So yeah these are different
on. So yeah these are different brilliant people uh they have their own views on things and I happily learn from them and talk to them and interact with them. I don't feel very much ever to
them. I don't feel very much ever to belong to a certain camp and often there's something to learn from different people but I don't know if that touches on what you mean but I I I really love your perspective because at the end of the day this like
potential outcomes and structural cot models these frameworks are largely interchangeable right they have almost the same meaning with some exceptions
maybe but I really love reading offers that you know just blend them wherever something is more useful. Yeah. And an
example of that is you're right that for example I remember you know it was student of mine we were talking about some kind of like mediation
uh formula and and the mediation result at the time was in particular the one from structure equations and they would put down an identification assumption in terms of structure equations.
Uh but then you know when you write it on a p piece of paper and you do it in terms of potential outcomes you see that you can weaken that quite a bit right and it makes it very specific
about a very conditional mean right of of certain treatment specific outcomes and some kind of conditional independence and it becomes suddenly very different. But from a structure
very different. But from a structure equation model where you just look at the graph and try to learn the identification from that you wouldn't see that that gives identification results for
general post intervention distribution not very specific for certain very specific questions such as a contrast of a mean outcome under different regimens.
And so that's where you see that by just switching from maybe something very nice about kind of you looking at graphs and learning identification results from that that also you have to acknowledge
it's also limited that way because it doesn't it gives you two strong read assumptions which are can be often weakened once you care about something very specific and so that's kind of the
switching I yeah I do naturally do and I'm happy in both worlds and then you do see the benefit in some cases from one versus the other. And so that's yeah, so
depending on your goal, you you might switch from one to the other.
That's a great perspective. I I really love it. You know, I'm a huge fan of of
love it. You know, I'm a huge fan of of of Perth's framework. Uh I always emphasize this. I think it it gives a
emphasize this. I think it it gives a lot of clarity and also gives a lot of opportunities for for transparency, right? and and in particular
right? and and in particular transparency with people who are not necessarily technically uh trained or technically skilled. I I work a lot in
technically skilled. I I work a lot in in industry and and so causal graphs are just like such a convenient way to communicate with business people for instance.
Exactly. and and they find them and they find them you know after a while maybe after a second of confusion after years of talking to data scientists you know coming from just machine learning and so
on and so on when they grasp the idea of a causal graph it becomes so intuitive that it's really it's really incredible at the same time you see the benefit of you know people who are less mathematically inclined
they suddenly have a framework where they can get ahead around and make progress and yeah embrace it and so that's why all these different views and approaches are have their utility even though they be perfect for every
purpose. I also appreciate very much
purpose. I also appreciate very much that what what what you said right sometimes if we look a little bit closer from the point of view maybe a little bit more of of the estimation point of
view the identification point of view we can we can notice something that is just invisible when we look from this purely structural point of view right and so
and so just make our lives easier uh so that's that's really great Mark I wanted to go back now to to what you said before you you mentioned super learners and uh super learners are I
think it would be fair to say it's a part of a broader framework that you've been working on for for years which is called target machine uh targeted
maximum likelihood estimation or TMLE what what is TMLE about and what inspired you to to work on this and and put so much effort so much work into
into this into this framework over the years yeah so that has been kind of one of the biggest drives of my journey has being you know given a statistical estimation problem. So you have a certain study
problem. So you have a certain study design, a certain type of data you collect, certain features or complexities such as dropouts, missingness, informative monitoring,
time dependent treatment assignment, uh treatment by indication and all that given the statistical estimation problem well formulated. You know what your
well formulated. You know what your cisclass cement is. You know what your cisco model is. But we also know they're not paramedic models because they're I call them realistic syscom models.
Meaning you're taking it serious. You
make assumptions which you can defend.
How do we for such statistical estimation problems? How do we construct
estimation problems? How do we construct the kind of the best possible estimator?
And so that has been my journey and my goal. So that's also how I think right.
goal. So that's also how I think right.
I'm literally always thinking that way.
So it's not like I'm working on a very specific problem in that. Yes, they are of course also work on them but in my mind I'm kind of philosophically
thinking how do we do this? How are we going to construct really optimally learn from data that's has been a long journey and I started with understanding
non-parametic maximum like estimation in reasonable models but still nonparametic and so that's what's like my PhD thesis was is to understand yeah they called it
npml nonparametic maximum light estimation in things like kepla meer is a nonparametic max estimate function based on right sensor data then I worked
on a bvert right sensor the day data and so on. So that helped me to develop a
so on. So that helped me to develop a lot of theory uh and insights how to analyze estimators. So it gave me a lot
analyze estimators. So it gave me a lot of tools, mathematical tools to understand how to think about estimators, how to analyze them and all that. So as I
pointed out Jamie recognized that kind of maximum likelihood ends quickly because you know you can handle it a little bit in certain dimensions but quickly it
becomes too much. So for most statistical like even when you have right sense of failure time data with a single continuous covert it already becomes a problem how to do normal
parametric x and y estimate so you have to start getting beyond MLE of MPMLE and so that's where I learned about from him
about the estimating equation approach and that was you know yeah it's a whole story about whatever he called it orthogonal complement of the nuance tangent space and the class of estim
functions and you have to orthogonalize these estimating functions with respect to the nuisance parameters and and and that's nowadays they call that also
double bus machine learning by the way but that came after that what would what Andrea and Jamie had done was that whole orthogonalizing of estimating functions and then you know one estimating
functions would then be the best one and that's corresponds with what we call the efficient involves curve the canonical gradient of the statistical estimate And I wrote a book with Jamie which was
unified methods for sense of long data and causality. And when I was writing
and causality. And when I was writing that book uh you know and also based on all the the work based on that and before that and looking at other uh
people how they were experiencing things and post spending a lot of time implementing things and so on. Uh it
became clear that this approach was theoretically appealing. It's the dober
theoretically appealing. It's the dober of people refer to but it also has some serious flaws
and also created kind of two camps.
It created the camp of people who want to live by the likeshood and that includes the Beijians, right, who work with models and model selection
and and and so that was one camp and then there was this kind of estimate equation uh approach campus estimation and maybe theoretically you
could say that double robust estimation was maybe better uh but at the same time the people who were working with these likelihoods and doing MLE invasion. They
felt pretty darn good about what they were doing because they felt it was quite robust. That's because you know
quite robust. That's because you know you're learning kind of adaptively your likelihood and at the end you map that into what you care about. What do I
want to learn about my stoastic system I'm learning? And if you're trying to
I'm learning? And if you're trying to learn a probability then you will end up with a probability and if you're trying to learn a difference of probability you end up with a difference of probability.
And so all the kind of bounds are naturally taken care of because all you do is learn and density of the data and probability distribution and then plug
it in. And so you get naturally robust
it in. And so you get naturally robust estimators. They don't do something
estimators. They don't do something crazy. And that was not the case with
crazy. And that was not the case with estimate equation approach because there you have this inverse waiting going on.
And because of that, you might end up with estimate equations which have no solutions or have solutions which are like negative or bigger than one when there's supposed to be a probability.
Mhm.
And so and there was another thing is that actually that estimate equation approach doesn't always apply.
There are lots of estimation problems where the so-called canonical gradient which is the which is kind of the optimal estimate function in that world uh is
not an estimate function. It's not
something you can write as a function of the parame of interest and the nuisance round. You can just not do that. It it
round. You can just not do that. It it
depends on the data distribution but you cannot split it up in par of interest and nuisance. And so then it doesn't
and nuisance. And so then it doesn't even work.
So by then the history was in the 70s and 80s 90s and 80s you had the so-called onestep estimator which was an efficient estimator. Uh and
what that did is it's constructed yeah an plug-in estimator right again keep in mind the statistical estimate is just a mapping applied to the data
distribution. So plug-in estimator takes
distribution. So plug-in estimator takes an estimate of the data distribution which is in your Cisco model and plugs it in that M and that gives you then a plug-in estimator.
So the onestep estimator constructed an initial plug-in estimator and then added an empirical mean of what we call the canonical gradient. Now
economical gradient is something you can it's it's like you take your statistical estimate think of it as a mapping from the density of the data to the real line compute some kind of derivative we call
pathwise derivative mhm and that derivative is characterized by gradient that's called the canonical gradient. So the main lesson is you can
gradient. So the main lesson is you can analyze the statistical estimate as a function and figure out that so-called canonical gradient and that canonical
gradient is just a function of the unit data structure and of course at what data distribution you're doing it. So
it's a function of P the data distribution and of the data structure you observe on every unit. And so that's something you can actually find the canonical gradient for every statistical
estimation problem. we can do this
estimation problem. we can do this differentiability and figure it out.
It's a very important object. I often
refer to it with my students as the most important object because it tells us that an est there's efficiency theory and it says that among all estimators which are well behaved there's an
efficient estimator and it's efficient if and only if it's so-called asmtotically linear. So if you you can
asmtotically linear. So if you you can write the estimator minus the true estimate as an empirical mean of the canonical gradient plus something negative. And so that's why the command
negative. And so that's why the command is also called the efficient influence curve because the influence curve when you say an estimator as to the linear with a certain influence curve that
means estimator minus truth behaves like an empirical mean of that influence curve.
So there's a connection between statistical efficiency and properties of the of the canonical gradient.
Yes. The canonical gradient defines an efficient estimator the best estimator. And so and and so what the onestep estimator was in the
1980s was plug in estimator plus the empirical mean of the canonical gradient at the initial estimator. That was the onestep estimator. Estimate equation
onestep estimator. Estimate equation goes to says the canonical gradient I want to think of it as an estimate function and then tml was what came after that. and and I was inspired
after that. and and I was inspired because I wanted to have the properties theoretical properties of yeah these double bust estimators and that they have theoretically asmtoically they
correct they're as efficient all that but I also wanted something which generalized maximum like estimate that we wouldn't have two camps anymore it would if you want to be a maximum like you can be a maximum like except
sometimes it doesn't work so now you have to but you still essentially stay as close as possible to maximum likelihood and that's what we call targeted maximum likelihood And so that was something I
wrote in we wrote in 2006 and a paper and that was inspired by the limitation of both maximum like estimation as well as the kind of you know what what you
might nowadays call double machine learning or the estimating equation approach and the onestep estimate. So
all these previous approaches but and then TML yeah is then the kind of thing which came out of that and what TML is it's still just a plug-in estimate. So it's
just like maximum like you learn the density of the data or the relevant parts of the dens of the data and you plug it in your estimate right like for example you care about the AT and at can
be written as the like the mean clinical outcome under treatment and conditioning on coariantss minus the mean clinical outcome under control condition on these coariantss. So that's the difference of
coariantss. So that's the difference of two predictions under treatment and control at a particular coat profile and then you average that. That's how you
can identify the average treatment effect. Tim would say just estimate
effect. Tim would say just estimate these conditional mean outcomes but with machine learning but update them target these fits towards the purpose and then
plug them in and average them. And so
that's a plug-in estimator just like max like would do right in max and likelihood normally is a parametic model it would estimate the outcome aggression with parametric model gets for every subject in your sample the prediction of
the outcome and the treatment and control take the difference and average that will be paramedic model based estimate. So team is also doing that
estimate. So team is also doing that except it use machine learning but also uses the target max likelihood step to make the machine learning tailored for its purpose which is
evoling the statistical estimate and then it has all the theory right and better we can use state-of-the-art machine learning like superarning still get statistical inference and still
stick to normal plug-in estimation which is very natural of course people once they're used to something unnatural they think it's natural but it's not natural Right? If I tell you what the
Right? If I tell you what the statistical estimate is, then you say, "Oh, so what do I need for that? Oh, I
need this and this and this. Oh, let me learn it. Oh, I plug it in." That's
learn it. Oh, I plug it in." That's
natural. That's while writing down a canonical grain as an estimating function and setting the empirical mean equal to zero is not natural, right? You
might be trained that way, but it's not something like again maximum like wouldn't do that. So that's why yeah target maximum likelihood if max like would work then target maximum like just
use the maximum like initial estimate and then it doesn't do anything so you just get maximum likelihood. So it
generalized maximum likelihood but now we can incorporate all the machine learning and therefore generalizes it to realistic statistical models. Yeah, that
was a paper we wrote in 2006 and then always you wait, you know, because you think maybe there's something wrong with it and and we'll have to adapt again or but it was just a very powerful
framework and it kind of opened up yeah it's a very general flexible framework which allowed us to essentially handle any kind of challenge. So whatever next
challenge came up we could always make it work within this team framework and what helped a lot is for most is just they have a light. you have a criteria and so you can tailor your
decisions towards that. I wanted to pause you here for a second uh because you you said a very interesting thing here a few few interesting things that the first one is that this work on on on
TMLE was in a sense ecumenical right you you you said like oh we don't have to make this decision we don't have to subscribe
to just one camp we can have in a sense best of both worlds and so that's that that seems to me like like yet another example of you looking
beyond the popular demarcation lines between different camps and just taking what's useful and what's applicable that's yeah that's of course also where my journey had helped because I had
worked of course on maximum likelihood then I got into the camp of you know estimating equations and all that which was great and then I was kind of part of that camp also thinking you know max of
light will never work but then you run you start also kind of certainly if you start writing a book on it you start, you know, thinking philosophically about it. Am I on the right track? And that's
it. Am I on the right track? And that's
often what you have to do in science, right? You have to continuously ask to
right? You have to continuously ask to yourself, what am I doing? Does this
make philosophically sense? My adviser
would always say, you need to be able to if you work on something and you think it's good, you should be able to explain it during a hike in a forest with with a friend who's not necessarily the best
trained person. You should be able to do
trained person. You should be able to do that because then it should be philosophical and it should be beautiful and it should be something you can explain and if you start losing that you have to talk about formulas and this and that
then something is off. Yeah. That's why
you have to always step back and say what am I am I on the right track? Does
this feel right? Is this something I can philosophically explain? And once you do
philosophically explain? And once you do that then suddenly you start saying hm this is really not that natural that whole estimate equation pro and you know it's not natural because you see that
every for most estimation problem there is not even an estimate function so you just know it's not natural because otherwise the nature would just know automatically adapt to it. So then I had
the maxim in my history as something I understood and then it's and then also understanding the limitation of the kind of the onestep estimator approach and
the estimate equation approach and then I said okay I have to go back to maximum likelihood and combine them so I get both worlds and and uh yeah that's that
certainly was the right idea and and I think like some of my uh like Marco Coron Tony who was a former post of mine
was now at University of Washington. He
also saw like a reference in Fonzagle who was a person the 80s or 90s working on efficiency theory and even second order efficient estimators and in his
writing you can see see it written that he says wouldn't it be nice if because he was doing these kind of updates like onestep estimator you take an initial of a plugin estimator add an empirical
meaning of a first order canonical grain then you have another use of another second order thing and the thing is of course was painful Because if you start adding things in like your estimated
probability and you keep adding these kind of noisy objects, you have no reason why at the end it will be a nice bounded probability. So he recognized
bounded probability. So he recognized that and he said wouldn't it be nice if we can do the updating in the model space update the density instead of doing it in the parameter space
and so he didn't say how how it should be done but that's he recognized how painful it is to do these updates in the parameter space and so that's yeah so it
is actually really the natural idea to to do target maximiz and we are now in the Where are we now? You know, 2025.
So, it is this paper was written 2006.
So, we're 20 years down the road. And if
anything, it became stronger and stronger. So, it it's it's not like
stronger. So, it it's it's not like sometimes you you start seeing things falling apart over many years because you see this problem, you see that problem. This is the opposite. It's it's
problem. This is the opposite. It's it's
it's just flexible enough to be able to handle the challenges we have to deal with. when I listen to you uh and and
with. when I listen to you uh and and and you talk about this uh targeting of the parameter, it brings to my mind the idea of double machine learning. You
mentioned auto organization and double robustness before and so on and so on.
So the ideas that are very close to that to that framework which is which is a more recent framework but it seems to me that that your framework and double
machine learning although they come they originate from different sources so to say behind the scenes they they are very
very uh close to each other. Would you
agree with this view? In a way I don't another way I might right because again this represents the difference between the the augmented IPCW they call it
right augmented inverse probably sensing weighted approach which was Robins Rodnitzki really right and then the double machine learning was doing that right they they were using like an IPW
type thing and IPCW and then orthogonalize it with they called it name orthogonalization well Jamie Rollins called it orthogonal same space And then he they create an estimator from that and and yes they started talk
about machine learning which you know was done you know we did that early on right we have been using machine learning from the get-go when everybody was angry about it that's another way where you transgress the the boundary the camp boundaries
right against the camps against the camps absolutely yeah that was something which was not something you should be doing you shouldn't be using machine learning and still think you can do theory so I would say the double was machine
learning and they have some novel ideas in that literature Right beyond the augmented IPCW orthogonalizing nuisance SC space really it's missing the piece
which was the journey where I went from estimating equation and once the estimator towards plugging and that's the tim so they skipped so they were
they came after the tim yeah in that sense it misses out on the team in some ways yes all these approaches are based on yeah what we call the canonical
graded right efficiency theory efficiency theory defines for off for us an efficient estimator. So now you know as if you want to be efficient you should be asically linear estimator with
influence curve the canonical grain. So
all of them are aiming for that and succeeding in that but they're very similar assumptions.
Mhm.
So in that sense they're all related but the one which really marries the whole world of maximum likelihood and and cif maximum likelihood basian with the
actual efficient estimation is the team.
Mhm.
Yes, there's a lot of similarity because all of them worked with the canonical gradient. But then how you use the
gradient. But then how you use the canonical gradient is the key, right?
For finite sample performance and you can use the canonical gradient in essentially three ways. One is was the original way in the 1970s and ' 80s.
That's the onestep estimator. Take an
estimator of your target parameirical meaning of the cononical grain at at the initial estimator was number one. Number
two, use the conical gradient estimate function. That's the Robins rod nits the
function. That's the Robins rod nits the augmented IPCW and also the double machine learning. Number three, don't
machine learning. Number three, don't use the canor grade in that way. Create
an initial density estimator.
Create a little parametic path through the initial density estimator. Gave make
sure it has scored a canonical gradient.
The canonical gradient happens to be a score. So I construct a little parametic
score. So I construct a little parametic model with only one unknown parameter or or multiple sometimes. And I do a little maximum likelihood step and that
parametic model uses offset your current superlearner whatever your fancy machine learning density estimator is and it does a little parametic maximum likelihood and then you get an updated density estimator the targeted dens
estimator and then you plug it in your par. These are the the ways you can use
par. These are the the ways you can use the canonical grade and in that sense they're fundamentally different and that's where yeah from my perspective that makes them very
different because I know how important it is to stick to the likelihood and have that criteria to make your updates and really be a plug-in estimate in the end
when somebody a practitioner is is considering which type of estimator they should use maybe if they should go with
Emily maybe with something else.
What questions should they ask themselves in order to make the best informed decision?
You're asking me, right? I say use team, but keep in mind and that's because I did the whole journey. But of course, when you haven't made that whole journey, you might be part of a camp and one camp your your PD advisor was a
basian. So you're basian, maximum
basian. So you're basian, maximum likelihood doing model selection, all that. another camp is you were into
that. another camp is you were into augmented IPCW and you think that's what I should be using and look it has great properties but if you're not belonging to a camp yeah yeah then I can tell you
TMLE is the way to go it is the future and not because of whoever developed it in essence it's it's really a product of a huge academic group right community
it's all bu all these advances from the 70s and 80s it's building on the efficiency theory it's building on the empirical process theory it's on on on all of it, right? And and the the all the work of the pearls and the
robins and so on. So this is not like an simple or you know site thing. It came
out of it. It evolved exactly out of it and it takes in all these advances and and and also keep in mind TMLE is not a
single algorithm. It's not like oh I
single algorithm. It's not like oh I have a problem I need to do the whatever average treatment effect or I have a survival date and I want to get the causal relative risk on survival at five
years right so oh I'm going to use Tim has lots of choices it's a framework it's a framework it's a template for constructing targeted plug-in machine learning estimators with valid inference and so
there are choices to be made there is how do you write down your statistical estimate right like for example in survival you might write it in terms of the conditional failure time hazard Okay, good. Let's estimate that
Okay, good. Let's estimate that conditional time has how are you going to do that? What what what algorithm are you going to use? like I'm super learning or highly dep of lasso is is
one we like a lot or both put it in the superarner and then also how do you do the targeting precisely there's they're often choices to be made right they all will end up being efficient estimates but you know how you do things can
actually be tailored towards your particular sample because your sample size might be very small or your sample your sample size may be not so small but there's a lot of challenging like there's a lot of positivity issues we
call them right that weights blow up and so all these things can really influence how the what's the best way to do the precise targeting step for example and
and nowadays we have adaptive tim so we kind of regularize the targeting steps to tailor it again for your finite sample so you can get the best bias variance in the finite sample so that's why it's not a simple thing like oh I'm
using tim let me grab something because you can course there most things have software now and they have implementations but they're often defaults right they're not tailored for your particular setting potentially so
yeah you have to do some homework and potentially there might be standard software packs which are giving you enough to work with. So that's why for me it's not just the real question is how are we from my perspective is how
are we going to do the TML and that's is a lot of our research is about that how do we set up our specified statistical analysis plans including the
specification of the TMLE based on for your particular study and yeah we we're anyway that's that's kind of where the software has to evolve as well because
the current software is often open source and not necessarily ready or maintained that way or not necessarily tailored and not user friendly enough.
And so that's why I think there is a future for producing you know robust industry quality software but also user friendly so that these choices are kind of made for you in the right way.
Mhm.
And that's yeah that's also something I'm working on. We have a little company uh where we get into that so that we can be advising on that how to do this for people who who are just starting and
are maybe interested in open source implementations.
what can they do? In my book, I I describe one of uh the possible estimators uh TMLE estimators now but I'm not using any framework. This book
is focused on Python and at the time of writing I was not able to find anything in Python but I know that there are some things in R maybe there are also in Python what would you recommend to someone who is just starting.
Oh that's actually glad you asked.
That's a good question. I'm not again I'm not married to R. I'm not married to but Python is great. it just happened to be you know in the community I was yeah
that R was kind of the language so our we program mostly in R and so yes you can definitely find all kinds of R packages like for example there's the team elite package for point treatment
analysis you know we just want to get the cause effect of a single treatment on some future outcome where you might have missing this so that's all included there's also but then there's also
packages for survival data like serve Tim and there's another one concrete we have and and then there's LTI which is longitudinal causal inference tim with
sequential regression approach and there is now for but then you know there are lots of such packages because you know all kinds of people do that but uh like
I have a student to shakawa and he has been programming what we call deep ultimate and that uses transformer and deep
learning uh to do the fitting and still do the target mix like it and all that and so that's done in Python actually has been much better from a memory and
computational perspective so that's why I'm not married to recommending R if you live in the Python world go for it right and that's what he does he does and he so he has a very powerful
implementations and then they keep evolving and it's already available for people who want to use it is going to happen more and more and I think you probably you know more about that than I do but that's uh I'm quite sure also
these kind things which maybe were not yet available in Python are are evolving and getting more and more integrated there and so at some point hopefully there's not much reason to necessarily
go to R or what I I would be interested in your view more on if you think Python is just the way to go or that both have their uh rights it's great that all
these communities start developing these uh products and have them available and I so far my guess is that Python is just better when we get into really massive
data and and so on. But some R person might say, "No, we can do that. Handle
that too." But it's just you have to know how to do it.
Yeah, of course. There there are camps as well. I have I think you know R is
as well. I have I think you know R is great. I I do my work mostly in Python
great. I I do my work mostly in Python but I have R here on my on my in my laptop and whenever I want to do quick something check you know build some
quick data generating process and it's some you know like models and so on or maybe mixed effects models or linear models I I do it in R because it's just so much faster to write it in R. You
don't have to input fast models something or you know and so on. So
so we still have a little bit of an an uh uh strengths in certain areas.
Oh, definitely.
Areas like more the data science they they might have more advances in Python to deal with the that's that's why Toru you know wrote this whole thing in Python and he seems to be quite content
with it but it really kind of depends on on the needs.
Yes, I I agree. I agree. So the name of the package is deep LTMLE.
Yeah, deep LTML.
Great. We'll link to it. It's actually
quite an exciting because there was ultimate I don't know to what degree you're familiar or the use the uh listeners are familiar with uh like
the sequential regression approach for learning the mean potential outcome under time dependent treatment regimen and then you essentially you can identify it by sequentially regressing.
You start with the final outcome, regress it on the most recent past, integrate out the lost treatment according to your intervention, make that your outcome, right? The
predictions for everybody. Then you do you do it again a regression and again it and then in that regression again substitute for your treatment your desired regimen. And so you keep the
desired regimen. And so you keep the sequentially regressing every time evoling your regression at the treatment of your choice so that you end up with the mean outcome in the end under that
treatment regimen. That's an approach
treatment regimen. That's an approach and you can do targeting of the regressions and then you have to ultimately and and that can be done for all kinds of extensions and so on. But
deep ultimate still sticks to that kind of framework but it doesn't do it sequentially.
Mhm.
And so it's called temporal difference learning. It means you are
learning. It means you are simultaneously fitting all these sequentially defined regressions simultaneously.
That's why transformer that's why transformer and it's beautiful. I must say it took me some time to figure out what he was doing because I kept telling him you cannot do that till I understood what he was doing. But it actually it's it's
was doing. But it actually it's it's beautiful. It's it's it's writing down
beautiful. It's it's it's writing down one big sum loss over all time points even though it's weird because you're using in every loss you don't know what the outcome is yet. So you define it in terms of your previous parametic model
essentially which is this deep learning neural network model. And so at the end it's this big sum of losses squared error losses or whatever lo likely losses and and then you do a gradient
descent but you only take the derivative with respect to the for every loss with respect to the regression part and not the outcome part which is also a model.
So then it works then you're solving the right score equations and everything work. So it's a beautiful idea actually
work. So it's a beautiful idea actually and and it really helps us because sequential regression had limitations because you had to kind of enough data you had to know you couldn't have too many time points and now with this deep
LTML we can have arbitrary number of time points we can handle continuous time and and uh so yeah it's really quite powerful that's that's really interesting we'll link to the package in the description
in the show description so everybody can check it we talking about transformers so I have to ask this question what's your take on LLM on large language models Yes.
Yeah. No, it's phenomenal. I mean, I can see it with my uh my children and you know, like our youngest son. I mean,
he's just living by it. I mean, it's like his he's using it for everything and they're getting very good at it. So,
yeah, it's his jet GPT is quite miraculous, I must say. And of course, you know, things will evolve and get better. Uh but yeah, it's not something
better. Uh but yeah, it's not something we can uh deny or we can suppress or there's no way this is happening and it's happening fast. Of course, we need to have think carefully about how how
things are evolving and and what we trust and don't trust. Uh but yeah, no, I'm a big fan and I think it really was kind of a revolutionary step. Uh so it's
quite remarkable. That's quite
quite remarkable. That's quite remarkable and I see it you know how it changes you know how people like like I'm using my son as an example how he just in a very playful way learned so
much much more than he could have learned on a college right by just continuously operating with this Chad GPT and playing with it and programming with it and setting up websites or
whatever and he learns all these skills which you know he didn't know how to do so it's I I think it's quite quite something so he's using it as as a teacher ing
partner in a learning partner.
It's a partner. It's a partner for him where he kind Yeah. But in the meantime, he learns a lot about all kinds of things. But yes, it's definitely a
things. But yes, it's definitely a partner where he collaborates with in essence and and does all kinds of stuff and guides it to what what he needs.
So I think all kinds of people have become so much more powerful and by just having it as a partner than before. So it's quite something and
than before. So it's quite something and yes and we are using it to some degree.
I mean like I said the transformer we use for the as part of the estimation uh in deep LTML but also we are working on co-pilots for developing causal
estimates using large language models.
We have a student working on that. Uh so
it's certainly something you know we think can be very helpful also for statistical analysis plan development and and and automating things and again like you said before with the structured
equations being more accessible for certain certain people than others similar these large language models can really make things much more accessible and user friendly to carefully develop
like go through the steps of the road map we talked about right and so really make that more automated and think about the kind of things you have to think about. You can kind of
train the L large language model for being trained in that way of thinking and the way of answering but at the same time if it gets a little off then it still has interesting things to say. So
it's it's it's different from just writing a program because then for program you can just say you know ask for this then do that oh but in this case it's you essentially also train it
to say what what to do but at the same time it has that automatic flexibility that it can still jump in and automate and answer things spontaneously which which can be a benefit and a risk
at the same time. Right.
Yes.
That's right. That's the tricky but yeah so yeah and that's why we need a good ways to I think it gives all kinds of new statistical problems to think about
how to evaluate how to set up designs with change how to optimize it for different purposes and so on. So yeah I think it's it's an exciting time.
Now I want to take a step back into the old school times and ask you what are two books that changed your life?
To be honest I'm not much of a book reader. Uh I'm like I write books myself
reader. Uh I'm like I write books myself but then I never read them. So it's it's that bad. Uh so I like the writing
that bad. Uh so I like the writing because that's a lot that gives me really a way to concentrate and write up how I think and how the journey. I'm not
much of a reader including my own work.
I mean books which influenced me certainly were things like the Anderson boring gill and kiting counting processes uh
but also that of art vendard books uh influenced me of course but and also the the big class read the wellner it's harder to read but you know based a lot
of based on that work from people like that maybe a few came to mind there probably more if anything it's a weak point that I'm not much of really. Yeah. Now and then I look into
really. Yeah. Now and then I look into things, but it's a lot I let my students do a lot of the reading to make sure we're up to date on the
literature.
If if somebody was could only have could only read one of your books, which one should that be?
Uh I would say the upcoming one uh which is going to we have a writing a book on the highly adaptive law.
in combination with Tim that will be I think a very powerful story but yes if you I would say yeah read the first targeted learning book because yeah the book with Jamie was quite difficult
unified methods uh and then before that yeah multiple testing book we have uh yeah I would say the target learning book is probably the the better one yeah more coming
great yeah I'm looking forward and I hope our listeners as well Mark if you could give one advice to people who are interested in working on challenging
problems maybe in in your area or some other area machine learning uh maybe statistics maybe causality what would that be I mean yeah I know of course what I'm
kind of thinking about but that's not necessarily you know that's just one piece of things and kind of part of my journey to be honest you know that's why it naturally flows because yeah you know
team was developed for in principle for what we call pathwise differential statistical estimates. So things you can
statistical estimates. So things you can estimate at one over square sample size so that you can get as normality and all that and so what I've been working on over the last years is is generalizing
that to ties for any function you're trying to learn. So including like a conditional treatment effect or and and still getting inference and confident
evolves and all that. Anyway that's also my work on high. So in essence what's happening in my uh kind of research is I'm going back to maximum likelihood
more and more. So it's like full circle.
I started with analyzing maximum likelihood then you know the estimate equations target maximum likelihood using superarn but now realizing that maximum likelihood also for the sake of machine learning is so important and
that highly adapted loss is an example of that. It's a very powerful machine
of that. It's a very powerful machine learning algorithm. Incredible
learning algorithm. Incredible theoretical properties but all because it's it's it's like maximum likelihood under constraints on the target on the functions or densities uh in terms of
variation norms and all that and then you can still get max light estimated which are highly nonparametic solutions and get all the asmality and everything.
So that's an area of research I I think it is still very interesting. Um but
moreover I would more say understand the road map and attack the real world and when you attack the real world don't cut corners
don't move towards immediate uh oh I'm I'm going to do this or I'm going to do that that's that's the whole idea of the road map really be real about the real world translate it into an estimation
problem truly addressing the real world problem and then you will see there's an interesting research area right because it's never you know it's it's not always that things you can use off
the shelf there will always be new challenges and so I think that's has been my experience is you can get all kinds of interesting uh problems to work on uh by being out
there in the real world and translating and then see oh that's a challenge darn how are we going to do that right and and then you have an interesting research area so that's uh yeah that's
kind of uh my take I think adaptive designs uh you know reinforcement learning combined with statistical inference is is another big one. Yeah, there's a lot
out there. Overall, I would say we truly
out there. Overall, I would say we truly are getting to the point where we can start integrating in both the designs and adaptive designs the state-of-the-art of machine learning and
still get uh valid inference and answer all kinds of closer questions. So, we're
that's really great. That's really
exciting. That's a really exciting prospect, you know.
Yeah. Before we close, uh, what's your message to the cosal community?
I actually think the causal community is is great. I mean, I've seen so many
is great. I mean, I've seen so many young smart people coming up. So, I meet them now and then, and I mean, I'm impressed by how eager they are to
really not just do something, but actually also understand it and learn it. And I like that. I mean I like the
it. And I like that. I mean I like the fact that uh there are lots of people out there who truly want to understand what they're doing be formal formulate the problem carefully
and and try to do a rigorous job and and of course still if you have the eye on the ball some people are more on the pragmatic side some are more on the data analy side some are more on the whatever
side computing programming side all that is important and every contribution is important so many people have to find their way what drives And that's also
what I do with students. I try to make them figure out what makes them excited and that's very different depending on the type of person. And so I think in general for the cause of community as
well. Don't don't necessarily feel you
well. Don't don't necessarily feel you have to belong to a particular camp or a particular expertise. Something has to
particular expertise. Something has to fit you and then once it fits you then things will happen. Uh and so that's uh so just make sure you get inspired. If
you're inspired, then you will learn a lot and it it will work out. So yeah,
that's my take. Uh large variety of contributions are needed.
Thank you so much. That's that's a really beautiful take. I really love it and I think it's a it's a great it's a great closing uh for us as well for our conversation. Mark, I hope to have you
conversation. Mark, I hope to have you again when your new book is out.
Great. I'm I'm sure we have a lot of topics to to to discuss and I'm really grateful that you found time for this conversation.
Absolutely.
Thank you so much.
Was great. Thanks a lot for setting us up. All right.
up. All right.
Thank you, Mark.
Bye.
Loading video analysis...