Ultra-Fast Language Generation || Hybrid twinning using PBDW and DeepONet || Jan 16, 2026
By CRUNCH Group: Home of Math + Machine Learning + X
Summary
Topics Covered
- 64x faster language generation without infrastructure tricks
- Humans don't read token by token
- Why discrete diffusion breaks standard distillation
- Block diffusion: marrying autoregression and diffusion
- Teach AI only what physics cannot explain
Full Transcript
Hello everybody, welcome to today's crime seminar. Today we have two
crime seminar. Today we have two speakers, two types. Our first speaker is uh Hangen.
He will be giving a talk on Opra language generation via discrete diffusion divergence uh in Ho Yang Jang is a PhD candidate in the school of
mechanical engineering at Purdue University advised by Dr. Guan Ling. He
studied generative modeling and reinforcement learning from a probabilistic modeling perspective. His
research interest uh include large language models, continuous and discrete diffusion models, reinforcement learning, sampling and optimization
methods aimed at improving latency and robustness. With that short
robustness. With that short introduction, the floor is yours and you can start your presentation.
Okay. Thanks for the kind introduction and uh thanks for the invitation from professor Kanya Dis and it is my pleasure to give a talk at crunch seminar and today I will give a talk of
our recent work on large language models and the title is here and we also call it DD instruct
and uh uh this work is a joint effort with a fantastic team of collaborators like uh Shinyang from UT Austin and
Cindi is a PhD student at Purdue and Nan uh was graduated from Purdue University and now is a tenure track assistant professor and Julia from a US and
unfortunately I did not get his photo for this slides uh but since he's he is worked closely with professor Kanye Dis I believe uh many of you are familiar
with him and way is another Purdue alumni who is now a research scientist is at Morgan Stanley also we graduated
uh graduated from picking university and is currently a research scientist at at rand node and also Dr. grounding.
Um so I will start uh with my talk uh with a brief introduction of diffusion language model. Uh why how it it it is
language model. Uh why how it it it is different from auto regressive models and why do we care about uh diffusion language model and then I will give a
mathematical definition. How do we
mathematical definition. How do we define the forward and reverse process of diffusion language model? how to
define the objective to train the fusion language models and then will be the main part of this talk the proposed DD
instruct uh which can dist generation or faster inference and there are four main challenges uh in
this work and and we when we develop this method and uh and I will talk about how do we address them uh we also develop a lot of experiment to validate
our method which we'll introduce in the experimental uh section and uh the last part is uh our future bra uh future plans I would briefly introduce some
directions that we are developing uh related to DD instruct and uh nowadays I believe uh uh all of
us are familiar with large language model especially uh the auto regress as a model uh just like TH GBT or Gemini uh they generate
sequence uh token by token from left to right and it works well because uh many companies do a lot of uh infrastructure
to optimize them but uh it still face a major limitation of the inference speed. uh thinking
about the thinking mode in CH GB or Gemini uh they need to generate a vast amount of internal tokens to reason uh
to reason a problem before giving an answer. So for complex tests uh we often
answer. So for complex tests uh we often have to wait a minutes uh to uh to get a result and due to this bottleneck there's a growing trend in academia uh
to explore diffusion language model as a potential solution.
So the core feature of diffusion language models is parallel sampling. So
in each step the neuronet network will predict all the tok uh all the loges for all tokens simultaneously.
So theoretically this means we could generate a complete response in just a single sampling step. Uh but uh we also need care about the generation quality.
So doing it in one step is quite difficult in practice to get a high level high quality text. So a common tra
tradeoff is to generate two tokens per step uh which can achieves uh can achieves quality compared to auto reggressive models and that means uh
without considering any infrastructure optimization uh diffusion language model can achieve like two times speed up compared to auto regressor model.
uh but clearly this does not reach the theoretical speed limit uh limit of diffusion model since uh they can generate everything at once. So to
bridge this gap we propose uh DD instruct uh and it allows like up to sample 64 uh
64 tokens in a single step way while maintaining high quality and uh without any infrastructure optimization. This
means it achieves like 64 times speed up compared to auto regressive model. And
uh this is the motivation that we develop DD instruct. And there are some other interesting statement that motivate
uh in this area. And uh for example uh there are some uh some informal argument says that when human process language or
understand language uh we do not necessarily read from left to right token by token. Uh we just try to capture some keywords to understand the
information globally.
And uh another uh point is that uh if we consider some uh multimodal task like uh image generation, video generation uh they inherently do not have a natural
left to right order. So sampling from a random position just like diffusion model uh these are much more reasonable like to forcing them uh in an auto
reggressive order. So this means uh
reggressive order. So this means uh diffusion language model might be more compatible uh to do multi- uh multimodal task than
auto reggressive model.
Okay. So uh before I dive into the mathematical formulation, I actually uh create uh like some uh animation to help
you understand uh how diffusion language model works like uh what is it? It is
similar to continuous diffusion language model. Uh it has uh forward process and
model. Uh it has uh forward process and uh reverse process. And uh for the forward process uh at first it uh there's a complete sequence clean
sequence and each uh at each step we randomly select a subset of tokens uh just like uh this red one and this
red one. And then we uh during for the
red one. And then we uh during for the forward process we transform them to mask token uh just like this one. And
then we fix these two tokens. And then
for the next step, we transform the other two positions uh from from the task to a mask token. And then we keep
doing this. And uh finally what we got
doing this. And uh finally what we got uh here uh at the end of the forward process uh you get a complete sequence
of mask tokens just like this one. And
uh this complete the forward process and for the backward process uh we need to define a neural network and in practice we randomly select uh certain mass
positions uh just like this red one and based on the network's output we sample the token and convert the mass token to
tax and once these tokens are sampled they are fixed and we continue uh and we for the next step we sample the other
mass token and transform them to a text and we keep doing this until finally for the for the backward process uh we
recover the whole sequence uh just like this one and I hope this help you understand uh the process yeah and yeah if you have
any question just feel free to stop me and uh I I will try my best to explain everything.
Okay.
All right. So that um okay we already have a high level understanding how the forward process uh and the reverse process look like and here's a
mathematical definition. And just like
mathematical definition. And just like the continuous diffusion we we we have a time step t and also we have a complete
sequence of data x.
And uh uh different from uh the key difference here uh different from the continuous diffusion is that we also need a corruption rate because uh we
need to we need a rate to say that we transform the token to a mask state.
Yeah. And after uh define this uh we also uh continue to define the transition probability. So uh we need to
transition probability. So uh we need to define a transition from uh t equals to zero to an intermediate state t uh zt
and we also need to define another transition uh probability from uh state uh from time step t to the time step s.
Yeah. So I I think that's the uh that's all we need to define the forward process or or maybe we also call it absorbing process.
Hi uh have a question.
Uh yes please.
Yeah just a quick question in the last slide I just wanted to ask the corruption rates are always non- negative. Is that correct?
negative. Is that correct?
Yeah yeah yeah it should be uh it should zero and one. Yeah
right. Okay.
Yeah. Thanks for the question.
Any other question?
Also, yes, I have one question. Uh, so
how do you define these transition matrix? They're like in the end are you
matrix? They're like in the end are you you want to try to learn them or is like you define them from before beforehand?
Yeah. Um, I'll do uh that's a good question. So yeah for the backward
question. So yeah for the backward process we need to uh like uh approximate this uh forward process to
to to like learn how how does it uh get the transition probability. Uh does
it answer your question or if you have some again can you can you say anything?
Uh sorry can you say your question again? C can you can you explain again
again? C can you can you explain again how do you define the transition probabilities if they're learnable? I
didn't sorry my my my internet like got so I didn't hear what your answer uh do you mean how do I define uh define the forward for the transition
probability?
Yes. Yes. How how do you construct them like um uh construct them? Uh
yeah. So I think you first need to have the corruption rate. So uh you you may need you first need a have a time step
and the time step correspond to uh corruption rate. And once you have the
corruption rate. And once you have the corruption rate, you can um uh use this
equation to like uh to say for for this uh if we have a intermediate step t uh
you use this equation to like calculate the uh probability at this state. uh how
it start from the data to the intermediate state ZT and uh then for yeah okay but then it means
yeah so then mean it will be associated with the way that because I hear it says that every time you randomly replace some tokens you mass some tokens right so if I if I if I understood correctly
what you said is like you actually compute the pro the transition probability that's what if I understood what you just And if you if you computed was based on the two states,
right?
Yeah, you can. Yeah, you I think you can see that uh we just follow this equation to to compute the the probability.
Okay.
Yeah. Uh does it make sense?
Yeah.
Yeah. And this is categorical distribution if you're not familiar with.
Mhm. Okay.
Thank you.
Yeah. Uh yeah. Okay.
Yeah. So uh later we'll we'll move on for the reward process. So we need to uh first parameterize the uh neuronet
network and the input is the uh token ID times the time times the time and the output is the log. And we also need to
define uh a transition kernel just like to approximate the uh forward transition
kernel Q. And here uh I actually have an
kernel Q. And here uh I actually have an example to help you understand uh how this uh neuronet network neuronet network works. So if we consider a batch
network works. So if we consider a batch size eight and the sequence length of 1,000 and uh also if we consider
tokenizer tokenization like charge GBT which uh would charge GBD2 which would be uh 50,000. So the input of the
neuronet network would be uh the patch size times the sequence length and uh we would input the token id here. So it
will be an integer. The dimension is one and also we need to input the time step and then we output uh for the output it has the same the batch size and sequence
length are the same but it also uh but the difference here is that uh now it output all the u largest of of the
possible tokens. So it will be uh
possible tokens. So it will be uh 550,000 and uh with this log we can transform them them into probability and then we
sample just like auto reggressive model to uh to sample the token uh with logist and uh uh the difference between the
continuous diffusion is here uh because uh discrete diff for the discrete diffusion you need to sample the token That's that is a discrete uh discrete
oper operation. So here we need to use
oper operation. So here we need to use the corruption rate that we defined previously. So uh with probability 1
previously. So uh with probability 1 minus alpha t we need uh we keep the current position remain mask and uh with
probability alpha t we will uh transform the mass token to a token and we will sample that to according to the logit
and here's an example. So suppose the corruption rate would be uh 0.1 at this time. Uh that means because we have the
time. Uh that means because we have the sequence length 1,000 uh that means around 100 procedures will transform
from the mask mask token to the task.
Okay, that's uh how we define the reverse process. uh we need a neuron
reverse process. uh we need a neuron network and also a sampling uh and also a sampler to to sample the token according to the log that provided by
the new network.
Okay. So uh uh now let's look uh look at how we actually train the neuron network. So this is similar to
network. So this is similar to continuous uh diffusion. So we we train this with negative evidence lower bound
uh just like this one.
Okay. So uh I think at this point we can train the diffusion uh language model but uh there's still obvious bottleneck here which why we care about
distillation.
So the thing the issue is that uh while the elbow train model generate high quality tests, it does not fully unlock
the potential for ultra fast or parallel sampling. Uh like to maintain high high
sampling. Uh like to maintain high high quality generation, we are forced uh the diffusion language model to like sample
at at most two tokens for each step. uh
that means for for the uh diffusion language model trained by the elbow uh it still requires like hundreds or even
thousands of function evaluations and that leads to substantial latency. So to
to get a faster inference uh yeah we propose TD instruct. Uh the main idea is that we have uh we define a student model and the student model have the
same input and output of the teacher model. Uh the teacher model is a
model. Uh the teacher model is a diffusion language model trained by the elbow and uh the the main difference between
the student and teacher is that the student can do few step generation and it requires like eight steps or 16 steps
to generate the high quality test just like the teacher model. uh that means uh that means uh uh
uh improve the speed up compared to the teacher model and the theoretical foundation is distribute uh distribution matching distillation. So we
matching distillation. So we we define the integral k divergence and we train the student model so that it
can do the few step generation.
Okay. So um
actually this distilleration works well in continuous diffusion and I will uh first briefly introduce how what how it
works in continuous diffusion and then we will explain uh why it does not works in diffusion language model and then how
we uh solve some challenge to develop this framework. So for the uh continuous
this framework. So for the uh continuous diffusion if we want to do distillation we also need to define a student model
and the input output and the next uh the network structure is the same as the teacher model and we also need to define the forward marginal for the teacher
model. uh just uh you have the neuronet
model. uh just uh you have the neuronet network P theta to output the logit and then you you get the data and you uh
this Q is of some forward marginal. So
it will corrupt to some uh intermediate state ZS.
And this is how we define the forward marginal. And our objective is to have
marginal. And our objective is to have the integral KO divergence to match the distribution between the student and teacher.
Uh just like this one.
The main challenge here is that the Q the QV and Q theta here uh we cannot get them when we train the neuron network
and may uh I will explain this later but uh you you can remember this at this
time. So uh what so to get a trackable
time. So uh what so to get a trackable objective function uh in this work uh which we call diffusion instruct which
is uh published in Europe's 2023 uh it actually developed uh objective like this one so I will give some high
level intuition how uh of this objective so SV and F theta is a score of the t student and the teacher
and the partial dt partial theta is the uh gradient with respect to the data uh
is the data uh is gradient with uh of the parameter space with respect to the data. So that means uh it can be written
data. So that means uh it can be written as a product of the score difference and the uh data with uh gradient of the data
with respect to the parameter space. And
for this one we can we can get the sample data and then we use autograd to calculate
uh this one. And for for the this one the score of the student and teacher are uh we can directly get them. So that
means this objective is totally tractable and we can directly use this to distill the student model and uh and do the fuel step generation
and this is good in continuous diffusion but uh when we apply it in uh the diffusion language model we actually
encounter some issue uh and the main issue is this term uh as I mentioned before is uh partial dt partial theta
uh it use autograd to to to calculate the gradient but uh this one is actually non uh because in diffusion language
model it uh involves some non- differentiable operation like armax we need when we samp when we use the
logist to sample the token so this one will interrupt the computational ational graph and this this is why uh in
diffusion language model this objective objective does not work.
Okay. So
uh let me stop for a while. If you have any question maybe I have a question it's like it's general about distillation is um in this case it means that your model
would perform in the end after it's train it will be just as good as the teacher model so it cannot outperform teacher model or the way after like to
do something to improve it further.
Uh actually we have uh when we do the distillation we have to match the performance of the teacher model. Uh if
you look at this red line this is our proposed method and this is a matrix to like to test the performance of diffusion language model. Uh the smaller
the better. So when we do the
the better. So when we do the distillation and when we have the fast inference uh we we required to at least match the performance of the teacher
model. So the teacher model is a blue
model. So the teacher model is a blue line uh that means uh if we like consider 16 steps uh the sample quality
can match the teacher model and if you just use uh eight steps it does not match. So actually we would select 16
match. So actually we would select 16 steps to uh to do the inference. Uh does
that make sense?
Yeah. Yeah. So it actually can outperform it. And one question here,
outperform it. And one question here, how did you choose that teacher model?
Why I mean is it possible to choose like a something like I don't know like Gemini 3.0 or something that is a bit more like that is
Yeah. Uh yeah, that's a good question.
Yeah. Uh yeah, that's a good question.
So I think for for our framework the the function for teacher model uh there will be two function for the teacher model.
Uh the first is to have a good init initialization for the student model. So
for uh for what you said gemini if we you use auto reggressive model I think the network structure are different from the diffusion language model. So we
cannot directly use it for for initialization.
And another another way we use this is because uh we need some regulariz regularization when we distill the model
because uh it is pretty an abrupt improvement. So sometimes it will
improvement. So sometimes it will uh the objective will like force the student model too much and it will lead
some uh performance issue like model collapse. So we need student model to to
collapse. So we need student model to to like uh like to to let the student know um you you
cannot uh move too far away from the teacher model. So this is a
teacher model. So this is a regularization and yeah so I would say uh if you use
auto reggressive model as teacher uh it does not work. You still need to like uh train or train a teacher model with
elbow or maybe you just use some open source diffusion language model as a teacher and yeah I hope this answer your question.
Yeah that makes a lot of sense. Thank
you.
Mhm.
Hi, I had one clarification on what has been discussed so far.
So, uh is the teacher model being also trained in this process or is it frozen throughout when the student is being learned? When the student model is being
learned? When the student model is being learned?
Yeah, that's a good question which I will explain later actually. Uh so for the teacher model, we will throw uh the parameter will be frozen. is just like a
reference for the student model.
We do not train the teacher.
Okay. So, initially you have like a fully properly trained teacher model and now that is being used to train the student model. Is that how it works or
student model. Is that how it works or is is the teacher model also update during the update of the student model like during some intervals?
Uh the teacher model is just like a good initialization for the student model. uh
uh the teacher model will not be trained during uh during the distribulation. It
it is just frozen there.
Okay, got it. And also could you like elaborate also could you elaborate on the difference in the inference process?
So now when we are having the auto reggressive models uh you respect the causality there right because you're having masking only for the future time steps. But in this case
are you saying that the masks are randomly generated and then you're trying to reconstruct the words from that.
Is that how it works?
Yeah. Yeah. That's right.
Okay.
Yeah. Yeah. Um actually for some uh this is a naive setting if uh uh because we uh we mainly focus on uh distillation
uh to do fast inference and of course there are some advanced uh decoding uh strategy uh to guide you how how uh for diffusion language model how you can get
a better sample quality. But uh because our uh our work mainly focus on uh distillation so we just use the naive
sighting uh setting just like we randomly decoding uh each position.
Yeah, that's a good question.
Got it. Okay. And also is this some way similar to bird uh like someuling of the bird models because even in bird you have these maskings which are done like randomly right with some fraction
defined.
Yeah, I believe uh diffusion language model um so there are different types of diffusion language model. So uh because
like uh you see here our u we mainly focus on mass diffusion model. So yeah,
it it actually get a lot of idea from bird like uh to have different noise level to handle different noise levels
and uh we also like skill up bird. Yeah.
And I hope that answer your question.
Thank you.
Yeah.
So uh if there's no other question, I will uh move on to introduce uh our training framework. So uh as I mentioned
training framework. So uh as I mentioned before uh we have the teacher model and the student model. The teacher is frozen and we only train the student model and
because we also consider some adver adversarial training. So we also
adversarial training. So we also consider a discriminator here and the main uh the main process is here. We
first have the sequence of mask tokens and we input them to student model and teacher model and uh we require them to gener uh recover the whole sequence and
then corrupt to some random time time step TI and then we get the partially mass uh sampled sequence and then we
input them to the discriminator. So the
task for the discriminator will be identify the sample sequence whether it is from the student or whether it is
from the teacher. And then we have uh we have the ground truth and the uh binary output of the discriminator. And then we
have the objective to to train the discriminator. And on the other hand, we
discriminator. And on the other hand, we also uh input the student sequence to the discriminator and the discriminator
will uh we will will do some transformation and we have some reward signal. The reverse signal will be used
signal. The reverse signal will be used to train the student model and we train the student model and the discriminator alternatively.
And finally we we when we trained well the student model and the discriminator the student model can do the fostep
generation and the discriminator can uh classify the sample sequence from the student and teacher.
Okay. So uh so uh let's try to tackle the first challenge that I mentioned before in continuous diffusion that the
objective the objective cannot be used in diffusion language model. So here we first have the uh some mathematical theorem uh mathematical derivation to
get this score function identity. So we
get some idea from the policy gradient and we decompose the integral KO divergence uh to the product of the reverse signal times the policy
gradient. And the reverse signal here is
gradient. And the reverse signal here is the log density ratio between the student model and the teacher.
And it actually like to tell you how much you should update uh along this direction along this parameter space.
and uh the the gradient term is just like the tell you the direction uh that you should update uh for the for the
parameter space. So with this uh policy
parameter space. So with this uh policy gradient form objective uh this can be directly used in training the diffusion
language model and uh I I also want to give some high level idea why this objective uh is
effective and why this can allow student model to do few step generation and if you look at this diagram it is actually
uh visualized the reverse process and we start at time t = to 1 uh which is a fully masked sequence and uh for the
for the for the final destination here t equals to zero this is uh fully masked as clean s fully masked and a clean
sample here and for the teacher model it trained with the elbow so intuitively uh this objective This objective aligns the update
direction uh at any given point uh with the tangent of the trajectory. Okay. So
uh that means uh if we have many uh backward steps uh and we can set the step size to be pretty small. So the
discretization error here remains low and we can uh have the teacher model to get a high quality test and however if
we force the teacher model to do some like few step generation and the step size here will be pretty large and uh
this introduce a large approximation error and that's why the teacher model cannot uh cannot do well in few step generation.
Okay. So, uh for the student model, uh our objective changes the update direction. So, it forces update
direction. So, it forces update direction at every point on the trajectory uh point directly to the final destination at t equal to zero. Uh
because of this even with the step size is pretty large uh the student can still moving in the correct global uh direction. Uh that's the fundamental
direction. Uh that's the fundamental reason why the student model is capable of few step generation.
Okay. So uh the so this is how we address the first challenge and the second challenge is that uh the
objective here is not tractable. The
just like I mentioned before uh it is because the log density ratio this QV and the Q theta is not tractable. Uh it
needs it is the sum of the product of this. Uh you first have the neuronet
this. Uh you first have the neuronet network to output the sample and then you have the forward process to corrupt
it to some intermediate state Q uh at time s and uh this requires to uh go through the whole discrete space to get
all the x. So this is not trainable and uh our solution is that we refer to again to have a discriminator.
So we have the discriminator which have uh which have the same input of the teacher model and we have the binary
output for the discriminator. So when it trained well it can represent the log density ratio uh between the student and
the teacher uh just like this one and then we can uh represent the reward as a function of the discriminator just like
this one.
Okay. So
um uh here's uh another challenge is the training instability and this is inherent in the policy gradient. Uh it
is because of uh well it use unbiased estimation but uh it it introduce high variance and what we would see when we
train the diffusion language model is is that uh it can gen somewhat generate high quality sample but uh the uh the
sample diversity uh is bad. So it may cause some mode collapse issue.
And uh to to address this we actually consider two different strategies and one for the reward for the reward part
and one for the gradient part.
And for the reward we get some idea from GRP uh deep uh we get some idea from deepseek to have gpo. So we first uh
have a batch of reward data and then we calculate the mean and variance of the reward and then we directly normalize the reverse signal just like this one.
And to stabilize the gradient we actually decompose this term into two terms just like this one. And here is a
visualization of the uh of the uh of how how do we do this uh decomposition.
So um this is still a backward process start from t = to 1 and ends on t equals to zero. So for the left term it just
to zero. So for the left term it just like uh this blue line here uh is start from t = to 1 and directly points to t =
to to zero.
And this would work well if we just uh use one step uh during the inference. Uh
but diffusion language model is a pretty complex uh challenging task. So uh we would say uh we cannot just use one step
to get a high quality sample. So
normally uh we would require in our experiment it requires like 16 steps. So
that means if we train with this uh blue line it does not encounter any uh uh intermediate state in the trajectory.
So uh when we do the inference uh when we jump to some intermediate state here uh the the the student model does not
encounter uh the situation uh in the im intermediate state. So it will fail and
intermediate state. So it will fail and it will have some mode collapse issue.
And when we have the right hand side we when we decompose it into two terms uh it just like this red line here. So
normally we would first sample the timestep ti and then it corresponding to
uh intermediate state zi. So uh we would start from t equals to 1 and then point to the intermediate state zi and then we
start from zi and point directly to uh t equals to zero and uh this actually address the mode collapse issue when we
train the uh student model.
Okay.
So uh the last challenge here is that um uh actually we have trained the discriminator and uh uh can the problem
is that can we leverage the discriminator to help the inference. So
uh the high level of idea is that because the discriminator can use some revert signal and the re uh if we uh follow the reverse signal it will guide
it the in uh the inference to some uh high reward region and this actually helps to get a high quality sample. So
what we did here is that when we when we do the inference we we get the discriminator and transform them to the
reverse signal and we inject the reverse signal to the sampling step uh just like this one and this one and this actually
improves the sample quality uh a bit.
Yeah. So yeah, so this the these are the four challenges that we face uh during uh when we propose this uh distillation
method and uh for the remaining part I would I would like to briefly introduce the experimental part. So
experimental part. So uh for quick review of the setup we cons we cond we conduct our experiments on open web tests and for the model we have
the teacher model and student model which have the same uh 169 million parameters. Uh this is a pretty small
parameters. Uh this is a pretty small model and for the discriminator uh because we replace the output uh from
the uh for the t teacher model it outputs the the all the logs and for the discriminate discriminator we just
require it to be binary. So uh we for the discriminator we initialize from the teacher model and then we replace the
feature head. So uh it will be 131
feature head. So uh it will be 131 million parameter uh smaller than the teacher model and we consider h 100 to
and use a dam to train the neuronet network and yeah so this is the main results that we got uh just like what what we
show here uh before. So this blue line is the teacher model and the red line is uh uh the model distilled by the DD
instruct and the green line and orange orange line here are uh some other distillation method just like consistency
uh they based on trajectory based uh distillation.
So uh uh it is clear that our uh our method performs uh always the best across different uh number of uh sampling steps and the perplexity is a
matrix like I mentioned before the smaller the better.
I have a question. [clears throat]
Mhm. Can you give us an intuition of what does the perplexity measures like how is it how I understand lower the better but what it is what is the thing that is measuring
that's a okay that's a good question so uh perplexity is like uh you you have a
reference so the reference is advanced auto regret auto regression model and uh uh when we calculate perplexity we do
something like cross entropy. So we
compare the so from left to right we compare each token uh compare to to the advanced large language model to compare
uh whether they are the same and the different uh if they are the same that means it get a good performance uh if
they are the different the will increase so it's just like uh we require this uh this this student model to generate a
sequence of tok uh so tokens and we compare it to an advanced large language model to compare whether they have the
same quality. Uh does that make sense
same quality. Uh does that make sense kind of but in in this case when we're talking about language that there are many possibilities for example there would be many ways of answering the
questions that maybe valid right like so in that case like doing a comparison one the word by word wouldn't a bit like a bit misleading
uh yeah that's a good point so uh I would say that um we actually will uh propy is sound like
a very naive way to detect uh the performance and there are some other metrics that may be uh a good
alternative compared to perplexity.
So uh actually if we want to detect uh whether it performs good we uh for the uh formal way we need to do some
downstream task like we test in uh to to we first have the backbone and then we train some like feature head and uh we
we test whether this backbone can have a good sample quality to to maintain the sample. quality some something like
sample. quality some something like that. So I I would say perplexity is a
that. So I I would say perplexity is a very simple way uh just uh uh but uh I think it is still a good indicator to to
say that uh uh our distilled method uh achieves uh good performance. Yeah.
I think that makes thank you for explaining. Yeah, it makes sense. And
explaining. Yeah, it makes sense. And
also maybe I have another question here like if I understand correctly when you have like this plot of the of the of the trajectory and you said that
the teacher model will go tangent to the trajectory but the your model will somehow like like cut it like to some
kind of shortcuts right and uh but if the if the blue line let's say if it will get get you to the end point of the trajectory and in the The red one also will get you to the end point of the
trajectory. Why can you achieve better
trajectory. Why can you achieve better performance with the with the red one?
If the end point of the trajectory is the same.
Okay. Uh
I see uh this is a good point. Uh so
actually when we train uh if we go back to the framework actually we we replace this teacher model to a data. So
actually when we train the model we use some data as a reference. So that's why it can beat the teacher model. Uh
doesn't make sense.
Yeah. Yeah, that makes a lot of sense.
Okay. Yeah. Great. Thank
Yeah. Yeah. So yeah, I think uh yeah, there are some misleading here. Yeah, probably actually
misleading here. Yeah, probably actually when we train the model, the teacher model when we have the reverse process is pretty slow. So uh actually we would
replace it with the data and then we corrupt it. So that makes the training
corrupt it. So that makes the training much more faster.
Yeah, this is a good point.
Thank you.
Okay.
Yeah. So, uh later we also did a lot of ablation studies to test uh different component of our uh DD instruct
framework. uh because like I mentioned
framework. uh because like I mentioned before we introduce a lot of uh component like uh gpo score decomposition and uh this reward
guidance. So we need to test each
guidance. So we need to test each component whether it uh works well. So
for table one we start from a baseline without any tricks uh without any tricks like gpo score de composition and we
test the performance from eight steps to 128 steps. And then later we uh for each
128 steps. And then later we uh for each row we add one uh technique that we mentioned before. And finally for the
mentioned before. And finally for the last row here uh we add all the all the techniques uh that we mentioned before
and this is the uh performance that we get and you would say from table one actually each component uh contribute to a better performance of diffusion
language model and for the table two uh this is another ablation. So we start from the last line
ablation. So we start from the last line here. This is the baseline. So for this
here. This is the baseline. So for this one uh it is got the same result of uh the last last row here. So we have all
the tricks here. Uh what we did here is that we uh remove one tricks but uh keep the all the other tricks here. So you
will say uh uh the performance we also test all the performance from eight steps to 18 28 steps.
Okay. So uh later we also do some model scaling up because uh nowadays for large
language models uh all the maybe most people are are interested in whether it can scale up. uh how do we scale it up?
So we also did a a little bit scale up uh to have uh 400 million parameters and we also test perplexity and entropy. So
perplexity the smaller the better the entropy uh the larger the better. So uh
the the the red red line here is uh the results that when we scale up. So
you will see uh when we screw it up it actually performs better and uh we also did some uh we also have
some other application like protein sequence generation and uh yeah I think uh it is because um uh DD instruct is
not uh specific to diffusion language model actually it uh can be applied few uh discrete diffusion. So this is a
general uh general uh general part of the diffusion models. So uh that's why we
diffusion models. So uh that's why we apply it in uh different applications like protein sequence generation and uh due to the time limit I will skip uh the
details here but uh in general uh it actually performs good uh when we apply to different applications.
Um hi sorry I have a question.
Mhm.
Um I just wanted to ask um do you know what protein this is?
Uh sorry because I'm not responsible for this experiment so I'm not quite sure.
Uh I I just know uh uh this is done by the second author. So he used the model from the Tik Tok and they they have yeah
they have Mhm.
Um also like another question I just wanted to ask is like um how do you make sure that it is preserving the certain function constraints
that are there in a protein structure um uh we to validate whether it or not. Yeah.
or not. Yeah.
Uh this is good question. So I think we have some uh some matrix to validate the
the details of the like how how do we how how do the protein maintain some constraints uh uh because like I mentioned I am not responsible for this
experiment. you may need to go go into
experiment. you may need to go go into the paper to to see and I believe uh yeah the they did some work to to validate that.
Okay. Yeah.
Thanks for the question.
Mhm.
Yeah. So uh the last part is that there are some uh f uh impro promising directions that uh uh we are ex or
exploring uh at this time. Uh the first is like uh like I mentioned before the uh diffusion language model is are compatible with image and video and we
are doing some uh text to image text to video generation and we are also interested in scaling it
up like to scale up to 7 billion diffusion language model like uh light up. So uh I think the main challenge
up. So uh I think the main challenge here is uh some infrastructure issue uh is because uh previously when we have uh
like small small model like 169 million parameters uh we can uh when we distill the model we can use uh data parallel
uh distributed data parallel to train the model and the uh one GPU can can can
have all the entire model but uh if we scale it it up to 7 billion I believe one GPU cannot uh have such a large
model so we need to do some tensor decomposition or state decomposition so that's all the infrastructure
optimization that uh in this project that uh that challenge us uh the uh another promises direction is called
block diffusion. It is because uh
block diffusion. It is because uh nowadays uh auto reggressive model uh have a very good infrastructure optimization
and uh we want to leverage the advantage of both uh auto reggressive model and diffusion language model. So uh that's why block diffusion gets popular
nowadays. So the main idea is that we
nowadays. So the main idea is that we decompose uh this one 1,00 sequence to some blocks like 32 blocks and each
block uh got 32 tokens. So we start from the first block which is the first the first token to the 302 token and within
this block we use uh diffusion to generate the token and uh once it's complete we move to the next block uh
the 32 uh the 33 tokens to uh to the 64 tokens and we also use diffusion to generate all the tokens and then we move
to the next block and then and so on.
And the good thing is that uh once we complete the first block, the second block can get uh can use the KV cache
that we have in the first block. And uh
yeah uh this is um this can actually get uh good inference speed uh when we use block diffusion. And if we consider DD
block diffusion. And if we consider DD instruct here uh actually we can allow autore auto regression model to gen
generate 32 tokens uh uh at one step. So
that means 32 uh times speed up compared to auto auto reggressive model. Yeah and yeah I think
reggressive model. Yeah and yeah I think that's all I want to share today. Yeah
thanks for listening and I'm happy to answer any question.
Thank you so much for your great talk.
So now we open the floor for questions.
Do we have any other questions for speaker?
Uh yes.
Uh hi uh I'm a post dog. My name is Jim.
Uh uh so I try to Yeah. Well, I try to understand the big picture. So here
basically a teacher model is given and by using the technique like the distillation you can reduce the number of the like the inference uh model model
inference.
So that's the the idea.
Yeah that's right.
So from like one 1,00 to like 16.
Yeah. Yeah. Yeah. Just like this figure.
Yeah. The the red one here. we we want to have a fast inference.
Okay. So if the number of the function evaluation like is like 16 from the beginning uh it is fully random. So you
need 16 steps to get the final sequence.
Is that the uh the right way to inter interrupt?
Uh yeah I think so. Um
uh I would say like we first have the we have fixed the sequence length like 1 1,0uh 24 and uh if we have auto regress
model because you have to sample token by token. So it requires 1,24
by token. So it requires 1,24 number of function evaluations to complete the sample process. And for
mass diffusion if we every time we sample two tokens. So that means uh it requires five 512
12 NF to have the sample. And if we have this like 64 times speed up it requires
like 16 NF to generate the sequence.
Yeah. Yeah. Got it. Thank you.
Mhm.
Thank you. Thanks for the question.
Do we have any other questions? Oh yeah,
it's very good. Uh I can you give us a a feel of um the the time cost like for the smallest
model to 400 million parameters to what you're targeting now on a on a 100.
Uh what what is is it a one day training? Is it one week? What is it?
training? Is it one week? What is it?
Oh, it is because yeah, that's a good question. So, because uh it is actually
question. So, because uh it is actually a very small model like uh one69 million parameters. So, actually it just
million parameters. So, actually it just takes uh one GPU hour to to complete the task.
Oh, one GPU. I didn't see that. Really?
One GPU. So, when you go to 400 when you go to 400 million parameters, how does the the scaling go? Is is it linear? Is it quadratic?
linear? Is it quadratic?
Uh should I I remember it takes like two hours probably
two or three hours. it does not take too much. Uh like uh for example for the
much. Uh like uh for example for the previous uh for the consist consisting consistency model I believe to get this
performance they require like uh at least 20 or 30 hours. Yeah, the training is pretty fast uh for our DD instruct
method.
So it was 30 hours and it goes down to one hour.
That's right.
So you said you have a 64 speed up, right? 64 times speed up.
right? 64 times speed up.
Probably probably worst case. Yeah.
Mhm. Yeah. I I believe the reason they requires they require so many hours is that they they do something like a progressive progressive training. So
they start from a very small step to to match the teacher model and uh they gradually increase the step. So that
means it like takes like uh 16 uh six or seven runs to to train the model. So
that's six to seven runs require a lot of time. Yeah.
of time. Yeah.
And how about the protein generation that you the example that you show? How
long does that take?
Uh sorry I I do not ask the second author though I believe it is also pretty fast because like I remember it
takes like uh two weeks or three weeks to complete the test. So I would say maybe one or two hours to to complete
the distillation.
I do not have the exact number. Sorry.
But here because I did ask you about the constraints here unconditional generation that does that mean that you unconditional means that
you don't really obey any constraints?
Yeah. Uh yeah. Yeah. That's right. Yeah.
There's no prompt or anything. Yeah.
All right. Thanks. Great talk. Thank
you.
Thanks for the question.
Yeah.
Thank you, Professor Kanyakis.
Yeah.
Yeah. Mhm.
Do we have any other questions? Because
I want to close this session.
Thank you so much for your great talk.
Uh and now it's time for our Yeah. And now it's time to move on to
Yeah. And now it's time to move on to our second speaker. Stephen, can you hear me?
Um our second speaker is Steven Masala.
He will be giving a talk on hybrid tweening using PBDW and net for the effective state estimation and
prediction on partially known systems. He was born and raised in Congo Brazil where he completed most of his education
before moving to France to pursue his undergrad and graduate studies. He's
currently a final year PhD candidate in a joint program between EOL normal sub Paris and non-yang technological
university and tu Singapore working on the card project which aims to develop digital tweening technologies to support decision making in urban context. With
that short introduction the floor is yours and you can start your presentation.
Thank you, Naz.
Yeah.
Um, yeah. So, she already presented um I'm
yeah. So, she already presented um I'm actually third year uh PhD, but in France we usually do short PhD. So, it's
my final and I'm going to present a different my PhD in two months actually.
Um this research is actually based on some work that I did on my first year PhD.
Uh and um I'm trying to make it as short as possible. Uh Naz, is there any
as possible. Uh Naz, is there any constraint on the time? Can I make it shorter?
Um yes, you are like a second speaker and last speaker. So take your time and uh it usually goes for 1 hour, but you
can present for like 45 minutes, then have question and answer session.
Okay. I actually made it so that it goes it goes quite fast. Um I very I sanitize it but so any anyone can stop me at any
time um to ask question don't you can feel free to to do so. Um so the the context of my research and my PhD is
about this uh pro program called Decart and it's a collaboration between France and Singapore. um they um aimed uh to
and Singapore. um they um aimed uh to create a digital twin for urban context.
So they want to have this um twin that will allow them to help help them make decision uh in complex system and um in
real time um with a lot of uncertainty and things that we don't know a lot. So
most of the use cases in the project are related to energy drone trajectory and remote sensing. So there's a lot of
remote sensing. So there's a lot of structural EL monitoring uh drone trajectory planning where you update the trajectory of the drone in real time and uh some energy grid uh
update uh model.
So in the in this project presentation I'm showing one algorithm that um one project that we did for the first year of my PhD but that was also applying in
drone case later and so I will show it at the end.
So uh hybrid twin in the way that uh we define it is this additive uh model combination between a physics based uh
estimation and a datadriven correction where you have a physics based model. Um
okay you can see can you see um the mouse moving?
Not really.
Oh, okay. Okay. So, you have a physics based model that makes estimation and a datadriven uh model that correct based
on um experimental data. That's the the idea of twin and this datadriven model can be a machine learning model where you can learn the correction and and and
correct. Uh the idea is to have it in
correct. Uh the idea is to have it in real time and make a um prediction and correction approach on on system system [clears throat] like that. Um in the
context of this uh this research we have an environment that we don't know perfectly but we might have um wind that that is changing or some assumption that
we made on the physics that that are not that is not correct. uh and we will try to uh uh make the best modeling approach
to take into account uh this uh uncertainty.
So the idea at the end is to have this awin where you can control it uh and then you can apply it on dynamical systems as well that's changing.
Yeah. So the construction of the A between as I said this additive approach uh it usually comes with a lot of challenges. For example uh when you are
challenges. For example uh when you are combining the two element the physics based approach and the datadriven correction you want some complimentarity. You don't want to uh
complimentarity. You don't want to uh have a datadriven correction that is uh adding information that the physics already has. Um so this orthogonality or
already has. Um so this orthogonality or complimentarity uh is what that um occurred the most when I started this PhD and I really
wanted to work on that. How do you actually work on that and how do you ensure that what you are correcting u is complementing what what exist
uh and the idea is to have it in a control loop later so that you can have a model that is getting better and better over time as you are controlling it. Yeah. So the combined model layer is
it. Yeah. So the combined model layer is the the twin. Uh the one that we build in this research is based on uh on on
the idea of the PBGW. So the on this approach uh which is called parameterized background data week uh we we we we inherit from that and then we
build on top of it cuz this approach is very interesting. It was built by uh
very interesting. It was built by uh Ivan Mday and Antony Patera and all in 2015.
This is a data assimulation task where uh they try to reconstruct the state of a of a system U here as a additive
combination of physics based estimation and datadriven correction. So in the same spirit of the the in between that I just introduced earlier and uh the the
originality here is that in their formulation they they managed to have a naturally orthogonal combination um in
the way it's built. So here the the the ATA which is red here and Z live in two spaces that are orthogonal. So that
allows you to correct uh and enrich the space of the physics based estimate which is blue. So we wanted to actually build on top of this because of the
mathematical property of the approach which uh gives you a good uh view on the convergence of the error estimate. It
also comes with uh a built-in uh a priori error estimation which is also uh uh that we can quantify thanks to the
the fip stability constant here the beta um and uh and it comes with a way to select the sensor that maximize uh the
enrichment of the space. So it's a very interesting approach and it was extended uh to time dependent problem by uh Willie Aayik and Ludovik Shama in 2023.
So which was the PhD thesis right before me. Um they extended it in real time to
me. Um they extended it in real time to deal with biases that changes and how do you correct them in real time. So this
is what I will be building on top of. uh
please uh you can stop me at any time if you have a question or if you want me to clarify anything.
So here we will build two spaces. We
will we work with two spaces. The blue
spaces which is the the background space is going to be the physics based estimate and then uh a red space which is going to be the observation space.
The observation space is be is going to be spanned by rich represent of experimental uh experimental data. So we will assume
here that we have a lot of experimental data and and we have a incomplete physics based model. So there's a physics model that does not represent completely the reality. We might have
made a lot of assumptions or we don't understand the physics properly. So just
estimating the state with that physics based model would give a wrong a wrong estimate and correcting them with the experimental data would would would give
a good estimate and giving putting more experimental data would give better and better estimate. That's the the idea and
better estimate. That's the the idea and um there is a way to actually um deal with noise because even though the
physics is wrong uh the xmal data might be noisy. Uh so in the optimization
be noisy. Uh so in the optimization problem that they solve to find Z and AA here uh there is a way to tune the hyperparameter of the regularization to
balance the trust between um the the incomplete physics and the noisy data by following a kind of moor of uh discrepancy approach.
Stephen, can you can you say something about the two spaces?
Yes. Yes. So the the first spaces is um the physics based space here is built with a reduced order model. So we use the reduced basis approach. So you
compute you can let's say you solve your your model or many times and then you have this parametric domain you choose the parameter you choose the you solve the model for different value of that parameter that cover your parametric
space.
What is it? Is it a hilbert space it banana space? Is it H1? What is it?
banana space? Is it H1? What is it?
space subspace of that.
Yes. So uh EU u is um um let's say the physics based space you can approximate with finite element.
So the the solution the space in which you you define your solution is going to be the same space as you are building the physics base space.
So for example h1 for example h1. Yeah.
All right. And then the data space the data space. So you will have uh actually have the next uh slide for example here you have the physics space on the left which are the different mode
first mode second mode third mode fourth mode and the second observation space are vit representer of the experimental data. So you have we have this the first
data. So you have we have this the first sensor for example here we model the sensor as goians you have the first sensor the second sensor the third sensor and observation space is the span
of those reads representer so that would basically tell you where you need to correct the information for example so those two spaces are the two space that we will use in in this presentation
but if they so is there noise in the data or you don't assume any is sorry the the data are noisy so the on the on the observation space because you can
have stoasticity how does that come in yes so we we assume that the sensors are noisy so that's why in the in the optimization
we have a regularization the C uh in the second line implementation online if you can see the
um the arg so uncorrelated noise.
Uh yes.
Okay.
Yes.
Um and then in the subtle point so the the solution of this problem reduced to this linear solution the subtle point problem where you can see that the
solution the the first line and the second line. If you look at the second
second line. If you look at the second line the projection of uh ATA on B gives zero and in B you have all the physics mode. So that tells you that there's an
mode. So that tells you that there's an orthogonality between the datadriven correction and the physics based um space with every single mode of them
and the y is the experimental data. So
we we fix the experimental data as a force we force it and then we find the physics estimate and the correction that such that the sum of them gives the
experimental data.
Thank and um and yeah here the the quality of the estimate actually relies is dependent on the stability
constant the MC constant here the beta.
So the the better the beta the bigger the beta close to one the better we have uh the better estimate we have. So in
the way we choose the sensor, we also use the MU um constant to choose the next uh sensor that maximize the information gain from
the correction and uh so we follow this algorithm that was built by uh I don't remember who exactly uh but it was in 2015 in the same paper
as Mday um that rely on [clears throat] this max uh greedy stability approximation where you you you maximize the stability
and you minimize the the approximation error at every every time you choose a new sensor and that allows you to to better correct and cover the physical
space.
Um so I will just give an example of uh solving a PTE with the PBDW that I just int uh presented which is not what I built but uh we are building on top of
so I have a question on that. So if you if you don't move the sensors if you don't do adaptive [clears throat and snorts] adaptive sampling using this gridy method then you cannot guarantee that
the spaces will be orthogonal.
uh this the the space is going to be orthogonal no matter the the the choice of of the sensor. So it's by
construction here in the saddle point construction right the subtle point gives you the orthogonality.
Yes. But the question is interesting because if we choose them well then we need less sensor and we we actually have a unique solution if a better condition
on the unicity and the existence by beta is very small let's say 001 that that that that's bad so you have to do adaptive sampling to improve it.
Yeah.
Exactly. Yes. Yes. If beta is is very bad we need more sensor to improve the beta. Yeah.
beta. Yeah.
And if if it's good then we need less sensor. But the the condition here to
sensor. But the the condition here to have a at least existence of the solution is we need at least uh more at
least as as much sensor as physics mode.
So here the the the physics space is a reduced order space and we use n mode and m sensor. uh by using this algorithm
you can use we can only we can use uh n sensor u and it it will work but if not if it's not optimal then we might need m
bigger than n to have our existence at least and the the idea would be that if if we have less sensor then we don't have
enough sensor um to understand the physics mode so the physics mode are less and less observable and we might have instability
in the the idea.
Um so the numerical example to just show it. So here I just chose one uh example
it. So here I just chose one uh example which is the El question which is the actually the initial problem that they they chose in the paper. Uh so you have
this domain the physical domain is this square domain and then you have um the [clears throat] el equation and then we have a bias. So the bias here is introduced because we are making the assumption that the physics is
imperfect. We we don't we don't know the
imperfect. We we don't we don't know the physics perfectly and here the bias come in the source. So the source here is Q in the second right hand of the
equation. And then in that source we
equation. And then in that source we introduce a bias G. Uh and uh that bias is going to represent our um uh lack of
understanding of of of the physics and then the dot red are the ex some of the experimental data. um we also introduce
experimental data. um we also introduce another bias uh uh the bias in the the boundary condition. So let's say the uh
boundary condition. So let's say the uh the initial the reality is that those boundary condition are based on direct conditions. Uh but when you are solving
conditions. Uh but when you are solving it you you don't know and then you solve it with no man. So that's also a misconception of of reality. So we have
different biases. Um this is the the
different biases. Um this is the the problem that I will solve uh with with AI later but in this experiment I'm just showing the the perfect example without
bias. So those are the two spaces.
bias. So those are the two spaces.
Um so for example for a perfect model without bias this is how the PBDW work.
So you have the physics based estimate on a. So this is the estimate by the
on a. So this is the estimate by the reduce order model. And then you have the correction by PDW which is based on the observation data. And then the sum
of them give you on C uh which is the estimate of the state uh using this hybrid uh approach. And then uh you have
the the relative error with the finite element approximation of the same state.
uh on D and then we do that by increasing the number of mode and then we see how this approximation change. So
we fix the number of sensor to 50. So we
have 50 sensor and then two POD mode and then by increasing the number of of POD mode um we will see that the error is
getting more and more local uh and we actually need more sensor and the error increasing where we are far from sensor.
So if you look at the yellow, yellow is the error is big and red is the error is low. If you look at the yellow areas,
low. If you look at the yellow areas, most of the yellow areas are between sensor. Um so this is what I was talking
sensor. Um so this is what I was talking about instability as well. By increasing
the number of mode for a fixed number of sensor, we get instable but if we add sensors uh it would be better especially
in the red in the yellow areas. So
that's actually one of the criteria that is used in the sensor selection in the stability uh maximization and um
approximation error minimization.
Okay. So now I will introduce uh the contribution that we had uh during this uh during my first year of the PhD. So
the idea is now to say uh we have a model the physics based model we will not replace the physics based model with a AI approach in this context uh we will
instead replace the datadriven correction model with AI approach um but the condition is that we want to learn only what we don't know
so we will try to learn the force the AI approach to live in the orthogonal complement of the physics based uh space so that we can correct uh efficiently.
So here the new problem is uh this twin is going to be physics based estimate plus AI correction and AI correction is going to be aonet
uh net why because uh I went to the the pins summer school uh in 2022 in Stockholm and that's where I I was very curious about about deep and I learned
it there and I remember I was asking a lot of question to Raj and and George about this orthogonality uh and I spent most of my time after
that to work working on this problem. Um
so here the idea is it's interesting was interesting in this case because in deep you have two two model two AI model that you are projecting on each other and the
second model is actually different u um vectors and you you learn them as the same time as you're learning the coefficients. In this in this case since
coefficients. In this in this case since we know the space in which we want to live which is the space of the ignorance uh we actually we can replace the the
the trunk uh with a set of bases that we can build by ourselves. Um so that was the idea in here uh how to learn something
that you don't know is to build a set of vector that are orthogonal to what you know and force your AI to live inside.
um you have two ways to do it. The first
way is the physics informed. You could
actually penalize the autoonality to the physics base estimation or you could simply uh uh do a physics constraint approach where you just replace the
trunk with a set of bases which is what I'm I'm showing here. But in the paper that if you want uh you you should be able to to find it with the title using
the title of my presentation.
um in the paper I discuss both both the the approaches but here I'm just discussing the physics constraint because it showed better performances
so the the idea here is to say um um the the bias that we built here
the in red we can see that it's a linear combination of coefficient and vectors here the vector is uh a RIT represent um and then the coefficient is how much
do you put weight on that the RS represent are built constructed uh on offline so we don't need to relearn them so the donate here is only learning the coefficient and then the output is
projected on the rich representation u so what we do is to say that uh if the output of uh the depotonet is a
combination of those coefficient on the physics based mode um on on on on the orthogonal complement of the physics based modes. It means that the
based modes. It means that the projection of the bias is always be orthogonal always going to be orthogonal to this guest estimate and that's how we
uh we constructed and by construction we have um a uh a constraint that is existing all the time compared to a physics uh
inform orthogonality where you would have something that's working by average. uh here you have a constraint
average. uh here you have a constraint that's working all the time. So you are training the with the constraint inference you have the constraint and it's working all the time. That was very
important for for us to guarantee that so that we correct efficiently. Uh and
then in the loss is just the classical um MSE loss um and the training data was different value of of the data. So we we
would solve the PBDW many times. We have
different value of that that datadriven correction and then we train the AI on minimizing
uh the L2 distance with those data.
Yeah. So the the classical diponet is actually on the left. Uh and then uh the contribution of this paper is on the
right where uh we replace the trunk with a set of of bases that we we build and then the output is projected on the rich representers uh to reconstruct the
scalar field of the of of the correction and then when we we sum both of them we get the the estimate with a guarantee on
um yeah so that's the the the approach in in this paper that I wanted to to share here in term of uh advantages why
is it more interesting or what does it bring compared to the PBDW is just that instead of solving the subtle point
problem um with uh this complexity on M and N we can only we can solve on the reduce order space only and then the AI would make an approximation. So the
inference in inference time uh goes uh almost to zero. It goes very very low especially for space for a system where we have a lot of data that would bring a lot of uh that would bring the
computational time very low and that's that's that's interesting and uh I just wanted to also say that all this is possible because this settle point problem can be decomposed into two
problem one constraint problem and one linear problem. um and the linear
linear problem. um and the linear problem is easy to solve. The constraint
problem is easy to solve as well. So
instead of solving the setup point problem, my approach is reducing into a two-step where we we do the inference on the on on the on the ATA and then we
solve the constraint problem using the inference on the AI. So the the problem is actually equivalent.
And then we we actually tried it on uh some cases. So still on the El's
some cases. So still on the El's equation uh here we've um the bias was on the boundary condition. Um so instead
of having a direct we used the the the nman and uh the prediction on of the the bias using the approach is on a the
correct bias is on b uh the physics based estimate is on c and then on d you have the ibrid twin using this deep
approach and the orthogonality on e um you have the the pdw without on it and then on f you have the ground truth
using linear uh shape function polomial function using FM.
So in in in in this case uh I didn't use a lot a ton of data to to train this uh I used I think maybe only 1,000 couple
of of data to train this and it's already showed quite interesting um result. it could learn that we are um we
result. it could learn that we are um we are wrong on the boundary condition. We
you can see that the value is different on on the border and in term of error we were around 7% uh on the error like the
L2 error on the on the field compared to 5% of the classical PBDW. So PBDW did uh better uh this and PVDW did faster and
2% uh worse in this case.
Uh and then in this case it was a combination uh of boundary condition and source. So
wrong source and then um wrong boundary condition. The prediction of the of the
condition. The prediction of the of the donet plus PBDW is on A. Uh the true uh
the prediction of the bias using deep uh the true bias using PDW is on B. And you
can see that the net with the strong autoonity managed to get this source and boundary condition error with slightly
some some mistake on the on the left corner. And um so on D you have the
corner. And um so on D you have the agree using PBDW and E is the classical PBDW and F is the ground truth using FM.
So in term of error here we were super close to the PDW. I think it was 2% and 4%
um compared to the the ground truth FM.
Uh this work was extended to time dependent problem. So I also worked on
dependent problem. So I also worked on um on encoder decoder approach to do the same thing uh without without deponet
and then it was also extended to time dependent uh using a transformer to see if we can learn some encoding. um and uh
and then it was applied. So the what I've been working on recently is to apply it on a drone. The idea is let's say you have a drone and then you have a changing of a change of environment.
Um so how do you actually uh update your model uh using the the real time information. So we also applied the PDW
information. So we also applied the PDW on this. Um
on this. Um so like the idea was to do um was to do a basian model updating. Uh so the idea
is to say uh in this case we we we have um inertia that's changing for example one one of the motor is going to break at some
point or maybe it's hurt something and the inertia changes and how do you actually um then now you have a bias on the inertia because you didn't model the inertia perfectly. So the idea is to say
inertia perfectly. So the idea is to say okay we can use a this dual calman filter where you have a a filter that's tracking the parameters and then the
observation is actually the PBDW. So
instead of having a data fully data uh observation on the calman filter we have a physics uh hybrid physics and data uh that is correcting the bias and
then we update the the parameter in real time. We tried it on on this case. Um
time. We tried it on on this case. Um
um for example in this curve the curve on the right is how is the bias uh estimate in real time and how the bias contribute to the state estimate of the
drone in a 25-second simulation for example. Um and here the case was let's
example. Um and here the case was let's say you have your drone going on uh following a trajectory that is fixed from point A to point B and then half
the way at 10 second the initial suddenly dropped and how do you adapt to it super quickly to regenerate like the trajectory that you can still follow or
using MPC. So that's uh one of the cases
using MPC. So that's uh one of the cases that where we use this PVDW into the observation of a man filter to correct it. So you can see that the three
it. So you can see that the three component of the inertia um um were uh updated quite fast uh to
allow the MPC the control to be done um on the trajectory that we defined.
Yes, that's that's one of the use case that I wanted to show the L mods and this one. Um
this one. Um and then in term of computational time uh using this without PBDW on the drone
uh um would be would take a little bit longer. PBGW would gain few
longer. PBGW would gain few second or millisecond on on on the control because you're solving an optimization pro problem at every time
step.
Yes, that's that's it. Uh that was my contribution for today. I would love to take any question or contribution. Yes.
Yeah. Thank you so much. Do we have any questions for the speaker?
I can see only George.
Maybe I have a question.
Yeah.
Hi Stephen, that was a really good presentation. Thank you so much for
presentation. Thank you so much for sharing with us.
Thank you. Thank you. No, one of the things that you were mentioning was like did you choose this way of imposing the throne because you wanted to to put our tonality right and then you said that there are two ways of doing it. Once was
like with this physics based approach where you have like a soft constraint.
Mhm. And but there's also like another one I'm not sure if you compared it like when you do some kind of QR de composition or SPD like donet and they also essentially what they're trying to
do is to transform your basis to make it orthogonal. Did you did you compare with
orthogonal. Did you did you compare with that framework to see oh no I've never heard of that. Can you say more the the orthogonal to what?
Yeah. So no it's not ex it's not what he wants. So this is orthogonal the base is
wants. So this is orthogonal the base is orthogonal itself but but it doesn't guarantee that it's orthogonal to the physics modes. So that's a
physics modes. So that's a okay it could be different. So so that's he his um
his he doesn't really care about the the trunk basis because he replaced the trunk basis.
Yes.
That's the um Yeah. So you you replace with the trunk
Yeah. So you you replace with the trunk basis as is orthogonal in the physics right. So they are not orthogonal
right. So they are not orthogonal between themselves.
Um they they they are because I build them with a kind of Graham Schmidt autonomization.
Okay.
And I build them sequentially one by one. So I ensure that they are each of
one. So I ensure that they are each of them orthogonal to what I I want. So the
physics that we already know and then they are also orthogonal to each other.
Okay. Okay.
But but but I will check the the QR that you mentioned to I I will read the paper. Thank you for for sharing.
paper. Thank you for for sharing.
So this called a two- stage a two-stage uh uh uh two-stage training of pins.
Okay.
Yeah. The author is Shin S H I N former postto here. Um,
postto here. Um, but why the accuracy when you um use the deep net? Why is it worse?
deep net? Why is it worse?
Oh, because uh I didn't use a lot of GPUs. I think I just did 1,00
GPUs. I think I just did 1,00 simulations and then I tried it on a very simple case and it was already amazing. So, I didn't I was like,
amazing. So, I didn't I was like, oh, okay. But there fundamentally
oh, okay. But there fundamentally there's no reason that it would be worse the accuracy, right?
Yes. Yes. It should it should match. And
I uh in terms of computation complexity if m= n I think the factor uh the speed of factor is 8 over 3. I
check 3 + 3 in the middle.
So 6 + 1 + 1 2 8. So that's 8 n cub versus 2 n cub.
Okay.
Sorry. N cube n cub n cub three. So it's
8 over three. Yeah. Yeah. So 8 over three. substantial but yeah
three. substantial but yeah yes I was thinking if if the m equals n but if m is big then
it would be more interesting yeah it would be more more interesting yes okay
do we have any other question Stephen by the way I'm I'm in the scientific board of discard oh I saw I saw it. [laughter] I saw it.
I never attended the meetings, but Oh, no.
Too busy.
I understand. Yeah, I'm I'm finishing now. I'm going to defend in two months.
now. I'm going to defend in two months.
Is Paco is Paco part of your physics committee?
Uh I think it's Elia Quutotos instead.
Um but yeah I yeah yes but Pakist follows follows my my work.
Um but yeah we invited alias instead for the for the [clears throat] committee.
Great.
But if you want to to attend my today's difference I can send you a link.
Send me. Yeah sure.
Sure. Okay. Thank you.
Tony Patra was my adviser.
Sorry Tony Patra.
Oh my adviser. [laughter]
my adviser. [laughter] Okay.
I was his first student and Ivan Mday I know Ivan Mday since 86 87 when he came to MIT lab where I met him first.
Oh interesting. I see. I've never met them unfortunately. But maybe I should
them unfortunately. But maybe I should go to JC in in Sbon to see them one day.
I will say he's in Sbon. Yeah. I was in Sbon last summer. Yeah. [clears throat]
summer. Yeah. [clears throat]
But yes. Okay. the small word but yeah thanks very much for the talk thank you thank you everyone and for the question as well
yeah thank you so much I think we don't have any more questions so I think it's time to close the session thank you for your great talk and thank you everybody for joining us today have a great weekend by
thank you thanks for organizing as
Loading video analysis...