Ultra-Fast Language Generation || Hybrid twinning using PBDW and DeepONet || Jan 16, 2026

By CRUNCH Group: Home of Math + Machine Learning + X

Summary

Topics Covered

64x faster language generation without infrastructure tricks
Humans don't read token by token
Why discrete diffusion breaks standard distillation
Block diffusion: marrying autoregression and diffusion
Teach AI only what physics cannot explain

Full Transcript

Hello everybody, welcome to today's crime seminar. Today we have two

crime seminar. Today we have two speakers, two types. Our first speaker is uh Hangen.

He will be giving a talk on Opra language generation via discrete diffusion divergence uh in Ho Yang Jang is a PhD candidate in the school of

mechanical engineering at Purdue University advised by Dr. Guan Ling. He

studied generative modeling and reinforcement learning from a probabilistic modeling perspective. His

research interest uh include large language models, continuous and discrete diffusion models, reinforcement learning, sampling and optimization

methods aimed at improving latency and robustness. With that short

robustness. With that short introduction, the floor is yours and you can start your presentation.

Okay. Thanks for the kind introduction and uh thanks for the invitation from professor Kanya Dis and it is my pleasure to give a talk at crunch seminar and today I will give a talk of

our recent work on large language models and the title is here and we also call it DD instruct

and uh uh this work is a joint effort with a fantastic team of collaborators like uh Shinyang from UT Austin and

Cindi is a PhD student at Purdue and Nan uh was graduated from Purdue University and now is a tenure track assistant professor and Julia from a US and

unfortunately I did not get his photo for this slides uh but since he's he is worked closely with professor Kanye Dis I believe uh many of you are familiar

with him and way is another Purdue alumni who is now a research scientist is at Morgan Stanley also we graduated

uh graduated from picking university and is currently a research scientist at at rand node and also Dr. grounding.

Um so I will start uh with my talk uh with a brief introduction of diffusion language model. Uh why how it it it is

language model. Uh why how it it it is different from auto regressive models and why do we care about uh diffusion language model and then I will give a

mathematical definition. How do we

mathematical definition. How do we define the forward and reverse process of diffusion language model? how to

define the objective to train the fusion language models and then will be the main part of this talk the proposed DD

instruct uh which can dist generation or faster inference and there are four main challenges uh in

this work and and we when we develop this method and uh and I will talk about how do we address them uh we also develop a lot of experiment to validate

our method which we'll introduce in the experimental uh section and uh the last part is uh our future bra uh future plans I would briefly introduce some

directions that we are developing uh related to DD instruct and uh nowadays I believe uh uh all of

us are familiar with large language model especially uh the auto regress as a model uh just like TH GBT or Gemini uh they generate

sequence uh token by token from left to right and it works well because uh many companies do a lot of uh infrastructure

to optimize them but uh it still face a major limitation of the inference speed. uh thinking

about the thinking mode in CH GB or Gemini uh they need to generate a vast amount of internal tokens to reason uh

to reason a problem before giving an answer. So for complex tests uh we often

answer. So for complex tests uh we often have to wait a minutes uh to uh to get a result and due to this bottleneck there's a growing trend in academia uh

to explore diffusion language model as a potential solution.

So the core feature of diffusion language models is parallel sampling. So

in each step the neuronet network will predict all the tok uh all the loges for all tokens simultaneously.

So theoretically this means we could generate a complete response in just a single sampling step. Uh but uh we also need care about the generation quality.

So doing it in one step is quite difficult in practice to get a high level high quality text. So a common tra

tradeoff is to generate two tokens per step uh which can achieves uh can achieves quality compared to auto reggressive models and that means uh

without considering any infrastructure optimization uh diffusion language model can achieve like two times speed up compared to auto regressor model.

uh but clearly this does not reach the theoretical speed limit uh limit of diffusion model since uh they can generate everything at once. So to

bridge this gap we propose uh DD instruct uh and it allows like up to sample 64 uh

64 tokens in a single step way while maintaining high quality and uh without any infrastructure optimization. This

means it achieves like 64 times speed up compared to auto regressive model. And

uh this is the motivation that we develop DD instruct. And there are some other interesting statement that motivate

uh in this area. And uh for example uh there are some uh some informal argument says that when human process language or

understand language uh we do not necessarily read from left to right token by token. Uh we just try to capture some keywords to understand the

information globally.

And uh another uh point is that uh if we consider some uh multimodal task like uh image generation, video generation uh they inherently do not have a natural

left to right order. So sampling from a random position just like diffusion model uh these are much more reasonable like to forcing them uh in an auto

reggressive order. So this means uh

reggressive order. So this means uh diffusion language model might be more compatible uh to do multi- uh multimodal task than

auto reggressive model.

Okay. So uh before I dive into the mathematical formulation, I actually uh create uh like some uh animation to help

you understand uh how diffusion language model works like uh what is it? It is

similar to continuous diffusion language model. Uh it has uh forward process and

model. Uh it has uh forward process and uh reverse process. And uh for the forward process uh at first it uh there's a complete sequence clean

sequence and each uh at each step we randomly select a subset of tokens uh just like uh this red one and this

red one. And then we uh during for the

red one. And then we uh during for the forward process we transform them to mask token uh just like this one. And

then we fix these two tokens. And then

for the next step, we transform the other two positions uh from from the task to a mask token. And then we keep

doing this. And uh finally what we got

doing this. And uh finally what we got uh here uh at the end of the forward process uh you get a complete sequence

of mask tokens just like this one. And

uh this complete the forward process and for the backward process uh we need to define a neural network and in practice we randomly select uh certain mass

positions uh just like this red one and based on the network's output we sample the token and convert the mass token to

tax and once these tokens are sampled they are fixed and we continue uh and we for the next step we sample the other

mass token and transform them to a text and we keep doing this until finally for the for the backward process uh we

recover the whole sequence uh just like this one and I hope this help you understand uh the process yeah and yeah if you have

any question just feel free to stop me and uh I I will try my best to explain everything.

Okay.

All right. So that um okay we already have a high level understanding how the forward process uh and the reverse process look like and here's a

mathematical definition. And just like

mathematical definition. And just like the continuous diffusion we we we have a time step t and also we have a complete

sequence of data x.

And uh uh different from uh the key difference here uh different from the continuous diffusion is that we also need a corruption rate because uh we

need to we need a rate to say that we transform the token to a mask state.

Yeah. And after uh define this uh we also uh continue to define the transition probability. So uh we need to

transition probability. So uh we need to define a transition from uh t equals to zero to an intermediate state t uh zt

and we also need to define another transition uh probability from uh state uh from time step t to the time step s.

Yeah. So I I think that's the uh that's all we need to define the forward process or or maybe we also call it absorbing process.

Hi uh have a question.

Uh yes please.

Yeah just a quick question in the last slide I just wanted to ask the corruption rates are always non- negative. Is that correct?

negative. Is that correct?

Yeah yeah yeah it should be uh it should zero and one. Yeah

right. Okay.

Yeah. Thanks for the question.

Any other question?

Also, yes, I have one question. Uh, so

how do you define these transition matrix? They're like in the end are you

matrix? They're like in the end are you you want to try to learn them or is like you define them from before beforehand?

Yeah. Um, I'll do uh that's a good question. So yeah for the backward

question. So yeah for the backward process we need to uh like uh approximate this uh forward process to

to to like learn how how does it uh get the transition probability. Uh does

it answer your question or if you have some again can you can you say anything?

Uh sorry can you say your question again? C can you can you explain again

again? C can you can you explain again how do you define the transition probabilities if they're learnable? I

didn't sorry my my my internet like got so I didn't hear what your answer uh do you mean how do I define uh define the forward for the transition

probability?

Yes. Yes. How how do you construct them like um uh construct them? Uh

yeah. So I think you first need to have the corruption rate. So uh you you may need you first need a have a time step

and the time step correspond to uh corruption rate. And once you have the

corruption rate. And once you have the corruption rate, you can um uh use this

equation to like uh to say for for this uh if we have a intermediate step t uh

you use this equation to like calculate the uh probability at this state. uh how

it start from the data to the intermediate state ZT and uh then for yeah okay but then it means

yeah so then mean it will be associated with the way that because I hear it says that every time you randomly replace some tokens you mass some tokens right so if I if I if I understood correctly

what you said is like you actually compute the pro the transition probability that's what if I understood what you just And if you if you computed was based on the two states,

right?

Yeah, you can. Yeah, you I think you can see that uh we just follow this equation to to compute the the probability.

Okay.

Yeah. Uh does it make sense?

Yeah.

Yeah. And this is categorical distribution if you're not familiar with.

Mhm. Okay.

Thank you.

Yeah. Uh yeah. Okay.

Yeah. So uh later we'll we'll move on for the reward process. So we need to uh first parameterize the uh neuronet

network and the input is the uh token ID times the time times the time and the output is the log. And we also need to

define uh a transition kernel just like to approximate the uh forward transition

kernel Q. And here uh I actually have an

kernel Q. And here uh I actually have an example to help you understand uh how this uh neuronet network neuronet network works. So if we consider a batch

network works. So if we consider a batch size eight and the sequence length of 1,000 and uh also if we consider

tokenizer tokenization like charge GBT which uh would charge GBD2 which would be uh 50,000. So the input of the

neuronet network would be uh the patch size times the sequence length and uh we would input the token id here. So it

will be an integer. The dimension is one and also we need to input the time step and then we output uh for the output it has the same the batch size and sequence

length are the same but it also uh but the difference here is that uh now it output all the u largest of of the

possible tokens. So it will be uh

possible tokens. So it will be uh 550,000 and uh with this log we can transform them them into probability and then we

sample just like auto reggressive model to uh to sample the token uh with logist and uh uh the difference between the

continuous diffusion is here uh because uh discrete diff for the discrete diffusion you need to sample the token That's that is a discrete uh discrete

oper operation. So here we need to use

oper operation. So here we need to use the corruption rate that we defined previously. So uh with probability 1

previously. So uh with probability 1 minus alpha t we need uh we keep the current position remain mask and uh with

probability alpha t we will uh transform the mass token to a token and we will sample that to according to the logit

and here's an example. So suppose the corruption rate would be uh 0.1 at this time. Uh that means because we have the

time. Uh that means because we have the sequence length 1,000 uh that means around 100 procedures will transform

from the mask mask token to the task.

Okay, that's uh how we define the reverse process. uh we need a neuron

reverse process. uh we need a neuron network and also a sampling uh and also a sampler to to sample the token according to the log that provided by

the new network.

Okay. So uh uh now let's look uh look at how we actually train the neuron network. So this is similar to

network. So this is similar to continuous uh diffusion. So we we train this with negative evidence lower bound

uh just like this one.

Okay. So uh I think at this point we can train the diffusion uh language model but uh there's still obvious bottleneck here which why we care about

distillation.

So the thing the issue is that uh while the elbow train model generate high quality tests, it does not fully unlock

the potential for ultra fast or parallel sampling. Uh like to maintain high high

sampling. Uh like to maintain high high quality generation, we are forced uh the diffusion language model to like sample

at at most two tokens for each step. uh

that means for for the uh diffusion language model trained by the elbow uh it still requires like hundreds or even

thousands of function evaluations and that leads to substantial latency. So to

to get a faster inference uh yeah we propose TD instruct. Uh the main idea is that we have uh we define a student model and the student model have the

same input and output of the teacher model. Uh the teacher model is a

model. Uh the teacher model is a diffusion language model trained by the elbow and uh the the main difference between

the student and teacher is that the student can do few step generation and it requires like eight steps or 16 steps

to generate the high quality test just like the teacher model. uh that means uh that means uh uh

uh improve the speed up compared to the teacher model and the theoretical foundation is distribute uh distribution matching distillation. So we

matching distillation. So we we define the integral k divergence and we train the student model so that it

can do the few step generation.

Okay. So um

actually this distilleration works well in continuous diffusion and I will uh first briefly introduce how what how it

works in continuous diffusion and then we will explain uh why it does not works in diffusion language model and then how

we uh solve some challenge to develop this framework. So for the uh continuous

this framework. So for the uh continuous diffusion if we want to do distillation we also need to define a student model

and the input output and the next uh the network structure is the same as the teacher model and we also need to define the forward marginal for the teacher

model. uh just uh you have the neuronet

model. uh just uh you have the neuronet network P theta to output the logit and then you you get the data and you uh

this Q is of some forward marginal. So

it will corrupt to some uh intermediate state ZS.

And this is how we define the forward marginal. And our objective is to have

marginal. And our objective is to have the integral KO divergence to match the distribution between the student and teacher.

Uh just like this one.

The main challenge here is that the Q the QV and Q theta here uh we cannot get them when we train the neuron network

and may uh I will explain this later but uh you you can remember this at this

time. So uh what so to get a trackable

time. So uh what so to get a trackable objective function uh in this work uh which we call diffusion instruct which

is uh published in Europe's 2023 uh it actually developed uh objective like this one so I will give some high

level intuition how uh of this objective so SV and F theta is a score of the t student and the teacher

and the partial dt partial theta is the uh gradient with respect to the data uh

is the data uh is gradient with uh of the parameter space with respect to the data. So that means uh it can be written

data. So that means uh it can be written as a product of the score difference and the uh data with uh gradient of the data

with respect to the parameter space. And

for this one we can we can get the sample data and then we use autograd to calculate

uh this one. And for for the this one the score of the student and teacher are uh we can directly get them. So that

means this objective is totally tractable and we can directly use this to distill the student model and uh and do the fuel step generation

and this is good in continuous diffusion but uh when we apply it in uh the diffusion language model we actually

encounter some issue uh and the main issue is this term uh as I mentioned before is uh partial dt partial theta

uh it use autograd to to to calculate the gradient but uh this one is actually non uh because in diffusion language

model it uh involves some non- differentiable operation like armax we need when we samp when we use the

logist to sample the token so this one will interrupt the computational ational graph and this this is why uh in

diffusion language model this objective objective does not work.

Okay. So

uh let me stop for a while. If you have any question maybe I have a question it's like it's general about distillation is um in this case it means that your model

would perform in the end after it's train it will be just as good as the teacher model so it cannot outperform teacher model or the way after like to

do something to improve it further.

Uh actually we have uh when we do the distillation we have to match the performance of the teacher model. Uh if

you look at this red line this is our proposed method and this is a matrix to like to test the performance of diffusion language model. Uh the smaller

the better. So when we do the

the better. So when we do the distillation and when we have the fast inference uh we we required to at least match the performance of the teacher

model. So the teacher model is a blue

model. So the teacher model is a blue line uh that means uh if we like consider 16 steps uh the sample quality

can match the teacher model and if you just use uh eight steps it does not match. So actually we would select 16

match. So actually we would select 16 steps to uh to do the inference. Uh does

that make sense?

Yeah. Yeah. So it actually can outperform it. And one question here,

outperform it. And one question here, how did you choose that teacher model?

Why I mean is it possible to choose like a something like I don't know like Gemini 3.0 or something that is a bit more like that is

Yeah. Uh yeah, that's a good question.

Yeah. Uh yeah, that's a good question.

So I think for for our framework the the function for teacher model uh there will be two function for the teacher model.

Uh the first is to have a good init initialization for the student model. So

for uh for what you said gemini if we you use auto reggressive model I think the network structure are different from the diffusion language model. So we

cannot directly use it for for initialization.

And another another way we use this is because uh we need some regulariz regularization when we distill the model

because uh it is pretty an abrupt improvement. So sometimes it will

improvement. So sometimes it will uh the objective will like force the student model too much and it will lead

some uh performance issue like model collapse. So we need student model to to

collapse. So we need student model to to like uh like to to let the student know um you you

cannot uh move too far away from the teacher model. So this is a

teacher model. So this is a regularization and yeah so I would say uh if you use

auto reggressive model as teacher uh it does not work. You still need to like uh train or train a teacher model with

elbow or maybe you just use some open source diffusion language model as a teacher and yeah I hope this answer your question.

Yeah that makes a lot of sense. Thank

you.

Mhm.

Hi, I had one clarification on what has been discussed so far.

So, uh is the teacher model being also trained in this process or is it frozen throughout when the student is being learned? When the student model is being

learned? When the student model is being learned?

Yeah, that's a good question which I will explain later actually. Uh so for the teacher model, we will throw uh the parameter will be frozen. is just like a

reference for the student model.

We do not train the teacher.

Okay. So, initially you have like a fully properly trained teacher model and now that is being used to train the student model. Is that how it works or

student model. Is that how it works or is is the teacher model also update during the update of the student model like during some intervals?

Uh the teacher model is just like a good initialization for the student model. uh

uh the teacher model will not be trained during uh during the distribulation. It

it is just frozen there.

Okay, got it. And also could you like elaborate also could you elaborate on the difference in the inference process?

So now when we are having the auto reggressive models uh you respect the causality there right because you're having masking only for the future time steps. But in this case

are you saying that the masks are randomly generated and then you're trying to reconstruct the words from that.

Is that how it works?

Yeah. Yeah. That's right.

Okay.

Yeah. Yeah. Um actually for some uh this is a naive setting if uh uh because we uh we mainly focus on uh distillation

uh to do fast inference and of course there are some advanced uh decoding uh strategy uh to guide you how how uh for diffusion language model how you can get

a better sample quality. But uh because our uh our work mainly focus on uh distillation so we just use the naive

sighting uh setting just like we randomly decoding uh each position.

Yeah, that's a good question.

Got it. Okay. And also is this some way similar to bird uh like someuling of the bird models because even in bird you have these maskings which are done like randomly right with some fraction

defined.

Yeah, I believe uh diffusion language model um so there are different types of diffusion language model. So uh because

like uh you see here our u we mainly focus on mass diffusion model. So yeah,

it it actually get a lot of idea from bird like uh to have different noise level to handle different noise levels

and uh we also like skill up bird. Yeah.

And I hope that answer your question.

Thank you.

Yeah.

So uh if there's no other question, I will uh move on to introduce uh our training framework. So uh as I mentioned

training framework. So uh as I mentioned before uh we have the teacher model and the student model. The teacher is frozen and we only train the student model and

because we also consider some adver adversarial training. So we also

adversarial training. So we also consider a discriminator here and the main uh the main process is here. We

first have the sequence of mask tokens and we input them to student model and teacher model and uh we require them to gener uh recover the whole sequence and

then corrupt to some random time time step TI and then we get the partially mass uh sampled sequence and then we

input them to the discriminator. So the

task for the discriminator will be identify the sample sequence whether it is from the student or whether it is

from the teacher. And then we have uh we have the ground truth and the uh binary output of the discriminator. And then we

have the objective to to train the discriminator. And on the other hand, we

discriminator. And on the other hand, we also uh input the student sequence to the discriminator and the discriminator

will uh we will will do some transformation and we have some reward signal. The reverse signal will be used

signal. The reverse signal will be used to train the student model and we train the student model and the discriminator alternatively.

And finally we we when we trained well the student model and the discriminator the student model can do the fostep

generation and the discriminator can uh classify the sample sequence from the student and teacher.

Okay. So uh so uh let's try to tackle the first challenge that I mentioned before in continuous diffusion that the

objective the objective cannot be used in diffusion language model. So here we first have the uh some mathematical theorem uh mathematical derivation to

get this score function identity. So we

get some idea from the policy gradient and we decompose the integral KO divergence uh to the product of the reverse signal times the policy

gradient. And the reverse signal here is

gradient. And the reverse signal here is the log density ratio between the student model and the teacher.

And it actually like to tell you how much you should update uh along this direction along this parameter space.

and uh the the gradient term is just like the tell you the direction uh that you should update uh for the for the

parameter space. So with this uh policy

parameter space. So with this uh policy gradient form objective uh this can be directly used in training the diffusion

language model and uh I I also want to give some high level idea why this objective uh is

effective and why this can allow student model to do few step generation and if you look at this diagram it is actually

uh visualized the reverse process and we start at time t = to 1 uh which is a fully masked sequence and uh for the

for the for the final destination here t equals to zero this is uh fully masked as clean s fully masked and a clean

sample here and for the teacher model it trained with the elbow so intuitively uh this objective This objective aligns the update

direction uh at any given point uh with the tangent of the trajectory. Okay. So

uh that means uh if we have many uh backward steps uh and we can set the step size to be pretty small. So the

discretization error here remains low and we can uh have the teacher model to get a high quality test and however if

we force the teacher model to do some like few step generation and the step size here will be pretty large and uh

this introduce a large approximation error and that's why the teacher model cannot uh cannot do well in few step generation.

Okay. So, uh for the student model, uh our objective changes the update direction. So, it forces update

direction. So, it forces update direction at every point on the trajectory uh point directly to the final destination at t equal to zero. Uh

because of this even with the step size is pretty large uh the student can still moving in the correct global uh direction. Uh that's the fundamental

direction. Uh that's the fundamental reason why the student model is capable of few step generation.

Okay. So uh the so this is how we address the first challenge and the second challenge is that uh the

objective here is not tractable. The

just like I mentioned before uh it is because the log density ratio this QV and the Q theta is not tractable. Uh it

needs it is the sum of the product of this. Uh you first have the neuronet

this. Uh you first have the neuronet network to output the sample and then you have the forward process to corrupt

it to some intermediate state Q uh at time s and uh this requires to uh go through the whole discrete space to get

all the x. So this is not trainable and uh our solution is that we refer to again to have a discriminator.

So we have the discriminator which have uh which have the same input of the teacher model and we have the binary

output for the discriminator. So when it trained well it can represent the log density ratio uh between the student and

the teacher uh just like this one and then we can uh represent the reward as a function of the discriminator just like

this one.

Okay. So

um uh here's uh another challenge is the training instability and this is inherent in the policy gradient. Uh it

is because of uh well it use unbiased estimation but uh it it introduce high variance and what we would see when we

train the diffusion language model is is that uh it can gen somewhat generate high quality sample but uh the uh the

sample diversity uh is bad. So it may cause some mode collapse issue.

And uh to to address this we actually consider two different strategies and one for the reward for the reward part

and one for the gradient part.

And for the reward we get some idea from GRP uh deep uh we get some idea from deepseek to have gpo. So we first uh

have a batch of reward data and then we calculate the mean and variance of the reward and then we directly normalize the reverse signal just like this one.

And to stabilize the gradient we actually decompose this term into two terms just like this one. And here is a

visualization of the uh of the uh of how how do we do this uh decomposition.

So um this is still a backward process start from t = to 1 and ends on t equals to zero. So for the left term it just

to zero. So for the left term it just like uh this blue line here uh is start from t = to 1 and directly points to t =

to to zero.

And this would work well if we just uh use one step uh during the inference. Uh

but diffusion language model is a pretty complex uh challenging task. So uh we would say uh we cannot just use one step

to get a high quality sample. So

normally uh we would require in our experiment it requires like 16 steps. So

that means if we train with this uh blue line it does not encounter any uh uh intermediate state in the trajectory.

So uh when we do the inference uh when we jump to some intermediate state here uh the the the student model does not

encounter uh the situation uh in the im intermediate state. So it will fail and

intermediate state. So it will fail and it will have some mode collapse issue.

And when we have the right hand side we when we decompose it into two terms uh it just like this red line here. So

normally we would first sample the timestep ti and then it corresponding to

uh intermediate state zi. So uh we would start from t equals to 1 and then point to the intermediate state zi and then we

start from zi and point directly to uh t equals to zero and uh this actually address the mode collapse issue when we

train the uh student model.

Okay.

So uh the last challenge here is that um uh actually we have trained the discriminator and uh uh can the problem

is that can we leverage the discriminator to help the inference. So

uh the high level of idea is that because the discriminator can use some revert signal and the re uh if we uh follow the reverse signal it will guide

it the in uh the inference to some uh high reward region and this actually helps to get a high quality sample. So

what we did here is that when we when we do the inference we we get the discriminator and transform them to the

reverse signal and we inject the reverse signal to the sampling step uh just like this one and this one and this actually

improves the sample quality uh a bit.

Yeah. So yeah, so this the these are the four challenges that we face uh during uh when we propose this uh distillation

method and uh for the remaining part I would I would like to briefly introduce the experimental part. So

experimental part. So uh for quick review of the setup we cons we cond we conduct our experiments on open web tests and for the model we have

the teacher model and student model which have the same uh 169 million parameters. Uh this is a pretty small

parameters. Uh this is a pretty small model and for the discriminator uh because we replace the output uh from

the uh for the t teacher model it outputs the the all the logs and for the discriminate discriminator we just

require it to be binary. So uh we for the discriminator we initialize from the teacher model and then we replace the

feature head. So uh it will be 131

feature head. So uh it will be 131 million parameter uh smaller than the teacher model and we consider h 100 to

and use a dam to train the neuronet network and yeah so this is the main results that we got uh just like what what we

show here uh before. So this blue line is the teacher model and the red line is uh uh the model distilled by the DD

instruct and the green line and orange orange line here are uh some other distillation method just like consistency

uh they based on trajectory based uh distillation.

So uh uh it is clear that our uh our method performs uh always the best across different uh number of uh sampling steps and the perplexity is a

matrix like I mentioned before the smaller the better.

I have a question. [clears throat]

Mhm. Can you give us an intuition of what does the perplexity measures like how is it how I understand lower the better but what it is what is the thing that is measuring

that's a okay that's a good question so uh perplexity is like uh you you have a

reference so the reference is advanced auto regret auto regression model and uh uh when we calculate perplexity we do

something like cross entropy. So we

compare the so from left to right we compare each token uh compare to to the advanced large language model to compare

uh whether they are the same and the different uh if they are the same that means it get a good performance uh if

they are the different the will increase so it's just like uh we require this uh this this student model to generate a

sequence of tok uh so tokens and we compare it to an advanced large language model to compare whether they have the

same quality. Uh does that make sense

same quality. Uh does that make sense kind of but in in this case when we're talking about language that there are many possibilities for example there would be many ways of answering the

questions that maybe valid right like so in that case like doing a comparison one the word by word wouldn't a bit like a bit misleading

uh yeah that's a good point so uh I would say that um we actually will uh propy is sound like

a very naive way to detect uh the performance and there are some other metrics that may be uh a good

alternative compared to perplexity.

So uh actually if we want to detect uh whether it performs good we uh for the uh formal way we need to do some

downstream task like we test in uh to to we first have the backbone and then we train some like feature head and uh we

we test whether this backbone can have a good sample quality to to maintain the sample. quality some something like

sample. quality some something like that. So I I would say perplexity is a

that. So I I would say perplexity is a very simple way uh just uh uh but uh I think it is still a good indicator to to

say that uh uh our distilled method uh achieves uh good performance. Yeah.

I think that makes thank you for explaining. Yeah, it makes sense. And

explaining. Yeah, it makes sense. And

also maybe I have another question here like if I understand correctly when you have like this plot of the of the of the trajectory and you said that

the teacher model will go tangent to the trajectory but the your model will somehow like like cut it like to some

kind of shortcuts right and uh but if the if the blue line let's say if it will get get you to the end point of the trajectory and in the The red one also will get you to the end point of the

trajectory. Why can you achieve better

trajectory. Why can you achieve better performance with the with the red one?

If the end point of the trajectory is the same.

Okay. Uh

I see uh this is a good point. Uh so

actually when we train uh if we go back to the framework actually we we replace this teacher model to a data. So

actually when we train the model we use some data as a reference. So that's why it can beat the teacher model. Uh

doesn't make sense.

Yeah. Yeah, that makes a lot of sense.

Okay. Yeah. Great. Thank

Yeah. Yeah. So yeah, I think uh yeah, there are some misleading here. Yeah, probably actually

misleading here. Yeah, probably actually when we train the model, the teacher model when we have the reverse process is pretty slow. So uh actually we would

replace it with the data and then we corrupt it. So that makes the training

corrupt it. So that makes the training much more faster.

Yeah, this is a good point.

Thank you.

Okay.

Yeah. So, uh later we also did a lot of ablation studies to test uh different component of our uh DD instruct

framework. uh because like I mentioned

framework. uh because like I mentioned before we introduce a lot of uh component like uh gpo score decomposition and uh this reward

guidance. So we need to test each

guidance. So we need to test each component whether it uh works well. So

for table one we start from a baseline without any tricks uh without any tricks like gpo score de composition and we

test the performance from eight steps to 128 steps. And then later we uh for each

128 steps. And then later we uh for each row we add one uh technique that we mentioned before. And finally for the

mentioned before. And finally for the last row here uh we add all the all the techniques uh that we mentioned before

and this is the uh performance that we get and you would say from table one actually each component uh contribute to a better performance of diffusion

language model and for the table two uh this is another ablation. So we start from the last line

ablation. So we start from the last line here. This is the baseline. So for this

here. This is the baseline. So for this one uh it is got the same result of uh the last last row here. So we have all

the tricks here. Uh what we did here is that we uh remove one tricks but uh keep the all the other tricks here. So you

will say uh uh the performance we also test all the performance from eight steps to 18 28 steps.

Okay. So uh later we also do some model scaling up because uh nowadays for large

language models uh all the maybe most people are are interested in whether it can scale up. uh how do we scale it up?

So we also did a a little bit scale up uh to have uh 400 million parameters and we also test perplexity and entropy. So

perplexity the smaller the better the entropy uh the larger the better. So uh

the the the red red line here is uh the results that when we scale up. So

you will see uh when we screw it up it actually performs better and uh we also did some uh we also have

some other application like protein sequence generation and uh yeah I think uh it is because um uh DD instruct is

not uh specific to diffusion language model actually it uh can be applied few uh discrete diffusion. So this is a

general uh general uh general part of the diffusion models. So uh that's why we

diffusion models. So uh that's why we apply it in uh different applications like protein sequence generation and uh due to the time limit I will skip uh the

details here but uh in general uh it actually performs good uh when we apply to different applications.

Um hi sorry I have a question.

Mhm.

Um I just wanted to ask um do you know what protein this is?

Uh sorry because I'm not responsible for this experiment so I'm not quite sure.

Uh I I just know uh uh this is done by the second author. So he used the model from the Tik Tok and they they have yeah

they have Mhm.

Um also like another question I just wanted to ask is like um how do you make sure that it is preserving the certain function constraints

that are there in a protein structure um uh we to validate whether it or not. Yeah.

or not. Yeah.

Uh this is good question. So I think we have some uh some matrix to validate the

the details of the like how how do we how how do the protein maintain some constraints uh uh because like I mentioned I am not responsible for this

experiment. you may need to go go into

experiment. you may need to go go into the paper to to see and I believe uh yeah the they did some work to to validate that.

Okay. Yeah.

Thanks for the question.

Mhm.

Yeah. So uh the last part is that there are some uh f uh impro promising directions that uh uh we are ex or

exploring uh at this time. Uh the first is like uh like I mentioned before the uh diffusion language model is are compatible with image and video and we

are doing some uh text to image text to video generation and we are also interested in scaling it

up like to scale up to 7 billion diffusion language model like uh light up. So uh I think the main challenge

up. So uh I think the main challenge here is uh some infrastructure issue uh is because uh previously when we have uh

like small small model like 169 million parameters uh we can uh when we distill the model we can use uh data parallel

uh distributed data parallel to train the model and the uh one GPU can can can

have all the entire model but uh if we scale it it up to 7 billion I believe one GPU cannot uh have such a large

model so we need to do some tensor decomposition or state decomposition so that's all the infrastructure

optimization that uh in this project that uh that challenge us uh the uh another promises direction is called

block diffusion. It is because uh

block diffusion. It is because uh nowadays uh auto reggressive model uh have a very good infrastructure optimization

and uh we want to leverage the advantage of both uh auto reggressive model and diffusion language model. So uh that's why block diffusion gets popular

nowadays. So the main idea is that we

nowadays. So the main idea is that we decompose uh this one 1,00 sequence to some blocks like 32 blocks and each

block uh got 32 tokens. So we start from the first block which is the first the first token to the 302 token and within

this block we use uh diffusion to generate the token and uh once it's complete we move to the next block uh

the 32 uh the 33 tokens to uh to the 64 tokens and we also use diffusion to generate all the tokens and then we move

to the next block and then and so on.

And the good thing is that uh once we complete the first block, the second block can get uh can use the KV cache

that we have in the first block. And uh

yeah uh this is um this can actually get uh good inference speed uh when we use block diffusion. And if we consider DD

block diffusion. And if we consider DD instruct here uh actually we can allow autore auto regression model to gen

generate 32 tokens uh uh at one step. So

that means 32 uh times speed up compared to auto auto reggressive model. Yeah and yeah I think

reggressive model. Yeah and yeah I think that's all I want to share today. Yeah

thanks for listening and I'm happy to answer any question.

Thank you so much for your great talk.

So now we open the floor for questions.

Do we have any other questions for speaker?

Uh yes.

Uh hi uh I'm a post dog. My name is Jim.

Uh uh so I try to Yeah. Well, I try to understand the big picture. So here

basically a teacher model is given and by using the technique like the distillation you can reduce the number of the like the inference uh model model

inference.

So that's the the idea.

Yeah that's right.

So from like one 1,00 to like 16.

Yeah. Yeah. Yeah. Just like this figure.

Yeah. The the red one here. we we want to have a fast inference.

Okay. So if the number of the function evaluation like is like 16 from the beginning uh it is fully random. So you

need 16 steps to get the final sequence.

Is that the uh the right way to inter interrupt?

Uh yeah I think so. Um

uh I would say like we first have the we have fixed the sequence length like 1 1,0uh 24 and uh if we have auto regress

model because you have to sample token by token. So it requires 1,24

by token. So it requires 1,24 number of function evaluations to complete the sample process. And for

mass diffusion if we every time we sample two tokens. So that means uh it requires five 512

12 NF to have the sample. And if we have this like 64 times speed up it requires

like 16 NF to generate the sequence.

Yeah. Yeah. Got it. Thank you.

Mhm.

Thank you. Thanks for the question.

Do we have any other questions? Oh yeah,

it's very good. Uh I can you give us a a feel of um the the time cost like for the smallest

model to 400 million parameters to what you're targeting now on a on a 100.

Uh what what is is it a one day training? Is it one week? What is it?

training? Is it one week? What is it?

Oh, it is because yeah, that's a good question. So, because uh it is actually

question. So, because uh it is actually a very small model like uh one69 million parameters. So, actually it just

million parameters. So, actually it just takes uh one GPU hour to to complete the task.

Oh, one GPU. I didn't see that. Really?

One GPU. So, when you go to 400 when you go to 400 million parameters, how does the the scaling go? Is is it linear? Is it quadratic?

linear? Is it quadratic?

Uh should I I remember it takes like two hours probably

two or three hours. it does not take too much. Uh like uh for example for the

much. Uh like uh for example for the previous uh for the consist consisting consistency model I believe to get this

performance they require like uh at least 20 or 30 hours. Yeah, the training is pretty fast uh for our DD instruct

method.

So it was 30 hours and it goes down to one hour.

That's right.

So you said you have a 64 speed up, right? 64 times speed up.

right? 64 times speed up.

Probably probably worst case. Yeah.

Mhm. Yeah. I I believe the reason they requires they require so many hours is that they they do something like a progressive progressive training. So

they start from a very small step to to match the teacher model and uh they gradually increase the step. So that

means it like takes like uh 16 uh six or seven runs to to train the model. So

that's six to seven runs require a lot of time. Yeah.

of time. Yeah.

And how about the protein generation that you the example that you show? How

long does that take?

Uh sorry I I do not ask the second author though I believe it is also pretty fast because like I remember it

takes like uh two weeks or three weeks to complete the test. So I would say maybe one or two hours to to complete

the distillation.

I do not have the exact number. Sorry.

But here because I did ask you about the constraints here unconditional generation that does that mean that you unconditional means that

you don't really obey any constraints?

Yeah. Uh yeah. Yeah. That's right. Yeah.

There's no prompt or anything. Yeah.

All right. Thanks. Great talk. Thank

you.

Thanks for the question.

Yeah.

Thank you, Professor Kanyakis.

Yeah.

Yeah. Mhm.

Do we have any other questions? Because

I want to close this session.

Thank you so much for your great talk.

Uh and now it's time for our Yeah. And now it's time to move on to

Yeah. And now it's time to move on to our second speaker. Stephen, can you hear me?

Um our second speaker is Steven Masala.

He will be giving a talk on hybrid tweening using PBDW and net for the effective state estimation and

prediction on partially known systems. He was born and raised in Congo Brazil where he completed most of his education

before moving to France to pursue his undergrad and graduate studies. He's

currently a final year PhD candidate in a joint program between EOL normal sub Paris and non-yang technological

university and tu Singapore working on the card project which aims to develop digital tweening technologies to support decision making in urban context. With

that short introduction the floor is yours and you can start your presentation.

Thank you, Naz.

Yeah.

Um, yeah. So, she already presented um I'm

yeah. So, she already presented um I'm actually third year uh PhD, but in France we usually do short PhD. So, it's

my final and I'm going to present a different my PhD in two months actually.

Um this research is actually based on some work that I did on my first year PhD.

Uh and um I'm trying to make it as short as possible. Uh Naz, is there any

as possible. Uh Naz, is there any constraint on the time? Can I make it shorter?

Um yes, you are like a second speaker and last speaker. So take your time and uh it usually goes for 1 hour, but you

can present for like 45 minutes, then have question and answer session.

Okay. I actually made it so that it goes it goes quite fast. Um I very I sanitize it but so any anyone can stop me at any

time um to ask question don't you can feel free to to do so. Um so the the context of my research and my PhD is

about this uh pro program called Decart and it's a collaboration between France and Singapore. um they um aimed uh to

and Singapore. um they um aimed uh to create a digital twin for urban context.

So they want to have this um twin that will allow them to help help them make decision uh in complex system and um in

real time um with a lot of uncertainty and things that we don't know a lot. So

most of the use cases in the project are related to energy drone trajectory and remote sensing. So there's a lot of

remote sensing. So there's a lot of structural EL monitoring uh drone trajectory planning where you update the trajectory of the drone in real time and uh some energy grid uh

update uh model.

So in the in this project presentation I'm showing one algorithm that um one project that we did for the first year of my PhD but that was also applying in

drone case later and so I will show it at the end.

So uh hybrid twin in the way that uh we define it is this additive uh model combination between a physics based uh

estimation and a datadriven correction where you have a physics based model. Um

okay you can see can you see um the mouse moving?

Not really.

Oh, okay. Okay. So, you have a physics based model that makes estimation and a datadriven uh model that correct based

on um experimental data. That's the the idea of twin and this datadriven model can be a machine learning model where you can learn the correction and and and

correct. Uh the idea is to have it in

correct. Uh the idea is to have it in real time and make a um prediction and correction approach on on system system [clears throat] like that. Um in the

context of this uh this research we have an environment that we don't know perfectly but we might have um wind that that is changing or some assumption that

we made on the physics that that are not that is not correct. uh and we will try to uh uh make the best modeling approach

to take into account uh this uh uncertainty.

So the idea at the end is to have this awin where you can control it uh and then you can apply it on dynamical systems as well that's changing.

Yeah. So the construction of the A between as I said this additive approach uh it usually comes with a lot of challenges. For example uh when you are

challenges. For example uh when you are combining the two element the physics based approach and the datadriven correction you want some complimentarity. You don't want to uh

complimentarity. You don't want to uh have a datadriven correction that is uh adding information that the physics already has. Um so this orthogonality or

already has. Um so this orthogonality or complimentarity uh is what that um occurred the most when I started this PhD and I really

wanted to work on that. How do you actually work on that and how do you ensure that what you are correcting u is complementing what what exist

uh and the idea is to have it in a control loop later so that you can have a model that is getting better and better over time as you are controlling it. Yeah. So the combined model layer is

it. Yeah. So the combined model layer is the the twin. Uh the one that we build in this research is based on uh on on

the idea of the PBGW. So the on this approach uh which is called parameterized background data week uh we we we we inherit from that and then we

build on top of it cuz this approach is very interesting. It was built by uh

very interesting. It was built by uh Ivan Mday and Antony Patera and all in 2015.

This is a data assimulation task where uh they try to reconstruct the state of a of a system U here as a additive

combination of physics based estimation and datadriven correction. So in the same spirit of the the in between that I just introduced earlier and uh the the

originality here is that in their formulation they they managed to have a naturally orthogonal combination um in

the way it's built. So here the the the ATA which is red here and Z live in two spaces that are orthogonal. So that

allows you to correct uh and enrich the space of the physics based estimate which is blue. So we wanted to actually build on top of this because of the

mathematical property of the approach which uh gives you a good uh view on the convergence of the error estimate. It

also comes with uh a built-in uh a priori error estimation which is also uh uh that we can quantify thanks to the

the fip stability constant here the beta um and uh and it comes with a way to select the sensor that maximize uh the

enrichment of the space. So it's a very interesting approach and it was extended uh to time dependent problem by uh Willie Aayik and Ludovik Shama in 2023.

So which was the PhD thesis right before me. Um they extended it in real time to

me. Um they extended it in real time to deal with biases that changes and how do you correct them in real time. So this

is what I will be building on top of. uh

please uh you can stop me at any time if you have a question or if you want me to clarify anything.

So here we will build two spaces. We

will we work with two spaces. The blue

spaces which is the the background space is going to be the physics based estimate and then uh a red space which is going to be the observation space.

The observation space is be is going to be spanned by rich represent of experimental uh experimental data. So we will assume

here that we have a lot of experimental data and and we have a incomplete physics based model. So there's a physics model that does not represent completely the reality. We might have

made a lot of assumptions or we don't understand the physics properly. So just

estimating the state with that physics based model would give a wrong a wrong estimate and correcting them with the experimental data would would would give

a good estimate and giving putting more experimental data would give better and better estimate. That's the the idea and

better estimate. That's the the idea and um there is a way to actually um deal with noise because even though the

physics is wrong uh the xmal data might be noisy. Uh so in the optimization

be noisy. Uh so in the optimization problem that they solve to find Z and AA here uh there is a way to tune the hyperparameter of the regularization to

balance the trust between um the the incomplete physics and the noisy data by following a kind of moor of uh discrepancy approach.

Stephen, can you can you say something about the two spaces?

Yes. Yes. So the the first spaces is um the physics based space here is built with a reduced order model. So we use the reduced basis approach. So you

compute you can let's say you solve your your model or many times and then you have this parametric domain you choose the parameter you choose the you solve the model for different value of that parameter that cover your parametric

space.

What is it? Is it a hilbert space it banana space? Is it H1? What is it?

banana space? Is it H1? What is it?

space subspace of that.

Yes. So uh EU u is um um let's say the physics based space you can approximate with finite element.

So the the solution the space in which you you define your solution is going to be the same space as you are building the physics base space.

So for example h1 for example h1. Yeah.

All right. And then the data space the data space. So you will have uh actually have the next uh slide for example here you have the physics space on the left which are the different mode

first mode second mode third mode fourth mode and the second observation space are vit representer of the experimental data. So you have we have this the first

data. So you have we have this the first sensor for example here we model the sensor as goians you have the first sensor the second sensor the third sensor and observation space is the span

of those reads representer so that would basically tell you where you need to correct the information for example so those two spaces are the two space that we will use in in this presentation

but if they so is there noise in the data or you don't assume any is sorry the the data are noisy so the on the on the observation space because you can

have stoasticity how does that come in yes so we we assume that the sensors are noisy so that's why in the in the optimization

we have a regularization the C uh in the second line implementation online if you can see the

um the arg so uncorrelated noise.

Uh yes.

Okay.

Yes.

Um and then in the subtle point so the the solution of this problem reduced to this linear solution the subtle point problem where you can see that the

solution the the first line and the second line. If you look at the second

second line. If you look at the second line the projection of uh ATA on B gives zero and in B you have all the physics mode. So that tells you that there's an

mode. So that tells you that there's an orthogonality between the datadriven correction and the physics based um space with every single mode of them

and the y is the experimental data. So

we we fix the experimental data as a force we force it and then we find the physics estimate and the correction that such that the sum of them gives the

experimental data.

Thank and um and yeah here the the quality of the estimate actually relies is dependent on the stability

constant the MC constant here the beta.

So the the better the beta the bigger the beta close to one the better we have uh the better estimate we have. So in

the way we choose the sensor, we also use the MU um constant to choose the next uh sensor that maximize the information gain from

the correction and uh so we follow this algorithm that was built by uh I don't remember who exactly uh but it was in 2015 in the same paper

as Mday um that rely on [clears throat] this max uh greedy stability approximation where you you you maximize the stability

and you minimize the the approximation error at every every time you choose a new sensor and that allows you to to better correct and cover the physical

space.

Um so I will just give an example of uh solving a PTE with the PBDW that I just int uh presented which is not what I built but uh we are building on top of

so I have a question on that. So if you if you don't move the sensors if you don't do adaptive [clears throat and snorts] adaptive sampling using this gridy method then you cannot guarantee that

the spaces will be orthogonal.

uh this the the space is going to be orthogonal no matter the the the choice of of the sensor. So it's by

construction here in the saddle point construction right the subtle point gives you the orthogonality.

Yes. But the question is interesting because if we choose them well then we need less sensor and we we actually have a unique solution if a better condition

on the unicity and the existence by beta is very small let's say 001 that that that that's bad so you have to do adaptive sampling to improve it.

Yeah.

Exactly. Yes. Yes. If beta is is very bad we need more sensor to improve the beta. Yeah.

beta. Yeah.

And if if it's good then we need less sensor. But the the condition here to

sensor. But the the condition here to have a at least existence of the solution is we need at least uh more at

least as as much sensor as physics mode.

So here the the the physics space is a reduced order space and we use n mode and m sensor. uh by using this algorithm

you can use we can only we can use uh n sensor u and it it will work but if not if it's not optimal then we might need m

bigger than n to have our existence at least and the the idea would be that if if we have less sensor then we don't have

enough sensor um to understand the physics mode so the physics mode are less and less observable and we might have instability

in the the idea.

Um so the numerical example to just show it. So here I just chose one uh example

it. So here I just chose one uh example which is the El question which is the actually the initial problem that they they chose in the paper. Uh so you have

this domain the physical domain is this square domain and then you have um the [clears throat] el equation and then we have a bias. So the bias here is introduced because we are making the assumption that the physics is

imperfect. We we don't we don't know the

imperfect. We we don't we don't know the physics perfectly and here the bias come in the source. So the source here is Q in the second right hand of the

equation. And then in that source we

equation. And then in that source we introduce a bias G. Uh and uh that bias is going to represent our um uh lack of

understanding of of of the physics and then the dot red are the ex some of the experimental data. um we also introduce

experimental data. um we also introduce another bias uh uh the bias in the the boundary condition. So let's say the uh

boundary condition. So let's say the uh the initial the reality is that those boundary condition are based on direct conditions. Uh but when you are solving

conditions. Uh but when you are solving it you you don't know and then you solve it with no man. So that's also a misconception of of reality. So we have

different biases. Um this is the the

different biases. Um this is the the problem that I will solve uh with with AI later but in this experiment I'm just showing the the perfect example without

bias. So those are the two spaces.

bias. So those are the two spaces.

Um so for example for a perfect model without bias this is how the PBDW work.

So you have the physics based estimate on a. So this is the estimate by the

on a. So this is the estimate by the reduce order model. And then you have the correction by PDW which is based on the observation data. And then the sum

of them give you on C uh which is the estimate of the state uh using this hybrid uh approach. And then uh you have

the the relative error with the finite element approximation of the same state.

uh on D and then we do that by increasing the number of mode and then we see how this approximation change. So

we fix the number of sensor to 50. So we

have 50 sensor and then two POD mode and then by increasing the number of of POD mode um we will see that the error is

getting more and more local uh and we actually need more sensor and the error increasing where we are far from sensor.

So if you look at the yellow, yellow is the error is big and red is the error is low. If you look at the yellow areas,

low. If you look at the yellow areas, most of the yellow areas are between sensor. Um so this is what I was talking

sensor. Um so this is what I was talking about instability as well. By increasing

the number of mode for a fixed number of sensor, we get instable but if we add sensors uh it would be better especially

in the red in the yellow areas. So

that's actually one of the criteria that is used in the sensor selection in the stability uh maximization and um

approximation error minimization.

Okay. So now I will introduce uh the contribution that we had uh during this uh during my first year of the PhD. So

the idea is now to say uh we have a model the physics based model we will not replace the physics based model with a AI approach in this context uh we will

instead replace the datadriven correction model with AI approach um but the condition is that we want to learn only what we don't know

so we will try to learn the force the AI approach to live in the orthogonal complement of the physics based uh space so that we can correct uh efficiently.

So here the new problem is uh this twin is going to be physics based estimate plus AI correction and AI correction is going to be aonet

uh net why because uh I went to the the pins summer school uh in 2022 in Stockholm and that's where I I was very curious about about deep and I learned

it there and I remember I was asking a lot of question to Raj and and George about this orthogonality uh and I spent most of my time after

that to work working on this problem. Um

so here the idea is it's interesting was interesting in this case because in deep you have two two model two AI model that you are projecting on each other and the

second model is actually different u um vectors and you you learn them as the same time as you're learning the coefficients. In this in this case since

coefficients. In this in this case since we know the space in which we want to live which is the space of the ignorance uh we actually we can replace the the

the trunk uh with a set of bases that we can build by ourselves. Um so that was the idea in here uh how to learn something

that you don't know is to build a set of vector that are orthogonal to what you know and force your AI to live inside.

um you have two ways to do it. The first

way is the physics informed. You could

actually penalize the autoonality to the physics base estimation or you could simply uh uh do a physics constraint approach where you just replace the

trunk with a set of bases which is what I'm I'm showing here. But in the paper that if you want uh you you should be able to to find it with the title using

the title of my presentation.

um in the paper I discuss both both the the approaches but here I'm just discussing the physics constraint because it showed better performances

so the the idea here is to say um um the the bias that we built here

the in red we can see that it's a linear combination of coefficient and vectors here the vector is uh a RIT represent um and then the coefficient is how much

do you put weight on that the RS represent are built constructed uh on offline so we don't need to relearn them so the donate here is only learning the coefficient and then the output is

projected on the rich representation u so what we do is to say that uh if the output of uh the depotonet is a

combination of those coefficient on the physics based mode um on on on on the orthogonal complement of the physics based modes. It means that the

based modes. It means that the projection of the bias is always be orthogonal always going to be orthogonal to this guest estimate and that's how we

uh we constructed and by construction we have um a uh a constraint that is existing all the time compared to a physics uh

inform orthogonality where you would have something that's working by average. uh here you have a constraint

average. uh here you have a constraint that's working all the time. So you are training the with the constraint inference you have the constraint and it's working all the time. That was very

important for for us to guarantee that so that we correct efficiently. Uh and

then in the loss is just the classical um MSE loss um and the training data was different value of of the data. So we we

would solve the PBDW many times. We have

different value of that that datadriven correction and then we train the AI on minimizing

uh the L2 distance with those data.

Yeah. So the the classical diponet is actually on the left. Uh and then uh the contribution of this paper is on the

right where uh we replace the trunk with a set of of bases that we we build and then the output is projected on the rich representers uh to reconstruct the

scalar field of the of of the correction and then when we we sum both of them we get the the estimate with a guarantee on

um yeah so that's the the the approach in in this paper that I wanted to to share here in term of uh advantages why

is it more interesting or what does it bring compared to the PBDW is just that instead of solving the subtle point

problem um with uh this complexity on M and N we can only we can solve on the reduce order space only and then the AI would make an approximation. So the

inference in inference time uh goes uh almost to zero. It goes very very low especially for space for a system where we have a lot of data that would bring a lot of uh that would bring the

computational time very low and that's that's that's interesting and uh I just wanted to also say that all this is possible because this settle point problem can be decomposed into two

problem one constraint problem and one linear problem. um and the linear

linear problem. um and the linear problem is easy to solve. The constraint

problem is easy to solve as well. So

instead of solving the setup point problem, my approach is reducing into a two-step where we we do the inference on the on on the on the ATA and then we

solve the constraint problem using the inference on the AI. So the the problem is actually equivalent.

And then we we actually tried it on uh some cases. So still on the El's

some cases. So still on the El's equation uh here we've um the bias was on the boundary condition. Um so instead

of having a direct we used the the the nman and uh the prediction on of the the bias using the approach is on a the

correct bias is on b uh the physics based estimate is on c and then on d you have the ibrid twin using this deep

approach and the orthogonality on e um you have the the pdw without on it and then on f you have the ground truth

using linear uh shape function polomial function using FM.

So in in in in this case uh I didn't use a lot a ton of data to to train this uh I used I think maybe only 1,000 couple

of of data to train this and it's already showed quite interesting um result. it could learn that we are um we

result. it could learn that we are um we are wrong on the boundary condition. We

you can see that the value is different on on the border and in term of error we were around 7% uh on the error like the

L2 error on the on the field compared to 5% of the classical PBDW. So PBDW did uh better uh this and PVDW did faster and

2% uh worse in this case.

Uh and then in this case it was a combination uh of boundary condition and source. So

wrong source and then um wrong boundary condition. The prediction of the of the

condition. The prediction of the of the donet plus PBDW is on A. Uh the true uh

the prediction of the bias using deep uh the true bias using PDW is on B. And you

can see that the net with the strong autoonity managed to get this source and boundary condition error with slightly

some some mistake on the on the left corner. And um so on D you have the

corner. And um so on D you have the agree using PBDW and E is the classical PBDW and F is the ground truth using FM.

So in term of error here we were super close to the PDW. I think it was 2% and 4%

um compared to the the ground truth FM.

Uh this work was extended to time dependent problem. So I also worked on

dependent problem. So I also worked on um on encoder decoder approach to do the same thing uh without without deponet

and then it was also extended to time dependent uh using a transformer to see if we can learn some encoding. um and uh

and then it was applied. So the what I've been working on recently is to apply it on a drone. The idea is let's say you have a drone and then you have a changing of a change of environment.

Um so how do you actually uh update your model uh using the the real time information. So we also applied the PDW

information. So we also applied the PDW on this. Um

on this. Um so like the idea was to do um was to do a basian model updating. Uh so the idea

is to say uh in this case we we we have um inertia that's changing for example one one of the motor is going to break at some

point or maybe it's hurt something and the inertia changes and how do you actually um then now you have a bias on the inertia because you didn't model the inertia perfectly. So the idea is to say

inertia perfectly. So the idea is to say okay we can use a this dual calman filter where you have a a filter that's tracking the parameters and then the

observation is actually the PBDW. So

instead of having a data fully data uh observation on the calman filter we have a physics uh hybrid physics and data uh that is correcting the bias and

then we update the the parameter in real time. We tried it on on this case. Um

time. We tried it on on this case. Um

um for example in this curve the curve on the right is how is the bias uh estimate in real time and how the bias contribute to the state estimate of the

drone in a 25-second simulation for example. Um and here the case was let's

example. Um and here the case was let's say you have your drone going on uh following a trajectory that is fixed from point A to point B and then half

the way at 10 second the initial suddenly dropped and how do you adapt to it super quickly to regenerate like the trajectory that you can still follow or

using MPC. So that's uh one of the cases

using MPC. So that's uh one of the cases that where we use this PVDW into the observation of a man filter to correct it. So you can see that the three

it. So you can see that the three component of the inertia um um were uh updated quite fast uh to

allow the MPC the control to be done um on the trajectory that we defined.

Yes, that's that's one of the use case that I wanted to show the L mods and this one. Um

this one. Um and then in term of computational time uh using this without PBDW on the drone

uh um would be would take a little bit longer. PBGW would gain few

longer. PBGW would gain few second or millisecond on on on the control because you're solving an optimization pro problem at every time

step.

Yes, that's that's it. Uh that was my contribution for today. I would love to take any question or contribution. Yes.

Yeah. Thank you so much. Do we have any questions for the speaker?

I can see only George.

Maybe I have a question.

Yeah.

Hi Stephen, that was a really good presentation. Thank you so much for

presentation. Thank you so much for sharing with us.

Thank you. Thank you. No, one of the things that you were mentioning was like did you choose this way of imposing the throne because you wanted to to put our tonality right and then you said that there are two ways of doing it. Once was

like with this physics based approach where you have like a soft constraint.

Mhm. And but there's also like another one I'm not sure if you compared it like when you do some kind of QR de composition or SPD like donet and they also essentially what they're trying to

do is to transform your basis to make it orthogonal. Did you did you compare with

orthogonal. Did you did you compare with that framework to see oh no I've never heard of that. Can you say more the the orthogonal to what?

Yeah. So no it's not ex it's not what he wants. So this is orthogonal the base is

wants. So this is orthogonal the base is orthogonal itself but but it doesn't guarantee that it's orthogonal to the physics modes. So that's a

physics modes. So that's a okay it could be different. So so that's he his um

his he doesn't really care about the the trunk basis because he replaced the trunk basis.

Yes.

That's the um Yeah. So you you replace with the trunk

Yeah. So you you replace with the trunk basis as is orthogonal in the physics right. So they are not orthogonal

right. So they are not orthogonal between themselves.

Um they they they are because I build them with a kind of Graham Schmidt autonomization.

Okay.

And I build them sequentially one by one. So I ensure that they are each of

one. So I ensure that they are each of them orthogonal to what I I want. So the

physics that we already know and then they are also orthogonal to each other.

Okay. Okay.

But but but I will check the the QR that you mentioned to I I will read the paper. Thank you for for sharing.

paper. Thank you for for sharing.

So this called a two- stage a two-stage uh uh uh two-stage training of pins.

Okay.

Yeah. The author is Shin S H I N former postto here. Um,

postto here. Um, but why the accuracy when you um use the deep net? Why is it worse?

deep net? Why is it worse?

Oh, because uh I didn't use a lot of GPUs. I think I just did 1,00

GPUs. I think I just did 1,00 simulations and then I tried it on a very simple case and it was already amazing. So, I didn't I was like,

amazing. So, I didn't I was like, oh, okay. But there fundamentally

oh, okay. But there fundamentally there's no reason that it would be worse the accuracy, right?

Yes. Yes. It should it should match. And

I uh in terms of computation complexity if m= n I think the factor uh the speed of factor is 8 over 3. I

check 3 + 3 in the middle.

So 6 + 1 + 1 2 8. So that's 8 n cub versus 2 n cub.

Okay.

Sorry. N cube n cub n cub three. So it's

8 over three. Yeah. Yeah. So 8 over three. substantial but yeah

three. substantial but yeah yes I was thinking if if the m equals n but if m is big then

it would be more interesting yeah it would be more more interesting yes okay

do we have any other question Stephen by the way I'm I'm in the scientific board of discard oh I saw I saw it. [laughter] I saw it.

I never attended the meetings, but Oh, no.

Too busy.

I understand. Yeah, I'm I'm finishing now. I'm going to defend in two months.

now. I'm going to defend in two months.

Is Paco is Paco part of your physics committee?

Uh I think it's Elia Quutotos instead.

Um but yeah I yeah yes but Pakist follows follows my my work.

Um but yeah we invited alias instead for the for the [clears throat] committee.

Great.

But if you want to to attend my today's difference I can send you a link.

Send me. Yeah sure.

Sure. Okay. Thank you.

Tony Patra was my adviser.

Sorry Tony Patra.

Oh my adviser. [laughter]

my adviser. [laughter] Okay.

I was his first student and Ivan Mday I know Ivan Mday since 86 87 when he came to MIT lab where I met him first.

Oh interesting. I see. I've never met them unfortunately. But maybe I should

them unfortunately. But maybe I should go to JC in in Sbon to see them one day.

I will say he's in Sbon. Yeah. I was in Sbon last summer. Yeah. [clears throat]

summer. Yeah. [clears throat]

But yes. Okay. the small word but yeah thanks very much for the talk thank you thank you everyone and for the question as well

yeah thank you so much I think we don't have any more questions so I think it's time to close the session thank you for your great talk and thank you everybody for joining us today have a great weekend by

thank you thanks for organizing as

Loading...

Loading video analysis...