Top AI Researcher: We've Been Lied To About LLM Training
By Zachary Huang
Summary
Topics Covered
- GRPO Biases Toward Shorter Correct Responses
- Dr. GRPO Fixes Length and Difficulty Biases
- FP16 Precision Beats BF16 Range in RL
- Separate Engines Cause Precision Mismatch
- Realistic Benchmarks Drive True AI Progress
Full Transcript
And it's actually quite controversial. I
saw that in the open view you you actually got two very strong accept.
>> People found if I use vanilla JPO for optimization we will have very weird response lens behavior. So for Dr. Jio what we do is very simple. We just want
to go back to the very basic but correct formulation of policy gradients.
>> What do you think would be the most import problems for large language model training? I I want two answers. One one
training? I I want two answers. One one
is for the next one year and is for the next 10 years. I think for me very important problem is to creating >> today I have the chance to talk with Dr. one of the top air researcher working on
large function model training specifically reinforcement learning one of his work demonstrate that the well-known GPO training algorithm that powers deepseek is actually flawed a few
months ago he published another paper uncover the fundamental flaw in the the floating point precision used by most of the reinforcement learning frameworks and it causes quite a buzz even Andrew
Kapasi highlight this works on Twitter we'll talk about all his experiences and thoughts in this field. Welcome to the podcast. You are one of the the top AI
podcast. You are one of the the top AI researcher. Do you mind give us a a
researcher. Do you mind give us a a quick introduction and what motivates you to the domain of reinforcement learning?
>> Yeah. Yeah. Hi Zachary for the invitation. I'm very happy to be here.
invitation. I'm very happy to be here.
I'm currently a final year PhD student in Singapore and also am a research engineer and CA lab in Singapore and during my past four years of PhD study.
I started it from the deep reinforced learning classical deep reinforced learning like focus on playing Atari games with simulation kind of stuff and then later I switched to AM post
training because I found AM post training is a quite successful application of Ario and it can create a big impact in the AI fields moving forward. So that's why I choose to work
forward. So that's why I choose to work on reinforcement learning for >> that's amazing. Let's jump right into your most recent works that are quite influential. So let's start with the
influential. So let's start with the paper understanding R1 zero like training a critical perspective. In this
paper you talk about the theoretical issues within the popular GPO algorithm and it's actually quite controversial. I
saw that in the open review you you actually got two very strong accept which is quite real in the air conferences. So do you mind just share
conferences. So do you mind just share with us a big backstory of this paper?
>> Oh yeah sure. Actually we work on this paper back in January of this year which is the birth of the very famous deepseek R1 paper right after deepseek R1
actually in actually in the meantime we are working on VR for mathematical solving concurrently of deepseek R1 but immediately after its release we read
their paper and we found some of the intuitions they they bring and also some of the findings they have we have some misagreement with them typically especially for the aha moment and for
the phenomenal where the as training goes the response length keep increasing. So these two phenomenon to
increasing. So these two phenomenon to us is quite abnormal at that time. So we
decided to investigate why this happens and that leads us to the finding where we found there's actually no aha moment during a process because the base model itself can already exhibit the
self-reflection self-refocation behavior and also that also leads us to the finding where we found GRPO might have some length bias or question difficulty biers which can lead to this increasing
everinccreasing response length phenomena which is reported by deepseek R1 paper.
>> Yeah, that's amazing. Could you for the sake of the audience could we just walk through the GPO algorithm step by step so we can get on the same page of the problem?
>> Oh yeah sure. So I'll use these slides for illustration. So basically GRPO is a
for illustration. So basically GRPO is a simplified policy gradient algorithm typically for AMO post training and for
any question or any prompt we would like to sample a group of responses and we would like to grade each of the response with a reward scaler. Typically it's a binary scaler say one or zero for the
correctness and sometimes we may introduce some like format reward as well and then we will normalize among these group to calculate the advantage and finally we will use the advantage to
do the policy gradient estimation and use that policy gradient to outfit the whole parameters.
>> Do you mind walk through like the different components of the the GPO algorithms? I saw that the GPO
algorithms? I saw that the GPO algorithms has two summations. Do you
mind just give us like a very high level intuition on what each part of the the algorithm is doing?
>> Oh yeah sure sure. So looking at here this equation. So this capital G is
this equation. So this capital G is actually the group size. So that means for each question how many responses we would like to sample and this first summation is over this this responses
and then the second summation is over each token of each response. So here
from t to from t2 equals to one to the length of that response we will actually sum up all the tokens all all the losses on each token times the for that
response and then divided by one the response length in the front but actually as we will see later this is one bias that gpo has which we call length buyers. So could you just step
length buyers. So could you just step very back on what what is policy gradients? What is that organizing for?
gradients? What is that organizing for?
And and I think policy gradients precedes all the GP show PO. So maybe
let's just go to the very basic here.
What's posit?
>> Yeah, sure. So policy gradient is a type of our algorithm which is to optimize your policy towards maximizing the cumulative reward. So basically you
cumulative reward. So basically you would like to given a observation you would like to take some action and this action got some probabilities and after you take this action in the environment
you will receive some reward and then you will adjust the probability of taking this action to increase that reward or to minimize the negative reward. So that is basically to adjust
reward. So that is basically to adjust the probabilities of your action towards maximizing the total rewards. And how
does that action reward observation translates into large language model status trying to solve a mass problem?
>> So in large language model context the observation we have is the question we we give to the AM and then the action is the response and return us then the
reward will be how good is the response.
Typically for example for mathematical problem it is how correct the final answer towards the the question. So
could you help us walk through the the raw formula for the policy gradient?
What does the the loss function and the gradients mean? Maybe that will helps us
gradients mean? Maybe that will helps us transition better to the difference between JPO and Dr. JPO.
>> Yeah, sure. So actually if we denote pi theta as the policy then the policy gradient will look like this. So
basically we will sample trajectories from this pi theta policy and we take expectation over these trajectories and for each trajectory we will have a sequence of state action state reward
state action reward and so on which denoted as a t and then rt rt is not visible here because rt eventually will be converted into the advantage
estimation which is denoted as a here.
So basically we will have this action probability pi theta given st and then we will have its advantage estimation a and what we do for positive gradient is
to just uh rewe the log probability of the action with their advantage and we convert it into as the gradient of the
model parameters to be updated. And note
here that over the horizon of this trajectory which is the capital T we do the summation over all the time steps instead of the average which is kind of
a hint why GRPO is buyers.
>> Amazing. Could you go through the formula for JPO and Dr. JPO?
>> Yeah sure. So these two formulas looks a little bit complex compared to the previous one because we also add the PO which is this ratio here. But if we
ignore them it's actually looks like the vanilla pul gradient quite similarly but one but two differences are highlighted in red which is this one over length and
this divided by std. So this is how we argue JP is bias in terms of the length and the question difficulty because with this one over length actually all the
post gradient is again reweighted by the response length of the the response itself. This reweing is harmful because
itself. This reweing is harmful because in gpo we will we will optimize the model parameters using a lot of rollouts given a a single question. These row out
will have different lengths and if we rewe the P gradient using this length for example for correct responses all these advantages will be positive values
and then the the shorter the response is the larger this reweing is then this positive gradients will be upweighted.
So the the overall optimization will be biased towards shorter but correct responses. And on the other hand if the
responses. And on the other hand if the response is wrong is incorrect this advantage will be a negative value which means for a longer response actually this whole term this rewriting term will
be smaller. That means we will penalize
be smaller. That means we will penalize less for the incorrect but longer responses. So in summary this two like
responses. So in summary this two like for either for incorrect and correct responses this one one over length term will have opposite direction for the
biased organization which is not what we want and on the other hand this dividing by std term will introduce us a question level difficulty bias because this
reward vector which is R the reward for response one response G will have smaller value for those easy or hard question because all this reward will
look like very very flat like all zero or nearly all zero or nearly all one. So
so that this standard deviation will be quite small. So dividing by this will
quite small. So dividing by this will even like upweight the overall weight of this policy gradient. So that is how GRPO itself has these two type of biases
and that that is also why the t especially for the lens bias that is why people found if I use vanilla JPO for optimization we will have very weird
response length behavior and for Dr. Jio what we do is very simple we just want to go back to the very basic but correct formulation of policy gradient so we
just remove these two bias term in red and we we hope to do jar correctly or we call it jpo down light >> wow that's amazing that's truly even to
degree of unbelievable because you know gio and deepsek was all the birth during that time I remember you know when I open the newspaper it's even on the on the top of the top of the news here so
why do you think they could make such like a bias error is it remiss or uh what what do you think such bias isn't caught by the researchers?
>> Yeah, that's a great question. So when
we look at the the JRPO paper which is the dipstick math paper like in 2024 we also are quite curious why this happened in the first place and one hypothesis
from us is like for example for this advantage normalization what they did is the typical normalization which we use in statistics minus by the average and divided by the standard deviation
perhaps they didn't really consider the buyers needs of in terms of R and actually empirically I don't think this will cause a lot of issue until we
really scale it up and we we observe some strange phenomenal but at that time perhaps from for their setting it is fine empirically and then for the one
over lens buyers I guess doing the doing the average loss is a quite typical operation for for many of the deep learning implementation or formulation.
So it looks like if we sum over all the tokens then we'd like we with better divided by the token the the the m the length of the whole response or the
number of tokens we have here. So that
seems also reasonable from the very beginning uh which the the authors of JPO read these uh questions but in fact from the IO perspective especially
posian this is a biased one but perhaps they didn't notice it until like this year >> for the audience who don't really have a very strong theoretical background when they look at this algorithm they may
feel like this normalization makes intuitive sense because if you in loop you are essentially wait the advantage age of robots based on the sum of the
relative probability and what will happen is that for those robots that has longer tokens it will kind of get an
large weight even even if the last should not really matter on the total loss. So what's wrong with such a
loss. So what's wrong with such a intuitive understanding?
>> Yeah. Yeah. Great question. I think the the part we might miss is that all these responses are sampled from our policy because during this sampling and because
RIO is to optimize the expected reward and this expected reward is over the state and the policy over the policy means we sample all the actions from the
policy and all these responses either is shorter or longer are all sampled from the policy that means it will already have different sampling probability.
Let's say for longer responses we might expect there the probability of this exact long response is smaller than another shorter response and then in the expectation point of view it's actually
average out right because sampling a longer response got smaller probability and summing up a longer response got larger loss but they will average out during this expectation process. Got
you. Just zoom out a bit on the the whole reinforcement learning training algorithm design. How much of this is
algorithm design. How much of this is theoretical versus how much it is practical engineering? Because if we
practical engineering? Because if we look at the GRP or PP algorithms, you have all this clipping and the K divergence factor that are you know I
assume already biased to the original policy gradient because you know you have this very sharp cling factor that have to be biased against those. So
there seems to already been a lot of practical and engineering factors coming into the place that people just find it useful even if it's already biased. Does
the bus and unbias still matters if our training has already been polluted with this engineering issues?
>> Oh got you that's a really great question. It's very hard to be to be
question. It's very hard to be to be fully unbiased I would say especially when we take the engineering the implementation into consideration. They
say even like like even for the pure optimization point of view we use for example FP16 or float BB DF-16 is already a quantization of the real value
it's also introduce some bias if we take that into consideration but here I think there are different levels of biases here what I described just now is for the numerical viruses is a very low
level virus which we cannot control but what we note here for the GRPs virus is at a very high level which is incorrect theor theoretically and we can fix it and what you described just now for the
keeping the first order optimiz first order approximation of PO these kind of biases are something that we choose to trade off with the bias for the variance
so I think there are different levels and what we hope to do is to fix this the very high level and the theoretically incorrect biases to make it unbiased but some other levels of
biases that is our desired choice for the trade-off or that is something that we cannot really control or fix. We will
let it go and it will not create very harmful results unlike the JRPO spirus which is which will lead to like like very bad response length issues. If we
further zoom out from reinforced learning to even the stage of pre-training or maybe mid-training or maybe a bit SFT I think they also have somehow similar problem of you know when
we are performing fine tuning those response that are longer naturally you know get more weight and in the pre-training stages people all also try
to balance the different distribution and of the the data mixture. Do you
think it's also something going on with the the theoretical parts of this algorithm?
>> Yeah, I think it also depends on the granularity we model the the like the granularity we model the data. So here
in RAO the granularity is the response level. So we will treat each response as
level. So we will treat each response as the the action and we decompose this macro action into a token level action by factorizing the probability. So that
is how we need to sum up the log probability instead of doing the average. But for the pre-trending stage
average. But for the pre-trending stage from what I understand the granularity is typically on the token level. So we
do the next token prediction. And in
this sense we just pack all the into a fixed context window training batch and then no matter we do the summation or average it is fine because our context
window is a fixed context window. If we
sum sum it up it's fine. If we divide by the length, it is a constant length. It
won't introduce any bias here. So I
guess the biases we are talking about here is is very severe. Once we go into the IO stage and once we sample these responses and we do this kind of GRPO,
this kind of contrastive learning here, these biases will will emerge and will create very harmful results. Idea.
>> Amazing. Moving from Dr. GPO to your next paper defeating the training inference mismatch via FP16. So this
paper has makes all the buzz on Twitter.
According to Andrew Kapati on his tweet it's all about your paper and sin swin boobs and apparently your paper attracts more attention from Andrew Kapasi. So
could you give us a high level backstory of this paper? Yeah, actually this is a paper collaborated with few amazing colleagues here and C lab and the
backstory is that actually we were doing another project before this project and we didn't even intend to to focus on this problem before we were doing another project and during that project
the trainings always collapse is is always instable. We want to investigate
always instable. We want to investigate why this happens. What causes this instability? And we narrow down and we
instability? And we narrow down and we found this training inference mismatch matters a lot. And during that time I think the first block which is the
truncated important sampling block from USS uh D went out. And then we found this issue actually is quite relevant to what we what what we found for the
training instability. And we dig it
training instability. And we dig it deeper a bit and we found actually precision matters a lot for this training inference mismatch. Simply
switching from the BF-16 to FP16 can resolve this mismatch issue to a large extent.
>> Awesome. So could you step back and walk through what FP16 versus BF16 is?
Oh yeah, maybe to most of us F16 is a very old technology because we are all like by default using BF-16. So
basically they are all using uh 16 bit to represent the numbers in computers.
But what's the difference is they allocate different amount of bits for exponent and mantisa. For uh FP16 they will allocate more bits for Matisa which
will lead to higher precision but smaller dynamic range for the values can be represented and for BF16 it can have very high very wide dynamic range but it
will have less precision and uh uh what we found here is precision matters more in training tuning of AMS than uh the dynamic range. Could you give us like a
dynamic range. Could you give us like a very high level overview on the numerical stability in the field of maybe not only reinforced learning but also machine learning in general because I think numerical stability is has been
there quite a while. People has all different techniques like residual connections or for vanishing gradients before. Why is this a solved problem?
before. Why is this a solved problem?
Yeah, definitely it it has a very long history and many researchers and engineers are tackling this numerical issues and I guess that is also why this
BF16 it it happens and during for like for BF16 what we want is to is have a very large dynamic range while reducing the computation burden because using
F632 is too too heavy and too costly and we want to have half precision training but F16 doesn't provide us enough dynamic range and that's why we FBF16.
So that is a history we went through for the past years and that becomes a default. But here I think what we want
default. But here I think what we want what what is different here is like for IO tuning of AMS the mismatch happens because we have very complex
infrastructure for both rollout engine and training engine. So this will not not this is not an issue back several years ago where we don't really do large scale our training on large models. we
we just do SFT training, supervised training or on smaller scales models and we don't have separate training and robot engines. Yeah. So that is why this
robot engines. Yeah. So that is why this numerical issue come to the surface for these days even though we have a long history of optimizing it.
>> Could you share with us why we have two different engine even if the model parameters are essentially the same?
Isn't that quite wasteful in terms of the GPU memory usage? Yeah, I think it's mostly for efficiency consideration like
for example in a typical uh AMI training framework we will have we will have VM or SG long as a roll out engine which is highly optimized for the auto
reggressive generation and then we will have for example Megatron as SDP deep speed and the training engine side which is suitable for like different kind of paralization like acceleration kind of
stuff. So we want to leverage both the
stuff. So we want to leverage both the the good thing of both sides to combine together to get a highly optimized framework overall. So that is why we
framework overall. So that is why we choose to separate the engines and uh uh even more we may do the asynchronized uh training like we keep both engines busy.
So that that is to to resolve the question you mentioned like the different engines might occupy different GPU memories but they are not westful because residing on different GPUs and having different memories but they are
all doing heavy amount of work and all this throughput is very high. So why is BF16 so popular before? Why people
haven't used FP16 for all the different training task here? Is it because for reinforcement learning? I think we
reinforcement learning? I think we already trying to impose a lot of constraints on the model should not deviate too much from the the current model. So the range will be small anyway
model. So the range will be small anyway but we find that the discrepancy is larger. So we should always have a trade
larger. So we should always have a trade more precision over range or is there any other deeper reasons?
>> Yeah, that's a very great question. Um I
think previously we use BF-16 is to is to optimize typically is to optimize from scratch optimize a network from scratch. But in that phase I think the
scratch. But in that phase I think the optimization landscape might be quite complex and the values happen during the optimization might be quite large a range that's how we that's how BF16 can
help with this very large range optimization process. And after all this
optimization process. And after all this pre-training from scratch and mid-training we have quite stable parameter state which which where the
values are quite like in in a smaller range and quite stable and for the IO process we typically just fine-tune a little bit or even a sub network of the
whole model and it the parameters won't deate too much and here what matters more is the training engines and different engines numerical mismatch and
uh by using FP16 we have smaller numerical mismatch and this can give us overall give us more benefits compared to the the larger dynamic range that can
that that BF16 offers us. So is it correct that even if FP16 is better for reinforcement learning for the other stage of the large mobile training let's
say pre-training mid training or even SFT that I think you don't you don't have two different engines uh we should still stick with BF16 for its larger
range what would be your recommendation >> yeah I agree I think what you propose is a quite reasonable procedure >> and how would the quantitization which trying to further save the GP CPU memory
by representing floating point uh with integer interact with this precision discrepancy issue here.
>> Oh yeah. So first of all we didn't try like FP8 or intake training yet but I think the idea behind is similar. Um
whatever quantization or floating point or integer value representation we have we will definitely have this training engine and inference engine mismatch and we can fix it from algorithm side like
doing important sampling. uh this is what people are currently doing and this can fix to some extent. Uh but with better like to fix this m mismatch from
the very beginning from the root cause and if we use some quant quantized training weight better also allow the the value representations on both side
to keep the mismatch small as small as possible to to avoid any training instability during our process. Now what
what do you think because we already have much stronger reasoning mass and coding models um before from frontier model here and they must they they
probably should also face some similar issue here. What's your best guess? What
issue here. What's your best guess? What
are their solution to this problem before they even see your papers? Is it
because larger model natively more uh less vulnerable to this discrepancy or do you think they have as a better solution or maybe they just use FP32 because they have more attribute to
burn?
>> Yeah, I think Frontier Labs maybe maybe if if they don't use FP16 technology maybe they already have some other technology that can reduce the mismatch
already. So I guess Frontier Labs
already. So I guess Frontier Labs definitely ha have already fixed this numerical issue and for the open source community actually most of the people
observe instability issues. So the issue is there and it is not fixed and to deliver a good model we don't really need a fully stable training I guess
because whenever it's the training collapse we can save the checkpoint and restart again and we have various tricks and engineering methods to make the
final model performant to deliver a good model that is that is good on benchmarks but we may not solve the root cause of the Chinese instability that is my guess >> okay that's amazing And given all the
the current issues with the the reinforced learning I know you have your own reinforced learning framework called do you mind just share with us a bit backstory of old like when do you start
writing this framework what problem is it solves and what is the the backstory behind it? Oh yeah, sure. I guess
behind it? Oh yeah, sure. I guess
nowadays there are more and more uh frameworks and uh is just a simple attempt for me to create my own and uh to fulfill some of my own research
project. So I started coding uh back in
project. So I started coding uh back in uh May of last year. Uh that is when VR is not so popular and uh same as other
some of the other frameworks. I didn't
intend to create a really large scale IO frameworks but instead I want to create a simple and uh hackable and userfriendly lightweight framework
especially for single node training because I found most of the research like like the academia research focus on single node training like less than 7B 8B model parameters I think that is
enough and I focus friendly more than scalability and the start of the odd development ment is from one of my online alignment project where I want to
implement online DPO and that time I don't find any suitable framework to start and because they they either do not implement online DPO or the efficiency is too low so that's why I
decide to write my own starting from online DPO and then later switch to online PO and online PPO with verifiable reward those are very simple extension based on the very first version of the framework
>> that's amazing uh how hard is to write a our framework from scratch. What's the
technical challenges you have been faced? Could you share with us your like
faced? Could you share with us your like a stepbystep building process you had?
>> Yeah, sure. So, actually I I had a bit uh framework experience back in back in several years ago because I work on deep. Um it is very similar because the
deep. Um it is very similar because the basic building block is just actor and learner. Actor is where the policy take
learner. Actor is where the policy take actions given some observation. Narner
is to sample from these experiences this data buffer and then optimize the network and what we have next is to synchronize the weights from learner to actor that completes the whole loop or
your framework and the component is already there and I I know this before and what I need is to fill in these components with modern AM
infrastructure. So I I put VM as the
infrastructure. So I I put VM as the actor, I put deep speed as the learner and we try to combine them together to make the fully loop workingable and
basically that is how and during this process I also refer to many amazing infrastructure work for example the TRL uh open RF and VM of course and BIP
speed and without this crier infrastructure works I I don't think it is possible for me to write the for myself >> that's amazing uh how would you compare against because VR is kind of the
current most popular reinforced learning framework built by bad dance and is backed up by a large industry. So if I'm a new reinforced learning researcher, I'm trying to pick different framework.
What would you recommend?
>> Oh yeah, it's a hard question. I think
comparison is relatively simple. I know
VRO is backed by a group of people like full-time people perhaps at bons and development is very fast and then they incorporate many many new technologies
in in a very timely manner compar compared to VR I couldn't really maintain it as often as possible so I think Vio has many many features and it supports very large scale multiode
training it is it is more scalable of course but as a result actually their codebase is more complex as well so typically A new beginner will complain
about VRO's complexity, too many layers of abstractions and very hard for them to customize some algorithms or some low-level stuff. And in that sense, A
low-level stuff. And in that sense, A might be a more friendly beginning for for for the new joiners because A has very clear modular component. And what
you need is to write your sampling logic and then your training logic in different classes and then combine them together. You have a training script.
together. You have a training script.
And I think art and vio is not competitive like to each other. I don't
think for the new beginner I don't recommend either actually I would like new beginners to first learn the overall concept and then pick some runnable
script to feel the training first and you you need to like look at the critical path of your code like for your runnable scripts where uh does your rollout happen and where does your
training happen? You need to locate
training happen? You need to locate those codes and you will find very minimal critical path out of a very large code base and then what you need to focus is only amount of critical path
and do necessary modifications and that is all. So choosing auto or VR doesn't
is all. So choosing auto or VR doesn't matter.
>> What do you mean by renderable script?
Do you mean doing all this reinforcement training a pretty offline fashion? We
first load the robot engine and all the model parameters let the robots free the GPU memory get onto the training engine send the training data to it free memory
again is that is that the renable script you were thinking about >> yeah something like that I think there are some educational resource for IO training but I can't remember the name I
I strongly recommend new beginners to start from theirs >> final question what do you think would be the most import problems for large language model training I I want two answers one one is for the next one year
another is for the next 10 years >> okay that is a very broad question like for for next one year disclaimer I don't have I don't have very definite answer for this because am training like many
many people top researchers are working on it and the future is very unpredictable I think for me very important problem is to creating good
benchmarks what we really want from AMS we know AM is a very good technology impact the society a lot. But what
currently the frontier labs are pushing is to maximizing the benchmarks mostly.
But are these benchmarks real enough for many real life real world tasks with human beings really facing? Perhaps not
not very closely. So creating a very real realistic benchmarks is a high priority from my side because benchmarks are also reward functions for either IO
of MMS or IO of our human beings. We
also do the reward climbing. Um so
creating a correct and realistic benchmark is quite high importance for me uh for the next few years. Yes.
>> What about 10 years?
>> 10 years maybe for the whole community figure out how we can leverage the power of Adams uh to benefit our humanity either do frontier scientific discovery
or I don't know to to save energy to improve different levels of efficiency of the society. Yeah, it's more high level stuff that AI community as a whole need to think about. I think
>> Dr. Thank you so much for your time.
Loading video analysis...