Top AI Researcher: We've Been Lied To About LLM Training

By Zachary Huang

Summary

Topics Covered

GRPO Biases Toward Shorter Correct Responses
Dr. GRPO Fixes Length and Difficulty Biases
FP16 Precision Beats BF16 Range in RL
Separate Engines Cause Precision Mismatch
Realistic Benchmarks Drive True AI Progress

Full Transcript

And it's actually quite controversial. I

saw that in the open view you you actually got two very strong accept.

>> People found if I use vanilla JPO for optimization we will have very weird response lens behavior. So for Dr. Jio what we do is very simple. We just want

to go back to the very basic but correct formulation of policy gradients.

>> What do you think would be the most import problems for large language model training? I I want two answers. One one

training? I I want two answers. One one

is for the next one year and is for the next 10 years. I think for me very important problem is to creating >> today I have the chance to talk with Dr. one of the top air researcher working on

large function model training specifically reinforcement learning one of his work demonstrate that the well-known GPO training algorithm that powers deepseek is actually flawed a few

months ago he published another paper uncover the fundamental flaw in the the floating point precision used by most of the reinforcement learning frameworks and it causes quite a buzz even Andrew

Kapasi highlight this works on Twitter we'll talk about all his experiences and thoughts in this field. Welcome to the podcast. You are one of the the top AI

podcast. You are one of the the top AI researcher. Do you mind give us a a

researcher. Do you mind give us a a quick introduction and what motivates you to the domain of reinforcement learning?

>> Yeah. Yeah. Hi Zachary for the invitation. I'm very happy to be here.

invitation. I'm very happy to be here.

I'm currently a final year PhD student in Singapore and also am a research engineer and CA lab in Singapore and during my past four years of PhD study.

I started it from the deep reinforced learning classical deep reinforced learning like focus on playing Atari games with simulation kind of stuff and then later I switched to AM post

training because I found AM post training is a quite successful application of Ario and it can create a big impact in the AI fields moving forward. So that's why I choose to work

forward. So that's why I choose to work on reinforcement learning for >> that's amazing. Let's jump right into your most recent works that are quite influential. So let's start with the

influential. So let's start with the paper understanding R1 zero like training a critical perspective. In this

paper you talk about the theoretical issues within the popular GPO algorithm and it's actually quite controversial. I

saw that in the open review you you actually got two very strong accept which is quite real in the air conferences. So do you mind just share

conferences. So do you mind just share with us a big backstory of this paper?

>> Oh yeah sure. Actually we work on this paper back in January of this year which is the birth of the very famous deepseek R1 paper right after deepseek R1

actually in actually in the meantime we are working on VR for mathematical solving concurrently of deepseek R1 but immediately after its release we read

their paper and we found some of the intuitions they they bring and also some of the findings they have we have some misagreement with them typically especially for the aha moment and for

the phenomenal where the as training goes the response length keep increasing. So these two phenomenon to

increasing. So these two phenomenon to us is quite abnormal at that time. So we

decided to investigate why this happens and that leads us to the finding where we found there's actually no aha moment during a process because the base model itself can already exhibit the

self-reflection self-refocation behavior and also that also leads us to the finding where we found GRPO might have some length bias or question difficulty biers which can lead to this increasing

everinccreasing response length phenomena which is reported by deepseek R1 paper.

>> Yeah, that's amazing. Could you for the sake of the audience could we just walk through the GPO algorithm step by step so we can get on the same page of the problem?

>> Oh yeah sure. So I'll use these slides for illustration. So basically GRPO is a

for illustration. So basically GRPO is a simplified policy gradient algorithm typically for AMO post training and for

any question or any prompt we would like to sample a group of responses and we would like to grade each of the response with a reward scaler. Typically it's a binary scaler say one or zero for the

correctness and sometimes we may introduce some like format reward as well and then we will normalize among these group to calculate the advantage and finally we will use the advantage to

do the policy gradient estimation and use that policy gradient to outfit the whole parameters.

>> Do you mind walk through like the different components of the the GPO algorithms? I saw that the GPO

algorithms? I saw that the GPO algorithms has two summations. Do you

mind just give us like a very high level intuition on what each part of the the algorithm is doing?

>> Oh yeah sure sure. So looking at here this equation. So this capital G is

this equation. So this capital G is actually the group size. So that means for each question how many responses we would like to sample and this first summation is over this this responses

and then the second summation is over each token of each response. So here

from t to from t2 equals to one to the length of that response we will actually sum up all the tokens all all the losses on each token times the for that

response and then divided by one the response length in the front but actually as we will see later this is one bias that gpo has which we call length buyers. So could you just step

length buyers. So could you just step very back on what what is policy gradients? What is that organizing for?

gradients? What is that organizing for?

And and I think policy gradients precedes all the GP show PO. So maybe

let's just go to the very basic here.

What's posit?

>> Yeah, sure. So policy gradient is a type of our algorithm which is to optimize your policy towards maximizing the cumulative reward. So basically you

cumulative reward. So basically you would like to given a observation you would like to take some action and this action got some probabilities and after you take this action in the environment

you will receive some reward and then you will adjust the probability of taking this action to increase that reward or to minimize the negative reward. So that is basically to adjust

reward. So that is basically to adjust the probabilities of your action towards maximizing the total rewards. And how

does that action reward observation translates into large language model status trying to solve a mass problem?

>> So in large language model context the observation we have is the question we we give to the AM and then the action is the response and return us then the

reward will be how good is the response.

Typically for example for mathematical problem it is how correct the final answer towards the the question. So

could you help us walk through the the raw formula for the policy gradient?

What does the the loss function and the gradients mean? Maybe that will helps us

gradients mean? Maybe that will helps us transition better to the difference between JPO and Dr. JPO.

>> Yeah, sure. So actually if we denote pi theta as the policy then the policy gradient will look like this. So

basically we will sample trajectories from this pi theta policy and we take expectation over these trajectories and for each trajectory we will have a sequence of state action state reward

state action reward and so on which denoted as a t and then rt rt is not visible here because rt eventually will be converted into the advantage

estimation which is denoted as a here.

So basically we will have this action probability pi theta given st and then we will have its advantage estimation a and what we do for positive gradient is

to just uh rewe the log probability of the action with their advantage and we convert it into as the gradient of the

model parameters to be updated. And note

here that over the horizon of this trajectory which is the capital T we do the summation over all the time steps instead of the average which is kind of

a hint why GRPO is buyers.

>> Amazing. Could you go through the formula for JPO and Dr. JPO?

>> Yeah sure. So these two formulas looks a little bit complex compared to the previous one because we also add the PO which is this ratio here. But if we

ignore them it's actually looks like the vanilla pul gradient quite similarly but one but two differences are highlighted in red which is this one over length and

this divided by std. So this is how we argue JP is bias in terms of the length and the question difficulty because with this one over length actually all the

post gradient is again reweighted by the response length of the the response itself. This reweing is harmful because

itself. This reweing is harmful because in gpo we will we will optimize the model parameters using a lot of rollouts given a a single question. These row out

will have different lengths and if we rewe the P gradient using this length for example for correct responses all these advantages will be positive values

and then the the shorter the response is the larger this reweing is then this positive gradients will be upweighted.

So the the overall optimization will be biased towards shorter but correct responses. And on the other hand if the

responses. And on the other hand if the response is wrong is incorrect this advantage will be a negative value which means for a longer response actually this whole term this rewriting term will

be smaller. That means we will penalize

be smaller. That means we will penalize less for the incorrect but longer responses. So in summary this two like

responses. So in summary this two like for either for incorrect and correct responses this one one over length term will have opposite direction for the

biased organization which is not what we want and on the other hand this dividing by std term will introduce us a question level difficulty bias because this

reward vector which is R the reward for response one response G will have smaller value for those easy or hard question because all this reward will

look like very very flat like all zero or nearly all zero or nearly all one. So

so that this standard deviation will be quite small. So dividing by this will

quite small. So dividing by this will even like upweight the overall weight of this policy gradient. So that is how GRPO itself has these two type of biases

and that that is also why the t especially for the lens bias that is why people found if I use vanilla JPO for optimization we will have very weird

response length behavior and for Dr. Jio what we do is very simple we just want to go back to the very basic but correct formulation of policy gradient so we

just remove these two bias term in red and we we hope to do jar correctly or we call it jpo down light >> wow that's amazing that's truly even to

degree of unbelievable because you know gio and deepsek was all the birth during that time I remember you know when I open the newspaper it's even on the on the top of the top of the news here so

why do you think they could make such like a bias error is it remiss or uh what what do you think such bias isn't caught by the researchers?

>> Yeah, that's a great question. So when

we look at the the JRPO paper which is the dipstick math paper like in 2024 we also are quite curious why this happened in the first place and one hypothesis

from us is like for example for this advantage normalization what they did is the typical normalization which we use in statistics minus by the average and divided by the standard deviation

perhaps they didn't really consider the buyers needs of in terms of R and actually empirically I don't think this will cause a lot of issue until we

really scale it up and we we observe some strange phenomenal but at that time perhaps from for their setting it is fine empirically and then for the one

over lens buyers I guess doing the doing the average loss is a quite typical operation for for many of the deep learning implementation or formulation.

So it looks like if we sum over all the tokens then we'd like we with better divided by the token the the the m the length of the whole response or the

number of tokens we have here. So that

seems also reasonable from the very beginning uh which the the authors of JPO read these uh questions but in fact from the IO perspective especially

posian this is a biased one but perhaps they didn't notice it until like this year >> for the audience who don't really have a very strong theoretical background when they look at this algorithm they may

feel like this normalization makes intuitive sense because if you in loop you are essentially wait the advantage age of robots based on the sum of the

relative probability and what will happen is that for those robots that has longer tokens it will kind of get an

large weight even even if the last should not really matter on the total loss. So what's wrong with such a

loss. So what's wrong with such a intuitive understanding?

>> Yeah. Yeah. Great question. I think the the part we might miss is that all these responses are sampled from our policy because during this sampling and because

RIO is to optimize the expected reward and this expected reward is over the state and the policy over the policy means we sample all the actions from the

policy and all these responses either is shorter or longer are all sampled from the policy that means it will already have different sampling probability.

Let's say for longer responses we might expect there the probability of this exact long response is smaller than another shorter response and then in the expectation point of view it's actually

average out right because sampling a longer response got smaller probability and summing up a longer response got larger loss but they will average out during this expectation process. Got

you. Just zoom out a bit on the the whole reinforcement learning training algorithm design. How much of this is

algorithm design. How much of this is theoretical versus how much it is practical engineering? Because if we

practical engineering? Because if we look at the GRP or PP algorithms, you have all this clipping and the K divergence factor that are you know I

assume already biased to the original policy gradient because you know you have this very sharp cling factor that have to be biased against those. So

there seems to already been a lot of practical and engineering factors coming into the place that people just find it useful even if it's already biased. Does

the bus and unbias still matters if our training has already been polluted with this engineering issues?

>> Oh got you that's a really great question. It's very hard to be to be

question. It's very hard to be to be fully unbiased I would say especially when we take the engineering the implementation into consideration. They

say even like like even for the pure optimization point of view we use for example FP16 or float BB DF-16 is already a quantization of the real value

it's also introduce some bias if we take that into consideration but here I think there are different levels of biases here what I described just now is for the numerical viruses is a very low

level virus which we cannot control but what we note here for the GRPs virus is at a very high level which is incorrect theor theoretically and we can fix it and what you described just now for the

keeping the first order optimiz first order approximation of PO these kind of biases are something that we choose to trade off with the bias for the variance

so I think there are different levels and what we hope to do is to fix this the very high level and the theoretically incorrect biases to make it unbiased but some other levels of

biases that is our desired choice for the trade-off or that is something that we cannot really control or fix. We will

let it go and it will not create very harmful results unlike the JRPO spirus which is which will lead to like like very bad response length issues. If we

further zoom out from reinforced learning to even the stage of pre-training or maybe mid-training or maybe a bit SFT I think they also have somehow similar problem of you know when

we are performing fine tuning those response that are longer naturally you know get more weight and in the pre-training stages people all also try

to balance the different distribution and of the the data mixture. Do you

think it's also something going on with the the theoretical parts of this algorithm?

>> Yeah, I think it also depends on the granularity we model the the like the granularity we model the data. So here

in RAO the granularity is the response level. So we will treat each response as

level. So we will treat each response as the the action and we decompose this macro action into a token level action by factorizing the probability. So that

is how we need to sum up the log probability instead of doing the average. But for the pre-trending stage

average. But for the pre-trending stage from what I understand the granularity is typically on the token level. So we

do the next token prediction. And in

this sense we just pack all the into a fixed context window training batch and then no matter we do the summation or average it is fine because our context

window is a fixed context window. If we

sum sum it up it's fine. If we divide by the length, it is a constant length. It

won't introduce any bias here. So I

guess the biases we are talking about here is is very severe. Once we go into the IO stage and once we sample these responses and we do this kind of GRPO,

this kind of contrastive learning here, these biases will will emerge and will create very harmful results. Idea.

>> Amazing. Moving from Dr. GPO to your next paper defeating the training inference mismatch via FP16. So this

paper has makes all the buzz on Twitter.

According to Andrew Kapati on his tweet it's all about your paper and sin swin boobs and apparently your paper attracts more attention from Andrew Kapasi. So

could you give us a high level backstory of this paper? Yeah, actually this is a paper collaborated with few amazing colleagues here and C lab and the

backstory is that actually we were doing another project before this project and we didn't even intend to to focus on this problem before we were doing another project and during that project

the trainings always collapse is is always instable. We want to investigate

always instable. We want to investigate why this happens. What causes this instability? And we narrow down and we

instability? And we narrow down and we found this training inference mismatch matters a lot. And during that time I think the first block which is the

truncated important sampling block from USS uh D went out. And then we found this issue actually is quite relevant to what we what what we found for the

training instability. And we dig it

training instability. And we dig it deeper a bit and we found actually precision matters a lot for this training inference mismatch. Simply

switching from the BF-16 to FP16 can resolve this mismatch issue to a large extent.

>> Awesome. So could you step back and walk through what FP16 versus BF16 is?

Oh yeah, maybe to most of us F16 is a very old technology because we are all like by default using BF-16. So

basically they are all using uh 16 bit to represent the numbers in computers.

But what's the difference is they allocate different amount of bits for exponent and mantisa. For uh FP16 they will allocate more bits for Matisa which

will lead to higher precision but smaller dynamic range for the values can be represented and for BF16 it can have very high very wide dynamic range but it

will have less precision and uh uh what we found here is precision matters more in training tuning of AMS than uh the dynamic range. Could you give us like a

dynamic range. Could you give us like a very high level overview on the numerical stability in the field of maybe not only reinforced learning but also machine learning in general because I think numerical stability is has been

there quite a while. People has all different techniques like residual connections or for vanishing gradients before. Why is this a solved problem?

before. Why is this a solved problem?

Yeah, definitely it it has a very long history and many researchers and engineers are tackling this numerical issues and I guess that is also why this

BF16 it it happens and during for like for BF16 what we want is to is have a very large dynamic range while reducing the computation burden because using

F632 is too too heavy and too costly and we want to have half precision training but F16 doesn't provide us enough dynamic range and that's why we FBF16.

So that is a history we went through for the past years and that becomes a default. But here I think what we want

default. But here I think what we want what what is different here is like for IO tuning of AMS the mismatch happens because we have very complex

infrastructure for both rollout engine and training engine. So this will not not this is not an issue back several years ago where we don't really do large scale our training on large models. we

we just do SFT training, supervised training or on smaller scales models and we don't have separate training and robot engines. Yeah. So that is why this

robot engines. Yeah. So that is why this numerical issue come to the surface for these days even though we have a long history of optimizing it.

>> Could you share with us why we have two different engine even if the model parameters are essentially the same?

Isn't that quite wasteful in terms of the GPU memory usage? Yeah, I think it's mostly for efficiency consideration like

for example in a typical uh AMI training framework we will have we will have VM or SG long as a roll out engine which is highly optimized for the auto

reggressive generation and then we will have for example Megatron as SDP deep speed and the training engine side which is suitable for like different kind of paralization like acceleration kind of

stuff. So we want to leverage both the

stuff. So we want to leverage both the the good thing of both sides to combine together to get a highly optimized framework overall. So that is why we

framework overall. So that is why we choose to separate the engines and uh uh even more we may do the asynchronized uh training like we keep both engines busy.

So that that is to to resolve the question you mentioned like the different engines might occupy different GPU memories but they are not westful because residing on different GPUs and having different memories but they are

all doing heavy amount of work and all this throughput is very high. So why is BF16 so popular before? Why people

haven't used FP16 for all the different training task here? Is it because for reinforcement learning? I think we

reinforcement learning? I think we already trying to impose a lot of constraints on the model should not deviate too much from the the current model. So the range will be small anyway

model. So the range will be small anyway but we find that the discrepancy is larger. So we should always have a trade

larger. So we should always have a trade more precision over range or is there any other deeper reasons?

>> Yeah, that's a very great question. Um I

think previously we use BF-16 is to is to optimize typically is to optimize from scratch optimize a network from scratch. But in that phase I think the

scratch. But in that phase I think the optimization landscape might be quite complex and the values happen during the optimization might be quite large a range that's how we that's how BF16 can

help with this very large range optimization process. And after all this

optimization process. And after all this pre-training from scratch and mid-training we have quite stable parameter state which which where the

values are quite like in in a smaller range and quite stable and for the IO process we typically just fine-tune a little bit or even a sub network of the

whole model and it the parameters won't deate too much and here what matters more is the training engines and different engines numerical mismatch and

uh by using FP16 we have smaller numerical mismatch and this can give us overall give us more benefits compared to the the larger dynamic range that can

that that BF16 offers us. So is it correct that even if FP16 is better for reinforcement learning for the other stage of the large mobile training let's

say pre-training mid training or even SFT that I think you don't you don't have two different engines uh we should still stick with BF16 for its larger

range what would be your recommendation >> yeah I agree I think what you propose is a quite reasonable procedure >> and how would the quantitization which trying to further save the GP CPU memory

by representing floating point uh with integer interact with this precision discrepancy issue here.

>> Oh yeah. So first of all we didn't try like FP8 or intake training yet but I think the idea behind is similar. Um

whatever quantization or floating point or integer value representation we have we will definitely have this training engine and inference engine mismatch and we can fix it from algorithm side like

doing important sampling. uh this is what people are currently doing and this can fix to some extent. Uh but with better like to fix this m mismatch from

the very beginning from the root cause and if we use some quant quantized training weight better also allow the the value representations on both side

to keep the mismatch small as small as possible to to avoid any training instability during our process. Now what

what do you think because we already have much stronger reasoning mass and coding models um before from frontier model here and they must they they

probably should also face some similar issue here. What's your best guess? What

issue here. What's your best guess? What

are their solution to this problem before they even see your papers? Is it

because larger model natively more uh less vulnerable to this discrepancy or do you think they have as a better solution or maybe they just use FP32 because they have more attribute to

burn?

>> Yeah, I think Frontier Labs maybe maybe if if they don't use FP16 technology maybe they already have some other technology that can reduce the mismatch

already. So I guess Frontier Labs

already. So I guess Frontier Labs definitely ha have already fixed this numerical issue and for the open source community actually most of the people

observe instability issues. So the issue is there and it is not fixed and to deliver a good model we don't really need a fully stable training I guess

because whenever it's the training collapse we can save the checkpoint and restart again and we have various tricks and engineering methods to make the

final model performant to deliver a good model that is that is good on benchmarks but we may not solve the root cause of the Chinese instability that is my guess >> okay that's amazing And given all the

the current issues with the the reinforced learning I know you have your own reinforced learning framework called do you mind just share with us a bit backstory of old like when do you start

writing this framework what problem is it solves and what is the the backstory behind it? Oh yeah, sure. I guess

behind it? Oh yeah, sure. I guess

nowadays there are more and more uh frameworks and uh is just a simple attempt for me to create my own and uh to fulfill some of my own research

project. So I started coding uh back in

project. So I started coding uh back in uh May of last year. Uh that is when VR is not so popular and uh same as other

some of the other frameworks. I didn't

intend to create a really large scale IO frameworks but instead I want to create a simple and uh hackable and userfriendly lightweight framework

especially for single node training because I found most of the research like like the academia research focus on single node training like less than 7B 8B model parameters I think that is

enough and I focus friendly more than scalability and the start of the odd development ment is from one of my online alignment project where I want to

implement online DPO and that time I don't find any suitable framework to start and because they they either do not implement online DPO or the efficiency is too low so that's why I

decide to write my own starting from online DPO and then later switch to online PO and online PPO with verifiable reward those are very simple extension based on the very first version of the framework

>> that's amazing uh how hard is to write a our framework from scratch. What's the

technical challenges you have been faced? Could you share with us your like

faced? Could you share with us your like a stepbystep building process you had?

>> Yeah, sure. So, actually I I had a bit uh framework experience back in back in several years ago because I work on deep. Um it is very similar because the

deep. Um it is very similar because the basic building block is just actor and learner. Actor is where the policy take

learner. Actor is where the policy take actions given some observation. Narner

is to sample from these experiences this data buffer and then optimize the network and what we have next is to synchronize the weights from learner to actor that completes the whole loop or

your framework and the component is already there and I I know this before and what I need is to fill in these components with modern AM

infrastructure. So I I put VM as the

infrastructure. So I I put VM as the actor, I put deep speed as the learner and we try to combine them together to make the fully loop workingable and

basically that is how and during this process I also refer to many amazing infrastructure work for example the TRL uh open RF and VM of course and BIP

speed and without this crier infrastructure works I I don't think it is possible for me to write the for myself >> that's amazing uh how would you compare against because VR is kind of the

current most popular reinforced learning framework built by bad dance and is backed up by a large industry. So if I'm a new reinforced learning researcher, I'm trying to pick different framework.

What would you recommend?

>> Oh yeah, it's a hard question. I think

comparison is relatively simple. I know

VRO is backed by a group of people like full-time people perhaps at bons and development is very fast and then they incorporate many many new technologies

in in a very timely manner compar compared to VR I couldn't really maintain it as often as possible so I think Vio has many many features and it supports very large scale multiode

training it is it is more scalable of course but as a result actually their codebase is more complex as well so typically A new beginner will complain

about VRO's complexity, too many layers of abstractions and very hard for them to customize some algorithms or some low-level stuff. And in that sense, A

low-level stuff. And in that sense, A might be a more friendly beginning for for for the new joiners because A has very clear modular component. And what

you need is to write your sampling logic and then your training logic in different classes and then combine them together. You have a training script.

together. You have a training script.

And I think art and vio is not competitive like to each other. I don't

think for the new beginner I don't recommend either actually I would like new beginners to first learn the overall concept and then pick some runnable

script to feel the training first and you you need to like look at the critical path of your code like for your runnable scripts where uh does your rollout happen and where does your

training happen? You need to locate

training happen? You need to locate those codes and you will find very minimal critical path out of a very large code base and then what you need to focus is only amount of critical path

and do necessary modifications and that is all. So choosing auto or VR doesn't

is all. So choosing auto or VR doesn't matter.

>> What do you mean by renderable script?

Do you mean doing all this reinforcement training a pretty offline fashion? We

first load the robot engine and all the model parameters let the robots free the GPU memory get onto the training engine send the training data to it free memory

again is that is that the renable script you were thinking about >> yeah something like that I think there are some educational resource for IO training but I can't remember the name I

I strongly recommend new beginners to start from theirs >> final question what do you think would be the most import problems for large language model training I I want two answers one one is for the next one year

another is for the next 10 years >> okay that is a very broad question like for for next one year disclaimer I don't have I don't have very definite answer for this because am training like many

many people top researchers are working on it and the future is very unpredictable I think for me very important problem is to creating good

benchmarks what we really want from AMS we know AM is a very good technology impact the society a lot. But what

currently the frontier labs are pushing is to maximizing the benchmarks mostly.

But are these benchmarks real enough for many real life real world tasks with human beings really facing? Perhaps not

not very closely. So creating a very real realistic benchmarks is a high priority from my side because benchmarks are also reward functions for either IO

of MMS or IO of our human beings. We

also do the reward climbing. Um so

creating a correct and realistic benchmark is quite high importance for me uh for the next few years. Yes.

>> What about 10 years?

>> 10 years maybe for the whole community figure out how we can leverage the power of Adams uh to benefit our humanity either do frontier scientific discovery

or I don't know to to save energy to improve different levels of efficiency of the society. Yeah, it's more high level stuff that AI community as a whole need to think about. I think

>> Dr. Thank you so much for your time.

Loading...

Loading video analysis...