RLHF, PPO & GRPO Explained: A Top-Down Guide to LLM Policy Optimization
By Mei Li
Summary
Topics Covered
- SFT teaches imitation, not judgment
- KL penalty anchors policy against drift
- Baselines separate quality from prompt difficulty
- GRPO replaces the value model with group statistics
Full Transcript
Hello, welcome to my YouTube channel. In
this video, we will build a top-down understanding of LLM policy optimization,
why ILF is needed, how it works, and how two important methods PO and GRPO fit into the picture. My goal is to make the
explanation as self-contained as possible. We'll start from the core
possible. We'll start from the core intuition, connect it to the math, and then work through one step of the training loop for each method. By the
end of the video, you should have a clear mental model of how PBO and GRPO optimize LLM policies. Let's get
started.
Before we jump in, we first need to understand why supervised fetuning alone is not enough.
In supervised font tuning or SFT, we give the model a prompt together with a single reference answer and train it to
reproduce that answer token by token.
The only signal is essentially copy this.
This is useful, but notice what it leaves out. SFT only ever shows one good
leaves out. SFT only ever shows one good answer per problem. It never shows a comparison.
So, it can teach the model to imitate a demonstration, but not to judge which answer is better. What we actually need
is this. Given the same prompt and two
is this. Given the same prompt and two candidate answers A and B, the model should learn that humans prefer A over B.
For large language models, this distinction is very important. To make a model more helpful, honest, and safe, we need preference feedback.
This kind of preference feedback, learning from comparisons rather than demonstrations is a core motivation for ILHF
or reinforcement learning from human feedback.
This naturally brings us to another area of machine learning called reinforcement learning.
The idea behind reinforcement learning is very similar to how humans learn.
Since we were babies, we have learned by interacting with the world around us, trying things, observing the results,
and gradually learn which actions lead to better outcomes.
More formally, I is a type of machine learning technique where an agent learns to make optimal decisions by interacting
with an environment, receiving feedback via rewards or penalties as shown in the diagram. Starting from
state S, the agent takes an action A.
The environment then returns a reward R and transitions to a new state S prime.
From S prime, the agent picks the next action and the loop repeats.
The goal of the agent is to learn a behavior that maximizes the total accumulated reward over time.
Before we map reinforcement learning to language models, let's define a few KIO terms. First, the policy describes how the
agent chooses actions.
A trajectory is the sequence of states, actions, and rewards produced as the agent interacts with the environment.
You will also often hear the word roll out. A roll out is basically a
out. A roll out is basically a trajectory generated by running the current policy.
So trajectory describes the sequence itself while roll out emphasize the process of sampling it from the policy.
The return is the total future reward from a given point onward. This is the quantity the agent ultimately wants to maximize.
The value of a state is the expected future reward from that state. In other
words, it tells us how good we expected things to be from here.
Finally, the advantage tells us whether a particular action was better or worse than what we expected from that day.
This will become very important when we discuss PO later.
Now let's map reinforcement learning to the case of large language models.
In LLM reinforcement learning, the policy is the language model itself.
Given the current context, the model defines a probability distribution over the next token.
The state is a prompt together with the tokens generated so far. We can think of it as the current context seen by the model.
The action is the next token sampled from the model.
For example, suppose the prompt is where is Shanghai and the model has already generated Shanghai as at this point the
state is a full context. Where is
Shanghai? Shanghai is in. And the next action could be China.
After the model chooses that action, the state changes. Now the generated prefix
state changes. Now the generated prefix includes China and the model uses this
new context to choose the next token.
One clarification before we continue. in
this slide and in the examples later. I
will often write tokens as if each word is one token. This is only for simplicity. In real language models,
simplicity. In real language models, tokenization is more complicated. A
token can be a full word, part of word, punctuation, or even include white space. But conceptually, the IL
space. But conceptually, the IL formulation is the same.
So, LLM generation can be viewed as a sequence of decisions. At each step, the model observes the current context,
choose the next token, and then moves to a new state. In IOHF,
we optimize this policy so that the generated answers receive higher rewards according to human preferences.
So how do we usually apply reinforcement learning to language models?
A classical example is instruct GPT from open AI which introduced the ILHF pipeline for aligning language models
with human preferences.
ILHF usually contains three main stages.
The first stage is supervised fine-tuning or SFT. Here we collect demonstration data and train a supervised policy.
The second stage is reward modeling. In
this stage, we collect comparison data where humans compare different answers to the same prompt and then we train a
reward model to predict which answer is preferred. The third stage is
preferred. The third stage is reinforcement learning. Here we optimize
reinforcement learning. Here we optimize the policy model against the reward model so that the model learns to generate answers that receive higher
preference scores.
So in summary, the ILF pipeline is supervised funing, reward modeling, and reinforcement
learning. Let's go through these steps
learning. Let's go through these steps one by one.
Most of us are already familiar with the pre-training and supervised front- tuning stage of language models. During
pre-training, we train the model on a very large collection of text data such as web pages, books, code, articles,
and many other sources.
The objective is simple. Given the
previous tokens, predict the next token.
Supervised fine-tuning or SFT is teacher first next token prediction on a reference answer. Same objective as
reference answer. Same objective as pre-training, but the data is prompt and answer pairs instead of arbitrary text.
The goal is to make the model more useful as an assistant by teaching it how to answer questions or following instructions.
Suppose Q is the prompt and O is the target answer. We concatenate the tokens
target answer. We concatenate the tokens from Q and O and feed the four sequence into the language model.
At each position, the model predicts the next token. So the target sequence is
next token. So the target sequence is basically the input sequence shifted by one token.
This shifted setup is exactly what teacher first means. At every position, the model is showing the true previous token from the reference answer as input
rather than its own previous predictions.
So even if the model would have generated a different token, training continues along the reference sequence.
More specifically, the training objective is the standard cross entropy loss or equivalently the negative log
likelihood of the target tokens.
For each position, the model predicts a probability distribution over the vocabulary and we want it to assign high
probability to the correct next token.
The important difference from pre-training is that in supervised font tuning, we usually compute this loss
only on the answer tokens.
The prompt tokens are used as context, but they are masked out from the loss.
So the SFT loss encourages the model to generate the target answer condition on the prompt.
For reward modeling, the first thing we need to do is to collect comparison data. The idea is simple. For the same
data. The idea is simple. For the same prompt, we generate multiple candidate answers.
We can do this by sampling from the same model several times with a nonzero temperature or by collecting answers from different models.
Then we ask a human labeler to compare these answers and rank them from best to worst. For example, suppose the prompt
worst. For example, suppose the prompt is explain the moon landing to a six-year-old.
We generate four different answers A, B, C, and D. Here, four answers is just an example. In practice, it is usually a
example. In practice, it is usually a small number, often somewhere between four and nine. After reading them, the
human labeler ranks them from best to worst as C, then A, then D, and finally B.
From this ranking, we can construct a pair-wise preference examples.
Since there are four answers, we get six pairs. C is preferred over A, C is
pairs. C is preferred over A, C is preferred over D. C is preferred over B.
A is preferred over D. A is preferred over B. And D is preferred over B.
over B. And D is preferred over B.
This pairwise comparisons are the training data we will use to train the reward model.
Now let's see how we train the reward model using the preference data we collected.
The reward model takes a prompt Q and an answer O and outputs a single scalar reward score.
Intuitively, a better answer should receive a higher reward score.
In practice, the reward model is usually initialized from the SFT model.
We reuse the language model backbone, remove the original language model head that predicts the next token and add a
reverse model head instead.
This head converts the hidden state of the final token into one scala reward score for the whole answer.
For each preference pair, we have two answers. the chosen answer which humans
answers. the chosen answer which humans prefer and the rejected answer which humans prefer less. The revert model
gives us two scores. One score for the chosen answer and one score for the rejected answer. We want the chosen
rejected answer. We want the chosen answer to get a higher score than the rejected answer.
To train this, we use the Bradley Terry model. The Bradley Terry model is a
model. The Bradley Terry model is a probabilistic model for pair-wise comparison. It says that if two
comparison. It says that if two candidates have scalar scores, then the probability that one candidate is
preferred over the other depends on the difference between their scores.
In our case, the probability that the chosen answer is preferred over the rejected answer is computed as the
sigmoid of the reward difference.
The reward of the chosen answer minus the reward of the rejected answer.
If the chosen answer is much larger than the rejected answer, the difference is very positive. the single point is close
very positive. the single point is close to one and the loss is close to zero.
This means the revert model is already assigning the correct preference.
If the chosen answer is smaller than the rejected answer, the difference is negative. The single mind is close to
negative. The single mind is close to zero and the negative log loss becomes very large. So the model is strongly
very large. So the model is strongly penalized.
So the revert model loss is the negative log probability of the human preference pairs under the Bradley Terry model.
In practice, we compute this loss for each chosen rejected pair and average it over the batch.
One practical detail is that different prompts in the training data set may produce different numbers of preference
pairs. To handle this, we may first
pairs. To handle this, we may first average the loss over the pairs from the same prompt and then average across prompts.
This prevent prompts with more candidate answers for dominating the batch.
Now that we know how to train a revert model, we can move to the final stage of ILHF reinforcement learning.
In this stage, we use the reward model as a training signal to optimize the language model policy.
The goal is to make the model generate answers that receive higher reward scores.
In the rest of this video, I will introduce two reinforcement learning methods commonly used for LLM policy optimization.
The first one is PO which stands for proximal policy optimization.
PO is the more classical and widely used ILHF algorithm.
The second one is GRPO which stands for group relative policy optimization.
Grpo is simpler in some ways because it removes the value model and uses group relative rewards instead.
Let's start with PO.
There are four models involved in PO.
The first one is the policy model. This
is the language model we want to optimize for an LLM. The policy gives us the probability of the next token condition
on the prompt and the tokens generated so far.
The second one is the reference model.
This is usually a frozen copy of the model before I training. It anchors the policy to what it learned during SFT.
preventing it from drifting too far and losing its general capabilities.
The third one is a revert model. This is
the model we trained in the previous step. It gives a reverse score for the
step. It gives a reverse score for the full answer generated by the policy model.
The first one is the value model. The
value model predicts the expected future reward from the current state and it is used to compute the advantage.
At a high level, PO works like this.
First, the policy model generates an answer O from a prompt Q. Next, we
compute a KL penalty by comparing the current policy to the reference model.
We also score the answer using the reward model. The KL penalty and the
reward model. The KL penalty and the reward score are then combined into the final reverse signal R.
Separately, the value model produces value estimates VI. The reverse signal R and the value estimates V are then used
to compute the advantage A with a method called GA AE. Finally, this advantage signal tells us how to update the policy
model.
As we mentioned earlier in PO, the reward signal usually combines two parts. The final reward model score I of
parts. The final reward model score I of Q and O and a per token KL penalty.
The reward model gives one score for the whole generated answer. But in LLM PPO, each generated token is treated as a
action. So we need token level
action. So we need token level quantities such as log probabilities, key penalties, and advantages.
Since the reward model score is only available for the complete answer, we add it at the final token.
For all tokens before the last token, the step reward is just a minus beta times the per token KL estimate.
For the final token, the revert is the reward model score for the full answer minus beta times the per token KL
estimate on the last token.
That per token KL estimate is computed with the K1 estimator.
For each generated token, we compare its probability under the current policy pi theta with its probability under the
reference model pyf giving the same context. Here beta
controls the strength of the KL penalty.
A larger beta keeps the policy closer to the reference model while a smaller beta allows the policy to move more
aggressively toward answers with a higher reward.
So the KL penalty acts as an anchor to the reference policy. It prevents the current policy from drifting too far during reinforcement learning
optimization.
Now let's look at the value model in more detail.
The value model estimates the expected future return from the current state ST before taking action at
in the LLM setting. The state ST is the prompt together with the generated tokens so far. The action at is the next
token.
Architecturally, we can think of the value model as an LLM backbone with a value head on top. This value head
predicts a scalar value for each generated token position.
In practice, this does not always mean we train a completely separate model just for value prediction.
Very often, the policy model and the value model share the same transformal backbone. The policy head predicts the
backbone. The policy head predicts the next token distribution while the value head predicts the scalar value. So
conceptually we talk about a policy model and a value model but implementation wise they may share most of their parameters.
The return from step t written as JT is defined recursively as the current reward RT plus gamma times the future
return GT + one. Equivalently, it is the discounted sum of future rewards from this step onward. Here, gamma is a
discount factor. Intuitively, this is
discount factor. Intuitively, this is like preferring to receive money sooner rather than much later. Future rewards
are still valuable, but they may be discounted.
In standard reinforcement learning, gamma is usually between zero and one. A
smaller gamma discounts future rewards more strongly, while a gamma close to one preserves more of the future reward.
In many LM ILHF settings, gamma is often set close to one or even exactly one.
The reason is that the main reward often comes from the final complete answer.
But earlier tokens are also important because they determine the structure, direction, and reasoning path of the
answer. So we usually do not want to
answer. So we usually do not want to discount the final reward too much when assigning credit to earlier tokens.
Also unlike some traditional IO problems the generated answer has a finite length.
So using gamma close to one is usually reasonable.
So weight the value model weigh applied to state ST is the expected return JT from that state.
Intuitively this means given the partial answer. So far, how good do we expected
answer. So far, how good do we expected the final outcome to be? To train the value model, we fit this predictions to
the returns. The value model loss is the
the returns. The value model loss is the mean squared error between the predicted value weigh of ST and the target return
GT averaged over the generated tokens.
The value estimate will later help us compute the advantage which tells us whether the sample token was better or worse than expected.
Let's look at a simple example to better understand the value model.
Suppose the prompted Q is explain gravity simply and suppose the generated answer O is gravity pulls objects
together. Here we continue using the
together. Here we continue using the simplification that each word is one token.
So this answer has four tokens.
Now suppose the revert model score for this prompt answer pair I fi of Q and O is two.
We also have a per token sampled KL estimate 0.03 0.02 0.04 and 0.01.
If we set beta equals 1 then the step reward are calculated like this.
For the first three tokens, the revert is only the negative key penalty. So we
get R1 = -0.03, I2 = minus 0.02
and R3 equals minus 0.04.
For the final token, we add the revert model score for the complete answer. So
R4 = 2 - 0.01 = 1.99.
Next suppose the discount factor gamma equals 1. This means we do not discount
equals 1. This means we do not discount the future rewards.
We can compute the returns backwards from the final token.
First J4 = R4 = 1.99.
Then J3 = R3 + J4 = -0.04 + 1.99 which equals 1.95.
J2 = R2 + G3 = minus 0.02. 02 + 1.95 = 1.93.
And finally, J1 = R1 + J2 = - 0.03 + 1.93
which equals 1.90.
This returns become the training target for the value model. At each position, the value head predicts a scalar value
and we want that prediction to match the corresponding return GT.
Plugging this in, the value model loss is a mean squared error between each predicted value
of ST and its target return JT.
here average over the four tokens of this answer.
Now let's introduce JAE which stands for generalized advantage estimation.
Before we define JAE directly, let's first set up the reinforcement learning objective.
Suppose to is a trajectory. It contains
a sequence of states and actions. S0,
A0, S1, A1 and so on. This trajectory is generated by following the policy pitta.
At each time step T, the action A is sampled from the policy conditioned on the current state ST.
After the agent takes action A, the environment transitions to the next state ST + one. According to the
environment transition probability P, at each step, the agent also receives a reward RT which depends on the current
state, the action, and the next state.
The goal of the reinforcement learning is to maximize the expected return.
We write this objective as G SATA. It is
the expected accumulated reward over trajectories sampled from the current policy pacea.
In other words, we want to adjust the policy parameters theta so that on average the trajectories generated by
the policy receives higher total reward.
Now let's compute the gradient of this objective.
Our objective jeta is the expected return over trajectories sampled from the policy.
By the definition of expectation, we can write this as integrate over all possible trajectories the probability of the trajectory pa of
tall multiplied by the return of that trajectory j of tall.
Now we want to take the gradient with respect to theta.
To do this we apply the chain rule to the log probability.
The gradient of a probability equals the probability itself multiplied by the gradient of the log probability.
So the gradient of P theta of to becomes P theta of to times the gradient of log P theta of to.
After applying the chain rule, the gradient becomes integrate over all possible trajectories. the probability of the
trajectories. the probability of the trajectory passa of to multiplied by the gradient of the log probability of the trajectory
multiplied by the return j of tall.
Now we expand the probability of a trajectory.
A trajectory starts from an initial state. Then at each step the policy
state. Then at each step the policy chooses an action and the environment moves to the next state. So the
probability of the trajectory is the probability of the initial state P of S0 multiplied by all the policy
probabilities P theta of A given ST multiplied by all the environment transition probabilities
P of ST + one given ST and A.
If we take the log of this trajectory probability, the products becomes sums. So we get three parts. The log
probability of the initial state plus the sum of the log policy probabilities plus the sum of the log environment transition probabilities.
But only the policy depends on the theta. The initial state distribution
theta. The initial state distribution and the environment transition probability do not depend on theta.
Therefore, when we take the gradient with respect to theta, those two parts disappear leaving the gradient of the
sum of log theta of a t given st.
Since the gradient distributes over the sum, this becomes the sum over time of the gradient of log paca of a t given
st.
Putting this back into the objective, we get the policy gradient. It is the integral of all possible trajectories of
three terms multiplied together. The
probability of the trajectory passa of to the sum of the gradient of the log action probabilities and the return g of
to this equals to the expectation of g to multiplied by the sum of the gradients of the log action probabilities.
We can tighten this and more generally with each term by replacing the four trajectory return jour with a per step signal sit.
The policy gradient then becomes the expectation of a sum over time where each gradient of the log action
probability is weighted by its own site.
So the key question becomes what should this waiting signal be?
Now let's connect this to advantage estimation.
Policy gradient methods estimate the gradient of the expected return by waiting each actions gradient with a per
step signal site.
A good choice of this waiting signal is the advantage a pa of sat.
It is defined as Q pitta of ST80 minus we pa of ST. This choice reduces
variance by subtracting a baseline.
Here Q pitta of ST is the expected return if we take action A at state ST
and then continue following policy para.
We paca of ST is the expected return from state ST averaging over actions sampled from policy paca.
So the value function acts as a baseline.
Therefore the advantage tells us how much better than average is action at in state ST according to the current policy.
This baseline is important because it reduces variance.
For example, imagine two prompts. Prompt
A is easy. So most completions get high scores. Prompt B is hard. Even the best
scores. Prompt B is hard. Even the best completions may get relatively low scores.
If we only look at raw reward, the easy prompt may always look good and the hard prompt may always look bad. But what we really want to know is whether a
completion is better or worse than expected for that specific prompt.
By subtracting a baseline, the advantage signal reflects better than expected quality, not just prompt difficulty.
In practice, however, the true advantage function is unknown. So, we need to
estimate it. This is where JAE comes in.
estimate it. This is where JAE comes in.
Now, let's see how we estimate the advantage with JAE.
First, we define the temporal difference residue or TD residue for short. The TD
ratio is a one-step estimate of the advantage. It is written as data t and
advantage. It is written as data t and it compares two quantities.
What's the value model predicted before and what we observe after taking one step? More specifically, delta t equals
step? More specifically, delta t equals the reward at step t plus gamma times the value of the next state minus the
value of the current state.
Intuitively, this tells us whether the outcome was better or worse than what the value model expected.
If data t is positive, the result was better than expected. If it is negative,
the result was worse than expected.
JE combines these TD residuals across multiple future steps. It takes an exponentially weighted sum of future TD
residuals.
The recursive form is the one we usually implement. The advantage and step t
implement. The advantage and step t equals delta t plus gamma time lambda
times the advantage at the next step.
Here lambda controls the bias varance tradeoff.
When lambda equals zero, j becomes pureestep td estimation. This has lower variance but a higher bias because it
only looks one step ahead. When lambda
equals one, G becomes closer to Mont Carlo estimation.
This has lower bias because it uses the full rule out but higher variance because the estimate depends on many sampled future rewards. In practice,
lambda is often set to a value like 0.95, which gives a balance between bias and variance.
Finally, we now have all the building blocks needed to apply PO to a language model.
The PO objective is something we want to maximize at a high level. We want to increase the probability of tokens with
positive advantage and decrease the probability of tokens with negative advantage.
The formula may look complicated at first, but the idea is simple. We want
to update the current policy using rollouts generated by the old policy while making sure the update is not too large. But why do we use roll outs from
large. But why do we use roll outs from the old policy instead of the current policy? We will explain that next when
policy? We will explain that next when we discuss the importance ratio.
In this objective, we sample a prompt Q from the prompt data set and then sample an answer O from the old policy pita
old. For each generated token, we
old. For each generated token, we compute two surrogates.
An unclipped one and a clipped one. The
unclipped surrogate is the importance ratio it sa multiplied by the advantage at. Here rta is the probability of the
at. Here rta is the probability of the current action under the current policy divide by the probability of the same
action under the old policy.
Paca of A given ST over pass of A given ST. In the LM setting, this becomes PCA
ST. In the LM setting, this becomes PCA of O T given Q and O less than T over
pass of O T given Q and O less than T.
The clipped surrogate is similar but we clip the importance ratio so that it cannot go below one minus epic and
cannot go above 1 + epsilon.
PO takes the minimum of the two surrogate.
This clipping is the key idea of PO. It
prevents the new policy from moving too far away from the old policy in one update.
Finally, the one oversize all factor averages this objective over the tokens of the answer and the outer expectation
average over the batch of prompts.
Here 80 is the advantage term we introduced earlier in the JA section. It
tells us whether this sampled token was better or worse than expected.
If this formula still feels abstract, don't worry. In the next few slides, we
don't worry. In the next few slides, we will break down the three important parts. The importance ratio, clipping,
parts. The importance ratio, clipping, and how the advantage at is computed in practice.
So why do we generate the roll out using the old policy instead of the current policy?
Well, the reason is efficiency. In pure
on policy training, every time the policy change, we will need to generate fresh rollouts from the new policy. But
for large language model, roll out generation is expensive, especially for long authors or long
horizon tasks. So in PO we first make a
horizon tasks. So in PO we first make a frozen copy of the current policy and call it the old policy denoted as pass
old. Then we use this old policy to
old. Then we use this old policy to generate a batch of rowouts. After the
roll outs are generated we optimize the current policy passa using this fixed batch of data. This means we allow a
small controlled of policy mismatch.
So here is the problem. The roll out came from the old policy paca old but the model we are updating is the current policy paca.
To correct this mismatch we multiply the advantage at by the importance ratio RTCA.
This ratio compares how likely the sample token is under the current policy versus under the old policy. Multiplying
by this ratio correct the mismatch between the policy that generated the roll outs and the policy being updated.
But we don't want this ratio to become too large or too small. That is why PO introduces clipping.
Now let's take a closer look at what PO clipping is doing. This plot shows the objective for a single time step T. The
horizontal axis is the importance ratio R which compares the probability of the token under the current policy with the
probability under the old policy.
The red dot indicates where the PO update starts on this curve. At this
initial point, the ratio R equals one because before the update, the current policy is the same as the old policy.
Now consider the cases when the advantage is positive.
A positive advantage means this token was better than expected. So we want to encourage the model to produce it more
often. To maximize this objective, PO
often. To maximize this objective, PO want to increase this importance ratio.
Meaning the current policy should assign a higher probability to this token than the old policy. But PO does not want
this increase to go too far. Once R is larger than 1 plus ipso, the objective becomes clicked.
So increasing the probability even more gives no additional gradient signal. On
the other hand, if r is much smaller than one minus epsilon, it means the probability of a go token has dropped
too much in that region. We keeps
unclipped objective active. So the
gradient can push the ratio backs up.
Now consider the case when the advantage is negative.
A negative advantage means this token was worse than expected. So we want to discourage the model from producing it.
Since 80 is negative, the surrogate R times 80 becomes less negative when R decreases.
So to maximize the objective, PO wants to decrease the importance ratio, meaning the current policy should assign a lower probability to this token than
the old policy.
But again, PO does not want the update to go too far. Once R is smaller than one minus epsilon, the objective become clipped.
So reducing the probability even more gives no additional benefit. And if r is larger than one plus ipsino, it means
the probability of a bad token is still too high. In that region, the unclipped
too high. In that region, the unclipped objective remains active. So the
gradient can push the ratio back down.
So the key idea is PO clipping allows useful updates but it prevents the policy from changing too aggressively in one optimized step.
Now let's briefly recap how advantage AT is computed.
At comes from applying GAE generalized advantage estimation to two inputs.
the per token rewards from the current step onward and the learned value function weigh the first input is the per token reward
RT in PO for LHF.
This reward combines the per token KL estimate KLT and the final token the
reward model score I of Q and O.
Here KLT measures how much the policy deviates from the reference model at token T and beta times KLT is the penalty term we subtract from the
reward.
The second input is a value model for each state. The value model predicts a
each state. The value model predicts a scalar value.
With these two inputs in hand, JA produces the advantage in two steps.
First, we compute the TD error delta T.
The revert at step T plus gamma times the value of next state minus the value of the current state. Then we compute
the advantage backward. A t equals delta t + gamma * lambda * a t + 1.
Finally, to conclude the po section, let's walk through one step of the po training loop.
First, we copy the current policy paca and freeze it as the old policy pita old. Then for each prompt, we sample an
old. Then for each prompt, we sample an answer o from the old policy. This
answer is a sequence of generated tokens 01, O2 and so on up to OT.
Next, we compute the reward model score for the complete answer I of Q and O.
This gives us one scala revert score for the four response.
Then for all response tokens, we compute several token level quantities in parallel.
First, the value model predicts a value vt where the state ST is the prompt plus
the generated tokens before position T.
Second, we compute the per token KL estimate KLT.
Conceptually this compares passa with the reference policy.
We use log paca old here because at roll out time the old policy is a frozen copy of pacita and we already have its log
probabilities from generation.
Using the revert model score and KL estimate, we form the step reward RT.
For tokens before the final token, the reward is just the negative KL penalty.
For the final token, we add the reward model score for the complete answer.
Once we have the step reward RT and the value predictions, we compute the TD error. Delta T equals RT plus gamma
error. Delta T equals RT plus gamma times the next value V_T + 1 minus the current value V_T.
At the final token, the next value WT + one is zero because the answer ends because the answer ends here. because
the answer ends there and there is because the answer ends there and there is no next state.
Next we compute the advantage and use it to update the model. Starting from the final token and move backward we compute
the advantage at using the recursive JE formula.
The advantage at step t equals the TD error delta t plus gamma time lambda times the next advantage a t + 1.
We initialize the recursion with a capital t + 1 equals zero.
We also compute the return target jt which is used to train the value model.
Remember advantage is defined as return minus value.
So after we estimate the advantage, we can add the value baseline back. JT
equals 80 plus VT with J. This is an estimated return target that reuses the value baseline.
During PO optimization, we then recomputee the current policy log probability on the same roll out. In
other words, for each generated token OT, we compute its log probability under the current policy pass data given the prompt Q and the previous generated
tokens.
Then we compute the importance ratio RT.
This is the exponential of the current policy log probability minus the old policy log probability for the same
token and the same context.
Finally, we calculate the training loss.
Since JPPO is an objective we want to maximize in implementation, we minimize the negative PO objective. We also add
the value model loss weighted by a coefficient CT.
As we mentioned before, the policy model and value model often share the same transformal backbone with one head for next token prediction as another head
for value prediction. During PO
training, we usually update the shared backbone and both heads together.
The reward model II and reference model PIA stays frozen.
Now let's move from PO to JRPO which stands for group relative policy optimization.
Compared with PO, JRPO uses a simpler setup. Recall that PO involved four
setup. Recall that PO involved four models. The policy model, the reference
models. The policy model, the reference model, the reward model, and the value model. GRPO keeps only three of them.
model. GRPO keeps only three of them.
the value model is dropped.
The first is the policy model para. This
is the language model we want to optimize. For an LLM, the policy gives
optimize. For an LLM, the policy gives us the probability of the next token conditioned on the prompt and the
previous generated tokens. Because GRPO
samples a group of answers per prompt, we index by both the answer I and the token position T.
The second is the reference model parf, a frozen copy of the model before I training used to anchor the KL penalty.
The third is a revert model I the model we trained earlier which scores complete answers.
The value model way we sign JRPO removes it entirely. And because there is no
it entirely. And because there is no value model, we also do not need JA to estimate the advantage. Instead, GRPO
computes the advantage from a group of sampled answers.
We will see how on the next slide.
Now let's see how GRPO computes the advantage.
For each question Q, we sample a group of answers from the old policy passa. We
denote them as O1, O2 and so on up to OG.
Then the revert model scores each answer give us G revert scores R1, R2 and so on up to RG. Instead of using a value model
as a baseline, JRPO uses the rewards within the same group. For each answer I, we compute its normalized reward by
subtracting the mean reward of the group and divide by the standard deviation of the group rewards.
This normalized reward becomes the advantage for that answer. Then we
assign the same advantage to every token in that answer.
So if an answer gets a higher reward than the group average, it has a positive advantage and the GRPO will encourage the model to generate answers
like this. If an answer gets a lower
like this. If an answer gets a lower reward than the group average, it has a negative advantage and GRPO will
discourage it. This is what group as
discourage it. This is what group as baseline means. The reward is judged
baseline means. The reward is judged relative to a group of answers for the same prompt instead of relative to a learned value model.
The JRPO objective is very similar to the PO objective but adapted to the group setting. Instead of sampling one
group setting. Instead of sampling one answer from each prompt, we sample a group of J answers from the old policy pass it out.
And then for each answer in the group and for each token in that answer we compute the importance ratio rit theta.
This ratio compares the probability of token o under the current policy pitta with its probability under the old
policy pita old giving the same prompt and the previous tokens.
The advantage ait is computed using the group relative reward we introduced in the previous slide.
The clipping part is the same as PO. It
prevents the current policy from moving too far away from the old policy during one update.
One difference from the PO version we discussed earlier is how the KL term is used. In PO, we included the KL penalty
used. In PO, we included the KL penalty inside the reverse signal. In GRPO, the KL penalty is subtracted directly in the
objective as a separate regularization term and we use the K3 estimator for the
KL term. Specifically K3 approximates
KL term. Specifically K3 approximates the KL as row minus log of row minus
one. Here row is the ratio of reference
one. Here row is the ratio of reference to current probability for the token.
Perf of OI divided by para of OIT.
Compared with the Q1 estimate we used in PO, the K3 estimator is always non
negative and usually has lower variance.
So in short, GRPO keeps PO style clip objective, removes the value model, replaces the value baseline with a group
relative baseline and adds the key penalty directly in the objective.
Let's walk through one step of JRPO training loop similar to what we did for PO.
First we copy the current policy pita and freeze it as the old policy passa O.
Then for each prompt Q we sample a group of answers from the old policy 01 O2 etc
and OG. Next, we use a reward model or
and OG. Next, we use a reward model or more generally a reward function to score each answer. This gives us a group
of reward scores R1, R2, etc. and RG where R I is the reward score for prompt Q and answer OI.
Then for each answer in the group and for each token in that answer, we compute several quantities in parallel.
First we compute the importance ratio RIT TCA which compares the probability of the token under the current policy
and the old policy.
Second, we compute the advantage ait.
This comes from the group relative reward. We normalize each answer's
reward. We normalize each answer's reward by subtracting the group mean and divide by the group standard deviation and then assign that advantage to every
token in the same answer.
Third, we compute the per token key term between the current policy and the reference model. This key term is
reference model. This key term is multiplied by beta and subtracted from the grpo objective as a regularization
term.
Finally, we calculate the training loss.
Since GRPO is an objective, we want to maximize in implementation. We minimize the negative
implementation. We minimize the negative GRPO objective.
During GRPO training, we update the policy model Paeta.
The reference model PREF stays frozen under the revert source. either a revert model or a revert function is not updated by grpo.
At a high level, what is the difference between PO and grpo?
The biggest difference is a choice of baseline. PO uses a learned value model
baseline. PO uses a learned value model or credit as baseline. GRPO uses the group of sampled answers as a baseline.
This leads to different characteristics for the two methods. PO learns a value model and estimates the advantage at the
token level. It is more general, but it
token level. It is more general, but it also has a heavier training pipeline because we need to train and update both the policy model and the value model.
GRPO in contrast does not need a value model. Instead, it samples multiple
model. Instead, it samples multiple completions for the same prompt, scores them, and nomize the reward within the group.
Because of this, GRPO is usually simpler and cheaper to train. So, intuitively,
PO ask, was this token better than my value estimate?
GRPO ask, was this answer better than the other answers from the same prompt?
So, so when should we choose PO?
PO is a good choice when token level credit assignment matters. Specifically,
the reward model usually gives one score for the whole answer, not one reward per token. PO uses the value model and J to
token. PO uses the value model and J to propagate the sequence level reward backward and estimate token level advantages.
This is especially useful for lung outputs or cases where different parts of the answer may contribute differently to the final quality.
PO is also useful when the reward variance within a group of sampled answers is low. In that case, group normalization may not give a strong
baseline and a learned value model can be more robust.
Another reason to choose PO is stability. PO is a more mature and
stability. PO is a more mature and widely tested recipe for general ILF training.
So the typical use case is general ILHF for assistant alignment making the model more helpful, honest
and harmless.
Then when should we choose JRPO?
JRPO is a good choice when outcome level supervision is enough. In other words, we only need to know whether the final answer is good or bad instead of
estimating detailed token level advantage with a value model.
GRPO also works well when reversing the sampled group spread out naturally. For
example, in math or code task, some completions may pass the verifier or unit test while others fail. This gives
a clear relative signal within the group.
Another advantage is simplicity. GRPO
does not need a value model and it does not need JA. So, the training pipeline is simpler and cheaper.
This makes grpo especially suitable for math code and other verifiable tasks.
That pretty much covers everything I want to share in this video. We started
from the motivation for ILF then went through supervised font tuning, revert modeling, PO and finally JRPO.
Of course, this video only gives a highlevel walk through, so I strongly recommend reading the original papers if you want to understand the design
choices and the details more deeply.
If you have any questions, please leave a comment. If you find this video
a comment. If you find this video useful, please like it and share it with others.
Thanks for watching and I'll see you next time. Bye.
next time. Bye.
Loading video analysis...