RLHF, PPO & GRPO Explained: A Top-Down Guide to LLM Policy Optimization

By Mei Li

Summary

Topics Covered

SFT teaches imitation, not judgment
KL penalty anchors policy against drift
Baselines separate quality from prompt difficulty
GRPO replaces the value model with group statistics

Full Transcript

Hello, welcome to my YouTube channel. In

this video, we will build a top-down understanding of LLM policy optimization,

why ILF is needed, how it works, and how two important methods PO and GRPO fit into the picture. My goal is to make the

explanation as self-contained as possible. We'll start from the core

possible. We'll start from the core intuition, connect it to the math, and then work through one step of the training loop for each method. By the

end of the video, you should have a clear mental model of how PBO and GRPO optimize LLM policies. Let's get

started.

Before we jump in, we first need to understand why supervised fetuning alone is not enough.

In supervised font tuning or SFT, we give the model a prompt together with a single reference answer and train it to

reproduce that answer token by token.

The only signal is essentially copy this.

This is useful, but notice what it leaves out. SFT only ever shows one good

leaves out. SFT only ever shows one good answer per problem. It never shows a comparison.

So, it can teach the model to imitate a demonstration, but not to judge which answer is better. What we actually need

is this. Given the same prompt and two

is this. Given the same prompt and two candidate answers A and B, the model should learn that humans prefer A over B.

For large language models, this distinction is very important. To make a model more helpful, honest, and safe, we need preference feedback.

This kind of preference feedback, learning from comparisons rather than demonstrations is a core motivation for ILHF

or reinforcement learning from human feedback.

This naturally brings us to another area of machine learning called reinforcement learning.

The idea behind reinforcement learning is very similar to how humans learn.

Since we were babies, we have learned by interacting with the world around us, trying things, observing the results,

and gradually learn which actions lead to better outcomes.

More formally, I is a type of machine learning technique where an agent learns to make optimal decisions by interacting

with an environment, receiving feedback via rewards or penalties as shown in the diagram. Starting from

state S, the agent takes an action A.

The environment then returns a reward R and transitions to a new state S prime.

From S prime, the agent picks the next action and the loop repeats.

The goal of the agent is to learn a behavior that maximizes the total accumulated reward over time.

Before we map reinforcement learning to language models, let's define a few KIO terms. First, the policy describes how the

agent chooses actions.

A trajectory is the sequence of states, actions, and rewards produced as the agent interacts with the environment.

You will also often hear the word roll out. A roll out is basically a

out. A roll out is basically a trajectory generated by running the current policy.

So trajectory describes the sequence itself while roll out emphasize the process of sampling it from the policy.

The return is the total future reward from a given point onward. This is the quantity the agent ultimately wants to maximize.

The value of a state is the expected future reward from that state. In other

words, it tells us how good we expected things to be from here.

Finally, the advantage tells us whether a particular action was better or worse than what we expected from that day.

This will become very important when we discuss PO later.

Now let's map reinforcement learning to the case of large language models.

In LLM reinforcement learning, the policy is the language model itself.

Given the current context, the model defines a probability distribution over the next token.

The state is a prompt together with the tokens generated so far. We can think of it as the current context seen by the model.

The action is the next token sampled from the model.

For example, suppose the prompt is where is Shanghai and the model has already generated Shanghai as at this point the

state is a full context. Where is

Shanghai? Shanghai is in. And the next action could be China.

After the model chooses that action, the state changes. Now the generated prefix

state changes. Now the generated prefix includes China and the model uses this

new context to choose the next token.

One clarification before we continue. in

this slide and in the examples later. I

will often write tokens as if each word is one token. This is only for simplicity. In real language models,

simplicity. In real language models, tokenization is more complicated. A

token can be a full word, part of word, punctuation, or even include white space. But conceptually, the IL

space. But conceptually, the IL formulation is the same.

So, LLM generation can be viewed as a sequence of decisions. At each step, the model observes the current context,

choose the next token, and then moves to a new state. In IOHF,

we optimize this policy so that the generated answers receive higher rewards according to human preferences.

So how do we usually apply reinforcement learning to language models?

A classical example is instruct GPT from open AI which introduced the ILHF pipeline for aligning language models

with human preferences.

ILHF usually contains three main stages.

The first stage is supervised fine-tuning or SFT. Here we collect demonstration data and train a supervised policy.

The second stage is reward modeling. In

this stage, we collect comparison data where humans compare different answers to the same prompt and then we train a

reward model to predict which answer is preferred. The third stage is

preferred. The third stage is reinforcement learning. Here we optimize

reinforcement learning. Here we optimize the policy model against the reward model so that the model learns to generate answers that receive higher

preference scores.

So in summary, the ILF pipeline is supervised funing, reward modeling, and reinforcement

learning. Let's go through these steps

learning. Let's go through these steps one by one.

Most of us are already familiar with the pre-training and supervised front- tuning stage of language models. During

pre-training, we train the model on a very large collection of text data such as web pages, books, code, articles,

and many other sources.

The objective is simple. Given the

previous tokens, predict the next token.

Supervised fine-tuning or SFT is teacher first next token prediction on a reference answer. Same objective as

reference answer. Same objective as pre-training, but the data is prompt and answer pairs instead of arbitrary text.

The goal is to make the model more useful as an assistant by teaching it how to answer questions or following instructions.

Suppose Q is the prompt and O is the target answer. We concatenate the tokens

target answer. We concatenate the tokens from Q and O and feed the four sequence into the language model.

At each position, the model predicts the next token. So the target sequence is

next token. So the target sequence is basically the input sequence shifted by one token.

This shifted setup is exactly what teacher first means. At every position, the model is showing the true previous token from the reference answer as input

rather than its own previous predictions.

So even if the model would have generated a different token, training continues along the reference sequence.

More specifically, the training objective is the standard cross entropy loss or equivalently the negative log

likelihood of the target tokens.

For each position, the model predicts a probability distribution over the vocabulary and we want it to assign high

probability to the correct next token.

The important difference from pre-training is that in supervised font tuning, we usually compute this loss

only on the answer tokens.

The prompt tokens are used as context, but they are masked out from the loss.

So the SFT loss encourages the model to generate the target answer condition on the prompt.

For reward modeling, the first thing we need to do is to collect comparison data. The idea is simple. For the same

data. The idea is simple. For the same prompt, we generate multiple candidate answers.

We can do this by sampling from the same model several times with a nonzero temperature or by collecting answers from different models.

Then we ask a human labeler to compare these answers and rank them from best to worst. For example, suppose the prompt

worst. For example, suppose the prompt is explain the moon landing to a six-year-old.

We generate four different answers A, B, C, and D. Here, four answers is just an example. In practice, it is usually a

example. In practice, it is usually a small number, often somewhere between four and nine. After reading them, the

human labeler ranks them from best to worst as C, then A, then D, and finally B.

From this ranking, we can construct a pair-wise preference examples.

Since there are four answers, we get six pairs. C is preferred over A, C is

pairs. C is preferred over A, C is preferred over D. C is preferred over B.

A is preferred over D. A is preferred over B. And D is preferred over B.

over B. And D is preferred over B.

This pairwise comparisons are the training data we will use to train the reward model.

Now let's see how we train the reward model using the preference data we collected.

The reward model takes a prompt Q and an answer O and outputs a single scalar reward score.

Intuitively, a better answer should receive a higher reward score.

In practice, the reward model is usually initialized from the SFT model.

We reuse the language model backbone, remove the original language model head that predicts the next token and add a

reverse model head instead.

This head converts the hidden state of the final token into one scala reward score for the whole answer.

For each preference pair, we have two answers. the chosen answer which humans

answers. the chosen answer which humans prefer and the rejected answer which humans prefer less. The revert model

gives us two scores. One score for the chosen answer and one score for the rejected answer. We want the chosen

rejected answer. We want the chosen answer to get a higher score than the rejected answer.

To train this, we use the Bradley Terry model. The Bradley Terry model is a

model. The Bradley Terry model is a probabilistic model for pair-wise comparison. It says that if two

comparison. It says that if two candidates have scalar scores, then the probability that one candidate is

preferred over the other depends on the difference between their scores.

In our case, the probability that the chosen answer is preferred over the rejected answer is computed as the

sigmoid of the reward difference.

The reward of the chosen answer minus the reward of the rejected answer.

If the chosen answer is much larger than the rejected answer, the difference is very positive. the single point is close

very positive. the single point is close to one and the loss is close to zero.

This means the revert model is already assigning the correct preference.

If the chosen answer is smaller than the rejected answer, the difference is negative. The single mind is close to

negative. The single mind is close to zero and the negative log loss becomes very large. So the model is strongly

very large. So the model is strongly penalized.

So the revert model loss is the negative log probability of the human preference pairs under the Bradley Terry model.

In practice, we compute this loss for each chosen rejected pair and average it over the batch.

One practical detail is that different prompts in the training data set may produce different numbers of preference

pairs. To handle this, we may first

pairs. To handle this, we may first average the loss over the pairs from the same prompt and then average across prompts.

This prevent prompts with more candidate answers for dominating the batch.

Now that we know how to train a revert model, we can move to the final stage of ILHF reinforcement learning.

In this stage, we use the reward model as a training signal to optimize the language model policy.

The goal is to make the model generate answers that receive higher reward scores.

In the rest of this video, I will introduce two reinforcement learning methods commonly used for LLM policy optimization.

The first one is PO which stands for proximal policy optimization.

PO is the more classical and widely used ILHF algorithm.

The second one is GRPO which stands for group relative policy optimization.

Grpo is simpler in some ways because it removes the value model and uses group relative rewards instead.

Let's start with PO.

There are four models involved in PO.

The first one is the policy model. This

is the language model we want to optimize for an LLM. The policy gives us the probability of the next token condition

on the prompt and the tokens generated so far.

The second one is the reference model.

This is usually a frozen copy of the model before I training. It anchors the policy to what it learned during SFT.

preventing it from drifting too far and losing its general capabilities.

The third one is a revert model. This is

the model we trained in the previous step. It gives a reverse score for the

step. It gives a reverse score for the full answer generated by the policy model.

The first one is the value model. The

value model predicts the expected future reward from the current state and it is used to compute the advantage.

At a high level, PO works like this.

First, the policy model generates an answer O from a prompt Q. Next, we

compute a KL penalty by comparing the current policy to the reference model.

We also score the answer using the reward model. The KL penalty and the

reward model. The KL penalty and the reward score are then combined into the final reverse signal R.

Separately, the value model produces value estimates VI. The reverse signal R and the value estimates V are then used

to compute the advantage A with a method called GA AE. Finally, this advantage signal tells us how to update the policy

model.

As we mentioned earlier in PO, the reward signal usually combines two parts. The final reward model score I of

parts. The final reward model score I of Q and O and a per token KL penalty.

The reward model gives one score for the whole generated answer. But in LLM PPO, each generated token is treated as a

action. So we need token level

action. So we need token level quantities such as log probabilities, key penalties, and advantages.

Since the reward model score is only available for the complete answer, we add it at the final token.

For all tokens before the last token, the step reward is just a minus beta times the per token KL estimate.

For the final token, the revert is the reward model score for the full answer minus beta times the per token KL

estimate on the last token.

That per token KL estimate is computed with the K1 estimator.

For each generated token, we compare its probability under the current policy pi theta with its probability under the

reference model pyf giving the same context. Here beta

controls the strength of the KL penalty.

A larger beta keeps the policy closer to the reference model while a smaller beta allows the policy to move more

aggressively toward answers with a higher reward.

So the KL penalty acts as an anchor to the reference policy. It prevents the current policy from drifting too far during reinforcement learning

optimization.

Now let's look at the value model in more detail.

The value model estimates the expected future return from the current state ST before taking action at

in the LLM setting. The state ST is the prompt together with the generated tokens so far. The action at is the next

token.

Architecturally, we can think of the value model as an LLM backbone with a value head on top. This value head

predicts a scalar value for each generated token position.

In practice, this does not always mean we train a completely separate model just for value prediction.

Very often, the policy model and the value model share the same transformal backbone. The policy head predicts the

backbone. The policy head predicts the next token distribution while the value head predicts the scalar value. So

conceptually we talk about a policy model and a value model but implementation wise they may share most of their parameters.

The return from step t written as JT is defined recursively as the current reward RT plus gamma times the future

return GT + one. Equivalently, it is the discounted sum of future rewards from this step onward. Here, gamma is a

discount factor. Intuitively, this is

discount factor. Intuitively, this is like preferring to receive money sooner rather than much later. Future rewards

are still valuable, but they may be discounted.

In standard reinforcement learning, gamma is usually between zero and one. A

smaller gamma discounts future rewards more strongly, while a gamma close to one preserves more of the future reward.

In many LM ILHF settings, gamma is often set close to one or even exactly one.

The reason is that the main reward often comes from the final complete answer.

But earlier tokens are also important because they determine the structure, direction, and reasoning path of the

answer. So we usually do not want to

answer. So we usually do not want to discount the final reward too much when assigning credit to earlier tokens.

Also unlike some traditional IO problems the generated answer has a finite length.

So using gamma close to one is usually reasonable.

So weight the value model weigh applied to state ST is the expected return JT from that state.

Intuitively this means given the partial answer. So far, how good do we expected

answer. So far, how good do we expected the final outcome to be? To train the value model, we fit this predictions to

the returns. The value model loss is the

the returns. The value model loss is the mean squared error between the predicted value weigh of ST and the target return

GT averaged over the generated tokens.

The value estimate will later help us compute the advantage which tells us whether the sample token was better or worse than expected.

Let's look at a simple example to better understand the value model.

Suppose the prompted Q is explain gravity simply and suppose the generated answer O is gravity pulls objects

together. Here we continue using the

together. Here we continue using the simplification that each word is one token.

So this answer has four tokens.

Now suppose the revert model score for this prompt answer pair I fi of Q and O is two.

We also have a per token sampled KL estimate 0.03 0.02 0.04 and 0.01.

If we set beta equals 1 then the step reward are calculated like this.

For the first three tokens, the revert is only the negative key penalty. So we

get R1 = -0.03, I2 = minus 0.02

and R3 equals minus 0.04.

For the final token, we add the revert model score for the complete answer. So

R4 = 2 - 0.01 = 1.99.

Next suppose the discount factor gamma equals 1. This means we do not discount

equals 1. This means we do not discount the future rewards.

We can compute the returns backwards from the final token.

First J4 = R4 = 1.99.

Then J3 = R3 + J4 = -0.04 + 1.99 which equals 1.95.

J2 = R2 + G3 = minus 0.02. 02 + 1.95 = 1.93.

And finally, J1 = R1 + J2 = - 0.03 + 1.93

which equals 1.90.

This returns become the training target for the value model. At each position, the value head predicts a scalar value

and we want that prediction to match the corresponding return GT.

Plugging this in, the value model loss is a mean squared error between each predicted value

of ST and its target return JT.

here average over the four tokens of this answer.

Now let's introduce JAE which stands for generalized advantage estimation.

Before we define JAE directly, let's first set up the reinforcement learning objective.

Suppose to is a trajectory. It contains

a sequence of states and actions. S0,

A0, S1, A1 and so on. This trajectory is generated by following the policy pitta.

At each time step T, the action A is sampled from the policy conditioned on the current state ST.

After the agent takes action A, the environment transitions to the next state ST + one. According to the

environment transition probability P, at each step, the agent also receives a reward RT which depends on the current

state, the action, and the next state.

The goal of the reinforcement learning is to maximize the expected return.

We write this objective as G SATA. It is

the expected accumulated reward over trajectories sampled from the current policy pacea.

In other words, we want to adjust the policy parameters theta so that on average the trajectories generated by

the policy receives higher total reward.

Now let's compute the gradient of this objective.

Our objective jeta is the expected return over trajectories sampled from the policy.

By the definition of expectation, we can write this as integrate over all possible trajectories the probability of the trajectory pa of

tall multiplied by the return of that trajectory j of tall.

Now we want to take the gradient with respect to theta.

To do this we apply the chain rule to the log probability.

The gradient of a probability equals the probability itself multiplied by the gradient of the log probability.

So the gradient of P theta of to becomes P theta of to times the gradient of log P theta of to.

After applying the chain rule, the gradient becomes integrate over all possible trajectories. the probability of the

trajectories. the probability of the trajectory passa of to multiplied by the gradient of the log probability of the trajectory

multiplied by the return j of tall.

Now we expand the probability of a trajectory.

A trajectory starts from an initial state. Then at each step the policy

state. Then at each step the policy chooses an action and the environment moves to the next state. So the

probability of the trajectory is the probability of the initial state P of S0 multiplied by all the policy

probabilities P theta of A given ST multiplied by all the environment transition probabilities

P of ST + one given ST and A.

If we take the log of this trajectory probability, the products becomes sums. So we get three parts. The log

probability of the initial state plus the sum of the log policy probabilities plus the sum of the log environment transition probabilities.

But only the policy depends on the theta. The initial state distribution

theta. The initial state distribution and the environment transition probability do not depend on theta.

Therefore, when we take the gradient with respect to theta, those two parts disappear leaving the gradient of the

sum of log theta of a t given st.

Since the gradient distributes over the sum, this becomes the sum over time of the gradient of log paca of a t given

st.

Putting this back into the objective, we get the policy gradient. It is the integral of all possible trajectories of

three terms multiplied together. The

probability of the trajectory passa of to the sum of the gradient of the log action probabilities and the return g of

to this equals to the expectation of g to multiplied by the sum of the gradients of the log action probabilities.

We can tighten this and more generally with each term by replacing the four trajectory return jour with a per step signal sit.

The policy gradient then becomes the expectation of a sum over time where each gradient of the log action

probability is weighted by its own site.

So the key question becomes what should this waiting signal be?

Now let's connect this to advantage estimation.

Policy gradient methods estimate the gradient of the expected return by waiting each actions gradient with a per

step signal site.

A good choice of this waiting signal is the advantage a pa of sat.

It is defined as Q pitta of ST80 minus we pa of ST. This choice reduces

variance by subtracting a baseline.

Here Q pitta of ST is the expected return if we take action A at state ST

and then continue following policy para.

We paca of ST is the expected return from state ST averaging over actions sampled from policy paca.

So the value function acts as a baseline.

Therefore the advantage tells us how much better than average is action at in state ST according to the current policy.

This baseline is important because it reduces variance.

For example, imagine two prompts. Prompt

A is easy. So most completions get high scores. Prompt B is hard. Even the best

scores. Prompt B is hard. Even the best completions may get relatively low scores.

If we only look at raw reward, the easy prompt may always look good and the hard prompt may always look bad. But what we really want to know is whether a

completion is better or worse than expected for that specific prompt.

By subtracting a baseline, the advantage signal reflects better than expected quality, not just prompt difficulty.

In practice, however, the true advantage function is unknown. So, we need to

estimate it. This is where JAE comes in.

estimate it. This is where JAE comes in.

Now, let's see how we estimate the advantage with JAE.

First, we define the temporal difference residue or TD residue for short. The TD

ratio is a one-step estimate of the advantage. It is written as data t and

advantage. It is written as data t and it compares two quantities.

What's the value model predicted before and what we observe after taking one step? More specifically, delta t equals

step? More specifically, delta t equals the reward at step t plus gamma times the value of the next state minus the

value of the current state.

Intuitively, this tells us whether the outcome was better or worse than what the value model expected.

If data t is positive, the result was better than expected. If it is negative,

the result was worse than expected.

JE combines these TD residuals across multiple future steps. It takes an exponentially weighted sum of future TD

residuals.

The recursive form is the one we usually implement. The advantage and step t

implement. The advantage and step t equals delta t plus gamma time lambda

times the advantage at the next step.

Here lambda controls the bias varance tradeoff.

When lambda equals zero, j becomes pureestep td estimation. This has lower variance but a higher bias because it

only looks one step ahead. When lambda

equals one, G becomes closer to Mont Carlo estimation.

This has lower bias because it uses the full rule out but higher variance because the estimate depends on many sampled future rewards. In practice,

lambda is often set to a value like 0.95, which gives a balance between bias and variance.

Finally, we now have all the building blocks needed to apply PO to a language model.

The PO objective is something we want to maximize at a high level. We want to increase the probability of tokens with

positive advantage and decrease the probability of tokens with negative advantage.

The formula may look complicated at first, but the idea is simple. We want

to update the current policy using rollouts generated by the old policy while making sure the update is not too large. But why do we use roll outs from

large. But why do we use roll outs from the old policy instead of the current policy? We will explain that next when

policy? We will explain that next when we discuss the importance ratio.

In this objective, we sample a prompt Q from the prompt data set and then sample an answer O from the old policy pita

old. For each generated token, we

old. For each generated token, we compute two surrogates.

An unclipped one and a clipped one. The

unclipped surrogate is the importance ratio it sa multiplied by the advantage at. Here rta is the probability of the

at. Here rta is the probability of the current action under the current policy divide by the probability of the same

action under the old policy.

Paca of A given ST over pass of A given ST. In the LM setting, this becomes PCA

ST. In the LM setting, this becomes PCA of O T given Q and O less than T over

pass of O T given Q and O less than T.

The clipped surrogate is similar but we clip the importance ratio so that it cannot go below one minus epic and

cannot go above 1 + epsilon.

PO takes the minimum of the two surrogate.

This clipping is the key idea of PO. It

prevents the new policy from moving too far away from the old policy in one update.

Finally, the one oversize all factor averages this objective over the tokens of the answer and the outer expectation

average over the batch of prompts.

Here 80 is the advantage term we introduced earlier in the JA section. It

tells us whether this sampled token was better or worse than expected.

If this formula still feels abstract, don't worry. In the next few slides, we

don't worry. In the next few slides, we will break down the three important parts. The importance ratio, clipping,

parts. The importance ratio, clipping, and how the advantage at is computed in practice.

So why do we generate the roll out using the old policy instead of the current policy?

Well, the reason is efficiency. In pure

on policy training, every time the policy change, we will need to generate fresh rollouts from the new policy. But

for large language model, roll out generation is expensive, especially for long authors or long

horizon tasks. So in PO we first make a

horizon tasks. So in PO we first make a frozen copy of the current policy and call it the old policy denoted as pass

old. Then we use this old policy to

old. Then we use this old policy to generate a batch of rowouts. After the

roll outs are generated we optimize the current policy passa using this fixed batch of data. This means we allow a

small controlled of policy mismatch.

So here is the problem. The roll out came from the old policy paca old but the model we are updating is the current policy paca.

To correct this mismatch we multiply the advantage at by the importance ratio RTCA.

This ratio compares how likely the sample token is under the current policy versus under the old policy. Multiplying

by this ratio correct the mismatch between the policy that generated the roll outs and the policy being updated.

But we don't want this ratio to become too large or too small. That is why PO introduces clipping.

Now let's take a closer look at what PO clipping is doing. This plot shows the objective for a single time step T. The

horizontal axis is the importance ratio R which compares the probability of the token under the current policy with the

probability under the old policy.

The red dot indicates where the PO update starts on this curve. At this

initial point, the ratio R equals one because before the update, the current policy is the same as the old policy.

Now consider the cases when the advantage is positive.

A positive advantage means this token was better than expected. So we want to encourage the model to produce it more

often. To maximize this objective, PO

often. To maximize this objective, PO want to increase this importance ratio.

Meaning the current policy should assign a higher probability to this token than the old policy. But PO does not want

this increase to go too far. Once R is larger than 1 plus ipso, the objective becomes clicked.

So increasing the probability even more gives no additional gradient signal. On

the other hand, if r is much smaller than one minus epsilon, it means the probability of a go token has dropped

too much in that region. We keeps

unclipped objective active. So the

gradient can push the ratio backs up.

Now consider the case when the advantage is negative.

A negative advantage means this token was worse than expected. So we want to discourage the model from producing it.

Since 80 is negative, the surrogate R times 80 becomes less negative when R decreases.

So to maximize the objective, PO wants to decrease the importance ratio, meaning the current policy should assign a lower probability to this token than

the old policy.

But again, PO does not want the update to go too far. Once R is smaller than one minus epsilon, the objective become clipped.

So reducing the probability even more gives no additional benefit. And if r is larger than one plus ipsino, it means

the probability of a bad token is still too high. In that region, the unclipped

too high. In that region, the unclipped objective remains active. So the

gradient can push the ratio back down.

So the key idea is PO clipping allows useful updates but it prevents the policy from changing too aggressively in one optimized step.

Now let's briefly recap how advantage AT is computed.

At comes from applying GAE generalized advantage estimation to two inputs.

the per token rewards from the current step onward and the learned value function weigh the first input is the per token reward

RT in PO for LHF.

This reward combines the per token KL estimate KLT and the final token the

reward model score I of Q and O.

Here KLT measures how much the policy deviates from the reference model at token T and beta times KLT is the penalty term we subtract from the

reward.

The second input is a value model for each state. The value model predicts a

each state. The value model predicts a scalar value.

With these two inputs in hand, JA produces the advantage in two steps.

First, we compute the TD error delta T.

The revert at step T plus gamma times the value of next state minus the value of the current state. Then we compute

the advantage backward. A t equals delta t + gamma * lambda * a t + 1.

Finally, to conclude the po section, let's walk through one step of the po training loop.

First, we copy the current policy paca and freeze it as the old policy pita old. Then for each prompt, we sample an

old. Then for each prompt, we sample an answer o from the old policy. This

answer is a sequence of generated tokens 01, O2 and so on up to OT.

Next, we compute the reward model score for the complete answer I of Q and O.

This gives us one scala revert score for the four response.

Then for all response tokens, we compute several token level quantities in parallel.

First, the value model predicts a value vt where the state ST is the prompt plus

the generated tokens before position T.

Second, we compute the per token KL estimate KLT.

Conceptually this compares passa with the reference policy.

We use log paca old here because at roll out time the old policy is a frozen copy of pacita and we already have its log

probabilities from generation.

Using the revert model score and KL estimate, we form the step reward RT.

For tokens before the final token, the reward is just the negative KL penalty.

For the final token, we add the reward model score for the complete answer.

Once we have the step reward RT and the value predictions, we compute the TD error. Delta T equals RT plus gamma

error. Delta T equals RT plus gamma times the next value V_T + 1 minus the current value V_T.

At the final token, the next value WT + one is zero because the answer ends because the answer ends here. because

the answer ends there and there is because the answer ends there and there is no next state.

Next we compute the advantage and use it to update the model. Starting from the final token and move backward we compute

the advantage at using the recursive JE formula.

The advantage at step t equals the TD error delta t plus gamma time lambda times the next advantage a t + 1.

We initialize the recursion with a capital t + 1 equals zero.

We also compute the return target jt which is used to train the value model.

Remember advantage is defined as return minus value.

So after we estimate the advantage, we can add the value baseline back. JT

equals 80 plus VT with J. This is an estimated return target that reuses the value baseline.

During PO optimization, we then recomputee the current policy log probability on the same roll out. In

other words, for each generated token OT, we compute its log probability under the current policy pass data given the prompt Q and the previous generated

tokens.

Then we compute the importance ratio RT.

This is the exponential of the current policy log probability minus the old policy log probability for the same

token and the same context.

Finally, we calculate the training loss.

Since JPPO is an objective we want to maximize in implementation, we minimize the negative PO objective. We also add

the value model loss weighted by a coefficient CT.

As we mentioned before, the policy model and value model often share the same transformal backbone with one head for next token prediction as another head

for value prediction. During PO

training, we usually update the shared backbone and both heads together.

The reward model II and reference model PIA stays frozen.

Now let's move from PO to JRPO which stands for group relative policy optimization.

Compared with PO, JRPO uses a simpler setup. Recall that PO involved four

setup. Recall that PO involved four models. The policy model, the reference

models. The policy model, the reference model, the reward model, and the value model. GRPO keeps only three of them.

model. GRPO keeps only three of them.

the value model is dropped.

The first is the policy model para. This

is the language model we want to optimize. For an LLM, the policy gives

optimize. For an LLM, the policy gives us the probability of the next token conditioned on the prompt and the

previous generated tokens. Because GRPO

samples a group of answers per prompt, we index by both the answer I and the token position T.

The second is the reference model parf, a frozen copy of the model before I training used to anchor the KL penalty.

The third is a revert model I the model we trained earlier which scores complete answers.

The value model way we sign JRPO removes it entirely. And because there is no

it entirely. And because there is no value model, we also do not need JA to estimate the advantage. Instead, GRPO

computes the advantage from a group of sampled answers.

We will see how on the next slide.

Now let's see how GRPO computes the advantage.

For each question Q, we sample a group of answers from the old policy passa. We

denote them as O1, O2 and so on up to OG.

Then the revert model scores each answer give us G revert scores R1, R2 and so on up to RG. Instead of using a value model

as a baseline, JRPO uses the rewards within the same group. For each answer I, we compute its normalized reward by

subtracting the mean reward of the group and divide by the standard deviation of the group rewards.

This normalized reward becomes the advantage for that answer. Then we

assign the same advantage to every token in that answer.

So if an answer gets a higher reward than the group average, it has a positive advantage and the GRPO will encourage the model to generate answers

like this. If an answer gets a lower

like this. If an answer gets a lower reward than the group average, it has a negative advantage and GRPO will

discourage it. This is what group as

discourage it. This is what group as baseline means. The reward is judged

baseline means. The reward is judged relative to a group of answers for the same prompt instead of relative to a learned value model.

The JRPO objective is very similar to the PO objective but adapted to the group setting. Instead of sampling one

group setting. Instead of sampling one answer from each prompt, we sample a group of J answers from the old policy pass it out.

And then for each answer in the group and for each token in that answer we compute the importance ratio rit theta.

This ratio compares the probability of token o under the current policy pitta with its probability under the old

policy pita old giving the same prompt and the previous tokens.

The advantage ait is computed using the group relative reward we introduced in the previous slide.

The clipping part is the same as PO. It

prevents the current policy from moving too far away from the old policy during one update.

One difference from the PO version we discussed earlier is how the KL term is used. In PO, we included the KL penalty

used. In PO, we included the KL penalty inside the reverse signal. In GRPO, the KL penalty is subtracted directly in the

objective as a separate regularization term and we use the K3 estimator for the

KL term. Specifically K3 approximates

KL term. Specifically K3 approximates the KL as row minus log of row minus

one. Here row is the ratio of reference

one. Here row is the ratio of reference to current probability for the token.

Perf of OI divided by para of OIT.

Compared with the Q1 estimate we used in PO, the K3 estimator is always non

negative and usually has lower variance.

So in short, GRPO keeps PO style clip objective, removes the value model, replaces the value baseline with a group

relative baseline and adds the key penalty directly in the objective.

Let's walk through one step of JRPO training loop similar to what we did for PO.

First we copy the current policy pita and freeze it as the old policy passa O.

Then for each prompt Q we sample a group of answers from the old policy 01 O2 etc

and OG. Next, we use a reward model or

and OG. Next, we use a reward model or more generally a reward function to score each answer. This gives us a group

of reward scores R1, R2, etc. and RG where R I is the reward score for prompt Q and answer OI.

Then for each answer in the group and for each token in that answer, we compute several quantities in parallel.

First we compute the importance ratio RIT TCA which compares the probability of the token under the current policy

and the old policy.

Second, we compute the advantage ait.

This comes from the group relative reward. We normalize each answer's

reward. We normalize each answer's reward by subtracting the group mean and divide by the group standard deviation and then assign that advantage to every

token in the same answer.

Third, we compute the per token key term between the current policy and the reference model. This key term is

reference model. This key term is multiplied by beta and subtracted from the grpo objective as a regularization

term.

Finally, we calculate the training loss.

Since GRPO is an objective, we want to maximize in implementation. We minimize the negative

implementation. We minimize the negative GRPO objective.

During GRPO training, we update the policy model Paeta.

The reference model PREF stays frozen under the revert source. either a revert model or a revert function is not updated by grpo.

At a high level, what is the difference between PO and grpo?

The biggest difference is a choice of baseline. PO uses a learned value model

baseline. PO uses a learned value model or credit as baseline. GRPO uses the group of sampled answers as a baseline.

This leads to different characteristics for the two methods. PO learns a value model and estimates the advantage at the

token level. It is more general, but it

token level. It is more general, but it also has a heavier training pipeline because we need to train and update both the policy model and the value model.

GRPO in contrast does not need a value model. Instead, it samples multiple

model. Instead, it samples multiple completions for the same prompt, scores them, and nomize the reward within the group.

Because of this, GRPO is usually simpler and cheaper to train. So, intuitively,

PO ask, was this token better than my value estimate?

GRPO ask, was this answer better than the other answers from the same prompt?

So, so when should we choose PO?

PO is a good choice when token level credit assignment matters. Specifically,

the reward model usually gives one score for the whole answer, not one reward per token. PO uses the value model and J to

token. PO uses the value model and J to propagate the sequence level reward backward and estimate token level advantages.

This is especially useful for lung outputs or cases where different parts of the answer may contribute differently to the final quality.

PO is also useful when the reward variance within a group of sampled answers is low. In that case, group normalization may not give a strong

baseline and a learned value model can be more robust.

Another reason to choose PO is stability. PO is a more mature and

stability. PO is a more mature and widely tested recipe for general ILF training.

So the typical use case is general ILHF for assistant alignment making the model more helpful, honest

and harmless.

Then when should we choose JRPO?

JRPO is a good choice when outcome level supervision is enough. In other words, we only need to know whether the final answer is good or bad instead of

estimating detailed token level advantage with a value model.

GRPO also works well when reversing the sampled group spread out naturally. For

example, in math or code task, some completions may pass the verifier or unit test while others fail. This gives

a clear relative signal within the group.

Another advantage is simplicity. GRPO

does not need a value model and it does not need JA. So, the training pipeline is simpler and cheaper.

This makes grpo especially suitable for math code and other verifiable tasks.

That pretty much covers everything I want to share in this video. We started

from the motivation for ILF then went through supervised font tuning, revert modeling, PO and finally JRPO.

Of course, this video only gives a highlevel walk through, so I strongly recommend reading the original papers if you want to understand the design

choices and the details more deeply.

If you have any questions, please leave a comment. If you find this video

a comment. If you find this video useful, please like it and share it with others.

Thanks for watching and I'll see you next time. Bye.

next time. Bye.

Loading...

Loading video analysis...