Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 3: Policy Gradients
By Stanford Online
Summary
Topics Covered
- Imitate Own Data Weighted by Reward
- Causality Demands Future Rewards Only
- Baseline Subtracts Average to Cut Variance
- Importance Sampling Enables Reuse
Full Transcript
Okay, let's get started. As a recap of some of the stuff that we've talked about so far, um we've formulated this reinforcement learning problem where we have states and actions. Uh we have some
uh kind of trajectories that we might be collecting uh which are sequences of states and actions. We have a reward function which tells us how good um is a particular state and action. uh and also
a policy which is usually what we're trying to learn. Uh and this is representing uh the behavior that we're trying to learn. Uh and then the goal of reinforcement learning which we talked
about uh in the first lecture is to try to maximize the expected sum of rewards.
So we want to uh produce behavior that leads to positive outcomes.
And um we can actually mathematically write this down as an expectation over the probability distribution over trajectories. Uh where this probability
trajectories. Uh where this probability distribution is a function of the policy which is acting in the world, the environment dynamics which might be
stochastic uh and of course uh the reward function.
Uh last week on Friday we started talking about one way to do this which is if you have expert demonstrations you can try to imitate the expert uh by
trying to maximize the likelihood of the actions that that the demonstrator did.
So trying to actually produce a policy that leads to actions that the expert would have taken. This is a very simple and scalable approach. Uh but it cannot outperform the demonstrator and doesn't
allow improvement through trial and error.
Uh we also introduced the notion of an offline algorithm that only uses an existing data set and no new data from a learned policy and an online algorithm that uses new data from the learn
policy. Uh and today we'll be talking
policy. Uh and today we'll be talking about uh ways in which we can use reinforcement learning to outperform the demonstrator to improve from practice and specifically we'll be starting to
talk about online reinforcement learning algorithms. um particular our first online RL algorithm will be policy gradients. Uh
we'll talk about a couple different variants of policy gradients and introduce both kind of the mathematical derivation for it as well as some of the intuition behind it. Uh and policy
gradients are the basis for um some popular reinforcement learning algorithms including ones that are used to train legged robots use uh algorithms that are used to train language models
um and are also uh kind of part of the default project. We won't get kind of
default project. We won't get kind of the complete algorithm that was used to train these today, but we will uh get like most of we're like halfway there basically and and by Friday we'll have
most of the algorithm for it.
Uh yeah, and then the key goals will be to try to get the intuition behind policy gradients, how to implement them and when to use them.
Okay. Um so the outline of an online reinforcement learning algorithm uh is going to look um something like this. So
first we need to initialize our policy in some way. Uh so we like when you train a neural network you need to initialize its weights. Um we could initialize the weights of our policy randomly. We could also initialize it
randomly. We could also initialize it with imitation learning. We could also use heristics. Um it depends on on the
use heristics. Um it depends on on the problem setting. And after we've
problem setting. And after we've initialized uh our policy, we'll run our policy to collect some data. Uh so we actually try out different things in the
world. Then we'll use that batch of data
world. Then we'll use that batch of data to improve the policy.
And then we will repeat that process.
We'll then collect new data with our updated policy and update um our policy again using that latest batch of data.
So we're going to attempt the task and try to get better. So what this might look like is the first iteration we kind of attempt um whatever um some kind of simple 2D navigation task. Uh the reward
for some of them is bad. the reward for other trajectories is like okay and then after more iterations of this uh we might have a policy that's a bit better that is able to get higher reward
uh but isn't perfect. So sometimes it gets higher reward sometimes it still gets low reward uh then we'll use that to improve the policy again and after more iterations
uh hopefully we'll get a policy that is able to succeed.
Um so this is kind of this notion of online reinforcement learning where we're collecting data online and improving using that data.
Uh so we talked about this reinforcement learning objective. Uh and the say you
learning objective. Uh and the say you have some policy and you want to actually understand how good that policy is. Uh to do that we can't just actually
is. Uh to do that we can't just actually look at the weights of the neural network. We actually have to run the
network. We actually have to run the policy to understand how good it is. And
so uh what we'll do is um to estimate this expectation which is a function of the policy we could actually just run the policy multiple times observe the
resulting reward and then compute the average over the observed reward and that's an estimate of how good our policy is.
Um so we can then sum over the samples from our policy.
Now let's start to think about how we actually like so we have this trial and error process. We collect some data and
error process. We collect some data and we improve our policy. Um collecting the data is is straightforward. We've talked
about how to run a policy to get a roll out in the environment. We can kind of sample an action, observe the next state, then uh run that through our policy again to get a new action and repeat that process. Uh but how do you
actually then improve our policy after we've collected that data?
So we have our objective uh and this can be written down as we have some policy pi theta and our goal is to maximize
uh the expected rewards when we sample from that policy. So uh we can consider the trajectories that are sampled from the policy as being drawn by some
distribution that I'll write as p theta of toao um theta because the distribution over trajectories is a function of the policy that you're using
and our goal is to maximize the sum of rewards. Um, I'm going to use shorthand
rewards. Um, I'm going to use shorthand here. Uh, and I'm going to use shorthand
here. Uh, and I'm going to use shorthand for r of the trajectory. Um, where this just means are like the sum of the
rewards along that trajectory.
Um, so our goal is to maximize uh the rewards. Uh we're going to maximize this
rewards. Uh we're going to maximize this with respect to our policy parameters theta.
Now um how do we actually optimize this?
So if we want to use gradient descent because we typically train neural networks with gradient descent, we need to figure out the gradient of this objective with regard to theta.
And um we'll refer to this uh term right here as j theta uh to be able to to write it. This is the objective
that we're trying to maximize.
And it's a little bit challenging to try to get a gradient of this thing uh directly because the relationship uh that this has on theta is through the
trajectories that we sampled. Uh so we can't just like differentiate through this sampling process and that sampling process also depends on the dynamics of
the world which you also don't know.
Um so if we're going to think about trying to get a gradient of um of j theta
um it's helpful to write down this expectation um as an integral.
So uh this expectation can be written
as an integral of p theta of toao time r of toao d to so this is just from the
definition of an expectation.
Um now because an integral is a summation and that's kind of a linear operation, we can bring the gradient um inside the integral. And
so in particular, what that looks like is the integral of
uh the gradient of p theta of toao r of tow.
Now, this doesn't really help us um directly in any way because we don't want to have to evaluate this integral in order to take a gradient. Um and at
this point, there is kind of a trick um a convenient trick uh that we can use to try to put this into a form that's more useful.
uh and in particular the trick is that we'll write it over here is that if you think about um
p theta of toao uh times the gradient of uh p theta of toao actually sorry times
the gradient of the log Does anyone remember what the gradient of a log is or the derivative of log
one over? Right? So um then what this is
one over? Right? So um then what this is equal to is this is equal to p theta of tow*
1 / p theta of tow. And then we need to use the chain rule to apply the gradient to this. And so with the chain rule we'll
this. And so with the chain rule we'll get the gradient of p theta of toao.
And these two will then cancel out. And
this is equal to the gradient of p theta of. And the reason why this is a
of. And the reason why this is a convenient equality between gradient of p theta and this is we could actually then replace uh the gradient of p theta
down here with this first thing right here. If we replace this gradient of P
here. If we replace this gradient of P theta oft to right here, we replace this,
we now get an integral of P theta of toao time grad log P theta of toa
uh time r of toao p tow.
Um now this is something that's useful because we can now represent this as an expectation this this first term right here. And so this is equal to uh an
here. And so this is equal to uh an expectation of toao sampled from p theta of toao of grad log
p theta of toao time r of.
So what we did there is we use this log trick to then get the P theta out here and then get an expectation and we're starting to get to something that's
actually more like what we can evaluate and in particular this expectation is something that we can estimate with samples and now theta is actually
appearing on the inside here. uh and
that means that this kind of uh this is actually getting closer to something that we can evaluate if we're still have this term right here which is doesn't look like pi of a given s. Um so this is
not this we need to kind of work on this term but we're starting to get towards something that we can evaluate. Yeah. So
the question is how is this u more useful than the first term because we still have the sampling and uh we need to the issue with the first case was that we can't differentiate through the sampling. Um now the thing that's
sampling. Um now the thing that's different here is that theta actually appears inside the expectation. Now
we're not going to be able to differentiate through the sampling still. Um but the the thing that's
still. Um but the the thing that's useful here is that we will actually um be able to differentiate with respect to theta. Uh, and we're still not going to
theta. Uh, and we're still not going to differentiate in this part. Um, but
we'll be able to get an estimate um that's still useful and we could potentially draw many many samples uh and then measure this quantity to get actually a pretty accurate
representation of um of the gradient in a way that actually depends on theta.
Okay, let's talk a little bit about this term right here.
So, and actually maybe in case it's useful in terms of where we're at right so far, we can show that on the slides as well. So, um we looked at this J of
as well. So, um we looked at this J of theta um wrote it down as an integral.
We found this convenient identity over here um where P grad log P is equal to grad P. And then we can use that to
grad P. And then we can use that to derive this second equation right here.
Um and now we're right here.
Um so for this part um for grad log p this is something that we can write out
uh and if we use uh if we remember that p theta of toao equals p of s1 times
pi of a t given s t and p of s t + 1 given st a t and and
uh with a product over t.
Then when we take the log of this as is over there, the log
of p theta of toao is equal to the log of p of s1
plus the sum over t of the log of pi
plus the log of the dynamics.
And then once we actually take the gradient of log p to then the gradient of of this with respect to theta theta doesn't affect
the initial state distribution. It also
doesn't affect the dynamics in any way.
And so the gradient of log p theta of town is equal to just the
gradient of this term right here. So
the gradient of um log pi theta of a given s.
And so this is really useful because this means that we could actually plug in um this over there and this is actually something that we can evaluate.
We can evaluate the gradient of or the log the gradient of log pi.
Um then this is just taking a backward path uh through the log of the output of our neural network.
And so then we actually get a gradient that we can evaluate. So
um the gradient that we get when we plug things in the gradient of J theta
when we plug this in is we have an expectation over trajectory sampled from our policy.
And inside that expectation, we will have we can bring the gradient inside that summation. And so we'll have a sum
that summation. And so we'll have a sum over time steps of grad log pi of a t given s
right here. And then we still have our r
right here. And then we still have our r of t right here. Um so we can write that as r of toao. Um we could also write it
as a summation of r of st comma a t.
So this is the this is what's called the policy gradient or the vanilla policy gradient. Uh and
gradient. Uh and this is a lot nicer than the version up there because we can sample trajectories
from our policy. Compute their reward.
also compute the gradient of the of each of of log of pi for each of the actions and then sum those together multiply and
we can get a gradient that we can apply to the weights of our neural network that's actually leading to higher reward.
So again um on the slides uh we wrote down what the probability of a trajectory is. If you take the log of that, you get um a summation instead of
a product and a log of each of the individual terms. Then if we plug that into our gradient
um and we take the um we take the gradient of log p to then the first term and the last term go away
because they don't depend on theta.
And then we get uh this final term when we plug it in.
So this is the gradient. When we do run policy gradient, we'll collect data and then we'll evaluate this equation and then we'll use that uh to represent the gradient of the the weights of the
policy and then run gradient descent apply that to our policy and then repeat the process.
Um so yeah we also talked about this verbally but in ter in order to estimate this gradient here we'll run the policy to collect some data. We'll then improve the policy um using that data. In
particular you can think of um this the first summation as collecting n samples to for that expectation and ideally n is larger because the higher n is the
better an estimate you're going to get of the expectation of the integral.
Um and then once we evaluate this gradient that we're estimating with samples, we'll then take a gradient step on our policy parameters.
And so the full algorithm looks something like sampling trajectories from our policy, evaluating an estimate of the gradient using those samples
and then applying that to the weights and then repeating this process to collect a new batch of data. recomputing the gradient
uh and improving the policy. Um so the question is what does it look like to use the data to estimate um this kind of gradient term right like this first term
or the second term both the first term yeah so when we're using the data to compute this first term the um what we're doing is we're getting a bunch of states and actions a bunch of
trajectories we're getting n trajectories that each in this case have length t time steps and we'll be taking the states and actions in those trajectories passing the state through
the policy um as a forward pass to get the probability over actions. We'll then
evaluate the action in the data um the probability of that of that action in the data under the distribution that the policy output. So if it's like a a
policy output. So if it's like a a categorical distribution, we'll then get the probability. We'll take the log of
the probability. We'll take the log of that probability and then we take the gradient right here which will just be a forward pass through sorry a backward pass through the network. um after
taking the log of the policy output for that action. Um so this is called the
that action. Um so this is called the reinforce algorithm or the vanilla policy gradient algorithm.
Yeah. So this came out this is just r of toao. Um so I'm using r of toao as
toao. Um so I'm using r of toao as shorthand for this uh just because I didn't want to write the summation uh five times. Uh but yeah this is this is
five times. Uh but yeah this is this is exactly r of yeah exactly. So this is an estimate because we're only going to be taking n samples to estimate this
expectation. If you took infinite then
expectation. If you took infinite then this would be um this would be exact. Uh
of course taking infinite samples is not possible. Yeah. So the gradient right
possible. Yeah. So the gradient right here this is going to depend on the architecture. Um the question was do we
architecture. Um the question was do we compute this by hand? Uh you can use um automatic differentiation. So like
automatic differentiation. So like whatever library pietorch or whatever that can compute that for you. So you
don't have to yeah write it out. Great.
Great. So let's talk about some of the intuition behind this. So there's the math. Um but what does this gradient
math. Um but what does this gradient actually mean? Um so if we look at this
actually mean? Um so if we look at this um this maybe let's start by looking at the first term. Uh the first term um if we kind of think back to imitation
learning uh this first term is actually the same as the gradient of the imitation learning objective. So in
imitation learning we are trying to mimic actions from some demonstration data set uh by maximizing the log probability or minimizing the negative
log probability. Um and if you take the
log probability. Um and if you take the gradient of that it's uh similarly going to be um the gradient of the log probability. So you can think about this
probability. So you can think about this first term as saying I'm going to uh basically try to increase the likelihood of the actions in the online data. Um
but then for the second term it's weighting that by the reward of that trajectory like how good was that trajectory.
Um and so you can kind of think about this as as the imitation gradient but weighted by the sum of rewards seen in the data.
And so essentially you're going to increase the likelihood of actions that had high reward and decre decrease the likelihood of actions that were taken in
negative reward trajectories. Uh, and so you're actually going to be imitating your own data, the data that was collected by the policy. Um, but
depending on whether it's high reward or low reward, you will be um you you actually might actually be uh have a negative gradient. If it's negative
negative gradient. If it's negative reward, that's actually going to push you away from the bad stuff and push you towards the good stuff. Yes. Yeah. So,
this is inside the the first summation.
You can see the I comes up right here.
Um, so we're summing a bunch of directories and then Yeah. This is this is all within the summation overn. So
the question is is the difference between this and imitation learning whereas like imitation learning you're like kind of either right or wrong or here you're like this amount of right or this amount of wrong. Um
so you can apply this well in principle you can apply this in in scenarios where the reward is very sparse. Um and so you can actually apply it in scenarios where the reward is just like one if you
succeeded and zero otherwise. Um now
this algorithm may not do very well in those scenarios and we'll talk about that. Um, one of the key differences
that. Um, one of the key differences with imitation learning is that you're actually doing this on your own rollouts because so you actually the policy is collecting its own data and then doing this applying this objective to its own
data. Um, and so that's that's one of
data. Um, and so that's that's one of the differences and then second is we're actually using this reward um whereas in imitation learning you're just blindly imitating all the data um without any notion of how good the data is and it
kind of assumes that it's good data.
Yeah. How do you sample the first state um of the trajectory? the um this is something that's typically defined by the problem setting. So in some problem settings um like if you're if it's like
you're trying to train um an agent to play like pinball or something then the initial state will just be given by the pinball machine. Um the in other cases
pinball machine. Um the in other cases um in like language modeling uh the initial state or observation might be the first question that the user asks.
Um so it's kind of usually thought of as a property of the environment.
>> Yeah. Yeah. So the question is can you use this to improve imitation learning um by weighting the trajectories by reward? Uh yeah you could definitely do
reward? Uh yeah you could definitely do that. Um we will also talk about some of
that. Um we will also talk about some of the good things and bad things about this objective um in the coming slides.
Yeah. So the question was in practice does it make sense to start with imitation learning run that for some number of iterations and then switch to this. Um that can definitely make sense.
this. Um that can definitely make sense.
Uh if you start completely from a randomly initialized policy it will do random stuff. Um and if you have a very
random stuff. Um and if you have a very dense reward signal that can that can be okay. Uh but in other scenarios if you
okay. Uh but in other scenarios if you have access to demonstrations then initializing it with de demonstrations uh can also can be like a very good fit.
Um and then it will allow you to improve on top of um your initial policy. Um so
the question is if you don't have a good conception of like what's good and bad like what like what the reward function should be uh can you use imitation learning to kind of get an initial sense for that? Um we'll talk about reward
for that? Um we'll talk about reward learning in a few lectures um and how you can potentially use expert data for that. Um in practice uh it's
that. Um in practice uh it's challenging. Yeah. Um so we talked about
challenging. Yeah. Um so we talked about how this kind of is intuitively increasing likelihood on high reward trajectories and decreasing likelihood on negative reward trajectories. Um if
we actually look at this um visually, say that we took a few trajectories, one of them had good reward, one of them had bad reward, and this red is kind of showing our initial uh distribution of
trajectories. What this gradient is
trajectories. What this gradient is going to do is it's going to increase the likelihood on the green part. So
you'll get something that looks like this. And then it'll decrease the
this. And then it'll decrease the likelihood on the red part. So you'll
get something that looks like this. And
it'll encourage the policy um to take actions that are more like the highreward trajectory.
Um, in like even more brief terms, you can think of this as like trying to do more of the good stuff and less of the bad stuff that you observe.
Okay? And then you can also think about this as kind of a formula a formalization of trial and error where you're actually um going to try to do different things and if you see uh good stuff happening then you'll try to do
more of that and if you make a mistake then you'll try to stop doing that.
Cool. Um, so now I have a question for you. So let's look at an example. Say
you. So let's look at an example. Say
that we're trying to have a humanoid learn how to walk in simulation. Uh, and
we'll define the reward function to be the forward velocity of the robot. This
means that the reward function will be positive if it's moving forward. It may
also be negative if the robot is moving backwards. It might be zero if the robot
backwards. It might be zero if the robot is not moving.
Um, first we'll sample from our policy.
Say we have a policy that's not very good at the beginning. Um so in the first trajectory maybe it falls backwards. In the second second
backwards. In the second second trajectory maybe it falls forwards which will actually technically give it positive reward. Uh so the sum of
positive reward. Uh so the sum of rewards will be positive.
Um maybe in another trajectory it manages to stand still. So the rewards would be zero for that. In another
trajectory maybe it takes one small step forward and then falls backwards. And so
the initial reward um at the beginning of the trajectory would actually be getting positive rewards and then later on it would be getting negative rewards and the sum of rewards would be negative.
And then maybe the last trajectory is it takes one large step backwards and then a small step forwards and in this case um it would have kind of negative rewards and then some positive rewards
but the sum of rewards would still be negative.
Um now my question for you um is what will the gradient that we defined here what will it encourage the policy to do if this is your data set? If these are
the five trajectories that you um that you sampled uh and we'll start we're this is going to be one of our simpler examples and then we'll move to more complex examples in a couple more slides. Um, so I'd
encourage you to pair up for a minute, discuss it briefly, and then we'll come back and share in like one or two minutes. Okay, let's come back. If you
minutes. Okay, let's come back. If you
want to share your answer, can you raise your hand?
Okay, so the gradient will encourage it to fall forwards and potentially stand still. So, um, cool. That's mostly
still. So, um, cool. That's mostly
correct. Uh so if we think through each of these trajectories the first one the reward uh will be negative here um it will also be negative this kind of sum of rewards over the entire trajectory
will be negative for 1 four and five it should be about zero for three um and for two it should be the sum should be positive and so what this means is it will um you'll get negative gradients
for any actions that were taken in 1 four and five you'll get positive gradients for all the actions taken in two and you'll get um roughly zero gradients for all the actions that are
taken in trajectory 3. And so this means that it will encourage it to fall forwards um and it'll encourage it to not do the things that were in 1, four, and five. So it will discourage it from
and five. So it will discourage it from taking steps backwards or forwards and will also discourage it from falling backwards. Cool. Um so yeah, so we
backwards. Cool. Um so yeah, so we encourage it encourage a policy to fall forwards and not take steps. Uh this is uh distressing because we don't want our policy to fall forward. We want it to
take steps. Um and so this is kind of
take steps. Um and so this is kind of example of how the polic how the gradient that we're getting is noisy high variance. It means that depending
high variance. It means that depending on what these trajectories are, the gradient might vary drastically. So if
you do actually have some trajectories that are walking forwards, um you might actually get a very different gradient estimate from this set of trajectories.
And so that's kind of the definition of having noise or having high variance.
Yeah. So the question is why are we getting multiple trajectories from the same policy? like why not just get one?
same policy? like why not just get one?
Um there's two reasons for this. One is
that the policy is not deterministic.
And so if we want to explore different strategies, we will get kind of different samples of different strategies because it's nondeterministic. And the second reason
nondeterministic. And the second reason is also the environment might be stochastic. And so even if you have a
stochastic. And so even if you have a deterministic policy, you might see different outcomes and that will this will help you learn about those different outcomes. Um yeah, there's one
different outcomes. Um yeah, there's one thing that we can do about this that we can like already kind of directly improve the gradient in a fairly simple way. Um and in particular, um one
way. Um and in particular, um one observation that we might make is that if you're in the middle of a trajectory and you take an action, um that action has no influence on
previous states.
And right now the gradient that is using the rewards at all states in the trajectory to update um in the gradient update and it's even telling us to take
actions later in the trajectory if your reward before earlier in the trajectory was good or bad.
And so one of the things that we can do is if we one of our trajectories was to take a large step backwards and then a small step forwards and we should actually be able to learn from this trajectory that taking a small step
forward is is good.
Uh and the way that we can do that is instead of having this be a sum uh from
t = 1 to t, what we can instead do is we can actually have this summation start at t.
Uh and so we can have this essentially be um let's use a different uh
a different letter. So we'll have t prime that is starting at t and going to capital t and now this is uh reward of s
t prime a t prime and by having it only affect um by having it only look at future rewards uh this means that we're now kind of incorporating the fact that
uh current actions can only affect the future and not affect the past. Oh yes,
sorry. One other thing that you have to do here. Yes, good point. Is to actually
do here. Yes, good point. Is to actually move this summation um inside this summation here. Yeah. Why can we do
summation here. Yeah. Why can we do this? Um so the reason is that we know
this? Um so the reason is that we know that um our action at at a at let's say um our action at time equals
100. So let's say that like um if t
100. So let's say that like um if t equals 100. So we're looking at like a
equals 100. So we're looking at like a 100. um we know that this um 100 is only
100. um we know that this um 100 is only going to influence rewards in the future and the this isn't going to have any
impact on states in the past. So like um this is going to be completely independent of like ST50 or sorry S50.
So this doesn't affect like S50 or S51 or S99.
Um and this means that the um yeah because of that independence we can then drop the rewards um we can kind of drop
that from the equation and uh yeah and only look at um when when can like when looking at like increasing the likelihood of an action at time t only
use the the future rewards in that influence.
It's a little bit hard to write out like mathematically why you can do that. Um,
I can maybe think about whether there's like a nice mathematical expression for that. Um, but it's really just about
that. Um, but it's really just about kind of causality in the world. Um, and
so basically policy behavior at time t is not going to influence rewards at earlier time steps. Cool. Um, I think that this equation, okay, I don't have the typo in this equation. So yeah,
we're bringing this summation inside the other one. um and it's only uh going to
other one. um and it's only uh going to be a factor of the future time steps.
And this means that for this example right here uh we will be um kind of decreasing the likelihood of actions at the beginning of the trajectory and then
the future summations will become positive because you took a step forward and um then you'll actually start to be increasing the likelihood of taking a small step forwards.
Yeah. So, a very good point is it still doesn't help you if you take a step forward and then you take a step backwards. And the reason why um it
backwards. And the reason why um it doesn't the reason why that's actually a little bit difficult to handle is that it's possible that the step forwards actually cause you to fall backwards or cause you to take a step backwards. you
don't really know um like a a big part of reinforcement learning is thinking about like what influence those actions have on the future and you need to learn
that from experience and you need to um if you take a step forwards it means it it could from the algorithm's point of view actually have caused something bad to happen in the future uh and so we
can't uh like drop those uh those rewards. Um so the question was if you
rewards. Um so the question was if you um say that you had like a lot of really bad stuff at the beginning and then you took a good action later on in the trajectory would you have enough time to
accumulate that? Now um so in your
accumulate that? Now um so in your particular example uh you started getting um kind of capital t was equal to 10 and you started seeing um you did something really good at time step six.
Now um at time step six what this will look like is you'll have um
grad log um a6 given s6 and then you'll have a summation from t prime equals 6
to 10 of r s t prime a t prime and so if you did get a high reward from that action immediately or in the next four time steps then you would actually
increase the likelihood on that action.
Now, um it's really going to depend a lot on the reward function. Um so you're saying that for short horizon, for kind of fixed horizon problems, um you're actually going to be giving it a the gradient will be different um if you're
summing over 10 rewards or versus over four rewards. Um so one of the things
four rewards. Um so one of the things that's important if you have a fixed time horizon t is um for this to be a proper state and actually to be truly
marovian you need it to contain information about um like the future dynamics. And so the um and if it if
dynamics. And so the um and if it if it's fixed at 10 and you don't actually um and the policy doesn't know if the horizon is about to finish then it's actually not a truly marovian state and
you need to actually give it information about how many time steps it has left to succeed and then you can incorporate that into the state. Um and so you would actually have different states in those different circumstances.
Okay. Um so I have now another example for you and this actually comes back to the question that came up earlier. um
although it's actually be slightly different. So say that the robot um we
different. So say that the robot um we now have a version that is only depending on future rewards. Um we have the same version of the problem but now our policy is a little bit better. Um
one of the trajectories falls forward, one of the trajectories kind of slowly stumbles forward, one of the trajectories suddenly walks forward and one of the trajectories runs forward. Um
my question for you is what will the gradient encourage the policy to do? um
in this case uh and we can use the um we can use the the this version the kind of updated version um yeah and so I'll encourage you to pair up again and
discuss this and then we'll come back in like two minutes okay let's come back please raise your hand if you want to
share your answer okay yeah So the the first answer was that like it's going to get highest reward for the stuff that is moving forward at higher velocity. So it's
going to encourage it to do that um the highest velocity stuff the most. And you
were saying that even if it makes any sort of forward progress, it'll still encourage it to do that. Yeah. So
another hypothesis was that if you kind of go back to the previous example, it might encourage it to fall forward and then get back up again and then fall forward and get back up again, especially if it's like um easy to fall forward. Now getting up again is
forward. Now getting up again is actually a little bit more challenging.
So that might also be a little tricky.
And so the thoughts there was that the fourth one if the reward is um it might dominate if that one has a much higher reward than the others. If it's um if they're more kind of even in terms of velocity and it's just like move forward
then um it won't necessarily dominate and you'll do all four. Yeah. So all of the rewards in these cases will all be positive. Uh and so even so the kind of
positive. Uh and so even so the kind of I think what a lot of people are getting at is that the kind of reward of this last one will be the highest and so it will be encouraged to do this but because all of these also like one two
and three also have positive rewards the policy is actually also still going to be encouraged to do those things. Um so
because uh basically the with this term is is is non- negative when it's positive um you're still going to be increasing the likelihood on the actions in these trajectories. Um, now it won't
do it as frequently as running forward, but it will still be encouraged to do that. Um, and that might be a little bit
that. Um, and that might be a little bit concerning because um, it sometime some some of the time it will actually be encouraging it to stumble forward and ideally we would actually um, in this
case want a policy that can um, actually have the policy run as fast as possible.
Yeah. Yeah. So the question was is this because of local minimum like it seems like with this reward function like it really the best outcome is to run forward and so um like why are we
getting this kind of objective? Uh the
reason for this is because the gradient is noisy. It kind of comes down to
is noisy. It kind of comes down to variance and if you do have a policy that runs forward it will encourage it to keep on running forward. Um but when you have other data in there um that
kind of that other data will influence it in some ways and so it won't necessarily converge to a local optimum um because at this point it will encourage it to run forward more and then the next batch of data will run forward more and they'll encourage it to
run forward even more. So eventually it will converge to the global optimum but um it will get there more slowly because the gradient has a lot of variance
depending on the samples that you select. Will this encourage the robot to
select. Will this encourage the robot to roll forward? Um,
roll forward? Um, with this data, it will not encourage it to roll forward unless some weird combination of gradients causes that.
Um, but yeah, generally it will not. Um,
if you had data of rolling forward, then it would encourage it to roll forward.
The question is, can the agent learn from a subset of trajectories instead of the whole trajectory? Um, in this case, it will learn from all future steps in the trajectory. So that is a subset but
the trajectory. So that is a subset but it might not be um and I guess in the next lecture we'll talk about like more sophisticated ways um to to estimate
this uh that might are are smarter basically.
Um cool. So the thing that we saw here is that things are sensitive to the scaling of the reward function. And on
the flip side if everything was negative like came up in the question before we would actually only be getting negative gradients uh to our policy and not no actual positive gradients. Um and that's
actually a big problem with um with this kind of approach is that it's very it kind of depends on the scale of the reward. And there's actually a very
reward. And there's actually a very simple thing that we can do to fix this.
Uh and so in particular uh what we can do is when we are using this gradient of um which is kind of the um expectation
of p of toao of um I'll maybe do the the kind of the first version just to make it a bit shorter. Um but we get the gradient of
uh log p of toao time r of toao. And
what we can do is instead of having this be r of toao, we could actually subtract um a constant uh where that constant
uh could be equal to the average reward.
And if we subtract the average reward from this, this means that things that are better than average will get a posit will get kind of will increase the likelihood. And things that are below
likelihood. And things that are below average will actually lead us to decrease likelihood. So this will be
decrease likelihood. So this will be negative.
Now uh you might be wondering can we just add this B here? Uh did I just like um make that up? Like why why can we do that? Why are we allowed to do
that? Um
that? Um the we can see what the effect that this has on the gradient and what we can show is
we can actually show that if you subtract a constant um in expectation this has zero effect on the gradient. um
and in particular so this is the gradient of J and if we look at what effect this has on the gradient and expectation
we can look at essentially the expectation of um
grad log P time uh time B And it's possible to show that this uh in expectation is actually equal to
zero. Uh and in particular
zero. Uh and in particular um if we write this out as the gradient of uh or
sorry the integral right the expectation is the integral of p of toao um times
grad log p of toao time v to the we can then use the same uh kind of use the same trick that we used before
but in reverse. This is equal to the integral of um grad
p theta of toao time b d to um I'm a little bit out of space but the gradient can be moved to the outside of
the integral because um because of kind of this is a linear operation. We can
also move B outside the integral because it's a constant. And so
this is equal to um the
gradient uh B times the gradient of the integral
of P theta of to D to. Now this integral is just equal to one and the gradient of
one is zero.
So this is one this uh is zero and b * 0 is zero.
Uh and so this means that the kind of the x expectation this is going to have zero effect on the gradient. Um, and so just to maybe write that out a little
bit more cleanly, um, the expectation of the gradient of log P * V. So basically the effect that this
* V. So basically the effect that this baseline is having on this gradient, we can write that as the integral. Um, we
can then use the log trick in reverse.
We can move B and the gradient outside of the integral. Um, this integral is equal to one. the gradient of one is zero and then we get zero. So this means
that we can add this baseline. Yeah. So
the reason why this constant term the question was like um this constant term wouldn't be helpful if we could actually truly u evaluate the expectation and yeah that's correct. So if we had this expectation and could measure this
exactly that we wouldn't need this. Um
the reason why this is helpful is because we have this expectation and um this basically you can think of it as reducing it's a way to reduce the variance of your estimate of this
expectation. Um and it's an unbiased way
expectation. Um and it's an unbiased way to do that. Uh so in particular um as long as the baseline is a constant um then it won't change the behavior of the
gradient and expectation. This means
that it's unbiased. This is a term from statistics talking about like um effect on random variables and you can show that it reduces the variance of your estimate of the gradient. Yeah. So the
question is is B really a constant? It
seems to depend on the trajectories. Uh
we'll use the same uh the same value of B uh for an entire batch for all of the examples. And so we'll collect a batch
examples. And so we'll collect a batch of data then we will compute B and we'll then use B for all of the updates. um in
uh on that batch of data. Yeah. So how
does it produce the variance of the gradient? Um you can write down so this
gradient? Um you can write down so this is the kind of the expectation of the gradient. You can write down the
gradient. You can write down the variance analytically um and show that it reduces the variance. Um you can kind um I don't want to go through the math uh because we have some other stuff to
get through. Um but kind of intuitively
get through. Um but kind of intuitively you can think about it in terms of um this example of um it will like yeah encourage it to do more of the good stuff more of the above average stuff
and less of the below average stuff. Um
the I could also mention that average reward is actually not the optimal baseline. Uh you can actually derive the
baseline. Uh you can actually derive the variance um and then uh find like then take the derivative of the variant set equal to zero to find the optimal
baseline. Um but average reward is
baseline. Um but average reward is pretty good. Uh the optimal baseline is
pretty good. Uh the optimal baseline is something like um kind of it's very close to average reward but there's some additional waiting terms.
Um cool. So we have one more episode of
Um cool. So we have one more episode of what does the gradient do? Uh and we're going to look at a different example.
Um, in this case, uh, the robot is folding a jacket, and the reward is one if it's neatly folded, 0.5 if it's folded with some wrinkles, and zero if
it's not folded. Um, we're going to again sample from our policy. Uh, we
have some samples of not touching the jacket, folding only the sleeves, but not folding the full thing, um, flattening the jacket but not folding it, and one trajectory where it does fold the the jacket. Um and my question
is uh with the baseline in here this time um what does the gradient look like? The uh the rewards for 1, two and
like? The uh the rewards for 1, two and three will be zero and the reward for four will be positive. And so this means that because of our baseline the gradient for four will be positive and
we'll encourage this behavior and then we'll discourage all the behavior for one, two and three. Now in some ways this is good uh because it means that we are encouraging uh the good behavior. Uh
but there are also some things that are a little bit dissatisfying about this.
Uh the first is that um the gradient for 1, two and three is going to be identical for all of them. Uh there
isn't any differentiation between them.
Um and some of these are actually better than others. And this is kind of in some
than others. And this is kind of in some ways a result of the reward function that we're using which is more sparse than it could be. Um and the other thing is that we'll we'll actually see some
techniques um in Friday's lecture that would actually um be able to leverage some of the good parts of these trajectories because there is a good thing about folding the sleeves and it would be nice to be able to use that
data to encourage the policy and right now it will actually be discouraged um and have negative gradients for this trajectory. So in some ways this is
trajectory. So in some ways this is good. Um so it will encourage folding
good. Um so it will encourage folding but the gradient is constant um for trajectories one, two and three. Uh and
we will um on Friday we'll see ways to make this a little bit better. Um and
kind of one of the takeaways here is that policy gradient is still noisy and is still high variance. Um and it really is going to work best if you have dense rewards rather than rewards that are sparse. And it'll work best if you have
sparse. And it'll work best if you have large batches of data because if you didn't actually sample this fourth trajectory and you only sampled failures, uh then you would have zero
gradient and you wouldn't actually be able to improve. Um so the question was can we just improve the reward function?
Uh yes. So if we had a more dense reward function, um that would definitely help in these cases. Uh there are actually scenarios still where um where that
could do uh we could do better. So
especially if you for example folded the sleeves uh got positive reward for this and then like dropped it on the ground or something after that then that's still dissatisfying because we wouldn't use the first part successfully. Uh and
we'll actually see how to do that how to get get good gradient for that first part on Friday. But yeah, if you had a more dense reward function for this particular uh set of trajectories that would uh work a lot better. Oh, so the
question here is when we do the baseline, are we also using this trick to only use future rewards? Um, you can combine them. Uh, it's you need to be a
combine them. Uh, it's you need to be a little bit um you need to think a little bit potentially more carefully about what the baseline will be. Um because
some of the rewards um uh you'll be summing over fewer things or more things depending on where you are in the trajectory. Uh but you can definitely
trajectory. Uh but you can definitely combine them. Yeah. So the question was
combine them. Yeah. So the question was for if we're using average reward as the baseline um do we need to keep updating the term? Do you need to use data from
the term? Do you need to use data from the policy? Um, so yes, you'll want to
the policy? Um, so yes, you'll want to keep on updating it because as the policy gets better, um, what's above above average and below average will change. Uh, and you'll want to
change. Uh, and you'll want to recomputee it on your latest, uh, batch of data.
Cool. Um, so one more thing on this version of the algorithm and then we'll move on uh to one last version. Um, in
terms of how to implement it, we talked about we kind of derived a gradient um, like this. And to actually implement
like this. And to actually implement this in practice, you could uh, in principle actually just run a backward pass for all of these different terms. Um, but if you computed all these
individually, then you'd have to do n* t backward passes. And this would be
backward passes. And this would be rather inefficient.
Uh, and it'd be nice if we could just actually do automatic differentiation on the full objective and only do one forward pass and one backward pass. Uh,
and we can do that. So, it's possible to implement essentially a surrogate objective whose gradient is the same as this objective. I'm calling a surrogate
this objective. I'm calling a surrogate because it's not the same as like the very first objective of maximizing expected rewards. Um, uh, but you can
expected rewards. Um, uh, but you can write down a surrogate objective for this particular example. Um, this is what the surrogate objective looks like.
Note that I actually um do have a typo on these equations where this second sum should be inside the first sum. Um,
that's maybe a small note. I'll see if I can fix that. Um, but uh yeah, you can implement a surrogate objective and then use back prop and autodiff software to
um just do one backward pass.
Okay. Um and what this actually looks like in practice is this would actually look like a cross entropy loss for a discrete action policy. Um or something like squared error for a gausian policy.
Um you could also use more sophisticated policies um like diffusion or or something like that as well.
Okay. So summary so far um we showed how to estimate the gradient of the RL objective using this log gradient trick.
Um this corresponds to weighing the policy likelihood by future rewards. And
we found that subtracting a baseline like average reward um will also further improve the gradient. Um even with these two things, the gradient is still noisy
and so it's best with large batches and dense rewards. Um this is also our first
dense rewards. Um this is also our first online reinforcement learning algorithm where we iterate between collecting a a batch of data with our policy and improve our policy and it kind of
formalizes this trial and error uh learning process.
Um, and the kind of the intuition for everything we've done so far is do more of the high reward stuff and less of the lowreward stuff.
Okay. Um, now the last thing we'll talk about today is what else is troublesome about policy gradients. Um, and one thing that we haven't really talked about that much is the details of the
sampling process. And this is what our
sampling process. And this is what our gradient in our algorithm looks like.
And note that in this gradient, this assumes that we're sampling from pi theta. And
theta. And we change theta in step three.
And this means that every single time we take a single gradient step, we need to recollect data to estimate our gradient.
Um, and this is kind of sad because usually we do like thousands and thousands of gradient steps on neural networks.
Um and so yeah that seems like a problem.
Um and what this is called is it's called on policy. It means that the gradient that we are measuring is like requires data using our exact policy. Um
and we can think about this as an attribute of an online orall algorithm where on policy only uses data from the current policy and off policy means that you can reuse data from other past
policies.
Um, so can we derive an off policy version of the policy gradient? And it's
only going to be a little off policy.
Uh, and by that I mean we're just going to try to be able to take more than one gradient step on a single batch of data.
Um, and to do this we can use important sampling. Um, and important sampling is
sampling. Um, and important sampling is actually a really cool technique that's useful outside of reinforcement learning as well. uh and it's kind of thinking
as well. uh and it's kind of thinking about this um this scenario where you want to estimate a quantity uh an expectation when you don't actually uh
without actually sampling from that distribution. So say you have uh some
distribution. So say you have uh some expectation uh using samples from p of x and you
want to estimate um some function f ofx using those samples. But say that you only have access to the ability to sample from Q of X. Um, so this is
usually called like your proposal distribution that's able to kind of propose uh samples to use. This is like your old policy. We have samples from our old policy, but we want to be able
to estimate a function of samples of our new policy after we've taken a gradient step.
So what you can do in this case um if you write down this expectation as an integral um it looks something like this and
as long as q of x has uh reasonable support we can write this as we can kind of add in this term that equals 1
into um our equation.
And then we can now actually write this as an expectation with respect to Q.
So this is equal to an expectation with respect to Q of uh P of X
divided by Q ofX times F ofX.
And so this means that we can actually estimate uh we can like get this expectation by sampling from Q and then waiting the
quantity by P over Q.
Um so that's pretty cool. Uh this means that we can use on principal samples from our old policy um Q weight them by the probability of
the sample under our new policy divided by the probability under the old policy times the quantity.
Um so we showed this for um kind of this simple case of P of X and Q of X. uh in
terms of actually applying this to our policy, we can take our expected rewards with regard to key of data. If we instead
have some previous policy that we've collected samples from, we can then measure our objective in terms of the samples from that policy where we weight
the reward by the new policy divided by the old policy probability.
Now note that this requires the um your proposal distribution to have non-zero support. Um in scenarios where the um
support. Um in scenarios where the um where your your new policy has probability. This is actually very
probability. This is actually very important for it to work. If Q ofX is zero, this is uh not poorly defined. Um
but that's really the the only catch here. Um, and in general, this is going
here. Um, and in general, this is going this important sampling technique is really going to work best when P and Q, um, or when your old policy and your new policy are very similar to each other.
Um, now there's some details. We've
shown this in terms of P of TOAO. Now,
how do we actually do this in terms of pi of A given S? Um, that actually gets a little bit trickier. Uh, we can write down P of tow and this ratio uh, as the
product of these different probabilities.
Um, and fortunately some of the probabilities cancel out.
But when we then try to put this um put these probabilities into um into our update, we're going to get something like this.
Uh where we then have this product of these ratios, these probability ratios.
Um, and now unfortunately this product can actually become very small or very large if the length of your trajectory is very large. Um, because you're going to be multiplying many many numbers
together. Like if you have a 100 numbers
together. Like if you have a 100 numbers that you're multiplying together, if they're above one, that might start to get very large. If they're below one, that might get to zero very quickly. Uh,
and so in practice, um, we often will consider the expectations over time steps instead of trajectories. Uh and
that also has a little bit of a hitch which is that this needs to be over state action pairs um rather than actions given states. Um but we can
often in practice we can approximate this ratio as equal to one. This is
incorrect but um it's kind of the best that we can do and then we can get some final form where we get this ratio of um new policy over old policy.
Um cool. Uh
cool. Uh the full algorithm looks the same except now we can take multiple gradient steps on the same batch. Um this is good. And
then we have this new uh form for our gradient.
Okay. Um we'll talk a little bit about um some tricks. Uh we don't have time to cover uh kale constraints today. Um but
to review things talked about online arrowl. It's an on policy algorithm. We
arrowl. It's an on policy algorithm. We
talked about baselines for reducing variance and we also just derived this off policy policy gradient which uses important sampling um and allows us to apply multiple gradient steps. The
intuition is to do more high reward stuff and less low reward stuff and the gradient is still very noisy and best with batch large batches and dense rewards. Um cool. Uh and that's it for
rewards. Um cool. Uh and that's it for today. Next time we'll talk about actor
today. Next time we'll talk about actor critic methods which will build closely on what we talked about today including PPO.
Loading video analysis...