Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 3: Policy Gradients

By Stanford Online

Summary

Topics Covered

Imitate Own Data Weighted by Reward
Causality Demands Future Rewards Only
Baseline Subtracts Average to Cut Variance
Importance Sampling Enables Reuse

Full Transcript

Okay, let's get started. As a recap of some of the stuff that we've talked about so far, um we've formulated this reinforcement learning problem where we have states and actions. Uh we have some

uh kind of trajectories that we might be collecting uh which are sequences of states and actions. We have a reward function which tells us how good um is a particular state and action. uh and also

a policy which is usually what we're trying to learn. Uh and this is representing uh the behavior that we're trying to learn. Uh and then the goal of reinforcement learning which we talked

about uh in the first lecture is to try to maximize the expected sum of rewards.

So we want to uh produce behavior that leads to positive outcomes.

And um we can actually mathematically write this down as an expectation over the probability distribution over trajectories. Uh where this probability

trajectories. Uh where this probability distribution is a function of the policy which is acting in the world, the environment dynamics which might be

stochastic uh and of course uh the reward function.

Uh last week on Friday we started talking about one way to do this which is if you have expert demonstrations you can try to imitate the expert uh by

trying to maximize the likelihood of the actions that that the demonstrator did.

So trying to actually produce a policy that leads to actions that the expert would have taken. This is a very simple and scalable approach. Uh but it cannot outperform the demonstrator and doesn't

allow improvement through trial and error.

Uh we also introduced the notion of an offline algorithm that only uses an existing data set and no new data from a learned policy and an online algorithm that uses new data from the learn

policy. Uh and today we'll be talking

policy. Uh and today we'll be talking about uh ways in which we can use reinforcement learning to outperform the demonstrator to improve from practice and specifically we'll be starting to

talk about online reinforcement learning algorithms. um particular our first online RL algorithm will be policy gradients. Uh

we'll talk about a couple different variants of policy gradients and introduce both kind of the mathematical derivation for it as well as some of the intuition behind it. Uh and policy

gradients are the basis for um some popular reinforcement learning algorithms including ones that are used to train legged robots use uh algorithms that are used to train language models

um and are also uh kind of part of the default project. We won't get kind of

default project. We won't get kind of the complete algorithm that was used to train these today, but we will uh get like most of we're like halfway there basically and and by Friday we'll have

most of the algorithm for it.

Uh yeah, and then the key goals will be to try to get the intuition behind policy gradients, how to implement them and when to use them.

Okay. Um so the outline of an online reinforcement learning algorithm uh is going to look um something like this. So

first we need to initialize our policy in some way. Uh so we like when you train a neural network you need to initialize its weights. Um we could initialize the weights of our policy randomly. We could also initialize it

randomly. We could also initialize it with imitation learning. We could also use heristics. Um it depends on on the

use heristics. Um it depends on on the problem setting. And after we've

problem setting. And after we've initialized uh our policy, we'll run our policy to collect some data. Uh so we actually try out different things in the

world. Then we'll use that batch of data

world. Then we'll use that batch of data to improve the policy.

And then we will repeat that process.

We'll then collect new data with our updated policy and update um our policy again using that latest batch of data.

So we're going to attempt the task and try to get better. So what this might look like is the first iteration we kind of attempt um whatever um some kind of simple 2D navigation task. Uh the reward

for some of them is bad. the reward for other trajectories is like okay and then after more iterations of this uh we might have a policy that's a bit better that is able to get higher reward

uh but isn't perfect. So sometimes it gets higher reward sometimes it still gets low reward uh then we'll use that to improve the policy again and after more iterations

uh hopefully we'll get a policy that is able to succeed.

Um so this is kind of this notion of online reinforcement learning where we're collecting data online and improving using that data.

Uh so we talked about this reinforcement learning objective. Uh and the say you

learning objective. Uh and the say you have some policy and you want to actually understand how good that policy is. Uh to do that we can't just actually

is. Uh to do that we can't just actually look at the weights of the neural network. We actually have to run the

network. We actually have to run the policy to understand how good it is. And

so uh what we'll do is um to estimate this expectation which is a function of the policy we could actually just run the policy multiple times observe the

resulting reward and then compute the average over the observed reward and that's an estimate of how good our policy is.

Um so we can then sum over the samples from our policy.

Now let's start to think about how we actually like so we have this trial and error process. We collect some data and

error process. We collect some data and we improve our policy. Um collecting the data is is straightforward. We've talked

about how to run a policy to get a roll out in the environment. We can kind of sample an action, observe the next state, then uh run that through our policy again to get a new action and repeat that process. Uh but how do you

actually then improve our policy after we've collected that data?

So we have our objective uh and this can be written down as we have some policy pi theta and our goal is to maximize

uh the expected rewards when we sample from that policy. So uh we can consider the trajectories that are sampled from the policy as being drawn by some

distribution that I'll write as p theta of toao um theta because the distribution over trajectories is a function of the policy that you're using

and our goal is to maximize the sum of rewards. Um, I'm going to use shorthand

rewards. Um, I'm going to use shorthand here. Uh, and I'm going to use shorthand

here. Uh, and I'm going to use shorthand for r of the trajectory. Um, where this just means are like the sum of the

rewards along that trajectory.

Um, so our goal is to maximize uh the rewards. Uh we're going to maximize this

rewards. Uh we're going to maximize this with respect to our policy parameters theta.

Now um how do we actually optimize this?

So if we want to use gradient descent because we typically train neural networks with gradient descent, we need to figure out the gradient of this objective with regard to theta.

And um we'll refer to this uh term right here as j theta uh to be able to to write it. This is the objective

that we're trying to maximize.

And it's a little bit challenging to try to get a gradient of this thing uh directly because the relationship uh that this has on theta is through the

trajectories that we sampled. Uh so we can't just like differentiate through this sampling process and that sampling process also depends on the dynamics of

the world which you also don't know.

Um so if we're going to think about trying to get a gradient of um of j theta

um it's helpful to write down this expectation um as an integral.

So uh this expectation can be written

as an integral of p theta of toao time r of toao d to so this is just from the

definition of an expectation.

Um now because an integral is a summation and that's kind of a linear operation, we can bring the gradient um inside the integral. And

so in particular, what that looks like is the integral of

uh the gradient of p theta of toao r of tow.

Now, this doesn't really help us um directly in any way because we don't want to have to evaluate this integral in order to take a gradient. Um and at

this point, there is kind of a trick um a convenient trick uh that we can use to try to put this into a form that's more useful.

uh and in particular the trick is that we'll write it over here is that if you think about um

p theta of toao uh times the gradient of uh p theta of toao actually sorry times

the gradient of the log Does anyone remember what the gradient of a log is or the derivative of log

one over? Right? So um then what this is

one over? Right? So um then what this is equal to is this is equal to p theta of tow*

1 / p theta of tow. And then we need to use the chain rule to apply the gradient to this. And so with the chain rule we'll

this. And so with the chain rule we'll get the gradient of p theta of toao.

And these two will then cancel out. And

this is equal to the gradient of p theta of. And the reason why this is a

of. And the reason why this is a convenient equality between gradient of p theta and this is we could actually then replace uh the gradient of p theta

down here with this first thing right here. If we replace this gradient of P

here. If we replace this gradient of P theta oft to right here, we replace this,

we now get an integral of P theta of toao time grad log P theta of toa

uh time r of toao p tow.

Um now this is something that's useful because we can now represent this as an expectation this this first term right here. And so this is equal to uh an

here. And so this is equal to uh an expectation of toao sampled from p theta of toao of grad log

p theta of toao time r of.

So what we did there is we use this log trick to then get the P theta out here and then get an expectation and we're starting to get to something that's

actually more like what we can evaluate and in particular this expectation is something that we can estimate with samples and now theta is actually

appearing on the inside here. uh and

that means that this kind of uh this is actually getting closer to something that we can evaluate if we're still have this term right here which is doesn't look like pi of a given s. Um so this is

not this we need to kind of work on this term but we're starting to get towards something that we can evaluate. Yeah. So

the question is how is this u more useful than the first term because we still have the sampling and uh we need to the issue with the first case was that we can't differentiate through the sampling. Um now the thing that's

sampling. Um now the thing that's different here is that theta actually appears inside the expectation. Now

we're not going to be able to differentiate through the sampling still. Um but the the thing that's

still. Um but the the thing that's useful here is that we will actually um be able to differentiate with respect to theta. Uh, and we're still not going to

theta. Uh, and we're still not going to differentiate in this part. Um, but

we'll be able to get an estimate um that's still useful and we could potentially draw many many samples uh and then measure this quantity to get actually a pretty accurate

representation of um of the gradient in a way that actually depends on theta.

Okay, let's talk a little bit about this term right here.

So, and actually maybe in case it's useful in terms of where we're at right so far, we can show that on the slides as well. So, um we looked at this J of

as well. So, um we looked at this J of theta um wrote it down as an integral.

We found this convenient identity over here um where P grad log P is equal to grad P. And then we can use that to

grad P. And then we can use that to derive this second equation right here.

Um and now we're right here.

Um so for this part um for grad log p this is something that we can write out

uh and if we use uh if we remember that p theta of toao equals p of s1 times

pi of a t given s t and p of s t + 1 given st a t and and

uh with a product over t.

Then when we take the log of this as is over there, the log

of p theta of toao is equal to the log of p of s1

plus the sum over t of the log of pi

plus the log of the dynamics.

And then once we actually take the gradient of log p to then the gradient of of this with respect to theta theta doesn't affect

the initial state distribution. It also

doesn't affect the dynamics in any way.

And so the gradient of log p theta of town is equal to just the

gradient of this term right here. So

the gradient of um log pi theta of a given s.

And so this is really useful because this means that we could actually plug in um this over there and this is actually something that we can evaluate.

We can evaluate the gradient of or the log the gradient of log pi.

Um then this is just taking a backward path uh through the log of the output of our neural network.

And so then we actually get a gradient that we can evaluate. So

um the gradient that we get when we plug things in the gradient of J theta

when we plug this in is we have an expectation over trajectory sampled from our policy.

And inside that expectation, we will have we can bring the gradient inside that summation. And so we'll have a sum

that summation. And so we'll have a sum over time steps of grad log pi of a t given s

right here. And then we still have our r

right here. And then we still have our r of t right here. Um so we can write that as r of toao. Um we could also write it

as a summation of r of st comma a t.

So this is the this is what's called the policy gradient or the vanilla policy gradient. Uh and

gradient. Uh and this is a lot nicer than the version up there because we can sample trajectories

from our policy. Compute their reward.

also compute the gradient of the of each of of log of pi for each of the actions and then sum those together multiply and

we can get a gradient that we can apply to the weights of our neural network that's actually leading to higher reward.

So again um on the slides uh we wrote down what the probability of a trajectory is. If you take the log of that, you get um a summation instead of

a product and a log of each of the individual terms. Then if we plug that into our gradient

um and we take the um we take the gradient of log p to then the first term and the last term go away

because they don't depend on theta.

And then we get uh this final term when we plug it in.

So this is the gradient. When we do run policy gradient, we'll collect data and then we'll evaluate this equation and then we'll use that uh to represent the gradient of the the weights of the

policy and then run gradient descent apply that to our policy and then repeat the process.

Um so yeah we also talked about this verbally but in ter in order to estimate this gradient here we'll run the policy to collect some data. We'll then improve the policy um using that data. In

particular you can think of um this the first summation as collecting n samples to for that expectation and ideally n is larger because the higher n is the

better an estimate you're going to get of the expectation of the integral.

Um and then once we evaluate this gradient that we're estimating with samples, we'll then take a gradient step on our policy parameters.

And so the full algorithm looks something like sampling trajectories from our policy, evaluating an estimate of the gradient using those samples

and then applying that to the weights and then repeating this process to collect a new batch of data. recomputing the gradient

uh and improving the policy. Um so the question is what does it look like to use the data to estimate um this kind of gradient term right like this first term

or the second term both the first term yeah so when we're using the data to compute this first term the um what we're doing is we're getting a bunch of states and actions a bunch of

trajectories we're getting n trajectories that each in this case have length t time steps and we'll be taking the states and actions in those trajectories passing the state through

the policy um as a forward pass to get the probability over actions. We'll then

evaluate the action in the data um the probability of that of that action in the data under the distribution that the policy output. So if it's like a a

policy output. So if it's like a a categorical distribution, we'll then get the probability. We'll take the log of

the probability. We'll take the log of that probability and then we take the gradient right here which will just be a forward pass through sorry a backward pass through the network. um after

taking the log of the policy output for that action. Um so this is called the

that action. Um so this is called the reinforce algorithm or the vanilla policy gradient algorithm.

Yeah. So this came out this is just r of toao. Um so I'm using r of toao as

toao. Um so I'm using r of toao as shorthand for this uh just because I didn't want to write the summation uh five times. Uh but yeah this is this is

five times. Uh but yeah this is this is exactly r of yeah exactly. So this is an estimate because we're only going to be taking n samples to estimate this

expectation. If you took infinite then

expectation. If you took infinite then this would be um this would be exact. Uh

of course taking infinite samples is not possible. Yeah. So the gradient right

possible. Yeah. So the gradient right here this is going to depend on the architecture. Um the question was do we

architecture. Um the question was do we compute this by hand? Uh you can use um automatic differentiation. So like

automatic differentiation. So like whatever library pietorch or whatever that can compute that for you. So you

don't have to yeah write it out. Great.

Great. So let's talk about some of the intuition behind this. So there's the math. Um but what does this gradient

math. Um but what does this gradient actually mean? Um so if we look at this

actually mean? Um so if we look at this um this maybe let's start by looking at the first term. Uh the first term um if we kind of think back to imitation

learning uh this first term is actually the same as the gradient of the imitation learning objective. So in

imitation learning we are trying to mimic actions from some demonstration data set uh by maximizing the log probability or minimizing the negative

log probability. Um and if you take the

log probability. Um and if you take the gradient of that it's uh similarly going to be um the gradient of the log probability. So you can think about this

probability. So you can think about this first term as saying I'm going to uh basically try to increase the likelihood of the actions in the online data. Um

but then for the second term it's weighting that by the reward of that trajectory like how good was that trajectory.

Um and so you can kind of think about this as as the imitation gradient but weighted by the sum of rewards seen in the data.

And so essentially you're going to increase the likelihood of actions that had high reward and decre decrease the likelihood of actions that were taken in

negative reward trajectories. Uh, and so you're actually going to be imitating your own data, the data that was collected by the policy. Um, but

depending on whether it's high reward or low reward, you will be um you you actually might actually be uh have a negative gradient. If it's negative

negative gradient. If it's negative reward, that's actually going to push you away from the bad stuff and push you towards the good stuff. Yes. Yeah. So,

this is inside the the first summation.

You can see the I comes up right here.

Um, so we're summing a bunch of directories and then Yeah. This is this is all within the summation overn. So

the question is is the difference between this and imitation learning whereas like imitation learning you're like kind of either right or wrong or here you're like this amount of right or this amount of wrong. Um

so you can apply this well in principle you can apply this in in scenarios where the reward is very sparse. Um and so you can actually apply it in scenarios where the reward is just like one if you

succeeded and zero otherwise. Um now

this algorithm may not do very well in those scenarios and we'll talk about that. Um, one of the key differences

that. Um, one of the key differences with imitation learning is that you're actually doing this on your own rollouts because so you actually the policy is collecting its own data and then doing this applying this objective to its own

data. Um, and so that's that's one of

data. Um, and so that's that's one of the differences and then second is we're actually using this reward um whereas in imitation learning you're just blindly imitating all the data um without any notion of how good the data is and it

kind of assumes that it's good data.

Yeah. How do you sample the first state um of the trajectory? the um this is something that's typically defined by the problem setting. So in some problem settings um like if you're if it's like

you're trying to train um an agent to play like pinball or something then the initial state will just be given by the pinball machine. Um the in other cases

pinball machine. Um the in other cases um in like language modeling uh the initial state or observation might be the first question that the user asks.

Um so it's kind of usually thought of as a property of the environment.

>> Yeah. Yeah. So the question is can you use this to improve imitation learning um by weighting the trajectories by reward? Uh yeah you could definitely do

reward? Uh yeah you could definitely do that. Um we will also talk about some of

that. Um we will also talk about some of the good things and bad things about this objective um in the coming slides.

Yeah. So the question was in practice does it make sense to start with imitation learning run that for some number of iterations and then switch to this. Um that can definitely make sense.

this. Um that can definitely make sense.

Uh if you start completely from a randomly initialized policy it will do random stuff. Um and if you have a very

random stuff. Um and if you have a very dense reward signal that can that can be okay. Uh but in other scenarios if you

okay. Uh but in other scenarios if you have access to demonstrations then initializing it with de demonstrations uh can also can be like a very good fit.

Um and then it will allow you to improve on top of um your initial policy. Um so

the question is if you don't have a good conception of like what's good and bad like what like what the reward function should be uh can you use imitation learning to kind of get an initial sense for that? Um we'll talk about reward

for that? Um we'll talk about reward learning in a few lectures um and how you can potentially use expert data for that. Um in practice uh it's

that. Um in practice uh it's challenging. Yeah. Um so we talked about

challenging. Yeah. Um so we talked about how this kind of is intuitively increasing likelihood on high reward trajectories and decreasing likelihood on negative reward trajectories. Um if

we actually look at this um visually, say that we took a few trajectories, one of them had good reward, one of them had bad reward, and this red is kind of showing our initial uh distribution of

trajectories. What this gradient is

trajectories. What this gradient is going to do is it's going to increase the likelihood on the green part. So

you'll get something that looks like this. And then it'll decrease the

this. And then it'll decrease the likelihood on the red part. So you'll

get something that looks like this. And

it'll encourage the policy um to take actions that are more like the highreward trajectory.

Um, in like even more brief terms, you can think of this as like trying to do more of the good stuff and less of the bad stuff that you observe.

Okay? And then you can also think about this as kind of a formula a formalization of trial and error where you're actually um going to try to do different things and if you see uh good stuff happening then you'll try to do

more of that and if you make a mistake then you'll try to stop doing that.

Cool. Um, so now I have a question for you. So let's look at an example. Say

you. So let's look at an example. Say

that we're trying to have a humanoid learn how to walk in simulation. Uh, and

we'll define the reward function to be the forward velocity of the robot. This

means that the reward function will be positive if it's moving forward. It may

also be negative if the robot is moving backwards. It might be zero if the robot

backwards. It might be zero if the robot is not moving.

Um, first we'll sample from our policy.

Say we have a policy that's not very good at the beginning. Um so in the first trajectory maybe it falls backwards. In the second second

backwards. In the second second trajectory maybe it falls forwards which will actually technically give it positive reward. Uh so the sum of

positive reward. Uh so the sum of rewards will be positive.

Um maybe in another trajectory it manages to stand still. So the rewards would be zero for that. In another

trajectory maybe it takes one small step forward and then falls backwards. And so

the initial reward um at the beginning of the trajectory would actually be getting positive rewards and then later on it would be getting negative rewards and the sum of rewards would be negative.

And then maybe the last trajectory is it takes one large step backwards and then a small step forwards and in this case um it would have kind of negative rewards and then some positive rewards

but the sum of rewards would still be negative.

Um now my question for you um is what will the gradient that we defined here what will it encourage the policy to do if this is your data set? If these are

the five trajectories that you um that you sampled uh and we'll start we're this is going to be one of our simpler examples and then we'll move to more complex examples in a couple more slides. Um, so I'd

encourage you to pair up for a minute, discuss it briefly, and then we'll come back and share in like one or two minutes. Okay, let's come back. If you

minutes. Okay, let's come back. If you

want to share your answer, can you raise your hand?

Okay, so the gradient will encourage it to fall forwards and potentially stand still. So, um, cool. That's mostly

still. So, um, cool. That's mostly

correct. Uh so if we think through each of these trajectories the first one the reward uh will be negative here um it will also be negative this kind of sum of rewards over the entire trajectory

will be negative for 1 four and five it should be about zero for three um and for two it should be the sum should be positive and so what this means is it will um you'll get negative gradients

for any actions that were taken in 1 four and five you'll get positive gradients for all the actions taken in two and you'll get um roughly zero gradients for all the actions that are

taken in trajectory 3. And so this means that it will encourage it to fall forwards um and it'll encourage it to not do the things that were in 1, four, and five. So it will discourage it from

and five. So it will discourage it from taking steps backwards or forwards and will also discourage it from falling backwards. Cool. Um so yeah, so we

backwards. Cool. Um so yeah, so we encourage it encourage a policy to fall forwards and not take steps. Uh this is uh distressing because we don't want our policy to fall forward. We want it to

take steps. Um and so this is kind of

take steps. Um and so this is kind of example of how the polic how the gradient that we're getting is noisy high variance. It means that depending

high variance. It means that depending on what these trajectories are, the gradient might vary drastically. So if

you do actually have some trajectories that are walking forwards, um you might actually get a very different gradient estimate from this set of trajectories.

And so that's kind of the definition of having noise or having high variance.

Yeah. So the question is why are we getting multiple trajectories from the same policy? like why not just get one?

same policy? like why not just get one?

Um there's two reasons for this. One is

that the policy is not deterministic.

And so if we want to explore different strategies, we will get kind of different samples of different strategies because it's nondeterministic. And the second reason

nondeterministic. And the second reason is also the environment might be stochastic. And so even if you have a

stochastic. And so even if you have a deterministic policy, you might see different outcomes and that will this will help you learn about those different outcomes. Um yeah, there's one

different outcomes. Um yeah, there's one thing that we can do about this that we can like already kind of directly improve the gradient in a fairly simple way. Um and in particular, um one

way. Um and in particular, um one observation that we might make is that if you're in the middle of a trajectory and you take an action, um that action has no influence on

previous states.

And right now the gradient that is using the rewards at all states in the trajectory to update um in the gradient update and it's even telling us to take

actions later in the trajectory if your reward before earlier in the trajectory was good or bad.

And so one of the things that we can do is if we one of our trajectories was to take a large step backwards and then a small step forwards and we should actually be able to learn from this trajectory that taking a small step

forward is is good.

Uh and the way that we can do that is instead of having this be a sum uh from

t = 1 to t, what we can instead do is we can actually have this summation start at t.

Uh and so we can have this essentially be um let's use a different uh

a different letter. So we'll have t prime that is starting at t and going to capital t and now this is uh reward of s

t prime a t prime and by having it only affect um by having it only look at future rewards uh this means that we're now kind of incorporating the fact that

uh current actions can only affect the future and not affect the past. Oh yes,

sorry. One other thing that you have to do here. Yes, good point. Is to actually

do here. Yes, good point. Is to actually move this summation um inside this summation here. Yeah. Why can we do

summation here. Yeah. Why can we do this? Um so the reason is that we know

this? Um so the reason is that we know that um our action at at a at let's say um our action at time equals

100. So let's say that like um if t

100. So let's say that like um if t equals 100. So we're looking at like a

equals 100. So we're looking at like a 100. um we know that this um 100 is only

100. um we know that this um 100 is only going to influence rewards in the future and the this isn't going to have any

impact on states in the past. So like um this is going to be completely independent of like ST50 or sorry S50.

So this doesn't affect like S50 or S51 or S99.

Um and this means that the um yeah because of that independence we can then drop the rewards um we can kind of drop

that from the equation and uh yeah and only look at um when when can like when looking at like increasing the likelihood of an action at time t only

use the the future rewards in that influence.

It's a little bit hard to write out like mathematically why you can do that. Um,

I can maybe think about whether there's like a nice mathematical expression for that. Um, but it's really just about

that. Um, but it's really just about kind of causality in the world. Um, and

so basically policy behavior at time t is not going to influence rewards at earlier time steps. Cool. Um, I think that this equation, okay, I don't have the typo in this equation. So yeah,

we're bringing this summation inside the other one. um and it's only uh going to

other one. um and it's only uh going to be a factor of the future time steps.

And this means that for this example right here uh we will be um kind of decreasing the likelihood of actions at the beginning of the trajectory and then

the future summations will become positive because you took a step forward and um then you'll actually start to be increasing the likelihood of taking a small step forwards.

Yeah. So, a very good point is it still doesn't help you if you take a step forward and then you take a step backwards. And the reason why um it

backwards. And the reason why um it doesn't the reason why that's actually a little bit difficult to handle is that it's possible that the step forwards actually cause you to fall backwards or cause you to take a step backwards. you

don't really know um like a a big part of reinforcement learning is thinking about like what influence those actions have on the future and you need to learn

that from experience and you need to um if you take a step forwards it means it it could from the algorithm's point of view actually have caused something bad to happen in the future uh and so we

can't uh like drop those uh those rewards. Um so the question was if you

rewards. Um so the question was if you um say that you had like a lot of really bad stuff at the beginning and then you took a good action later on in the trajectory would you have enough time to

accumulate that? Now um so in your

accumulate that? Now um so in your particular example uh you started getting um kind of capital t was equal to 10 and you started seeing um you did something really good at time step six.

Now um at time step six what this will look like is you'll have um

grad log um a6 given s6 and then you'll have a summation from t prime equals 6

to 10 of r s t prime a t prime and so if you did get a high reward from that action immediately or in the next four time steps then you would actually

increase the likelihood on that action.

Now, um it's really going to depend a lot on the reward function. Um so you're saying that for short horizon, for kind of fixed horizon problems, um you're actually going to be giving it a the gradient will be different um if you're

summing over 10 rewards or versus over four rewards. Um so one of the things

four rewards. Um so one of the things that's important if you have a fixed time horizon t is um for this to be a proper state and actually to be truly

marovian you need it to contain information about um like the future dynamics. And so the um and if it if

dynamics. And so the um and if it if it's fixed at 10 and you don't actually um and the policy doesn't know if the horizon is about to finish then it's actually not a truly marovian state and

you need to actually give it information about how many time steps it has left to succeed and then you can incorporate that into the state. Um and so you would actually have different states in those different circumstances.

Okay. Um so I have now another example for you and this actually comes back to the question that came up earlier. um

although it's actually be slightly different. So say that the robot um we

different. So say that the robot um we now have a version that is only depending on future rewards. Um we have the same version of the problem but now our policy is a little bit better. Um

one of the trajectories falls forward, one of the trajectories kind of slowly stumbles forward, one of the trajectories suddenly walks forward and one of the trajectories runs forward. Um

my question for you is what will the gradient encourage the policy to do? um

in this case uh and we can use the um we can use the the this version the kind of updated version um yeah and so I'll encourage you to pair up again and

discuss this and then we'll come back in like two minutes okay let's come back please raise your hand if you want to

share your answer okay yeah So the the first answer was that like it's going to get highest reward for the stuff that is moving forward at higher velocity. So it's

going to encourage it to do that um the highest velocity stuff the most. And you

were saying that even if it makes any sort of forward progress, it'll still encourage it to do that. Yeah. So

another hypothesis was that if you kind of go back to the previous example, it might encourage it to fall forward and then get back up again and then fall forward and get back up again, especially if it's like um easy to fall forward. Now getting up again is

forward. Now getting up again is actually a little bit more challenging.

So that might also be a little tricky.

And so the thoughts there was that the fourth one if the reward is um it might dominate if that one has a much higher reward than the others. If it's um if they're more kind of even in terms of velocity and it's just like move forward

then um it won't necessarily dominate and you'll do all four. Yeah. So all of the rewards in these cases will all be positive. Uh and so even so the kind of

positive. Uh and so even so the kind of I think what a lot of people are getting at is that the kind of reward of this last one will be the highest and so it will be encouraged to do this but because all of these also like one two

and three also have positive rewards the policy is actually also still going to be encouraged to do those things. Um so

because uh basically the with this term is is is non- negative when it's positive um you're still going to be increasing the likelihood on the actions in these trajectories. Um, now it won't

do it as frequently as running forward, but it will still be encouraged to do that. Um, and that might be a little bit

that. Um, and that might be a little bit concerning because um, it sometime some some of the time it will actually be encouraging it to stumble forward and ideally we would actually um, in this

case want a policy that can um, actually have the policy run as fast as possible.

Yeah. Yeah. So the question was is this because of local minimum like it seems like with this reward function like it really the best outcome is to run forward and so um like why are we

getting this kind of objective? Uh the

reason for this is because the gradient is noisy. It kind of comes down to

is noisy. It kind of comes down to variance and if you do have a policy that runs forward it will encourage it to keep on running forward. Um but when you have other data in there um that

kind of that other data will influence it in some ways and so it won't necessarily converge to a local optimum um because at this point it will encourage it to run forward more and then the next batch of data will run forward more and they'll encourage it to

run forward even more. So eventually it will converge to the global optimum but um it will get there more slowly because the gradient has a lot of variance

depending on the samples that you select. Will this encourage the robot to

select. Will this encourage the robot to roll forward? Um,

roll forward? Um, with this data, it will not encourage it to roll forward unless some weird combination of gradients causes that.

Um, but yeah, generally it will not. Um,

if you had data of rolling forward, then it would encourage it to roll forward.

The question is, can the agent learn from a subset of trajectories instead of the whole trajectory? Um, in this case, it will learn from all future steps in the trajectory. So that is a subset but

the trajectory. So that is a subset but it might not be um and I guess in the next lecture we'll talk about like more sophisticated ways um to to estimate

this uh that might are are smarter basically.

Um cool. So the thing that we saw here is that things are sensitive to the scaling of the reward function. And on

the flip side if everything was negative like came up in the question before we would actually only be getting negative gradients uh to our policy and not no actual positive gradients. Um and that's

actually a big problem with um with this kind of approach is that it's very it kind of depends on the scale of the reward. And there's actually a very

reward. And there's actually a very simple thing that we can do to fix this.

Uh and so in particular uh what we can do is when we are using this gradient of um which is kind of the um expectation

of p of toao of um I'll maybe do the the kind of the first version just to make it a bit shorter. Um but we get the gradient of

uh log p of toao time r of toao. And

what we can do is instead of having this be r of toao, we could actually subtract um a constant uh where that constant

uh could be equal to the average reward.

And if we subtract the average reward from this, this means that things that are better than average will get a posit will get kind of will increase the likelihood. And things that are below

likelihood. And things that are below average will actually lead us to decrease likelihood. So this will be

decrease likelihood. So this will be negative.

Now uh you might be wondering can we just add this B here? Uh did I just like um make that up? Like why why can we do that? Why are we allowed to do

that? Um

that? Um the we can see what the effect that this has on the gradient and what we can show is

we can actually show that if you subtract a constant um in expectation this has zero effect on the gradient. um

and in particular so this is the gradient of J and if we look at what effect this has on the gradient and expectation

we can look at essentially the expectation of um

grad log P time uh time B And it's possible to show that this uh in expectation is actually equal to

zero. Uh and in particular

zero. Uh and in particular um if we write this out as the gradient of uh or

sorry the integral right the expectation is the integral of p of toao um times

grad log p of toao time v to the we can then use the same uh kind of use the same trick that we used before

but in reverse. This is equal to the integral of um grad

p theta of toao time b d to um I'm a little bit out of space but the gradient can be moved to the outside of

the integral because um because of kind of this is a linear operation. We can

also move B outside the integral because it's a constant. And so

this is equal to um the

gradient uh B times the gradient of the integral

of P theta of to D to. Now this integral is just equal to one and the gradient of

one is zero.

So this is one this uh is zero and b * 0 is zero.

Uh and so this means that the kind of the x expectation this is going to have zero effect on the gradient. Um, and so just to maybe write that out a little

bit more cleanly, um, the expectation of the gradient of log P * V. So basically the effect that this

* V. So basically the effect that this baseline is having on this gradient, we can write that as the integral. Um, we

can then use the log trick in reverse.

We can move B and the gradient outside of the integral. Um, this integral is equal to one. the gradient of one is zero and then we get zero. So this means

that we can add this baseline. Yeah. So

the reason why this constant term the question was like um this constant term wouldn't be helpful if we could actually truly u evaluate the expectation and yeah that's correct. So if we had this expectation and could measure this

exactly that we wouldn't need this. Um

the reason why this is helpful is because we have this expectation and um this basically you can think of it as reducing it's a way to reduce the variance of your estimate of this

expectation. Um and it's an unbiased way

expectation. Um and it's an unbiased way to do that. Uh so in particular um as long as the baseline is a constant um then it won't change the behavior of the

gradient and expectation. This means

that it's unbiased. This is a term from statistics talking about like um effect on random variables and you can show that it reduces the variance of your estimate of the gradient. Yeah. So the

question is is B really a constant? It

seems to depend on the trajectories. Uh

we'll use the same uh the same value of B uh for an entire batch for all of the examples. And so we'll collect a batch

examples. And so we'll collect a batch of data then we will compute B and we'll then use B for all of the updates. um in

uh on that batch of data. Yeah. So how

does it produce the variance of the gradient? Um you can write down so this

gradient? Um you can write down so this is the kind of the expectation of the gradient. You can write down the

gradient. You can write down the variance analytically um and show that it reduces the variance. Um you can kind um I don't want to go through the math uh because we have some other stuff to

get through. Um but kind of intuitively

get through. Um but kind of intuitively you can think about it in terms of um this example of um it will like yeah encourage it to do more of the good stuff more of the above average stuff

and less of the below average stuff. Um

the I could also mention that average reward is actually not the optimal baseline. Uh you can actually derive the

baseline. Uh you can actually derive the variance um and then uh find like then take the derivative of the variant set equal to zero to find the optimal

baseline. Um but average reward is

baseline. Um but average reward is pretty good. Uh the optimal baseline is

pretty good. Uh the optimal baseline is something like um kind of it's very close to average reward but there's some additional waiting terms.

Um cool. So we have one more episode of

Um cool. So we have one more episode of what does the gradient do? Uh and we're going to look at a different example.

Um, in this case, uh, the robot is folding a jacket, and the reward is one if it's neatly folded, 0.5 if it's folded with some wrinkles, and zero if

it's not folded. Um, we're going to again sample from our policy. Uh, we

have some samples of not touching the jacket, folding only the sleeves, but not folding the full thing, um, flattening the jacket but not folding it, and one trajectory where it does fold the the jacket. Um and my question

is uh with the baseline in here this time um what does the gradient look like? The uh the rewards for 1, two and

like? The uh the rewards for 1, two and three will be zero and the reward for four will be positive. And so this means that because of our baseline the gradient for four will be positive and

we'll encourage this behavior and then we'll discourage all the behavior for one, two and three. Now in some ways this is good uh because it means that we are encouraging uh the good behavior. Uh

but there are also some things that are a little bit dissatisfying about this.

Uh the first is that um the gradient for 1, two and three is going to be identical for all of them. Uh there

isn't any differentiation between them.

Um and some of these are actually better than others. And this is kind of in some

than others. And this is kind of in some ways a result of the reward function that we're using which is more sparse than it could be. Um and the other thing is that we'll we'll actually see some

techniques um in Friday's lecture that would actually um be able to leverage some of the good parts of these trajectories because there is a good thing about folding the sleeves and it would be nice to be able to use that

data to encourage the policy and right now it will actually be discouraged um and have negative gradients for this trajectory. So in some ways this is

trajectory. So in some ways this is good. Um so it will encourage folding

good. Um so it will encourage folding but the gradient is constant um for trajectories one, two and three. Uh and

we will um on Friday we'll see ways to make this a little bit better. Um and

kind of one of the takeaways here is that policy gradient is still noisy and is still high variance. Um and it really is going to work best if you have dense rewards rather than rewards that are sparse. And it'll work best if you have

sparse. And it'll work best if you have large batches of data because if you didn't actually sample this fourth trajectory and you only sampled failures, uh then you would have zero

gradient and you wouldn't actually be able to improve. Um so the question was can we just improve the reward function?

Uh yes. So if we had a more dense reward function, um that would definitely help in these cases. Uh there are actually scenarios still where um where that

could do uh we could do better. So

especially if you for example folded the sleeves uh got positive reward for this and then like dropped it on the ground or something after that then that's still dissatisfying because we wouldn't use the first part successfully. Uh and

we'll actually see how to do that how to get get good gradient for that first part on Friday. But yeah, if you had a more dense reward function for this particular uh set of trajectories that would uh work a lot better. Oh, so the

question here is when we do the baseline, are we also using this trick to only use future rewards? Um, you can combine them. Uh, it's you need to be a

combine them. Uh, it's you need to be a little bit um you need to think a little bit potentially more carefully about what the baseline will be. Um because

some of the rewards um uh you'll be summing over fewer things or more things depending on where you are in the trajectory. Uh but you can definitely

trajectory. Uh but you can definitely combine them. Yeah. So the question was

combine them. Yeah. So the question was for if we're using average reward as the baseline um do we need to keep updating the term? Do you need to use data from

the term? Do you need to use data from the policy? Um, so yes, you'll want to

the policy? Um, so yes, you'll want to keep on updating it because as the policy gets better, um, what's above above average and below average will change. Uh, and you'll want to

change. Uh, and you'll want to recomputee it on your latest, uh, batch of data.

Cool. Um, so one more thing on this version of the algorithm and then we'll move on uh to one last version. Um, in

terms of how to implement it, we talked about we kind of derived a gradient um, like this. And to actually implement

like this. And to actually implement this in practice, you could uh, in principle actually just run a backward pass for all of these different terms. Um, but if you computed all these

individually, then you'd have to do n* t backward passes. And this would be

backward passes. And this would be rather inefficient.

Uh, and it'd be nice if we could just actually do automatic differentiation on the full objective and only do one forward pass and one backward pass. Uh,

and we can do that. So, it's possible to implement essentially a surrogate objective whose gradient is the same as this objective. I'm calling a surrogate

this objective. I'm calling a surrogate because it's not the same as like the very first objective of maximizing expected rewards. Um, uh, but you can

expected rewards. Um, uh, but you can write down a surrogate objective for this particular example. Um, this is what the surrogate objective looks like.

Note that I actually um do have a typo on these equations where this second sum should be inside the first sum. Um,

that's maybe a small note. I'll see if I can fix that. Um, but uh yeah, you can implement a surrogate objective and then use back prop and autodiff software to

um just do one backward pass.

Okay. Um and what this actually looks like in practice is this would actually look like a cross entropy loss for a discrete action policy. Um or something like squared error for a gausian policy.

Um you could also use more sophisticated policies um like diffusion or or something like that as well.

Okay. So summary so far um we showed how to estimate the gradient of the RL objective using this log gradient trick.

Um this corresponds to weighing the policy likelihood by future rewards. And

we found that subtracting a baseline like average reward um will also further improve the gradient. Um even with these two things, the gradient is still noisy

and so it's best with large batches and dense rewards. Um this is also our first

dense rewards. Um this is also our first online reinforcement learning algorithm where we iterate between collecting a a batch of data with our policy and improve our policy and it kind of

formalizes this trial and error uh learning process.

Um, and the kind of the intuition for everything we've done so far is do more of the high reward stuff and less of the lowreward stuff.

Okay. Um, now the last thing we'll talk about today is what else is troublesome about policy gradients. Um, and one thing that we haven't really talked about that much is the details of the

sampling process. And this is what our

sampling process. And this is what our gradient in our algorithm looks like.

And note that in this gradient, this assumes that we're sampling from pi theta. And

theta. And we change theta in step three.

And this means that every single time we take a single gradient step, we need to recollect data to estimate our gradient.

Um, and this is kind of sad because usually we do like thousands and thousands of gradient steps on neural networks.

Um and so yeah that seems like a problem.

Um and what this is called is it's called on policy. It means that the gradient that we are measuring is like requires data using our exact policy. Um

and we can think about this as an attribute of an online orall algorithm where on policy only uses data from the current policy and off policy means that you can reuse data from other past

policies.

Um, so can we derive an off policy version of the policy gradient? And it's

only going to be a little off policy.

Uh, and by that I mean we're just going to try to be able to take more than one gradient step on a single batch of data.

Um, and to do this we can use important sampling. Um, and important sampling is

sampling. Um, and important sampling is actually a really cool technique that's useful outside of reinforcement learning as well. uh and it's kind of thinking

as well. uh and it's kind of thinking about this um this scenario where you want to estimate a quantity uh an expectation when you don't actually uh

without actually sampling from that distribution. So say you have uh some

distribution. So say you have uh some expectation uh using samples from p of x and you

want to estimate um some function f ofx using those samples. But say that you only have access to the ability to sample from Q of X. Um, so this is

usually called like your proposal distribution that's able to kind of propose uh samples to use. This is like your old policy. We have samples from our old policy, but we want to be able

to estimate a function of samples of our new policy after we've taken a gradient step.

So what you can do in this case um if you write down this expectation as an integral um it looks something like this and

as long as q of x has uh reasonable support we can write this as we can kind of add in this term that equals 1

into um our equation.

And then we can now actually write this as an expectation with respect to Q.

So this is equal to an expectation with respect to Q of uh P of X

divided by Q ofX times F ofX.

And so this means that we can actually estimate uh we can like get this expectation by sampling from Q and then waiting the

quantity by P over Q.

Um so that's pretty cool. Uh this means that we can use on principal samples from our old policy um Q weight them by the probability of

the sample under our new policy divided by the probability under the old policy times the quantity.

Um so we showed this for um kind of this simple case of P of X and Q of X. uh in

terms of actually applying this to our policy, we can take our expected rewards with regard to key of data. If we instead

have some previous policy that we've collected samples from, we can then measure our objective in terms of the samples from that policy where we weight

the reward by the new policy divided by the old policy probability.

Now note that this requires the um your proposal distribution to have non-zero support. Um in scenarios where the um

support. Um in scenarios where the um where your your new policy has probability. This is actually very

probability. This is actually very important for it to work. If Q ofX is zero, this is uh not poorly defined. Um

but that's really the the only catch here. Um, and in general, this is going

here. Um, and in general, this is going this important sampling technique is really going to work best when P and Q, um, or when your old policy and your new policy are very similar to each other.

Um, now there's some details. We've

shown this in terms of P of TOAO. Now,

how do we actually do this in terms of pi of A given S? Um, that actually gets a little bit trickier. Uh, we can write down P of tow and this ratio uh, as the

product of these different probabilities.

Um, and fortunately some of the probabilities cancel out.

But when we then try to put this um put these probabilities into um into our update, we're going to get something like this.

Uh where we then have this product of these ratios, these probability ratios.

Um, and now unfortunately this product can actually become very small or very large if the length of your trajectory is very large. Um, because you're going to be multiplying many many numbers

together. Like if you have a 100 numbers

together. Like if you have a 100 numbers that you're multiplying together, if they're above one, that might start to get very large. If they're below one, that might get to zero very quickly. Uh,

and so in practice, um, we often will consider the expectations over time steps instead of trajectories. Uh and

that also has a little bit of a hitch which is that this needs to be over state action pairs um rather than actions given states. Um but we can

often in practice we can approximate this ratio as equal to one. This is

incorrect but um it's kind of the best that we can do and then we can get some final form where we get this ratio of um new policy over old policy.

Um cool. Uh

cool. Uh the full algorithm looks the same except now we can take multiple gradient steps on the same batch. Um this is good. And

then we have this new uh form for our gradient.

Okay. Um we'll talk a little bit about um some tricks. Uh we don't have time to cover uh kale constraints today. Um but

to review things talked about online arrowl. It's an on policy algorithm. We

arrowl. It's an on policy algorithm. We

talked about baselines for reducing variance and we also just derived this off policy policy gradient which uses important sampling um and allows us to apply multiple gradient steps. The

intuition is to do more high reward stuff and less low reward stuff and the gradient is still very noisy and best with batch large batches and dense rewards. Um cool. Uh and that's it for

rewards. Um cool. Uh and that's it for today. Next time we'll talk about actor

today. Next time we'll talk about actor critic methods which will build closely on what we talked about today including PPO.

Loading...

Loading video analysis...