Introduction to Reinforcement Learning (Julius Rückin)

By Cyrill Stachniss

Summary

## Key takeaways - **AlphaGo Masters Go from Scratch**: A Google AI beaten the world champion at Go in 2016 using reinforcement learning, learning completely from scratch to play at a superhuman level despite the game's complexity requiring human intuition. [01:47], [02:31] - **Quadruped Climbs via Sim RL**: ETH Zurich group trained a quadruped robot with reinforcement learning in simulation to climb, jump, and crouch over diverse obstacles in massively parallel fashion, achieving complex behavior without preprogramming. [03:34], [04:24] - **Drone Outraces Human Champion**: University of Zurich developed a reinforcement learning algorithm that races a drone as fast or better than the drone racing world champion, with the AI drone showing superior reaction time and complex racing behavior. [04:35], [05:15] - **RL Agent Maximizes Reward Sum**: The goal in reinforcement learning is to find a policy that maximizes the expected sum of rewards over a sequence of actions in a Markov Decision Process, optimizing long-term behavior rather than acting greedily. [09:25], [09:57] - **Policy Gradient Enables Model-Free**: Policy gradient algorithms directly optimize parameterized policies like neural networks without needing the dynamics model, allowing handling of continuous state and action spaces by computing gradients of expected rewards. [53:44], [54:38] - **Actor-Critic Cuts Variance**: Actor-critic algorithms use a critic to estimate state values as a baseline, reducing high variance in policy gradient estimates via the advantage function, enabling training of complex behaviors like robot control. [01:04:04], [01:13:56]

Topics Covered

RL Masters Go Without Human Intuition
Rewards Enable Supervision-Free Control
Value Iteration Beats Policy Iteration
Policy Gradients Ditch Dynamics Models
Actor-Critic Cuts Variance Conquers Reality

Full Transcript

hey everyone my name is Julius and today I was asked to give you a brief introduction about reinforcement learning also in the context of this self-driving cars lecture so we will see

also robots today using reinforcement learning this is meant to be an introduction I will stress that because you will soon realize reinforcement learning is quite a broad topic you

canot only use it for self-driving cars you can use it for plenty of problems to solve them and there are many different algorithms with pros and cons we will

talk about some of them in a bit and if you don't fully grasp every detail that's completely fine I want you to get an overview of the different classes of algorithms today and maybe what are the

advantages what are the disadvantages how to different algorithms address them so if you do have questions in between feel free to ask we can also do it afterwards whatever you like just raise

your hand and then I can try to re explain something if it's unclear so if you're interested after that lecture to delve deeper into that topic

I can really recommend you the lecture series by David silver from Google deep mind he talks in detail about reinforcement learning I think it's like 10 lectures where he goes into the

details of all the algorithms that we saw today we'll see today but also of more than that so that's really nice Source if you're a visual learner and

listen to lectures is something you like I can recommend going there and if you want to have a textbook look at the mouth in a bit more detail the goto book

although it's quite a few years ago is still an introduction to reinforcement learning from Richard saton and BTO so let's start with a few examples

of for what our L is used today so maybe some of you saw the headlines in 2016 that a Google AI actually has beaten the

world champion I guess and go did some of you see that like did you yes perfect so this is actually a computer program that is trained with reinforcement

learning to play the board game go and it is so significant because maybe some of you do play go and know how it roughly works I only roughly know how it works but I got told it's quite complex

you have many board configurations and it's believed that you need some human intuition component to play this game well and still we can come up with the reinforcement learning algorithm that

learns completely from scratch how to play this game and actually at a level where it builds one of the best players in that game it's not limited to board games we

can since a while already also play chess we can also do it with reinforcement learning now we can play video games Atari games for example if

you know them and in both chess and board games and go and also Atari games actually with reinforcement learning we are able to learn to play games better

than the best human players in the world but not only reinforcement learning is limited not only to games but we can actually do useful things with it we can

look into controlling nuclear power plants with it we can look into Financial Market decisions with it and maybe the most interesting part for you in this lecture series is we can learn

to control robots with reinforcement learning and today I want to show you two small examples in the beginning and of a video the first one probably you've seen that before is such a for

quadrupled robot and it's a work from eth zoric from the group of maruto and they actually taught with the reinforcement learning algorithm the

robot to climb these boxes so how they do it is they actually build a simulation of different obstacles and the robot tries from scratch to move

around these obstacles above the obstacles in a massively parallel fashion and by training in such a diverse Fashion on such diverse objects what we

can actually achieve is quite impressive complex climbing Behavior with that robot and this is fully learned in simulation with reinforcement learning

we can jump we can climb down I think in a bit they also see that you yes you can Crouch below objects this is all quite complex robot behavior that is not

preprogrammed for example like what you would do with classical P control this is all learned and another really nice example

recent example of reinforcement learning and Robotics is drone racing at the University of zurk in the group of Professor scar mutza they actually developed a reinforcement learning

algorithm that learns to race a drone as fast or better than the actual drone racing champion at the moment and racing this drone you will

see in a bit the blue drone is the AI controlled drone and the red drone I think is the human drone and the blue drone has of course

better reaction time but even afterwards can learn such complex racing behavior from reinforcement learning that it actually out competes a human world champion in drone

racing and from what you will see today subsequently is our L algorithms that form the basis for both of these works so at the end we will see an algorithm that is the basis with some

state-of-the-art adoptions to it for both of these works why is it worth talking about reinforcement learning because reinforcement learning is not just

another machine learning Paradigm you may have heard about how to train neural networks to do perception task like detecting objects in your self-driving car lecture for example detecting cars

or pedestrians but there someone gave you a collected data set and this data set is annotated by human annotator so someone

told you explicitly in a supervised fashion this is a car this is a pedestrian here in our examples we don't have this we don't have a pre-recorded

data set instead what we do have is in general what we call an agent this agent could be the robot that you saw before and it takes decisions which we what we

call is taking an action the action could be motor commands to for example do the Drone racing this means that we don't need

necessarily a supervisor the only thing that you do need to Define is a reward and you can reward for example your drone in the racing scenario by

successfully passing through a gate you can also reward it by the lap time and say shorter lap times are better intended result so I don't supervise

each of the motor commands but only say what's the intended outcome by defining such a reward function this is quite powerful but this has also some challenges worth talking

about in this lecture so by just providing this reward function you do have delayed feedback in the classical perception use case we have immediate feedback for your neural network did I

detect the car not here my drone has to execute quite a lot of motor commands until I can actually evaluate how fast it completed the lab or

not so this is generally then also harder to learn task and we do make make decisions sequentially as I said before so we do one motor command after another

just by the nature of the problem this means my actions that I do take actually dictate or determine what is the next image for example that I do

see with my camera on the Drone and this means also that the usual assumption that we do have in perception we have a data set that is identically

distributed and independently distributed not true here if I am a robot standing in this room and I do have a camera with such a field of view looking at the wall or turning to the

left actually gives me a picture of the wall looking to the right most likely doesn't give me a picture of the wall but of the window so I do have correlated observations which makes

learning also harder so let's take one step back and formally Define the reinforcement learning problem again as said we do

have an agent the agent takes a decision in general in an environment it could be a race track it could be a robot deployed in this room it doesn't matter this environment delivers a next

observation for example in our example from before it could be an observation from a camera giving me back the picture of the window and I do have a reward signal that defines the intended

behavior for example if my goal would be with a deployed robot in this room to find my laptop then I get a reward of plus one one if I do get to my laptop I

explore this room and as soon as I find the laptop I do get the reward so I do need to make sequentially this is why it's a circle basically decisions

actions for each action I do get back an observation from the environment and the reward of how good it was to execute this action and the goal is not is now to

maximize the expected sum of these rewards so for each step each action that I take as My Little Robot here in the room I do get a reward that can be

zero if I don't find the laptop and that one point it's a one so what I do do want to do is not only maximizing the immediate reward for one action but I

always want to maximize the sum of rewards over a sequence of actions this means I don't act greedily but I try to optimize actually the

long-term behavior of my agent and one thing that we need to formalize is a so-called Mark of decision process the mark of decision process is the

underlying mathematical framework of all reinforcement learning problems and it consists of the following components that we mentioned already and now defined in a mathematical way so we do

have the state space which we denote usually as this curlys and the state space is the sets of all states that my environment can be in for example in

this room and my navigation task as a robot the robot can be in any position in this environment defining the state further we have an action St space that

defines which actions can I execute in my navigation case of the robot it could be going forward going to the right or turning

actions then I do have as we saw before the reward function telling me if that was a good action to execute or not this reward function depends on the

action that I took in a given State s and analyzing in which state I end up in for example if I'm standing here I take an action that directly turns towards

the laptop in our example that's a good action if I would turn away from it that's a bad action and in which state I end up is actually defined by something that we

called a call a state transition function based on the current state of the environment and the action that I chose I do get a probability for all

possible next states that I end up in the state meaning if I stand here I execute the action of going forward for example I end up in this next

state with probability one because I don't have an actuation noise but for example if you think about the Drone the Drone motor might break down or you have slip in a real robot then you might also

end up in the next state that you didn't think you actually intended to end up with an action so we do have uncertainty about the actions that we execute by

defined by the environment model and this model is called a state transition function we have two important assumptions here for our problem this

when we always talk about a state of the environment we mean that all the relevant features in my environment that I need to know about to solve the problem are actually fully known or how

we call it fully observable this means to get to my laptop in a efficient fashion and to model this as an mdp I actually need to know where I am and if

you think about your localization lectures and mobile robotics that you took before maybe this is not always true you don't always know what certainty where your robot is but you can say for some places it's more likely

or less likely so this is important to keep in mind in the basic setup we always assume that all the information are observable in the environment

and as a second assumption that is encoded in our environment model is something that we call the mark of property you might have heard about that during your occupancy mapping lecture in

occupancy mapping we do say for a given map State and the received observation the given map State fully DET determines together with the observation the next map state

right what we do here is quite similar we do say our current state of the environment together with the action a that we chose fully determines How likely it is to end

up in the next state this means I don't need to remember all the previous States and actions that I chose but I just need to be able to observe my current state to

fully say how the environment will develop so going one step back we need one more ingredient to fully Define our problem I always talked about the robot or the action agent that takes

actions but what actually determines which action we take this is what we call a policy function a policy function is internal

to the agent and it maps for a given state to an action that I want to execute so for example my policy if it's a rather good policy and my navigation

example would be okay I directly turn to the left to see the robot it could be an arbitrary policy taking random steps that are not ideal towards the window it doesn't matter but it's some function

that determines my behavior and important to also note is it doesn't have to be a deterministic function so for a state like usually what we talked

about now end up in a new state taking one particular action but I do not only have problems where it's good to choose

deterministically based on the state my next action sometimes it's good to have a so-called stochastic policy which which means I act with some stosic in my

behavior for example if you think about rock paper scissors if you play this against an opponent and you always play the same move in the same state that your game is in your opponent will be

hopefully smart figure this out and you don't get the optimum result of winning this game after a few rounds so it's much better to act with a bit of Randomness in there depending on the state and this is what we call a

stochastic policy and we will see later where it comes in handy the mathematical formulation of the whole problem that we talked about is exactly this it looks a bit clunky but

it makes sense to go through the mathematical notation once because it will reoccur in the algorithms that we do see that do solve the RL

problem so what you see on the right is the reward function as before depending on the current state s in the time T the action that we take in that state and

the next state that we end up in and we do have something that we call a discount Factor the discount factor is valuing how important it is for me to

get an immediate High reward versus how important it is to get a long-term High sum of rewards so if my discount factor

is going towards zero essentially what it says is I care more about my immediate rewards then I care about later rewards but as we saw before there are problems where you also might want

to have behavior that acts long-term optimally so you can actually rate the importance of future delayed rewards by setting the discount Factor appropriately and then what we actually

want to have is a behavior or what we call a policy that is optimal the optimal policy is something we call P star from now on and the optimal policy

is out of any behavior that I might execute the behavior that maximizes the expected sum of rewards the sum of

rewards we saw before what it is and the expectation is important the expectation is important because you have multiple sources of Randomness we talked about

the state transition function defining How likely it is when executing in a certain state in action to end up in the next state so there is theity in the

next state that we end end up in we do have as we said before most likely stochastic policies so when following our policy that a stochastic we sample

from a probability distribution so we might end up with different likely or unlikely actions and the side technicality is that your problem has to start somewhere I have to deploy or

spawn my robot for example somewhere in this room in our previous example and sometimes it might be more likely that it spawns towards uh near the laptop

sometimes it will spawn further away but where it spawns it's essentially defined by this initial State distribution that we call

me and to facilitate our lives a bit we now do one trick and we find the notation a bit and instead of having the expectation over these three things we

have an expectation over how what we call an episode might look like and the episode is exactly what we talked about before it's an initial State an action

that I execute in the state we end up in the next state we again choose an action and so on and so on until we are at the terminal state after end steps in our

episode and the probability that this action occurs is exactly given by the individual terms that you saw before it's given by the initial State distribution times the likelihood or the

probability that we execute an action in the state times ending up in the next state given the executor action and the state that we were in before

this way now it doesn't look so scary anymore the formula we just have an expectation over all the episodes that we might experience and we want to

maximize the expected sum of rewards for all these episodes to make this a bit clear how such a simple uh problem could look like I want

to show you one example of a maze the maze on the right has a start position it has a gold position and we want to find the shortest path to enforce that we give a reward of minus one each time

step that we execute an action so to maximize that we are forc to find the shortest path and you can figure it out manually

by hand that for example standing here at the start and going one to the right at the minus 15 the number of steps that you need to take until to

reach that goal is 15 the shortest part that you can take from that position onwards is exactly 15 Steps this means if someone would give

you this maze with a numbers for example let's say I'm at the minus 15 what would you now intuitively do to decide if I go up or

down who of you knows if I'm at the minus 15 how do I decide if I go up or down based on this what is the uh you know lesser

number so like exactly exactly so you go to the minus 14 because you want to maximize the expected sum of rewards which is given by this

number one detail that is still important to realize we talked about the expected uh sum of rewards so for example in a slightly different problem

where I'm standing here the desk is a lake that I could fall into as a robot and my goal is exactly on the opposite side of this Lake I can go around the

lake by executing a path that is very nearby the lake but if I do have actuation noise for example that lets me sometimes slip into the lake I will

never reach the goal so what you also need to consider again is the Dynamics model of your environment so in that case it could be better to have some safety distance because then in

expectation on average you reach the goal faster because going into the lake r gives you a goal reaching Behavior and the nice thing is exactly

what you said how to do it we can derive from such uh information which we call the state value function a policy and as you said the best action for example

being in the minus 15 correctly is going up that we can do for all the states that we might be in and we derive a policy and actually it turns out that we

can arbitrarily go from such a state value function deriving a policy and from a policy back to a state value function so what we will see soon is

that it's actually equivalent of giving you a value function from that you can derive a policy or giving you a policy from that you can derive a value function and this

is an important equivalence that we will make use of in algorithms that actually will later determine our optimal Behavior so just to make it clear again

what we said before the value function is exactly this we want to maximize the sum of rewards in expectation following

our given policy P so the episode to actually is what we experience when we execute the policy that is given by

pi and we make use of one identity that comes in handy later to define the algorithms to find Optimal policies

if you look at the case for k equal Z okay the discount Factor goes towards one and you have the immediate reward of being in a state s i execute an action

with respect to my policy and end up in the next state so I get immediate feedback for just the next action and then I can look up from the

next state that I end up in again my value function at this next state which defines from there on the remaining sum of rewards so the state value function

of a state is exactly determined by the reward the immediate one that you can get by executing the policy plus the next States value

function and similarly we have something that we call an action value function or also named sometimes a q function for a given policy and it's similar to the state value function but there's one

difference instead of following the policy of my agent i instead force my agent now to execute one very specific

action a in a current state so instead of following my policy I execute this action a for one time step I observe the immediate reward and from the next time

step onwards as defined here I do again ask my state value function when I follow from the next time step my policy what would be the value of doing

that we will see later why both of these functions are important so just to make it clear how do we do how do we establish this

equivalence between policy and value function the first thing that we answer is how do we actually derive the policy what you just did before from the value

function and what we do is actually we can Loop over all states for example all positions in my Maze and I can check for

all actions that I might want to take in a current state what is actually the immediate reward that I get plus if I do

have the value function what is the state value of being in the next state that I end up in and this is averaged over our Dynamics model because we need to

account for the noise in our environment and then what I do is exactly what you said before I take the action that

maximizes this expected Q value in the end and this is rather complicated for one small detail because to integrate

out over all the possible states that I end up in I need to have this Dynamics model or the state transition function and it turns out that it's often much easier if someone would have given you

the Q function instead of a state value function I can just Loop for a given State over all possible actions that I take and then take the action with the

maximum Q value the other way around we can do it similarly so from going from a state value from a policy to a state value

function is also important because if you think about it maybe as an agent you somehow have an initial behavior that you might want to execute but you do want to know how good is this behavior

in each individual State and this is exactly given by the state value function so it's important to talk about how to evaluate basically the quality of your policy this is what we call a

policy evaluation and what we do is we iteratively start from a random guess how our state value function might look

like in terms of the maze for example if I don't have any prior knowledge I insert just zeros for all the states and

what I do then is I iterate again over all states I choose in the state all possible actions a analyze How likely is

it given my policy that I take this action and then integrate over all possible next States How likely it is to end up there and if I do that for a

concrete pair of State action and next state that I end up in I can get a reward plus my next state value function and it's important to realize the state

value function in the beginning is an initial guess so I don't really know what's going on what is how how good is it to be in that next state but we can

compute this reward the immediate one so at least for this one time step I do get a concrete estimate of how good it was to execute this time step so when I now

back up my value for this state S I do have a slightly better estimate of V so in the next iteration I do have a better

estimate for my next state V Prime so I can again collect an intermediate reward plus now having a bit of a better

estimate for the state value function at the next position and this way I converge to the value function for a given

policy and this operation is called something we call bman expect bman equations and this because for a given policy we average over the distribution

that the policy induces it's called a bman expectation equation and we can iteratively again apply bman equations to in this case converge to the state

value function for a given policy and now one thing that is still unclear because we talked about how to go from a state value function to a

policy and from a policy to a value function is how do I actually end up with the best policy with the optimal one right because at the moment the only thing that I can do is going from some

policy to the corresponding value function and back but this doesn't mean that the policy is actually inducing the optimal behavior and we need one more identity

to actually get to an algorithm that allows us to do that so if you remember the optimal policy was denoted at as P

star and actually as it turns out finding the policy P star that maximizes the expected sum of rewards is exactly

the same as finding what we call an optimal State value function P star because for an optimal policy to be optimal it has to be the value fun the

state value function that is maximizing my sum of expected rewards and then I can exploit that I can literally transition between policy

and state value function as we saw before so instead of finding the policy directly because this is kind of not handy we can also directly find the

optimal State value function and if we do find the optimal State value function we can use our previous algorithm to go back to the optimal policy similarly we can do that for Q

functions to find the optimal policy we can also find an optimal Q function because as we said before sometimes it's hard to go from a state value function to a policy because you need to know the

environment Dynamics but for the Q function to get the optim policy I can just Loop over all actions a and choose the action with the highest Q value so I

don't need necessarily need to know my State transition function today we talk about an algorithm that finds me the optimal State value function so not talking

about finding optimal Q value functions we do see how to generate the first algorithm now that finds me the optimal State value function and does

find me the optimal policy and to do that it's pretty much similar to what you saw for policy evaluation you don't know the optimal State value function so

you need to start from an initial guess and you can now again iteratively refine your guess how do you

do that well I start looping over all actions for all actions or for a single action S I try out all possible actions

and they lead to some new state S Prime and for this new state to end up in there I can again compute my immediate

reward of ending up there plus asking my not really not really accurate guess of an optimal value function V here how good is it to end up in the next

state so again my initial State and my initial guess for the optimal State value function at what happens being in the next state and for proceeding with my policy is is wrong for sure unless

someone by accident gave you the one already that's correct but what I can do again is I can evaluate the immediate

reward so again I need to integrate over the uncertainty in my environment model given here by ending up in the next state S

Prime and then instead of also as you may remember averaging over the policy that we were given before this is not what we do anymore but instead what we

do know is exactly what you said before we want to take the maxim the action that leads us to the maximum right and to which Maximum to

the here Q value for a given state for a given action a so instead of following an action we now try to enforce choosing the next best action the next best

action in a sense that we maximize the immediate reward that we can compute plus some initially of estimate of what happens after but this way we can

correct our initial guess for v s Prime for the next state and actually you can show that this iteratively converges to the optimal value function so for policy

evaluation we averaged over our stochastic policy for policy iteration finding the optimal value function we solve a maximization problem

here and this is again a bman equation because it's an iterative process but this time it's the bman optimality equations because we want to enforce

optimal behavior and as I said this estimate V converges to our optimal State value

function V Star from the previous slide if you have this you also directly get the optimal policy by applying the previous algorithm

again so let's look at this algorithm for one more time because this looks like okay now everything is set right now we do have an algorithm that finds me the

optimal State value function so I end up also with a optimal policy but there are actually good reasons why this algorithm does not solve everything in

reinforcement learning for example if you look at this for Loop we Loop over all states s for the mace example it's kind of intuitive right you go for each

position that you might end up in and you execute this for Loop but what would you do for example in the Drone Racing for like the Continuum of all the poses

that your drawn can be in if you have a continuous action space it's not so straight forward to Loop over this anymore and another thing is for example we always talked about discrete actions

there's a finite number of actions we can make one step forward backward left right in our maze but what is this algorithm doing if we have continuous

actions for example continuous motor commands in the Drone racing case again then we have another optimization problem in

here that is not so straightforward to solve so the basic form of this value iteration algorithm only works well for discrete finite action spaces and this

sorry State spaces and discrete finite action spaces but for these if you do know the Dynamics model we can relatively

efficiently find our optimal policy so let's summarize until now what we learned how to find a policy so how

to find a policy is what we solve by finding actually so-called optimal State value functions and we saw in the algorithm that there are some

limitations we saw for example that you need to know if I go back this Dynamics model to evaluate this whole formula this Dynamics model in

real world is not always straight forwardly given right because do do you know how the environment completely behaves in the super complex drone racing part or in this quadrupled leg

Parkour part I don't so there are definitely real world instances where it's not so handy to assume that you do have knowledge about your environment to that extent that you can define a

probability distribution like that we also quickly teaser that to some extent if you would have the optimal action value function this solves the

problem because instead of Performing policy evaluation averaging or integrating over our state transition function what we can actually do is looping over all actions that I can

execute and take the action that leads to the maximum Q value we don't talk about that in detail it's definitely talked about in the literature that I showed you in the beginning and if you

want to look at up these algorithms the first one that is quite uh basic and uh popular is called sarar so Sr S A RSA and the second one that you might

have heard of is Q learning and both of these actually try to achieve the same as value iteration but not finding an optimal State value but finding an optimal action value and this way you

can get rid of this assumption we do see another solution in a bit how to get rid of this assumption but first for completeness we said value iteration and

also by the way Q learning and such only work for discrete State spaces in their rudimental form and only work for discreete Action spaces but as we saw in

the robotics case quite often you have continuous action spaces and you do have continuous State spaces and partially all these problems

are due to our initial thought that we might solve the problem of finding an optimal policy by actually not finding it but finding the corresponding Optimus State value function right because if

you remember we had to Loop over this because we need to define the Optimus State value function so what we want to do next is we want to look at another class of

reinforcement learning algorithms this class because we want to find some kind of value function is called value-based reinforcement learning

algorithms and the next class that we will see is performing direct policy optimization and what this means we will talk about in a

bit if there are no direct questions I would go on with moving to how to define or optimize your policy directly do you have any questions until

here is it clear especially the limitations following from this algorithm do you have any questions regarding this when

we optim St function is it also that we can we will

get the optimal action function as well that's a really good question so the question is can such an algorithm

also Deliver Us the optimal action value function the optimal Q function and this exact value iteration as written down

here can only find you the optimal State value function but in a similar fashion the motivation is similar because we saw the identity of the Q function is

basically the immediate reward of executing a specific action that is not necessarily drawn from your policy plus from the next state onwards the state

value function right so it does a similar trick both sarar and also Q learning these are the two algorithms that you need to remember to find

exactly what you asked for um these do a similar trick it's always exploiting the identity that I can evaluate my immediate reward plus working on an

initially Wong guess from what will happen from the next state onwards so the the and this is the main trick

that happens on all these algorithms in a different form but similarly any other questions regarding this algorithm or the equivalence between

policy and value function how do you the the the value of what the gamma uh yeah the gamma the discount factor is

something you can choose as a user so what the gamma expresses is if it's close to zero or set to zero it means

basically what happens after my immediate next state I don't care about this is not important for my optimal uh Behavior if it's set to one or close to

one this means it's very important to me how well I'm doing after executing this next step so setting gamma to zero would actually lead to something very greedily

where I optimize the short-term reward of my behavior versus setting the GMA towards one or close to one expresses the idea of in many scenarios I want to

have a long-term planning strategy that actually leads me to then better behavior for example a simple problem could be um the navigation problem in

Google Maps where maybe you do know that I start here I want to go to cologne by train I go to Hof jump on the regional train and then I'm there and the second

option would be I go to or here nearby to kmana go on a bus and then take the Trum to Cologne we do know that the Trum takes much longer but we

also do know that the regional train is quite often delayed right so the greedy strategy of going where I think it's

immediately uh faster R towards cologne is not necessarily the best because it doesn't consider the immediate step of waiting for the regional train and

sometimes with uh our do bar and it is delayed so it could be overall better to take the longer route by Trum but having less waiting time in

between and how far sighted you want to be is basically what you can design with this gamma and for different gamma it's important to note that you can have

different state value functions of course so the sometimes with changing gamma your behavior that is supposedly optimal

under this gamma changing and the more it's toward zero the more shortsighted that behavior is and of course it depends on the actual problem that you want to solve how to set

this okay maybe in interest of time we move on to what I teased you already the second class of algorithms which finds a

policy directly and the big question is how can we actually do that because if you think about it our problem

is finding the marks the maximum policy function in the set of all possible policy functions and what does this mean this is kind of hard to grasp before we

exploit it that we can basically build a lookup table for each position in the Maze We can assign an action this kind of defines our policy function right but it's not so clear what in general this

optimization over the set of functions that are possible means and we can actually do one trick that you might know or remember from

your perception lecture for self-driving cars where you want to train a new network that takes in for example an image and outputs where the human is

it's kind of not so easy to Define this directly right and what we do instead is we place a function approximator in the middle how we call it namely the neural network that is parameterized by a set

of functions Theta and this set of functions Theta or this uh set of parameters Theta defines you the function that maps from input

meaning the images to Output where your obstacles where your uh object detections are for example for pedestrians right and we can do similarly here we

can handily Define the policy function by parameterizing it so saying instead of building this clanky lookup table as we did before what we do now is we can

use any form of parameterizable function that maps from a given state to a desired output the action

a and we write it in such a general form because it really doesn't matter for our algorithms too much what this new network or function looks like it can be

a linear function consisting of Just The Two parameter slope and y-axis offset it can be an arbitrarily complex neural

network and this way you have your first advantage of using such function approximators a neural network or linear regression can process continuous State

spaces they can process continuous input they can also process nowadays quite arbitrarily complex input your state doesn't need to be that lookup table and discrete set of States your state could

be a continuous position of the Drone your state could be a graph your state could be an image it really doesn't matter too much you just need to find

the right function approximator which nowadays in deep reinforcement learning we refer to whenever we use an expressive neural network for these function approximators which is then

giving us the policy and the second Advantage is we can now directly optimize this function the function is completely given by its parameters for example the parameters of

a neural network and we want to find the parameters that optim optimize the expected sum of rewards which is exactly

this so we went from this rather clunky optimization problem to an optimization problem that you actually know also again from perception for example if you

think about it your object detection task is finding a parameter Theta that gives you a neural network for a given image minimizes the error in detection

so you want to minimize a loss function there and instead here we want to maximize the sum of expected rewards it's pretty similar in nature and as I said this can give you

continuous State space it can give you continuous action space easiest case think about linear regression you have a continuous variable as input you have a

continuous output so the output could be now also continuous and as we said here it allows now to directly optimize the parameters

meaning it allows to directly optimize our policy Behavior so by tweaking the parameters of your function you can tweak the behavior of the

agent and how do we tweak this maybe from some new network lecture that you heard before you do know that we like to optimize these functions or these

parameters with gradient descent so what we want to ask for is actually find me the gradient that optimizes this function function on the right and then

we use gradient descent or in this sense gradient Ascent because we want to maximize a function instead of minimizing a loss function and then we can execute any

gradient based optimization algorithm to find me this parameter Theta that maximizes my function it could be a Newton method it could be stochastic

radiant descent in the simple case but there's one problem if you want to do gradient asent or any gradient based optimization you need to find the

gradient with with respect to these parameters for your function that you want to maximize and the function that you want to maximize is the expectation over the

episodes for the sum of expected Rewards or the sum of rewards but what does it mean to compute such a gradient for this

expected sum of rewards we do derive this it's called the policy gradient theorem it's a theorem it's a bit involved stay with me

we don't go into the details but it's important to derive some steps here that you do see further advantages of the algorithm when you do direct policy optimization and the first thing that we

do realize after applying something that we call score function estimator trick maybe some of you know this from statistical learning classes where you

do maximum likelihood optimization if you want to go into the details I link below a nice lecture from Cambridge where they in detail explain this score

score function estimator trick in general this is nothing specific to reinforcement learning this is something that happens all the time in machine learning when you want to maximize the

expectation of a function and the main key idea that you can take away here without going into the mathematical derivation is the

gradient of an expectation there we can change the expectation and the gradient sign why is that because if you think about it the expectation is an

integral over probability times in this case our function R right this is how expectation is defined and then the by liet's integration rule you can actually

change the gradient and the integral and by changing or pulling the gradient into the integral part of our equation this gradient ends up in the expectation and by doing

a bit more numerical manipulation what we end up with is actually now inside a gradient of the lock probability of an

epot and why this is important is something that is getting clear in The Next Step so this thing evaluates to this equation to get why we need to

understand what is actually the logarithm of the probability of an episode the probability of an episode you saw in the beginning if you remember right it

just the sequence the episode of State action reward next state and such and a probability is just the product of the

individual parts starting with the initial State distribution then How likely it is to choose an action in the current state as zero How likely it is to end up in the next state as one of

our episode and so on and then the logarithm applied to this product is the sum of logarithms

right so the logarithm of this probability is actually the sum of the loog probabilities of these individual parts and then of this whole sum we want

to take the gradient and the gradient of a sum of terms is the sum of gradients of the individual terms right because both are linear

operators so we just did too small mathematically mathematical tricks the logarithm of a product is the same as the sum of the individual probabil log

probabilities and the second trick is to realize the gradient of a sum is the same as the sum of a gradient and what you can ask yourself then is for each of

the individual terms in this product what is the gradient of the log probability of these terms and it turns out you need to ask

yourself which of these individual terms in the product actually depend on my parameter Theta because I want to take the gradient of the individual Lo

probabilities of these and when my term doesn't depend on Theta the gradient of something that doesn't depend on Theta

is zero so it drops out of the sum and maybe you see it already here which of these terms actually depend on

Theta can someone tell me think about what we use Theta for here here it's quite explicitly Theta is to parameterize our

policy so only the policy itself also depends on Theta right because just being in some State doesn't depend on which action I choose my Dynamics model

here T doesn't depend on the action I chose the action I chose it's given this means I can get rid of this something that we anyway did not clearly Define this mu the initial State

distribution I can get rid of my Dynamics model again the state transition function and the only thing that I end up with is the sum of gradients of the lock probability of my

policy and this is nice because now we have the same effect as we wanted to have before we have a algorithm that doesn't depend on the knowledge of our

Dynamics model so so we can derive a policy gradient based algorithm that does not require knowledge about how the environment behaves and detail instead what we can do is we can just execute

actions without knowing this environment model and these algorithms that do not require knowledge of our state transition function or Dynamics model

these is what we call model three RL algorithms and whenever an algorithm is relying on the knowledge of exactly this

term T before then it's a model based algorithm so the nice thing is that with the derivation of this policy gradient we end up with a model fre algorithm we

don't need full knowledge about our environment and then the term here on the right is just the sum of rewards that I get for an individual episode to

from current time step ton so I am in some time T while executing my policy uh my uh episode to and just the remainder

of the rewards that I do get from there on is exactly this term and it turns out that this is exactly the same as our Q function for the policy that we followed

because we experience when we execute some episode a certain action and then we follow again our policy this is just

by definition of what you saw before for and this is it you from there on we can actually derive a policy optimization algorithm or policy gradient based algorithm that directly

optimizes our policy we start with any Behavior any policy which means we start with a randomly initialized par meter Vector

Theta we somehow need to accumulate some experience so we need to accumulate some episodes to to actually compute this guy over here right so what we are doing is

basically what you saw in the first video when you remember this quadrupled robot they zoomed out in their simulation and you saw Many Robots doing their thing on their individual

obstacles to climb over that this is EXA exactly the same step you you start with some guess of how the policy might work you execute a lot of data for it so you

get these episodes of State action next state and reward and so on and based on this experience that you can gather there in

simulation for example you can compute now exactly this policy gradient that we saw before you don't explicitly derive this

expectation but because you sampled a lot of episodes to and for each of the episodes you can Define and simply

compute the sum of experienced rewards you can compute this and the gradient of the lock probability of your function Pi is actually given by any differentiation

tool that you might use for example if your policy Pi is parameterized by a neural network you can use pytorch Define your neural network and your automatic differentiation tool in

pytorch the back propagation algorithm actually derives you these gradients so this we know how to compute this is something we experience by just

simulating for example a lot of quadrupled robot climbing exercises and then we can average over all the episodes that we did experience so we do

get a Monte Carlo estimate we call it for the policy gradient then what we can do if we do have the gradient we can actually make one step in direction of the gradient

because the gradient points into the direction of the policy space where I do experience most likely policies that end up with a higher sum of expected rewards

right because this is exactly how the gradient was defined so I can do very simple in this case gradient asent I end up with a slightly better

policy because this is what my gradient was pointing towards and then I need to repeat that I only take a small step a gradient is always just a local approximation right you cannot directly

jump to the globally optimal policy so what we do is we repeat the whole procedure we again start our massive parallel simulation for example in the quad ruper case we collect a lot of

experience we compute again the estimate from this experience of Monte Carlo gradients until we converge to a policy that is performing

decently but there's one problem if you think about it this guy here the sum of rewards from time step t on is quite High variance for example in let's make

one illustrative example again again of our navigation task I am the robot I'm wandering around in the room until I do find facing towards the laptop my laptop

I execute many actions until I finally find this robot I get a plus one as a reward I can end up with the same set of actions that I execute sequentially

until one step before last time I found by turning towards my laptop the laptop itself but this time because of stoas either of the environment or the policy

I don't turn towards the laptop but I turn again away from it so I get a completely different reward although I executed up to the last step very same

sequence of action experience very similar sequence of States so this is why sometimes You observe that this classical we call it

reinforc algorithm for policy gradient based optimization is not so sample efficient we call it it needs a lot of gathered experience in simulation or in

real world to converge to a stable estimate of the policy gradient and we can do better by one

last final class of algorithms that I show you today which is called ector critic algorithms and ector critic algorithms actually do have the goal of reducing this

estimate but before I go into the details here I want to quickly ask you if there's a question about how policy gradient based optimization in general

works or in specifically also the steps of these algorithms if there's anything that is unclear to you you feel free to ask it

now generally when we using the based optimization so we see that we can have a local Maxima and then we cannot

upgrade like go ahead from that Maxim so how does this cover that part yeah that's a very good question question so we do execute basically first order

optimization methods here right we compute the gradient it's only local approximation for Regions where to move to get a better performing policy and

this algorithm by default is actually also not guaranteed to end up in a global maximum that's a very good question we cannot guarantee that we end up in a global maximum we have the same

restrictions that you might know from optimization lectures we do end up in local uh Optima there are certain tricks to ensure that you might get around this

to some extent but there is no mathematical proof as far as I know that you end up in this of course if you can guarantee that your function that you want to maximize is convex or that sense

then concave then you can use again your knowledge from optimization Theory and say yeah even with my gradient estimate I will converge to the global one but in

general if you think about it your function depends on the policy that you execute in a very complex maneuver for example in this drone racing it's very

hard to believe that the function that you want to optimize is actually uh is actually concave or convex right or linear something that gives you convergence to Global Optimum and even

worse if you think about it the main motivation in the beginning was let's parameterize our policy with

these parameters Theta right maybe if I choose a neural network for example with nonlinear activation functions my policy itself is also not

linear right or is not uh yeah it's nonlinear so there is actually quite a few things where this Assumption of global Optima would break but we can at

least guarantee that we converge with the reinforced algorithm if you compute enough of this experience to a local Optima and this is usually good enough

to solve our task do you have any other questions before we move on to the final round of our third class of methods to solve the RL

problem okay if you do have in between again to reinsure yourself that you understood everything until here questions feel free to ask so we did talk about that you need

to gather quite a lot of experience because the experience that you do gather even for similar actions that itially executed could end up in a

different reward right this is what I said in the beginning of the lecture y RL to some extent is hard because you have delayed feedback your actions that you choose actually influence what you

will see next and such to some extent this is a reason why you have high variance in your policy gradient estimates because you vary in what you

gather as experience so there's one class of actor critic algorithms that aims to do the same thing as the reinforce algorithm that

you saw before but modifies it to get a reduced variance of the policy gradient

estimates and what we do is if you remember our derivation of the policy gradient is exactly

getting the experience which is equal to plugging in here the Q function right and we can gather the experience then we only learn about the experience States

and actions but maybe we can do better because if I do know that a state that I that I am in has certain features that are similar to another state that I

didn't observe yet but it's close enough and I do have a function that can indicate hey these features I saw in

another state and action and this led to a high Q value I want to transfer this knowledge somehow so one one problem is in the reinforce algorithm that you only

learn about the actual experience data but maybe you can learn features from that data that actually explain which states are good which states are bad and

this is quite similar to again your perception your network idea you show a finite amount of images and their annotation and the neural network learns which features to focus on to find The

Pedestrian or the C similarly here you might hope for a function approximator for this Q function that learns what are the features in my input here that

actually drive the Q value to be high or low so what we do again is we plug in the function approximator for this Q

value and if you think about it this Q value is actually evaluating how well is it or how good is

it to take a certain action in The Next Step plus following my policy parameter by Theta from The Next Step onwards so

it kind of evaluates how good my current policy estimate T Theta is doing and this is what we call a Critic this is why it's called exra critic algorithm

the critic is not taking actively a decision what to do next but it evaluates how good my policy is parameterized by Theta again and then

the actor part comes in from the policy itself the policy Pi Theta does determine which behavior is induced so

only the policy can actually execute an action and drive the behavior of the agent this is why we call the policy function approximator Pi Theta also an

ector and this is why it's called ector critic algorithms we do have a policy function approximator as before we do policy gradient based updates but we

don't do them uh by judging only based on the experienced uh rewards but also by what we learned in the critic is a good and a bad

state so we do the trick this is what you had before is your policy gradient and we just plug in here the small but little important notation saying hey

this is based on the experience that I gathered this is based on what I learned from the experience might be in my function approximation indicating a good

and a bad state so this could you can think about this similarly to our uh Pi Theta that's actually missing here the

Theta um ah it's over here sorry change notation um you can think about it again as a neural network or as a linear regression or anything else that deres

you the critic function and there's a second trick to minimize variance and it is a bit mathematically

involved again but intuitively what we want to do is if you think about this quantity that can be in magnitude quite large so you GA experience you get a

reward in a certain time step in the next time step and so on and so on so the magnitude of this Q value can be large having a smaller magnitude of your

critic estimate usually leads also to lower variance so what we want to do is we want to find such Baseline

functions the Baseline functions are solely there for one purpose to reduce the variance and the basine function that's important that's why I put action

independent here only depend on a state and map to some real number and how they reduce variance is exactly

this instead of using the Q function we use the Q function and subtract the Baseline function so we remove some part of the magnitude of the Q function but

only the part that doesn't depend on the action because if it would depend on the action this actually changes our policy gradient

right and you saw before one function that also kind of evaluates the quality of a policy but does not depend on the action as well like the Q function but

does depend only on the state can you remember how we called it from the first half of the lecture we started with that

function ex value function right yeah the state value V right yeah perfect exactly so it turns out by definition the state value function only depends on

the state so actually a good guess for such a basine function that reduces variance and we don't go into the details but you can actually show out of

all possible basine functions the state value function is the one that reduces the variance in my policy gradiant estimates the most and we use that we

just plug in now from any Baseline function our state value function that we saw in the beginning of the lecture so my critic changed now from Q value of

a certain state minus just the state value of being in that state we talk a bit about this

meaning in a second but first let's go back to the definition now of our policy gradient it's this what we plugged in

before right it was first a q value now it's a baseline function that we subtract we can expand this to get rid of this Q

value and say our Q value of a certain State executing a particular action is exactly the reward that I would receive

right plus the state value of acting with respect to my policy in the next state this is what you saw in the beginning how a q value is defined we

asked what would be the immediate reward for executing an action a that does not necessarily follow my policy plus from the next date onwards I do follow my

policy again what is the expected sum of the rewards and this way we get rid of two functions and we just work with one function the state value function

and we can do one more trick again we parameterize our state value function inside of the critic by our parameters P so it can be again the state value

function any neural network any linear regression that tries or aims to learn what are the certain features in the state that drive a high or low State

value let's talk a bit about what's actually the meaning of this formula because that's important in literature you will find this as an advantage

function so-called Advantage function for a given policy Pi the advantage function always depends on the current state that you are in and an action that

you want to choose and it's the difference of the Q value and the state value that you saw before we do one trick we approximate the advantage

function again with our par metop side and we basically ask how good is it to execute a particular action at receive an immediate reward and then

following our policy versus directly following our policy in the current state that we are in so this at does not necessarily stem

from the policy that we follow this V the State value does so we basically asking would choosing a in St improve

what I'm doing at the moment with my policy or would it decrease the quality of my policy so going back it actually scales

the policy gradient here by the Improvement or like Drop in quality when choosing such an action so instead of evaluating the Q

Val Val directly we now evaluate the advantage of the experienced action a in the current state and this is it this is basically

deriving everything that you need to know now about the RL algorithm that was used to train the uh robots that you saw in the beginning the Drone and also the

quadrupled robot and we call it here vanilla ector critic algorithm because it's now basically plugging in into the reinforce algorithm theide idea of having these

parameterized critic functions we start as in reinforce we have an random initial guess of our policy parameterized by Theta and

parameterized the critic by then until convergence again an iterative scheme like in the reinforce algorithm we first execute the massively

parallel simulation that you saw in the beginning to gather data to gather data of State action reward next state and so on so on

how do we do that we follow our current stochastic policy parameter by Theta so we sample from this policy some action we see how good executing this action

was actually be was actually by accumulating rewards over time similar or same actually as the reinforce algorithm and then we again like in the

reinforce algorithm compute the policy gradient but this time we don't use directly the experience that we gathered

in this step instead we use our critic function to evaluate the advantage of choosing an action at that we experienced in a current state

s and this is gathered or computed by evaluating the state action H the state value function in this current form so immediate reward that we experienced

plus State Value Estimate given by our critic in the next State minus state Value Estimate of our critic in the current state and now there's one thing missing

you realize that probably here we plug in the state value Baseline function we pluck in the function approximator idea that you saw before and we pluck in basically all together this the

advantage function for our critic so we used all three tricks together to derive this policy gradient update and that's technically the only difference to reinforce

algorithms because now we can use this estimate again to do classical gradient Ascent similar problems as you told or asked for before we can only converge to

a local optimal policy and then we need to refit our critic because right it was also an initial guess and the critic should

learn about hey how good is my policy doing right in a certain State and if we start with a random guess in the beginning that's most likely off so I also need to have

some uh correction or learning in my critic part here and again we want to do it model freely because we had this nice property that policy gradient based

algorithms don't need to know about the Dynamics model and we want to also make sure that learning a Critic doesn't need a Dynamics model but if you remember the

state value function in the beginning we needed the Dynamics model to evaluate it but we can do it model freely by

doing the following here in the second step we gathered experience so we can compute for each of the episode that we executed the sum of

rewards and averaging over all time steps and all experien polic uh episodes to is basically giving us a Value

Estimate being in a certain State St what did I experience afterwards as rewards when I executed my policy because I do nothing Ahad than following my policy here which is exactly the

definition of a state value function so I can get again this expectation by using all my time steps in an episode using all my episodes that are simulated

and average over all episodes so I get again a Monte Carlo estimate for what is the optimal parameters that minimize the mean square error so maybe some of you

do know from slam lectures it's essentially a Le squares minimization again you can use any tool any optimization algorithm to do this

minimization usually we do the same as for the policy and gradient in this time descent not ENT because we want to minimize the error and that's it this is

your vanilla ector critic algorithm and this is the algorithm that you also execute in an adapted form to train your drone racing and to train your

quadrupled robot and if you understand the lecture until here this allows you basically to go to these

papers and to understand at least the actor critic state-of-the-art reinforcement learning algorithms because mostly what they do is they are concerned about adding a back of more

tricks that even reduce the variance in the policy gradient estimation further we don't cover this today this is a bit technical but in general what all these

algorithms or most of them try to improve is to deal with this High variance and because of these tricks like we add a baseline function we use the advantage function as a Critic we

use neural networks as function approximators we can actually learn for quite complex problems with sufficient compute this Behavior like outperforming and drone

racing the world champion and this is not doable with reinforc algorithms because we would need to gather way too many uh

experience you can't easily deal with these complex State and action spaces for that one you need to plug in your networks usually and then this is basically what people would call Deep

reinforcement learning so whenever you hear people did deep reinforcement learning methods they actually did some variant of the methods that we saw today and used for function approximators

neural networks this is what people then usually would call Deep reinforcement learning but it's essentially the same back of tricks and

techniques yes do you have question regarding how actor critic algorithms work again it's similar to the reinforce

algorithm and actually the difference comes in here where how we compute the policy gradient estimate if you don't have any now we

can also talk about it later afterwards of course you can also drop us an email if you want to we share the slides I know it's a lot of content so make sure you revisit the slides try to get for

you an intuitive understanding of this so this is everything from an algorithmic point of view that I wanted to show you today let's recap a bit what you actually saw because now you saw

actually quite a bit not in all the details but for this again please go to the literature that I showed you in the beginning as well these are really nice explation uh explanations of the details

so the first thing that we saw is RL is a powerful tool what do I mean by that you can solve a lot of different problems

you can solve controlling nuclear power plant you can solve drone racing you can solve almost arbitrarily sequential decision making

problems they need to follow the formula of the mark of decision process so everything that you saw today you need to be able to formalize as a problem in terms of a mark of decision process and

then you can use the algorithms that we saw today that's why I think it's a powerful tool action another reason why it's a powerful tool is what we discussed in the beginning you don't

need explicit supervision anymore you don't necessarily need a human annotator that supervises all actions that you took this is one advantage when people for example talk about in self-driving

cars if it makes sense to learn from a human driver learning from a human driver means we need expert data and how to drive as a human and then for each

action that you as the agent execute you can compare to would a human do it like that but this is always limited by the data that you have available and in

reinforcement learning you only it's complex but only need to define the reward function that wies your behavior then you can use any of these algorithms to figure out the

policy we saw the equivalence of going from a policy to a state value function and back today if you remember that came

in quite handy when actually deriving indirect value based RL algorithms where we saw one example the value iteration and this equivalence was

important because this essentially determines that we can find an optimal State value function instead of finding the optimal policy directly right and then from the optimal

State value function because of this equivalence we can derive the optimate policy however we also saw some limitations for example that value iteration as we saw

it in the current form is a modelbased algorithm so we need knowledge about the state transition function it's quite hard to Define that for real world problems so what we usually aim for is a

model free algorithm although it has also disadvantages if you can't fully Define your Dynamics model or you can't learn

it from data then usually model 3 algorithms are the way to go and the other problems that we saw for example that the general form of

value iteration we also quickly talked about that you can use Q learning and sarar in their basic implementation they are not capable of Performing updates

for continuous action spaces so and also continuous State spaces so what we did see is that this idea of indirectly optimizing an optimal State value function

partially contributes to these problems so instead we ask for can we also directly optimize a policy and it turns out yes we were able to by Framing it as a classical

optimization problem so RL can be understood as an optimization problem it's good because then we can plug in function approximators meaning for

example a neural network the deals out of the box without changing anything with continuous States and continuous actions and we did see policy gradient the

policy gradient theorem which is essentially telling us how to compute the gradient of the expected sum of rewards for a given policy and if we have this gradient we can derive our

policy gradient based algorithms such as reinforce to then iteratively optimize based on this gradient estimate our optimal policy or converging to the optimal

policy but we also saw that there is a problem with policy gradient based algorithms such as reinforce we have a really high variance of policy gradient

estimates and usually this leads to you needing to gather a lot of experience for example in the massively parallel simulation that we saw in the beginning

and to circumvent that to some extent we introduced actor critic algorithms and I think I didn't tell you before but if you think about it ector critic algorithms are actually the hybrid of what you saw in the first half of the

lecture and the second half in the second half we talked about how to directly optimize a policy with reinforce we still do that

but what we also do is we use Knowledge from value functions like the value function for a given policy which were essential for Value based algorithms for

indirect value based algorithms and we merge them together into one algorithm and by merging these ideas from the first half with ideas from the second half we end up at this hybrid elect

critic algorithms which have the nice property of most of the state-ofthe-art at least robotics control behavior that you see today is programmed with one or

another variant of these elor critic algorithms that's it for today I think it was a lot if you do have any questions you can ask them now if you want me to go back to something that we

should discuss in more detail feel free to ask I just have question this model uh

models how far they are going application especially control application knowled of system that's a really good question so

in in the setup that I showed you today what we what we always said is okay I interact with the environment I gather some feedback the policy is probably suboptimal that I execute in the

beginning and based on this gued feedback I somehow with one or another algorithm adapt my policy Behavior right so I act online in an environment and

how I act in there might not necessarily be directly optimal right otherwise I wouldn't have the RL problem this means usually if you deploy your robot

directly in real world you need to do some safety precautions because what you execute as a policy might not lead to intended behavior in the beginning so

directly deploying a robot in Real World is usually hard what we do instead is we fall back to these simulations that I showed in the beginning because there

it's not so harmful if you execute suboptimal or safety critical Behavior right so we go in that simulation that hopefully resembles what we see in real

world we learn there it takes long so to get the time scale actually um I I don't recall I don't recall exactly the numbers of results that you need uh to

execute but usually to get such complex Behavior like drone racing or parkour you need many many environment interactions meaning in a certain State I execute an action I try it try it see

what is the reward it's hundreds of millions of these interactions for state-of-the-art algorithms so executing this in the real world if your robot is quite slow breaks

down sometimes it's basically impossible so what people do is they fall back to these simulations use their compute that they have parallelize it it converges

over time in the simulated environment to a good policy and then they deploy the policy Network Pi Theta that you saw online on their

robot and you can observe the state with your sensors as you did in simulation right and then you execute just doing influence of your neural network the

action that your network proposes as the next best action Sol trained in simulation and because neural network influence is usually fast unless you train the giant

neural network this is how they kind of enforce real time behavior of course depends on your application how real time it needs to

be I have one question that why do we generally see that we end up uh optimizing the state value function and

why we do not optimize the action value like even when we started with the actor critic we Ed the action value as the

CRI but then we end up bringing out the Baseline function which is a is a so for the actor critic itself because what you

were saying is hey in the beginning especially for the reinforce algorithm I saw that basically my experience that I gathered is an estimate for the Q value

right so the left hand side here why do I end up with this one in the actor critic algorithm case it has exactly

this one reason that you do want to minimize the variance so people realized I can use this function I can also learn it to some

extent I would need another neural network probably for it but I don't need to when I expand on the definition of a q value I get an immediate reward from

that action that I took plus the next state reward right this is the identity this is exactly the Q function as before so and then you subtract your Baseline to

reduce the variance and it comes in handy that you don't need two neural networks now but just one once you plug in the next state once you plug in the current state so you need to train also

less neural networks for the first half of the lecture again you're right I did say it's usually easier to derive from a q function the policy right because we saw that we need knowledge of the

Dynamics model for the uh policy evaluation with a state value function and we do have algorithms for that again it's sarar and Q learning that I didn't

talk about today but they exactly realized what you said for these indirect value based algorithms the state value function comes in not so handy here it's a direct derivation from

how the advantage function works but there doesn't come in so handy so people develop for example Sasa algorithms and q-learning algorithms which directly

optimize the Q uh function and then if you do have your optimal Q function for a given State you can just Loop over all the actions and pick the action with the highest Q value this is how you would

derive your policy there so that's a very very good point especially for the first half we showed methods that work with State value

functions but usually indirect value based algorithms nowadays use the Q function because you can directly read off the

policy and for the second half it's exactly this derivation right it comes in hand do you have more questions immediately all

right then thank you for staying with me

Loading...

Loading video analysis...