Introduction to Reinforcement Learning (Julius Rückin)
By Cyrill Stachniss
Summary
## Key takeaways - **AlphaGo Masters Go from Scratch**: A Google AI beaten the world champion at Go in 2016 using reinforcement learning, learning completely from scratch to play at a superhuman level despite the game's complexity requiring human intuition. [01:47], [02:31] - **Quadruped Climbs via Sim RL**: ETH Zurich group trained a quadruped robot with reinforcement learning in simulation to climb, jump, and crouch over diverse obstacles in massively parallel fashion, achieving complex behavior without preprogramming. [03:34], [04:24] - **Drone Outraces Human Champion**: University of Zurich developed a reinforcement learning algorithm that races a drone as fast or better than the drone racing world champion, with the AI drone showing superior reaction time and complex racing behavior. [04:35], [05:15] - **RL Agent Maximizes Reward Sum**: The goal in reinforcement learning is to find a policy that maximizes the expected sum of rewards over a sequence of actions in a Markov Decision Process, optimizing long-term behavior rather than acting greedily. [09:25], [09:57] - **Policy Gradient Enables Model-Free**: Policy gradient algorithms directly optimize parameterized policies like neural networks without needing the dynamics model, allowing handling of continuous state and action spaces by computing gradients of expected rewards. [53:44], [54:38] - **Actor-Critic Cuts Variance**: Actor-critic algorithms use a critic to estimate state values as a baseline, reducing high variance in policy gradient estimates via the advantage function, enabling training of complex behaviors like robot control. [01:04:04], [01:13:56]
Topics Covered
- RL Masters Go Without Human Intuition
- Rewards Enable Supervision-Free Control
- Value Iteration Beats Policy Iteration
- Policy Gradients Ditch Dynamics Models
- Actor-Critic Cuts Variance Conquers Reality
Full Transcript
hey everyone my name is Julius and today I was asked to give you a brief introduction about reinforcement learning also in the context of this self-driving cars lecture so we will see
also robots today using reinforcement learning this is meant to be an introduction I will stress that because you will soon realize reinforcement learning is quite a broad topic you
canot only use it for self-driving cars you can use it for plenty of problems to solve them and there are many different algorithms with pros and cons we will
talk about some of them in a bit and if you don't fully grasp every detail that's completely fine I want you to get an overview of the different classes of algorithms today and maybe what are the
advantages what are the disadvantages how to different algorithms address them so if you do have questions in between feel free to ask we can also do it afterwards whatever you like just raise
your hand and then I can try to re explain something if it's unclear so if you're interested after that lecture to delve deeper into that topic
I can really recommend you the lecture series by David silver from Google deep mind he talks in detail about reinforcement learning I think it's like 10 lectures where he goes into the
details of all the algorithms that we saw today we'll see today but also of more than that so that's really nice Source if you're a visual learner and
listen to lectures is something you like I can recommend going there and if you want to have a textbook look at the mouth in a bit more detail the goto book
although it's quite a few years ago is still an introduction to reinforcement learning from Richard saton and BTO so let's start with a few examples
of for what our L is used today so maybe some of you saw the headlines in 2016 that a Google AI actually has beaten the
world champion I guess and go did some of you see that like did you yes perfect so this is actually a computer program that is trained with reinforcement
learning to play the board game go and it is so significant because maybe some of you do play go and know how it roughly works I only roughly know how it works but I got told it's quite complex
you have many board configurations and it's believed that you need some human intuition component to play this game well and still we can come up with the reinforcement learning algorithm that
learns completely from scratch how to play this game and actually at a level where it builds one of the best players in that game it's not limited to board games we
can since a while already also play chess we can also do it with reinforcement learning now we can play video games Atari games for example if
you know them and in both chess and board games and go and also Atari games actually with reinforcement learning we are able to learn to play games better
than the best human players in the world but not only reinforcement learning is limited not only to games but we can actually do useful things with it we can
look into controlling nuclear power plants with it we can look into Financial Market decisions with it and maybe the most interesting part for you in this lecture series is we can learn
to control robots with reinforcement learning and today I want to show you two small examples in the beginning and of a video the first one probably you've seen that before is such a for
quadrupled robot and it's a work from eth zoric from the group of maruto and they actually taught with the reinforcement learning algorithm the
robot to climb these boxes so how they do it is they actually build a simulation of different obstacles and the robot tries from scratch to move
around these obstacles above the obstacles in a massively parallel fashion and by training in such a diverse Fashion on such diverse objects what we
can actually achieve is quite impressive complex climbing Behavior with that robot and this is fully learned in simulation with reinforcement learning
we can jump we can climb down I think in a bit they also see that you yes you can Crouch below objects this is all quite complex robot behavior that is not
preprogrammed for example like what you would do with classical P control this is all learned and another really nice example
recent example of reinforcement learning and Robotics is drone racing at the University of zurk in the group of Professor scar mutza they actually developed a reinforcement learning
algorithm that learns to race a drone as fast or better than the actual drone racing champion at the moment and racing this drone you will
see in a bit the blue drone is the AI controlled drone and the red drone I think is the human drone and the blue drone has of course
better reaction time but even afterwards can learn such complex racing behavior from reinforcement learning that it actually out competes a human world champion in drone
racing and from what you will see today subsequently is our L algorithms that form the basis for both of these works so at the end we will see an algorithm that is the basis with some
state-of-the-art adoptions to it for both of these works why is it worth talking about reinforcement learning because reinforcement learning is not just
another machine learning Paradigm you may have heard about how to train neural networks to do perception task like detecting objects in your self-driving car lecture for example detecting cars
or pedestrians but there someone gave you a collected data set and this data set is annotated by human annotator so someone
told you explicitly in a supervised fashion this is a car this is a pedestrian here in our examples we don't have this we don't have a pre-recorded
data set instead what we do have is in general what we call an agent this agent could be the robot that you saw before and it takes decisions which we what we
call is taking an action the action could be motor commands to for example do the Drone racing this means that we don't need
necessarily a supervisor the only thing that you do need to Define is a reward and you can reward for example your drone in the racing scenario by
successfully passing through a gate you can also reward it by the lap time and say shorter lap times are better intended result so I don't supervise
each of the motor commands but only say what's the intended outcome by defining such a reward function this is quite powerful but this has also some challenges worth talking
about in this lecture so by just providing this reward function you do have delayed feedback in the classical perception use case we have immediate feedback for your neural network did I
detect the car not here my drone has to execute quite a lot of motor commands until I can actually evaluate how fast it completed the lab or
not so this is generally then also harder to learn task and we do make make decisions sequentially as I said before so we do one motor command after another
just by the nature of the problem this means my actions that I do take actually dictate or determine what is the next image for example that I do
see with my camera on the Drone and this means also that the usual assumption that we do have in perception we have a data set that is identically
distributed and independently distributed not true here if I am a robot standing in this room and I do have a camera with such a field of view looking at the wall or turning to the
left actually gives me a picture of the wall looking to the right most likely doesn't give me a picture of the wall but of the window so I do have correlated observations which makes
learning also harder so let's take one step back and formally Define the reinforcement learning problem again as said we do
have an agent the agent takes a decision in general in an environment it could be a race track it could be a robot deployed in this room it doesn't matter this environment delivers a next
observation for example in our example from before it could be an observation from a camera giving me back the picture of the window and I do have a reward signal that defines the intended
behavior for example if my goal would be with a deployed robot in this room to find my laptop then I get a reward of plus one one if I do get to my laptop I
explore this room and as soon as I find the laptop I do get the reward so I do need to make sequentially this is why it's a circle basically decisions
actions for each action I do get back an observation from the environment and the reward of how good it was to execute this action and the goal is not is now to
maximize the expected sum of these rewards so for each step each action that I take as My Little Robot here in the room I do get a reward that can be
zero if I don't find the laptop and that one point it's a one so what I do do want to do is not only maximizing the immediate reward for one action but I
always want to maximize the sum of rewards over a sequence of actions this means I don't act greedily but I try to optimize actually the
long-term behavior of my agent and one thing that we need to formalize is a so-called Mark of decision process the mark of decision process is the
underlying mathematical framework of all reinforcement learning problems and it consists of the following components that we mentioned already and now defined in a mathematical way so we do
have the state space which we denote usually as this curlys and the state space is the sets of all states that my environment can be in for example in
this room and my navigation task as a robot the robot can be in any position in this environment defining the state further we have an action St space that
defines which actions can I execute in my navigation case of the robot it could be going forward going to the right or turning
actions then I do have as we saw before the reward function telling me if that was a good action to execute or not this reward function depends on the
action that I took in a given State s and analyzing in which state I end up in for example if I'm standing here I take an action that directly turns towards
the laptop in our example that's a good action if I would turn away from it that's a bad action and in which state I end up is actually defined by something that we
called a call a state transition function based on the current state of the environment and the action that I chose I do get a probability for all
possible next states that I end up in the state meaning if I stand here I execute the action of going forward for example I end up in this next
state with probability one because I don't have an actuation noise but for example if you think about the Drone the Drone motor might break down or you have slip in a real robot then you might also
end up in the next state that you didn't think you actually intended to end up with an action so we do have uncertainty about the actions that we execute by
defined by the environment model and this model is called a state transition function we have two important assumptions here for our problem this
when we always talk about a state of the environment we mean that all the relevant features in my environment that I need to know about to solve the problem are actually fully known or how
we call it fully observable this means to get to my laptop in a efficient fashion and to model this as an mdp I actually need to know where I am and if
you think about your localization lectures and mobile robotics that you took before maybe this is not always true you don't always know what certainty where your robot is but you can say for some places it's more likely
or less likely so this is important to keep in mind in the basic setup we always assume that all the information are observable in the environment
and as a second assumption that is encoded in our environment model is something that we call the mark of property you might have heard about that during your occupancy mapping lecture in
occupancy mapping we do say for a given map State and the received observation the given map State fully DET determines together with the observation the next map state
right what we do here is quite similar we do say our current state of the environment together with the action a that we chose fully determines How likely it is to end
up in the next state this means I don't need to remember all the previous States and actions that I chose but I just need to be able to observe my current state to
fully say how the environment will develop so going one step back we need one more ingredient to fully Define our problem I always talked about the robot or the action agent that takes
actions but what actually determines which action we take this is what we call a policy function a policy function is internal
to the agent and it maps for a given state to an action that I want to execute so for example my policy if it's a rather good policy and my navigation
example would be okay I directly turn to the left to see the robot it could be an arbitrary policy taking random steps that are not ideal towards the window it doesn't matter but it's some function
that determines my behavior and important to also note is it doesn't have to be a deterministic function so for a state like usually what we talked
about now end up in a new state taking one particular action but I do not only have problems where it's good to choose
deterministically based on the state my next action sometimes it's good to have a so-called stochastic policy which which means I act with some stosic in my
behavior for example if you think about rock paper scissors if you play this against an opponent and you always play the same move in the same state that your game is in your opponent will be
hopefully smart figure this out and you don't get the optimum result of winning this game after a few rounds so it's much better to act with a bit of Randomness in there depending on the state and this is what we call a
stochastic policy and we will see later where it comes in handy the mathematical formulation of the whole problem that we talked about is exactly this it looks a bit clunky but
it makes sense to go through the mathematical notation once because it will reoccur in the algorithms that we do see that do solve the RL
problem so what you see on the right is the reward function as before depending on the current state s in the time T the action that we take in that state and
the next state that we end up in and we do have something that we call a discount Factor the discount factor is valuing how important it is for me to
get an immediate High reward versus how important it is to get a long-term High sum of rewards so if my discount factor
is going towards zero essentially what it says is I care more about my immediate rewards then I care about later rewards but as we saw before there are problems where you also might want
to have behavior that acts long-term optimally so you can actually rate the importance of future delayed rewards by setting the discount Factor appropriately and then what we actually
want to have is a behavior or what we call a policy that is optimal the optimal policy is something we call P star from now on and the optimal policy
is out of any behavior that I might execute the behavior that maximizes the expected sum of rewards the sum of
rewards we saw before what it is and the expectation is important the expectation is important because you have multiple sources of Randomness we talked about
the state transition function defining How likely it is when executing in a certain state in action to end up in the next state so there is theity in the
next state that we end end up in we do have as we said before most likely stochastic policies so when following our policy that a stochastic we sample
from a probability distribution so we might end up with different likely or unlikely actions and the side technicality is that your problem has to start somewhere I have to deploy or
spawn my robot for example somewhere in this room in our previous example and sometimes it might be more likely that it spawns towards uh near the laptop
sometimes it will spawn further away but where it spawns it's essentially defined by this initial State distribution that we call
me and to facilitate our lives a bit we now do one trick and we find the notation a bit and instead of having the expectation over these three things we
have an expectation over how what we call an episode might look like and the episode is exactly what we talked about before it's an initial State an action
that I execute in the state we end up in the next state we again choose an action and so on and so on until we are at the terminal state after end steps in our
episode and the probability that this action occurs is exactly given by the individual terms that you saw before it's given by the initial State distribution times the likelihood or the
probability that we execute an action in the state times ending up in the next state given the executor action and the state that we were in before
this way now it doesn't look so scary anymore the formula we just have an expectation over all the episodes that we might experience and we want to
maximize the expected sum of rewards for all these episodes to make this a bit clear how such a simple uh problem could look like I want
to show you one example of a maze the maze on the right has a start position it has a gold position and we want to find the shortest path to enforce that we give a reward of minus one each time
step that we execute an action so to maximize that we are forc to find the shortest path and you can figure it out manually
by hand that for example standing here at the start and going one to the right at the minus 15 the number of steps that you need to take until to
reach that goal is 15 the shortest part that you can take from that position onwards is exactly 15 Steps this means if someone would give
you this maze with a numbers for example let's say I'm at the minus 15 what would you now intuitively do to decide if I go up or
down who of you knows if I'm at the minus 15 how do I decide if I go up or down based on this what is the uh you know lesser
number so like exactly exactly so you go to the minus 14 because you want to maximize the expected sum of rewards which is given by this
number one detail that is still important to realize we talked about the expected uh sum of rewards so for example in a slightly different problem
where I'm standing here the desk is a lake that I could fall into as a robot and my goal is exactly on the opposite side of this Lake I can go around the
lake by executing a path that is very nearby the lake but if I do have actuation noise for example that lets me sometimes slip into the lake I will
never reach the goal so what you also need to consider again is the Dynamics model of your environment so in that case it could be better to have some safety distance because then in
expectation on average you reach the goal faster because going into the lake r gives you a goal reaching Behavior and the nice thing is exactly
what you said how to do it we can derive from such uh information which we call the state value function a policy and as you said the best action for example
being in the minus 15 correctly is going up that we can do for all the states that we might be in and we derive a policy and actually it turns out that we
can arbitrarily go from such a state value function deriving a policy and from a policy back to a state value function so what we will see soon is
that it's actually equivalent of giving you a value function from that you can derive a policy or giving you a policy from that you can derive a value function and this
is an important equivalence that we will make use of in algorithms that actually will later determine our optimal Behavior so just to make it clear again
what we said before the value function is exactly this we want to maximize the sum of rewards in expectation following
our given policy P so the episode to actually is what we experience when we execute the policy that is given by
pi and we make use of one identity that comes in handy later to define the algorithms to find Optimal policies
if you look at the case for k equal Z okay the discount Factor goes towards one and you have the immediate reward of being in a state s i execute an action
with respect to my policy and end up in the next state so I get immediate feedback for just the next action and then I can look up from the
next state that I end up in again my value function at this next state which defines from there on the remaining sum of rewards so the state value function
of a state is exactly determined by the reward the immediate one that you can get by executing the policy plus the next States value
function and similarly we have something that we call an action value function or also named sometimes a q function for a given policy and it's similar to the state value function but there's one
difference instead of following the policy of my agent i instead force my agent now to execute one very specific
action a in a current state so instead of following my policy I execute this action a for one time step I observe the immediate reward and from the next time
step onwards as defined here I do again ask my state value function when I follow from the next time step my policy what would be the value of doing
that we will see later why both of these functions are important so just to make it clear how do we do how do we establish this
equivalence between policy and value function the first thing that we answer is how do we actually derive the policy what you just did before from the value
function and what we do is actually we can Loop over all states for example all positions in my Maze and I can check for
all actions that I might want to take in a current state what is actually the immediate reward that I get plus if I do
have the value function what is the state value of being in the next state that I end up in and this is averaged over our Dynamics model because we need to
account for the noise in our environment and then what I do is exactly what you said before I take the action that
maximizes this expected Q value in the end and this is rather complicated for one small detail because to integrate
out over all the possible states that I end up in I need to have this Dynamics model or the state transition function and it turns out that it's often much easier if someone would have given you
the Q function instead of a state value function I can just Loop for a given State over all possible actions that I take and then take the action with the
maximum Q value the other way around we can do it similarly so from going from a state value from a policy to a state value
function is also important because if you think about it maybe as an agent you somehow have an initial behavior that you might want to execute but you do want to know how good is this behavior
in each individual State and this is exactly given by the state value function so it's important to talk about how to evaluate basically the quality of your policy this is what we call a
policy evaluation and what we do is we iteratively start from a random guess how our state value function might look
like in terms of the maze for example if I don't have any prior knowledge I insert just zeros for all the states and
what I do then is I iterate again over all states I choose in the state all possible actions a analyze How likely is
it given my policy that I take this action and then integrate over all possible next States How likely it is to end up there and if I do that for a
concrete pair of State action and next state that I end up in I can get a reward plus my next state value function and it's important to realize the state
value function in the beginning is an initial guess so I don't really know what's going on what is how how good is it to be in that next state but we can
compute this reward the immediate one so at least for this one time step I do get a concrete estimate of how good it was to execute this time step so when I now
back up my value for this state S I do have a slightly better estimate of V so in the next iteration I do have a better
estimate for my next state V Prime so I can again collect an intermediate reward plus now having a bit of a better
estimate for the state value function at the next position and this way I converge to the value function for a given
policy and this operation is called something we call bman expect bman equations and this because for a given policy we average over the distribution
that the policy induces it's called a bman expectation equation and we can iteratively again apply bman equations to in this case converge to the state
value function for a given policy and now one thing that is still unclear because we talked about how to go from a state value function to a
policy and from a policy to a value function is how do I actually end up with the best policy with the optimal one right because at the moment the only thing that I can do is going from some
policy to the corresponding value function and back but this doesn't mean that the policy is actually inducing the optimal behavior and we need one more identity
to actually get to an algorithm that allows us to do that so if you remember the optimal policy was denoted at as P
star and actually as it turns out finding the policy P star that maximizes the expected sum of rewards is exactly
the same as finding what we call an optimal State value function P star because for an optimal policy to be optimal it has to be the value fun the
state value function that is maximizing my sum of expected rewards and then I can exploit that I can literally transition between policy
and state value function as we saw before so instead of finding the policy directly because this is kind of not handy we can also directly find the
optimal State value function and if we do find the optimal State value function we can use our previous algorithm to go back to the optimal policy similarly we can do that for Q
functions to find the optimal policy we can also find an optimal Q function because as we said before sometimes it's hard to go from a state value function to a policy because you need to know the
environment Dynamics but for the Q function to get the optim policy I can just Loop over all actions a and choose the action with the highest Q value so I
don't need necessarily need to know my State transition function today we talk about an algorithm that finds me the optimal State value function so not talking
about finding optimal Q value functions we do see how to generate the first algorithm now that finds me the optimal State value function and does
find me the optimal policy and to do that it's pretty much similar to what you saw for policy evaluation you don't know the optimal State value function so
you need to start from an initial guess and you can now again iteratively refine your guess how do you
do that well I start looping over all actions for all actions or for a single action S I try out all possible actions
and they lead to some new state S Prime and for this new state to end up in there I can again compute my immediate
reward of ending up there plus asking my not really not really accurate guess of an optimal value function V here how good is it to end up in the next
state so again my initial State and my initial guess for the optimal State value function at what happens being in the next state and for proceeding with my policy is is wrong for sure unless
someone by accident gave you the one already that's correct but what I can do again is I can evaluate the immediate
reward so again I need to integrate over the uncertainty in my environment model given here by ending up in the next state S
Prime and then instead of also as you may remember averaging over the policy that we were given before this is not what we do anymore but instead what we
do know is exactly what you said before we want to take the maxim the action that leads us to the maximum right and to which Maximum to
the here Q value for a given state for a given action a so instead of following an action we now try to enforce choosing the next best action the next best
action in a sense that we maximize the immediate reward that we can compute plus some initially of estimate of what happens after but this way we can
correct our initial guess for v s Prime for the next state and actually you can show that this iteratively converges to the optimal value function so for policy
evaluation we averaged over our stochastic policy for policy iteration finding the optimal value function we solve a maximization problem
here and this is again a bman equation because it's an iterative process but this time it's the bman optimality equations because we want to enforce
optimal behavior and as I said this estimate V converges to our optimal State value
function V Star from the previous slide if you have this you also directly get the optimal policy by applying the previous algorithm
again so let's look at this algorithm for one more time because this looks like okay now everything is set right now we do have an algorithm that finds me the
optimal State value function so I end up also with a optimal policy but there are actually good reasons why this algorithm does not solve everything in
reinforcement learning for example if you look at this for Loop we Loop over all states s for the mace example it's kind of intuitive right you go for each
position that you might end up in and you execute this for Loop but what would you do for example in the Drone Racing for like the Continuum of all the poses
that your drawn can be in if you have a continuous action space it's not so straight forward to Loop over this anymore and another thing is for example we always talked about discrete actions
there's a finite number of actions we can make one step forward backward left right in our maze but what is this algorithm doing if we have continuous
actions for example continuous motor commands in the Drone racing case again then we have another optimization problem in
here that is not so straightforward to solve so the basic form of this value iteration algorithm only works well for discrete finite action spaces and this
sorry State spaces and discrete finite action spaces but for these if you do know the Dynamics model we can relatively
efficiently find our optimal policy so let's summarize until now what we learned how to find a policy so how
to find a policy is what we solve by finding actually so-called optimal State value functions and we saw in the algorithm that there are some
limitations we saw for example that you need to know if I go back this Dynamics model to evaluate this whole formula this Dynamics model in
real world is not always straight forwardly given right because do do you know how the environment completely behaves in the super complex drone racing part or in this quadrupled leg
Parkour part I don't so there are definitely real world instances where it's not so handy to assume that you do have knowledge about your environment to that extent that you can define a
probability distribution like that we also quickly teaser that to some extent if you would have the optimal action value function this solves the
problem because instead of Performing policy evaluation averaging or integrating over our state transition function what we can actually do is looping over all actions that I can
execute and take the action that leads to the maximum Q value we don't talk about that in detail it's definitely talked about in the literature that I showed you in the beginning and if you
want to look at up these algorithms the first one that is quite uh basic and uh popular is called sarar so Sr S A RSA and the second one that you might
have heard of is Q learning and both of these actually try to achieve the same as value iteration but not finding an optimal State value but finding an optimal action value and this way you
can get rid of this assumption we do see another solution in a bit how to get rid of this assumption but first for completeness we said value iteration and
also by the way Q learning and such only work for discrete State spaces in their rudimental form and only work for discreete Action spaces but as we saw in
the robotics case quite often you have continuous action spaces and you do have continuous State spaces and partially all these problems
are due to our initial thought that we might solve the problem of finding an optimal policy by actually not finding it but finding the corresponding Optimus State value function right because if
you remember we had to Loop over this because we need to define the Optimus State value function so what we want to do next is we want to look at another class of
reinforcement learning algorithms this class because we want to find some kind of value function is called value-based reinforcement learning
algorithms and the next class that we will see is performing direct policy optimization and what this means we will talk about in a
bit if there are no direct questions I would go on with moving to how to define or optimize your policy directly do you have any questions until
here is it clear especially the limitations following from this algorithm do you have any questions regarding this when
we optim St function is it also that we can we will
get the optimal action function as well that's a really good question so the question is can such an algorithm
also Deliver Us the optimal action value function the optimal Q function and this exact value iteration as written down
here can only find you the optimal State value function but in a similar fashion the motivation is similar because we saw the identity of the Q function is
basically the immediate reward of executing a specific action that is not necessarily drawn from your policy plus from the next state onwards the state
value function right so it does a similar trick both sarar and also Q learning these are the two algorithms that you need to remember to find
exactly what you asked for um these do a similar trick it's always exploiting the identity that I can evaluate my immediate reward plus working on an
initially Wong guess from what will happen from the next state onwards so the the and this is the main trick
that happens on all these algorithms in a different form but similarly any other questions regarding this algorithm or the equivalence between
policy and value function how do you the the the value of what the gamma uh yeah the gamma the discount factor is
something you can choose as a user so what the gamma expresses is if it's close to zero or set to zero it means
basically what happens after my immediate next state I don't care about this is not important for my optimal uh Behavior if it's set to one or close to
one this means it's very important to me how well I'm doing after executing this next step so setting gamma to zero would actually lead to something very greedily
where I optimize the short-term reward of my behavior versus setting the GMA towards one or close to one expresses the idea of in many scenarios I want to
have a long-term planning strategy that actually leads me to then better behavior for example a simple problem could be um the navigation problem in
Google Maps where maybe you do know that I start here I want to go to cologne by train I go to Hof jump on the regional train and then I'm there and the second
option would be I go to or here nearby to kmana go on a bus and then take the Trum to Cologne we do know that the Trum takes much longer but we
also do know that the regional train is quite often delayed right so the greedy strategy of going where I think it's
immediately uh faster R towards cologne is not necessarily the best because it doesn't consider the immediate step of waiting for the regional train and
sometimes with uh our do bar and it is delayed so it could be overall better to take the longer route by Trum but having less waiting time in
between and how far sighted you want to be is basically what you can design with this gamma and for different gamma it's important to note that you can have
different state value functions of course so the sometimes with changing gamma your behavior that is supposedly optimal
under this gamma changing and the more it's toward zero the more shortsighted that behavior is and of course it depends on the actual problem that you want to solve how to set
this okay maybe in interest of time we move on to what I teased you already the second class of algorithms which finds a
policy directly and the big question is how can we actually do that because if you think about it our problem
is finding the marks the maximum policy function in the set of all possible policy functions and what does this mean this is kind of hard to grasp before we
exploit it that we can basically build a lookup table for each position in the Maze We can assign an action this kind of defines our policy function right but it's not so clear what in general this
optimization over the set of functions that are possible means and we can actually do one trick that you might know or remember from
your perception lecture for self-driving cars where you want to train a new network that takes in for example an image and outputs where the human is
it's kind of not so easy to Define this directly right and what we do instead is we place a function approximator in the middle how we call it namely the neural network that is parameterized by a set
of functions Theta and this set of functions Theta or this uh set of parameters Theta defines you the function that maps from input
meaning the images to Output where your obstacles where your uh object detections are for example for pedestrians right and we can do similarly here we
can handily Define the policy function by parameterizing it so saying instead of building this clanky lookup table as we did before what we do now is we can
use any form of parameterizable function that maps from a given state to a desired output the action
a and we write it in such a general form because it really doesn't matter for our algorithms too much what this new network or function looks like it can be
a linear function consisting of Just The Two parameter slope and y-axis offset it can be an arbitrarily complex neural
network and this way you have your first advantage of using such function approximators a neural network or linear regression can process continuous State
spaces they can process continuous input they can also process nowadays quite arbitrarily complex input your state doesn't need to be that lookup table and discrete set of States your state could
be a continuous position of the Drone your state could be a graph your state could be an image it really doesn't matter too much you just need to find
the right function approximator which nowadays in deep reinforcement learning we refer to whenever we use an expressive neural network for these function approximators which is then
giving us the policy and the second Advantage is we can now directly optimize this function the function is completely given by its parameters for example the parameters of
a neural network and we want to find the parameters that optim optimize the expected sum of rewards which is exactly
this so we went from this rather clunky optimization problem to an optimization problem that you actually know also again from perception for example if you
think about it your object detection task is finding a parameter Theta that gives you a neural network for a given image minimizes the error in detection
so you want to minimize a loss function there and instead here we want to maximize the sum of expected rewards it's pretty similar in nature and as I said this can give you
continuous State space it can give you continuous action space easiest case think about linear regression you have a continuous variable as input you have a
continuous output so the output could be now also continuous and as we said here it allows now to directly optimize the parameters
meaning it allows to directly optimize our policy Behavior so by tweaking the parameters of your function you can tweak the behavior of the
agent and how do we tweak this maybe from some new network lecture that you heard before you do know that we like to optimize these functions or these
parameters with gradient descent so what we want to ask for is actually find me the gradient that optimizes this function function on the right and then
we use gradient descent or in this sense gradient Ascent because we want to maximize a function instead of minimizing a loss function and then we can execute any
gradient based optimization algorithm to find me this parameter Theta that maximizes my function it could be a Newton method it could be stochastic
radiant descent in the simple case but there's one problem if you want to do gradient asent or any gradient based optimization you need to find the
gradient with with respect to these parameters for your function that you want to maximize and the function that you want to maximize is the expectation over the
episodes for the sum of expected Rewards or the sum of rewards but what does it mean to compute such a gradient for this
expected sum of rewards we do derive this it's called the policy gradient theorem it's a theorem it's a bit involved stay with me
we don't go into the details but it's important to derive some steps here that you do see further advantages of the algorithm when you do direct policy optimization and the first thing that we
do realize after applying something that we call score function estimator trick maybe some of you know this from statistical learning classes where you
do maximum likelihood optimization if you want to go into the details I link below a nice lecture from Cambridge where they in detail explain this score
score function estimator trick in general this is nothing specific to reinforcement learning this is something that happens all the time in machine learning when you want to maximize the
expectation of a function and the main key idea that you can take away here without going into the mathematical derivation is the
gradient of an expectation there we can change the expectation and the gradient sign why is that because if you think about it the expectation is an
integral over probability times in this case our function R right this is how expectation is defined and then the by liet's integration rule you can actually
change the gradient and the integral and by changing or pulling the gradient into the integral part of our equation this gradient ends up in the expectation and by doing
a bit more numerical manipulation what we end up with is actually now inside a gradient of the lock probability of an
epot and why this is important is something that is getting clear in The Next Step so this thing evaluates to this equation to get why we need to
understand what is actually the logarithm of the probability of an episode the probability of an episode you saw in the beginning if you remember right it
just the sequence the episode of State action reward next state and such and a probability is just the product of the
individual parts starting with the initial State distribution then How likely it is to choose an action in the current state as zero How likely it is to end up in the next state as one of
our episode and so on and then the logarithm applied to this product is the sum of logarithms
right so the logarithm of this probability is actually the sum of the loog probabilities of these individual parts and then of this whole sum we want
to take the gradient and the gradient of a sum of terms is the sum of gradients of the individual terms right because both are linear
operators so we just did too small mathematically mathematical tricks the logarithm of a product is the same as the sum of the individual probabil log
probabilities and the second trick is to realize the gradient of a sum is the same as the sum of a gradient and what you can ask yourself then is for each of
the individual terms in this product what is the gradient of the log probability of these terms and it turns out you need to ask
yourself which of these individual terms in the product actually depend on my parameter Theta because I want to take the gradient of the individual Lo
probabilities of these and when my term doesn't depend on Theta the gradient of something that doesn't depend on Theta
is zero so it drops out of the sum and maybe you see it already here which of these terms actually depend on
Theta can someone tell me think about what we use Theta for here here it's quite explicitly Theta is to parameterize our
policy so only the policy itself also depends on Theta right because just being in some State doesn't depend on which action I choose my Dynamics model
here T doesn't depend on the action I chose the action I chose it's given this means I can get rid of this something that we anyway did not clearly Define this mu the initial State
distribution I can get rid of my Dynamics model again the state transition function and the only thing that I end up with is the sum of gradients of the lock probability of my
policy and this is nice because now we have the same effect as we wanted to have before we have a algorithm that doesn't depend on the knowledge of our
Dynamics model so so we can derive a policy gradient based algorithm that does not require knowledge about how the environment behaves and detail instead what we can do is we can just execute
actions without knowing this environment model and these algorithms that do not require knowledge of our state transition function or Dynamics model
these is what we call model three RL algorithms and whenever an algorithm is relying on the knowledge of exactly this
term T before then it's a model based algorithm so the nice thing is that with the derivation of this policy gradient we end up with a model fre algorithm we
don't need full knowledge about our environment and then the term here on the right is just the sum of rewards that I get for an individual episode to
from current time step ton so I am in some time T while executing my policy uh my uh episode to and just the remainder
of the rewards that I do get from there on is exactly this term and it turns out that this is exactly the same as our Q function for the policy that we followed
because we experience when we execute some episode a certain action and then we follow again our policy this is just
by definition of what you saw before for and this is it you from there on we can actually derive a policy optimization algorithm or policy gradient based algorithm that directly
optimizes our policy we start with any Behavior any policy which means we start with a randomly initialized par meter Vector
Theta we somehow need to accumulate some experience so we need to accumulate some episodes to to actually compute this guy over here right so what we are doing is
basically what you saw in the first video when you remember this quadrupled robot they zoomed out in their simulation and you saw Many Robots doing their thing on their individual
obstacles to climb over that this is EXA exactly the same step you you start with some guess of how the policy might work you execute a lot of data for it so you
get these episodes of State action next state and reward and so on and based on this experience that you can gather there in
simulation for example you can compute now exactly this policy gradient that we saw before you don't explicitly derive this
expectation but because you sampled a lot of episodes to and for each of the episodes you can Define and simply
compute the sum of experienced rewards you can compute this and the gradient of the lock probability of your function Pi is actually given by any differentiation
tool that you might use for example if your policy Pi is parameterized by a neural network you can use pytorch Define your neural network and your automatic differentiation tool in
pytorch the back propagation algorithm actually derives you these gradients so this we know how to compute this is something we experience by just
simulating for example a lot of quadrupled robot climbing exercises and then we can average over all the episodes that we did experience so we do
get a Monte Carlo estimate we call it for the policy gradient then what we can do if we do have the gradient we can actually make one step in direction of the gradient
because the gradient points into the direction of the policy space where I do experience most likely policies that end up with a higher sum of expected rewards
right because this is exactly how the gradient was defined so I can do very simple in this case gradient asent I end up with a slightly better
policy because this is what my gradient was pointing towards and then I need to repeat that I only take a small step a gradient is always just a local approximation right you cannot directly
jump to the globally optimal policy so what we do is we repeat the whole procedure we again start our massive parallel simulation for example in the quad ruper case we collect a lot of
experience we compute again the estimate from this experience of Monte Carlo gradients until we converge to a policy that is performing
decently but there's one problem if you think about it this guy here the sum of rewards from time step t on is quite High variance for example in let's make
one illustrative example again again of our navigation task I am the robot I'm wandering around in the room until I do find facing towards the laptop my laptop
I execute many actions until I finally find this robot I get a plus one as a reward I can end up with the same set of actions that I execute sequentially
until one step before last time I found by turning towards my laptop the laptop itself but this time because of stoas either of the environment or the policy
I don't turn towards the laptop but I turn again away from it so I get a completely different reward although I executed up to the last step very same
sequence of action experience very similar sequence of States so this is why sometimes You observe that this classical we call it
reinforc algorithm for policy gradient based optimization is not so sample efficient we call it it needs a lot of gathered experience in simulation or in
real world to converge to a stable estimate of the policy gradient and we can do better by one
last final class of algorithms that I show you today which is called ector critic algorithms and ector critic algorithms actually do have the goal of reducing this
estimate but before I go into the details here I want to quickly ask you if there's a question about how policy gradient based optimization in general
works or in specifically also the steps of these algorithms if there's anything that is unclear to you you feel free to ask it
now generally when we using the based optimization so we see that we can have a local Maxima and then we cannot
upgrade like go ahead from that Maxim so how does this cover that part yeah that's a very good question question so we do execute basically first order
optimization methods here right we compute the gradient it's only local approximation for Regions where to move to get a better performing policy and
this algorithm by default is actually also not guaranteed to end up in a global maximum that's a very good question we cannot guarantee that we end up in a global maximum we have the same
restrictions that you might know from optimization lectures we do end up in local uh Optima there are certain tricks to ensure that you might get around this
to some extent but there is no mathematical proof as far as I know that you end up in this of course if you can guarantee that your function that you want to maximize is convex or that sense
then concave then you can use again your knowledge from optimization Theory and say yeah even with my gradient estimate I will converge to the global one but in
general if you think about it your function depends on the policy that you execute in a very complex maneuver for example in this drone racing it's very
hard to believe that the function that you want to optimize is actually uh is actually concave or convex right or linear something that gives you convergence to Global Optimum and even
worse if you think about it the main motivation in the beginning was let's parameterize our policy with
these parameters Theta right maybe if I choose a neural network for example with nonlinear activation functions my policy itself is also not
linear right or is not uh yeah it's nonlinear so there is actually quite a few things where this Assumption of global Optima would break but we can at
least guarantee that we converge with the reinforced algorithm if you compute enough of this experience to a local Optima and this is usually good enough
to solve our task do you have any other questions before we move on to the final round of our third class of methods to solve the RL
problem okay if you do have in between again to reinsure yourself that you understood everything until here questions feel free to ask so we did talk about that you need
to gather quite a lot of experience because the experience that you do gather even for similar actions that itially executed could end up in a
different reward right this is what I said in the beginning of the lecture y RL to some extent is hard because you have delayed feedback your actions that you choose actually influence what you
will see next and such to some extent this is a reason why you have high variance in your policy gradient estimates because you vary in what you
gather as experience so there's one class of actor critic algorithms that aims to do the same thing as the reinforce algorithm that
you saw before but modifies it to get a reduced variance of the policy gradient
estimates and what we do is if you remember our derivation of the policy gradient is exactly
getting the experience which is equal to plugging in here the Q function right and we can gather the experience then we only learn about the experience States
and actions but maybe we can do better because if I do know that a state that I that I am in has certain features that are similar to another state that I
didn't observe yet but it's close enough and I do have a function that can indicate hey these features I saw in
another state and action and this led to a high Q value I want to transfer this knowledge somehow so one one problem is in the reinforce algorithm that you only
learn about the actual experience data but maybe you can learn features from that data that actually explain which states are good which states are bad and
this is quite similar to again your perception your network idea you show a finite amount of images and their annotation and the neural network learns which features to focus on to find The
Pedestrian or the C similarly here you might hope for a function approximator for this Q function that learns what are the features in my input here that
actually drive the Q value to be high or low so what we do again is we plug in the function approximator for this Q
value and if you think about it this Q value is actually evaluating how well is it or how good is
it to take a certain action in The Next Step plus following my policy parameter by Theta from The Next Step onwards so
it kind of evaluates how good my current policy estimate T Theta is doing and this is what we call a Critic this is why it's called exra critic algorithm
the critic is not taking actively a decision what to do next but it evaluates how good my policy is parameterized by Theta again and then
the actor part comes in from the policy itself the policy Pi Theta does determine which behavior is induced so
only the policy can actually execute an action and drive the behavior of the agent this is why we call the policy function approximator Pi Theta also an
ector and this is why it's called ector critic algorithms we do have a policy function approximator as before we do policy gradient based updates but we
don't do them uh by judging only based on the experienced uh rewards but also by what we learned in the critic is a good and a bad
state so we do the trick this is what you had before is your policy gradient and we just plug in here the small but little important notation saying hey
this is based on the experience that I gathered this is based on what I learned from the experience might be in my function approximation indicating a good
and a bad state so this could you can think about this similarly to our uh Pi Theta that's actually missing here the
Theta um ah it's over here sorry change notation um you can think about it again as a neural network or as a linear regression or anything else that deres
you the critic function and there's a second trick to minimize variance and it is a bit mathematically
involved again but intuitively what we want to do is if you think about this quantity that can be in magnitude quite large so you GA experience you get a
reward in a certain time step in the next time step and so on and so on so the magnitude of this Q value can be large having a smaller magnitude of your
critic estimate usually leads also to lower variance so what we want to do is we want to find such Baseline
functions the Baseline functions are solely there for one purpose to reduce the variance and the basine function that's important that's why I put action
independent here only depend on a state and map to some real number and how they reduce variance is exactly
this instead of using the Q function we use the Q function and subtract the Baseline function so we remove some part of the magnitude of the Q function but
only the part that doesn't depend on the action because if it would depend on the action this actually changes our policy gradient
right and you saw before one function that also kind of evaluates the quality of a policy but does not depend on the action as well like the Q function but
does depend only on the state can you remember how we called it from the first half of the lecture we started with that
function ex value function right yeah the state value V right yeah perfect exactly so it turns out by definition the state value function only depends on
the state so actually a good guess for such a basine function that reduces variance and we don't go into the details but you can actually show out of
all possible basine functions the state value function is the one that reduces the variance in my policy gradiant estimates the most and we use that we
just plug in now from any Baseline function our state value function that we saw in the beginning of the lecture so my critic changed now from Q value of
a certain state minus just the state value of being in that state we talk a bit about this
meaning in a second but first let's go back to the definition now of our policy gradient it's this what we plugged in
before right it was first a q value now it's a baseline function that we subtract we can expand this to get rid of this Q
value and say our Q value of a certain State executing a particular action is exactly the reward that I would receive
right plus the state value of acting with respect to my policy in the next state this is what you saw in the beginning how a q value is defined we
asked what would be the immediate reward for executing an action a that does not necessarily follow my policy plus from the next date onwards I do follow my
policy again what is the expected sum of the rewards and this way we get rid of two functions and we just work with one function the state value function
and we can do one more trick again we parameterize our state value function inside of the critic by our parameters P so it can be again the state value
function any neural network any linear regression that tries or aims to learn what are the certain features in the state that drive a high or low State
value let's talk a bit about what's actually the meaning of this formula because that's important in literature you will find this as an advantage
function so-called Advantage function for a given policy Pi the advantage function always depends on the current state that you are in and an action that
you want to choose and it's the difference of the Q value and the state value that you saw before we do one trick we approximate the advantage
function again with our par metop side and we basically ask how good is it to execute a particular action at receive an immediate reward and then
following our policy versus directly following our policy in the current state that we are in so this at does not necessarily stem
from the policy that we follow this V the State value does so we basically asking would choosing a in St improve
what I'm doing at the moment with my policy or would it decrease the quality of my policy so going back it actually scales
the policy gradient here by the Improvement or like Drop in quality when choosing such an action so instead of evaluating the Q
Val Val directly we now evaluate the advantage of the experienced action a in the current state and this is it this is basically
deriving everything that you need to know now about the RL algorithm that was used to train the uh robots that you saw in the beginning the Drone and also the
quadrupled robot and we call it here vanilla ector critic algorithm because it's now basically plugging in into the reinforce algorithm theide idea of having these
parameterized critic functions we start as in reinforce we have an random initial guess of our policy parameterized by Theta and
parameterized the critic by then until convergence again an iterative scheme like in the reinforce algorithm we first execute the massively
parallel simulation that you saw in the beginning to gather data to gather data of State action reward next state and so on so on
how do we do that we follow our current stochastic policy parameter by Theta so we sample from this policy some action we see how good executing this action
was actually be was actually by accumulating rewards over time similar or same actually as the reinforce algorithm and then we again like in the
reinforce algorithm compute the policy gradient but this time we don't use directly the experience that we gathered
in this step instead we use our critic function to evaluate the advantage of choosing an action at that we experienced in a current state
s and this is gathered or computed by evaluating the state action H the state value function in this current form so immediate reward that we experienced
plus State Value Estimate given by our critic in the next State minus state Value Estimate of our critic in the current state and now there's one thing missing
you realize that probably here we plug in the state value Baseline function we pluck in the function approximator idea that you saw before and we pluck in basically all together this the
advantage function for our critic so we used all three tricks together to derive this policy gradient update and that's technically the only difference to reinforce
algorithms because now we can use this estimate again to do classical gradient Ascent similar problems as you told or asked for before we can only converge to
a local optimal policy and then we need to refit our critic because right it was also an initial guess and the critic should
learn about hey how good is my policy doing right in a certain State and if we start with a random guess in the beginning that's most likely off so I also need to have
some uh correction or learning in my critic part here and again we want to do it model freely because we had this nice property that policy gradient based
algorithms don't need to know about the Dynamics model and we want to also make sure that learning a Critic doesn't need a Dynamics model but if you remember the
state value function in the beginning we needed the Dynamics model to evaluate it but we can do it model freely by
doing the following here in the second step we gathered experience so we can compute for each of the episode that we executed the sum of
rewards and averaging over all time steps and all experien polic uh episodes to is basically giving us a Value
Estimate being in a certain State St what did I experience afterwards as rewards when I executed my policy because I do nothing Ahad than following my policy here which is exactly the
definition of a state value function so I can get again this expectation by using all my time steps in an episode using all my episodes that are simulated
and average over all episodes so I get again a Monte Carlo estimate for what is the optimal parameters that minimize the mean square error so maybe some of you
do know from slam lectures it's essentially a Le squares minimization again you can use any tool any optimization algorithm to do this
minimization usually we do the same as for the policy and gradient in this time descent not ENT because we want to minimize the error and that's it this is
your vanilla ector critic algorithm and this is the algorithm that you also execute in an adapted form to train your drone racing and to train your
quadrupled robot and if you understand the lecture until here this allows you basically to go to these
papers and to understand at least the actor critic state-of-the-art reinforcement learning algorithms because mostly what they do is they are concerned about adding a back of more
tricks that even reduce the variance in the policy gradient estimation further we don't cover this today this is a bit technical but in general what all these
algorithms or most of them try to improve is to deal with this High variance and because of these tricks like we add a baseline function we use the advantage function as a Critic we
use neural networks as function approximators we can actually learn for quite complex problems with sufficient compute this Behavior like outperforming and drone
racing the world champion and this is not doable with reinforc algorithms because we would need to gather way too many uh
experience you can't easily deal with these complex State and action spaces for that one you need to plug in your networks usually and then this is basically what people would call Deep
reinforcement learning so whenever you hear people did deep reinforcement learning methods they actually did some variant of the methods that we saw today and used for function approximators
neural networks this is what people then usually would call Deep reinforcement learning but it's essentially the same back of tricks and
techniques yes do you have question regarding how actor critic algorithms work again it's similar to the reinforce
algorithm and actually the difference comes in here where how we compute the policy gradient estimate if you don't have any now we
can also talk about it later afterwards of course you can also drop us an email if you want to we share the slides I know it's a lot of content so make sure you revisit the slides try to get for
you an intuitive understanding of this so this is everything from an algorithmic point of view that I wanted to show you today let's recap a bit what you actually saw because now you saw
actually quite a bit not in all the details but for this again please go to the literature that I showed you in the beginning as well these are really nice explation uh explanations of the details
so the first thing that we saw is RL is a powerful tool what do I mean by that you can solve a lot of different problems
you can solve controlling nuclear power plant you can solve drone racing you can solve almost arbitrarily sequential decision making
problems they need to follow the formula of the mark of decision process so everything that you saw today you need to be able to formalize as a problem in terms of a mark of decision process and
then you can use the algorithms that we saw today that's why I think it's a powerful tool action another reason why it's a powerful tool is what we discussed in the beginning you don't
need explicit supervision anymore you don't necessarily need a human annotator that supervises all actions that you took this is one advantage when people for example talk about in self-driving
cars if it makes sense to learn from a human driver learning from a human driver means we need expert data and how to drive as a human and then for each
action that you as the agent execute you can compare to would a human do it like that but this is always limited by the data that you have available and in
reinforcement learning you only it's complex but only need to define the reward function that wies your behavior then you can use any of these algorithms to figure out the
policy we saw the equivalence of going from a policy to a state value function and back today if you remember that came
in quite handy when actually deriving indirect value based RL algorithms where we saw one example the value iteration and this equivalence was
important because this essentially determines that we can find an optimal State value function instead of finding the optimal policy directly right and then from the optimal
State value function because of this equivalence we can derive the optimate policy however we also saw some limitations for example that value iteration as we saw
it in the current form is a modelbased algorithm so we need knowledge about the state transition function it's quite hard to Define that for real world problems so what we usually aim for is a
model free algorithm although it has also disadvantages if you can't fully Define your Dynamics model or you can't learn
it from data then usually model 3 algorithms are the way to go and the other problems that we saw for example that the general form of
value iteration we also quickly talked about that you can use Q learning and sarar in their basic implementation they are not capable of Performing updates
for continuous action spaces so and also continuous State spaces so what we did see is that this idea of indirectly optimizing an optimal State value function
partially contributes to these problems so instead we ask for can we also directly optimize a policy and it turns out yes we were able to by Framing it as a classical
optimization problem so RL can be understood as an optimization problem it's good because then we can plug in function approximators meaning for
example a neural network the deals out of the box without changing anything with continuous States and continuous actions and we did see policy gradient the
policy gradient theorem which is essentially telling us how to compute the gradient of the expected sum of rewards for a given policy and if we have this gradient we can derive our
policy gradient based algorithms such as reinforce to then iteratively optimize based on this gradient estimate our optimal policy or converging to the optimal
policy but we also saw that there is a problem with policy gradient based algorithms such as reinforce we have a really high variance of policy gradient
estimates and usually this leads to you needing to gather a lot of experience for example in the massively parallel simulation that we saw in the beginning
and to circumvent that to some extent we introduced actor critic algorithms and I think I didn't tell you before but if you think about it ector critic algorithms are actually the hybrid of what you saw in the first half of the
lecture and the second half in the second half we talked about how to directly optimize a policy with reinforce we still do that
but what we also do is we use Knowledge from value functions like the value function for a given policy which were essential for Value based algorithms for
indirect value based algorithms and we merge them together into one algorithm and by merging these ideas from the first half with ideas from the second half we end up at this hybrid elect
critic algorithms which have the nice property of most of the state-ofthe-art at least robotics control behavior that you see today is programmed with one or
another variant of these elor critic algorithms that's it for today I think it was a lot if you do have any questions you can ask them now if you want me to go back to something that we
should discuss in more detail feel free to ask I just have question this model uh
models how far they are going application especially control application knowled of system that's a really good question so
in in the setup that I showed you today what we what we always said is okay I interact with the environment I gather some feedback the policy is probably suboptimal that I execute in the
beginning and based on this gued feedback I somehow with one or another algorithm adapt my policy Behavior right so I act online in an environment and
how I act in there might not necessarily be directly optimal right otherwise I wouldn't have the RL problem this means usually if you deploy your robot
directly in real world you need to do some safety precautions because what you execute as a policy might not lead to intended behavior in the beginning so
directly deploying a robot in Real World is usually hard what we do instead is we fall back to these simulations that I showed in the beginning because there
it's not so harmful if you execute suboptimal or safety critical Behavior right so we go in that simulation that hopefully resembles what we see in real
world we learn there it takes long so to get the time scale actually um I I don't recall I don't recall exactly the numbers of results that you need uh to
execute but usually to get such complex Behavior like drone racing or parkour you need many many environment interactions meaning in a certain State I execute an action I try it try it see
what is the reward it's hundreds of millions of these interactions for state-of-the-art algorithms so executing this in the real world if your robot is quite slow breaks
down sometimes it's basically impossible so what people do is they fall back to these simulations use their compute that they have parallelize it it converges
over time in the simulated environment to a good policy and then they deploy the policy Network Pi Theta that you saw online on their
robot and you can observe the state with your sensors as you did in simulation right and then you execute just doing influence of your neural network the
action that your network proposes as the next best action Sol trained in simulation and because neural network influence is usually fast unless you train the giant
neural network this is how they kind of enforce real time behavior of course depends on your application how real time it needs to
be I have one question that why do we generally see that we end up uh optimizing the state value function and
why we do not optimize the action value like even when we started with the actor critic we Ed the action value as the
CRI but then we end up bringing out the Baseline function which is a is a so for the actor critic itself because what you
were saying is hey in the beginning especially for the reinforce algorithm I saw that basically my experience that I gathered is an estimate for the Q value
right so the left hand side here why do I end up with this one in the actor critic algorithm case it has exactly
this one reason that you do want to minimize the variance so people realized I can use this function I can also learn it to some
extent I would need another neural network probably for it but I don't need to when I expand on the definition of a q value I get an immediate reward from
that action that I took plus the next state reward right this is the identity this is exactly the Q function as before so and then you subtract your Baseline to
reduce the variance and it comes in handy that you don't need two neural networks now but just one once you plug in the next state once you plug in the current state so you need to train also
less neural networks for the first half of the lecture again you're right I did say it's usually easier to derive from a q function the policy right because we saw that we need knowledge of the
Dynamics model for the uh policy evaluation with a state value function and we do have algorithms for that again it's sarar and Q learning that I didn't
talk about today but they exactly realized what you said for these indirect value based algorithms the state value function comes in not so handy here it's a direct derivation from
how the advantage function works but there doesn't come in so handy so people develop for example Sasa algorithms and q-learning algorithms which directly
optimize the Q uh function and then if you do have your optimal Q function for a given State you can just Loop over all the actions and pick the action with the highest Q value this is how you would
derive your policy there so that's a very very good point especially for the first half we showed methods that work with State value
functions but usually indirect value based algorithms nowadays use the Q function because you can directly read off the
policy and for the second half it's exactly this derivation right it comes in hand do you have more questions immediately all
right then thank you for staying with me
Loading video analysis...