The FASTEST introduction to Reinforcement Learning on the internet
By Gonkee
Summary
## Key takeaways - **RL Mimics Trial-and-Error Learning**: Reinforcement learning solves tasks like tying shoes or walking by having an agent learn through trial and error, using positive and negative rewards to reinforce desired behaviors, much like humans and animals learn in the physical world. [00:36], [00:59] - **Agent-Environment Boundary Arbitrary**: The boundary between agent and environment is arbitrary and depends on what you can directly control; for example, when learning to drive, it shifts from brain controlling limbs, to body controlling car, to car navigating road. [04:35], [08:50] - **Markov Property Enables Simple Decisions**: The Markov property ensures each state depends only on the immediate previous state, allowing agents to decide actions based solely on the current state without needing historical context, though real situations may violate this for complex dependencies. [11:32], [12:46] - **Q-Learning Beats Monte Carlo Efficiency**: Q-learning improves on Monte Carlo by using temporal difference updates for action values relative to the best next action, making it more sample-efficient, especially in larger mazes where Monte Carlo fails to reach the target. [36:22], [50:21] - **Dopamine Signals TD Error in Brain**: Dopamine neurons spike not on actual rewards but on predictive cues like lights signaling juice, representing temporal difference error—the surprise between expected and actual future reward—propagating reward knowledge backward through time. [01:12:06], [01:16:01] - **RL Lacks World Models for Reality**: Model-free RL learns without understanding environment physics, requiring millions of samples like 18 million Atari frames for human-level play, unlike humans who use intuitive world models for efficient learning of everyday tasks. [01:20:24], [01:23:10]
Topics Covered
- Computers Excel at Precision, Fail at Everyday Tasks
- Agent-Environment Boundary Shapes Control Realms
- Markov Property Demands Sufficient State Information
- Temporal Difference Solves Credit Assignment Problem
- RL Dopamine Mirrors Brain's Reward Prediction Error
Full Transcript
how would I teach a robot to tie a shoe computers are good at doing exactly what you tell them to do which is both a
blessing and a curse if I want to add two very large numbers that's perfect the computer can do that but if I want a computer or robot to do everyday tasks
like tie a shoe fold a t-shirts or even just walk it's impossible to specify exactly how to move its joints and interact with the
environment and exactly what the forces masses and friction of the physical world are reinforcement learning is a field of machine learning that tries to solve precisely that it's quite
appealing to people because of how it resembles the way that humans and animals learn in the physical world through trial and error and using positive and negative reward to
reinforce the behaviors that you want and punish the behaviors that you don't want in recent years some of the achievements of reinforcement learning include learning to play Atari games in
2013 defeating top human go players in 2016 with Alpha go defeating professional Dota 2 players in 2018 with open AI 5 solving a Rubik's Cube with a
robot Hands by open AI in 2019 certain applications in robot dogs self-driving fine tooting language model Etc and all
those AI learns to YouTube videos of which I have made one as well this video is over an hour long very dense in
information and not very entertaining its purpose is to help those who are interested learn as much of reinforcement learning as possible as
easily as possible and as fast as possible this is because I think that this technology and field of study will be very important in the next few decades when artificial intelligence
gradually brings its capabilities to the physical world and I want to make the knowledge as accessible as possible to all people we will go over much of the foundational Theory and math while
demonstrating it in practice with a series of example tasks that gradually increase in complexity that will be the bulk of the video and then following that we will cover the interesting
correlations between reinforcement learning and Neuroscience and how the brain works and then we will talk about the aspects in which reinforcement learning is still
lacking and some of the directions of future research that I think will be of great help this video is not meant to be watched just once so feel free to
re-watch it as many times as you need to extract its value the prerequisite knowledge of this video includes maths up until basic calculus and a basic
understanding of neural networks in order to research for this video some of the main sources of information I used include the book reinforcement learning
and introduction by suton and Barto which is widely regarded as the Holy Bible of reinforcement learning the open AI spinning up website which has compiled a lot of information about the
most famous and influential algorithms large language models such as chat gbt which are really good at searching for compiling and summarizing widely known
knowledge and the reinforcement learning subreddit where you can read through other people's questions and answers on the topic as well as these resources I'll leave a link to some of the interesting
articles that I mention in this video in the description now before we begin have a look at my hair I've chosen to film the introduction at the very end and throughout the different segments of the
video you'll notice that my hair has grown quite a bit indeed this video was approximately half a year in the making with even just the filming spanning several months and no I didn't avoid
haircuts just to prove a point I was just lazy but if you watch through the video and find Value in it consider supporting the channel on patreon link in description I've put some of the
behind the scenes demo videos on there as of now but those are in this video anyways so consider supporting if you're interested in topics related to Ai and Robotics and you'd like to see behind
the scenes of future videos with that being said let's get started [Music] in reinforcement learning which I'll
refer to in the rest of the video as RL there is an agent and an environment agent is the car environment is the road agent is the player environment is the
map agent is the hand environment is a cube and agent is the strategy environment is the chess match the definition of agents and environments
which generalize to all these situations is that we can say the agent is whatever you can directly control and the environment is whatever you cannot
directly control but you're trying to indirectly interact with through the agent now when the agent interacts with the environment obviously there is two
directions of influence influence of the agent on the environment is called an action and as a human you can intuitively think of the movements in
your limbs and your muscles influence of the environments on the agents is called State and as a human it's your sensory input you're observing the state of the
environment as a human the sensory input and motor output Loop is kind of like a form of this agent environment interaction where the brain is the agent
responsible for integrating data and decision- making in RL we're trying to create such an agent such a brain using a computer so of course all these states
and actions have to be represented as numbers the state could be position and velocity colors of pixels temperature battery level Etc and the action could
include numbers representing how fast to spin a motor which direction to go which ability to activate in a video game Etc now remember where exactly you draw the
boundary between agent and environment is actually more nuanced than you might think and it depends entirely on your realm of control and what specific task you're
trying to achieve only what you can directly control what you can take for granted is the agent for example imagine
I'm learning how to drive a car but I'm starting from zero I don't even know how to move my limbs yet we need to write a
program let's call it program a and put it in my brain in order to learn how to move my limbs here what we can directly control are the neuron signals that my
brain sends down my spinal cord into my muscles what we cannot directly control but are trying to influence indirectly are the movements of my limbs so we can
say the agent is my brain the environment is my limbs the actions are my neuron signals and the state includes the positions joint angles and movements
of my limbs as observed through Visual feedback and proprioception now pretend we've fine-tuned this program enough so now I'm pretty confident with limb
movement in my forearm for example I can open and close my grip do some wrist flexion and extension and occasionally if I'm feeling extra adventurous maybe even
dabble in some allar and Radial deviation now we have to write program B to control how I'll interact with the
car the agent is me as a living creature the environment is the car the actions are my limb movements and the state consists of the elements of the car
which I can observe visually such as looking at the speedometer or through tactile feedback like gripping the steering wheel or stepping on the brakes
now that I know how to operate the car I have to learn to drive using program C which let's say can now directly control how the car moves forwards and backwards and how it turns
now the agents is the whole car the environment is the road the actions are just how the car moves and the state includes other cars and their positions
how many lanes there are what the speed limit is ETC which I observe visually so in this agent environment model the boundary between agents and environment
can be quite arbitrary and the time scale for these actions and state observations can also be quite arbitrary like for example example if I close my
fist is that one action or is it five actions one for each finger the good thing about this arbitrariness is because you can frame a problem in so
many different ways you can choose to frame it in a way that gives the problem structure and makes it easier to solve so let's go over the way that reinforcement learning frames a task in
order to solve it which is called the marov decision process so first of all there's also this added signal known as a reward which is just a single number
that comes after each action that represents whether that action had a good or bad outcome the more positive the reward the better and the more negative the reward the worse now it's up to you to interpret whether the
reward comes from the agents or the environment if it's a human learning something we can generally decide for ourselves whether we did good or bad at that task but if you're training a dog
then the reward of food seems to be provided externally conventionally the reward is considered to be a signal from the environment to the agent so let's stick with that so now let's write out
what a mark of decision process is also known as an mdp it's a discret repeating one after the other sequence of State
action reward State action reward State action reward and so on this can either go on forever or it can terminate like if you're playing a level in a video
game and you die on that level in which case it's called an episode and the episode terminates just a small note on the notation here I've chosen to have
the time step of the reward take on the time step of the state and action before it so it would go s0 a0 r0 S1 A1 R1 Etc
this is because I thought it makes more sense this way however I found out that in other sources about reinforcement learning usually they have a so the rewards time step is the same as the
state and action after it so it goes s0 a0 R1 S1 A1 R2 Etc this is not a huge deal but as a result in some of the
later mathematical formulas you might see that it's a bit different compared to those formulas in other sources where the time step of the reward is off by
one other sources also often choose to use uppercase letters like capital r but here I've chosen to use lower case are once again not a big deal let's get on
with the video ideally this sequence must follow a rule called the Markov property which means that each state is only dependent on its immediately
previous state and not any other state for example if you're making a robot to catch a ball and you've designed it so that the state information it receives
only includes its position then it's unclear what the next state will be because it's unclear in which direction and how fast the ball is moving you would have to look at multiple
consecutive states to deduce that information if instead you also included the Velocity in the state information then it would follow the mark of property and each state is only
determined by the immediately previous one of course the robot isn't necessarily going to know how gravity works or how one state transitions to
the next but at least there's sufficient information for it to learn that now mdps are they perfect do they apply to every single situation probably not
first of all lots of RL methods rely on this Mark of property so that they can just go off of the most recent state to decide on the next action and keep it simple but obviously not all situations
are going to satisfy this Mark of property and if we have certain states that depend on more previous States then the tools that we develop using this framework aren't going to work as well also with a simple sequence like this
it's hard to deal with varying time scales if you want to work on the scale of individual muscle movements in one moment and then in the next moment zoom out to let's say which city to drive to
then you'd probably need a more complex framework than that another thing is there's only one single number as a reward signal if you're doing tasks that are more nuanced and can be evaluated in
multiple aspects then maybe having only a single number as a feedback for each action can be a bit limiting but this is pretty much the the best that we've been able to come up with so far and
basically all of reinforcement learning is built on the foundation of mdps maybe in the future someone could figure out a way that works better maybe that could be you the viewer but anyways let's talk
about some of the math in an mdp within an mdp the ultimate goal of the agent is to figure out how to for any given time
step use an input state to decide on an output action in a way that Max maximizes the total subsequent reward
over time so let's cover these two mathematical constructs the strategy that takes a state and decides on an action and the cumulative reward over
time the strategy we call that the policy function represented by the symbol pi you can intuitively think of it as a mathematical function that takes
a state as an input and gives an action as an output but technically the notation for this policy function is usually written a bit differently to
allow it to have some Randomness as in given the same state the policy can produce a range of different actions each with their own probability the randomness is important for exploring
different strategies so RL people usually write the policy function as giving the probability of picking a certain action given this vertical bar
is a conditional given that you're in a certain State now the next thing the total cumulative reward is called the
return represented by capital G with a subscript T the subscript T represents which time step it's for so for example if we're talking about the sum of
rewards starting from R1 then that would be G1 this is just all the rewards added up starting from time step T and
progressively multiplied by this discount Factor gamma in increasing Powers now gamma is just a number between Z and one and this can be
lowered to make rewards in the future less significant perhaps because the future is inherently harder to predict so maybe you'd want that to factor less into your decision- making or gamma can
just be set to one in which case this would be an undiscounted sum but then it would only work for terminating episodes or else it'll just add up into Infinity of course it's better to try tox
maximize the total reward over time instead of just the immediate reward for a single time step because what gets you more reward immediately might make you
get less reward later on kind of like delayed gratification so remember how the goal is to for any given time step use an input state to determine an
output action in order to maximize the total subsequent reward over time using this mathematical terminology now we can instead say that the goal is to find the policy
which maximizes return find the policy which maximizes [Music] return all right let's try applying this to a practical example usually when
you're teaching RL you start with some sort of task involving a grid and we're going to do that too the reason is that even though no real life problem will
ever be this simple in a grid each State corresponds to a single Square so it's really easy to visualize and demonstrate some key Concepts like the exploration
exploitation trade-off temporal difference the credit assignment problem World models and Sample efficiency using our simple grid demo we're going to go
through what each of these terms mean how they relate to each other and why they're important for reinforcement learning we have a maze well a really simple one with just one wall but let's
still call at a maze where the agent starts on a certain cell and must get to the Target cell by moving up down left or right and it cannot move through
walls the maximum episode length is 20 so the episode terminates when it either reaches the Target or when it's been going for 20 steps already and still hasn't reached the target let's start thinking about how we can apply the
theory we've developed so far and frame this task as a mark of decision process so first we'll need to define a state action action and reward so each state
will be two numbers representing the XY coordinates that the agent is on each action will be a single number between zero and three representing either up
down left or right the reward after each action will be zero if the agent enters the target cell or minus one otherwise which makes it so that the goal is to
arrive at the Target as fast as possible so here we can talk about our first concept which is World models as humans we can obviously see the whole
maze at a glance and immediately figure out how to solve it because we understand that these states represent coordinates the actions represent directions and how the reward works we
know that if you're in the state of 0 0 and you pick the action of three representing right that you'll move into one Zer and get a reward of minus one we also know that in 1 one if you pick
three and try to move right then you'll stay there because there's a wall in front of it if the agent also understands this then we say that it has a world model or a
model of the environment mathematically this just means having access to this function sometimes denoted as P which tells you given that you picked a
certain action in a certain State what reward and next state will it lead to or more precisely to account for some Randomness in the environment what's the
probability of any next state and reward given a combination of being in a certain State and picking a certain action a function like this is an
example of a world model it turns out that whether an RL agent has access to a world model whether it understands how the world Works makes a massive
difference to how well it can learn to do tasks which obviously makes sense this is also why a lot of the current research in RL is concerned with how to develop Better World models or make
agents learn Better World models on their own for complex tasks with our maze task though we're going to start by focusing on model free methods which assume that the agent does not have
access to a world model in other words the agents cannot see the whole maze out a glance like us humans it doesn't know what these State and action numbers represent or how they relate to each
other and it doesn't understand how the rewards work so it doesn't know that if you're in 0 0 and you pick three you'll end up in one Z for example these states actions and rewards are just meaningless
random numbers and the agent has to somehow learn to pick actions in a way that maximizes the rewards it gets so without access to any information all the agents can do at the start is just
randomly pick actions meaning that instead of using a policy where there's actually meaningful probabilities for the different actions it's effectively
using a policy that looks like this for all these meaningless input numbers all the meaningless output numbers have the the same probability and we're meant to
somehow improve this policy over time so by picking random actions of course most of the time it gets nowhere and receives a total reward of -2 since the episodes
are CT at 20 steps but the important thing is that it's writing down all of its experiences in order to learn from this process of trying actions is called
sampling and all these sequences of State action reward State action reward are known as trajectories the agent is collecting sample trajectories
eventually due to shearlock there will be a few trajectories where it actually does reach the Target and these are the important ones let's see how the Learning Happens so first we're going to
now use this other mathematical Concepts that we've developed and calculate the returns for each time step in our successful trajectory using a gamma of let's say
0.8 the way we can do this because of the way that later boards are progressively multiplied by gamma is we can start at the end and work backwards
the return of the last time step will just be zero and to get to the previous one you multiply that by gamma and then add on the previous
reward then to get to the previous one you multiply that whole sum by gamma and add on the previous reward again then you repeat in the end you'll find
that even though this is a successful trajectory all the returns are negative because all the rewards are negative obviously except for when the agent reaches the target but in the
unsuccessful trajectories all the calculated returns are going to be even more negative because the agent never reaches the target so keeps collecting the minus one rewards for the full 20
time steps so now that we have all our sample trajectories and we've calculated the returns for each time step we have to somehow use these experiences to achieve a better policy there are
different ways that you can go about this you can directly work with these probabilities for example push the probability up for actions that had good returns and push the probability down
for actions that had bad returns that would be called a policy gradients method gradient meaning rate of change and we have to calculate how much to
change the policy based on which actions are good and which actions are bad but doing that requires us to know what's considered good and what's considered
bad consider this example where we have three actions initially all set to the same probability let's say we collect a lot of sample trajectories and on
average action one gives a return of -5 action two gives a return of zero and the third action gives a return of five then it's pretty straightforward you
push the probability for action one down keep action two roughly the same and then push action three up but what if
instead action one gave five action two gave 10 and action 3 gave 15 or what if it's 95 100 and 105 or if it's like our
maze example where everything's negative see the issue here is not that all three numbers are positive or negative and you can increase or decrease all probabilities and still have them add up
to one when you're actually using policy gradient methods the effect on the probabilities is more like a tug of war especially with neural networks which have this thing called a soft Max
function that makes sure everything adds up to one the real issue is that proportionally the effects are now different 10 to 15 is a 50% increase
while 100 to 105 is only a 5% increase increase what we want is for all cases to have the same effect because an increase of five is an increase of five no matter what especially since within
the same environment maybe some states are just better on average and some states are just worse on average no matter what action you take so what we need is these things called value
functions as our baselines which keep track of the average return you should expect to get when you're following a certain policy and help evaluate how
good or bad States and actions are V is the state value function representing how much return you should expect to get when you're in a certain State and
follow a certain policy Q is the action value function representing how much return you should expect if you're in a certain State pick a certain action and
then follow the given policy notice that these functions necessarily have a policy function attached to them because if it it's a different policy function
the actions that are picked will be different so naturally you would expect to get different returns these are just random example numbers that I made up but the key is that it's different for
different policies in fact the state value function is basically just all the action values within that state averaged
according to the probabilities given by the policy if you just want to not care about which specific policy is talking about and refer to what's the best case
scenario if you somehow picked the best action every time then that would be V and qar the optimal value functions we
have written these value functions not as probabilities without any Randomness unlike the policy and World model up here because RL is still developing and
up till now people have not yet considered Randomness and variance for the value functions as much there are fields of research though like Bayesian
RL and distributional RL which do consider variants in value functions which you can think of as like Risk to reward ratio as in how good a state or
action is but also how uncertain that goodness is there are so many unanswered questions in RL but anyways that's our
value functions once we have our value functions then we can improve our policy and now we have a few different options as to how we do this first we could just
use the policy gradient method we talked about before without using a value function Baseline there's issues with that as we've said with the different averages it doesn't work that well and
then we could also use the policy gradient method with a baseline such as our state value function to compare with all the different returns in the same state now this works better as all these
cases with different averages can be made to all have the same effect but it turns out there's also another method which is instead to just look at the
action value function and pick the action with the highest estimate every time in other words the policy doesn't keep track of different probabilities
anymore 100% of the probability just goes to the action with the highest estimate well not quite because some Randomness is needed or else there will be issues as we'll see but now there's
this cyclic dependency because usually the action value function is based on the policy function but now the policy function is also based on the action
value function so with this cyclic nature as long as you keep improving the accuracy of the action value function as we'll see later the policy will improve
alongside it out of the three methods you can essentially think of the first method as being more depend dep on the policy function the last method as being
more dependent on a value function and the middle method as being a balance between the two taking the Best of Both Worlds between the policy and the value function all of them have their
advantages and disadvantages but this balanced method does tend to be the most powerful out of the three what I want to do though is focus exclusively on the
problem of evaluation for now the problem of how to develop good value functions having good evaluation is very important to having good Improvement so
let's first focus on this last method this action Value method even though it has its disadvantages which we'll get to later it allows us to not worry about
the policy function for now and just focus on some key concepts for good evaluation so of course the whole point of evaluation is we don't just magically
have access to such an action value Q function and must devel velop one and make it more accurate over time the simplest way we can do this is just to
take the trajectories we collected and then average out the return values for each combination of state and action we can imagine that for every possible
State action input the output of this Q function will initially be set to zero as shown in this table of course the agents has no knowledge that the table
is supposed to be a 5x3 grid like the maze but I set it up like this just to show the values that will actually end up changing so with every new return
value remember we're working backwards so let's say this state of 41 plus this action of one which would correspond to this right here we would move the value
for that specific State action pair towards this result by a small amount let's say 10% now in this case since they're both zero it wouldn't matter but
it would be with a calculation like this 0 Plus + 0 - 0 * 10% which is 0.1 this is an example of how we could gradually
adjust these values of course it'll be zero so nothing will change but let's say the previous one with 3 1 and 3 the
result is -1 so we would say 0 + -1 - 0 * 0.1 and then we would update our table with this new result now for the sake of time I'm not going to do all of the
values but just to demonstrate here we have have two instances of the same action value pair of 103 so for example first of all it would be - 4.57 so that
our result is now - 0.457 and then when we get to the other one now we have - 0.457 cuz it's what value was originally
there plus the new result of - 4.73 minus the existing value and then time 0.1 so the new value is now more
negative and adjusted towards whatever results we are using so this constant shifting of our values towards these new results by a small amount is how
everything gets averaged over time and over many iterations the good States and actions will end up having a good value in this function and the bad States and actions will have bad values ideally
eventually it'll be able to approach our actual environment obviously the states that are closer to our Target should have better values and the states that are farther away should have worse so
let's run the simulation for lots and lots of episodes in the beginning the trajectories don't really get anywhere but what you see is over time the Q function that we are developing which
evaluates the policy becomes more accurate and because it becomes more accurate the policy which picks the best action based on the Q function also improves but because the policy improved
the Q function is now evaluating a slightly different policy and has has to become more accurate for the new policy yet again and then after that the policy improves again because of this cyclic
dependence between the policy and the Q function the Improvement happens in a cycle known as generalized policy iteration eventually the goal is for the
Q function to approach the Q star function where it accounts for the best case scenario out of all the different ways of picking actions then naturally
the policy based on this op optimal qar function will also be the optimal policy here in the corner we have some results to track how the agent is improving but
often times in RL you track Improvement using total undiscounted reward as opposed to discounted return which is what's used for the training that's not
too important but what is more important is this variable over here called Epsilon and what it does is help maintain a balance between
exploration and exploitation this is one of the downsides of this action Value method the reason the agent was able to develop the value function in the first
place was by following a random policy and gaining all these diverse experiences that's known as exploration but once it actually wants to use the value function to perform
better and get more rewards known as exploitation then if it just picks the action with the highest value every time then there's no Randomness anymore the
probability distribution ends up being one action having 100% probability and all the other actions having zero when you're improving in RL what you need is
a balance between exploration and exploitation so that you can both try out different things to improve and also take advantage of that Improvement to
perform well now with policy gradients methods that is not a concern since you're adjusting a whole bunch of different probabilities it'll naturally have some Randomness even if certain
actions tend to be more optimal but with this action Value method where you're just picking the best one we have to use this hack called Epsilon greedy to kind
of artificially introduce some Randomness back in basically Epsilon represents the proportion of the time where the agent picks a random action
and it's supposed to gradually decrease for example you might have Epsilon start off as 0.9 meaning you pick randomly 90% of the time and pick optimally 10% of
the time this is because what the agent thinks is optimal is not very accurate at the start but then as time goes on and the action value estimates become more accurate Epsilon gradually
decreases to 0.1 which is 10% random 90% optimal this method that we just covered is known as
a Monty Carlo method which according to Wikipedia is just a computational algorithm that relies on repeated random
sampling to obtain numerical results we used random sample trajectories to figure out what the values of the Q function should be now let's look at
what the limitations of this method are and what we can improve upon it [Music] the main issue with Monte Carlo methods like this is that it works on an episode
to episode basis when we collect our trajectories we've got good episodes and Bad episodes in the good episodes all
the actions will be evaluated better and in the Bad episodes all the actions will be evaluated worse the first problem this results in is that you have to wait
for the episode to end before you can even calculate any returns and adjust your evaluations for the actions if the episodes are really long then that's going to be really slow and if it's a
continuing task meaning it just keeps on going and doesn't terminate then this method won't even work the second problem is that within the same episode you don't know which actions were good
and which actions were bad take our successful trajectory for example if we draw out what the path actually looks like on the maze then it's clear that some actions were the right decision and
and some were not using Monte Carlo methods we can only rely on the large number of samples to average out and hopefully correct for this suboptimal
action we have now run into the credit assignment problem the problem of figuring out what the impact of an individual action was within a long sequence of many actions so how can we
try to solve the credit assignment problem if the agent is just working with these trajectories and cannot see the whole Ma a glance like us humans the
answer is a concept known as temporal difference which breaks down evaluation to an action by action basis rather than an episode by episode basis here let's
imagine that in these two graphs the magnets represent time steps in a trajectory and we're using them to help construct a q function which estimates
how much return you'll get if you're in a state s take action a and then follow policy high in Monte Carlo methods you're directly calculating the returns
in an absolute sense by adding up all these subsequent rewards like we've done on paper with these trajectories in order to contribute to this estimate that's why you have to wait until the
episode ends however in a temporal difference method even though we're still trying to construct an estimate of the returns we don't directly calculate the returns by adding up all these
subsequent rewards instead you come up with an estimate for each state action pair relative to the estimate for the
state action pair in the time step after it using just one reward as the difference not all the subsequent rewards the terminal States at the end
of episodes which don't have a reward and next state afterwards stay at zero acting as an anchor around which the other values develop if it's a task that
continues for forever and there is no terminal State then there is no absolute reference point as an anchor and all the values might drift up or down but since you just need to know their values
relative to each other to pick the best one that's not necessarily an issue anyways the actual math that describes this relativity is like this if we write
out a return as a discounted sum of rewards then we can see that we can take out the first reward and the rest of will just be the next return in the
trajectory multiplied by gamma of course returns refer to within a specific trajectory so for our more General
averaged out estimate for States and actions we can replace the GT with q and move our estimate towards this result by
10% for example like we did before now here for the next state and action we've just used the state and action that that happens to occur in the sample
trajectory but since the environment has some Randomness this might not be the only subsequent State that's possible and since there's a range of different actions you can take this is also not
the only subsequent action that's possible we can't do much about the state because we're talking about a model-free method right now so we don't
have access to a world model and don't know the next state transitions but with the action as you may imagine there's a couple other ways that we can decide
what action to put in here so that we can make this calculation for example you can account for all the possible actions averaged out according to the
policy probabilities here it's a weighted sum across all the actions where you multiply by the probability or
you could take the best action with the highest estimate rather than the one that you ended up taking here you're taking the maximum out of all the
actions this algorithm is called sarsa this one is called expected sarsa and this one is called Q learning in other
words for any given State and action Monte Carlo is saying what is my return sarsa is saying how much better or worse
is my return relative to the new state and the action that I ended up taking expected sarsa is saying how much better or worse is my return relative to the
new state with all of its actions averaged out according to policy probabilities Q learning is saying how much better or worse is my return
relative to the new state with its best case scenario highest estimated action of course alongside these different ways of learning the action value Q function
you can also learn the state value V function like so where it's a bit simpler because it only takes States as opposed to States and actions right now
we're focusing on the Q function so this V function is not that important but do remember it because it'll come up later in the video furthermore I've written
out the math in simplified form with an arrow and just described it as moving one value towards another value but mathematically it's like this where you
assign to your estimate what it originally was plus the difference between the Target and the initial value multiplied by some sort of learning rate
Alpha which we've used 0.14 up until now anyways that's just the actual math of it if you're curious now notice how a q function is usually supposed to have a
pi attached to it because it's evaluating for that given policy but here for these different ways of constructing estimates I haven't put a
pi on them and that's because they are estimating different things for sarsa and expected sarsa notice how we're constructing the estimates based on what
action the policy ended up taking or what action it's expected to take based on the policy's probabilities in these two cases since the estimates are based
on the policy Pi then it's estimating Q pi and then remember through this continuous cycle of estimating Q pi and then using it to improve pi and then
estimating the new Q Pi again for the improved Pi called generalized policy iteration eventually Q Pi will become the optimal Q function qar and Pi will
become the optimal policy but for q-learning if we're constructing the estimate based on the best action to take in the next state then is no longer
dependent on any policy like the other two that means instead of estimating Q Pi which will hopefully eventually become qar it instead directly estimates
qar so there is a terminology to describe this difference the first two are called on policy methods where the behavior policy the policy used to
gather experiences is the same as the target policy the policy that is being evaluated or improved these two are evaluating part
which is the policy that's used to gather the experiences and trajectories in the first place the last one is called an off policy method where the
behavior policy is not the same as the target policy it still uses the policy pi to gather experiences but it's not trying to evaluate the policy Pi it's
instead trying to evaluate the optimal policy directly so Monte Carlo saraw expected saraw and Q learning all these methods differ in their sample
efficiency what that means is the number of samples either the number of episodes or the number of time steps that it takes to get good at a task if for the
same task one method requires 100 episodes to get good at it and another method requires only 50 episodes then the 50 episode method is more sample
efficient and that's what we want usually Monte Carlo is the least sample efficient because the other three have taken advantage of temporal difference
and it has not then it's usually followed by sarsa which is more sample efficient and then expected sarsa which is even more sample efficient and then
finally Q learning which is the most sample efficient because it has taken advantage of both temporal difference and of policy learning now off policy
has its advantages and disadvantages but this sample efficiency is sometimes one of the advantages of off po policy learning now we've said that Q learning
tries to estimate the optimal qar function through updating like this there's an equation that describes what this optimal qar function will actually
end up being remember how the environment is random and after you take a certain action you won't always end up getting the exact same reward or the exact same next state and this is of
course described by the world model function that gives the probability of the different reward Wards and next states which we don't have access to
well if you do q-learning with enough sample trajectories it will average out and you will end up approximating this optimal qar function which is supposed
to be the expected value of this relationship we have been trying to capture the expected value is just a weighted average similar to an expected
sarsa but this time according to the probabilities in the environment meaning that even though we don't have access to this world model the samples we collect
will still average out according to it furthermore aside from this expected value the expression inside it expresses a property called belman's principle of
optimality which if you look on Wikipedia is a quote from some guy called Richard Bellman who was figuring out some of this math back in the
1950s it says an optimal policy has the property that whatever the initial state and initial decision are the remaining decisions must constitute an optimal policy with regard to the state
resulting from the first decision what this means is if there's a series of decisions which is optimal as a whole then if you look at all the decisions except for the first one those have to
be optimal too which kind of makes sense right if you want to get full marks in a test and the test is split into three sections then naturally that means you must get full marks in each individual
section it sounds pretty obvious but back then this was groundbreaking stuff so that's kind of the theory behind putting a Max around this subsequent bit
in order to find the optimal Q function and this is called a Bellman optimality equation there is a Bellman optimality equation for the other value function v
as well but we're focusing on Q for now I don't think it's too important to know the historical context and terminology for this bit of math but everyone who
teaches RL mentions it so I thought I'd include it as well to quickly demonstrate how Q Learning Works in practice here's the table again with some random numbers filled in let's say
you're in this state picked this action and then ended up in this state with a reward of minus1 then you'd simply look at the new state find the maximum value
out of all the actions there which is this one multiply by gamma and then add on the reward once you have that result you simply go back to the original state
and action and then move that value towards it by a learning rate of 10% for example this way you only need the
reward for one time step and you don't have to add up all the rewards in a trajectory so let's run the simulation again with Monte Carlo and Q learning
side by side and see if it has any Improvement here I've made it so they both use the same Epsilon they both use the same gamma of 0. 8 and they both use
the same learn rate of 0.1 or 10% for this small maze there is a bit of an improvements but the difference is not that big if we switch to a bigger maze
though then immediately you can see that Q learning is doing way better and Monte Carlo is never able to reach the target Square even once the benefits of a more efficient algorithm really start to show
in more complex environments now you might be thinking that this is only because Monte Carlo was was never able to reach the target even once and it just got unlucky so to
see what would happen if Monte Carlo did have some successful experiences I did another run where for the first 50 episodes it was artificially helped to
move to the Target as efficiently as possible but even after that Monte Carlo failed to reach the target again in the rest of the run the point is Q learning is better
[Music] so now that we've gotten up to Q learning which is more sample efficient than the other ones we have finally
gotten good enough at the fundamentals that we can slap on a neural network and move past this grid example the main
benefit of neural networks is that we can go from discrete States and actions to continuous States and actions in other words instead of only having whole
number positions and a limited number of actions in this grid neural networks can allow for an infinite continuous range of states and actions as you can imagine
this is necessary for basically any real world physical task where you're dealing with things like angles velocities positions and directions it's impossible
to use a table for continuous States and actions because you can't have an infinitely big table the first algorithm we'll look at is
called Deep Q Network or dqn which takes us halfway there as in it can allow for continuous States but
not continuous actions so we'll try and solve a task called lunar lander where you're supposed to activate the engines on this 2D spaceship and try to land it
in the middle the task is designed so that there's three engines which are either on or off and you can only turn on one at a time which means there's
four discrete actions one for having no engines on and one for each engine the reason deep Q networks only allow for a limited discrete number of actions is
because it's a value based method remember how there is methods that are more reliant on the policy function methods that are more reliant on value
functions and methods that are a balance between the two up till now we've been focusing on the last category where Monte Carlo sarsa expected sarsa and Q
learning are all value-based methods they all evaluate actions so obviously you can't just evaluate an infinite number of actions in order to pick the
best one for continuous actions we have to go back to policy gradients methods which we'll look at later anyways with deep Q networks you basically take Q
learning but swap out the table for a neural network the table takes input States and AC and outputs and evaluation and you can
go and adjust its outputs over time well a neural network can also take input States and actions and output and evaluations and then you'd go and adjust
its outputs using back propagation and gradient descent well with a slight adjustment you take an input State and output evaluations for all the actions
simultaneously because that's more efficient the neural network is a function approximator because while with the table you can keep track of a whole
bunch of exact values which don't affect each other with a neural network you have a limited number of parameters which affect how all the outputs are
calculated so here is lunar lander with a neural network instead of a table to approximate its Q function the neural network has layers of 64 nodes each in
the middle so I wasn't able to show all of them but it's essentially updating according to Q learning so it's trying to estimate the optimal qar function in a way that
is off policy making it more sample efficient compared to the grid this environment uses much smaller time steps and much more time steps so we are using
a gamma discount factor of 0.999 and a learn rate of 0.001 as you can see we have our familiar Epsilon here which dictates the
probability that it'll take random actions as opposed to the most optimal one this environment comes from a python Library called open AI gym and the state
consists of eight variables the X and Y coordinates of the spaceship the X and Y velocities of the spaceship the angle the angular velocity which is how fast
it's spinning and two yes or no values saying whether its two legs are in contact with the ground the rewards at every time step will be higher if the spaceship is closer to The Landing Pad
moving more slowly and in a more horizontal position whenever the engines fire it results in a small penalty in the reward to try to get the spaceship
to use less Fuel and there's a bigger reward when the legs are in contact with the ground finally there's a massive penalty if it crashes and a massive
reward if it lands safely of course at the start the untrained spaceship will keep crashing but over time it does improve now there's actually many
variants of DQ networks also known as DQ NS there's double dqn dueling dqn rainbow dqn Etc these algorithms all differ in the little tricks that they
use but fundamentally they're all basically taking Q learning and replacing the table with neural networks in fact even this normal dqn which came
out in 2013 in a paper called playing Atari with deep reinforcement learning uses a couple of these tricks such as a Target Network and a replay buffer but
that's outside the scope of this video we won't go into all of those tricks after around 100 episodes the spaceship has gotten pretty good and pretty
consistent as well it doesn't crash anymore it stays upright and whenever it veers off to the side because there's a random Force that's applied to it at the
start of every episode to make it more challenging it's able to slowly readjust and move back into the middle before landing so let's try and visualize what the network has learned here are some
examples of outputs from a network trained on lunar lander I've laid it out so that this rectangular area corresponds to the X and Y coordinates of the spaceship which is the first two
numbers in the state and the other numbers in the States can be altered down here using sliders as you can see it's been trained to the point where you can understand the outputs reasonably
well so when the spaceship is not moving completely horizontal and not rotating then it will not fire any engine if it's right above the target when it's off to the left it'll fire its left engine to
move back towards the right when it's off to the right it'll fire Its Right engine to move back towards the left and if it's too far down it'll fire its down engine to come back up other than that
most things make sense but not everything if the x velocity is positive so it's moving to the right it's more likely to fire Its Right engine to move
back left but if it's moving to the left for some reason the left engine is not fired more frequently if it's got a negative y velocity meaning it's moving downwards
it's more likely to fire its down engine to slow down but for some reason if it's moving upwards it'll start firing Its Right engine if the angle is positive meaning
counterclockwise it'll fire the left engine more and if it's angled clockwise it'll fire the right engine more the same thing can be seen for angular
velocity which is how fast it's rotating for comparison I've also visualized a random untrained Network and it's much harder to see a pattern or explanation
in how its inputs relate to its [Music] outputs now let's talk about policy gradients if you'll remember from before this is where we push up the
probabilities for actions that did well and push down the probabilities for actions that did poorly here instead of just focusing on evaluation with a value
function and having the improvements of the policy be just based on that we are now separately keeping track of a distribution of probabilities and
altering it at will furthermore it turns out that not only can you have a discrete probability distribution you can also alter policy gradients to work
with A continuous probability distribution defined by a mean and standard deviation for example which we'll look at later this is of course important for many tasks where there is
a physical aspect where it's not feasible to classify actions into discret options so let's start off by talking about a fundamental difference
between learning a better value function for evaluation versus learning a better policy function for Improv movement if you'll notice whenever we have learned a
value function up until now no matter if it's with a table or with a neural network no matter if it's Monte Carlo or Sara or expected sarsa or Q learning
there's always some sort of Target that we calculate and then try to get the function to move closer to and this perfectly fits how a neural network traditionally functions where you're
trying to use gradient descent to minimize a loss function in order to get closer to the targets or ground truth
that you give to the network however in order to improve the policy function with all the probabilities that it keeps track of what's the target if we're just
adjusting probabilities up and down depending on whether the action did well then that's more akin to trying out different things and seeing what works
there is no such clearcut Target without a ground truth Target we no longer have a loss function to minimize without
something to minimize we are instead going to look for something to maximize what quantity should we maximize so that
our policy performs as best as possible such a thing is described as an objective function J which is written in a way that says it takes in the whole
policy with all its Network parameters Theta the whole policy not just a specific state or action and outputs some sort of evaluation of how well it's
doing so it's kind of like another type of evaluation but it's different from the value function such as Q in that Q for example takes a lot of effort to
estimate and calculate and it actually serves the useful function of helping us differentiate between which states and actions are good and which stat actions
are bad J on the other hand just evaluates how well the policy is doing as a whole so we could even say that our graph that tracks Total Rewards over
time is an example of a j function calculating the value of J is not that hard and not that meaningful I mean we kind of just did with this reward graph
and it doesn't really tell us anything about how to get better so the point of J is not really to calculate its value unlike the value functions of Q or V the
point is that when you set it as the thing to maximize for this neural network through gradient Ascent you can end up deriving some pretty useful math
that helps improve the policy function so here's where we have to figure out how to actually Define j in order to derive that math J is often defined as
different things in different sources on the internet this one frames it as what's the expected total total reward that you'll get in every trajectory generated by the policy this one is
what's the state value V of an initial States in an episode with a start and end this one is the average State value V across all states and this one is the average reward per time step the point
is it doesn't really matter they're basically all slightly different ways of saying the same thing which is how much reward are you going to get on average
the definition of j isn't that important what is important is that with all these different definitions if you differentiate it with respect to the
parameters Theta with a whole bunch of complicated math involving something called the policy gradients theorem you end up with something that is generally
in the form of this now that looks really complicated so let's break down this equation right here first of all this thing right here is the gradients
of J with respect to Theta which just means how to change the parameters Theta in order to increase the performance as
measured by J and then this thing right here is you take Pi of a given s which is How likely you are to take a certain action and then you log it and then you
take the gradient of that with respect to parameters Theta that sounds complicated but basically it's saying how to increase the probability of an
action a this thing right here is some sort of quantity there's most multiple definitions for this quantity which we'll get to later on but it's some sort
of quantity saying how good is that action a and then finally there's this expected value and sum across all the time steps which is basically to say you're averaging out across all these
states and actions so basically it's saying in order to increase performance increase the probability of all actions
according to how good each action is even though the math looks very complicated it's basically doing what we said at the start push up the probabilities for actions that did good
and push down the probabilities for actions that did poorly so let's first look at this quantity that says how good each action is as we've said before we
could just use GT which is the total return that occurred after that action but of course this suffers from the problem that different states might have higher or lower rewards on average so
it's inconsistent and has what we call High variance we want to figure out ways to reduce this variance of course then we can do what
we proposed which is to use the value function V as a Baseline and make it fair for all the different states but then this GT still suffers from the
problem that you have to wait for a whole episode to end and you can't use temporal difference so you could also rewrite these two but replace it with
the action value function Q so it could be either Q or Q minus V and now we're fully using temporal difference but then
with this Q minus V it's too much effort to have to use two different networks one for Q and one for V so you can use this equivalence form which rewrites Q
as the immediate rewards plus V for the next state multiplied by gamma so now we only need V and then we have the most advanced and Powerful way which is method called generalized Advantage
estimate where the math is a bit complicated so I won't write it out the point is there's quite a few different ways of calculating and they generally aim to go from evaluating how good each
action is on its own which is what we did with the value based methods like deep Q learning to evaluating how good each action is relative to the other
actions in the same state also known as its advantage and this is with the goal of reducing variance notice how the variance was not really an issue with
these value based methods like deep Q Network because you're just picking the best action anyways but now it's important to consider in order to adjust these action probabilities
proportionally so now that we've looked at how to calculate how good each action is let's look at the other part of the equation which is how to increase the
probability of said action here we have two main cases one where it's a discrete probability distribution and one where it's a continuous one and we'll say that
the continuous one is a normal or gaussian distribution so it has a mean and standard deviation the discrete one is simple whichever action it is you
just push the probability for that one up or down and if you're using a softmax function the other probabilities will adjust to accommodate it with the
continuous one the actions will now be a continuous range of numbers and what you're doing instead is whatever number the action is you're dragging the mean
towards that number proportional to how far away it is and with the standard deviation then you can just increase it if the action is far away from the mean
and decrease it if the action is close by each action is weighted according to how good it is so now we've covered our two main
components of optimizing this objective function one is adjusting the action probabilities which is done by the policy function and the other one is
evaluating how good each action is to inform that adjustment and this is done with at least one value function sometimes more if this evaluation
process doesn't use this return value that you have to wait till the the end of an episode to figure out and instead fully uses temporal difference then
usually it's called an actor critic method essentially the actor which is the policy function acts so it picks
action and tries to improve and the critic which is the evaluation with value functions critiques the actor in order to help it improve now I don't
think it's too important to know the fine detail of all this math as long as you intuitively understand what it's doing which is push probabilities for
good actions up and probabilities for bad actions down using a range of different methods to evaluate how good or bad each action was in order to
improve overall performance with the mathematical equations specifically with these typical objective functions which result in a gradients of this form some
of the more advanced r l algorithms don't even use this for example trpo and poo are two algorithms that use what's
known as surrogate objective functions which are like the objective function J but formulated differently it looks really complicated and when the neural
network tries to optimize that it often performs better than when it tries to optimize this stuff but it still follows the general structure of an actor and a
Critic so here is lunar lander with a policy gradients method I've downloaded this trained model from the internet and it
was trained with the algorithm proximal policy optimization poo so a bit different from the classical objective function gradients that we looked at but essentially you can see that instead of
having one network to calculate evaluations for all actions within a state we have two networks one to calculate probabilities for all action
within a state and one to calculate a single evaluation for the whole state in order to help with adjusting those probabilities in this case the state
value function V within the probabilities the red one represents which action it actually ended up picking so you can see that there is inbuilt Randomness here unlike with
value based methods like deep Q networks where you pick the best option but artificially add in Randomness with Epsilon greedy the next example is
called bipedal Walker and has continuous actions instead of discrete actions this model was trained using a method called Soft Axor critic sa and you can see that
not only have we gone from multiple discrete options to one continuous distribution we also now have multiple continuous distributions for all of the joints of this creature the actual
actions that it ended up taking out of these probability distributions are written here on the right this time the advantage is calculated using the average action value function Q for all
the actions instead of the state value function [Music] V let's talk about one of the most interesting things about reinforcement learning which is that throughout its
development people have often found links to neuroscience and how the brain works here I'll cover two of the most well-known and well accepted
similarities between reinforcement learning and the Brain the first is with temporal difference and dopamine and the second is with acor critic and the
dorsal lateral and ventral strim so let's first look at temporal difference and dopamine specifically the rewards prediction error hypothesis of dopamine
in the 1990s a scientist named wolam Schultz and his colleagues performed a well-known series of experiments on monkeys where monkeys were trained to
press a lever after seeing a light slide up as a trigger queue and as a reward they would receive some apple juice in their mouth what happened was at first
before the monkeys got good at this task whenever the monkeys got apple juice their dopamine neurons would get a sharp spike in activity which makes it seem
like dopamine signals reward however after the monkeys got good at the task they learned that that the lights lighting up signaled that a drop of apple juice was about to come and their
dopamine no longer spiked in response to the actual juice itself but rather to the light that predicted it afterwards in a different experiment there was not
one but two levers and there would be a light sliding up above the correct lever first as an instruction cue after which there would be the trigger queue telling
the monkey to press the lever after which the monkey would receive the apple juice now as the monkeys got good at this task the dopamine neurons stopped responding to the trigger que and
started responding to the instruction cue so this continual shifting forward of the dopamine Spike suggests that it signifies not reward but rather the
earliest sign that predicts there will be a reward so how does this relate to temporal difference well think about our value functions q and V let's say
specifically V in this context which is supposed to predict how much reward an agents will end up getting the monkeys start off by expecting zero reward so
their V function starts off at zero but at whatever point in time the monkey realizes it's going to get a reward the
V function changes to being positive so conceptually you can think of this dopamine Spike as a surprise or a shift in expectation
or even a prediction error because if the V function was supposed to predict future reward then you could say that when the new information arrived and it
changed there was a realization that the old prediction was wrong that there was an error so that's how you can think of it conceptually but mathematically it's
a concept called the temporal difference error or TD error remember how when you are learning your value functions you do so with these temporal difference
updates where you're moving your expectation to some sort of Target well in order to do that update of course you'd have to first figure out what the
difference is between that Target and what your expectation is currently the difference between what you originally expected to get versus what you actually
got plus what you now expect to get with the new information this is the TD error of course you could write out the TD error for the Q
function as well and in fact you could write multiple different forms of it since there's different ways of learning the Q function but we're focusing on the V function here because it's simpler and
the TD error usually refers to the V function either way this TD error is the signal that guides learning now remember that through this learning temporal
differ difference is essentially propagating knowledge of reward back through the time steps which is what enables this value function to predict and capture information from the future
if there's a consistent cue like the trigger queue or instruction queue which you can think of as a state then the TD
error the surprise of reward gets propagated back to there which is why this dopamine shift occurs until there's no Q before that which can predict it
anymore at which point you can think of it as there not being a consistent earlier state so this learning cannot happen what's interesting is the
dopamine also showed signs of being a negative shift in expectation when the monkeys were expecting apple juice but they accidentally hit the wrong lever so they didn't get it their dopamine
neurons experienced a sharp drop in activity signaling a negative temporal difference error they're now returning from an expectation of positive reward
to an expectation of zero reward so anyways that's the link between dopamine and temporal difference the reward prediction error hypothesis of dopamine
extending upon that is how part of the brain exhibits similar Behavior to an actor critic structure this is now a more nuanced topic and I won't be able
to refer to one single study that's as obvious as the monkey one but let's look at a diagram of a human brain as you may or may not know different parts of the brain are responsible for different
things you've got areas responsible for sensory input motor output visual perception language and speech Etc this image shows the outside of the brain if
we look inside the brain you'll see we have this green bit here called the striatum and that's responsible for according to Wikipedia motor and action
planning decision making motivation reinforcement and reward perception sounds familiar to what reinforcement learning is doing right well it turns out there's been extensive evidence that
two subparts of the striatum in particular behave similarly to an actor and a Critic the dorsal lateral striatum participates in action selection and
execution like the actor and the ventral striatum participates in evaluation and prediction in order to guide the actor like the critic this 2008 article goes
into detail on the topic and they've provided a neat little diagram here where on the left side are the components of our typical reinforcement learning framework and on the right side
is how their functions map on two parts of the brain if you'll see here in a usual actor critic RL algorithm the TD error feeds into both the actor and the
critic from our previous section on dopamine you might remember that the TD error is a signal that guides learning of a value function so it makes sense
that it feeds into the critic but you might also remember that in the math of policy gradients one of the ways of calculating Advantage is also the TD
error so it also informs the actor on how to improve and sure enough in the human brain the dopamine signals which are supposed to behave like the TDR feed
into both the ventral striatum and the dorsal lateral striatum that's just one part of the evidence that supports this correlation which we are able to understand with the stuff that we just went through the other evidence involves
a lot more Neuroscience knowledge which you might be able to see with all these fancy names of different parts of the brain so we won't go into that so the correlation between temporal difference
and dopamine and actor critic and the dorsal lateral and vental stasm those are two of the main links that have been discovered between reinforcement learning and Neuroscience
[Music] now I'll talk about some of the key Subs in reinforcement learning that I think are the most important for it to develop in the coming decades evidently we don't
have robots in our homes that can fold our t-shirts yet and some of the main issues still holding back RL are covered in this famous article from 2018 titled
deep reinforcement learning doesn't work yet many of its points are are still relevant today one of the main points was that RL can be really unreliable
meaning that you could write some code to train a model and run it 10 times without changing anything other than the automatic random seed that all the networks are initialized by and then
half the time it works and half the time it doesn't indeed when I was trying to train lunar lander with policy gradients specifically when I run the code
multiple times without changing anything there were enormous differences between the performance of different runs this is part of why I ended up just downloading a model from the Internet or
else I wouldn't finish this video on time reinforcement learning can also be really sample inefficient meaning it takes an insanely High number of time
steps and episodes to train models in the article it talks about how rainbow dqn which was one of the best RL algorithms for learning Atari games was
able to reach human level performance after 18 million frames now how long is 18 million frames assuming 60 frames per
second that's 83 hours so on tasks that might take a normal human a few minutes to learn reinforcement learning takes an almost laughable amount of time now that
might be fine for video games where you have infinite simulation anyways but if you're trying to apply reinforcement learning to situations where you cannot be so example inefficient like the
physical world then it's not practical to require 10 million 50 million or even 100 million frames of training like some of these research papers show the article had many more points so you
could go check that out if you're interested but the main point here is to talk about some key subfields of RL that are still in continuous development and will hopefully help resolve some of
these problems I'll specifically focus on two subfields model-based reinforcement learning and imitation learning SL inverse reinforcement
learning the first one is modelbased reinforcement learning this refers to that world model function that we talked about the one that we assumed we do not have access to throughout the whole
video it's a very complex and difficult topic that takes way too much time to cover in this video but essentially a world model can help you predict the next state and reward if you take a
certain action in a certain State meaning that it encodes some sort of understanding of how the world Works which allows you to imagine and simulate how things will turn out model-free
reinforcement learning which we have been covering up until now I think is not a very practical way of learning for example let's say I'm doing a task where
if this tennis ball touches the table I lose points without a world model without any understanding of the environment it's basically like hm I
wonder what happen happens if I let go of the ball here oh I lose points well what about if I let go of the ball here okay I also lose points well what about
here oh okay I also lose points and then maybe after 300 trials I'll figure out that if I let go of the ball above this
half of the table I'll lose points and then I'll need another 300 trials to figure out what happens if it's this half of the table now as a human if you learn like this that without basic
understanding of the environment with stuff like contact between objects inertia and gravity you look stupid and if you think about it that makes it all
the more impressive that reinforcement learning is able to do all these tasks without a world model without any understanding of the environment but that's also why it takes 10 million 50
million 100 million frames the next time you do any everyday task take notice that a world model is essential to understanding what's going to happen to a t-shirt when you fold it which way
you're going to fall when you lose your balance how the water flows in a cup when you fill it up Etc so modelbased reinforcement learning in order to
obtain a world model there's two methods a world model can either be handcrafted and given to the agent like if we already know the physics in an environment or if we already know the
rules of a board game or it can be learned by the agent at the same time that it learns a policy or value function and then once you have the
world model you can also use it in two main ways you can either just train as usual but with imagined simulated experiences from the world's model kind
of like that study which found that people were able to improve at shooting free throws by imagining themselves shooting free throws or you can use the world model to plan ahead while you're
actually doing the task by simulating different possibilities and searching for the best option a way of doing this is Monte Carlo tree search Monty Carlo
meaning simulated Tree Search meaning considering how different possibilities of the future may Branch out and searching for the best one Google Deep Mind has some good examples of
model-based RL methods developing over time from alphao to Alpha go0 to alpha0 to mu0 over time they've made the able
to do more and more tasks expanding from just go to other board games like chess and shogi and then to Atari games as well the first three are given a model
and only use the model for planning the last one muz learns a model by itself and uses the model for both learning
from imagined experiences and planning another example is the dreamer series with dreamer and then dreamer V2 and then dreamer V3 three these ones learn a
model by themselves and use the model only for learning from imagined experiences and not for planning the most recent version dreamer V3 was able
to obtain diamonds in a modified version of Minecraft so that is very cool as you may imagine world's models are very complicated and advanced so we won't go
into it too much here the second key fields of RL actually two closely related fields with a similar goal is imitation learning and inverse
reinforcement learning which aim to learn from a so-called experts demonstrations if we think about many day-to-day tasks such as tying shoelaces
folding a shirts or even shooting a basketball almost all of them are usually learned by watching someone else demonstrate you wouldn't expect a 3-year-old to learn how to tie shoelaces
just through trial and error so if we want robots to one day learn tasks of similar complexity it would be great if we could somehow teach through demonstration the earliest attempts to
achieve this goal were through something called imitation learning which is trying to learn a policy a mapping from states to actions not from a reward
signal but just from example trajectories from someone who supposedly knows what they're doing so for example if I want to teach an agent to drive a
car around a racetrack I'll just drive it around the racetrack myself a couple times and then record the states and actions that occurred as training data
now notice how there's no need to design a reward function here which is actually a pretty great feature because designing reward functions can actually be quite
difficult to get the agent to do exactly what you want it to do for example if I'm not careful and I just give the agents a reward for passing the Finish Line it might end up just going back and
forth across that finish line to collect reward instead of actually going around the racetrack and that's not what I wanted to do if we think about an even more complex task of just generally
driving safely changing lanes and signaling turns Etc how do you even Define the reward function for that these are examples of where an expert's demonstration could be greatly useful
anyways without the reward function the simplest way of learning the behavior within imitation learning is to learn a policy just from the generated
trajectories such as with a table or a neural network and this is called behavioral cloning however the issue with that is it works fine when the agent is able to
perfectly follow the example trajectory but if it veers ever so slightly off course it won't know what to do because the examples didn't cover that and it'll
just keep deviating more and more to solve that there's an improvement upon behavioral Clon learning which is still within the realm of imitation learning
called data set aggregation or dagger for short machine learning people do love coming up with Goofy acronyms in this case the agents will purposefully
add some noise to the example trajectories to generate instances where it has veered off course of bit and then a human has to go and annotate what it
should do in those instances this makes for a more robust type of imitation learning but of course the downside is it's very tedious to have to have humans go and label all that data despite that
imitation learning has been used to do some pretty cool stuff for example this paper shows a human demonstrating to a robot how to catch a ball in a cup and you can see that at first when the robot
tries to imitate that it's not successful but over time it gets closer and closer and finally after a 100 trials is able to catch the ball in the
cup I also found this example of a robot dog learning to walk by imitating a real dog and a surgery robot learning to suture by imitating human footage in
contrast to imitation learning inverse reinforcement learning or inverse RL refers to methods that rather try to learn what the reward function is
supposed to be it's kind of like observing some behavior and rather than trying to mindlessly follow it actually try to figure out what the underlying
goal is and what outcomes are good or bad I didn't find as many cool demos for inverse RL as for imitation learning but one paper showed a method called inverse
Q learning which focuses on learning the Q function from expert demonstration and subsequently the reward function the agent seems to be able to play Atari
games perform movement in Minecraft control a simulated robot arm Etc just from watching 20 to 30 expert demos also pretty [Music]
cool so there we have it to recap this video we first went through what a mark of decision process is and then applied it to a grid maze example we solved this
grid maze example with a Monte Carlo method and then showed how we can improve upon that with temporal difference using a method called Q learning then we added on a neural
network to to Q learning making deep Q networks in order to learn lunar lander now going from discreet to continuous State spaces after that we went through
the mass of policy gradients and with policy gradients we were also able to handle continuous action spaces not just continuous State spaces then we talked about how aspects
of reinforcement learning show correlations to neuroscience and certain parts of the brain and finally we discussed the current shortcom comings of RL and what directions of future
research could help with it I hope you found value from this video if you did and you're interested in more videos on related topics of AI and Robotics consider supporting the channel on
patreon where I'll put behind the scenes clips and previews Link in description I hope you learned a lot about reinforcement learning I hope you found the field fascinating and I hope one day
we will have robots that can tie our shoes and fold our clothes with that being said see you next time
Loading video analysis...