L3: Bellman Optimality Equation (P1-Motivating example)—Mathematical Foundations of RL

By WINDY Lab

Summary

## Key takeaways - **Bellman Optimality: Policy Improvement Key**: While the current policy is not good because at s1 it moves right into the forbidden area, we improve it by selecting the action with the highest action value, like a3 moving down, to avoid it and expect more rewards. [03:18], [03:33] - **Action Values Guide Better Policies**: Action values evaluate actions; choosing the one with the highest action value, such as a3 at 9 in s1, leads to a better policy because it means expecting more rewards. [03:14], [04:35] - **Greedy Improvement Iterates to Optimal**: Even if other states like s2, s4, s3 have suboptimal actions, repeatedly choosing the highest action value in every state and iterating refines the policy until it converges to the optimal one. [05:12], [05:25] - **Bellman Optimality Equation Enables This**: The mathematical tool for analyzing the process of iteratively selecting highest action value actions to reach the optimal policy is the Bellman optimality equation. [05:49], [01:04] - **Compute qπ via vπ Example**: With γ=0.9, after solving Bellman for vπ(s), compute qπ(s1,a3) as immediate reward 0 plus γ vπ(s3), yielding 9, the highest among actions. [02:22], [02:51]

Topics Covered

Greedy improvement yields better policy
Iteration converges to optimal policy

Full Transcript

Hello everyone welcome back to my course on reinforcement learning This lecture will introduce the concept of optimal policy and the Bellman optimality equation This figure shows the structure of our course In the previous lecture we studied the Bellman equation In this lecture we will study the Bellman optimality equation a specific case of the Bellman equation that plays a crucial role in reinforcement learning I strongly suggest you pay special attention

to two core concepts and a fundamental tool in today's lecture the optimal state value optimal policy and the Bellman optimality equation We know that the goal of reinforcement learning is to find optimal policies so it is very important to learn the concepts in this lecture This is the outline of this lecture It might seem a lot but each part is relatively short I'll start with a brief introduction First I will present motivating examples

followed by definitions of optimal policy and optimal state value Then we will thoroughly explore the Bellman optimality equation from sections 3 to 8 including what this equation looks like how to solve it its solutions properties and more With the Bellman optimality equation we can derive the optimal policy and also discuss its properties towards the end of the lecture Let's begin with an example

We've seen this example several times before and it is quite straightforward It involves a policy π represented by green arrows Our task essentially involves two steps first solving the Bellman equation to determine state values and second calculating action values On this basis we will introduce an interesting phenomenon First the Bellman equation This is also a revisit of our last lecture

It allows us to obtain an equation for every state for instance vπ(s1) Starting from here and moving right I receive an immediate reward of -1 then γ because I jump to s2 so the following is vπ(s2) Similar equations can be obtained for other states This set of equations can be solved in many ways since it forms a system of linear equations

If we set γ to 0.9 we can calculate these state values You can do the calculations on your own I directly present the results here After determining the state value we can proceed to solve for the action value How do we do this? Let's consider s1 as an example There are five actions Each action is associated with a value For instance let's examine action a3 if I move down from s1 by taking action a3

what would happen?

First I would receive an immediate reward of 0 followed by γ and then I transition to s3 which gives vπ(s3) Substituting the specific values for γ and vπ(s3) we get 9 Similarly other actions can be calculated Why do I highlight a3 in blue? It's because as you can see action a3 actually corresponds to the highest action value Now we ask an important question

the current policy is not very good because at s1 it moves to the right leading into the forbidden area So the question here is while the policy is not good how can we improve it to get a better one?

The answer lies in action value Let's look at the current policy which can be expressed like this when a equals a2 moving right its probability is 1 and for all other actions it's 0 Under this policy we've calculated action values and discovered that a3 corresponds to the highest action value Could we choose a3 as a new policy?

Specifically this new policy at s1 assigns a probability of 1 to action a* and a probability of 0 to all other actions indicating that this new policy would definitely choose a* What is a*? It corresponds to the action with the highest action value which in this example is a3 Therefore while the current policy moves right the new policy πnew would move down As you can see moving right is not as good as moving down

since moving down can avoid the forbidden area Why does choosing the action with the highest action value lead to a better policy?

Intuitively this is because action values can be used to evaluate actions If I select an action with a high action value it means I can expect to receive a lot of rewards and consequently the policy will be better This is quite easy to understand intuitively However mathematically it's not so straightforward Why? Because in this example it's quite simple

Why? Because in this example it's quite simple For states like s2 s4 and s3 the optimal actions are already chosen so in such cases choosing the action with the highest action value at s1 can lead us downward But what if it's not the case?

For example if at s2 the action is to move up at s4 to move down and at s3 to move left meaning the policy isn't optimal in other states can we still obtain the optimal action at s1 by calculating the highest action value?

There's some uncertainty here But what I can tell you is that by repeatedly performing this process and iterating we will eventually reach an optimal policy That is for every state we choose the action with the highest action value iterate to get a new policy and continue iterating to refine the policy until it converges to the optimal one The mathematical tool for analyzing this process

is precisely the Bellman optimality equation

Loading...

Loading video analysis...