06 Jun 2025 4 min read Markov Decision Process

Bellman Equation - Policy Iteration

This morning, let's talk how to find the optimal policy of MDP(Markov Decision Process) problems. Yet, There is one important thing to note is that the problem can be divided into a Model-Base and Model-Free two types in general.

Model-Base: learn or use model of the environment (i.e., transition and reward function are available). Famous algorithms are Dyna-Q, AlphaZero(Monte Carlo Tree Search) and World Models.
Model-Free: learn without explicitly modeling the environment, instead it directly learn a policy which relies on trial-and-error. Famous algorithms are Q-learning, SARSA, PPO, A2C and so on,.

Awesome, we are clear about the difference of Model-Base and Model-Free problems.

Policy Iteration Introduction

To find the optimal policy, we use Policy Iteration which is a method commonly used in reinforcement learning (RL). In brief, it iteratively improves a policy by alternating between Policy Evaluation and Policy Improvement until convergence to the optimal policy. Let's introduce those two components first.

Policy Evaluation: compute the $V_π(s)$ for a given policy $π$
$$V_π(s) = \sum_a{π(a|s)}\sum_{s'}P(s'|s,a)[R(s) + γV_π(s')]$$
Policy Improvement: improve the policy by selecting actions that maximize the expected value for each state.
$$π'(s) = \argmax_a\sum_{s'}P(s'|s,a)[R(s,a,s') + γV_π(s')]$$

This algorithm alternates between this two steps until finding the optimal policy (i.e., policy converges), which is expressed as below.

$$V^{*}(s) = \max_a\sum_{s'}P(s'|s,a)[R(s,a,s')+γV^{*}(s')]$$

As you can see, the Bellman Expectation Equation and the Bellman Optimality Equation are used in policy evaluation and policy improvement, respectively. This enables policy iteration to both evaluate how good a policy is and improve it by selecting actions that maximize future rewards. Moreover, policy iteration is a part of Dynamic Programming (DP), as it exhibits key DP characteristics such as breaking problems into subproblems and combining their solutions.

Policy Iteration Implementation

Gridworld is a classic example used to demonstrate reinforcement learning (RL) concepts. Let’s use a 4×4 grid to illustrate how policy iteration works.

   0    1    2    3
+----+----+----+----+
|    |    |    |    |  0
+----+----+----+----+
|    |    |    |    |  1
+----+----+----+----+
|    |    |    |    |  2
+----+----+----+----+
|    |    |    | T  |  3
+----+----+----+----+

T stands for the goal (terminal) state of this task. The steps are listed as below.

Initialization
- Initialize a random policy (e.g., move in any direction)
- Initialize state-values arbitrarily (e.g., all 0 in this case)
Policy Evaluation
- Compute the value function for the current policy until converges
Policy Improvement
- Improve the policy by choosing the best action based on $V(s)$
- Check if policy is changed

import numpy as np

GRID_SIZE = 4
GAMMA = 0.99
THETA = 1e-3

V = np.zeros((GRID_SIZE, GRID_SIZE))

policy = np.full((GRID_SIZE, GRID_SIZE), 'U')
# Terminal state has no policy
policy[3, 3] = 'X'

actions = {'U': (-1, 0), 'D': (1, 0), 'L': (0, -1), 'R': (0, 1)}
all_actions = list(actions.keys())

def step(state, action):
    if state == (3, 3):
        return state, 10
    act = actions[action]
    next_state = (min(
      max(state[0] + act[0], 0), 3), 
      min(max(state[1] + act[1], 0), 3)
    )
    reward = -1
    return next_state, reward

iteration = 0
is_stable = False
while not is_stable:
    iteration += 1
    print(f"\nIteration {iteration}")
    
    # Policy Evaluation
    eval_iter = 0
    while True:
        delta = 0
        eval_iter += 1
        for i in range(GRID_SIZE):
            for j in range(GRID_SIZE):
                if (i, j) == (3, 3):
                    continue
                v = V[i, j]
                a = policy[i, j]
                (ni, nj), r = step((i, j), a)
                V[i, j] = r + GAMMA * V[ni, nj]
                delta = max(delta, abs(v - V[i, j]))
        if delta < THETA:
            break
    
    print("\nValue function:")
    print(V)
    print("\nPolicy(Before):")
    print(policy)
    
    # Policy Improvement
    is_stable = True
    policy_changes = 0
    for i in range(GRID_SIZE):
        for j in range(GRID_SIZE):
            if (i, j) == (3, 3):  # Skip terminal state
                continue
            old_action = policy[i, j]
            action_values = {}
            for a in all_actions:
                (ni, nj), r = step((i, j), a)
                action_values[a] = r + GAMMA * V[ni, nj]
            best_action = max(action_values, key=action_values.get)
            policy[i, j] = best_action
            if best_action != old_action:
                is_stable = False
                policy_changes += 1
    
    print(f"\nPolicy changes in this iteration: {policy_changes}")

    print("\nPolicy(After):")
    print(policy)
    
    if iteration > 100:  # Safety check to prevent infinite loops
        print("Warning: Maximum iterations reached")
        break
    
    print("-" * 100)

The code snippets shows how Policy Iteration which actually includes Policy Evaluation and Policy Improvement works with a classical RL grid-world task.

For Policy Evaluation, the output would looks like this regarding its value for iteration 1 and 2:

Iteration 1:
[[-99.90167837 -99.90167837 -99.90167837 -99.90167837]
 [-99.90266158 -99.90266158 -99.90266158 -99.90266158]
 [-99.90363497 -99.90363497 -99.90363497 -99.90363497]
 [-99.90459862 -99.90459862 -99.90459862   0.        ]]

 Iteration 2:
 [[-99.90363497 -99.90363497 -99.90363497 -99.90363497]
 [-99.90459862 -99.90459862 -99.90459862 -99.90459862]
 [-99.90555263 -99.90555263 -99.90555263  -1.        ]
 [-99.90649711 -99.90649711  -1.           0.        ]]

For Policy Improvement, the output would looks like this regarding its behavior for iteration 1 and 2:

Iteration 1:
Policy(Before):
[['U' 'U' 'U' 'U']
 ['U' 'U' 'U' 'U']
 ['U' 'U' 'U' 'U']
 ['U' 'U' 'U' 'X']]

Policy changes in this iteration: 2

Policy(After):
[['U' 'U' 'U' 'U']
 ['U' 'U' 'U' 'U']
 ['U' 'U' 'U' 'D']
 ['U' 'U' 'R' 'X']]

Iteration 2:

Policy(Before):
[['U' 'U' 'U' 'U']
 ['U' 'U' 'U' 'U']
 ['U' 'U' 'U' 'D']
 ['U' 'U' 'R' 'X']]

Policy changes in this iteration: 3

Policy(After):
[['U' 'U' 'U' 'U']
 ['U' 'U' 'U' 'D']
 ['U' 'U' 'D' 'D']
 ['U' 'R' 'R' 'X']]

The procedure repeats in subsequent loops until the policy converges. As you can see, the evaluation stage updates the value of each state, while the improvement stage updates the action chosen for each state.

Just think of navigating a maze with a map you can gradually improve. At first, you follow a rough idea of which way to go (initial policy). After trying it out, you evaluate how good your path was (policy evaluation). Then, based on what you learned, you redraw parts of the map to find better directions (policy improvement). You keep repeating this process—trying, learning, and improving—until your map leads you out of the maze in the most efficient way possible.

Conclusion

Now that we have a basic understanding of policy iteration, we've opened the door to exploring more advanced reinforcement learning algorithms. Before diving deeper into those, we'll also take a look at value iteration—something we'll cover another day.

CSY

Nagoya, Japan

Policy Iteration Introduction

Policy Iteration Implementation

Conclusion

CSY

You might also like...