06 Jun 2025 4 min read Markov Decision Process

Bellman Equation - Value Iteration

One commonly used algorithm for solving MDP problems, alongside the Bellman Equation and Policy Iteration, is Value Iteration. It is similar to Bellman Equation - Policy Iteration but differs in its internal process.

Value Iteration Introduction

Value Iteration is a dynamic programming (DP) algorithm used to compute the optimal value function for a MDP(Markov Decision Process) Unlike Policy Iteration—which alternates between policy evaluation and policy improvement—Value Iteration combines both steps. It does this by directly updating the value function based on the assumption of taking the optimal action at each step. The process is outlined below.

Initialization: set $V_0(s) = 0$ for all states
Iteration: update the value function for every state using the Bellman Optimality Equation
$$V_{k+1}(s) = \max_{a}[R(s,a) + γ\sum_{s'\in S}P(s'|s,a)V_k(s')]$$
Check Convergence: measure the maximum change in value across all states
$$\Delta = \max_{s}|V_{k+1}(s) - V_{k}(s)|$$
Extract Optimal Policy: once convergence condition meet, return optimal policy
$$π^{}(s) = \argmax{a}[R(s,a) + γ\sum_{s'\in S}P(s'|s,a) V^{*}(s')]$$

Value Iteration can be viewed as a fixed-point iteration method used to solve the Bellman equation. It relies on the principle that the Bellman operator is a contraction mapping in the space of value functions. We’ll explore the Bellman operator in more detail in another post.

$$T(V)(s) = \max_a[R(s,a) + γ\sum_{s'\in S}P(s'|s,a)V(s')]$$

Convergence Properties are listed as below:

Convergence: with γ < 1, value iteration is guaranteed to converge to the optimal value function $V^{*}$.
Convergence Rate: the error is bounded by $γ^k \cdot$ initial error, meaning the algoritmh converges exponentially but may require many iterations for high precision.
Stopping Critetion: threshold 𝜖 ensures that the value function is within $\dfrac{ε}{1 - γ}$ of the true $V^{*}$

Value Iteration Implementation

Here comes the code part.

import numpy as np

GRID_SIZE = 4
GAMMA = 0.99
THETA = 1e-3

V = np.zeros((GRID_SIZE, GRID_SIZE))
policy = np.full((GRID_SIZE, GRID_SIZE), 'U')
policy[3, 3] = 'X'
actions = {'U': (-1, 0), 'D': (1, 0), 'L': (0, -1), 'R': (0, 1)}
all_actions = list(actions.keys())

def step(state, action):
    if state == (3, 3):
        return state, 10
    act = actions[action]
    next_state = (
      min(
        max(state[0] + act[0], 0), 3), 
        min(max(state[1] + act[1], 0), 3)
      )
    reward = -1
    return next_state, reward

def value_iteration():
    iteration = 0
    while True:
        iteration += 1
        print(f"\nIteration {iteration}")
        
        delta = 0
        for i in range(GRID_SIZE):
            for j in range(GRID_SIZE):
                if (i, j) == (3, 3):
                    continue
                
                v = V[i, j]
                action_values = {}
                for a in all_actions:
                    (ni, nj), r = step((i, j), a)
                    action_values[a] = r + GAMMA * V[ni, nj]

                V[i, j] = max(action_values.values())
                delta = max(delta, abs(v - V[i, j]))
        
        print("\nValue function:")
        print(V)
        
        if delta < THETA:
            break
        
        if iteration > 100:
            print("Warning: Maximum iterations reached")
            break

    for i in range(GRID_SIZE):
        for j in range(GRID_SIZE):
            if (i, j) == (3, 3):
                continue
            
            action_values = {}
            for a in all_actions:
                (ni, nj), r = step((i, j), a)
                action_values[a] = r + GAMMA * V[ni, nj]
            best_action = max(action_values, key=action_values.get)
            policy[i, j] = best_action
    
    print("\nFinal Policy:")
    print(policy)
    return V, policy

if __name__ == "__main__":
    V, policy = value_iteration()

If we compare the code snippet to that of the Bellman Equation used in Policy Iteration, the difference becomes clear: in Value Iteration, policy improvement is performed only once at the end, rather than in every iteration. Running the code should produce an output similar to the following:

Iteration 1

Value function:
[[-1. -1. -1. -1.]
 [-1. -1. -1. -1.]
 [-1. -1. -1. -1.]
 [-1. -1. -1.  0.]]

Iteration 2

Value function:
[[-1.99 -1.99 -1.99 -1.99]
 [-1.99 -1.99 -1.99 -1.99]
 [-1.99 -1.99 -1.99 -1.  ]
 [-1.99 -1.99 -1.    0.  ]]

Iteration 3

Value function:
[[-2.9701 -2.9701 -2.9701 -2.9701]
 [-2.9701 -2.9701 -2.9701 -1.99  ]
 [-2.9701 -2.9701 -1.99   -1.    ]
 [-2.9701 -1.99   -1.      0.    ]]

Iteration 4

Value function:
[[-3.940399 -3.940399 -3.940399 -2.9701  ]
 [-3.940399 -3.940399 -2.9701   -1.99    ]
 [-3.940399 -2.9701   -1.99     -1.      ]
 [-2.9701   -1.99     -1.        0.      ]]

Iteration 5

Value function:
[[-4.90099501 -4.90099501 -3.940399   -2.9701    ]
 [-4.90099501 -3.940399   -2.9701     -1.99      ]
 [-3.940399   -2.9701     -1.99       -1.        ]
 [-2.9701     -1.99       -1.          0.        ]]

Iteration 6

Value function:
[[-5.85198506 -4.90099501 -3.940399   -2.9701    ]
 [-4.90099501 -3.940399   -2.9701     -1.99      ]
 [-3.940399   -2.9701     -1.99       -1.        ]
 [-2.9701     -1.99       -1.          0.        ]]

Iteration 7

Value function:
[[-5.85198506 -4.90099501 -3.940399   -2.9701    ]
 [-4.90099501 -3.940399   -2.9701     -1.99      ]
 [-3.940399   -2.9701     -1.99       -1.        ]
 [-2.9701     -1.99       -1.          0.        ]]

Final Policy:
[['D' 'D' 'D' 'D']
 ['D' 'D' 'D' 'D']
 ['D' 'D' 'D' 'D']
 ['R' 'R' 'R' 'X']]

The final policy refers to the set of actions derived from the value function after it has converged over several iterations. Based on this optimal value function, the policy is updated at the end of the process. Clearly, Value Iteration is generally less computationally expensive than Policy Iteration. Below are the key differences between the two:

Aspect	Value Iteration	Policy Iteration
Idea	Get the policy at the end.	Keep improving until it's the best.
Iteration	Update values using max over actions.	Evaluate and improve the policy.
Cost	Lower (just update values).	Higher (evaluation of policy).
Convergence	Slower (needs more steps).	Faster (fewer steps overall).
Policy	No policy updated until the end.	Keeps updating each time.
Usage	Good for big problems	Good when steps are limited.

Conclusion

Value iteration is a practical, easy-to-implement algorithm for solving MDPs, converging to the optimal value function via the Bellman operator’s contraction property. Effective for known dynamics in applications like robotics, but it can be slow for large state spaces. In the future, we will explore other algorithms, such as Q-learning combined with deep neural networks (DNNs). However, the underlying concepts remain the same and are fundamentally important.

CSY

Nagoya, Japan

Value Iteration Introduction

Value Iteration Implementation

Conclusion

CSY

You might also like...