MDP(Markov Decision Process)

MDP(Markov Decision Process)
Photo by Shubham Dhage / Unsplash

Today, let's talk about the foundation stone of RL. You can refer to this to grasp basic understanding of RL.

Introduction to MDP

Markov Decision Process (MDP) is the fundamental of RL, and is introduced in Reinforcement Learning: An Introduction at Chapter 3 of the book. So, what MDP really is? An image worths thousand of words.

Simplified MDP with Financial Market - CSY

The above image depicts the MDP inner work even though it is not strict enough. However, what does this really indicates? MDP is nothing more than a mathematical framework which guides us modeling the decision-making procedure. The core components of MDP are listed as below:

  1. State(S): a finite set of possible situations an agent can be in.
  2. Action(A): a finite set of choices an agent can take.
  3. Reward Function(R): immediate reward an agent receives.
  4. Transition Model(P): the probability of moving from one state(s) to next state(s')
  5. Discount Factor(γ-gamma): a value between 0 and 1 that controls the importance of future rewards.
    • γ close to 0 means short-sighted
    • γ close to 1 means far-sighted

As you can see, the components of MDP are alike to RL. Or, one can declare RL derives from MDP concept.

How MDP Work: Goal and Policy

MDP can be easily described as the following.

Goal: the primary goal in an MDP is to find an optimal policy.

Policy(π): a strategy that tells the agent what action to take in each state. Think this as an image mapping which makes state to actions ($π(s) → a$).

  • Deterministic Policy: always prescribes the same action.
  • Stochastic Policy: a probability distribution over possible actions.

Optimal Policy($π^{*}$): policy that maximizes the expected cumulative discounted reward over time.

A formal expression is defined as below.

Transition Probability: $p(s'|s,a) = Pr[s_{t+1}=s'|S_t=s,A_t=a]$
Reward Function: $r(s,a,s') = E[R_{t+1}|S_t=s,A_t=a,S_{t+1}=s']$

However, in RL, we typically focus on long-term rewards rather than immediate rewards from individual actions. As a result, policies in Markov Decision Processes (MDPs) are defined to maximize the cumulative reward, also known as the discounted reward sum. This objective is framed as an optimization problem.

Cumulative Reward: $G_t = R_{t+1} + γR_{t+2} + γ^2R_{t+3} + \cdots = \sum_{t=1}^∞ γ^{k-1}R_{t+k}$
Value(Target) Function: $v_π(s) = E_π[G_t|S_t=s]$

The subscript π on the expectation E indicates the policy currently being followed (i.e., a probability distribution over actions). The goal is, of course, to find the optimal policy that yields the maximum possible reward — analogous to someone trying to maximize profit in the financial market.

Bellman Equation

So far, we have discussed the backbone of RL, and its detailed components of itself. Let's dive into the Bellman Equation which rewrites the calculation of value function into a recursive form. We derive the equation as below.

$$G_t = R_{t+1} + γR_{t+2} + γ^2R_{t+3} + \cdots = \sum_{t=1}^∞ γ^{k-1}R_{t+k}$$

$$G_t = R_{t+1} + γG_{t+1}$$

Given the value function,

$E_π[G_t|S_t=s] = E_π[R_{t+1}|S_t=s] + γE_π[G_{t+1}|S_t=s]$

The first term on the right-hand side depends only on the state s; however, the reward function above depends on both the action and the next state. Therefore, we need to multiply by the marginal probability and incorporate it back into the original function.

$$E_π[R_{t+1}|S_t=s] = \sum_aπ(a|s)\sum_{s'}p(s'|s,a)r(s,a,r')$$

The second term of right side is similar to the first term, in which we rewrite as below.

$$γE_π[G_{t+1}|S_t=s] = γ\sum_aπ(a|s)\sum_{s'}p(s'|s,a)v_π(s')$$

Combined those two terms, the equation could be deduced as below.

$$v_π(s) = \sum_aπ(a|s)\sum_{s'}p(s'|s,a)[r(s,a,s')+γv_π(s')]$$

For the action-value function, we only need to consider a specific action in a given state. Therefore, the action-value function and the value function form a complementary (or dual) relationship.

$$v_π(s) = \sum_aπ(a|s)q_π(s)$$

$$q_π(s) = \sum_{s'}p(s'|s,a)[r(s,a,s')+γv_π(s')] \equiv E_π[G_t|S_t=s,A_t=a]$$

Besides, the optimal value and action-value function are expressed with *.

$$v_*(s) = \max_πv_π(s)$$

$$q_*(s,a) = \max_πq_π(s,a)$$

Whew, that was quite a long discussion — but the journey is just beginning. Next time, we’ll explore how to handle dynamic planning using the Bellman Equation, diving into value iteration and policy iteration with some code snippets.

CSY

CSY

Nagoya, Japan