10 Jun 2025 4 min read Reinforcement Learning

Policy Gradient

In Value/Policy-Based Control , we discussed Value-Based and Policy-Based control methods. However, there are still many details to cover before we can progress to more advanced algorithms. One of the key concepts in policy-based approaches is the Policy Gradient, which forms the foundation for algorithms such as REINFORCE, PPO, and others.

Policy-Based Pros and Cons

Here, I briefly list the features of policy-based methods to provide a quick understanding of why they are sometimes preferred over value-based approaches. For example, consider how cumbersome it can be to manually discretize action spaces in value-based algorithms like DQN.

Pros

Continuous action spaces: naturally handle high-dimensional action spaces.
Stochastic Policy: learn probability distributions over actions
Direct Optimization: optimize the expected return directly, which leads to better convergence

Cons

High Variance: tends to have high variance, leadning to noisy updates and slower learning.
Sample Inefficiency: requires more interactions with the environment to learn
Hyperparameters Sensation: sensitive to parameters tuning

Policy Parameterization

To model a policy as a function that maps states to actions (𝑓: X → Y), where the output represents the probability distribution over possible actions, the policy is defined as follows:

$$π(a|s,\theta) = \dfrac{e^{(\theta \cdot ξ(s,a))}}{\sum_{a'}e^{(\theta \cdot ξ(s,a))}}$$

where $ξ(s,a)$ is Gibbs (Softmax) Policy which defines a feature vector for state-action pairs.

Why define this? There are couple of reasons to that.

Stochastic Action Selection: this encourages exploration, meaning suboptimal actions have non-zero probability.
Differentiable: softmax function is smooth and differentiable which is crucial for gradient based optimization.
Log-Likelihood Gradient Form: the gradient of the log policy is easy to compute.

However, the Gibbs policy is only applicable to discrete action spaces. To handle continuous action spaces, the Gaussian policy is typically used, defined as follows:

$$π_{\theta}(a|s) = 𝒩(a|\mu_{\theta}(s), \sum_{\theta}(s))$$

where:

$\mu_{\theta}(s)$: the mean vector
$\sum_{\theta}(s)$: the covariance matrix

Policy Gradient Proof

To find the policy that maximizes the expected return an agent receives over time, we parameterize the policy using 𝜃. The first step is to define an objective function that can be used to update the policy. The definitions of the objective function for both episodic and continuing tasks are provided below:

Objective Function

Episodic

$$J(\theta) = 𝔼_{π_{\theta}}[\sum_{t=0}^Tr_t]$$

Continuing

$$J(\theta) = 𝔼_{π_{\theta}}[\sum_{t=0}^∞r_t]$$

where:

$r_t$: reward at time t
$γ(0 \le γ < 1)$: discount factor to weigh future rewards
$T$: episode length

Since we aim to maximize the reward, the update is performed in the opposite direction to the gradient descent commonly used in neural networks like MLPs. However, the underlying idea remains the same. The update rule is given as follows:

$$\theta \leftarrow \theta + \alpha\nabla_{\theta}J(\theta)$$

where:

$\alpha$: learning rate
$\nabla_{\theta}J(\theta)$: gradient of objective function with respect to 𝜽

Policy Gradient Theorem

The policy gradient theorem expresses the gradient of the expected return as:

$$\nabla_{\theta}J(\theta) = 𝔼_{π_{\theta}}[\sum_{t=0}^∞\nabla_{\theta}\logπ_{\theta}(a_t|s_t) \cdot Q^{π_{\theta}}(s_t, a_t)]$$

Looks complex at first glance right? So it feels to me. Let's now unravel it down to step by step.

Step1: Rewrite Objective Function

In this step, we can express the objective function as below:

$$J(\theta) = \sum_{s}d_{0}(s)\sum_{a}π_{\theta}(a|s)Q^{π_{\theta}}(s,a)$$

where:

$d_0(s)$: initial state distribution

However, to account for the long-term state distribution. Define $d^{π_{\theta}}(s)$ as the discounted state distribution, which represents the probability of being in state s under policy $π_{\theta}$ and weighted by $γ^{t}$:

$$d^{π_{\theta}}(s) = \sum_{t=0}^∞ \gamma^{t}P(s_t=s)$$

where $P(s_t=s)$ is the probability of being in state $s$ at time $t$, starting from $d_0(s)$.

Finally,

$$J(\theta) = \sum_{s} d^{π_{\theta}}(s) \sum_{a}π_{\theta}(a|s)r(s,a)$$

Step2: Compute the Gradient

Now comes the extremely tricky parts many components depend on 𝜽. Using the product rule:

$$\nabla_{\theta}J(\theta) = \sum_{s}[\nabla_{\theta}d^{π_{\theta}}(s) \sum_{a}π_{\theta}(a|s) + d^{π_{\theta}}(s) \sum_{a}\nabla_{\theta} π_{\theta}(a|s)]$$

Step3: Log-Likelihood Trick

To handle the derivative, apply log-likelihood trick will reduce the complexity when working with gradient.

$$\nabla_{\theta}π_{\theta}(a|s) = π_{\theta}(a|s) \cdot \dfrac{\nabla_{\theta}π_{\theta}(a|s)}{π_{\theta}(a|s)} = π_{\theta}(a|s)\nabla_{\theta}\log{π_{\theta}(a|s)}$$

Now, we consider the gradient of the value (or action-value) function. Its gradient is expressed as:

For value function:

$$V^{π_{\theta}}(s) = \sum_{a}π_{\theta}(a|s)Q^{π_{\theta}}(s,a)$$

Gradient of value function:

$$\nabla_{\theta}V^{π_{\theta}}(s) = \sum_{a}[\nabla_{\theta}π_{\theta}(a|s)Q^{π_{\theta}}(s,a) + π_{\theta}(a|s)\nabla_{\theta}Q^{π_{\theta}}(s,a)]$$
$$=\sum_{a}[π_{\theta}(a|s)\nabla_{\theta}\log_{π_{\theta}(a|s)}Q^{π_{\theta}}(s,a) + π_{\theta}(a|s)\nabla_{\theta}Q^{π_{\theta}}(s,a)]$$

Step4: Relate Value and Action-Value Function

Given the relationship between the value function and the action-value function, as discussed in the context of MDP(Markov Decision Process) , the following expression establishes a recursive relationship:

$$Q^{π_{\theta}}(s,a) = r(s,a) + γ\sum_{s'}P(s'|s,a)V^{π_{\theta}}(s')$$

$$\nabla_{\theta}Q^{π_{\theta}}(s,a) = γ\sum_{s'}P(s'|s,a)\nabla_{\theta}V^{π_{\theta}}(s')$$

Step5: Combine

We combine all the results we have carefully derived.

$$\nabla_{\theta}J(\theta) = \sum_{s}d^{π_{\theta}}(s) \sum_{a}\nabla_{\theta}π_{\theta}(a|s)Q^{π_{\theta}}(s,a)$$

$$= \sum_{s}d^{π_{\theta}}(s) \sum_{a}π_{\theta}(a|s)\nabla_{\theta}\log π_{\theta}(a|s)Q^{π_{\theta}}(s,a)$$

$$= 𝔼_{π_{\theta}}[\sum_{t=0}^∞\nabla_{\theta}\logπ_{\theta}(a_t|s_t) \cdot Q^{π_{\theta}}(s_t, a_t)]$$

, which matches exactly to the formula we have listed in the head of this section.

Why is this important? Because it allows us to directly optimize the policy without relying on a value function—especially since, in some cases, finding such a value function may not be straightforward. Moreover, in many real-world tasks, we often need methods that can naturally handle continuous states and actions.

Conclusion

Whew, it’s really exciting to go through the derivation once we truly understand it. Policy Gradient links changes in the policy parameters directly to the expected reward, making policy updates possible. By sampling trajectories in different ways, we can estimate the gradient and update parameters more flexibly—whether using models like Transformers, LSTMs, RNNs, or even methods like Neuro Evolution of Augmenting Topologies (NEAT).