PPO-NES
Today, we will explore Natural Evolution Strategies (NES) and investigate how they can be combined with Proximal Policy Optimization (PPO) to potentially enhance its performance. To evaluate the effectiveness of this approach, we will use environments from Gymnasium as our benchmark for testing and comparison. Before diving into the integration, we’ll begin with a brief introduction to Evolution Strategies. A more detailed explanation of ES will be provided in a future post.
Evolution Strategy (ES)
ES is an optimization algorithm inspired by the principles of biological evolution. It is particularly well-suited for solving complex, continuous parameter optimization problems. As a member of the broader family of evolutionary algorithms, ES operates by iteratively refining a population of candidate solutions through mechanisms that resemble natural selection, mutation, and recombination.
Here, we will outline the steps of NES, a subclass of Evolution Strategies that optimize a population by estimating the gradient of the expected fitness with respect to the parameters of the search distribution.
Natural Evolution Strategy (NES) Introduction
So, what does NES actually do?
NES are commonly used to optimize non-differentiable or black-box functions. Unlike gradient-based methods that require explicit gradients of the objective function, NES directly optimizes a probability distribution over parameters. It iteratively improves this distribution to maximize the expected value of a fitness function.
The core idea behind NES is the use of natural gradients, which take into account the geometry of the parameter space, leading to more efficient and stable updates. Naturally, defining an appropriate fitness function is crucial for the effectiveness of the algorithm.
At its core, NES treats the problem as searching for a set of policy parameters 𝜃 that maximize a given fitness function. This can feel somewhat similar to a two-stage reinforcement learning optimization approach. However, unlike standard RL algorithms, NES does not interact with the environment through state transitions or actions—instead, it operates entirely by optimizing the parameters of the probability distribution.
Below, we outline the main steps of the NES algorithm.
NES Algorithm
- Initialize the Distribution:
- Initialize mean $\mu_0$ and covariance matrix $Σ_0$
- Define the fitness function
- Sample Candidates:
- Sample parameter vectors $θ_1, θ_2, ..., θ_n \sim 𝒩(\mu, Σ)$
- Evaluate Fitness:
- Calculate fitness score for each individaul
- Estimate the Gradient:
- Compute the gradient of the expected fitness function with respect to the distribution parameters
where, mean is approximated:
$$\nabla_{\mu}𝔼[f(\theta)] \approx \dfrac{1}{n}Σ_{i=1}^nf(\theta_i) \cdot Σ^{-1}(\theta_i - \mu)$$
- Compute the Natural Gradient:
- Multiply gradient with Fisher Information Matrix $𝐹^{-1}$
- Update Distribution Parameters:
$$\mu \leftarrow \mu + \eta_{\mu} \cdot \nabla_{\mu}𝔼[f(θ)]$$
$$Σ \leftarrow Σ + \eta_{Σ} \cdot \nabla_{Σ}𝔼[f(θ)]$$
The process is repeated from 2-6 until the algorithm converges.
PPO-NES Implementation
In the code snippet, we highlight only the parts that differ from a standard PPO agent. The key difference lies in the agent implementation, where NES is introduced as a wrapper to optimize the parameters of the Actor-Critic network. This integration is demonstrated using the simplest form of NES for ease of understanding and use.
class PPONESAgent:
"""Hybrid PPO-NES Agent for Atari environments"""
def __init__(self, env_name="ALE/Assault-v5", lr=3e-4, gamma=0.99, clip_ratio=0.2,
update_epochs=4, batch_size=32, buffer_size=128,
nes_population_size=20, nes_sigma=0.1, nes_frequency=10):
self.env_name = env_name
self.device = torch.device(
"cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {self.device}")
# Create environment
self.env = gym.make(env_name, render_mode=None)
self.frame_stack = FrameStack(num_frames=4)
# PPO parameters
self.lr = lr
self.gamma = gamma
self.clip_ratio = clip_ratio
self.update_epochs = update_epochs
self.batch_size = batch_size
self.buffer_size = buffer_size
# NES parameters
self.nes_frequency = nes_frequency
self.generation = 0
# Initialize network
self.actor_critic = AtariActorCritic(
self.env.observation_space,
self.env.action_space
).to(self.device)
self.optimizer = optim.Adam(self.actor_critic.parameters(), lr=lr)
# Initialize NES
param_count = sum(p.numel()
for p in self.actor_critic.parameters() if p.requires_grad)
self.nes = NaturalEvolutionStrategy(
param_count,
population_size=nes_population_size,
sigma=nes_sigma
)
# Training storage
self.states = []
self.actions = []
self.logprobs = []
self.rewards = []
self.values = []
self.dones = []
# Metrics
self.episode_rewards = deque(maxlen=100)
self.episode_lengths = deque(maxlen=100)
self.training_rewards = []
self.nes_rewards = []
# Persistent episode tracking
self.current_episode_reward = 0
self.current_episode_length = 0
self.episode_active = False
# State management
self.current_state = None
def evaluate_fitness(self, params, num_episodes=3):
"""Evaluate fitness of parameter vector"""
# Save current parameters
original_params = self.parameters_to_vector()
# Load new parameters
self.vector_to_parameters(params)
total_reward = 0
for _ in range(num_episodes):
state, _ = self.env.reset()
state = self.frame_stack.reset(state)
state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
episode_reward = 0
done = False
max_steps = 1000 # Limit episode length for faster evaluation
for step in range(max_steps):
with torch.no_grad():
action, _, _, _ = self.actor_critic.get_action_and_value(
state)
next_state, reward, terminated, truncated, _ = self.env.step(
action.item())
done = terminated or truncated
state = self.frame_stack.step(next_state)
state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
episode_reward += reward
if done:
break
total_reward += episode_reward
# Restore original parameters
self.vector_to_parameters(original_params)
return total_reward / num_episodes
def nes_evolution_step(self):
"""Perform one NES evolution step"""
print(f"Running NES evolution step (Generation {self.generation})")
# Generate population
population, epsilon = self.nes.generate_population()
# Evaluate fitness for each individual
fitnesses = []
for i, individual in enumerate(population):
fitness = self.evaluate_fitness(individual)
fitnesses.append(fitness)
print(
f"Individual {i+1}/{len(population)}: Fitness = {fitness:.2f}")
fitnesses = np.array(fitnesses)
# Update NES
self.nes.update(fitnesses, epsilon)
# Load best parameters
if self.nes.best_params is not None:
self.vector_to_parameters(self.nes.best_params)
print(f"Best fitness this generation: {np.max(fitnesses):.2f}")
print(f"Best fitness overall: {self.nes.best_fitness:.2f}")
self.nes_rewards.append(np.max(fitnesses))
self.generation += 1PPO-NES Agent - CSY
You can refer to the code sample for details.
Conclusion
PPO is a cornerstone algorithm among on-policy RL methods. By incorporating NES, we can enhance exploration across a broader range of state spaces, helping to avoid convergence to suboptimal policies. However, this benefit comes at a computational cost, as NES requires significant resources. Fortunately, NES is highly parallelizable, allowing us to leverage CUDA or other parallel computing frameworks to accelerate its computations.
That said, there remains much to explore—not only in the realm of RL algorithms but also in the deeper, perhaps still undiscovered, evolutionary mechanisms found in nature that could inspire future breakthroughs.