PPO-NES

PPO-NES
Photo by Vincent van Zalinge / Unsplashe

Today, we will explore Natural Evolution Strategies (NES) and investigate how they can be combined with Proximal Policy Optimization (PPO)  to potentially enhance its performance. To evaluate the effectiveness of this approach, we will use environments from Gymnasium as our benchmark for testing and comparison. Before diving into the integration, we’ll begin with a brief introduction to Evolution Strategies. A more detailed explanation of ES will be provided in a future post.

Evolution Strategy (ES)

ES is an optimization algorithm inspired by the principles of biological evolution. It is particularly well-suited for solving complex, continuous parameter optimization problems. As a member of the broader family of evolutionary algorithms, ES operates by iteratively refining a population of candidate solutions through mechanisms that resemble natural selection, mutation, and recombination.

Here, we will outline the steps of NES, a subclass of Evolution Strategies that optimize a population by estimating the gradient of the expected fitness with respect to the parameters of the search distribution.

Natural Evolution Strategy (NES) Introduction

So, what does NES actually do?
NES are commonly used to optimize non-differentiable or black-box functions. Unlike gradient-based methods that require explicit gradients of the objective function, NES directly optimizes a probability distribution over parameters. It iteratively improves this distribution to maximize the expected value of a fitness function.

The core idea behind NES is the use of natural gradients, which take into account the geometry of the parameter space, leading to more efficient and stable updates. Naturally, defining an appropriate fitness function is crucial for the effectiveness of the algorithm.

At its core, NES treats the problem as searching for a set of policy parameters 𝜃 that maximize a given fitness function. This can feel somewhat similar to a two-stage reinforcement learning optimization approach. However, unlike standard RL algorithms, NES does not interact with the environment through state transitions or actions—instead, it operates entirely by optimizing the parameters of the probability distribution.

Below, we outline the main steps of the NES algorithm.

NES Algorithm

  1. Initialize the Distribution:
    • Initialize mean $\mu_0$ and covariance matrix $Σ_0$
    • Define the fitness function
  2. Sample Candidates:
    • Sample parameter vectors $θ_1, θ_2, ..., θ_n \sim 𝒩(\mu, Σ)$
  3. Evaluate Fitness:
    • Calculate fitness score for each individaul
  4. Estimate the Gradient:
    • Compute the gradient of the expected fitness function with respect to the distribution parameters

where, mean is approximated:

$$\nabla_{\mu}𝔼[f(\theta)] \approx \dfrac{1}{n}Σ_{i=1}^nf(\theta_i) \cdot Σ^{-1}(\theta_i - \mu)$$

  1. Compute the Natural Gradient:
    • Multiply gradient with Fisher Information Matrix $𝐹^{-1}$
  2. Update Distribution Parameters:

$$\mu \leftarrow \mu + \eta_{\mu} \cdot \nabla_{\mu}𝔼[f(θ)]$$

$$Σ \leftarrow Σ + \eta_{Σ} \cdot \nabla_{Σ}𝔼[f(θ)]$$

The process is repeated from 2-6 until the algorithm converges.

PPO-NES Implementation

In the code snippet, we highlight only the parts that differ from a standard PPO agent. The key difference lies in the agent implementation, where NES is introduced as a wrapper to optimize the parameters of the Actor-Critic network. This integration is demonstrated using the simplest form of NES for ease of understanding and use.

class PPONESAgent:
    """Hybrid PPO-NES Agent for Atari environments"""

    def __init__(self, env_name="ALE/Assault-v5", lr=3e-4, gamma=0.99, clip_ratio=0.2,
                 update_epochs=4, batch_size=32, buffer_size=128,
                 nes_population_size=20, nes_sigma=0.1, nes_frequency=10):

        self.env_name = env_name
        self.device = torch.device(
            "cuda" if torch.cuda.is_available() else "cpu")
        print(f"Using device: {self.device}")

        # Create environment
        self.env = gym.make(env_name, render_mode=None)
        self.frame_stack = FrameStack(num_frames=4)

        # PPO parameters
        self.lr = lr
        self.gamma = gamma
        self.clip_ratio = clip_ratio
        self.update_epochs = update_epochs
        self.batch_size = batch_size
        self.buffer_size = buffer_size

        # NES parameters
        self.nes_frequency = nes_frequency
        self.generation = 0

        # Initialize network
        self.actor_critic = AtariActorCritic(
            self.env.observation_space,
            self.env.action_space
        ).to(self.device)

        self.optimizer = optim.Adam(self.actor_critic.parameters(), lr=lr)

        # Initialize NES
        param_count = sum(p.numel()
                          for p in self.actor_critic.parameters() if p.requires_grad)
        self.nes = NaturalEvolutionStrategy(
            param_count,
            population_size=nes_population_size,
            sigma=nes_sigma
        )

        # Training storage
        self.states = []
        self.actions = []
        self.logprobs = []
        self.rewards = []
        self.values = []
        self.dones = []

        # Metrics
        self.episode_rewards = deque(maxlen=100)
        self.episode_lengths = deque(maxlen=100)
        self.training_rewards = []
        self.nes_rewards = []

        # Persistent episode tracking
        self.current_episode_reward = 0
        self.current_episode_length = 0
        self.episode_active = False

        # State management
        self.current_state = None

    def evaluate_fitness(self, params, num_episodes=3):
        """Evaluate fitness of parameter vector"""
        # Save current parameters
        original_params = self.parameters_to_vector()

        # Load new parameters
        self.vector_to_parameters(params)

        total_reward = 0
        for _ in range(num_episodes):
            state, _ = self.env.reset()
            state = self.frame_stack.reset(state)
            state = torch.FloatTensor(state).unsqueeze(0).to(self.device)

            episode_reward = 0
            done = False
            max_steps = 1000  # Limit episode length for faster evaluation

            for step in range(max_steps):
                with torch.no_grad():
                    action, _, _, _ = self.actor_critic.get_action_and_value(
                        state)

                next_state, reward, terminated, truncated, _ = self.env.step(
                    action.item())
                done = terminated or truncated

                state = self.frame_stack.step(next_state)
                state = torch.FloatTensor(state).unsqueeze(0).to(self.device)

                episode_reward += reward

                if done:
                    break

            total_reward += episode_reward

        # Restore original parameters
        self.vector_to_parameters(original_params)

        return total_reward / num_episodes

    def nes_evolution_step(self):
        """Perform one NES evolution step"""
        print(f"Running NES evolution step (Generation {self.generation})")

        # Generate population
        population, epsilon = self.nes.generate_population()

        # Evaluate fitness for each individual
        fitnesses = []
        for i, individual in enumerate(population):
            fitness = self.evaluate_fitness(individual)
            fitnesses.append(fitness)
            print(
                f"Individual {i+1}/{len(population)}: Fitness = {fitness:.2f}")

        fitnesses = np.array(fitnesses)

        # Update NES
        self.nes.update(fitnesses, epsilon)

        # Load best parameters
        if self.nes.best_params is not None:
            self.vector_to_parameters(self.nes.best_params)
            print(f"Best fitness this generation: {np.max(fitnesses):.2f}")
            print(f"Best fitness overall: {self.nes.best_fitness:.2f}")

        self.nes_rewards.append(np.max(fitnesses))
        self.generation += 1

PPO-NES Agent - CSY


You can refer to the code sample for details.

Conclusion

PPO is a cornerstone algorithm among on-policy RL methods. By incorporating NES, we can enhance exploration across a broader range of state spaces, helping to avoid convergence to suboptimal policies. However, this benefit comes at a computational cost, as NES requires significant resources. Fortunately, NES is highly parallelizable, allowing us to leverage CUDA or other parallel computing frameworks to accelerate its computations.

That said, there remains much to explore—not only in the realm of RL algorithms but also in the deeper, perhaps still undiscovered, evolutionary mechanisms found in nature that could inspire future breakthroughs.

CSY

CSY

Nagoya, Japan